lmprobe¶

Train linear probes on language model activations for AI safety monitoring.

lmprobe makes it easy to build text classifiers from a language model's internal representations. It has been used for detection of deception, harmful intent, CBRN misuse, and other safety-relevant properties, but can also be used to build arbitrary classifiers and even regression.

What is a probe?¶

A probe is a classifier trained on a model's intermediate activations (residual stream, hidden states) rather than its output text. A very common type of probe is the linear probe: because the classifier is linear, it's fast to train, interpretable, and provably reflects what the model represents, not just what it says.

Key results from the literature:

Anthropic's probe work achieved >99% AUROC detecting sleeper agents
Representation Engineering (Zou et al., 2023) showed probes reliably track honesty and power-seeking
Apollo Research demonstrated probes trained on simple contrast pairs generalize to realistic deception scenarios

Install¶

pip install lmprobe

Optional extras:

Extra	Installs
`lmprobe[hub]`	HuggingFace Hub (activation datasets)
`lmprobe[s3]`	S3 cache backend
`lmprobe[nnsight]`	Remote execution via NDIF
`lmprobe[embeddings]`	Sentence-transformers baselines
`lmprobe[auto]`	Automatic layer selection (Group Lasso)

Five-minute example¶

from lmprobe import Probe

positive_prompts = [
    "Who wants to go for a walk?",
    "My tail is wagging with delight.",
    "Fetch the ball!",
]

negative_prompts = [
    "Purring, stalking, pouncing, scratching.",
    "Uses a litterbox, throws sand all over the room.",
    "Tail raised, back arched, eyes alert.",
]

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    pooling="last_token",
    classifier="logistic_regression",
)

probe.fit(positive_prompts, negative_prompts)

predictions = probe.predict(["Arf! Let's go outside!", "Knocking things off the counter."])
# [1, 0]

See the Quickstart for a complete walkthrough.

Design philosophy¶

sklearn-inspired API — fit(), predict(), predict_proba(), score()
Contrastive-first — positive vs. negative prompts, following the RepE literature
Sensible defaults — simple cases are one-liners; complex cases are fully configurable
Separation of concerns — extraction, pooling, and classification are distinct and independently configurable

Guides¶

Guide	Topic
Quickstart	Install, train, evaluate, save
Contrastive Probing	Contrast pair design, pooling, regression, classifiers
Layer Selection & Sweep	Find the most informative layers
Preprocessing	StandardScaler, PCA, and chained pipelines
Ensembles	Multi-probe ensembles and bootstrap stability
Baselines	Validate your probe against text and activation baselines
Caching	Cache backends, eviction, introspection, env vars
Activation Datasets	Share pre-extracted activations via HuggingFace
Remote Execution	Probe large models via NDIF without local GPU
Geometry of Truth Tutorial	Reproduce truthfulness probes on pre-extracted data

API Reference¶

Reference	Coverage
Probe	`Probe`, `LayerSweepResult`
Ensemble	`ProbeEnsemble`
Classifiers	Built-in classifiers, `MassMeanClassifier`, `EnsembleClassifier`
Pooling	Pooling strategies and stage prefixes
Baseline	`BaselineProbe`, `ActivationBaseline`, `BaselineBattery`
Cache	Cache configuration, inspection, eviction
Datasets	`UnifiedCache`, `push_dataset`, `load_activations`, `pull_dataset`
Dataset Format	v2 format specification (Parquet + safetensors)
Scaling	`PerLayerScaler` for multi-layer normalization