lmprobe¶
Train linear probes on language model activations for AI safety monitoring.
lmprobe makes it easy to build text classifiers from a language model's internal representations. It has been used for detection of deception, harmful intent, CBRN misuse, and other safety-relevant properties, but can also be used to build arbitrary classifiers and even regression.
What is a probe?¶
A probe is a classifier trained on a model's intermediate activations (residual stream, hidden states) rather than its output text. A very common type of probe is the linear probe: because the classifier is linear, it's fast to train, interpretable, and provably reflects what the model represents, not just what it says.
Key results from the literature:
- Anthropic's probe work achieved >99% AUROC detecting sleeper agents
- Representation Engineering (Zou et al., 2023) showed probes reliably track honesty and power-seeking
- Apollo Research demonstrated probes trained on simple contrast pairs generalize to realistic deception scenarios
Install¶
Optional extras:
| Extra | Installs |
|---|---|
lmprobe[hub] |
HuggingFace Hub (activation datasets) |
lmprobe[s3] |
S3 cache backend |
lmprobe[nnsight] |
Remote execution via NDIF |
lmprobe[embeddings] |
Sentence-transformers baselines |
lmprobe[auto] |
Automatic layer selection (Group Lasso) |
Five-minute example¶
from lmprobe import Probe
positive_prompts = [
"Who wants to go for a walk?",
"My tail is wagging with delight.",
"Fetch the ball!",
]
negative_prompts = [
"Purring, stalking, pouncing, scratching.",
"Uses a litterbox, throws sand all over the room.",
"Tail raised, back arched, eyes alert.",
]
probe = Probe(
model="meta-llama/Llama-3.1-8B-Instruct",
layers=16,
pooling="last_token",
classifier="logistic_regression",
)
probe.fit(positive_prompts, negative_prompts)
predictions = probe.predict(["Arf! Let's go outside!", "Knocking things off the counter."])
# [1, 0]
See the Quickstart for a complete walkthrough.
Design philosophy¶
- sklearn-inspired API —
fit(),predict(),predict_proba(),score() - Contrastive-first — positive vs. negative prompts, following the RepE literature
- Sensible defaults — simple cases are one-liners; complex cases are fully configurable
- Separation of concerns — extraction, pooling, and classification are distinct and independently configurable
Guides¶
| Guide | Topic |
|---|---|
| Quickstart | Install, train, evaluate, save |
| Contrastive Probing | Contrast pair design, pooling, regression, classifiers |
| Layer Selection & Sweep | Find the most informative layers |
| Preprocessing | StandardScaler, PCA, and chained pipelines |
| Ensembles | Multi-probe ensembles and bootstrap stability |
| Baselines | Validate your probe against text and activation baselines |
| Caching | Cache backends, eviction, introspection, env vars |
| Activation Datasets | Share pre-extracted activations via HuggingFace |
| Remote Execution | Probe large models via NDIF without local GPU |
| Geometry of Truth Tutorial | Reproduce truthfulness probes on pre-extracted data |
API Reference¶
| Reference | Coverage |
|---|---|
| Probe | Probe, LayerSweepResult |
| Ensemble | ProbeEnsemble |
| Classifiers | Built-in classifiers, MassMeanClassifier, EnsembleClassifier |
| Pooling | Pooling strategies and stage prefixes |
| Baseline | BaselineProbe, ActivationBaseline, BaselineBattery |
| Cache | Cache configuration, inspection, eviction |
| Datasets | UnifiedCache, push_dataset, load_activations, pull_dataset |
| Dataset Format | v2 format specification (Parquet + safetensors) |
| Scaling | PerLayerScaler for multi-layer normalization |