Skip to content

lmprobe

Train linear probes on language model activations for AI safety monitoring.

PyPI version Python 3.10+ License: MIT

lmprobe makes it easy to build text classifiers from a language model's internal representations. It has been used for detection of deception, harmful intent, CBRN misuse, and other safety-relevant properties, but can also be used to build arbitrary classifiers and even regression.


What is a probe?

A probe is a classifier trained on a model's intermediate activations (residual stream, hidden states) rather than its output text. A very common type of probe is the linear probe: because the classifier is linear, it's fast to train, interpretable, and provably reflects what the model represents, not just what it says.

Key results from the literature:


Install

pip install lmprobe

Optional extras:

Extra Installs
lmprobe[hub] HuggingFace Hub (activation datasets)
lmprobe[s3] S3 cache backend
lmprobe[nnsight] Remote execution via NDIF
lmprobe[embeddings] Sentence-transformers baselines
lmprobe[auto] Automatic layer selection (Group Lasso)

Five-minute example

from lmprobe import Probe

positive_prompts = [
    "Who wants to go for a walk?",
    "My tail is wagging with delight.",
    "Fetch the ball!",
]

negative_prompts = [
    "Purring, stalking, pouncing, scratching.",
    "Uses a litterbox, throws sand all over the room.",
    "Tail raised, back arched, eyes alert.",
]

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    pooling="last_token",
    classifier="logistic_regression",
)

probe.fit(positive_prompts, negative_prompts)

predictions = probe.predict(["Arf! Let's go outside!", "Knocking things off the counter."])
# [1, 0]

See the Quickstart for a complete walkthrough.


Design philosophy

  • sklearn-inspired APIfit(), predict(), predict_proba(), score()
  • Contrastive-first — positive vs. negative prompts, following the RepE literature
  • Sensible defaults — simple cases are one-liners; complex cases are fully configurable
  • Separation of concerns — extraction, pooling, and classification are distinct and independently configurable

Guides

Guide Topic
Quickstart Install, train, evaluate, save
Contrastive Probing Contrast pair design, pooling, regression, classifiers
Layer Selection & Sweep Find the most informative layers
Preprocessing StandardScaler, PCA, and chained pipelines
Ensembles Multi-probe ensembles and bootstrap stability
Baselines Validate your probe against text and activation baselines
Caching Cache backends, eviction, introspection, env vars
Activation Datasets Share pre-extracted activations via HuggingFace
Remote Execution Probe large models via NDIF without local GPU
Geometry of Truth Tutorial Reproduce truthfulness probes on pre-extracted data

API Reference

Reference Coverage
Probe Probe, LayerSweepResult
Ensemble ProbeEnsemble
Classifiers Built-in classifiers, MassMeanClassifier, EnsembleClassifier
Pooling Pooling strategies and stage prefixes
Baseline BaselineProbe, ActivationBaseline, BaselineBattery
Cache Cache configuration, inspection, eviction
Datasets UnifiedCache, push_dataset, load_activations, pull_dataset
Dataset Format v2 format specification (Parquet + safetensors)
Scaling PerLayerScaler for multi-layer normalization