Quickstart¶

This guide walks through a complete probe workflow: install, train, evaluate, and save.

Install¶

pip install lmprobe

1. Define your contrast pairs¶

Probes are trained contrastively: you provide examples of the positive class and examples of the negative class. The model learns what separates them in activation space.

positive_prompts = [  # dog-like text, without using the word "dog"
    "Who wants to go for a walk?",
    "My tail is wagging with delight.",
    "Fetch the ball!",
    "Good boy!",
    "Slobbering, chewing, growling, barking.",
]

negative_prompts = [  # cat-like text
    "Enjoys lounging in the sun beam all day.",
    "Purring, stalking, pouncing, scratching.",
    "Uses a litterbox, throws sand all over the room.",
    "Tail raised, back arched, eyes alert, whiskers forward.",
]

Contrast pair quality

The contrastive approach is most effective when prompts isolate the property you care about. If the positive and negative prompts differ in multiple ways (topic, length, tone), the probe may learn those surface features instead of the target concept.

2. Configure the probe¶

from lmprobe import Probe

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,                          # which residual stream layer to probe
    pooling="last_token",               # how to aggregate across tokens
    classifier="logistic_regression",   # classification head
    device="auto",                      # "auto", "cpu", "cuda:0"
    random_state=42,
)

Key parameters at a glance:

Parameter	What it controls
`model`	HuggingFace model ID or local path
`layers`	Which layer(s) to extract (`int`, `list[int]`, `"middle"`, `"all"`, `"sweep"`)
`pooling`	Token aggregation strategy (`"last_token"`, `"mean"`, `"first_token"`)
`classifier`	Classification head (`"logistic_regression"`, `"ridge"`, `"svm"`, `"lda"`, `"mass_mean"`)

See API Reference: Probe for the full parameter table.

3. Fit¶

probe.fit(positive_prompts, negative_prompts)

This extracts activations from the model for each prompt, pools them, and fits the classifier. Activations are cached automatically. Re-running fit() with the same prompts is fast.

4. Predict¶

test_prompts = [
    "Arf! Arf! Let's go outside!",
    "Knocking things off the counter for sport.",
]

predictions = probe.predict(test_prompts)
# array([1, 0])  — 1=dog, 0=cat

probabilities = probe.predict_proba(test_prompts)
# array([[0.12, 0.88], [0.91, 0.09]])

5. Evaluate¶

test_labels = [1, 0]  # ground truth

accuracy = probe.score(test_prompts, test_labels)

# Multiple metrics at once
metrics = probe.evaluate(test_prompts, test_labels)
# {"accuracy": 0.85, "f1": 0.85, "precision": 0.88, "recall": 0.82, "auroc": 0.91, ...}

6. Save and load¶

# Save to disk
probe.save("dog_vs_cat_probe.pkl")

# Load for inference
loaded_probe = Probe.load("dog_vs_cat_probe.pkl")
predictions = loaded_probe.predict(test_prompts)

Next steps¶

Contrastive Probing guide — best practices for contrast pair design
Layer Selection & Sweep — find the most informative layers
Baselines — validate your probe isn't learning surface features
Caching — configure cache backends and reduce disk usage
Remote Execution — probe large models via NDIF without local GPU