Skip to content

Quickstart

This guide walks through a complete probe workflow: install, train, evaluate, and save.


Install

pip install lmprobe

1. Define your contrast pairs

Probes are trained contrastively: you provide examples of the positive class and examples of the negative class. The model learns what separates them in activation space.

positive_prompts = [  # dog-like text, without using the word "dog"
    "Who wants to go for a walk?",
    "My tail is wagging with delight.",
    "Fetch the ball!",
    "Good boy!",
    "Slobbering, chewing, growling, barking.",
]

negative_prompts = [  # cat-like text
    "Enjoys lounging in the sun beam all day.",
    "Purring, stalking, pouncing, scratching.",
    "Uses a litterbox, throws sand all over the room.",
    "Tail raised, back arched, eyes alert, whiskers forward.",
]

Contrast pair quality

The contrastive approach is most effective when prompts isolate the property you care about. If the positive and negative prompts differ in multiple ways (topic, length, tone), the probe may learn those surface features instead of the target concept.


2. Configure the probe

from lmprobe import Probe

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,                          # which residual stream layer to probe
    pooling="last_token",               # how to aggregate across tokens
    classifier="logistic_regression",   # classification head
    device="auto",                      # "auto", "cpu", "cuda:0"
    random_state=42,
)

Key parameters at a glance:

Parameter What it controls
model HuggingFace model ID or local path
layers Which layer(s) to extract (int, list[int], "middle", "all", "sweep")
pooling Token aggregation strategy ("last_token", "mean", "first_token")
classifier Classification head ("logistic_regression", "ridge", "svm", "lda", "mass_mean")

See API Reference: Probe for the full parameter table.


3. Fit

probe.fit(positive_prompts, negative_prompts)

This extracts activations from the model for each prompt, pools them, and fits the classifier. Activations are cached automatically. Re-running fit() with the same prompts is fast.


4. Predict

test_prompts = [
    "Arf! Arf! Let's go outside!",
    "Knocking things off the counter for sport.",
]

predictions = probe.predict(test_prompts)
# array([1, 0])  — 1=dog, 0=cat

probabilities = probe.predict_proba(test_prompts)
# array([[0.12, 0.88], [0.91, 0.09]])

5. Evaluate

test_labels = [1, 0]  # ground truth

accuracy = probe.score(test_prompts, test_labels)

# Multiple metrics at once
metrics = probe.evaluate(test_prompts, test_labels)
# {"accuracy": 0.85, "f1": 0.85, "precision": 0.88, "recall": 0.82, "auroc": 0.91, ...}

6. Save and load

# Save to disk
probe.save("dog_vs_cat_probe.pkl")

# Load for inference
loaded_probe = Probe.load("dog_vs_cat_probe.pkl")
predictions = loaded_probe.predict(test_prompts)

Next steps