Skip to content

Probe

The main class for training and using linear probes on language model activations.


Probe

lmprobe.probe.Probe

Train a linear probe on language model activations.

Parameters:

Name Type Description Default
model str | None

HuggingFace model ID or local path. Optional when using *_from_activations() methods only.

None
layers int | list[int] | str

Which layers to extract activations from: - int: Single layer (negative indexing supported) - list[int]: Multiple layers (concatenated) - "middle": Middle third of layers - "last": Last layer only - "all": All layers - "auto": Automatic layer selection via Group Lasso - "fast_auto": Fast automatic layer selection via coefficient importance - "sweep": Train independent probe per layer (memory-safe) - "sweep:N": Sweep every Nth layer (coarse sweep) - "sweep:START-END": Sweep a range of layers (fine sweep)

"middle"
pooling str

Token pooling strategy for both training and inference. Options: "last_token", "first_token", "mean", "all"

"last_token"
train_pooling str | None

Override pooling for training only.

None
inference_pooling str | None

Override pooling for inference only. Additional options: "max", "min" (score-level pooling)

None
classifier str | BaseEstimator

Classification model. Either a string name or sklearn estimator.

"logistic_regression"
task str

Task type: "classification" or "regression". When "regression", defaults to Ridge regression and disables predict_proba.

"classification"
device str

Device for model inference: "auto", "cpu", "cuda:0", etc.

"auto"
remote bool

Use nnsight remote execution (requires NDIF_API_KEY).

False
random_state int | None

Random seed for reproducibility. Propagates to classifier.

None
batch_size int

Number of prompts to process at once during activation extraction. Smaller values use less memory but may be slower.

8
auto_candidates list[int] | list[float] | None

Candidate layers for layers="auto" mode: - list[int]: Explicit layer indices (e.g., [10, 16, 22]) - list[float]: Fractional positions (e.g., [0.33, 0.5, 0.66]) - None: Default to [0.25, 0.5, 0.75] Only used when layers="auto".

None
auto_alpha float

Group Lasso regularization strength for layers="auto". Higher values select fewer layers. Typical range: 0.001 to 0.1.

0.01
normalize_layers bool | str

Per-layer feature standardization when using multiple layers. Compensates for differences in activation magnitude across layers. Options: - True or "per_neuron": Each neuron gets its own mean/std (default) - "per_layer": All neurons in a layer share one mean/std (may work better with small sample sizes due to lower variance) - False: No scaling

True
fast_auto_top_k int | None

Number of layers to select when using layers="fast_auto". If None, defaults to selecting half the candidate layers.

None
backend str

Extraction backend: "nnsight" (default) or "local". "local" uses HuggingFace transformers directly without nnsight, enabling use with models not supported by nnsight/NDIF.

"local"
dtype str | None

Model dtype for local backend: "float32", "float16", or "bfloat16". Defaults to "float32" if None. Ignored for nnsight backend.

None
classifier_kwargs dict | None

Additional keyword arguments passed to the sklearn classifier constructor. Overrides defaults for built-in classifiers. Example: {"C": 0.01, "solver": "liblinear", "max_iter": 5000} for logistic regression.

None
preprocessing str | list[str] | None

Preprocessing pipeline applied after layer scaling but before the classifier. Steps are separated by "+" when given as a string: - "standard": StandardScaler - "pca" or "pca:N": PCA with N components - "standard+pca": StandardScaler then PCA Use pca_components to set N when using "pca" without :N.

None
pca_components int | None

Number of PCA components when preprocessing includes "pca" without an explicit component count.

None
mass_mean_augment bool

If True, compute the mass-mean direction (mean_positive - mean_negative) on the original activations (before preprocessing), project all samples onto this direction to get a 1D feature, and append it to the (optionally PCA-reduced) features before fitting the classifier. This augmentation is also applied during inference.

False

Attributes:

Name Type Description
classifier_ BaseEstimator

The fitted sklearn classifier (after calling fit()).

classes_ ndarray

Class labels (after calling fit()).

selected_layers_ list[int] | None

Layer indices selected when layers="auto" or "fast_auto". None for other layer modes or before fitting.

scaler_ PerLayerScaler | None

The fitted per-layer scaler (after calling fit() with multiple layers and normalize_layers=True). None if single layer or normalize_layers=False.

Examples:

>>> probe = Probe(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     layers=16,
...     pooling="last_token",
...     classifier="logistic_regression",
...     random_state=42,
... )
>>> probe.fit(positive_prompts, negative_prompts)
>>> predictions = probe.predict(test_prompts)
>>> # Automatic layer selection
>>> probe = Probe(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     layers="auto",
...     auto_candidates=[0.25, 0.5, 0.75],
...     auto_alpha=0.01,
... )
>>> probe.fit(positive_prompts, negative_prompts)
>>> print(probe.selected_layers_)  # e.g., [8, 16]

warmup

warmup(prompts: list[str], remote: bool | None = None, max_retries: int | None = None, batch_size: int | None = None) -> None

Extract and cache activations without training a classifier.

Use this to pre-populate the activation cache for a set of prompts. This is useful when you want to separate the (expensive) activation extraction step from the (cheap) classifier training step, or when you plan to train multiple probes on the same prompts.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to extract and cache activations for.

required
remote bool | None

Override the instance-level remote setting.

None
max_retries int | None

Override the instance-level max_retries setting. Only applies to remote extraction.

None
batch_size int | None

Override the instance-level batch_size for this call. Smaller values reduce memory usage; larger values may improve throughput on GPU.

None

fit

fit(positive_prompts: list[str], negative_prompts: list[str] | ndarray | list[int] | None = None, remote: bool | None = None, invalidate_cache: bool = False, max_retries: int | None = None, batch_size: int | None = None, sample_weight: ndarray | list[float] | None = None) -> Probe

Fit the probe on training data.

Supports two signatures: 1. Contrastive: fit(positive_prompts, negative_prompts) 2. Standard: fit(prompts, labels)

Parameters:

Name Type Description Default
positive_prompts list[str]

In contrastive mode: prompts for the positive class. In standard mode: all prompts.

required
negative_prompts list[str] | ndarray | list[int] | None

In contrastive mode: prompts for the negative class. In standard mode: labels (array of ints).

None
remote bool | None

Override the instance-level remote setting.

None
invalidate_cache bool

If True, ignore cached activations and re-extract.

False
max_retries int | None

Override the instance-level max_retries setting. Only applies to remote extraction.

None
batch_size int | None

Override the instance-level batch_size for this call. Smaller values reduce memory usage; larger values may improve throughput on GPU.

None
sample_weight ndarray | list[float] | None

Per-sample weights passed to the classifier's fit() method. Length must match the total number of training samples (len(positive_prompts) + len(negative_prompts) in contrastive mode, or len(prompts) in standard mode). If None, all samples are weighted equally.

None

Returns:

Type Description
Probe

Self, for method chaining.

Notes

When layers="auto", fitting occurs in two phases: 1. Train Group Lasso on candidate layers to identify informative layers 2. Re-train the specified classifier using only selected layers

After fitting with layers="auto", check probe.selected_layers_ to see which layers were chosen.

predict

predict(prompts: list[str], remote: bool | None = None, batch_size: int | None = None) -> np.ndarray

Predict class labels for prompts.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required
remote bool | None

Override the instance-level remote setting.

None
batch_size int | None

Override the instance-level batch_size for this call.

None

Returns:

Type Description
ndarray

Predicted class labels, shape (n_samples,).

predict_proba

predict_proba(prompts: list[str], remote: bool | None = None, batch_size: int | None = None) -> np.ndarray

Predict class probabilities for prompts.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required
remote bool | None

Override the instance-level remote setting.

None
batch_size int | None

Override the instance-level batch_size for this call.

None

Returns:

Type Description
ndarray

Class probabilities. Shape depends on inference_pooling: - Normal: (n_samples, n_classes) - "all": (n_samples, seq_len, n_classes)

score

score(prompts: list[str], labels: list[int] | ndarray, remote: bool | None = None, batch_size: int | None = None) -> float

Compute accuracy on test data.

Parameters:

Name Type Description Default
prompts list[str]

Test prompts.

required
labels list[int] | ndarray

True labels.

required
remote bool | None

Override the instance-level remote setting.

None
batch_size int | None

Override the instance-level batch_size for this call.

None

Returns:

Type Description
float

Classification accuracy.

evaluate

evaluate(prompts: list[str], labels: list[int] | ndarray, remote: bool | None = None) -> dict

Compute a standard set of evaluation metrics.

Computes accuracy, AUROC, F1, precision, and recall. Results are cached on self._evaluation_results_.

Parameters:

Name Type Description Default
prompts list[str]

Evaluation prompts (should NOT be training data).

required
labels list[int] | ndarray

True labels.

required
remote bool | None

Override the instance-level remote setting.

None

Returns:

Type Description
dict

Metrics dict with keys: accuracy, auroc, f1, precision, recall, n_eval, eval_hash.

compute_layer_importance

compute_layer_importance(metric: str = 'l2', normalize: bool = True) -> np.ndarray

Compute layer importance from classifier coefficients.

This method analyzes the trained classifier's coefficients to determine which layers contribute most to the classification decision. It provides a fast alternative to Group Lasso for layer importance analysis.

Must be called after fit() when using multiple layers with a linear classifier (one with a coef_ attribute).

Parameters:

Name Type Description Default
metric str

How to aggregate coefficients per layer: - "l2": L2 norm (Euclidean magnitude) - analogous to Group Lasso - "l1": Sum of absolute values - "mean_abs": Mean absolute value (normalized by dimension) - "max_abs": Maximum absolute value

"l2"
normalize bool

If True, normalize importances to sum to 1.

True

Returns:

Type Description
ndarray

Layer importance scores, shape (n_layers,). Also stored in self.layer_importances_.

Raises:

Type Description
RuntimeError

If probe not fitted or classifier lacks coef_ attribute.

ValueError

If unknown metric specified.

Examples:

>>> probe = Probe(model="...", layers=[8, 16, 24])
>>> probe.fit(positive_prompts, negative_prompts)
>>> importance = probe.compute_layer_importance()
>>> print(f"Layer {probe.candidate_layers_[importance.argmax()]} is most important")
>>> print(f"Most important: layer {probe.candidate_layers_[importance.argmax()]}")

fit_from_activations

fit_from_activations(X: Any, y: Any, sample_weight: ndarray | list[float] | None = None, n_layers: int = 1) -> Probe

Fit the probe from pre-computed activation tensors.

Skips activation extraction and pooling, but applies the same normalization and preprocessing pipeline as fit(): StandardScaler for single-layer, PerLayerScaler for multi-layer, and any user-specified preprocessing.

Parameters:

Name Type Description Default
X ndarray | Tensor

Pre-computed activations, shape (n_samples, n_features).

required
y ndarray | Tensor

Labels. int for classification, float for regression.

required
sample_weight ndarray | list[float] | None

Per-sample weights passed to the classifier's fit() method. If None, all samples are weighted equally.

None
n_layers int

Number of concatenated layers in X. Controls scaling behavior: 1 (default) applies StandardScaler, >1 applies PerLayerScaler when normalize_layers is set.

1

Returns:

Type Description
Probe

Self, for method chaining.

predict_from_activations

predict_from_activations(X: Any) -> np.ndarray

Predict from pre-computed activation tensors.

Parameters:

Name Type Description Default
X ndarray | Tensor

Pre-computed activations, shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

Predictions, shape (n_samples,).

predict_proba_from_activations

predict_proba_from_activations(X: Any) -> np.ndarray

Predict probabilities from pre-computed activation tensors.

Only available for classification tasks.

Parameters:

Name Type Description Default
X ndarray | Tensor

Pre-computed activations, shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

Class probabilities, shape (n_samples, n_classes).

Raises:

Type Description
ValueError

If task is regression.

score_from_activations

score_from_activations(X: Any, y: Any) -> float

Score the probe on pre-computed activation tensors.

Returns accuracy for classification, R-squared for regression.

Parameters:

Name Type Description Default
X ndarray | Tensor

Pre-computed activations, shape (n_samples, n_features).

required
y ndarray | Tensor

True labels/values.

required

Returns:

Type Description
float

Accuracy (classification) or R-squared (regression).

save

save(path: str) -> None

Save the fitted probe to disk.

Parameters:

Name Type Description Default
path str

Path to save the probe.

required

load classmethod

load(path: str) -> Probe

Load a fitted probe from disk.

Parameters:

Name Type Description Default
path str

Path to the saved probe.

required

Returns:

Type Description
Probe

The loaded probe.

sweep_layers classmethod

sweep_layers(model: str, positive_prompts: list[str], negative_prompts: list[str], layers: int | list[int] | str = 'all', pooling: str = 'last_token', classifier: str | BaseEstimator = 'logistic_regression', device: str = 'auto', remote: bool = False, random_state: int | None = None, batch_size: int = 8, backend: str = 'local', dtype: str | None = None, normalize_layers: bool | str = True, classifier_kwargs: dict | None = None, preprocessing: str | list[str] | None = None, pca_components: int | None = None) -> LayerSweepResult

Train a probe at every layer and return per-layer results.

This method avoids the boilerplate of manually looping over layers. It performs one warmup pass extracting all requested layers (single forward pass through the model, cached), then trains an independent single-layer probe for each layer using cached activations.

Parameters:

Name Type Description Default
model str

HuggingFace model ID or local path.

required
positive_prompts list[str]

Prompts for the positive class.

required
negative_prompts list[str]

Prompts for the negative class.

required
layers int | list[int] | str

Which layers to sweep. Accepts same specifications as Probe: int, list[int], "all", "middle", "last".

"all"
pooling str

Token pooling strategy.

"last_token"
classifier str | BaseEstimator

Classification model.

"logistic_regression"
device str

Device for model inference.

"auto"
remote bool

Use nnsight remote execution.

False
random_state int | None

Random seed for reproducibility.

None
batch_size int

Number of prompts per batch during extraction.

8
backend str

Extraction backend: "local" or "nnsight".

"local"
dtype str | None

Model dtype for local backend.

None
normalize_layers bool | str

Per-layer normalization (applied per single-layer probe).

True

Returns:

Type Description
LayerSweepResult

Contains a fitted probe for each layer, with methods for scoring and finding the best layer.

Examples:

>>> result = Probe.sweep_layers(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     positive_prompts=pos,
...     negative_prompts=neg,
...     layers="all",
... )
>>> scores = result.score(test_prompts, test_labels)
>>> best = result.best_layer(test_prompts, test_labels)
>>> print(f"Best layer: {best}, accuracy: {scores[best]:.3f}")

LayerSweepResult

lmprobe.probe.LayerSweepResult dataclass

Results from a per-layer probe sweep.

Contains a fitted Probe for each layer, with convenience methods for scoring and finding the best layer.

Parameters:

Name Type Description Default
probes dict[int, Probe]

Mapping from layer index to fitted Probe.

dict()

Examples:

>>> result = Probe.sweep_layers(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     positive_prompts=pos,
...     negative_prompts=neg,
...     layers="all",
... )
>>> scores = result.score(test_prompts, test_labels)
>>> print(f"Best layer: {result.best_layer(test_prompts, test_labels)}")

layers property

layers: list[int]

Return sorted list of layer indices in this sweep.

__getitem__

__getitem__(layer: int) -> Probe

Get the probe for a specific layer.

score

score(test_prompts: list[str], test_labels: list[int] | ndarray) -> dict[int, float]

Score each layer's probe on test data.

Performs a single warmup extraction pass for all layers, then scores each probe from cache (no redundant forward passes).

best_layer

best_layer(test_prompts: list[str], test_labels: list[int] | ndarray) -> int

Return the layer index with the highest accuracy.

predict

predict(prompts: list[str]) -> dict[int, np.ndarray]

Predict with each layer's probe.

Performs a single warmup extraction pass for all layers, then predicts from cache (no redundant forward passes).

predict_proba

predict_proba(prompts: list[str]) -> dict[int, np.ndarray]

Predict probabilities with each layer's probe.

Performs a single warmup extraction pass for all layers, then predicts from cache (no redundant forward passes).