Probe¶
The main class for training and using linear probes on language model activations.
Probe¶
lmprobe.probe.Probe ¶
Train a linear probe on language model activations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | None
|
HuggingFace model ID or local path. Optional when using *_from_activations() methods only. |
None
|
layers
|
int | list[int] | str
|
Which layers to extract activations from: - int: Single layer (negative indexing supported) - list[int]: Multiple layers (concatenated) - "middle": Middle third of layers - "last": Last layer only - "all": All layers - "auto": Automatic layer selection via Group Lasso - "fast_auto": Fast automatic layer selection via coefficient importance - "sweep": Train independent probe per layer (memory-safe) - "sweep:N": Sweep every Nth layer (coarse sweep) - "sweep:START-END": Sweep a range of layers (fine sweep) |
"middle"
|
pooling
|
str
|
Token pooling strategy for both training and inference. Options: "last_token", "first_token", "mean", "all" |
"last_token"
|
train_pooling
|
str | None
|
Override pooling for training only. |
None
|
inference_pooling
|
str | None
|
Override pooling for inference only. Additional options: "max", "min" (score-level pooling) |
None
|
classifier
|
str | BaseEstimator
|
Classification model. Either a string name or sklearn estimator. |
"logistic_regression"
|
task
|
str
|
Task type: "classification" or "regression". When "regression", defaults to Ridge regression and disables predict_proba. |
"classification"
|
device
|
str
|
Device for model inference: "auto", "cpu", "cuda:0", etc. |
"auto"
|
remote
|
bool
|
Use nnsight remote execution (requires NDIF_API_KEY). |
False
|
random_state
|
int | None
|
Random seed for reproducibility. Propagates to classifier. |
None
|
batch_size
|
int
|
Number of prompts to process at once during activation extraction. Smaller values use less memory but may be slower. |
8
|
auto_candidates
|
list[int] | list[float] | None
|
Candidate layers for layers="auto" mode: - list[int]: Explicit layer indices (e.g., [10, 16, 22]) - list[float]: Fractional positions (e.g., [0.33, 0.5, 0.66]) - None: Default to [0.25, 0.5, 0.75] Only used when layers="auto". |
None
|
auto_alpha
|
float
|
Group Lasso regularization strength for layers="auto". Higher values select fewer layers. Typical range: 0.001 to 0.1. |
0.01
|
normalize_layers
|
bool | str
|
Per-layer feature standardization when using multiple layers. Compensates for differences in activation magnitude across layers. Options: - True or "per_neuron": Each neuron gets its own mean/std (default) - "per_layer": All neurons in a layer share one mean/std (may work better with small sample sizes due to lower variance) - False: No scaling |
True
|
fast_auto_top_k
|
int | None
|
Number of layers to select when using layers="fast_auto". If None, defaults to selecting half the candidate layers. |
None
|
backend
|
str
|
Extraction backend: "nnsight" (default) or "local". "local" uses HuggingFace transformers directly without nnsight, enabling use with models not supported by nnsight/NDIF. |
"local"
|
dtype
|
str | None
|
Model dtype for local backend: "float32", "float16", or "bfloat16". Defaults to "float32" if None. Ignored for nnsight backend. |
None
|
classifier_kwargs
|
dict | None
|
Additional keyword arguments passed to the sklearn classifier constructor.
Overrides defaults for built-in classifiers. Example:
|
None
|
preprocessing
|
str | list[str] | None
|
Preprocessing pipeline applied after layer scaling but before the
classifier. Steps are separated by |
None
|
pca_components
|
int | None
|
Number of PCA components when |
None
|
mass_mean_augment
|
bool
|
If True, compute the mass-mean direction (mean_positive - mean_negative) on the original activations (before preprocessing), project all samples onto this direction to get a 1D feature, and append it to the (optionally PCA-reduced) features before fitting the classifier. This augmentation is also applied during inference. |
False
|
Attributes:
| Name | Type | Description |
|---|---|---|
classifier_ |
BaseEstimator
|
The fitted sklearn classifier (after calling fit()). |
classes_ |
ndarray
|
Class labels (after calling fit()). |
selected_layers_ |
list[int] | None
|
Layer indices selected when layers="auto" or "fast_auto". None for other layer modes or before fitting. |
scaler_ |
PerLayerScaler | None
|
The fitted per-layer scaler (after calling fit() with multiple layers and normalize_layers=True). None if single layer or normalize_layers=False. |
Examples:
>>> probe = Probe(
... model="meta-llama/Llama-3.1-8B-Instruct",
... layers=16,
... pooling="last_token",
... classifier="logistic_regression",
... random_state=42,
... )
>>> probe.fit(positive_prompts, negative_prompts)
>>> predictions = probe.predict(test_prompts)
>>> # Automatic layer selection
>>> probe = Probe(
... model="meta-llama/Llama-3.1-8B-Instruct",
... layers="auto",
... auto_candidates=[0.25, 0.5, 0.75],
... auto_alpha=0.01,
... )
>>> probe.fit(positive_prompts, negative_prompts)
>>> print(probe.selected_layers_) # e.g., [8, 16]
warmup ¶
warmup(prompts: list[str], remote: bool | None = None, max_retries: int | None = None, batch_size: int | None = None) -> None
Extract and cache activations without training a classifier.
Use this to pre-populate the activation cache for a set of prompts. This is useful when you want to separate the (expensive) activation extraction step from the (cheap) classifier training step, or when you plan to train multiple probes on the same prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to extract and cache activations for. |
required |
remote
|
bool | None
|
Override the instance-level remote setting. |
None
|
max_retries
|
int | None
|
Override the instance-level max_retries setting. Only applies to remote extraction. |
None
|
batch_size
|
int | None
|
Override the instance-level batch_size for this call. Smaller values reduce memory usage; larger values may improve throughput on GPU. |
None
|
fit ¶
fit(positive_prompts: list[str], negative_prompts: list[str] | ndarray | list[int] | None = None, remote: bool | None = None, invalidate_cache: bool = False, max_retries: int | None = None, batch_size: int | None = None, sample_weight: ndarray | list[float] | None = None) -> Probe
Fit the probe on training data.
Supports two signatures: 1. Contrastive: fit(positive_prompts, negative_prompts) 2. Standard: fit(prompts, labels)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
positive_prompts
|
list[str]
|
In contrastive mode: prompts for the positive class. In standard mode: all prompts. |
required |
negative_prompts
|
list[str] | ndarray | list[int] | None
|
In contrastive mode: prompts for the negative class. In standard mode: labels (array of ints). |
None
|
remote
|
bool | None
|
Override the instance-level remote setting. |
None
|
invalidate_cache
|
bool
|
If True, ignore cached activations and re-extract. |
False
|
max_retries
|
int | None
|
Override the instance-level max_retries setting. Only applies to remote extraction. |
None
|
batch_size
|
int | None
|
Override the instance-level batch_size for this call. Smaller values reduce memory usage; larger values may improve throughput on GPU. |
None
|
sample_weight
|
ndarray | list[float] | None
|
Per-sample weights passed to the classifier's |
None
|
Returns:
| Type | Description |
|---|---|
Probe
|
Self, for method chaining. |
Notes
When layers="auto", fitting occurs in two phases: 1. Train Group Lasso on candidate layers to identify informative layers 2. Re-train the specified classifier using only selected layers
After fitting with layers="auto", check probe.selected_layers_ to see which layers were chosen.
predict ¶
predict(prompts: list[str], remote: bool | None = None, batch_size: int | None = None) -> np.ndarray
Predict class labels for prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
remote
|
bool | None
|
Override the instance-level remote setting. |
None
|
batch_size
|
int | None
|
Override the instance-level batch_size for this call. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Predicted class labels, shape (n_samples,). |
predict_proba ¶
predict_proba(prompts: list[str], remote: bool | None = None, batch_size: int | None = None) -> np.ndarray
Predict class probabilities for prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
remote
|
bool | None
|
Override the instance-level remote setting. |
None
|
batch_size
|
int | None
|
Override the instance-level batch_size for this call. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Class probabilities. Shape depends on inference_pooling: - Normal: (n_samples, n_classes) - "all": (n_samples, seq_len, n_classes) |
score ¶
score(prompts: list[str], labels: list[int] | ndarray, remote: bool | None = None, batch_size: int | None = None) -> float
Compute accuracy on test data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Test prompts. |
required |
labels
|
list[int] | ndarray
|
True labels. |
required |
remote
|
bool | None
|
Override the instance-level remote setting. |
None
|
batch_size
|
int | None
|
Override the instance-level batch_size for this call. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Classification accuracy. |
evaluate ¶
Compute a standard set of evaluation metrics.
Computes accuracy, AUROC, F1, precision, and recall. Results are
cached on self._evaluation_results_.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Evaluation prompts (should NOT be training data). |
required |
labels
|
list[int] | ndarray
|
True labels. |
required |
remote
|
bool | None
|
Override the instance-level remote setting. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Metrics dict with keys: accuracy, auroc, f1, precision, recall, n_eval, eval_hash. |
compute_layer_importance ¶
Compute layer importance from classifier coefficients.
This method analyzes the trained classifier's coefficients to determine which layers contribute most to the classification decision. It provides a fast alternative to Group Lasso for layer importance analysis.
Must be called after fit() when using multiple layers with a linear classifier (one with a coef_ attribute).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
str
|
How to aggregate coefficients per layer: - "l2": L2 norm (Euclidean magnitude) - analogous to Group Lasso - "l1": Sum of absolute values - "mean_abs": Mean absolute value (normalized by dimension) - "max_abs": Maximum absolute value |
"l2"
|
normalize
|
bool
|
If True, normalize importances to sum to 1. |
True
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Layer importance scores, shape (n_layers,). Also stored in self.layer_importances_. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If probe not fitted or classifier lacks coef_ attribute. |
ValueError
|
If unknown metric specified. |
Examples:
>>> probe = Probe(model="...", layers=[8, 16, 24])
>>> probe.fit(positive_prompts, negative_prompts)
>>> importance = probe.compute_layer_importance()
>>> print(f"Layer {probe.candidate_layers_[importance.argmax()]} is most important")
>>> print(f"Most important: layer {probe.candidate_layers_[importance.argmax()]}")
fit_from_activations ¶
fit_from_activations(X: Any, y: Any, sample_weight: ndarray | list[float] | None = None, n_layers: int = 1) -> Probe
Fit the probe from pre-computed activation tensors.
Skips activation extraction and pooling, but applies the same
normalization and preprocessing pipeline as fit():
StandardScaler for single-layer, PerLayerScaler for multi-layer,
and any user-specified preprocessing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray | Tensor
|
Pre-computed activations, shape (n_samples, n_features). |
required |
y
|
ndarray | Tensor
|
Labels. int for classification, float for regression. |
required |
sample_weight
|
ndarray | list[float] | None
|
Per-sample weights passed to the classifier's |
None
|
n_layers
|
int
|
Number of concatenated layers in X. Controls scaling behavior:
1 (default) applies StandardScaler, >1 applies PerLayerScaler
when |
1
|
Returns:
| Type | Description |
|---|---|
Probe
|
Self, for method chaining. |
predict_from_activations ¶
Predict from pre-computed activation tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray | Tensor
|
Pre-computed activations, shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Predictions, shape (n_samples,). |
predict_proba_from_activations ¶
Predict probabilities from pre-computed activation tensors.
Only available for classification tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray | Tensor
|
Pre-computed activations, shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Class probabilities, shape (n_samples, n_classes). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If task is regression. |
score_from_activations ¶
Score the probe on pre-computed activation tensors.
Returns accuracy for classification, R-squared for regression.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray | Tensor
|
Pre-computed activations, shape (n_samples, n_features). |
required |
y
|
ndarray | Tensor
|
True labels/values. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Accuracy (classification) or R-squared (regression). |
save ¶
Save the fitted probe to disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to save the probe. |
required |
load
classmethod
¶
Load a fitted probe from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the saved probe. |
required |
Returns:
| Type | Description |
|---|---|
Probe
|
The loaded probe. |
sweep_layers
classmethod
¶
sweep_layers(model: str, positive_prompts: list[str], negative_prompts: list[str], layers: int | list[int] | str = 'all', pooling: str = 'last_token', classifier: str | BaseEstimator = 'logistic_regression', device: str = 'auto', remote: bool = False, random_state: int | None = None, batch_size: int = 8, backend: str = 'local', dtype: str | None = None, normalize_layers: bool | str = True, classifier_kwargs: dict | None = None, preprocessing: str | list[str] | None = None, pca_components: int | None = None) -> LayerSweepResult
Train a probe at every layer and return per-layer results.
This method avoids the boilerplate of manually looping over layers. It performs one warmup pass extracting all requested layers (single forward pass through the model, cached), then trains an independent single-layer probe for each layer using cached activations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
HuggingFace model ID or local path. |
required |
positive_prompts
|
list[str]
|
Prompts for the positive class. |
required |
negative_prompts
|
list[str]
|
Prompts for the negative class. |
required |
layers
|
int | list[int] | str
|
Which layers to sweep. Accepts same specifications as Probe: int, list[int], "all", "middle", "last". |
"all"
|
pooling
|
str
|
Token pooling strategy. |
"last_token"
|
classifier
|
str | BaseEstimator
|
Classification model. |
"logistic_regression"
|
device
|
str
|
Device for model inference. |
"auto"
|
remote
|
bool
|
Use nnsight remote execution. |
False
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
batch_size
|
int
|
Number of prompts per batch during extraction. |
8
|
backend
|
str
|
Extraction backend: "local" or "nnsight". |
"local"
|
dtype
|
str | None
|
Model dtype for local backend. |
None
|
normalize_layers
|
bool | str
|
Per-layer normalization (applied per single-layer probe). |
True
|
Returns:
| Type | Description |
|---|---|
LayerSweepResult
|
Contains a fitted probe for each layer, with methods for scoring and finding the best layer. |
Examples:
>>> result = Probe.sweep_layers(
... model="meta-llama/Llama-3.1-8B-Instruct",
... positive_prompts=pos,
... negative_prompts=neg,
... layers="all",
... )
>>> scores = result.score(test_prompts, test_labels)
>>> best = result.best_layer(test_prompts, test_labels)
>>> print(f"Best layer: {best}, accuracy: {scores[best]:.3f}")
LayerSweepResult¶
lmprobe.probe.LayerSweepResult
dataclass
¶
Results from a per-layer probe sweep.
Contains a fitted Probe for each layer, with convenience methods for scoring and finding the best layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probes
|
dict[int, Probe]
|
Mapping from layer index to fitted Probe. |
dict()
|
Examples:
>>> result = Probe.sweep_layers(
... model="meta-llama/Llama-3.1-8B-Instruct",
... positive_prompts=pos,
... negative_prompts=neg,
... layers="all",
... )
>>> scores = result.score(test_prompts, test_labels)
>>> print(f"Best layer: {result.best_layer(test_prompts, test_labels)}")
score ¶
Score each layer's probe on test data.
Performs a single warmup extraction pass for all layers, then scores each probe from cache (no redundant forward passes).
best_layer ¶
Return the layer index with the highest accuracy.
predict ¶
Predict with each layer's probe.
Performs a single warmup extraction pass for all layers, then predicts from cache (no redundant forward passes).
predict_proba ¶
Predict probabilities with each layer's probe.
Performs a single warmup extraction pass for all layers, then predicts from cache (no redundant forward passes).