Probe¶

The main class for training and using linear probes on language model activations.

Probe¶

lmprobe.probe.Probe ¶

Train a linear probe on language model activations.

Parameters:

Name	Type	Description	Default
`model`	`str \| None`	HuggingFace model ID or local path. Optional when using *_from_activations() methods only.	`None`
`layers`	`int \| list[int] \| str`	Which layers to extract activations from: - int: Single layer (negative indexing supported) - list[int]: Multiple layers (concatenated) - "middle": Middle third of layers - "last": Last layer only - "all": All layers - "auto": Automatic layer selection via Group Lasso - "fast_auto": Fast automatic layer selection via coefficient importance - "sweep": Train independent probe per layer (memory-safe) - "sweep:N": Sweep every Nth layer (coarse sweep) - "sweep:START-END": Sweep a range of layers (fine sweep)	`"middle"`
`pooling`	`str`	Token pooling strategy for both training and inference. Options: "last_token", "first_token", "mean", "all"	`"last_token"`
`train_pooling`	`str \| None`	Override pooling for training only.	`None`
`inference_pooling`	`str \| None`	Override pooling for inference only. Additional options: "max", "min" (score-level pooling)	`None`
`classifier`	`str \| BaseEstimator`	Classification model. Either a string name or sklearn estimator.	`"logistic_regression"`
`task`	`str`	Task type: "classification" or "regression". When "regression", defaults to Ridge regression and disables predict_proba.	`"classification"`
`device`	`str`	Device for model inference: "auto", "cpu", "cuda:0", etc.	`"auto"`
`remote`	`bool`	Use nnsight remote execution (requires NDIF_API_KEY).	`False`
`random_state`	`int \| None`	Random seed for reproducibility. Propagates to classifier.	`None`
`batch_size`	`int`	Number of prompts to process at once during activation extraction. Smaller values use less memory but may be slower.	`8`
`auto_candidates`	`list[int] \| list[float] \| None`	Candidate layers for layers="auto" mode: - list[int]: Explicit layer indices (e.g., [10, 16, 22]) - list[float]: Fractional positions (e.g., [0.33, 0.5, 0.66]) - None: Default to [0.25, 0.5, 0.75] Only used when layers="auto".	`None`
`auto_alpha`	`float`	Group Lasso regularization strength for layers="auto". Higher values select fewer layers. Typical range: 0.001 to 0.1.	`0.01`
`normalize_layers`	`bool \| str`	Per-layer feature standardization when using multiple layers. Compensates for differences in activation magnitude across layers. Options: - True or "per_neuron": Each neuron gets its own mean/std (default) - "per_layer": All neurons in a layer share one mean/std (may work better with small sample sizes due to lower variance) - False: No scaling	`True`
`fast_auto_top_k`	`int \| None`	Number of layers to select when using layers="fast_auto". If None, defaults to selecting half the candidate layers.	`None`
`backend`	`str`	Extraction backend: "nnsight" (default) or "local". "local" uses HuggingFace transformers directly without nnsight, enabling use with models not supported by nnsight/NDIF.	`"local"`
`dtype`	`str \| None`	Model dtype for local backend: "float32", "float16", or "bfloat16". Defaults to "float32" if None. Ignored for nnsight backend.	`None`
`classifier_kwargs`	`dict \| None`	Additional keyword arguments passed to the sklearn classifier constructor. Overrides defaults for built-in classifiers. Example: `{"C": 0.01, "solver": "liblinear", "max_iter": 5000}` for logistic regression.	`None`
`preprocessing`	`str \| list[str] \| None`	Preprocessing pipeline applied after layer scaling but before the classifier. Steps are separated by `"+"` when given as a string: - `"standard"`: StandardScaler - `"pca"` or `"pca:N"`: PCA with N components - `"standard+pca"`: StandardScaler then PCA Use `pca_components` to set N when using `"pca"` without `:N`.	`None`
`pca_components`	`int \| None`	Number of PCA components when `preprocessing` includes `"pca"` without an explicit component count.	`None`
`mass_mean_augment`	`bool`	If True, compute the mass-mean direction (mean_positive - mean_negative) on the original activations (before preprocessing), project all samples onto this direction to get a 1D feature, and append it to the (optionally PCA-reduced) features before fitting the classifier. This augmentation is also applied during inference.	`False`

Attributes:

Name	Type	Description
`classifier_`	`BaseEstimator`	The fitted sklearn classifier (after calling fit()).
`classes_`	`ndarray`	Class labels (after calling fit()).
`selected_layers_`	`list[int] \| None`	Layer indices selected when layers="auto" or "fast_auto". None for other layer modes or before fitting.
`scaler_`	`PerLayerScaler \| None`	The fitted per-layer scaler (after calling fit() with multiple layers and normalize_layers=True). None if single layer or normalize_layers=False.

Examples:

>>> probe = Probe(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     layers=16,
...     pooling="last_token",
...     classifier="logistic_regression",
...     random_state=42,
... )
>>> probe.fit(positive_prompts, negative_prompts)
>>> predictions = probe.predict(test_prompts)

>>> # Automatic layer selection
>>> probe = Probe(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     layers="auto",
...     auto_candidates=[0.25, 0.5, 0.75],
...     auto_alpha=0.01,
... )
>>> probe.fit(positive_prompts, negative_prompts)
>>> print(probe.selected_layers_)  # e.g., [8, 16]

warmup ¶

warmup(prompts: list[str], remote: bool | None = None, max_retries: int | None = None, batch_size: int | None = None) -> None

Extract and cache activations without training a classifier.

Use this to pre-populate the activation cache for a set of prompts. This is useful when you want to separate the (expensive) activation extraction step from the (cheap) classifier training step, or when you plan to train multiple probes on the same prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to extract and cache activations for.	required
`remote`	`bool \| None`	Override the instance-level remote setting.	`None`
`max_retries`	`int \| None`	Override the instance-level max_retries setting. Only applies to remote extraction.	`None`
`batch_size`	`int \| None`	Override the instance-level batch_size for this call. Smaller values reduce memory usage; larger values may improve throughput on GPU.	`None`

fit ¶

fit(positive_prompts: list[str], negative_prompts: list[str] | ndarray | list[int] | None = None, remote: bool | None = None, invalidate_cache: bool = False, max_retries: int | None = None, batch_size: int | None = None, sample_weight: ndarray | list[float] | None = None) -> Probe

Fit the probe on training data.

Supports two signatures: 1. Contrastive: fit(positive_prompts, negative_prompts) 2. Standard: fit(prompts, labels)

Parameters:

Name	Type	Description	Default
`positive_prompts`	`list[str]`	In contrastive mode: prompts for the positive class. In standard mode: all prompts.	required
`negative_prompts`	`list[str] \| ndarray \| list[int] \| None`	In contrastive mode: prompts for the negative class. In standard mode: labels (array of ints).	`None`
`remote`	`bool \| None`	Override the instance-level remote setting.	`None`
`invalidate_cache`	`bool`	If True, ignore cached activations and re-extract.	`False`
`max_retries`	`int \| None`	Override the instance-level max_retries setting. Only applies to remote extraction.	`None`
`batch_size`	`int \| None`	Override the instance-level batch_size for this call. Smaller values reduce memory usage; larger values may improve throughput on GPU.	`None`
`sample_weight`	`ndarray \| list[float] \| None`	Per-sample weights passed to the classifier's `fit()` method. Length must match the total number of training samples (`len(positive_prompts) + len(negative_prompts)` in contrastive mode, or `len(prompts)` in standard mode). If None, all samples are weighted equally.	`None`

Returns:

Type	Description
`Probe`	Self, for method chaining.

Notes

When layers="auto", fitting occurs in two phases: 1. Train Group Lasso on candidate layers to identify informative layers 2. Re-train the specified classifier using only selected layers

After fitting with layers="auto", check probe.selected_layers_ to see which layers were chosen.

predict ¶

predict(prompts: list[str], remote: bool | None = None, batch_size: int | None = None) -> np.ndarray

Predict class labels for prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required
`remote`	`bool \| None`	Override the instance-level remote setting.	`None`
`batch_size`	`int \| None`	Override the instance-level batch_size for this call.	`None`

Returns:

Type	Description
`ndarray`	Predicted class labels, shape (n_samples,).

predict_proba ¶

predict_proba(prompts: list[str], remote: bool | None = None, batch_size: int | None = None) -> np.ndarray

Predict class probabilities for prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required
`remote`	`bool \| None`	Override the instance-level remote setting.	`None`
`batch_size`	`int \| None`	Override the instance-level batch_size for this call.	`None`

Returns:

Type	Description
`ndarray`	Class probabilities. Shape depends on inference_pooling: - Normal: (n_samples, n_classes) - "all": (n_samples, seq_len, n_classes)

score ¶

score(prompts: list[str], labels: list[int] | ndarray, remote: bool | None = None, batch_size: int | None = None) -> float

Compute accuracy on test data.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Test prompts.	required
`labels`	`list[int] \| ndarray`	True labels.	required
`remote`	`bool \| None`	Override the instance-level remote setting.	`None`
`batch_size`	`int \| None`	Override the instance-level batch_size for this call.	`None`

Returns:

Type	Description
`float`	Classification accuracy.

evaluate ¶

evaluate(prompts: list[str], labels: list[int] | ndarray, remote: bool | None = None) -> dict

Compute a standard set of evaluation metrics.

Computes accuracy, AUROC, F1, precision, and recall. Results are cached on self._evaluation_results_.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Evaluation prompts (should NOT be training data).	required
`labels`	`list[int] \| ndarray`	True labels.	required
`remote`	`bool \| None`	Override the instance-level remote setting.	`None`

Returns:

Type	Description
`dict`	Metrics dict with keys: accuracy, auroc, f1, precision, recall, n_eval, eval_hash.

compute_layer_importance ¶

compute_layer_importance(metric: str = 'l2', normalize: bool = True) -> np.ndarray

Compute layer importance from classifier coefficients.

This method analyzes the trained classifier's coefficients to determine which layers contribute most to the classification decision. It provides a fast alternative to Group Lasso for layer importance analysis.

Must be called after fit() when using multiple layers with a linear classifier (one with a coef_ attribute).

Parameters:

Name	Type	Description	Default
`metric`	`str`	How to aggregate coefficients per layer: - "l2": L2 norm (Euclidean magnitude) - analogous to Group Lasso - "l1": Sum of absolute values - "mean_abs": Mean absolute value (normalized by dimension) - "max_abs": Maximum absolute value	`"l2"`
`normalize`	`bool`	If True, normalize importances to sum to 1.	`True`

Returns:

Type	Description
`ndarray`	Layer importance scores, shape (n_layers,). Also stored in self.layer_importances_.

Raises:

Type	Description
`RuntimeError`	If probe not fitted or classifier lacks coef_ attribute.
`ValueError`	If unknown metric specified.

Examples:

>>> probe = Probe(model="...", layers=[8, 16, 24])
>>> probe.fit(positive_prompts, negative_prompts)
>>> importance = probe.compute_layer_importance()
>>> print(f"Layer {probe.candidate_layers_[importance.argmax()]} is most important")
>>> print(f"Most important: layer {probe.candidate_layers_[importance.argmax()]}")

fit_from_activations ¶

fit_from_activations(X: Any, y: Any, sample_weight: ndarray | list[float] | None = None, n_layers: int = 1) -> Probe

Fit the probe from pre-computed activation tensors.

Skips activation extraction and pooling, but applies the same normalization and preprocessing pipeline as fit(): StandardScaler for single-layer, PerLayerScaler for multi-layer, and any user-specified preprocessing.

Parameters:

Name	Type	Description	Default
`X`	`ndarray \| Tensor`	Pre-computed activations, shape (n_samples, n_features).	required
`y`	`ndarray \| Tensor`	Labels. int for classification, float for regression.	required
`sample_weight`	`ndarray \| list[float] \| None`	Per-sample weights passed to the classifier's `fit()` method. If None, all samples are weighted equally.	`None`
`n_layers`	`int`	Number of concatenated layers in X. Controls scaling behavior: 1 (default) applies StandardScaler, >1 applies PerLayerScaler when `normalize_layers` is set.	`1`

Returns:

Type	Description
`Probe`	Self, for method chaining.

predict_from_activations ¶

predict_from_activations(X: Any) -> np.ndarray

Predict from pre-computed activation tensors.

Parameters:

Name	Type	Description	Default
`X`	`ndarray \| Tensor`	Pre-computed activations, shape (n_samples, n_features).	required

Returns:

Type	Description
`ndarray`	Predictions, shape (n_samples,).

predict_proba_from_activations ¶

predict_proba_from_activations(X: Any) -> np.ndarray

Predict probabilities from pre-computed activation tensors.

Only available for classification tasks.

Parameters:

Name	Type	Description	Default
`X`	`ndarray \| Tensor`	Pre-computed activations, shape (n_samples, n_features).	required

Returns:

Type	Description
`ndarray`	Class probabilities, shape (n_samples, n_classes).

Raises:

Type	Description
`ValueError`	If task is regression.

score_from_activations ¶

score_from_activations(X: Any, y: Any) -> float

Score the probe on pre-computed activation tensors.

Returns accuracy for classification, R-squared for regression.

Parameters:

Name	Type	Description	Default
`X`	`ndarray \| Tensor`	Pre-computed activations, shape (n_samples, n_features).	required
`y`	`ndarray \| Tensor`	True labels/values.	required

Returns:

Type	Description
`float`	Accuracy (classification) or R-squared (regression).

save ¶

save(path: str) -> None

Save the fitted probe to disk.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to save the probe.	required

load `classmethod` ¶

load(path: str) -> Probe

Load a fitted probe from disk.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the saved probe.	required

Returns:

Type	Description
`Probe`	The loaded probe.

sweep_layers `classmethod` ¶

sweep_layers(model: str, positive_prompts: list[str], negative_prompts: list[str], layers: int | list[int] | str = 'all', pooling: str = 'last_token', classifier: str | BaseEstimator = 'logistic_regression', device: str = 'auto', remote: bool = False, random_state: int | None = None, batch_size: int = 8, backend: str = 'local', dtype: str | None = None, normalize_layers: bool | str = True, classifier_kwargs: dict | None = None, preprocessing: str | list[str] | None = None, pca_components: int | None = None) -> LayerSweepResult

Train a probe at every layer and return per-layer results.

This method avoids the boilerplate of manually looping over layers. It performs one warmup pass extracting all requested layers (single forward pass through the model, cached), then trains an independent single-layer probe for each layer using cached activations.

Parameters:

Name	Type	Description	Default
`model`	`str`	HuggingFace model ID or local path.	required
`positive_prompts`	`list[str]`	Prompts for the positive class.	required
`negative_prompts`	`list[str]`	Prompts for the negative class.	required
`layers`	`int \| list[int] \| str`	Which layers to sweep. Accepts same specifications as Probe: int, list[int], "all", "middle", "last".	`"all"`
`pooling`	`str`	Token pooling strategy.	`"last_token"`
`classifier`	`str \| BaseEstimator`	Classification model.	`"logistic_regression"`
`device`	`str`	Device for model inference.	`"auto"`
`remote`	`bool`	Use nnsight remote execution.	`False`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`batch_size`	`int`	Number of prompts per batch during extraction.	`8`
`backend`	`str`	Extraction backend: "local" or "nnsight".	`"local"`
`dtype`	`str \| None`	Model dtype for local backend.	`None`
`normalize_layers`	`bool \| str`	Per-layer normalization (applied per single-layer probe).	`True`

Returns:

Type	Description
`LayerSweepResult`	Contains a fitted probe for each layer, with methods for scoring and finding the best layer.

Examples:

>>> result = Probe.sweep_layers(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     positive_prompts=pos,
...     negative_prompts=neg,
...     layers="all",
... )
>>> scores = result.score(test_prompts, test_labels)
>>> best = result.best_layer(test_prompts, test_labels)
>>> print(f"Best layer: {best}, accuracy: {scores[best]:.3f}")

LayerSweepResult¶

lmprobe.probe.LayerSweepResult `dataclass` ¶

Results from a per-layer probe sweep.

Contains a fitted Probe for each layer, with convenience methods for scoring and finding the best layer.

Parameters:

Name	Type	Description	Default
`probes`	`dict[int, Probe]`	Mapping from layer index to fitted Probe.	`dict()`

Examples:

>>> result = Probe.sweep_layers(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     positive_prompts=pos,
...     negative_prompts=neg,
...     layers="all",
... )
>>> scores = result.score(test_prompts, test_labels)
>>> print(f"Best layer: {result.best_layer(test_prompts, test_labels)}")

layers `property` ¶

layers: list[int]

Return sorted list of layer indices in this sweep.

getitem ¶

__getitem__(layer: int) -> Probe

Get the probe for a specific layer.

score ¶

score(test_prompts: list[str], test_labels: list[int] | ndarray) -> dict[int, float]

Score each layer's probe on test data.

Performs a single warmup extraction pass for all layers, then scores each probe from cache (no redundant forward passes).

best_layer ¶

best_layer(test_prompts: list[str], test_labels: list[int] | ndarray) -> int

Return the layer index with the highest accuracy.

predict ¶

predict(prompts: list[str]) -> dict[int, np.ndarray]

Predict with each layer's probe.

Performs a single warmup extraction pass for all layers, then predicts from cache (no redundant forward passes).

predict_proba ¶

predict_proba(prompts: list[str]) -> dict[int, np.ndarray]

Predict probabilities with each layer's probe.

Performs a single warmup extraction pass for all layers, then predicts from cache (no redundant forward passes).

Probe¶

Probe¶

lmprobe.probe.Probe ¶

warmup ¶

fit ¶

predict ¶

predict_proba ¶

score ¶

evaluate ¶

compute_layer_importance ¶

fit_from_activations ¶

predict_from_activations ¶

predict_proba_from_activations ¶

score_from_activations ¶

save ¶

load classmethod ¶

sweep_layers classmethod ¶

LayerSweepResult¶

lmprobe.probe.LayerSweepResult dataclass ¶

layers property ¶

__getitem__ ¶

score ¶

best_layer ¶

predict ¶

predict_proba ¶

load `classmethod` ¶

sweep_layers `classmethod` ¶

lmprobe.probe.LayerSweepResult `dataclass` ¶

layers `property` ¶

getitem ¶