Baseline¶

Classes for building baselines to compare against your probe.

BaselineProbe¶

lmprobe.baseline.BaselineProbe ¶

Text classification baseline for comparison with linear probes.

This class provides simple baselines that don't use model activations, helping determine if probes are learning meaningful representations or just exploiting surface-level features.

Parameters:

Name	Type	Description	Default
`method`	`str`	Feature extraction method: - "bow": Bag-of-words (word counts) - "tfidf": TF-IDF weighted bag-of-words - "random": Random predictions (true chance baseline) - "majority": Always predict majority class - "sentence_length": Use character/word count features - "perplexity": Use model's own logprobs (requires model param) - "sentence_transformers": Use off-the-shelf embeddings - "shuffled_labels": Sanity check — trains on shuffled labels using features from base_method. Should get ~50% accuracy.	`"tfidf"`
`classifier`	`str \| BaseEstimator`	Classification model. Same options as LinearProbe. Ignored for method="random" and method="majority".	`"logistic_regression"`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`max_features`	`int \| None`	Maximum vocabulary size for bow/tfidf methods.	`10000`
`ngram_range`	`tuple[int, int]`	N-gram range for bow/tfidf. (1, 1) = unigrams only, (1, 2) = unigrams and bigrams.	`(1, 1)`
`base_method`	`str`	Feature extraction method to use when method="shuffled_labels". Ignored for other methods.	`"tfidf"`
`model`	`str \| None`	HuggingFace model ID. Required for method="perplexity".	`None`
`device`	`str`	Device for model inference (for perplexity method).	`"auto"`
`remote`	`bool`	Use nnsight remote execution (for perplexity method).	`False`

Attributes:

Name	Type	Description
`classifier_`	`BaseEstimator`	The fitted classifier (after calling fit()).
`classes_`	`ndarray`	Class labels (after calling fit()).
`vectorizer_`	`CountVectorizer \| TfidfVectorizer \| None`	The fitted text vectorizer (for bow/tfidf methods).

Examples:

>>> baseline = BaselineProbe(method="tfidf", classifier="logistic_regression")
>>> baseline.fit(positive_prompts, negative_prompts)
>>> accuracy = baseline.score(test_prompts, test_labels)
>>> print(f"TF-IDF baseline: {accuracy:.1%}")

fit ¶

fit(positive_prompts: list[str], negative_prompts: list[str]) -> BaselineProbe

Fit the baseline on contrastive examples.

Parameters:

Name	Type	Description	Default
`positive_prompts`	`list[str]`	Examples of the positive class.	required
`negative_prompts`	`list[str]`	Examples of the negative class.	required

Returns:

Type	Description
`BaselineProbe`	Self, for method chaining.

predict ¶

predict(prompts: list[str]) -> np.ndarray

Predict class labels for prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required

Returns:

Type	Description
`ndarray`	Predicted class labels, shape (n_prompts,).

predict_proba ¶

predict_proba(prompts: list[str]) -> np.ndarray

Predict class probabilities for prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required

Returns:

Type	Description
`ndarray`	Class probabilities, shape (n_prompts, n_classes).

Raises:

Type	Description
`AttributeError`	If the classifier doesn't support predict_proba.

score ¶

score(prompts: list[str], labels: list[int] | ndarray) -> float

Compute classification accuracy.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required
`labels`	`list[int] \| ndarray`	True labels.	required

Returns:

Type	Description
`float`	Classification accuracy.

get_feature_names ¶

get_feature_names() -> list[str] | None

Get feature names for bow/tfidf methods.

Returns:

Type	Description
`list[str] \| None`	Feature names (vocabulary), or None for random/majority.

get_top_features ¶

get_top_features(n: int = 20) -> dict[str, list[tuple[str, float]]] | None

Get top features by classifier weight for each class.

Only works for bow/tfidf with linear classifiers that have coef_.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of top features to return per class.	`20`

Returns:

Type	Description
`dict \| None`	Dictionary with 'positive' and 'negative' keys, each containing a list of (feature_name, weight) tuples. None if not applicable.

ActivationBaseline¶

lmprobe.activation_baseline.ActivationBaseline ¶

Activation-based baseline classifiers.

These baselines test whether a probe is learning something meaningful beyond what simple transformations of activations would capture.

Parameters:

Name	Type	Description	Default
`model`	`str`	HuggingFace model ID or local path.	required
`method`	`str`	Baseline method: - "random_direction": Project onto random unit vector - "pca": Project onto top-k principal components - "layer_0": Use embedding layer instead of later layers	`"random_direction"`
`layers`	`int \| list[int] \| str`	Layers for activation extraction (ignored for layer_0 method).	`-1`
`pooling`	`str`	Token pooling strategy.	`"last_token"`
`classifier`	`str \| BaseEstimator`	Classification model.	`"logistic_regression"`
`device`	`str`	Device for model inference.	`"auto"`
`remote`	`bool`	Use nnsight remote execution.	`False`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`n_components`	`int`	Number of PCA components (for method="pca").	`10`
`batch_size`	`int`	Batch size for activation extraction.	`8`
`backend`	`str`	Extraction backend: "local" (default) or "nnsight".	`"local"`

Attributes:

Name	Type	Description
`classifier_`	`BaseEstimator`	The fitted classifier (after calling fit()).
`classes_`	`ndarray`	Class labels (after calling fit()).
`random_direction_`	`ndarray \| None`	Random unit vector (for method="random_direction").
`pca_`	`PCA \| None`	Fitted PCA transformer (for method="pca").

Examples:

>>> baseline = ActivationBaseline(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     method="random_direction",
...     layers=-1,
... )
>>> baseline.fit(positive_prompts, negative_prompts)
>>> accuracy = baseline.score(test_prompts, test_labels)

fit ¶

fit(positive_prompts: list[str], negative_prompts: list[str]) -> ActivationBaseline

Fit the baseline on contrastive examples.

Parameters:

Name	Type	Description	Default
`positive_prompts`	`list[str]`	Examples of the positive class.	required
`negative_prompts`	`list[str]`	Examples of the negative class.	required

Returns:

Type	Description
`ActivationBaseline`	Self, for method chaining.

predict ¶

predict(prompts: list[str]) -> np.ndarray

Predict class labels.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required

Returns:

Type	Description
`ndarray`	Predicted class labels, shape (n_prompts,).

predict_proba ¶

predict_proba(prompts: list[str]) -> np.ndarray

Predict class probabilities.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required

Returns:

Type	Description
`ndarray`	Class probabilities, shape (n_prompts, n_classes).

Raises:

Type	Description
`AttributeError`	If the classifier doesn't support predict_proba.

score ¶

score(prompts: list[str], labels: list[int] | ndarray) -> float

Compute classification accuracy.

Parameters:

Name	Type	Description	Default
`prompts`	`list[str]`	Text prompts to classify.	required
`labels`	`list[int] \| ndarray`	True labels.	required

Returns:

Type	Description
`float`	Classification accuracy.

BaselineBattery¶

lmprobe.battery.BaselineBattery ¶

Run multiple baselines and compare their performance.

BaselineBattery provides a convenient way to run all available baselines and find which one performs best on your task. This helps determine if a linear probe is learning something meaningful beyond simpler approaches.

Parameters:

Name	Type	Description	Default
`model`	`str \| None`	HuggingFace model ID. Required for activation-based baselines. If None, only text-based baselines are run.	`None`
`layers`	`int \| list[int] \| str`	Layers for activation extraction (activation baselines only).	`-1`
`pooling`	`str`	Token pooling strategy.	`"last_token"`
`classifier`	`str \| BaseEstimator`	Classification model for all baselines.	`"logistic_regression"`
`device`	`str`	Device for model inference.	`"auto"`
`remote`	`bool`	Use nnsight remote execution.	`False`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`include`	`list[str] \| None`	Which baselines to include. If None, includes all applicable. Available: bow, tfidf, random, majority, sentence_length, sentence_transformers, perplexity, random_direction, pca, layer_0.	`None`
`exclude`	`list[str] \| None`	Which baselines to exclude.	`None`
`scorer`	`Callable \| None`	Custom scoring function. Default is accuracy. Signature: scorer(y_true, y_pred) -> float	`None`

Attributes:

Name	Type	Description
`results_`	`BaselineResults \| None`	Results from the last fit() call. None before fitting.

Examples:

>>> # Run all text-only baselines
>>> battery = BaselineBattery(random_state=42)
>>> results = battery.fit(
...     positive_prompts, negative_prompts,
...     test_prompts, test_labels,
... )
>>> print(results.summary())
>>> best = results.get_best(n=3)

>>> # Run all baselines including activation-based
>>> battery = BaselineBattery(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     layers=-1,
...     device="cuda",
... )
>>> results = battery.fit(pos, neg, test_prompts, test_labels)

available_baselines `property` ¶

available_baselines: list[str]

List of all registered baseline names.

applicable_baselines `property` ¶

applicable_baselines: list[str]

List of baselines that would run with current config.

fit ¶

fit(positive_prompts: list[str], negative_prompts: list[str], test_prompts: list[str] | None = None, test_labels: list[int] | ndarray | None = None) -> BaselineResults

Fit all baselines and optionally score on test data.

Parameters:

Name	Type	Description	Default
`positive_prompts`	`list[str]`	Positive training examples.	required
`negative_prompts`	`list[str]`	Negative training examples.	required
`test_prompts`	`list[str] \| None`	Test prompts for scoring. If None, uses training data.	`None`
`test_labels`	`list[int] \| ndarray \| None`	Test labels. If None, uses training labels.	`None`

Returns:

Type	Description
`BaselineResults`	Results for all baselines that ran successfully.

get_best ¶

get_best(n: int = 1) -> list[BaselineResult]

Get top n baselines by score.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of top baselines to return.	`1`

Returns:

Type	Description
`list[BaselineResult]`	Top n baselines sorted by score.

Raises:

Type	Description
`RuntimeError`	If fit() has not been called.

get_baseline ¶

get_baseline(name: str) -> BaselineProbe | ActivationBaseline

Get a specific fitted baseline by name.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the baseline.	required

Returns:

Type	Description
`BaselineProbe \| ActivationBaseline`	The fitted baseline instance.

Raises:

Type	Description
`KeyError`	If the baseline was not run or not found.

BaselineResults¶

lmprobe.battery.BaselineResults `dataclass` ¶

Results from BaselineBattery.fit().

Attributes:

Name	Type	Description
`results`	`list[BaselineResult]`	List of results for each baseline, unsorted.

get_best ¶

get_best(n: int = 1) -> list[BaselineResult]

Return top n baselines by score, descending.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of top baselines to return.	`1`

Returns:

Type	Description
`list[BaselineResult]`	Top n baselines sorted by score (highest first).

getitem ¶

__getitem__(key: str | int) -> BaselineResult

Get a result by name or index.

summary ¶

summary() -> str

Return formatted summary string.

Baseline¶

BaselineProbe¶

lmprobe.baseline.BaselineProbe ¶

fit ¶

predict ¶

predict_proba ¶

score ¶

get_feature_names ¶

get_top_features ¶

ActivationBaseline¶

lmprobe.activation_baseline.ActivationBaseline ¶

fit ¶

predict ¶

predict_proba ¶

score ¶

BaselineBattery¶

lmprobe.battery.BaselineBattery ¶

available_baselines property ¶

applicable_baselines property ¶

fit ¶

get_best ¶

get_baseline ¶

BaselineResults¶

lmprobe.battery.BaselineResults dataclass ¶

get_best ¶

__getitem__ ¶

summary ¶

available_baselines `property` ¶

applicable_baselines `property` ¶

lmprobe.battery.BaselineResults `dataclass` ¶

getitem ¶