Skip to content

Baseline

Classes for building baselines to compare against your probe.


BaselineProbe

lmprobe.baseline.BaselineProbe

Text classification baseline for comparison with linear probes.

This class provides simple baselines that don't use model activations, helping determine if probes are learning meaningful representations or just exploiting surface-level features.

Parameters:

Name Type Description Default
method str

Feature extraction method: - "bow": Bag-of-words (word counts) - "tfidf": TF-IDF weighted bag-of-words - "random": Random predictions (true chance baseline) - "majority": Always predict majority class - "sentence_length": Use character/word count features - "perplexity": Use model's own logprobs (requires model param) - "sentence_transformers": Use off-the-shelf embeddings - "shuffled_labels": Sanity check — trains on shuffled labels using features from base_method. Should get ~50% accuracy.

"tfidf"
classifier str | BaseEstimator

Classification model. Same options as LinearProbe. Ignored for method="random" and method="majority".

"logistic_regression"
random_state int | None

Random seed for reproducibility.

None
max_features int | None

Maximum vocabulary size for bow/tfidf methods.

10000
ngram_range tuple[int, int]

N-gram range for bow/tfidf. (1, 1) = unigrams only, (1, 2) = unigrams and bigrams.

(1, 1)
base_method str

Feature extraction method to use when method="shuffled_labels". Ignored for other methods.

"tfidf"
model str | None

HuggingFace model ID. Required for method="perplexity".

None
device str

Device for model inference (for perplexity method).

"auto"
remote bool

Use nnsight remote execution (for perplexity method).

False

Attributes:

Name Type Description
classifier_ BaseEstimator

The fitted classifier (after calling fit()).

classes_ ndarray

Class labels (after calling fit()).

vectorizer_ CountVectorizer | TfidfVectorizer | None

The fitted text vectorizer (for bow/tfidf methods).

Examples:

>>> baseline = BaselineProbe(method="tfidf", classifier="logistic_regression")
>>> baseline.fit(positive_prompts, negative_prompts)
>>> accuracy = baseline.score(test_prompts, test_labels)
>>> print(f"TF-IDF baseline: {accuracy:.1%}")

fit

fit(positive_prompts: list[str], negative_prompts: list[str]) -> BaselineProbe

Fit the baseline on contrastive examples.

Parameters:

Name Type Description Default
positive_prompts list[str]

Examples of the positive class.

required
negative_prompts list[str]

Examples of the negative class.

required

Returns:

Type Description
BaselineProbe

Self, for method chaining.

predict

predict(prompts: list[str]) -> np.ndarray

Predict class labels for prompts.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required

Returns:

Type Description
ndarray

Predicted class labels, shape (n_prompts,).

predict_proba

predict_proba(prompts: list[str]) -> np.ndarray

Predict class probabilities for prompts.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required

Returns:

Type Description
ndarray

Class probabilities, shape (n_prompts, n_classes).

Raises:

Type Description
AttributeError

If the classifier doesn't support predict_proba.

score

score(prompts: list[str], labels: list[int] | ndarray) -> float

Compute classification accuracy.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required
labels list[int] | ndarray

True labels.

required

Returns:

Type Description
float

Classification accuracy.

get_feature_names

get_feature_names() -> list[str] | None

Get feature names for bow/tfidf methods.

Returns:

Type Description
list[str] | None

Feature names (vocabulary), or None for random/majority.

get_top_features

get_top_features(n: int = 20) -> dict[str, list[tuple[str, float]]] | None

Get top features by classifier weight for each class.

Only works for bow/tfidf with linear classifiers that have coef_.

Parameters:

Name Type Description Default
n int

Number of top features to return per class.

20

Returns:

Type Description
dict | None

Dictionary with 'positive' and 'negative' keys, each containing a list of (feature_name, weight) tuples. None if not applicable.


ActivationBaseline

lmprobe.activation_baseline.ActivationBaseline

Activation-based baseline classifiers.

These baselines test whether a probe is learning something meaningful beyond what simple transformations of activations would capture.

Parameters:

Name Type Description Default
model str

HuggingFace model ID or local path.

required
method str

Baseline method: - "random_direction": Project onto random unit vector - "pca": Project onto top-k principal components - "layer_0": Use embedding layer instead of later layers

"random_direction"
layers int | list[int] | str

Layers for activation extraction (ignored for layer_0 method).

-1
pooling str

Token pooling strategy.

"last_token"
classifier str | BaseEstimator

Classification model.

"logistic_regression"
device str

Device for model inference.

"auto"
remote bool

Use nnsight remote execution.

False
random_state int | None

Random seed for reproducibility.

None
n_components int

Number of PCA components (for method="pca").

10
batch_size int

Batch size for activation extraction.

8
backend str

Extraction backend: "local" (default) or "nnsight".

"local"

Attributes:

Name Type Description
classifier_ BaseEstimator

The fitted classifier (after calling fit()).

classes_ ndarray

Class labels (after calling fit()).

random_direction_ ndarray | None

Random unit vector (for method="random_direction").

pca_ PCA | None

Fitted PCA transformer (for method="pca").

Examples:

>>> baseline = ActivationBaseline(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     method="random_direction",
...     layers=-1,
... )
>>> baseline.fit(positive_prompts, negative_prompts)
>>> accuracy = baseline.score(test_prompts, test_labels)

fit

fit(positive_prompts: list[str], negative_prompts: list[str]) -> ActivationBaseline

Fit the baseline on contrastive examples.

Parameters:

Name Type Description Default
positive_prompts list[str]

Examples of the positive class.

required
negative_prompts list[str]

Examples of the negative class.

required

Returns:

Type Description
ActivationBaseline

Self, for method chaining.

predict

predict(prompts: list[str]) -> np.ndarray

Predict class labels.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required

Returns:

Type Description
ndarray

Predicted class labels, shape (n_prompts,).

predict_proba

predict_proba(prompts: list[str]) -> np.ndarray

Predict class probabilities.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required

Returns:

Type Description
ndarray

Class probabilities, shape (n_prompts, n_classes).

Raises:

Type Description
AttributeError

If the classifier doesn't support predict_proba.

score

score(prompts: list[str], labels: list[int] | ndarray) -> float

Compute classification accuracy.

Parameters:

Name Type Description Default
prompts list[str]

Text prompts to classify.

required
labels list[int] | ndarray

True labels.

required

Returns:

Type Description
float

Classification accuracy.


BaselineBattery

lmprobe.battery.BaselineBattery

Run multiple baselines and compare their performance.

BaselineBattery provides a convenient way to run all available baselines and find which one performs best on your task. This helps determine if a linear probe is learning something meaningful beyond simpler approaches.

Parameters:

Name Type Description Default
model str | None

HuggingFace model ID. Required for activation-based baselines. If None, only text-based baselines are run.

None
layers int | list[int] | str

Layers for activation extraction (activation baselines only).

-1
pooling str

Token pooling strategy.

"last_token"
classifier str | BaseEstimator

Classification model for all baselines.

"logistic_regression"
device str

Device for model inference.

"auto"
remote bool

Use nnsight remote execution.

False
random_state int | None

Random seed for reproducibility.

None
include list[str] | None

Which baselines to include. If None, includes all applicable. Available: bow, tfidf, random, majority, sentence_length, sentence_transformers, perplexity, random_direction, pca, layer_0.

None
exclude list[str] | None

Which baselines to exclude.

None
scorer Callable | None

Custom scoring function. Default is accuracy. Signature: scorer(y_true, y_pred) -> float

None

Attributes:

Name Type Description
results_ BaselineResults | None

Results from the last fit() call. None before fitting.

Examples:

>>> # Run all text-only baselines
>>> battery = BaselineBattery(random_state=42)
>>> results = battery.fit(
...     positive_prompts, negative_prompts,
...     test_prompts, test_labels,
... )
>>> print(results.summary())
>>> best = results.get_best(n=3)
>>> # Run all baselines including activation-based
>>> battery = BaselineBattery(
...     model="meta-llama/Llama-3.1-8B-Instruct",
...     layers=-1,
...     device="cuda",
... )
>>> results = battery.fit(pos, neg, test_prompts, test_labels)

available_baselines property

available_baselines: list[str]

List of all registered baseline names.

applicable_baselines property

applicable_baselines: list[str]

List of baselines that would run with current config.

fit

fit(positive_prompts: list[str], negative_prompts: list[str], test_prompts: list[str] | None = None, test_labels: list[int] | ndarray | None = None) -> BaselineResults

Fit all baselines and optionally score on test data.

Parameters:

Name Type Description Default
positive_prompts list[str]

Positive training examples.

required
negative_prompts list[str]

Negative training examples.

required
test_prompts list[str] | None

Test prompts for scoring. If None, uses training data.

None
test_labels list[int] | ndarray | None

Test labels. If None, uses training labels.

None

Returns:

Type Description
BaselineResults

Results for all baselines that ran successfully.

get_best

get_best(n: int = 1) -> list[BaselineResult]

Get top n baselines by score.

Parameters:

Name Type Description Default
n int

Number of top baselines to return.

1

Returns:

Type Description
list[BaselineResult]

Top n baselines sorted by score.

Raises:

Type Description
RuntimeError

If fit() has not been called.

get_baseline

get_baseline(name: str) -> BaselineProbe | ActivationBaseline

Get a specific fitted baseline by name.

Parameters:

Name Type Description Default
name str

Name of the baseline.

required

Returns:

Type Description
BaselineProbe | ActivationBaseline

The fitted baseline instance.

Raises:

Type Description
KeyError

If the baseline was not run or not found.


BaselineResults

lmprobe.battery.BaselineResults dataclass

Results from BaselineBattery.fit().

Attributes:

Name Type Description
results list[BaselineResult]

List of results for each baseline, unsorted.

get_best

get_best(n: int = 1) -> list[BaselineResult]

Return top n baselines by score, descending.

Parameters:

Name Type Description Default
n int

Number of top baselines to return.

1

Returns:

Type Description
list[BaselineResult]

Top n baselines sorted by score (highest first).

__getitem__

__getitem__(key: str | int) -> BaselineResult

Get a result by name or index.

summary

summary() -> str

Return formatted summary string.