Baseline¶
Classes for building baselines to compare against your probe.
BaselineProbe¶
lmprobe.baseline.BaselineProbe ¶
Text classification baseline for comparison with linear probes.
This class provides simple baselines that don't use model activations, helping determine if probes are learning meaningful representations or just exploiting surface-level features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
Feature extraction method: - "bow": Bag-of-words (word counts) - "tfidf": TF-IDF weighted bag-of-words - "random": Random predictions (true chance baseline) - "majority": Always predict majority class - "sentence_length": Use character/word count features - "perplexity": Use model's own logprobs (requires model param) - "sentence_transformers": Use off-the-shelf embeddings - "shuffled_labels": Sanity check — trains on shuffled labels using features from base_method. Should get ~50% accuracy. |
"tfidf"
|
classifier
|
str | BaseEstimator
|
Classification model. Same options as LinearProbe. Ignored for method="random" and method="majority". |
"logistic_regression"
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
max_features
|
int | None
|
Maximum vocabulary size for bow/tfidf methods. |
10000
|
ngram_range
|
tuple[int, int]
|
N-gram range for bow/tfidf. (1, 1) = unigrams only, (1, 2) = unigrams and bigrams. |
(1, 1)
|
base_method
|
str
|
Feature extraction method to use when method="shuffled_labels". Ignored for other methods. |
"tfidf"
|
model
|
str | None
|
HuggingFace model ID. Required for method="perplexity". |
None
|
device
|
str
|
Device for model inference (for perplexity method). |
"auto"
|
remote
|
bool
|
Use nnsight remote execution (for perplexity method). |
False
|
Attributes:
| Name | Type | Description |
|---|---|---|
classifier_ |
BaseEstimator
|
The fitted classifier (after calling fit()). |
classes_ |
ndarray
|
Class labels (after calling fit()). |
vectorizer_ |
CountVectorizer | TfidfVectorizer | None
|
The fitted text vectorizer (for bow/tfidf methods). |
Examples:
>>> baseline = BaselineProbe(method="tfidf", classifier="logistic_regression")
>>> baseline.fit(positive_prompts, negative_prompts)
>>> accuracy = baseline.score(test_prompts, test_labels)
>>> print(f"TF-IDF baseline: {accuracy:.1%}")
fit ¶
Fit the baseline on contrastive examples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
positive_prompts
|
list[str]
|
Examples of the positive class. |
required |
negative_prompts
|
list[str]
|
Examples of the negative class. |
required |
Returns:
| Type | Description |
|---|---|
BaselineProbe
|
Self, for method chaining. |
predict ¶
Predict class labels for prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Predicted class labels, shape (n_prompts,). |
predict_proba ¶
Predict class probabilities for prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Class probabilities, shape (n_prompts, n_classes). |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If the classifier doesn't support predict_proba. |
score ¶
Compute classification accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
labels
|
list[int] | ndarray
|
True labels. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Classification accuracy. |
get_feature_names ¶
Get feature names for bow/tfidf methods.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
Feature names (vocabulary), or None for random/majority. |
get_top_features ¶
Get top features by classifier weight for each class.
Only works for bow/tfidf with linear classifiers that have coef_.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of top features to return per class. |
20
|
Returns:
| Type | Description |
|---|---|
dict | None
|
Dictionary with 'positive' and 'negative' keys, each containing a list of (feature_name, weight) tuples. None if not applicable. |
ActivationBaseline¶
lmprobe.activation_baseline.ActivationBaseline ¶
Activation-based baseline classifiers.
These baselines test whether a probe is learning something meaningful beyond what simple transformations of activations would capture.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
HuggingFace model ID or local path. |
required |
method
|
str
|
Baseline method: - "random_direction": Project onto random unit vector - "pca": Project onto top-k principal components - "layer_0": Use embedding layer instead of later layers |
"random_direction"
|
layers
|
int | list[int] | str
|
Layers for activation extraction (ignored for layer_0 method). |
-1
|
pooling
|
str
|
Token pooling strategy. |
"last_token"
|
classifier
|
str | BaseEstimator
|
Classification model. |
"logistic_regression"
|
device
|
str
|
Device for model inference. |
"auto"
|
remote
|
bool
|
Use nnsight remote execution. |
False
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
n_components
|
int
|
Number of PCA components (for method="pca"). |
10
|
batch_size
|
int
|
Batch size for activation extraction. |
8
|
backend
|
str
|
Extraction backend: "local" (default) or "nnsight". |
"local"
|
Attributes:
| Name | Type | Description |
|---|---|---|
classifier_ |
BaseEstimator
|
The fitted classifier (after calling fit()). |
classes_ |
ndarray
|
Class labels (after calling fit()). |
random_direction_ |
ndarray | None
|
Random unit vector (for method="random_direction"). |
pca_ |
PCA | None
|
Fitted PCA transformer (for method="pca"). |
Examples:
>>> baseline = ActivationBaseline(
... model="meta-llama/Llama-3.1-8B-Instruct",
... method="random_direction",
... layers=-1,
... )
>>> baseline.fit(positive_prompts, negative_prompts)
>>> accuracy = baseline.score(test_prompts, test_labels)
fit ¶
Fit the baseline on contrastive examples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
positive_prompts
|
list[str]
|
Examples of the positive class. |
required |
negative_prompts
|
list[str]
|
Examples of the negative class. |
required |
Returns:
| Type | Description |
|---|---|
ActivationBaseline
|
Self, for method chaining. |
predict ¶
Predict class labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Predicted class labels, shape (n_prompts,). |
predict_proba ¶
Predict class probabilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Class probabilities, shape (n_prompts, n_classes). |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If the classifier doesn't support predict_proba. |
score ¶
Compute classification accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompts
|
list[str]
|
Text prompts to classify. |
required |
labels
|
list[int] | ndarray
|
True labels. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Classification accuracy. |
BaselineBattery¶
lmprobe.battery.BaselineBattery ¶
Run multiple baselines and compare their performance.
BaselineBattery provides a convenient way to run all available baselines and find which one performs best on your task. This helps determine if a linear probe is learning something meaningful beyond simpler approaches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | None
|
HuggingFace model ID. Required for activation-based baselines. If None, only text-based baselines are run. |
None
|
layers
|
int | list[int] | str
|
Layers for activation extraction (activation baselines only). |
-1
|
pooling
|
str
|
Token pooling strategy. |
"last_token"
|
classifier
|
str | BaseEstimator
|
Classification model for all baselines. |
"logistic_regression"
|
device
|
str
|
Device for model inference. |
"auto"
|
remote
|
bool
|
Use nnsight remote execution. |
False
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
include
|
list[str] | None
|
Which baselines to include. If None, includes all applicable. Available: bow, tfidf, random, majority, sentence_length, sentence_transformers, perplexity, random_direction, pca, layer_0. |
None
|
exclude
|
list[str] | None
|
Which baselines to exclude. |
None
|
scorer
|
Callable | None
|
Custom scoring function. Default is accuracy. Signature: scorer(y_true, y_pred) -> float |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
results_ |
BaselineResults | None
|
Results from the last fit() call. None before fitting. |
Examples:
>>> # Run all text-only baselines
>>> battery = BaselineBattery(random_state=42)
>>> results = battery.fit(
... positive_prompts, negative_prompts,
... test_prompts, test_labels,
... )
>>> print(results.summary())
>>> best = results.get_best(n=3)
>>> # Run all baselines including activation-based
>>> battery = BaselineBattery(
... model="meta-llama/Llama-3.1-8B-Instruct",
... layers=-1,
... device="cuda",
... )
>>> results = battery.fit(pos, neg, test_prompts, test_labels)
available_baselines
property
¶
List of all registered baseline names.
applicable_baselines
property
¶
List of baselines that would run with current config.
fit ¶
fit(positive_prompts: list[str], negative_prompts: list[str], test_prompts: list[str] | None = None, test_labels: list[int] | ndarray | None = None) -> BaselineResults
Fit all baselines and optionally score on test data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
positive_prompts
|
list[str]
|
Positive training examples. |
required |
negative_prompts
|
list[str]
|
Negative training examples. |
required |
test_prompts
|
list[str] | None
|
Test prompts for scoring. If None, uses training data. |
None
|
test_labels
|
list[int] | ndarray | None
|
Test labels. If None, uses training labels. |
None
|
Returns:
| Type | Description |
|---|---|
BaselineResults
|
Results for all baselines that ran successfully. |
get_best ¶
Get top n baselines by score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of top baselines to return. |
1
|
Returns:
| Type | Description |
|---|---|
list[BaselineResult]
|
Top n baselines sorted by score. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If fit() has not been called. |
get_baseline ¶
Get a specific fitted baseline by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the baseline. |
required |
Returns:
| Type | Description |
|---|---|
BaselineProbe | ActivationBaseline
|
The fitted baseline instance. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the baseline was not run or not found. |
BaselineResults¶
lmprobe.battery.BaselineResults
dataclass
¶
Results from BaselineBattery.fit().
Attributes:
| Name | Type | Description |
|---|---|---|
results |
list[BaselineResult]
|
List of results for each baseline, unsorted. |
get_best ¶
Return top n baselines by score, descending.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of top baselines to return. |
1
|
Returns:
| Type | Description |
|---|---|
list[BaselineResult]
|
Top n baselines sorted by score (highest first). |