Baselines¶

Always compare your probe against baselines. A probe that doesn't beat a bag-of-words classifier may be learning surface features, not model-internal representations.

Text-only baselines¶

from lmprobe import BaselineProbe

# Bag-of-words
bow = BaselineProbe(method="bow", classifier="logistic_regression")
bow.fit(positive_prompts, negative_prompts)
bow_acc = bow.score(test_prompts, test_labels)

# TF-IDF
tfidf = BaselineProbe(method="tfidf")
tfidf.fit(positive_prompts, negative_prompts)

# Sentence length (surprisingly predictive for some tasks)
length = BaselineProbe(method="sentence_length")
length.fit(positive_prompts, negative_prompts)

# Sentence-transformers embeddings (requires: pip install lmprobe[embeddings])
st = BaselineProbe(method="sentence_transformers")
st.fit(positive_prompts, negative_prompts)

# Sanity checks
random_baseline = BaselineProbe(method="random")    # should be ~50%
majority_baseline = BaselineProbe(method="majority")
shuffled_baseline = BaselineProbe(method="shuffled_labels")  # overfitting check

Activation-based baselines¶

These use the same model activations as your probe but with simpler classification approaches. If your probe doesn't beat them, the learned direction may not be special:

from lmprobe import ActivationBaseline

# Random direction — project onto random unit vector
random_dir = ActivationBaseline(
    method="random_direction",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)
random_dir.fit(positive_prompts, negative_prompts)

# PCA — classify using top principal components
pca = ActivationBaseline(
    method="pca",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)

# Layer 0 — input embeddings only (no deep processing)
layer0 = ActivationBaseline(
    method="layer_0",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)

# Perplexity — model's own token probabilities (uses BaselineProbe, not ActivationBaseline)
perplexity = BaselineProbe(
    method="perplexity",
    model="meta-llama/Llama-3.1-8B-Instruct",
)

Baseline battery¶

Run all applicable baselines at once and get a ranked comparison:

from lmprobe import BaselineBattery

# Text-only (no model required)
battery = BaselineBattery(model=None, random_state=42)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

print(results.summary())
# Baseline Results:
# ------------------------------------------------------------
#   sentence_transformers          0.7925  (fit: 1.23s, predict: 0.05s)
#   tfidf                          0.7547  (fit: 0.01s, predict: 0.00s)
#   bow                            0.6792  (fit: 0.01s, predict: 0.00s)
#   random                         0.4906  (fit: 0.00s, predict: 0.00s)
#   majority                       0.5556  (fit: 0.00s, predict: 0.00s)

best = results.get_best()[0]
print(f"Best baseline: {best.name} — {best.score:.2%}")

# With activation baselines (requires model)
battery = BaselineBattery(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    include=["bow", "tfidf", "random_direction", "pca"],
)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

All baseline methods¶

Method	Type	Description
`bow`	Text	Bag-of-words + classifier
`tfidf`	Text	TF-IDF + classifier
`random`	Text	Random predictions (sanity check)
`majority`	Text	Always predict majority class
`sentence_length`	Text	Classify by text length
`sentence_transformers`	Text	Pretrained embeddings + classifier
`shuffled_labels`	Text	Train on permuted labels (overfitting check)
`random_direction`	Activation	Project onto random unit vector
`pca`	Activation	Top principal components
`layer_0`	Activation	Input embeddings only
`perplexity`	Activation	Model's own token log-probabilities

Interpreting results¶

Situation	What it means
Probe ≈ random baseline	Probe learned nothing; check data quality or layer choice
Probe ≈ BOW/TF-IDF	Probe may be learning lexical features, not activations
Probe ≈ sentence_transformers	Signal may be general semantics, not model-specific
Probe ≈ layer_0	Signal is in token identity/position, not deep representations
Probe ≈ random_direction	Any direction in activation space works; signal is not a specific learned direction
Probe >> all baselines	Strong evidence of model-internal signal