Skip to content

Baselines

Always compare your probe against baselines. A probe that doesn't beat a bag-of-words classifier may be learning surface features, not model-internal representations.


Text-only baselines

from lmprobe import BaselineProbe

# Bag-of-words
bow = BaselineProbe(method="bow", classifier="logistic_regression")
bow.fit(positive_prompts, negative_prompts)
bow_acc = bow.score(test_prompts, test_labels)

# TF-IDF
tfidf = BaselineProbe(method="tfidf")
tfidf.fit(positive_prompts, negative_prompts)

# Sentence length (surprisingly predictive for some tasks)
length = BaselineProbe(method="sentence_length")
length.fit(positive_prompts, negative_prompts)

# Sentence-transformers embeddings (requires: pip install lmprobe[embeddings])
st = BaselineProbe(method="sentence_transformers")
st.fit(positive_prompts, negative_prompts)

# Sanity checks
random_baseline = BaselineProbe(method="random")    # should be ~50%
majority_baseline = BaselineProbe(method="majority")
shuffled_baseline = BaselineProbe(method="shuffled_labels")  # overfitting check

Activation-based baselines

These use the same model activations as your probe but with simpler classification approaches. If your probe doesn't beat them, the learned direction may not be special:

from lmprobe import ActivationBaseline

# Random direction — project onto random unit vector
random_dir = ActivationBaseline(
    method="random_direction",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)
random_dir.fit(positive_prompts, negative_prompts)

# PCA — classify using top principal components
pca = ActivationBaseline(
    method="pca",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)

# Layer 0 — input embeddings only (no deep processing)
layer0 = ActivationBaseline(
    method="layer_0",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)

# Perplexity — model's own token probabilities (uses BaselineProbe, not ActivationBaseline)
perplexity = BaselineProbe(
    method="perplexity",
    model="meta-llama/Llama-3.1-8B-Instruct",
)

Baseline battery

Run all applicable baselines at once and get a ranked comparison:

from lmprobe import BaselineBattery

# Text-only (no model required)
battery = BaselineBattery(model=None, random_state=42)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

print(results.summary())
# Baseline Results:
# ------------------------------------------------------------
#   sentence_transformers          0.7925  (fit: 1.23s, predict: 0.05s)
#   tfidf                          0.7547  (fit: 0.01s, predict: 0.00s)
#   bow                            0.6792  (fit: 0.01s, predict: 0.00s)
#   random                         0.4906  (fit: 0.00s, predict: 0.00s)
#   majority                       0.5556  (fit: 0.00s, predict: 0.00s)

best = results.get_best()[0]
print(f"Best baseline: {best.name}{best.score:.2%}")

# With activation baselines (requires model)
battery = BaselineBattery(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    include=["bow", "tfidf", "random_direction", "pca"],
)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

All baseline methods

Method Type Description
bow Text Bag-of-words + classifier
tfidf Text TF-IDF + classifier
random Text Random predictions (sanity check)
majority Text Always predict majority class
sentence_length Text Classify by text length
sentence_transformers Text Pretrained embeddings + classifier
shuffled_labels Text Train on permuted labels (overfitting check)
random_direction Activation Project onto random unit vector
pca Activation Top principal components
layer_0 Activation Input embeddings only
perplexity Activation Model's own token log-probabilities

Interpreting results

Situation What it means
Probe ≈ random baseline Probe learned nothing; check data quality or layer choice
Probe ≈ BOW/TF-IDF Probe may be learning lexical features, not activations
Probe ≈ sentence_transformers Signal may be general semantics, not model-specific
Probe ≈ layer_0 Signal is in token identity/position, not deep representations
Probe ≈ random_direction Any direction in activation space works; signal is not a specific learned direction
Probe >> all baselines Strong evidence of model-internal signal