Skip to content

Tutorial: Geometry of Truth Probes

This tutorial trains truthfulness probes on the Geometry of Truth datasets using pre-extracted activations from latent-lab/got-activations-qwen2.5-0.5b — no GPU or local model required.

Geometry of Truth (Marks & Tegmark, 2023) showed that language models represent truth as a linear feature in activation space, detectable with simple probes. We'll reproduce that result on Qwen2.5-0.5B and explore what the activations reveal.


The dataset

latent-lab/got-activations-qwen2.5-0.5b contains 7,600 true/false statements from six GoT categories, with full-sequence activations pre-extracted across all 24 layers of Qwen2.5-0.5B (896-dim, float32). Top-100 logits are also cached.

Category N Example
cities 1,486 "The city of Amman is in Jordan."
neg_cities 1,482 "The city of Omsk is not in Russia."
sp_en_trans 350 "The Spanish word 'rosa' means 'rose'."
neg_sp_en_trans 353 "The Spanish word 'miedo' does not mean 'fear'."
larger_than 1,966 "Ninety-nine is larger than eighty-five."
smaller_than 1,963 "Fifty-six is smaller than eighty-one."

Labels: 1 = true statement, 0 = false statement. The dataset is nearly balanced (~50/50).


Setup

pip install lmprobe[hub] pyarrow

No API keys required. No GPU required.


Step 1: Load the index

The Parquet index is tiny (~1 MB). Download it first to understand the dataset and filter prompts before touching any tensors:

from huggingface_hub import hf_hub_download
import pyarrow.parquet as pq

DATASET = "latent-lab/got-activations-qwen2.5-0.5b"

path = hf_hub_download(DATASET, "index/train-00000-of-00001.parquet", repo_type="dataset")
index = pq.read_table(path)

texts  = index["text"].to_pylist()
labels = index["label"].to_pylist()
cats   = index["category"].to_pylist()

print(f"{len(texts)} prompts  |  columns: {index.column_names}")
# 7600 prompts  |  columns: ['text', 'label', 'category', 'prompt_format', ...]

Filter to a single category:

def filter_category(category):
    mask = [c == category for c in cats]
    return (
        [t for t, m in zip(texts, mask) if m],
        [l for l, m in zip(labels, mask) if m],
    )

cities_texts, cities_labels = filter_category("cities")
print(f"cities: {len(cities_texts)} prompts")
# cities: 1486 prompts

Step 2: Train a probe

Split into train/test, load activations from HuggingFace, and fit a probe. load_activations() downloads only the shards for the requested layer; subsequent runs hit local cache.

import numpy as np
from sklearn.model_selection import train_test_split
from lmprobe import load_activations, Probe

train_texts, test_texts, train_labels, test_labels = train_test_split(
    cities_texts, cities_labels, test_size=0.2, random_state=42, stratify=cities_labels
)

# Load layer 9 activations for all cities prompts
all_cities = train_texts + test_texts
acts = load_activations(DATASET, layers=[9], prompts=all_cities)

# Split activations to match train/test
n_train = len(train_texts)
X_train, X_test = acts[9][:n_train], acts[9][n_train:]

probe = Probe(classifier="logistic_regression", random_state=42)
probe.fit_from_activations(X_train, train_labels)

accuracy = probe.score_from_activations(X_test, test_labels)
print(f"Accuracy: {accuracy:.1%}")
# Accuracy: 95.6%

Step 3: Find the best layer

Layer 9 wasn't hand-picked — it came from a sweep. Load all layers at once and iterate:

all_acts = load_activations(DATASET, prompts=all_cities)

scores = {}
for layer, X in all_acts.items():
    X_train, X_test = X[:n_train], X[n_train:]
    p = Probe(classifier="logistic_regression", random_state=42)
    p.fit_from_activations(X_train, train_labels)
    scores[layer] = p.score_from_activations(X_test, test_labels)

best_layer = max(scores, key=scores.get)
print(f"Best layer: {best_layer}  ({scores[best_layer]:.1%})")

Layer sweep — cities dataset

Signal emerges sharply around layer 7 and remains high through the middle layers — a pattern consistent with the original GoT paper.


Step 4: Compare classifiers

X_train, X_test = all_acts[9][:n_train], all_acts[9][n_train:]

for clf in ["logistic_regression", "ridge", "svm", "lda", "mass_mean"]:
    p = Probe(classifier=clf, random_state=42)
    p.fit_from_activations(X_train, train_labels)
    acc = p.score_from_activations(X_test, test_labels)
    print(f"  {clf:22s}  acc={acc:.1%}")

Classifier comparison

Mass-Mean underperforms

Mass-Mean Probing is the method highlighted in the original GoT paper, yet it performs dramatically worse than logistic regression here — roughly 70% vs 95%. The mean-difference direction isn't the optimal linear separator for Qwen2.5-0.5B's representation of truthfulness. Logistic regression, Ridge, and SVM all perform comparably, suggesting the signal is robust to classifier choice as long as the direction is learned from data.


Step 5: Explore all six categories

Each category encodes a different kind of factual knowledge. Do all of them have a detectable truth direction?

categories = ["cities", "neg_cities", "sp_en_trans", "neg_sp_en_trans",
              "larger_than", "smaller_than"]

for cat in categories:
    t, l = filter_category(cat)
    tr_t, te_t, tr_l, te_l = train_test_split(t, l, test_size=0.2, random_state=42, stratify=l)

    cat_acts = load_activations(DATASET, prompts=tr_t + te_t)
    n_tr = len(tr_t)

    best_acc, best_layer = 0, 0
    for layer, X in cat_acts.items():
        p = Probe(classifier="logistic_regression", random_state=42)
        p.fit_from_activations(X[:n_tr], tr_l)
        acc = p.score_from_activations(X[n_tr:], te_l)
        if acc > best_acc:
            best_acc, best_layer = acc, layer

    print(f"  {cat:20s}  best_layer={best_layer}  acc={best_acc:.1%}")

Per-category accuracy heatmap

The heatmap shows accuracy by layer for each category, with the best layer highlighted in green. Truth representations appear consistently in the middle layers across all six categories, though the exact peak varies.


Step 6: Pull all shards locally for fast iteration

After your first run, activated layers are in your local cache. If you plan to sweep all layers across all categories repeatedly, pre-download everything at once:

from lmprobe import pull_dataset

n = pull_dataset(DATASET)   # downloads all shards (~few GB)
print(f"Pulled {n} prompts — all subsequent load_activations() calls from local cache")

Key findings

Category Best layer Accuracy
cities 9 95.6%
neg_cities 12 96.6%
sp_en_trans 10 100.0%
neg_sp_en_trans 9 100.0%
larger_than 10 100.0%
smaller_than 0 100.0%
Finding Detail
Truth is linearly represented Logistic regression reaches >95% on cities at layer 9 — consistent with GoT's core claim
Numerical comparisons are trivially separable larger_than and smaller_than hit 100% from layer 0 onward — the model represents numerical order very strongly
Signal peaks in middle layers Layers 7–17 are most informative for cities; early layers are sufficient for numerical tasks
Mass-Mean notably weaker ~70% vs ~95% for logistic regression on cities — the mean-difference direction is not the optimal separator for this model
LR, Ridge, SVM are comparable All three are within ~1% of each other; the linear representation is robust to classifier choice
No GPU needed The full analysis runs on CPU using pre-extracted activations from HuggingFace