lmprobe Dataset Format Specification¶
Format version: 2.0
This document describes the lmprobe dataset format: a self-describing, sharded storage layout for language model activation data. The format is designed so that a researcher with no knowledge of lmprobe can download a single parquet file, understand the dataset, and write their own loading code.
How the format works¶
An lmprobe dataset consists of two parts:
- A parquet index — a small file with one row per prompt. Contains the text, labels, and addressing columns that locate each prompt's tensor data within the shards.
- Tensor shards — safetensors files containing the actual activation vectors. Split by layer, then by prompt groups, and optimized for common access patterns.
The parquet index is the entry point. Everything a researcher needs to understand the dataset (what model produced it, what layers are stored, how to construct shard filenames) is embedded in the parquet's schema metadata. The tensor shards are bulk data; their structure is fully described by the index.
The parquet index¶
Core columns (always present)¶
| Column | Type | Description |
|---|---|---|
text |
string |
The prompt text |
label |
int32 or string |
Class label (may be null for unlabeled data) |
num_tokens |
int32 |
Token count for this prompt |
shard_index |
int32 |
Which shard file holds this prompt's tensor |
row_offset |
int32 |
Row position within that shard |
The universal contract: shard_index and row_offset always resolve to the last-token hidden-state vector for a prompt. This is true regardless of whether the dataset stores pooled or full-sequence activations. These two columns are the only addressing a researcher needs for the most common workflow: extracting one activation vector per prompt per layer.
Full-sequence columns (when full-sequence data is stored)¶
| Column | Type | Description |
|---|---|---|
token_offset |
int64 |
Legacy offset (equal to row_offset for compatibility) |
token_shard_ids |
list<int64> |
Per-token shard index; one entry per token in the prompt |
token_shard_offsets |
list<int64> |
Per-token row offset within each shard |
To retrieve the activation for token i of a prompt, use token_shard_ids[i] and token_shard_offsets[i] in place of shard_index and row_offset. The universal shard_index/row_offset columns still point to the last token, as always.
Logits columns (when logits are stored)¶
| Column | Type | Description |
|---|---|---|
shard_index_logits |
int32 |
Shard index for logits data |
row_offset_logits |
int32 |
Row offset within the logits shard |
Logits are stored in separate shard files from hidden states, so they have their own addressing columns.
Optional columns¶
Perplexity (when the publisher computed perplexity):
| Column | Type | Description |
|---|---|---|
perplexity_mean |
float64 |
Mean per-token perplexity for the prompt |
perplexity_min |
float64 |
Minimum per-token perplexity |
perplexity_max |
float64 |
Maximum per-token perplexity |
Token-level data (when the publisher included decoded tokens):
| Column | Type | Description |
|---|---|---|
token_ids |
list<int64> |
Token IDs for the prompt |
token_perplexity |
list<float64> |
Per-token perplexity values |
token_strings |
list<string> |
Decoded token strings |
User metadata: Any additional columns the publisher attached at push time (dataset name, category, source, etc.) appear as extra columns with auto-inferred types.
Schema metadata¶
All dataset-level metadata is embedded in the parquet file's schema metadata under keys prefixed with lmprobe:. Each key stores a JSON-encoded value.
To read it:
import json
import pyarrow.parquet as pq
table = pq.read_table("index/train-00000-of-00001.parquet")
info = {
k.decode().removeprefix("lmprobe:"): json.loads(v)
for k, v in table.schema.metadata.items()
if k.decode().startswith("lmprobe:")
}
This produces a dict with the following top-level keys:
format_version¶
String. Currently "2.0".
model¶
The source model.
{
"name": "meta-llama/Llama-3.1-8B-Instruct",
"revision": "5206a32e0bd3067aef1ce90f5528ade7d866253f"
}
name is the HuggingFace model identifier. revision is the exact commit hash, pinning the model weights that produced this data.
num_prompts¶
Integer. Total number of prompts in the dataset.
prompt_ordering¶
String. How prompts are ordered in the index. Typically "random".
tensors¶
The core of the metadata. Describes what tensor data exists and how to find it. See Tensor descriptors below.
provenance¶
Build environment details.
{
"lmprobe_version": "0.8.9",
"extraction_backend": "local",
"nnsight_version": null,
"torch_version": "2.5.1",
"transformers_version": "4.46.3",
"python_version": "3.11.10",
"created_at": "2026-03-25T12:00:00+00:00"
}
Tensor descriptors¶
The tensors dict contains one entry per tensor type. The two tensor types are hidden_layers (hidden-state activations) and logits_topk (top-k logit values and indices).
Hidden layers: pooled storage¶
The common case. One vector per prompt per layer, using last-token pooling.
{
"hidden_layers": {
"type": "hidden",
"layers": [0, 1, 2, "...", 31],
"dim": 4096,
"dtype": "float32",
"layout": "per_layer",
"file_pattern": "tensors/hidden_layer{layer:03d}_shard{shard:03d}.safetensors",
"key_pattern": "hidden.layer_{layer}",
"storage": "pooled",
"pooling": "last_token",
"row_bytes": 16384,
"shards": [
{"num_prompts": 500}
]
}
}
| Field | Description |
|---|---|
layers |
List of layer indices stored |
dim |
Hidden dimension per vector |
dtype |
Tensor dtype ("float32", "float16", etc.) |
layout |
Always "per_layer" — each layer is stored in its own set of shard files |
file_pattern |
Python format string for constructing shard filenames |
key_pattern |
Python format string for the tensor key inside each safetensors file |
storage |
"pooled" — one vector per prompt |
pooling |
"last_token" — the vector is the activation at the final token position |
row_bytes |
Byte size of one row (= dim x dtype_size) |
shards |
List of shard descriptors, ordered by shard index |
Each entry in shards describes one shard file. num_prompts is the number of prompts (rows) in that shard. Shard indices are implicit from list position: the first entry is shard 0, the second is shard 1, and so on.
Hidden layers: full-sequence storage¶
All tokens stored, with dedicated last-token shards for fast access to the common case.
{
"hidden_layers": {
"type": "hidden",
"layers": [0, 1, 2, "...", 125],
"dim": 16384,
"dtype": "float32",
"layout": "per_layer",
"file_pattern": "tensors/hidden_layer{layer:03d}_shard{shard:03d}.safetensors",
"key_pattern": "hidden.layer_{layer}",
"storage": "full_sequence",
"last_token_shards": 2,
"shards": [
{"num_prompts": 4000, "num_tokens": 4000},
{"num_prompts": 3660, "num_tokens": 3660},
{"num_prompts": 4000, "num_tokens": 285432},
{"num_prompts": 3660, "num_tokens": 261100}
]
}
}
The critical difference is last_token_shards. This integer tells you that shards 0 through last_token_shards - 1 are last-token shards. They contain exactly one vector per prompt (the last-token activation), just like a pooled dataset. The universal shard_index/row_offset columns in the parquet always point into these shards.
Shards from index last_token_shards onward are sequence shards. They contain the remaining tokens in variable-length chunks. These are addressed by the token_shard_ids/token_shard_offsets list columns in the parquet.
In last-token shards, num_tokens always equals num_prompts (one token per prompt). In sequence shards, num_tokens reflects the total token count across all prompts packed into that shard.
This design means a researcher doing last-token work downloads only the small last-token shards and never touches the bulk sequence data.
Logits (top-k)¶
{
"logits_topk": {
"type": "logits_topk",
"k": 50,
"dtype": "float32",
"pooling": "last_token",
"file_pattern": "tensors/logits_topk_{shard:03d}.safetensors",
"row_bytes": 600,
"shards": [
{"file": "tensors/logits_topk_000.safetensors", "num_prompts": 500}
]
}
}
Logits shards store both values and indices for the top-k tokens at the last token position. Addressed via the shard_index_logits/row_offset_logits parquet columns.
Resolving a prompt to a tensor¶
Last-token hidden state (the common path)¶
import pyarrow.parquet as pq
import json
from safetensors import safe_open
# 1. Load the index
table = pq.read_table("index/train-00000-of-00001.parquet")
df = table.to_pandas()
# 2. Read the tensor descriptor
meta = table.schema.metadata
tensors = json.loads(meta[b"lmprobe:tensors"])
hidden = tensors["hidden_layers"]
# 3. Pick a prompt and a layer
row = df.iloc[42]
layer = 16
# 4. Construct the filename and key
filename = hidden["file_pattern"].format(layer=layer, shard=row["shard_index"])
key = hidden["key_pattern"].format(layer=layer)
# 5. Read the vector
with safe_open(filename, framework="pt") as f:
vector = f.get_tensor(key)[row["row_offset"]]
# vector is shape (hidden_dim,) as the last-token activation for prompt 42 at layer 16
This works identically for pooled and full-sequence datasets. The shard_index/row_offset contract guarantees it.
Per-token activation (full-sequence only)¶
# For token i of prompt 42:
row = df.iloc[42]
token_idx = 5
layer = 16
shard = row["token_shard_ids"][token_idx]
offset = row["token_shard_offsets"][token_idx]
filename = hidden["file_pattern"].format(layer=layer, shard=shard)
key = hidden["key_pattern"].format(layer=layer)
with safe_open(filename, framework="pt") as f:
vector = f.get_tensor(key)[offset]
Logits¶
logits_desc = tensors["logits_topk"]
row = df.iloc[42]
filename = logits_desc["file_pattern"].format(shard=row["shard_index_logits"])
with safe_open(filename, framework="pt") as f:
# Contains top-k values and indices
values = f.get_tensor("topk_values")[row["row_offset_logits"]]
indices = f.get_tensor("topk_indices")[row["row_offset_logits"]]
File layout on disk¶
A typical dataset on HuggingFace:
repo/
├── index/
│ └── train-00000-of-00001.parquet
├── tensors/
│ ├── hidden_layer000_shard000.safetensors
│ ├── hidden_layer000_shard001.safetensors
│ ├── hidden_layer001_shard000.safetensors
│ ├── ...
│ └── logits_topk_000.safetensors
└── README.md
Shard filenames are fully determined by the file_pattern in the tensor descriptor. No naming convention is assumed. The pattern is the contract.
Design rationale¶
Why shards are split by layer. Nearly all interpretability workflows (probes, autoencoders, steering vectors, etc.) operate on one layer at a time. Per-layer sharding means a researcher downloads only the layers they need.
Why last-token shards exist. Last-token pooling is the dominant access pattern for probes and classifiers. For full-sequence datasets, dedicating small shards to last-token vectors avoids downloading gigabytes of per-token data for the most common workflow.
Why shard_index/row_offset always means last-token. A single, universal addressing contract means df.head() is immediately legible. A researcher seeing these columns knows what they resolve to without checking which storage mode the dataset uses. Full-sequence and per-token addressing is available through additional columns, but the default path is always simple.
Why metadata is embedded in the parquet. A researcher downloads one file and has everything: the data, the schema, and the full description of the tensor layout. No sidecar files, no separate config downloads, no implicit conventions.
Version history¶
2.0 (current): Unified shard_index/row_offset contract. Explicit file_pattern and key_pattern in tensor descriptors. Embedded schema metadata with lmprobe: prefix.
1.x (before-times): Not documented, never used by anyone important.