Caching¶

Activation extraction is expensive. A single forward pass through a large model can take seconds. lmprobe caches activations automatically so repeated calls with the same prompts are fast.

Default behavior¶

Caching is always enabled. Activations are stored at ~/.cache/lmprobe/ by default. Override with an environment variable:

export LMPROBE_CACHE_DIR="/path/to/my/cache"

Inspecting the cache¶

from lmprobe import cache_info

info = cache_info()
print(info)
# CacheInfo(total_size_gb=3.42, models=[...])

Reducing disk usage¶

Store activations in float16 instead of float32 (2× reduction, negligible accuracy impact):

from lmprobe import set_cache_dtype

set_cache_dtype("float16")

LRU eviction¶

Set a maximum cache size. When the limit is exceeded, least-recently-used entries are evicted:

from lmprobe import set_cache_limit

set_cache_limit(50)  # GB

S3 backend¶

Store activations in S3 for cross-machine sharing or building large datasets. Requires pip install lmprobe[s3].

from lmprobe import set_cache_backend

set_cache_backend("s3://my-bucket/lmprobe-cache")

S3 is for datasets, not ephemeral caching

The S3 backend is designed for building and sharing large activation datasets: pre-extracting activations for thousands of prompts across machines. It is not intended as a drop-in replacement for the local cache for short-lived work.

Configure AWS credentials via the standard environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) or an IAM role.

Warmup¶

Pre-extract and cache activations before running predictions. Useful when you want to front-load extraction work:

probe.warmup(test_prompts, batch_size=16)

# Subsequent calls hit the cache
predictions = probe.predict(test_prompts)

Cache logging¶

Enable verbose logging to see cache hits and misses:

from lmprobe import enable_cache_logging

enable_cache_logging()

Evicting specific entries¶

Manually trigger LRU eviction when you've set a cache limit:

from lmprobe import evict

evict()  # removes least-recently-used entries if over the size limit

This is decoupled from writes for performance. Call it at natural boundaries — after a batch of extractions, at session end, or on a schedule.

Cache introspection¶

Check what's cached for a specific model and prompt:

from lmprobe import discover_cached

info = discover_cached("meta-llama/Llama-3.1-8B-Instruct", "Who wants to go for a walk?")
if info is not None:
    print(info.raw_layers)           # [0, 1, ..., 31]
    print(info.pooled)               # {"last_token": [0, 1, ..., 31]}
    print(info.has_perplexity)       # True
    print(info.has_logits)           # False

Returns None if nothing is cached for that combination.

Clearing the cache¶

# Clear everything (irreversible)
from lmprobe.cache import clear_cache
clear_cache()

Warning

clear_cache() deletes all cached activations for all models. This is irreversible.

Environment variables¶

All cache settings can be configured via environment variables, useful for CI/CD or containerized deployments:

Variable	Description	Example
`LMPROBE_CACHE_DIR`	Cache directory (default: `~/.cache/lmprobe/`)	`/mnt/fast-ssd/lmprobe`
`LMPROBE_CACHE_MAX_GB`	Max cache size in GB (LRU eviction)	`100`
`LMPROBE_CACHE_DTYPE`	Storage dtype	`float16`
`LMPROBE_CACHE_BACKEND`	Cache backend URI	`s3://my-bucket/prefix`
`LMPROBE_CACHE_DEBUG`	Enable verbose cache logging	`1` or `debug`

Environment variables are read at import time and can be overridden programmatically via set_cache_limit(), set_cache_dtype(), and set_cache_backend().

Cache format¶

Activations are stored in safetensors format (v2), keyed per prompt, per model, per layer. The key is a hash of the prompt text and model ID. Older .pt format caches (v1) are still readable for backwards compatibility.