Caching¶
Activation extraction is expensive. A single forward pass through a large model can take seconds. lmprobe caches activations automatically so repeated calls with the same prompts are fast.
Default behavior¶
Caching is always enabled. Activations are stored at ~/.cache/lmprobe/ by default. Override with an environment variable:
Inspecting the cache¶
from lmprobe import cache_info
info = cache_info()
print(info)
# CacheInfo(total_size_gb=3.42, models=[...])
Reducing disk usage¶
Store activations in float16 instead of float32 (2× reduction, negligible accuracy impact):
LRU eviction¶
Set a maximum cache size. When the limit is exceeded, least-recently-used entries are evicted:
S3 backend¶
Store activations in S3 for cross-machine sharing or building large datasets. Requires pip install lmprobe[s3].
S3 is for datasets, not ephemeral caching
The S3 backend is designed for building and sharing large activation datasets: pre-extracting activations for thousands of prompts across machines. It is not intended as a drop-in replacement for the local cache for short-lived work.
Configure AWS credentials via the standard environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) or an IAM role.
Warmup¶
Pre-extract and cache activations before running predictions. Useful when you want to front-load extraction work:
probe.warmup(test_prompts, batch_size=16)
# Subsequent calls hit the cache
predictions = probe.predict(test_prompts)
Cache logging¶
Enable verbose logging to see cache hits and misses:
Evicting specific entries¶
Manually trigger LRU eviction when you've set a cache limit:
This is decoupled from writes for performance. Call it at natural boundaries — after a batch of extractions, at session end, or on a schedule.
Cache introspection¶
Check what's cached for a specific model and prompt:
from lmprobe import discover_cached
info = discover_cached("meta-llama/Llama-3.1-8B-Instruct", "Who wants to go for a walk?")
if info is not None:
print(info.raw_layers) # [0, 1, ..., 31]
print(info.pooled) # {"last_token": [0, 1, ..., 31]}
print(info.has_perplexity) # True
print(info.has_logits) # False
Returns None if nothing is cached for that combination.
Clearing the cache¶
Warning
clear_cache() deletes all cached activations for all models. This is irreversible.
Environment variables¶
All cache settings can be configured via environment variables, useful for CI/CD or containerized deployments:
| Variable | Description | Example |
|---|---|---|
LMPROBE_CACHE_DIR |
Cache directory (default: ~/.cache/lmprobe/) |
/mnt/fast-ssd/lmprobe |
LMPROBE_CACHE_MAX_GB |
Max cache size in GB (LRU eviction) | 100 |
LMPROBE_CACHE_DTYPE |
Storage dtype | float16 |
LMPROBE_CACHE_BACKEND |
Cache backend URI | s3://my-bucket/prefix |
LMPROBE_CACHE_DEBUG |
Enable verbose cache logging | 1 or debug |
Environment variables are read at import time and can be overridden programmatically via set_cache_limit(), set_cache_dtype(), and set_cache_backend().
Cache format¶
Activations are stored in safetensors format (v2), keyed per prompt, per model, per layer. The key is a hash of the prompt text and model ID. Older .pt format caches (v1) are still readable for backwards compatibility.