Skip to content

Caching

Activation extraction is expensive. A single forward pass through a large model can take seconds. lmprobe caches activations automatically so repeated calls with the same prompts are fast.


Default behavior

Caching is always enabled. Activations are stored at ~/.cache/lmprobe/ by default. Override with an environment variable:

export LMPROBE_CACHE_DIR="/path/to/my/cache"

Inspecting the cache

from lmprobe import cache_info

info = cache_info()
print(info)
# CacheInfo(total_size_gb=3.42, models=[...])

Reducing disk usage

Store activations in float16 instead of float32 (2× reduction, negligible accuracy impact):

from lmprobe import set_cache_dtype

set_cache_dtype("float16")

LRU eviction

Set a maximum cache size. When the limit is exceeded, least-recently-used entries are evicted:

from lmprobe import set_cache_limit

set_cache_limit(50)  # GB

S3 backend

Store activations in S3 for cross-machine sharing or building large datasets. Requires pip install lmprobe[s3].

from lmprobe import set_cache_backend

set_cache_backend("s3://my-bucket/lmprobe-cache")

S3 is for datasets, not ephemeral caching

The S3 backend is designed for building and sharing large activation datasets: pre-extracting activations for thousands of prompts across machines. It is not intended as a drop-in replacement for the local cache for short-lived work.

Configure AWS credentials via the standard environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) or an IAM role.


Warmup

Pre-extract and cache activations before running predictions. Useful when you want to front-load extraction work:

probe.warmup(test_prompts, batch_size=16)

# Subsequent calls hit the cache
predictions = probe.predict(test_prompts)

Cache logging

Enable verbose logging to see cache hits and misses:

from lmprobe import enable_cache_logging

enable_cache_logging()

Evicting specific entries

Manually trigger LRU eviction when you've set a cache limit:

from lmprobe import evict

evict()  # removes least-recently-used entries if over the size limit

This is decoupled from writes for performance. Call it at natural boundaries — after a batch of extractions, at session end, or on a schedule.


Cache introspection

Check what's cached for a specific model and prompt:

from lmprobe import discover_cached

info = discover_cached("meta-llama/Llama-3.1-8B-Instruct", "Who wants to go for a walk?")
if info is not None:
    print(info.raw_layers)           # [0, 1, ..., 31]
    print(info.pooled)               # {"last_token": [0, 1, ..., 31]}
    print(info.has_perplexity)       # True
    print(info.has_logits)           # False

Returns None if nothing is cached for that combination.


Clearing the cache

# Clear everything (irreversible)
from lmprobe.cache import clear_cache
clear_cache()

Warning

clear_cache() deletes all cached activations for all models. This is irreversible.


Environment variables

All cache settings can be configured via environment variables, useful for CI/CD or containerized deployments:

Variable Description Example
LMPROBE_CACHE_DIR Cache directory (default: ~/.cache/lmprobe/) /mnt/fast-ssd/lmprobe
LMPROBE_CACHE_MAX_GB Max cache size in GB (LRU eviction) 100
LMPROBE_CACHE_DTYPE Storage dtype float16
LMPROBE_CACHE_BACKEND Cache backend URI s3://my-bucket/prefix
LMPROBE_CACHE_DEBUG Enable verbose cache logging 1 or debug

Environment variables are read at import time and can be overridden programmatically via set_cache_limit(), set_cache_dtype(), and set_cache_backend().


Cache format

Activations are stored in safetensors format (v2), keyed per prompt, per model, per layer. The key is a hash of the prompt text and model ID. Older .pt format caches (v1) are still readable for backwards compatibility.