Skip to content

Large Model Inference

lmprobe provides two backends for probing models that don't fit in GPU VRAM. The chunked backend handles models that fit in CPU RAM, while the disk offload backend handles models that exceed even CPU RAM — up to hundreds of billions of parameters on a single GPU.

Backend Best for Memory requirement
"local" Models that fit in GPU VRAM GPU VRAM > model size
"chunked" Models that fit in CPU RAM CPU RAM > model size
"disk_offload" Models that exceed CPU RAM Disk > model size

Chunked backend

The chunked backend loads the full model on CPU, then streams layers through the GPU in chunks. Use it for models like Llama-3.1-70B or Mixtral-8x22B that fit in CPU RAM but not GPU VRAM.

from lmprobe import Probe

probe = Probe(
    model="mistralai/Mixtral-8x22B-v0.1",
    layers=16,
    backend="chunked",          # load on CPU, chunk through GPU
    dtype="bfloat16",           # bf16 recommended for large models
)

probe.fit(positive_prompts, negative_prompts)

The chunk size is estimated automatically from available VRAM. Override it if needed:

probe = Probe(
    model="mistralai/Mixtral-8x22B-v0.1",
    layers=16,
    backend="chunked",
    dtype="bfloat16",
    chunk_size=4,               # 4 layers at a time on GPU
)

When to use chunked vs local

If your model loads successfully with backend="local" but you're running out of VRAM during extraction, switch to "chunked". The trade-off is speed — each batch requires moving layers between CPU and GPU.


Disk offload backend

For models that don't fit in CPU RAM either (e.g. DeepSeek-V3 at 671B parameters / 642GB), the disk offload backend loads layer weights directly from safetensors files to GPU, one layer at a time. It never materializes the full model in memory.

from lmprobe.backends import DiskOffloadBackend
from lmprobe.activation_types import ExtractionSpec, detect_moe_info

backend = DiskOffloadBackend(
    "deepseek-ai/DeepSeek-V3-Base",
    device="cuda:0",
)

# Detect MoE structure automatically
moe = detect_moe_info("deepseek-ai/DeepSeek-V3-Base")

spec = ExtractionSpec(
    hidden_layers=[0, 10, 20, 30],
    router_layers=moe.moe_layer_indices,
    router_module_template=moe.router_module_template,
    router_hook_strategy=moe.router_hook_strategy,
)

result = backend.extract_all(
    prompts,
    spec,
    batch_size=16,
)
# result.hidden_per_layer[layer_idx] -> (N, seq_len, hidden_dim)
# result.router_logits[layer_idx]    -> (N, seq_len, n_experts)

Layer-amortized extraction

The key optimization is extract_all: it processes the entire dataset through each layer before moving to the next. Each layer's weights are loaded from disk exactly once, regardless of dataset size.

Layer 0:  load weights -> run all batches -> free weights
Layer 1:  load weights -> run all batches -> free weights
...
Layer 60: load weights -> run all batches -> free weights

This is critical for large models where weight loading dominates runtime. For DeepSeek-V3, loading one MoE layer takes ~18 seconds. Processing per-batch through an already-loaded layer is fast (GPU compute only). The amortized approach means 20,000 prompts take roughly the same time as 100.

Mean pooling

For large-scale extraction where storing full (N, seq_len, hidden_dim) tensors per layer would exceed memory, use pool="mean" to mean-pool over valid tokens as features are captured:

result = backend.extract_all(
    prompts,
    spec,
    batch_size=16,
    pool="mean",                # store (N, dim) per layer, not (N, seq, dim)
)
# result.hidden_per_layer[layer_idx] -> (N, hidden_dim)
# result.router_logits[layer_idx]    -> (N, n_experts)

This reduces memory from O(N * seq * layers * dim) to O(N * layers * dim) — essential when combining multiple datasets in a single extraction pass.

FP8 quantized models

The disk offload backend handles FP8 block-quantized weights (e.g. DeepSeek-V3's native e4m3 format) automatically. Weights are dequantized to bf16 during loading using the companion weight_scale_inv tensors. No manual configuration is needed.

For MoE architectures, expert weights stored as individual per-expert tensors in safetensors are automatically packed into the 3D format expected by the model's forward pass.

GPU requirements

Peak VRAM usage depends on the largest single layer. For DeepSeek-V3 (256 experts, 2048 intermediate), one MoE layer in bf16 requires ~22GB. A GPU with 24-32GB VRAM is sufficient.


Combining multiple datasets

A common pattern for probing experiments is extracting features from many datasets at once. With the disk offload backend, combining datasets into a single extract_all call is much faster than processing them separately:

# Combine all prompts
all_prompts = []
slices = {}
for name, prompts in datasets.items():
    start = len(all_prompts)
    all_prompts.extend(prompts)
    slices[name] = (start, len(all_prompts))

# Single extraction pass — each layer loaded once
result = backend.extract_all(all_prompts, spec, batch_size=16, pool="mean")

# Slice results per dataset
for name, (start, end) in slices.items():
    hidden = result.hidden_per_layer[layer_idx][start:end]
    router = result.router_logits[layer_idx][start:end]
    # ... run probes on this dataset

Backend comparison

"local" "chunked" "disk_offload"
Model in GPU VRAM Full No No
Model in CPU RAM Full Full No
Weight loading Once Per batch Per layer (amortized)
Speed (per batch) Fast Moderate Slow per layer, fast per batch
FP8 support Via transformers Via transformers Native dequantization
MoE routing extraction Yes Yes Yes
Max tested model size ~30B ~70B 671B