vLLM shipped tiered KV cache management this week…

vLLM v0.21.0 dropped the Hybrid Memory Allocator -- HMA -- as a production feature. I want to explain what it actually does, why it took this long, and the specific hardware constraint that determines whether any of this matters for your deployment.

The short version: HMA solves two separate problems that were blocking production tiered KV cache. One has been solved well. One has a hardware ceiling that most writeups don't mention.

Problem one: hybrid model memory waste. This is what motivated the HMA RFC two years ago and it's genuinely fixed now.

Models like Gemma-2, Nemotron 3 Super, and Ministral have heterogeneous layer types. Gemma-2 alternates sliding window attention layers (KV cache only covers the last N tokens) with full attention layers (KV cache covers all tokens). MLlama has cross-attention layers for image tokens with a different KV cache shape than its self-attention layers for text. Mamba hybrid models have SSM layers with fixed-size recurrent state instead of KV cache entirely.

The old vLLM allocator -- a single block size for all layers -- handled this badly. If you set the block size for the worst-case layer (full attention, largest KV), every sliding window layer wastes the portion of each block that can never be used. The numbers from the RFC: 79.6% memory waste in MLlama, 25% in Gemma-2, 56.25% in Ministral. You're paying for GPU HBM you cannot use because the allocator doesn't understand that different layers have different KV footprints.

HMA gives every layer type its own allocator with the correct block size. Sliding window layers get small blocks. Full attention layers get full blocks. SSM layers get a Mamba-specific cache manager that doesn't interfere with prefix caching. Memory fragmentation drops dramatically. For MLlama you recover nearly 80% of previously wasted HBM. On a GPU where HBM is the primary constraint on how many concurrent sessions you can serve, 80% recovered capacity is not marginal.

That problem is cleanly solved. Production-ready.

Problem two: tiered offloading across GPU HBM → CPU DRAM → NVMe. This is where the PCIe ceiling shows up.

The architecture makes sense on paper. GPU HBM is fast, expensive, and small -- roughly 80GB on an H100. CPU DRAM is slower, cheap, and large -- a standard server has 512GB to 2TB. NVMe is slower still, very cheap, and very large. When the KV cache for active sessions exceeds HBM, you spill to DRAM. When you want persistent caching across sessions (for prefix reuse on long documents), DRAM gives you capacity without recomputation.

The problem is PCIe.

HBM bandwidth on an H100: 3.35 TB/s. PCIe 5.0 x16, which is how the GPU connects to the host CPU and DRAM: 64 GB/s. Ratio: about 50x slower.

A 65K-token context window for Llama-3.1-405B generates roughly 33GB of KV cache. Transferring that from HBM to CPU DRAM and back costs 15ms from HBM. From CPU DRAM: 800ms. Five hundred milliseconds on the PCIe bus while the GPU waits.

For a user asking a follow-up question about a long document -- the multi-turn KV retention use case -- 800ms of PCIe transfer adds directly to their TTFT. That's not a rounding error. That's the dominant term in their latency experience.

vLLM's HMA handles this with async transfers and non-blocking scheduling: requests waiting for a DRAM load aren't scheduled until the load completes, freeing the GPU to serve other requests in the meantime. The scheduler groups KV blocks by position in the sliding window and promotes recently-accessed blocks back to HBM proactively before the next request arrives. When it works, the transfer happens during idle time and the user never sees it. When the cluster is under load and there's no idle time, the user waits.

The multi-tier framework in v0.21.0 adds a Python filesystem backend for NVMe, Mooncake disk offloading support, and DSv4 integration. The hierarchy is now fully pluggable. You can have HBM → DRAM → local NVMe → Mooncake distributed cache as a four-tier stack. Each tier with its own connector, its own eviction policy, its own capacity configuration.

What the adaptive tiered storage paper (March 2026) found that vLLM doesn't yet implement: adding more DRAM beyond a certain threshold doesn't help.

For workloads with high prefix hit rates -- document Q&A, RAG pipelines, agent workflows with shared context -- DRAM tier capacity translates directly to KV cache reuse and lower TTFT. For workloads with low hit rates -- fresh requests, diverse inputs, no shared prefixes -- the PCIe transfer overhead to populate the DRAM tier costs more than it saves. The optimal DRAM allocation varies by workload and cannot be set statically.

vLLM's current configuration takes fixed provisioning. You decide at startup how much CPU DRAM to reserve for KV offloading. There's no adaptive feedback that says "your current traffic has 15% prefix hit rate, your DRAM tier is costing more than it's saving, reduce it to 64GB." That system doesn't exist yet in any production framework. The paper proposes one. It's not shipped anywhere.

This is the honest state: HMA is production. Tiered KV is production. Adaptive tier configuration is research.

The DGX Spark post on the vLLM blog (June 1st) changes the bandwidth math in a way nobody has said clearly.

The DGX Spark is NVIDIA's Grace Blackwell Superchip -- a desktop machine with CPU and GPU sharing 128GB of NVLink-connected unified memory. Not PCIe-connected. NVLink.

NVLink bandwidth between the Grace CPU and Blackwell GPU on the DGX Spark: approximately 900 GB/s. Compare to PCIe 5.0 x16 at 64 GB/s. The DGX Spark's "host" memory is 14x faster than a standard server's CPU DRAM from the GPU's perspective.

The PCIe bottleneck that makes tiered KV cache painful on standard hardware -- the 800ms transfer for a 33GB context -- becomes approximately 55ms on the DGX Spark. Still slower than HBM-to-HBM, but in the range where async prefetching can hide it behind compute latency rather than dominating it.

More importantly: the DGX Spark has 128GB of unified memory. A 70B model in BF16 is 140GB -- slightly over budget. In FP8 it's 70GB, leaving 58GB for KV cache and overhead. That's a single-machine 70B deployment with meaningful KV headroom, on a desktop form factor, without requiring the NVMe tier at all for most workloads.

The HMA running on a DGX Spark doesn't see a slow DRAM tier that costs 800ms. It sees a fast unified memory tier that costs 55ms. The tiered KV cache architecture that was theoretically correct but practically constrained on standard hardware becomes practically useful on Grace Blackwell unified memory.

This is not a DGX Spark advertisement. It's a statement about what the tier structure looks like when you change the interconnect. CXL memory does the same thing at rack scale -- takes the DRAM tier from 800ms-equivalent to something manageable by replacing PCIe with a load/store protocol over CXL. The DGX Spark does it at single-machine scale with NVLink. Both are solving the same bandwidth problem. The software (HMA) is now production. The hardware that makes the software worthwhile is shipping.

the tiered kv cache architecture has been correct in principle for two years.

the pcie bus made it painful in practice.

nvlink unified memory and cxl both attack the same bottleneck from different angles.

hma shipped this week. the hardware it needs to reach its ceiling is shipping this year.

the adaptive tier configuration problem -- knowing how much dram to reserve for your actual traffic pattern -- is the open research problem that nobody has shipped in production yet. if you're deploying hma today, measure your prefix hit rate first. if it's below 20%, the dram tier is costing more than it's returning.

P.S. The per-layer allocation fix in HMA has a non-obvious consequence for prefix caching. The old allocator couldn't do prefix caching for SSM/Mamba layers because the Mamba cache had a separate manager incompatible with the prefix cache index. HMA unifies this: all layer types register their cache state through the same allocator interface, so the prefix cache can index into SSM state as well as KV blocks. Multi-turn sessions on hybrid models like Nemotron 3 Super -- which is 75% Mamba layers -- can now reuse cached recurrent state across turns, not just KV. Nobody wrote about this. It's in the RFC. It's real. And it significantly changes the economics of serving hybrid models at multi-turn workloads because you're no longer recomputing SSM recurrent state from scratch on every new turn.