GQA models have been making thousands of RDMA…

I want to be precise about what this means because it's the kind of problem that's invisible until you know the memory layout.

PD disaggregation sends KV cache from prefill workers to decode workers over RDMA. This is known. The KV cache is large. The transfer happens once per request. You've read about this. What's less discussed: in GQA models -- Grouped Query Attention, which includes DeepSeek-V4, Qwen3.5, Llama-3, and essentially every production MoE model deployed right now -- the K and V tensors are not contiguous in memory.

Here's why. GQA assigns multiple query heads to each KV head, so the number of KV heads is a fraction of query heads. The model stores KV tensors per layer, per head. When you shard across TP ranks in a disaggregated deployment, the KV head slices for each rank are scattered: head 0, head 4, head 8 -- non-contiguous strides through GPU memory, interleaved with the query heads that don't need to be transferred.

RDMA transfers require contiguous memory. It can't natively scatter-gather across non-contiguous GPU memory ranges at the granularity of individual KV head slices. The workaround: issue one RDMA request per head slice. For a model with 128 KV heads across multiple layers with TP degree 4, you're issuing hundreds to thousands of individual RDMA requests per token transfer. Each request carries its own completion event. The InfiniBand fabric queues them all. The receive side processes them all. The RDMA subsystem was not designed for this many small messages at this frequency.

SGLang's GPU Staging Buffer (PR #19890) fixes this with a single architectural insight: consolidate before you transfer.

A dedicated CUDA kernel runs before the RDMA transfer. It gathers all scattered KV head slices -- from wherever they sit in GPU memory, in whatever non-contiguous layout GQA produces -- into a single contiguous staging buffer in GPU HBM. One contiguous memory region. Then one bulk RDMA transfer. The receive side gets one message. The completion event fires once. The decode worker copies from its contiguous receive buffer into its own KV cache.

The gather is cheap -- a coalesced CUDA copy kernel. The RDMA transfer is now a single large message instead of thousands of small ones. RDMA was designed for exactly this: large contiguous bulk transfers.

RDMA request count reduction: approximately 1000x on GQA models. TPS/GPU on large concurrency: 5x improvement with Prefill TP4 + Decode DEP4 on Qwen3.5. The throughput improvement is not from a better algorithm. It's from removing a mismatch between the memory layout GQA creates and the bulk transfer semantics RDMA needs.

The same SGLang release that shipped the staging buffer shipped HiSparse, and the two are solving adjacent problems in the same serving stack. I want to explain HiSparse's mechanism specifically because the LMSYS blog post names it without fully explaining the kernel.

Long-context inference has a KV cache size problem even after all the architectural tricks. At 1 million token context on a 40-billion-parameter model, the KV cache at BF16 precision is roughly 160GB per request -- well beyond any single GPU's HBM. The standard approach is to limit context window to what fits. The alternative is to offload KV to CPU DRAM and fetch it back when needed. The problem: naive offloading fetches the entire KV cache from CPU on every attention step, which is bandwidth-limited and slow.

HiSparse is selective. The insight: at any given decode step, the attention kernel only actually accesses a small fraction of the total KV cache. For DeepSeek-V4's hybrid sparse attention layers -- which mix sliding window attention with 4:1 top-k compressed attention and 128:1 dense compressed attention -- the indexer touches maybe 5-10% of KV positions per step. The other 90-95% are inactive at this moment.

The HiSparse CUDA kernel does three things in sequence: it identifies which KV cache entries are cache misses in the device buffer (needed but not in HBM), selects eviction candidates from the device buffer via LRU (what to move out to make room), and fetches the required entries from host DRAM to HBM in one pipelined operation. The device buffer on GPU HBM stays sized to hold the "hot" KV entries -- the ones the current sliding window or top-k attention will actually access. The "cold" entries live on CPU.

The result on DeepSeek-V4: decode throughput stays essentially flat from 4K to 900K token context. Under 10% throughput drop from 4K all the way to 900K on both B200 (199 → 180 tokens/second) and H200 (266 → 240). Without HiSparse, throughput drops sharply as the KV cache exceeds HBM capacity because preemptions and recomputation kick in.

The key property: HiSparse is data-movement-aware about the attention pattern. It uses the sparsity of the attention itself -- the fact that modern long-context models only attend to a fraction of their context at each step -- to make the CPU offload work. Naive offloading ignores sparsity. HiSparse exploits it.

These two things together -- the staging buffer and HiSparse -- are solving a problem that wasn't visible two years ago because the models that expose it didn't exist yet.

Two years ago, the dominant serving workload was dense transformer with full attention, moderate context lengths, no MoE. KV cache fit in HBM. RDMA transfers were manageable because KV layouts were simpler. GQA was rare. Sparse attention was research.

DeepSeek-V4 is 1.6 trillion parameters, hybrid sparse attention, GQA, MoE, 1 million token context. Serving it in production requires solving: scattered KV layout for RDMA transfer, attention sparsity for KV offloading, expert parallelism fault tolerance, and MoE dispatch communication overlap -- simultaneously, in the same serving stack. Each of these was a separate research problem. SGLang is shipping production solutions for all four in the same release cycle.

The pattern I keep noticing: the models expose the infrastructure problems. Dense GPT-4-style serving didn't require staging buffers or HiSparse. MoE + sparse attention + GQA at 1M context does. The infrastructure work is playing catch-up to the model architecture and the context lengths, and the catch-up is now happening in weeks rather than years because the serving framework community is reading the same model papers and shipping fixes before the papers are fully cited.

1000x rdma request reduction from one staging buffer.

the kv layout that gqa creates is scattered. rdma needs contiguous. the mismatch was costing thousands of small messages per transfer.

the fix is a gather kernel before the rdma call.

this is not a research result. it shipped in sglang last week.

if you're running pd disaggregation on any gqa model -- qwen3.5, llama-3, deepseek-v4, any of them -- and you haven't pulled the latest sglang, you're still issuing thousands of rdma requests per token transfer. the 5x throughput improvement is sitting in a github pr you haven't merged.

P.S. The ShadowRadix prefix cache in the same DeepSeek-V4 serving post is the third piece of this that nobody is talking about separately. Standard radix tree prefix caching doesn't handle prefix invalidation gracefully -- when a cached prefix gets evicted due to memory pressure, the next request that shares that prefix has to recompute from scratch, often under high-load conditions when recomputation is most expensive. ShadowRadix maintains a shadow copy of recently evicted prefixes in compressed form, allowing partial prefix reuse rather than full recomputation. It's small, it's in the same release, and it closes the gap between "prefix caching works in theory" and "prefix caching degrades gracefully under memory pressure in production." The details are in the blog post. Read it before you configure your cache eviction policy.