Three things shipped in vLLM and SGLang this week…

I want to do that.

Separately, each one is a changelog entry. Together they describe what the optimized attention stack looks like on Blackwell right now, and the combination is meaningfully different from what it was 60 days ago.

TurboQuant 2-bit KV cache is now a production vLLM attention backend.

PR #38479. Merged. Shipping. FA3 and FA4 prefill support added in #40092 this week.

I wrote about TurboQuant as a research paper in February. The short version then: PolarQuant plus random orthogonal rotation plus a Lloyd-Max quantizer, compressing KV cache to 2-bit integers with accuracy loss below measurement noise on standard benchmarks. 4x the KV capacity for the same HBM footprint. The paper was credible. The question was whether it would survive contact with production.

It survived. Here's what that means for serving economics.

A single H100 SXM5 has 80GB of HBM. A production LLaMA-3.1-70B deployment at FP8 weights uses roughly 35GB for model weights, leaving ~45GB for KV cache. At BF16 KV, that 45GB supports around 85 concurrent sessions at 4,096 average context length. At FP8 KV, roughly 170 sessions. At 2-bit TurboQuant KV, roughly 340 sessions.

340 vs 85 is not an incremental improvement. It's a different conversation about how many GPUs you need to serve a given request volume.

The serving economics change: if you were running 4 H100s to maintain session density, you now run 1. If you were bottle-necked on KV memory rather than compute at decode time -- which most production long-context deployments are -- TurboQuant 2-bit doesn't just save money. It changes which hardware resource is the constraint.

The accuracy question: TurboQuant's random orthogonal rotation redistributes the quantization error across all dimensions before applying the Lloyd-Max quantizer. The rotation is the key -- without it, 2-bit quantization destroys information because KV caches have heavy-tailed value distributions (a few outlier channels carry most of the information). After rotation, every channel carries roughly equal information and the per-bit error budget is used efficiently. At 2-bit with rotation, measured perplexity degradation on standard benchmarks is within noise. Structured tasks with precise numerical retrieval are the failure mode to watch, but for conversational and generative workloads, the accuracy holds.

FlashAttention 4 is now the default MLA prefill backend in vLLM on SM90+.

PR #38819. Head-dim 512 and paged-KV support added in #38835.

FA4 as the default for standard attention has been available since March. The new thing this week: FA4 as the default specifically for MLA -- Multi-head Latent Attention, the architecture DeepSeek uses in V3 and R1.

MLA is architecturally different from standard multi-head attention. Instead of storing full Q, K, V tensors in the KV cache, MLA compresses them into a lower-rank latent representation and projects up at attention time. The KV cache stores the compressed latent; the full K and V are reconstructed on the fly for each forward pass. This dramatically reduces KV cache memory but adds projection overhead.

FA4's software-emulated softmax (routing exp() through ALUs instead of SFUs on Blackwell) is more valuable for MLA than for standard attention because MLA's projection step produces attention score distributions that are less numerically stable than standard attention -- the projection introduces additional variance that makes the softmax argument range wider. Wider range means more exp() calls landing in the high-value region where SFU precision matters most. The ALU-based approximation handles this more gracefully at 2.25 PFLOP/s than the SFU-based hardware implementation at its current throughput ceiling.

The MLA + FA4 + 2-bit KV combination is the attention stack that production DeepSeek-V4 deployments on Blackwell use now. MLA reduces KV cache memory by the compression ratio (roughly 4-8x depending on configuration). TurboQuant 2-bit reduces it by another 4x. FA4 gives you 71% hardware utilization instead of the 50-60% you'd get from standard kernels. These three don't add -- they multiply. The serving economics for DeepSeek-class models on Blackwell this week are a different category from what they were at the end of March.

Skip-Softmax attention shipped in SGLang for the FlashInfer TRT-LLM kernel path.

PR #19089. This is the freshest and least-covered thing from this week's releases.

In speculative decoding with tree-based or chunked drafting, the verification pass computes attention for K candidate tokens simultaneously against the same KV prefix. Standard attention: for each candidate, compute a row of the attention score matrix, apply softmax, weight the values. K independent softmax normalizations.

Skip-Softmax observes a mathematical property of adjacent rows in this joint attention computation: when K candidate tokens are semantically related (which they are in speculative decoding, because they're all continuations of the same prefix), their attention score distributions are correlated. The row sums of exp(QK^T) -- the normalization denominators for the softmax -- are similar across candidates. Similar enough that for candidates K and K+1, you can reuse the normalization from K to compute K+1's softmax, accepting a small approximation error, rather than computing the normalization independently.

The error introduced by skipping re-normalization is bounded by the similarity of the score distributions. For speculative decoding with a well-trained draft model -- where the candidates are plausible continuations, not random tokens -- the score distributions are similar enough that the approximation error is below the accept/reject threshold. You accept or reject the same candidates whether you use exact normalization or skip normalization.

The compute saving: on Blackwell, softmax normalization (the exp() and row-sum operations) is the SFU bottleneck that FA4 addressed at the kernel level. Skip-Softmax reduces the number of independent normalizations from K to 1 for a batch of K speculative candidates. At K=4 (four speculative tokens per step, typical for EAGLE-3), that's 4x fewer exp() operations in the verification pass. At K=8 (more aggressive speculation), 8x fewer.

FA4 is already routing exp() to ALUs to avoid SFU saturation. Skip-Softmax reduces the total number of exp() calls regardless of which unit computes them. These two optimizations attack the same bottleneck from different angles and compose: FA4 makes each exp() call cheaper, Skip-Softmax makes there be fewer of them.

The reason I'm writing about these three together is that they form a coherent optimization story for a specific workload class: speculative decoding with MLA models on Blackwell.

TurboQuant 2-bit KV: the KV cache you're caching between speculative decode steps is 4x smaller. More sessions fit per GPU. The memory that was the bottleneck isn't anymore.

FA4 as default MLA prefill: the prefill step that initializes the KV cache for each new session runs at 71% hardware utilization instead of 50-60%. The end-to-end latency per new session is lower.

Skip-Softmax: the verification pass in speculative decoding -- run at every decode step, K times per accepted token batch -- is 4-8x cheaper in exp() operations on Blackwell.

Three separate PRs, three separate research lineages, one model class (DeepSeek-V4 / MLA + speculative decoding on B200), one week of production releases.

sixty days ago, this stack didn't exist in production.

2-bit KV was a paper. FA4 was research. Skip-softmax wasn't merged. MLA on FA4 wasn't supported.

today they're all in the latest vllm and sglang releases.

the gap between frontier research and production shipping is narrowing. it used to be 18 months. for these techniques it was 4 months. for some of the kernel work this month it was weeks.

if you're benchmarking a b200-based serving cluster in june 2026 without turboQuant kv, fa4 mla, and skip-softmax enabled simultaneously, you're not benchmarking what the hardware can actually do. you're benchmarking a cluster running last quarter's software.

P.S. The online quantization frontend that shipped in the same vLLM release (#38138) is the operational piece that makes TurboQuant deployment practical. Before this, enabling quantization required either offline weight conversion (a separate preprocessing step that breaks deployment automation) or manual per-model configuration. The online frontend handles quantization in the serving path dynamically -- you enable it as a serving flag, not a model preprocessing step. For teams with CI/CD pipelines that deploy model updates automatically, the difference between offline and online quantization is the difference between "we can try this" and "we can ship this." It's the implementation detail that determines whether TurboQuant 2-bit goes from "technically available" to "production default" for most teams. The flag is --enable-online-quantization. Turn it on. Measure the accuracy on your specific workload. The perplexity hit is typically within 0.5% for conversational tasks. The capacity gain is 4x.