# Going from batch size 33 to 34 on an H100 SXM5 more than doubles your decode attention latency.

Date: 2026-06-23
Source: https://vanshverma.com/notes/wave-quantization-decode-cliff
Tags: gpu, inference

Not degrades slightly. Not increases meaningfully. More than doubles.

This is the wave quantization cliff in decode attention and I haven't seen it written out precisely for the people who are setting batch sizes in their serving configs right now. Let me do that.

---

**The mechanism.**

Decode attention generates one CTA per KV head group per request. For a GQA model with 8 KV heads, each request in the batch produces 8 CTAs. For a model with 4 KV head groups (which is a common configuration for MQA-style GQA), each request produces 4 CTAs.

An H100 SXM5 has 132 SMs. An H100 PCIe has 114 SMs. A single A100 has 108 SMs.

Take the 4-CTA-per-request case on A100 (108 SMs). At batch size 27: 27 × 4 = 108 CTAs. Exactly fills all 108 SMs. One wave. Every SM has work. Every SM finishes at roughly the same time. The kernel latency is determined by the cost of processing one KV sequence block on one SM.

At batch size 28: 28 × 4 = 112 CTAs. 108 SMs run the first wave of 108 CTAs. Then 4 remaining CTAs run on 4 SMs while 104 SMs sit idle. The kernel doesn't finish until those 4 stragglers complete. You're paying for the full second wave just to process 4 CTAs. The wall clock latency is approximately 2x the single-wave cost.

POD-Attention (arXiv:2410.18038) measured this precisely. At batch size 54 (216 CTAs, exactly 2 × 108): two full waves, clean boundary, no stragglers. At batch size 55 (220 CTAs): third wave with 4 stragglers, kernel latency increases by more than 25%. The decode time component of the total attention latency -- prefill plus decode -- increases by up to 17% from that one request being added to the batch.

One request crosses the boundary. Total decode attention latency increases by 17-25%. The scheduler didn't know. The monitoring dashboard didn't show it. The user who sent request 55 just had a noticeably worse experience.

---

**Why this matters more at long context.**

The CTA count per request is fixed by the model's KV head structure. The per-CTA work is determined by context length -- each CTA processes the attention over the full KV cache for its assigned heads. At short context (4K tokens), the per-CTA work is fast, and a second wave adds modest absolute latency. At long context (128K tokens), the per-CTA work is slow, and a second wave adds substantial absolute latency.

The specific arithmetic for a Llama-3.1-70B deployment at 128K context on H100 SXM5 (132 SMs, 8 KV heads per GQA group = 8 CTAs per request in this config):

Clean wave at batch size 16: 16 × 8 = 128 CTAs, 4 SMs idle (128/132 efficiency = 97%). Near-perfect.
Cliff at batch size 17: 17 × 8 = 136 CTAs, second wave with 4 CTAs. 128K context per CTA in the second wave means the 4 stragglers are doing significant work. The tail latency increase is not 25%. It's proportional to the per-CTA work cost at 128K context, which is substantial.

This is why long-context deployments have wilder P99 latency variance than short-context deployments even at identical batch sizes. The wave quantization cliff is at a different absolute latency cost depending on context length. A batch size that was fine at 4K context crosses a cliff at 128K because the stragglers now cost more.

---

**The standard attention kernels don't solve this.**

FlashAttention-1, -2, and -3 all decompose attention into tiles with one tile per CTA. The CTA structure is fixed by the decomposition strategy. The wave quantization behavior is a direct consequence of the CTA structure and the SM count. If your batch × head_groups doesn't divide evenly by SM count, you have stragglers.

FlashDecoding addresses part of this by splitting the KV sequence across multiple CTAs -- instead of one CTA handling the full KV cache per request per head, you split the KV sequence into chunks and distribute across more CTAs. This gets you more parallelism over the sequence dimension, which helps at very long context. It doesn't eliminate wave quantization -- you still have a fixed CTA count that may or may not align with SM count.

The deeper problem: the standard Flash-family approach assigns work to CTAs at kernel launch time. Whatever the CTA count is, that's what the kernel runs with. The SM scheduler takes whatever CTAs are launched and schedules them as waves. You have no mechanism to tell the kernel "distribute work continuously across all available SMs regardless of the natural CTA boundary."

---

**LeanAttention's proof.**

LeanAttention (arXiv:2405.10480, Microsoft) proves a non-obvious mathematical property about softmax that enables a completely different decomposition strategy.

Online softmax -- the trick FlashAttention uses to compute attention without materializing the full attention matrix -- works by maintaining a running maximum and running sum that get updated as new KV blocks are processed. After processing all blocks, you rescale the accumulated output by the final normalization factor.

The key operation: combining the partial results from two independently processed KV chunks. You have an output from chunk A (with its own running max and running sum) and an output from chunk B. To merge them into the correct combined output, you compute the correction factors and rescale. This merge operation is what LeanAttention proves is an associative reduction.

Associativity of the merge operation means: you can process any subset of KV blocks on any SM and later combine the partial results correctly, regardless of the order in which the subsets were processed. The final result is the same whether you process blocks in order, out of order, split differently, or distributed across any number of SMs.

Why this matters: Stream-K GEMM (the technique for eliminating wave quantization from matrix multiplication) works by distributing work continuously across ALL available SMs rather than assigning fixed tiles. Instead of "SM 0 gets rows 0-16, SM 1 gets rows 17-32" (which creates stragglers), Stream-K says "each SM processes approximately 1/N of the total work, picking up tiles continuously until the work is exhausted." The last SM to finish has minimal stragglers because the work is balanced. This requires that partial results from different SMs can be merged correctly -- which requires associativity of the reduction operation.

For GEMM, the reduction is summation over the inner dimension. Summation is associative. Stream-K works for GEMM.

For attention, the reduction is the softmax-weighted sum over KV blocks. Online softmax's merge operation is what has to be associative. LeanAttention proves that it is. This enables Stream-K style decomposition of attention -- distribute KV block processing continuously across all available SMs, merge partial results at the end.

The result: near-100% SM occupancy regardless of batch size or context length. The wave quantization cliff disappears because there is no fixed CTA count that creates wave boundaries. 2.18x speedup at 256K context compared to FlashDecoding. Not from a better algorithm for the attention computation itself -- from eliminating the idle SM time that the wave structure was creating.

---

**What this means for your serving config.**

This is the kernel-level explanation for something that shows up in your latency percentiles as inexplicable variance.

If you're running FlashAttention-2 or FlashDecoding on an H100 SXM5 at batch sizes in the 30-50 range at long context, and your P99 TTOT is dramatically higher than your P50, check whether your most common batch size is near a wave quantization boundary. The formula: SM count / (KV head groups per request). For H100 SXM5 at 132 SMs with 8 KV head groups per request: cliff at 16, 33, 49, 66... For A100 at 108 SMs with 8 KV head groups: cliff at 13, 27, 40, 54...

If you're scheduling continuous batches and your most common batch size is 34, you're paying the second-wave tax on every single decode step for every single user in that batch. The P99 latency you're attributing to tail workloads might be structural.

LeanAttention is the kernel fix. It's not in vLLM or SGLang as the default attention backend yet -- FlashAttention and FlashDecoding are still the defaults. LeanAttention integration is in progress. FlashInfer has Stream-K style optimizations that partially address this for variable-length sequences.

The monitoring fix, available now: log the actual batch size at each decode step and cross-reference with your TTOT distribution. If TTOT bimodally distributes around the wave boundaries, you've confirmed the problem. You can then tune your scheduler's maximum batch size to stay below the nearest cliff, trading some throughput for dramatically improved P99 latency.

---

going from batch size 33 to 34 more than doubles decode attention latency.

the sm scheduler doesn't know. the dashboard doesn't show it. the user who sends request 34 just waits.

stream-k eliminates this by proving that attention's reduction is associative, enabling continuous work distribution across all sms regardless of cta count.

the math is in the paper. the fix is in progress. the cliff is in your production system right now.

*log your batch size at each decode step. cross-reference with ttot distribution. if you see bimodal distribution around multiples of sm-count / kv-head-groups, you've found the cliff. schedule around it until lean attention lands as the default.*

---

**P.S.** Wave quantization is the same root problem that the Bullet paper (ASPLOS '26) addressed with SM partitioning between prefill and decode -- prefill CTAs and decode CTAs competing for the same SMs, each creating their own wave quantization pattern, compounding each other's idle time. LeanAttention solves wave quantization within the decode attention kernel specifically. Bullet solves SM allocation between prefill and decode phases more broadly. The two improvements compose: Bullet eliminates inter-phase SM idle time, LeanAttention eliminates intra-decode wave quantization. Running both simultaneously gets you closer to actual SM saturation for the full compute budget. The SM utilization improvement from composing them is additive on independent bottlenecks -- which is what they are.
