blackwell doubled the tensor cores. it did not change the SFUs.
FlashAttention-4 is the most important kernel paper of 2026. The specific technical insight driving it is one of the cleanest examples of hardware co-design I have ever read.
May 30, 2026Blackwell doubled the tensor core throughput. It did not change the SFU count.
Let me tell you what that means for the attention kernel, because FlashAttention-4 (March 5th, Tri Dao, Princeton, Together AI, Meta, NVIDIA, Colfax Research) is the most important kernel paper of 2026 and the specific technical insight driving it is one of the cleanest examples of hardware co-design I have ever read.
The H100 delivered 1 PFLOP of BF16 tensor core throughput. The B200 delivers 2.25 PFLOPs. 2.25x. The shared memory bandwidth on the B200: unchanged from H100. The Special Function Unit count -- the hardware units that compute exponentials, logarithms, sine, cosine, all the transcendental operations -- unchanged from H100.
You doubled the engine. You did not double the plumbing. You did not double the exhaust.
For a GEMM -- pure matrix multiplication, all tensor cores, no transcendental ops -- this is fine. You get 2.25x. For attention -- two GEMMs with softmax in between, where softmax requires computing an exponential for every element of the attention score matrix -- you do not get 2.25x. You get something substantially less, because the exponential computation is now the bottleneck and the tensor cores are waiting for the SFUs to finish.
This is the asymmetric hardware scaling problem. Every generation of NVIDIA GPUs since Volta has scaled tensor cores faster than everything else. The gap between tensor core throughput and everything that isn't tensor cores has been growing for five years. FA4 is the paper that names it, quantifies it, and builds an attention kernel specifically designed around it.
The softmax problem in detail.
FlashAttention-1, -2, and -3 all compute softmax in the standard way: compute the attention scores S = Q·K^T (two GEMMs), apply softmax rowwise (exponential of each element, divide by row sum), apply to values O = softmax(S)·V. The softmax step runs on the SFUs. At H100 hardware ratios, the SFUs were not the bottleneck -- they could keep up. At B200 hardware ratios, the tensor cores finish their GEMM tiles faster than the SFUs can compute the softmax for those tiles. The tensor cores idle. The pipeline stalls.
FA4's fix for this is software-emulated exponential.
The exponential function e^x can be approximated with a polynomial expansion. You can compute it on the regular floating point ALUs -- not the SFUs -- using a sequence of multiply-accumulate operations. This is slower per operation than the dedicated SFU instruction, but it runs on ALUs that are not the bottleneck. The total throughput improves because you're moving work from a saturated unit to an idle one.
The specific implementation: FA4 uses a conditional softmax rescaling approach where the exponential is decomposed into e^(floor(x)) × e^(x - floor(x)). The floor term is a table lookup. The fractional term uses a polynomial approximation on the ALUs. The combined operation is faster than waiting for the SFU, at Blackwell hardware ratios.
This is the kind of optimization that only makes sense when you know exactly which hardware unit is the bottleneck. On H100, computing softmax on the SFUs is fine. On B200, it's the bottleneck, so you route the computation to ALUs that are otherwise idle. The same kernel can't be optimal on both architectures. This is the core argument for per-generation attention kernel redesigns.
The new memory hierarchy: TMEM.
Blackwell introduces tensor memory (TMEM) -- 256KB of on-chip memory per SM, distinct from shared memory (SMEM), specifically designed to hold intermediate results of tensor core operations. TMEM is warp-synchronous and tightly coupled to the tensor cores. The matrix multiply-accumulate units can write outputs directly to TMEM without consuming registers. The accumulator stays in TMEM across multiple MMA operations rather than cycling through registers.
This changes the register pressure calculus that dominated Hopper kernel design. On H100, deep pipelines required large register files to hold accumulators in flight, which limited occupancy (fewer active warps per SM when each warp holds more registers). On B200, accumulators live in TMEM, not registers. Register pressure from the accumulator disappears. Deeper pipelines and larger tiles become practical without the register spilling that made equivalent Hopper kernels slower than they should have been.
The new MMA instruction is UMMA -- Unified MMA. UMMA is launched by a single thread rather than requiring coordination across a warpgroup (as Hopper's WGMMA required). This makes warp specialization dramatically more viable: some warps can be dedicated to data movement while others issue UMMA instructions, and the synchronization overhead between them is lower because UMMA is thread-launched rather than warpgroup-launched.
The 2-CTA MMA is the deepest architectural detail in the paper and the most underreported. Blackwell can execute one UMMA operation across a CTA pair in the same cluster, spanning the TMEM of both CTAs. One thread in the leader CTA launches the operation. Both CTAs must stay active while it's in flight. This scales the effective MMA tile to 256×256×16 -- a 256K-element tile -- by splitting M and N dimensions across the pair. At this tile size, the ratio of useful compute to boundary overhead is much higher than at smaller tiles. You do more math per memory access. The arithmetic intensity of the kernel improves.
The largest single CTA UMMA tile on Blackwell: 128×256×16. Hopper's largest WGMMA tile: 64×128×16. FA4 runs on tiles that are roughly 4x larger than what FA3 could use. At larger tiles, the pipeline can run deeper with proportionally lower synchronization overhead.
The CuTe-DSL implementation detail matters more than it looks.
FlashAttention-4 is implemented entirely in CuTe-DSL embedded in Python -- not C++ templates, not raw CUDA. 20-30x faster compile times than the traditional C++ template approach. Full expressivity without the template metaprogramming overhead that makes CUTLASS kernels notoriously slow to compile and hard to modify.
This is not a convenience feature. It determines how quickly the kernel can be updated as hardware evolves. FlashAttention-3 was written in C++ templates against the Hopper-specific instruction set. Porting it to Blackwell would have required extensive template refactoring. FA4 in CuTe-DSL means the pipeline architecture is expressed at a higher level of abstraction that maps to both Hopper and Blackwell backends. When the next architecture arrives, the abstraction layer handles the mapping.
The 20-30x compile time improvement also matters for autotuning. One of the core ideas in RL-based kernel optimization (CUDA-L1, CUDA Agent) is using execution feedback to guide search over the optimization space. Fast compilation means more evaluations per unit time. FA4's CuTe-DSL implementation is the right substrate for that search process, whereas C++ template compilation was slow enough to make iterative kernel search impractical.
The numbers: 1,605 TFLOPs/s on B200. 71% hardware utilization. 1.3x faster than cuDNN 9.13. 2.7x faster than Triton.
71% hardware utilization is the number I want to focus on. Most published attention kernels achieve 50-60% on their target hardware. Getting to 71% on B200 requires that you have correctly identified every bottleneck at that hardware generation and addressed each one. The SFU bottleneck, addressed via software-emulated exponential. The register pressure bottleneck, addressed via TMEM. The tile size bottleneck, addressed via 2-CTA MMA. The synchronization bottleneck, addressed via single-thread-launched UMMA.
Each optimization is a response to a specific hardware constraint. None of them generalize to H100. All of them are necessary on B200. This is what hardware-specific kernel co-design actually means at the implementation level -- not "we tuned it for the new chip" but "we identified four separate bottlenecks that didn't exist on the previous chip and built specific solutions for each one."
1.3x over cuDNN means FA4 outperforms NVIDIA's own library on NVIDIA's own hardware by 30%. NVIDIA engineers had access to internal hardware documentation, silicon characterization data, and months of tuning time. Tri Dao's team, working from public specifications and empirical profiling, beat it.
This is not an accident. It is the consequence of correctly understanding the asymmetric scaling problem when NVIDIA's library team was still treating the B200 as a faster H100.
blackwell doubled the tensor cores.
it didn't change the sfus.
the attention kernel that was optimal on h100 is not optimal on b200 because softmax is now the bottleneck.
fa4 routes the exponential computation to the alus. moves the accumulator out of registers into tmem. scales the tile to 256×256 via 2-cta mma. gets to 71% hardware utilization.
the kernel that was right last year is wrong this year because one number changed in the hardware spec.
the asymmetric scaling problem compounds every generation. tensor cores will keep doubling. sfus will not. every attention kernel written without this constraint in mind is leaving an increasing fraction of hardware performance on the table. fa4 is the first kernel that treats the asymmetry as a first-order design constraint. it won't be the last.
P.S. The LPT (Longest Processing Time) scheduling for variable-length batches is the production systems detail buried at the end of the paper. Standard varlen attention kernels process batches in the order they arrive -- which can be badly suboptimal when a long-context decode batch is followed by a short prefill batch. FA4 adds a preprocessing step that sorts batches by maximum per-worktile execution time and creates a virtual-to-actual index mapping. The sorted order reduces load imbalance across SMs. The preprocessing overhead is negligible. The latency variance reduction is not. On heavy-tailed request distributions -- which is what production serving traffic looks like -- the LPT schedule smooths the P99 more than any of the core algorithmic improvements. Most engineers who read the FA4 paper skip this section. That's the section to read first.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.