Skip to main content

DiffusionGemma doesn't accelerate text generation by being a smarter model. It accelerates it by using GPU hardware in a completely different mode.

Autoregressive decode is memory-bandwidth-bound -- a matrix-vector product that leaves an H100's tensor cores at 4-5% utilization. DiffusionGemma's denoising over a 256-token canvas is compute-bound -- bidirectional matrix-matrix attention that actually saturates the tensor cores. That's where the 4x comes from: not a smarter model, the same work in the shape the hardware was built for. 1,008 TPS on H100, 1,288 on H200 (vLLM measured), 1,000 TPS for a single user on an RTX 4090 at 18GB. Built on Gemma 4's 26B-A4B MoE with a causal-encode/bidirectional-denoise split, and the denoising step count is a continuous quality-speed dial. Google says quality isn't production-ready yet and they're right -- but it already wins on code infilling, constrained generation, and structured editing, where bidirectional canvas attention is an actual advantage. Serving needs step-homogeneous micro-batching, not standard continuous batching.

June 24, 2026

This distinction matters and nobody has said it clearly in two weeks of coverage.

Autoregressive decode is memory-bandwidth-bound. Every token generation step loads the full model weight matrices from HBM and multiplies them against a single query vector -- a matrix-vector product. Your H100 has 3.35 TB/s of HBM bandwidth and 2 PFLOPS of BF16 tensor core throughput. At decode, you're using the bandwidth. The tensor cores are mostly idle because a matrix-vector product doesn't have enough arithmetic intensity to saturate them. This is why decode on an H100 at small batch sizes sits at 4-5% compute utilization. The hardware was designed for matrix-matrix. Decode gives it matrix-vector.

DiffusionGemma's denoising step over a 256-token canvas is compute-bound. You're running bidirectional attention over 256 tokens simultaneously -- the canvas -- attending to the full KV cache of the prompt plus bidirectionally attending to all 256 canvas positions against each other. This is a matrix-matrix operation. The tensor cores run at actual utilization. HBM bandwidth is not the constraint.

1,000 tokens per second on a single H100 with FP8 -- measured independently by the vLLM team at 1,008 TPS, 1,288 on H200. Not because DiffusionGemma does less work. Because it does the work in a shape that the hardware was designed for.


The architecture, precisely.

DiffusionGemma is built on the Gemma 4 26B-A4B MoE backbone -- 26B total parameters, 3.8B active per step, 128 experts, 8 activated per token. Google grafted a diffusion head onto this backbone and changed the attention mechanism from causal to a split architecture that I want to explain carefully.

The model has two phases.

Encoding: the prompt is processed with causal (autoregressive) attention. This produces a KV cache that captures the prompt context. The encoder is standard -- one token attending only to prior tokens, left-to-right. This produces a KV cache that captures the prompt context.

Denoising: the generation canvas -- 256 positions, initialized with noise -- is processed with bidirectional attention. Every position in the canvas attends to every other position in the canvas, AND to the prompt KV cache from the encoder. The model predicts less noisy tokens at each position simultaneously.

This is not standard masked diffusion (LLaDA, Dream), which runs bidirectional attention over the full context at once. This is not standard text generation, which is causal from start to finish. It's a specific hybrid: causal for the prompt because prompts have left-to-right semantic structure that benefits from causal processing, bidirectional for the canvas because denoising is inherently non-causal -- the corrected version of token 128 should be able to influence the corrected version of token 3.

The block-autoregressive structure handles variable length generation: generate canvas block 1 (256 tokens), commit to KV cache, generate canvas block 2 (256 tokens conditioned on block 1 via its KV cache), repeat until the model signals completion. The AR structure across blocks is causal. The denoising structure within each block is bidirectional. Variable length generation works naturally because you keep adding blocks until you're done.


The denoising step count is the quality lever -- and the thing that makes the 4x headline require context.

Each 256-token block requires N denoising steps to produce. N is the primary quality-speed tradeoff. More steps = better output = slower generation. Fewer steps = faster but lower quality.

DiffusionGemma's model card doesn't commit to a specific step count because it's configurable. The 1,000 tokens/second measurement is at a specific step count, presumably optimized for the quality level that makes the quality-speed tradeoff favorable. The "4x faster than Gemma 4" headline is measured at a step count where the quality delta is acceptable. At higher step counts (where quality approaches autoregressive), the speedup is lower.

This is structurally identical to the speculative decoding tradeoff: more draft tokens per step = better throughput if accepted = but lower acceptance rate. DiffusionGemma's step count is the same dial. It's not a binary fast/slow switch. It's a continuous knob that the serving system can tune based on the latency SLO for the workload.

The specific implication: at step counts where DiffusionGemma reaches Gemma 4 quality parity on its strongest tasks (structured outputs, constraint satisfaction, code infilling), the speedup may be 2x not 4x. The 4x number is at a step count optimized for DiffusionGemma's natural strengths, where it happens to be faster AND better. On tasks where autoregressive models have natural advantages (strict sequential reasoning, step-by-step math), the quality comparison shifts and the step count you need to match quality is higher.

Google's model card is explicit: "DiffusionGemma scores lower than Gemma 4 on MMLU and coding evaluations. We recommend Gemma 4 for production use cases that prioritize output quality over generation speed."

This is the most honest sentence in any model release this year. The model is experimental. The benchmark gap is real. The speed advantage is real. Both are true simultaneously.


What DiffusionGemma is actually better at.

The bidirectional canvas attention is a genuine architectural advantage for specific task types that the autoregressive structure can't match.

Code infilling: given a function signature and a body stub, fill in the implementation. Standard LLMs handle this by processing the full context (signature + stub) autoregressively and generating left-to-right. DiffusionGemma can attend to BOTH the left context (signature) and right context (stub terminator) simultaneously while generating the implementation. The canvas knows what comes after the fill region. The output can be globally consistent with both constraints.

Text editing: given a draft paragraph with a marked region to revise, produce a better version of the marked region. Autoregressive models have to process the full context, including the text that comes after the edit region, and then generate the edit left-to-right without directly attending to what follows. DiffusionGemma's canvas attends to the full context bidirectionally while denoising the edit region. The edit can be globally optimized against the post-edit context.

Constraint satisfaction: generate text that satisfies multiple positional constraints (this word must appear at position N, this phrase must appear near position M). Autoregressive models struggle with constraints on future positions. Diffusion can initialize the canvas with the constraints and denoise the rest around them.

These are the tasks where the quality comparison reverses. DiffusionGemma may outperform Gemma 4 on structured constrained generation at the same step counts that produce lower MMLU scores. The benchmark that captures this is not MMLU.


The serving implications nobody has spelled out.

Continuous batching doesn't work the same way for DiffusionGemma.

In autoregressive serving, continuous batching works because every active request at every decode step is doing the same operation -- run one forward pass, get one new token. You can freely mix requests at different generation positions in the same batch.

In DiffusionGemma serving, active requests need to be at the same denoising step within their current canvas block. If request A is on denoising step 32 of its first canvas block and request B is on denoising step 15 of its second canvas block, they cannot be trivially batched because they're at different denoising steps. The model's step conditioning (analogous to AdaLN timestep conditioning in image diffusion) varies per request.

This is the diffusion-aware scheduler problem I wrote about in the video world model serving post months ago. The serving framework for DiffusionGemma should use step-homogeneous micro-batching: group requests by their current denoising step, process them together, advance them all to the next step. Requests entering the system get queued until enough have accumulated to form an efficient batch at the same starting step.

vLLM added DiffusionGemma support specifically -- it's the first diffusion LLM natively in vLLM's framework. How well vLLM's continuous batching scheduler adapts to the step-homogeneity requirement will determine whether production DiffusionGemma serving achieves the 1,000 TPS headline under real multi-user load or whether the scheduler overhead from step-synchronization becomes the bottleneck.

At small batch sizes (1-2 concurrent users), DiffusionGemma's compute-bound profile gives it a massive advantage over autoregressive models. At large batch sizes (100+ concurrent users) where autoregressive models batch effectively, the advantage narrows because autoregressive models also become more compute-bound at large batch size. DiffusionGemma's sweet spot is exactly the single-user local inference case that Google specifically targeted: "Optimized for Small Batch Size Inference -- specifically engineered for low-latency, high-speed generation on a single capable accelerator."

An RTX 4090 running DiffusionGemma at 18GB VRAM gets ~1,000 TPS for a single user. An RTX 4090 running Llama-3-70B at 4-bit quant gets ~30-50 TPS for a single user. That gap is the hardware utilization story.


the 4x faster claim is true.

it comes from shifting from matrix-vector (memory-bandwidth-bound) to matrix-matrix (compute-bound) operations per generation step.

the hardware was designed for matrix-matrix. autoregressive decode gives it matrix-vector. diffusion gives it matrix-matrix.

google says the quality isn't there yet for production. they're right. the architecture is right. the quality will follow.

the tasks where diffusiongemma already wins: code infilling, constrained text generation, structured editing. these are the tasks where bidirectional context during generation is an actual advantage, not just an artifact of the architecture. benchmark suites don't capture them well. your workload might.


P.S. The 18GB VRAM at NVFP4 on an RTX 4090 is the most interesting deployment story in this release and has gotten no coverage relative to the cloud benchmarks. DiffusionGemma running locally on consumer hardware at 1,000 TPS means: zero cloud latency, zero API cost, zero data privacy concern, offline capability. For the specific tasks where DiffusionGemma's quality is competitive (structured constrained generation, code infilling), this is a production deployment story for teams that couldn't previously run frontier-adjacent local inference at this speed. The 18GB number was not accidental. Google specifically targeted consumer GPU deployment. The RTX 5090 and 4090 are the primary platforms NVIDIA announced DiffusionGemma optimization for. That's the market.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.