Skip to main content

Open an Nsight profile on a DeepSeek-R1 decode workload. Find the MoE Dispatch/Combine section. Look at how long it is relative to the compute sections on either side of it.

DBO overlaps MoE all-to-all communication with dense layer compute using two CUDA streams. 25% decode latency from one flag. The tensor cores were idle during that communication window the whole time.

May 20, 2026

That gap is what I want to talk about.

The compute before it finishes fast. Attention, dense layers, everything non-MoE -- done in milliseconds. Then the all-to-all dispatch kicks in. Tokens route to their selected experts on remote GPUs. The combine gathers results back. The GPU waits. The profiling trace shows a long flat section where nothing is computing while the collective runs.

The compute load during MoE dispatch/combine is negligible -- the GPUs aren't doing significant arithmetic during that window. They're moving data. And while they're moving data, the tensor cores are idle.

On a WideEP deployment of DeepSeek-R1 at decode time, this communication window is not a rounding error. It is the dominant term in per-layer latency. You bought H100s for the tensor cores. You are using the network.


The fix shipped in vLLM behind --enable-dbo. One flag. I want to explain the mechanism because it's genuinely clever and because the failure modes are specific and non-obvious.

DBO -- Dual Batch Overlap -- splits the decode batch into two microbatches and runs them on two CUDA streams with two worker threads. The key insight: microbatch A's all-to-all dispatch and microbatch B's dense layer compute use different hardware resources. Collective communication goes through NVLink/IB. Dense compute uses tensor cores. They do not compete.

So run them simultaneously.

The execution pattern with DBO: microbatch A initiates dispatch all-to-all and yields to microbatch B thread. Microbatch B runs its dense compute layers. Microbatch A's dispatch completes, B yields back. A does its expert compute. B initiates its own dispatch. A does its combine while B computes. The communication of one microbatch overlaps with the computation of the other throughout the entire decode step.

The profiling trace after DBO looks completely different. The flat communication gap collapses. Compute and communication fill the same wall-clock window rather than running sequentially. 25% decode latency reduction on DeepSeek-R1 workloads. Not from a new algorithm. Not from better hardware. From scheduling two things simultaneously that were previously running one after the other for no fundamental reason.


This is DeepSeek's DualPipe applied to inference.

DualPipe was DeepSeek's solution to pipeline parallelism bubbles in training -- overlapping the forward pass of one microbatch with the backward pass of another to keep pipeline stages continuously occupied. The idea of splitting work into two offset microbatches to hide communication behind computation is the same principle. vLLM's DBO takes it from training pipeline parallelism to inference decode MoE communication. The communication pattern is different. The insight is identical.


The non-obvious failure mode: DBO requires both microbatches to be non-empty.

vLLM's scheduler does a collective all_reduce across all DP ranks before each decode step to agree whether microbatching will be applied. If any rank would end up with an empty second microbatch after the batch is split, microbatching is disabled for all ranks. No overlap. Standard sequential execution.

At low batch sizes -- which is exactly the regime where decode latency matters most, because you're serving individual user requests, not saturating throughput -- the batch might not split cleanly. A batch of 7 tokens across 2 DP ranks gives 3 and 4. Both non-empty, DBO fires. A batch of 3 tokens across 2 ranks gives 1 and 2, or 2 and 1. Still non-empty. A batch of 1 token: you can't split it. DBO disabled.

The threshold is configurable via --dbo-decode-token-threshold. Below that threshold, the scheduler doesn't attempt microbatching. The default is set conservatively. If you have insight into your traffic distribution -- if you know your p10 batch size at decode time -- you can tune this down and capture overlap at lower batch sizes than the default captures.

The backend also matters. --all2all-backend deepep_low_latency is the backend that makes DBO worth enabling. It uses NVLink for intra-node expert communication with native CUDA stream support, which is what lets the overlap actually happen. deepep_high_throughput -- the InfiniBand backend for inter-node communication -- has different overlap characteristics and the performance gain from DBO is lower. If your EP group fits within a single node (which it does at EP width 8 or less on NVL8, or 16 or less on a dual-node NVLink setup), use deepep_low_latency. If it spans nodes, benchmark before assuming DBO helps.


The load imbalance story is the second half of this problem and it compounds with DBO in a way that isn't obvious.

Experts are balanced at training time -- the load balancing loss during training pushes the router toward even token distribution across all experts. At inference time, real workloads don't distribute evenly. A query about Python code routes heavily to certain experts. A query about French poetry routes to different ones. The training-time balance doesn't hold.

At WideEP with high parallelism degree, load imbalance means some GPUs in the EP group are processing 3x their expected token count per step while others are nearly idle. The step wall-clock time is determined by the slowest GPU. You're paying for the overloaded GPU's latency while the underloaded GPUs sit idle -- and you're paying for this inside the very communication window DBO is trying to hide.

The hierarchical load balancer in vLLM monitors token routing in real time and reshuffles expert assignments to balance load across GPU ranks. Not at restart time. Not at config time. Each decode step, if the imbalance exceeds a threshold, it rebalances. 12-18% throughput improvement on real heterogeneous workloads where some queries are disproportionately expert-hungry.

DBO and dynamic load balancing are independent improvements that compose. DBO hides the communication latency of a balanced dispatch. Dynamic load balancing reduces the tail latency from an imbalanced one. If you're running WideEP on DeepSeek-class models without both enabled, you're leaving the more significant fraction of the available performance on the table.


the profiling trace is the fastest way to understand this.

run deepseek-r1 decode without dbo. look at the moe dispatch/combine section. measure how long the gpu is idle waiting for collective communication.

enable --enable-dbo --all2all-backend deepep_low_latency. run again. look at the same section.

the gap doesn't disappear. it overlaps. same wall-clock time. two things happening in it instead of one.

25% decode latency from one flag on a workload you're probably already running. the compute was always available during that communication window. nobody scheduled anything into it until now.


P.S. The current DBO implementation in vLLM is model-specific -- there's a deepseek_dbo.py for DeepSeek-V3, and adding another model means writing another model-specific DBO module. The RFC to refactor DBO into a model-agnostic framework (RFC #2599 in vllm-ascend) is actively being worked on. Once it lands, DBO becomes a flag you enable for any MoE architecture rather than something that requires per-model implementation. Qwen3 MoE, Nemotron 3 Super, Mixtral -- all of them have the same all-to-all communication gap in their profiling traces. The fix is the same fix. The generalization is what makes it a platform feature rather than a DeepSeek-specific optimization.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.