Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

Open an Nsight profile on a DeepSeek-R1 decode workload. Find the MoE Dispatch/Combine section. Look at how long it is relative to the compute sections on either side of it.

That gap is what I want to talk about.

The compute before it finishes fast. Attention, dense layers, everything non-MoE -- done in milliseconds. Then the all-to-all dispatch kicks in. Tokens route to their selected experts on remote GPUs. The combine gathers results back. The GPU waits. The profiling trace shows a long flat section where nothing is computing while the collective runs.

The compute load during MoE dispatch/combine is negligible -- the GPUs aren't doing significant arithmetic during that window. They're moving data. And while they're moving data, the tensor cores are idle.

On a WideEP deployment of DeepSeek-R1 at decode time, this communication window is not a rounding error. It is the dominant term in per-layer latency. You bought H100s for the tensor cores. You are using the network.

The fix shipped in vLLM behind --enable-dbo. One flag. I want to explain the mechanism because it's genuinely clever and because the failure modes are specific and non-obvious.

DBO -- Dual Batch Overlap -- splits the decode batch into two microbatches and runs them on two CUDA streams with two worker threads. The key insight: microbatch A's all-to-all dispatch and microbatch B's dense layer compute use different hardware resources. Collective communication goes through NVLink/IB. Dense compute uses tensor cores. They do not compete.

So run them simultaneously.

The execution pattern with DBO: microbatch A initiates dispatch all-to-all and yields to microbatch B thread. Microbatch B runs its dense compute layers. Microbatch A's dispatch completes, B yields back. A does its expert compute. B initiates its own dispatch. A does its combine while B computes. The communication of one microbatch overlaps with the computation of the other throughout the entire decode step.

The profiling trace after DBO looks completely different. The flat communication gap collapses. Compute and communication fill the same wall-clock window rather than running sequentially. 25% decode latency reduction on DeepSeek-R1 workloads. Not from a new algorithm. Not from better hardware. From scheduling two things simultaneously that were previously running one after the other for no fundamental reason.

This is DeepSeek's DualPipe applied to inference.

DualPipe was DeepSeek's solution to pipeline parallelism bubbles in training -- overlapping the forward pass of one microbatch with the backward pass of another to keep pipeline stages continuously occupied. The idea of splitting work into two offset microbatches to hide communication behind computation is the same principle. vLLM's DBO takes it from training pipeline parallelism to inference decode MoE communication. The communication pattern is different. The insight is identical.

The non-obvious failure mode: DBO requires both microbatches to be non-empty.

vLLM's scheduler does a collective all_reduce across all DP ranks before each decode step to agree whether microbatching will be applied. If any rank would end up with an empty second microbatch after the batch is split, microbatching is disabled for all ranks. No overlap. Standard sequential execution.

At low batch sizes -- which is exactly the regime where decode latency matters most, because you're serving individual user requests, not saturating throughput -- the batch might not split cleanly. A batch of 7 tokens across 2 DP ranks gives 3 and 4. Both non-empty, DBO fires. A batch of 3 tokens across 2 ranks gives 1 and 2, or 2 and 1. Still non-empty. A batch of 1 token: you can't split it. DBO disabled.

The threshold is configurable via --dbo-decode-token-threshold. Below that threshold, the scheduler doesn't attempt microbatching. The default is set conservatively. If you have insight into your traffic distribution -- if you know your p10 batch size at decode time -- you can tune this down and capture overlap at lower batch sizes than the default captures.

The backend also matters. --all2all-backend deepep_low_latency is the backend that makes DBO worth enabling. It uses NVLink for intra-node expert communication with native CUDA stream support, which is what lets the overlap actually happen. deepep_high_throughput -- the InfiniBand backend for inter-node communication -- has different overlap characteristics and the performance gain from DBO is lower. If your EP group fits within a single node (which it does at EP width 8 or less on NVL8, or 16 or less on a dual-node NVLink setup), use deepep_low_latency. If it spans nodes, benchmark before assuming DBO helps.

The load imbalance story is the second half of this problem and it compounds with DBO in a way that isn't obvious.

Experts are balanced at training time -- the load balancing loss during training pushes the router toward even token distribution across all experts. At inference time, real workloads don't distribute evenly. A query about Python code routes heavily to certain experts. A query about French poetry routes to different ones. The training-time balance doesn't hold.

At WideEP with high parallelism degree, load imbalance means some GPUs in the EP group are processing 3x their expected token count per step while others are nearly idle. The step wall-clock time is determined by the slowest GPU. You're paying for the overloaded GPU's latency while the underloaded GPUs sit idle -- and you're paying for this inside the very communication window DBO is trying to hide.

The hierarchical load balancer in vLLM monitors token routing in real time and reshuffles expert assignments to balance load across GPU ranks. Not at restart time. Not at config time. Each decode step, if the imbalance exceeds a threshold, it rebalances. 12-18% throughput improvement on real heterogeneous workloads where some queries are disproportionately expert-hungry.

DBO and dynamic load balancing are independent improvements that compose. DBO hides the communication latency of a balanced dispatch. Dynamic load balancing reduces the tail latency from an imbalanced one. If you're running WideEP on DeepSeek-class models without both enabled, you're leaving the more significant fraction of the available performance on the table.

the profiling trace is the fastest way to understand this.

run deepseek-r1 decode without dbo. look at the moe dispatch/combine section. measure how long the gpu is idle waiting for collective communication.

enable --enable-dbo --all2all-backend deepep_low_latency. run again. look at the same section.

the gap doesn't disappear. it overlaps. same wall-clock time. two things happening in it instead of one.

25% decode latency from one flag on a workload you're probably already running. the compute was always available during that communication window. nobody scheduled anything into it until now.

P.S. The current DBO implementation in vLLM is model-specific -- there's a deepseek_dbo.py for DeepSeek-V3, and adding another model means writing another model-specific DBO module. The RFC to refactor DBO into a model-agnostic framework (RFC #2599 in vllm-ascend) is actively being worked on. Once it lands, DBO becomes a flag you enable for any MoE architecture rather than something that requires per-model implementation. Qwen3 MoE, Nemotron 3 Super, Mixtral -- all of them have the same all-to-all communication gap in their profiling traces. The fix is the same fix. The generalization is what makes it a platform feature rather than a DeepSeek-specific optimization.