Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

You adopted WideEP for the throughput gains. Then one GPU died and 96 went down with it.

This is the conversation happening in every infrastructure team that shipped DeepSeek-style MoE serving in the last six months. Not loudly. Quietly, in incident retrospectives, in Slack threads that don't make it to the blog post.

Let me explain what's happening.

Wide Expert Parallelism is the right architecture for MoE inference. The reasoning is clean: a model like DeepSeek-V3 has 256 experts, but each token only activates 8. If you shard those experts across 32 GPUs, each GPU holds a subset, and tokens are dispatched via all-to-all to whichever GPU has the expert they need. Attention layers are replicated across all GPUs in the group. The result: better memory efficiency, larger effective batch sizes, more throughput per GPU than tensor parallelism for this workload shape. The benchmarks are real. WideEP is now the mainstream serving pattern for large sparse models. vLLM has it. NVIDIA Dynamo has it. Ray Serve LLM has it.

Here is what nobody mentioned prominently in the adoption guides: those GPUs are no longer independent replicas.

In dense model serving, each replica is a self-contained copy of the model. GPU 1 fails -- that replica fails, the load balancer stops sending it traffic, the other replicas absorb the load. Blast radius: 1 GPU.

In WideEP serving, the DP group -- say, 32 GPUs -- is a single logical replica. Expert weights are sharded across all 32. Every request dispatches tokens to multiple GPUs in the group via collective operations. If GPU 17 goes down mid-collective, the all-to-all cannot complete. Every in-flight request in the group fails. The group cannot accept new requests. The load balancer has nothing to send traffic to until the group recovers.

Blast radius: all 32 GPUs. Simultaneously.

At a single 32-GPU group, this is painful but manageable. At the scale that WideEP is being deployed -- NVIDIA Dynamo supporting EP widths of 96 for DeepSeek-R1, multiple DP groups running in parallel -- the arithmetic gets uncomfortable.

GPU MTBF in production data centers is roughly 10,000 to 30,000 hours per GPU, depending on the hardware generation and workload intensity. Call it 15,000 hours as a rough median. At 1,000 GPUs in your serving cluster, you expect a failure roughly every 15 hours. At 10,000 GPUs, roughly every 1.5 hours.

Every time a GPU fails in a WideEP deployment with group width 96, you lose 96 GPUs of serving capacity until the group recovers. Recovery means detecting the failure, draining in-flight requests, removing the group from the load balancer, restarting the group with one fewer GPU or waiting for replacement, and bringing it back online. At optimistic timelines, that's 5 to 15 minutes.

At a cluster with 10,000 GPUs and EP width 96, you are losing a group of 96 GPUs roughly every 90 minutes for 5-15 minutes at a time. Do that math against your availability SLO.

Anyscale shipped DP Group Fault Tolerance in Ray 2.55 on April 2nd. It's the control-plane answer to this problem.

When a GPU in the group fails, Ray Serve LLM detects it, immediately stops routing new requests to the affected group, drains in-flight requests gracefully, and marks the group as degraded. It then either attempts to bring the group back online with the remaining healthy GPUs (running at reduced capacity) or flags for replacement. The rest of the cluster keeps serving. The blast radius is contained to the failed group. Other groups absorb the traffic.

The engine-level answer -- non-blocking collectives that let the surviving GPUs continue even with one rank missing -- is in the vLLM RFC (issue #27774, open since October 2025, still active). The difficulty: NCCL's all-to-all is blocking by default. If a rank disappears, the collective hangs until it times out. Making it non-blocking requires either custom kernel work (the DeepEP buffer initializer needs extending) or a timeout-based fallback that accepts potential correctness degradation on in-flight requests.

The Anyscale solution is the pragmatic path: solve it at the control plane while the engine-level work matures. Stop routing before the collective hangs. Accept that in-flight requests to the degraded group fail and let client retry logic handle it. This is correct production engineering -- the guaranteed behavior is "fail fast and recoverable," not "never fail."

The insight buried in the vLLM large-scale serving benchmarks that Anyscale cites: throughput per GPU is roughly flat across EP widths of 32, 72, and 96.

You are not losing meaningful efficiency by choosing a smaller group. A 32-wide EP group gets approximately the same throughput per GPU as a 96-wide EP group on DeepSeek-R1 decode workloads. And a 32-wide EP group has one-third the blast radius of a 96-wide group when a GPU fails.

The recommendation Anyscale makes explicitly: tune EP group width to the smallest value that maximizes per-GPU throughput. Smaller groups. Smaller blast radius. Same performance.

Most teams that adopted WideEP chose the widest group that fits in a rack because "wider = more throughput" was the intuition from the initial benchmarks. The intuition is wrong above a certain width. The throughput curve flattens. The reliability curve doesn't.

There is a parallel story in training, published in April in a paper called Nonuniform Tensor Parallelism. The argument: at TP degree 72 (a full NVL72 rack), a single GPU failure drops the entire rack's training contribution because the tensor parallel collective can't complete. With 0.1% of GPUs failing -- which is realistic at large cluster scale -- a high-TP-degree job loses nearly 10% of total throughput to failure cascades. NTP proposes running the failed replica at reduced TP degree rather than dropping it entirely, with rack-level power boosting to maintain per-chip throughput. Same problem. Same insight. The blast radius of your parallelism group determines your failure mode, and nobody was thinking about it at design time because the throughput gains were the headline.

The training team and the inference team both adopted wide parallelism groups for the performance. Both are now figuring out the failure modes independently. Both are landing on the same answer: smaller groups, better fault containment, approximately the same throughput above a certain group size.

the blast radius of a widep group is the width of the group.

not 1 gpu. the whole group.

everyone adopted 96-wide because wider looked better in benchmarks. the benchmarks didn't measure what happens when gpu 47 dies at 3am.

tune ep width to the smallest value that maximizes per-gpu throughput. check the vllm large-scale serving numbers -- the curve flattens before you think it does. whatever throughput you're leaving on the table is less than what you're losing to availability.

P.S. The vLLM Elastic EP RFC (separate from the fault tolerance RFC) addresses dynamic EP width adjustment at runtime -- shrinking or growing the group without restarting the serving engine. That's the long-term solution: a group that loses a GPU automatically contracts, serves at slightly reduced capacity, and re-expands when a replacement comes online. It's not shipped yet. Watch the vLLM main branch. When it lands, the blast radius argument collapses entirely and you can go back to optimizing purely for throughput. Until then: smaller groups.