Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

99% of the prefill cost on turn 2 is recomputing something the decode node already has.

I want to sit with that number for a second before explaining it.

PD disaggregation is now the standard serving architecture. Prefill nodes handle prompt processing -- compute-bound, high parallelism. Decode nodes handle token generation -- memory-bound, sequential. You separate them because they have different hardware affinities and interfere with each other when colocated. The benchmarks are real. The architecture is correct.

It was designed for single-turn queries.

The dominant usage pattern is now multi-turn.

Here is what happens under standard PD disaggregation when turn 2 of a conversation arrives.

The user sends a message. Your router sends it to a prefill node. The prefill node processes: the system prompt, the user's first message, the model's entire first response, and the new user message. It computes KV cache for all of it. Then it ships that KV cache over the network to a decode node. The decode node generates the response.

The model's first response -- every token the decode node generated in turn 1 -- was already processed by a decode node. The KV cache for those tokens was computed during generation. It was sitting in GPU memory on the decode node when turn 2 arrived.

The PD architecture threw it away. Sent the tokens as text back to a prefill node. Had the prefill node recompute the KV cache from scratch. Shipped it back.

The PPD paper (March 2026) analyzed ShareGPT -- a large dataset of real multi-turn conversations -- and found that up to 99% of the prefill computation on turn 2+ consists of recomputing KV cache for the model's own prior outputs. Content the decode node generated. Content the decode node already had the KV states for. Recomputed entirely because the architecture assumed every prefill belongs on a prefill node.

The mechanism that makes this fixable is a distinction the paper calls append-prefill.

Full prefill: process the entire conversation history plus the new message. Compute-heavy. O(n) in sequence length. Disrupts decode batching significantly when colocated on decode hardware because it competes for SM resources.

Append-prefill: process only the new tokens while reusing cached KV states for everything prior. Compute-light. O(k) where k is just the new message length -- typically tens to hundreds of tokens, not thousands. Barely disrupts decode batching because it's a small operation on a node that already has everything it needs.

The key empirical finding: append-prefill operations incur "substantially less decoding slowdown" than full prefill when colocated on decode nodes. The interference that made prefill-on-decode a bad idea for full prefill simply doesn't materialize for append-prefill at typical multi-turn message lengths.

This means the routing question isn't "should prefill happen on prefill nodes?" It's "is this specific prefill operation large enough that the interference cost of running it on a decode node exceeds the KV transfer cost of sending it to a prefill node?" For 99% of turn 2+ operations in real multi-turn traffic, the answer is no.

PPD -- Prefill-capable Decode -- routes append-prefill operations to the decode node that already holds the conversation's KV state. No transfer. No recomputation. The decode node processes the new tokens locally against its cached states and begins generating immediately.

The routing decision is made dynamically based on three inputs: the estimated workload on decode nodes at the moment the request arrives, the user-specified SLO (TTFT vs TPOT priority), and statistics about request patterns collected offline. When decode nodes are under heavy load, the algorithm can route append-prefills back to prefill nodes -- accepting the recomputation cost in exchange for not disrupting decode -- and fall back to standard PD behavior. When decode nodes have headroom, route locally.

The result on turn 2+ TTFT: 68% reduction. The reason is direct. You eliminated the KV transfer latency (network round trip, typically hundreds of milliseconds at long context) and you eliminated the recomputation (which at 10K+ tokens of conversation history is significant). What's left is the actual work: processing the new message tokens against existing cached states, which is fast.

The KV transfer congestion angle is the one that doesn't get enough attention.

Under high load, PD disaggregation creates a feedback loop. Heavy traffic means more concurrent sessions. More concurrent sessions means more turn 2+ requests arriving. More turn 2+ requests means more KV cache being transferred from decode to prefill to decode. The network link between prefill and decode nodes -- typically InfiniBand, typically sized for baseline throughput -- saturates. KV transfers queue. TTFT climbs. The congestion feeds itself.

PPD addresses this directly: route append-prefills locally and you remove a large fraction of the inter-node transfer volume under multi-turn load. The congestion that degraded under heavy traffic is partially eliminated because the traffic that caused it isn't crossing the network anymore.

Together AI's CPD (Cache-aware Prefill-Decode, March 2026) found a related result from a different angle: separating requests by cache hit rate -- routing requests with warm KV cache to prefill nodes configured for fast reuse, cold requests to standard prefill nodes -- produced 40% higher sustainable throughput under mixed real-world traffic. The mechanism is the same: most serving frameworks treat all prefill as equivalent. It isn't. Cache-warm and cache-cold prefill have different cost profiles and different optimal routing targets.

The thing I want to say clearly: PD disaggregation was designed for a workload that no longer represents the majority of production traffic.

When DistServe and Splitwise introduced PD disaggregation in 2024, the dominant inference workload was single-turn API queries -- one prompt in, one response out. That workload still exists. It is no longer the center of gravity. Chatbots, coding assistants, agentic systems -- the workloads consuming the most GPU-hours in 2026 are multi-turn by design. Multiple rounds per session. KV state accumulating across turns. Conversation history growing with each exchange.

The architecture that was correct for single-turn queries has a structural inefficiency for multi-turn that grows with session length: every turn sends the full history back to prefill, regardless of how much of that history the decode node already processed. The overhead is not constant. It compounds with conversation length. At session turn 10, the prefill node is recomputing 9 turns of prior conversation output. That's 9 turns of content that the decode node generated, cached, and had the architecture then discard.

PPD is a surgical fix. It doesn't replace PD disaggregation. It adds a routing decision layer that asks, for each request, whether the append-prefill is small enough to run locally. The algorithm is simple. The implementation extends standard vLLM disaggregated serving. The fallback to standard PD behavior is always available.

the architecture was designed for single-turn.

the workload is multi-turn.

99% of turn 2+ prefill cost is recomputing what the decode node already computed.

the number isn't from a benchmark. it's from real chatgpt conversations.

68% ttft reduction on turn 2+ from routing append-prefill locally. the transfer you were paying was the cost of an architecture assumption that was correct in 2024 and wrong in 2026.

P.S. The PrefillShare paper (February 2026) takes this one level further for multi-agent workloads: when multiple fine-tuned models are serving the same agentic session and sharing a common system prompt prefix, each model currently computes and caches that prefix independently. PrefillShare proposes a shared prefill module that computes the common prefix once and distributes the KV cache to all decode nodes, regardless of which model variant they're running. At 4-agent workflows with shared prefixes, the GPU budget required to serve the session drops significantly because one set of prefill GPUs is doing the work that four independent prefill nodes were doing before. Cross-model KV sharing without retraining. It's in vLLM extension form and not merged to main yet. It's the natural next step after PPD if you're running agentic workloads with shared context.