Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

Prefill and decode run on the same GPU. They use completely different hardware. Nobody ran them at the same time until six weeks ago.

This one took me a while to see.

The standard story of prefill-decode disaggregation goes like this: prefill is compute-bound, decode is memory-bandwidth-bound, they want different hardware, so split them across GPUs. vLLM does it. SGLang does it. NVIDIA Dynamo was built around it. The whole industry has spent 18 months optimizing the inter-GPU version of this problem.

The paper I've been reading for the last three days asks a different question. If prefill saturates compute units and decode saturates memory bandwidth -- and those are genuinely different hardware resources on the same chip -- why are we running them sequentially at all?

Bullet. ASPLOS '26. March 22nd, Pittsburgh. One citation in the vLLM issue tracker. Otherwise: nothing.

Here's the problem it's solving, made physical.

A GPU has two scarce resources: SM compute throughput and HBM memory bandwidth. Prefill uses the first one -- it's processing your entire input in parallel, saturating the tensor cores, feeding the SMs as fast as it can. Decode uses the second one -- loading weight matrices one token at a time, mostly waiting for HBM to deliver bytes.

When you run prefill and decode in the same batch (what every serving framework does by default), you are forcing two workloads with opposite resource profiles to share scheduling time. The scheduler picks one batch, runs it, picks the next. While prefill is running, HBM bandwidth sits unused. While decode is running, SM compute sits unused. You are getting one resource at a time when the hardware has two.

The number Bullet puts on this: chunked prefill -- the technique everyone uses to prevent prefill from blocking decode -- produces 5.2% lower SLO compliance than Bullet's approach, and leaves 20% of GPU compute idle during decode batches. You bought the hardware. You are using roughly 80% of it during the portion of inference that matters most for latency.

Bullet's mechanism is SM partitioning via libsmctrl.

libsmctrl is a CUDA SM masking library. It lets you specify which Streaming Multiprocessors on the GPU a given kernel is allowed to run on. Not scheduling hints -- actual hard partitions. SM 0-47 for prefill. SM 48-95 for decode. Both running simultaneously. Different kernels, different resource profiles, different SM allocations, one GPU.

The two engines -- prefill and decode -- run in separate processes under NVIDIA MPS (Multi-Process Service), which handles the GPU context multiplexing. They communicate through a shared CPU buffer and a unified GPU memory pool so KV cache doesn't have to move between engines. The scheduler monitors both continuously with microsecond-level overhead via non-intrusive model instrumentation, and dynamically rebalances the SM partition in real time based on what the current batch composition needs.

This is the part that took the longest to build: a real-time performance model that knows, given the current mix of prefill and decode work, what SM split maximizes throughput while keeping both engines inside their SLO. Static splits don't work -- a 70/30 prefill/decode SM partition is wrong during a burst of short decode-only traffic and wrong in the opposite direction during a prefill-heavy admission surge. Bullet's control loop adjusts the partition at microsecond granularity without restarting either engine.

1.26x average throughput gain over state-of-the-art. Up to 1.55x at peak. While consistently meeting latency constraints. On real-world workloads at ASPLOS, not synthetic benchmarks.

The thing I keep coming back to: this makes intra-GPU disaggregation possible without buying a second GPU.

The inter-GPU disaggregation story -- split prefill and decode onto separate servers -- requires two machines, an interconnect, a KV transfer layer. It's correct at scale. It's expensive and operationally complex, and for a lot of serving deployments it's overkill.

Bullet runs both on the same GPU. No new hardware. No KV transfer across the network. No RDMA. Just a CUDA SM partition and two processes. The KV cache lives in the same GPU memory pool accessible to both engines.

"But won't SM contention--" it won't, that's the entire point of the partition. The engines don't share SMs. They share memory bandwidth, which is fine because during prefill the compute engine is using compute not memory, so decode's memory-bandwidth-heavy access pattern isn't contending with anything. The resource profiles are complementary by design. They fit together.

The reason this wasn't done before is libsmctrl. Hard SM partitioning at the application level wasn't really accessible before NVIDIA exposed the APIs that libsmctrl wraps. The MPS multi-process approach also required careful engineering to avoid the context-switching overhead that historically made GPU multi-tenancy painful. Both of those constraints loosened in the Hopper generation. Bullet is the paper that used them.

The code is at github.com/zejia-lin/BulletServe. It was originally forked from SGLang. The authors note it's a research prototype without full feature parity. Integration into vLLM is open in issue #27093 -- the same issue where someone from the Bullet team posted the proof-of-concept. It will be in production frameworks within 12 months. That's how these things go.

Inter-GPU disaggregation split the problem across machines.

Bullet split it across SM allocations on the same machine.

The hardware had two resource profiles the whole time. They just ran sequentially because nobody partitioned them spatially until six weeks ago.

1.26x throughput on real workloads from a kernel-level scheduling change. no new hardware. the gains were already in the silicon. they were just waiting for someone to run both engines at once.

P.S. The complementary paper at ASPLOS '26 is "Towards High-Goodput LLM Serving with Prefill-decode Multiplexing" -- a different group, same conference, attacking the same problem from a slightly different angle. Where Bullet uses hard SM partitions, the multiplexing paper uses temporal interleaving with tighter latency modeling. Both shipped at the same conference six weeks apart. Either the problem was riper than anyone realized, or ASPLOS reviewers saw something in this direction and took everything that addressed it. Either way: two independent solutions to the same root cause in one conference proceedings is a signal.