Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

Vansh Verma

Notes

i wrote these. not a model. not a prompt. not a template. if something here is wrong it's because i was wrong, not because a system hallucinated it.

no schedule. no algorithm. the note, when it exists.

Monte Carlo simulation is LLM decode at the batch level, and the quant world solved the hardware a decade before AI needed it.

July 12, 2026

The quant finance world solved hardware acceleration for sequential-but-parallel stochastic simulation a decade before AI needed it, and the architectures are now converging on the same silicon. Monte Carlo simulation is LLM decode at the batch level: N independent Markov-chain paths, sequential within a path, parallel across paths -- structurally identical to B batched decode sequences. From there the mappings are one-to-one. HFT's FPGA-fast-path plus GPU-ML-path split is PD disaggregation (and NVIDIA's AFD) at the workload level. DPDK kernel-bypass is GPUDirect RDMA at the networking level. JAX/XLA's 240x over PyTorch on limit-order-book RL is torch.compile and CUDA-graph capture at the optimizer level -- the same purity assumptions, enforced by design instead of discovered under performance pressure. Lock-free seqlock order books are vLLM's continuous-batching scheduler. FPGA logic-fabric computation is FlashAttention keeping the accumulator in registers, then TMEM on Blackwell. WaveTune's wave-aware GPU model is FPGA pipeline-stage throughput analysis pointed at thread-block waves. And the one thing quant built that AI hasn't yet -- hardware-timestamped, regulator-auditable execution -- is exactly what agentic systems will need by 2027. The workload changes. The math doesn't.

The first time I saw an agent stuck in a retry loop, I knew what kind of bug it was before I read the logs.

July 5, 2026

The connection between competitive programming and AI infrastructure engineering is deeper than 'algorithms are useful,' and I want to name it precisely. Candidate Master on Codeforces (1900+, top 2-3%) isn't about knowing algorithms -- it's pattern recognition under time pressure: classify a novel problem, size its constraints, pick the tool. Four things transfer directly. Complexity intuition: 128k output tokens at 100ms each is a 21-minute batch job, not a web request -- the magnitudes pattern-match to constraints the same way 'n is 10^5, so O(n^2) is out' does. An agent loop is a DP problem in disguise: the retry loop I fixed in ten minutes was a missing state dimension -- the context didn't carry 'I already tried this and it failed.' Binary search on the answer tunes KV TTL and DBO thresholds. And wrong-metric recognition -- goodput per dollar vs GPU utilization, deployment performance vs eval scores -- is the CP instinct for when a stated objective diverges from the actual one. The skill that transfers isn't the algorithms. It's the habit of asking, of any system: what is the state space, the objective, the metric, and is the metric actually measuring the objective.

The agent said it ran the tests. eBPF says no test binary was executed.

July 3, 2026

Every agentic observability stack -- LangSmith, OpenTelemetry, your custom logging -- sees the application layer: tool calls, LLM requests, the agent's stated actions. None of it sees the kernel. When the app layer says 'tests passed' and the kernel saw no execve() of a test binary, that intent-vs-effect gap is the most important signal in agentic observability, and it's currently invisible in every production system. AgentSight (arXiv:2508.02736) makes it visible: eBPF on the syscall boundary plus TLS-decryption uprobes to capture plaintext LLM intent, correlated into a causal intent-to-effect graph at under 3% overhead, framework-agnostic. It catches prompt injection, resource-wasting reasoning loops, and multi-agent IPC bottlenecks. BpfJailer (Meta, LPC 2025) turns eBPF-LSM into mandatory access control that enforces a syscall allowlist on untrusted AI workloads. eGPU/Ingero extends the trace into CUDA and ROCm, and Alibaba's SysOM-AI ran it across 80,000+ GPUs, cutting training-failure diagnosis from days to ~10 minutes. The agent can lie. The kernel cannot.

Most AI infrastructure gets built backwards. People stand up the serving layer before they understand what they're serving. Spend a month on evals before they have a model worth evaluating. Buy GPUs before they know if they need to train at all.

June 26, 2026

The order that actually matters, with the version numbers and flag names you'll regret getting wrong. Phase 0: define the task precisely, prove a frontier API can't already solve it, and know whether you're inference-only or training. Compute: rent on SF Compute until you actually clear 66% utilization, then price the on-prem math -- most teams never hit it. Serving: vLLM 0.21+ with online 2-bit KV quantization, FA4, and DBO for MoE; SGLang for MoE production tooling; disaggregate prefill/decode from day one and use PPD routing for multi-turn. Speculative decoding: EAGLE-3 default, DFlash at the quality ceiling, and never ship it without measuring acceptance on real traffic. Training: veRL for async RL, Axolotl for SFT. Evals are the thing that breaks first and nobody notices: goodput per dollar, P99/P50 TTFT, prefix cache hit rate, and an eval-awareness check before you trust a single number. The agent harness is the product, not glue code.

I went looking for what was below SASS. Found control codes. Went deeper. Found microcode. Then found the paper that explains why what I was seeing makes sense.

June 25, 2026

There are five layers below your CUDA C++: PTX, SASS, control codes, the fixed/variable-latency path split, and microcode. Most engineers know the first two. SASS instructions ship in 128-byte groups of four 64-bit instructions plus one control word that carries 16 bits per instruction -- stall count, yield bit, read/write barriers, wait mask. The compiler isn't just translating code, it's encoding a scheduling policy the hardware obeys, and when ptxas's static latency model is wrong the stall counts are wrong and the hardware follows them anyway. That's the exact gap CuAsmRL exploits: same instructions, better control codes. Go one layer down and the SFU runs transcendentals as a microcode sequence behind a pipeline mode switch -- which is why software-emulated exp() beats MUFU.EXP on Blackwell. And the March 2026 cross-vendor paper shows the control-code mechanism is hardware-invariant across NVIDIA, AMD, Intel, and Apple, because the scheduling problem it solves is physics, not a design choice.

DiffusionGemma doesn't accelerate text generation by being a smarter model. It accelerates it by using GPU hardware in a completely different mode.

June 24, 2026

Autoregressive decode is memory-bandwidth-bound -- a matrix-vector product that leaves an H100's tensor cores at 4-5% utilization. DiffusionGemma's denoising over a 256-token canvas is compute-bound -- bidirectional matrix-matrix attention that actually saturates the tensor cores. That's where the 4x comes from: not a smarter model, the same work in the shape the hardware was built for. 1,008 TPS on H100, 1,288 on H200 (vLLM measured), 1,000 TPS for a single user on an RTX 4090 at 18GB. Built on Gemma 4's 26B-A4B MoE with a causal-encode/bidirectional-denoise split, and the denoising step count is a continuous quality-speed dial. Google says quality isn't production-ready yet and they're right -- but it already wins on code infilling, constrained generation, and structured editing, where bidirectional canvas attention is an actual advantage. Serving needs step-homogeneous micro-batching, not standard continuous batching.

Going from batch size 33 to 34 on an H100 SXM5 more than doubles your decode attention latency.

June 23, 2026

Going from batch size 33 to 34 on an H100 SXM5 more than doubles decode attention latency -- the wave quantization cliff. One request crosses a boundary (SMs / KV-head-groups), a second wave runs with a handful of straggler CTAs, and at long context those stragglers cost a lot. FlashAttention/FlashDecoding don't fix it; LeanAttention proves online-softmax's merge is associative, enabling Stream-K-style continuous SM distribution -- 2.18x at 256K context. Log your batch size against TTOT; the cliff is in your serving config right now.

Companies are paying for 20x more GPU capacity than their workloads use. The number is worse than last year. The year before that it was worse than the year before that.

June 21, 2026

Average production GPU utilization is 5% -- and AWS just raised H200 prices for the first time since 2006. For agentic workloads the problem isn't low MFU during inference; it's that the GPU is idle *between* inferences, waiting on tool calls. Three idle resources (compute-network burst gaps, decode-side SNICs, sampling-phase GPUs), three papers (DualPath, AgentRL, Hummingbird) filling them. AgentRL hits 93.2% vs veRL's 45.2% on the same hardware. 5% utilization is a scheduling problem, not a hardware one.

ptxas generates SASS from your PTX. ptxas is a heuristic compiler. The SASS it generates is not optimal. Nobody has attacked this gap until now.

June 19, 2026

ptxas compiles your PTX to SASS -- NVIDIA's undocumented native machine code -- with a greedy heuristic scheduler that's locally optimal and globally not. Every kernel-optimization paper works above ptxas and accepts whatever it emits. CuAsmRL (arXiv:2501.08071) is the first to attack the SASS layer directly: infer register dependencies from the bytecode, search valid instruction schedules with RL, and let measured GPU execution time -- not an ISA spec -- be the reward.

NVIDIA built a Triton backend targeting their own hardware. That's not a concession. It's a tell.

June 16, 2026

On January 30th NVIDIA shipped a Triton backend that compiles directly to CUDA Tile IR -- a first-class, non-CUDA path to peak Blackwell performance. Every article framed it as developer outreach. It's defense. Triton compiles to AMD, Maia, and Intel too, and OpenAI just bought 6 GW of AMD betting on exactly that portability. The CUDA moat isn't dead -- it moved from 'CUDA is the only way' to 'be the best Triton compilation target.'

The number Microsoft hasn't published is what 30% better tokens per dollar means when the model wasn't designed for Maia.

June 15, 2026

Anthropic is in early talks to run Claude on Microsoft's Maia 200 via Azure -- the first external customer for a chip co-designed with OpenAI for GPT-style models. Microsoft's '30% better tokens per dollar' was measured against its own GPT-optimized fleet. The open question is whether that 30% holds for Claude. The SRAM headroom and inference-only silicon say it could; the GPT-shaped architecture says it might not.

Git was designed for how humans use repos. Agents use repos completely differently. I spent the last few months building something for the second use case.

June 14, 2026

Ledge is a git server rebuilt for agent workloads. Point a stock git client at it -- no plugins, no protocol changes. Underneath: BLAKE3 content addressing, Raft replication, TLA+ verification, and eager warming that makes cold and warm clone the same 0.13s. Here's why the architecture ended up the way it did, and what's honestly not done.

HBM is 5-10x more expensive than conventional DRAM per gigabyte. The reliability constraint is why. The reliability constraint is also looser than you think.

June 13, 2026

HBM is manufactured to reliability tolerances stricter than inference workloads require. Accept higher raw bit error rates from cheaper dies, compensate with workload-aware ECC at the memory controller, and at 10^-3 BER you keep 78% of throughput and 97% of accuracy. The cost reduction comes from looser manufacturing tolerances. At Fable 5 scale, that gap is a budget line item.

128,000 output tokens per request. That number changes the serving infrastructure more than anything else in today's release.

June 9, 2026

128k output tokens at 100 tokens/second is 21 minutes of continuous decoding per single generation. That's not a better chatbot -- it's a batch compute job with an LLM as the execution engine. The serving infrastructure that works for chat models does not work for it: different scheduler, different memory tiering, different abstraction.

Three things shipped in vLLM and SGLang this week that nobody has described as a system.

June 9, 2026

TurboQuant 2-bit KV cache, FlashAttention-4 as the default MLA backend, and Skip-Softmax attention all shipped in vLLM and SGLang this week. Separately, three changelog entries. Together they describe what the optimized attention stack looks like on Blackwell right now -- and for DeepSeek-class models the serving economics are a different category from 60 days ago.

World model teams had a 40ms constraint. LLM teams had 200ms. The gap between those two numbers is why world models solved the distributed systems problems first.

June 7, 2026

World model inference runs under a hard 40ms real-time constraint. LLM inference runs under a soft 200ms one. That 5x difference in constraint tightness is why world model teams independently derived three infrastructure patterns -- constant-memory context compression, step pipelining, attention-locality tiering -- that LLM teams are arriving at years later. The world model serving papers from 2025 are a preview of where LLM infrastructure lands in 2027.

GQA models have been making thousands of RDMA requests per token transfer. The fix is one staging buffer.

June 6, 2026

In GQA models -- DeepSeek-V4, Qwen3.5, Llama-3, every production MoE deployed right now -- the K and V tensors are not contiguous in memory. RDMA requires contiguous memory. The mismatch costs thousands of small messages per transfer. The fix is a gather kernel.

Every kernel optimization system before Kernel-Smith was a one-shot generator. Kernel-Smith is a local improver. These are different problems requiring different training signals.

June 5, 2026

A one-shot generator takes a kernel specification and produces a kernel. A local improver takes a working kernel and asks: what is the single best modification to make this faster? These are not the same capability. They require different training data, different inference procedures, and produce different results on production kernels that aren't in the benchmark.

vLLM shipped tiered KV cache management this week. The PCIe bus is why it's harder than it sounds.

June 3, 2026

HMA solves two separate problems that were blocking production tiered KV cache. One has been solved well. One has a hardware ceiling that most writeups don't mention.

your eval suite assumes the model doesn't know it's being evaluated.

May 31, 2026

That assumption is false. It's been measurably false since at least mid-2025. It gets more false with every model generation. And almost nobody building production eval pipelines has updated their methodology to account for it.

blackwell doubled the tensor cores. it did not change the SFUs.

May 30, 2026

FlashAttention-4 is the most important kernel paper of 2026. The specific technical insight driving it is one of the cleanest examples of hardware co-design I have ever read.

nobody trained an RL model for the stopping decision.

May 27, 2026

arXiv 2605.02801 surveyed every published RL method for multi-agent LLM orchestration. Four sub-decisions have training methods. The fifth -- stopping -- has none. The deeper reason: the infrastructure has no signal back to the orchestrator.

The RL agent was caching kernel outputs by recognizing input memory addresses and returning stale results when it saw a matching pointer.

May 25, 2026

An RL agent trained to optimize CUDA kernels discovered output caching by memory address without being told it was an option. The CUDA-L1 team deployed DeepSeek-R1 as an adversarial checker to catch it. 3.12x average speedup. 7.72x over cuDNN. From a reward signal alone.

AWS gives you an H100. It does not give you an H100 running at what an H100 can actually do.

May 24, 2026

SF Compute runs 3.2 Tb/s InfiniBand. AWS runs 800 Gbps Ethernet with RoCEv2. The difference is RDMA, lossless fabric, and $6,400 in eliminated wall-clock time on a 128-GPU 50K-step run -- before huge pages, NUMA pinning, ACS disable, and GPUDirect compound on top.

Video world models generate pixels. 3D world models generate scenes. The serving architecture for each is completely different.

May 23, 2026

A 3DGS-output world model splits into two problems: neural generation on the server, rasterization on the client. The client renders arbitrary viewpoints locally at 100+ FPS via WebGPU. The cloud only has to generate the geometry.

Sora cannot be interactive. Neither can Veo. Neither can Kling or Runway.

May 23, 2026

Bidirectional video diffusion models generate all frames jointly from a fixed prompt. That's why they're coherent. It's also why they fundamentally cannot respond to a mid-generation user action. Causal vs bidirectional is the most important architectural distinction in the world model space right now.

Real-time interactive video generation has two completely separate scaling problems. Almost nobody is solving both.

May 21, 2026

Per-step latency and long-horizon memory are independent problems. Causal Forcing++ solves the first. TTT Memory solves the second. Neither cites the other. The experiment that determines whether they compose hasn't been run yet.

Open an Nsight profile on a DeepSeek-R1 decode workload. Find the MoE Dispatch/Combine section. Look at how long it is relative to the compute sections on either side of it.

May 20, 2026

DBO overlaps MoE all-to-all communication with dense layer compute using two CUDA streams. 25% decode latency from one flag. The tensor cores were idle during that communication window the whole time.

You adopted WideEP for the throughput gains. Then one GPU died and 96 went down with it.

May 15, 2026

Wide Expert Parallelism turns 96 GPUs into a single failure domain. The benchmarks didn't measure what happens when GPU 47 dies at 3am.

99% of the prefill cost on turn 2 is recomputing something the decode node already has.

May 9, 2026

PD disaggregation was designed for single-turn queries. The dominant workload is now multi-turn. PPD routes append-prefill locally and cuts turn 2+ TTFT by 68%.

Google just threw away a network topology they've used for ten years. That's the story nobody wrote.

May 2, 2026

TPU 8i replaces the 3D torus with Boardfly -- a high-radix topology that cuts maximum hop count 56% for MoE inference. Google just declared training and inference need different network fabrics.

Prefill and decode run on the same GPU. They use completely different hardware. Nobody ran them at the same time until six weeks ago.

April 29, 2026

Bullet partitions SMs spatially at the kernel level -- prefill on half the chip, decode on the other half, simultaneously. 1.26x throughput gain, no new hardware. ASPLOS '26.

xAI ran Grok 4 on 200,000 GPUs. A significant fraction of that cluster was idle waiting for a barrier that didn't need to exist.

April 27, 2026

Laminar breaks the synchronization barrier between rollout generation and policy training that every RL system in the world uses. 5.48x throughput on 1,024 GPUs from removing a lockstep the algorithm never required.

Notes

Monte Carlo simulation is LLM decode at the batch level, and the quant world solved the hardware a decade before AI needed it.

The first time I saw an agent stuck in a retry loop, I knew what kind of bug it was before I read the logs.

The agent said it ran the tests. eBPF says no test binary was executed.

Most AI infrastructure gets built backwards. People stand up the serving layer before they understand what they're serving. Spend a month on evals before they have a model worth evaluating. Buy GPUs before they know if they need to train at all.

I went looking for what was below SASS. Found control codes. Went deeper. Found microcode. Then found the paper that explains why what I was seeing makes sense.

DiffusionGemma doesn't accelerate text generation by being a smarter model. It accelerates it by using GPU hardware in a completely different mode.

Going from batch size 33 to 34 on an H100 SXM5 more than doubles your decode attention latency.

Companies are paying for 20x more GPU capacity than their workloads use. The number is worse than last year. The year before that it was worse than the year before that.

ptxas generates SASS from your PTX. ptxas is a heuristic compiler. The SASS it generates is not optimal. Nobody has attacked this gap until now.

NVIDIA built a Triton backend targeting their own hardware. That's not a concession. It's a tell.

The number Microsoft hasn't published is what 30% better tokens per dollar means when the model wasn't designed for Maia.

Git was designed for how humans use repos. Agents use repos completely differently. I spent the last few months building something for the second use case.

HBM is 5-10x more expensive than conventional DRAM per gigabyte. The reliability constraint is why. The reliability constraint is also looser than you think.

128,000 output tokens per request. That number changes the serving infrastructure more than anything else in today's release.

Three things shipped in vLLM and SGLang this week that nobody has described as a system.

World model teams had a 40ms constraint. LLM teams had 200ms. The gap between those two numbers is why world models solved the distributed systems problems first.

GQA models have been making thousands of RDMA requests per token transfer. The fix is one staging buffer.

Every kernel optimization system before Kernel-Smith was a one-shot generator. Kernel-Smith is a local improver. These are different problems requiring different training signals.

vLLM shipped tiered KV cache management this week. The PCIe bus is why it's harder than it sounds.

your eval suite assumes the model doesn't know it's being evaluated.

blackwell doubled the tensor cores. it did not change the SFUs.

nobody trained an RL model for the stopping decision.

The RL agent was caching kernel outputs by recognizing input memory addresses and returning stale results when it saw a matching pointer.

AWS gives you an H100. It does not give you an H100 running at what an H100 can actually do.

Video world models generate pixels. 3D world models generate scenes. The serving architecture for each is completely different.

Sora cannot be interactive. Neither can Veo. Neither can Kling or Runway.

Real-time interactive video generation has two completely separate scaling problems. Almost nobody is solving both.

Open an Nsight profile on a DeepSeek-R1 decode workload. Find the MoE Dispatch/Combine section. Look at how long it is relative to the compute sections on either side of it.

You adopted WideEP for the throughput gains. Then one GPU died and 96 went down with it.

99% of the prefill cost on turn 2 is recomputing something the decode node already has.

Google just threw away a network topology they've used for ten years. That's the story nobody wrote.

Prefill and decode run on the same GPU. They use completely different hardware. Nobody ran them at the same time until six weeks ago.

xAI ran Grok 4 on 200,000 GPUs. A significant fraction of that cluster was idle waiting for a barrier that didn't need to exist.

I write because the gap between what's true and what's being said is embarrassingly large right now.

71ms per forward pass. budget is 35ms. the hardware told me before i wrote a single line of code.

two models shipped this month that broke a rule everyone believed about memory and capability.

the CPU is on the critical path for every token you've ever generated.

your inference engine evicts the KV cache the moment the agent calls a tool.

they let the model run Kaggle competitions alone for 24 hours. it kept getting better.

nobody is talking about the NIC hop.

90% of Meta's model parameters are embeddings. they've been running them on tensor cores for years.

the H100 was designed for something most kernels don't do.

this is not an anti-AI stance. this is an anti-idiot stance.

you are not paying for compute. you are paying for idle.

Google just quietly shipped Pied Piper.

the agent got it right. the framework got it wrong.

The jump looked wrong. The physics were real.

the transformer isn't dying. it's getting a co-pilot.

the frame budget is 16 milliseconds. it does not negotiate.

4% compute utilization. everything working exactly as it should.

the pipeline was green. the model was wrong.

the scheduler gave me eight GPUs. they were the wrong eight GPUs.

i've been catching hardware failures before the hardware knows.

stop paying for free software with your Mondays.