Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

I write because the gap between what's true and what's being said is embarrassingly large right now.

That's the whole reason. I keep waiting for it to close and it doesn't.

There are hundreds of pieces every week about AI. Most of them are about models -- which one scored higher on which benchmark, which company raised more money, which CEO said something quotable. Some of them are about infrastructure in a surface way -- Jensen said the inference inflection point has arrived, here is a summary of the GTC keynote, here are the numbers he cited.

Almost none of them are about the actual problems. The specific, hard, unsolved engineering problems that determine whether any of this works at scale. The things that keep the people building this space awake at 2am not because they're anxious but because the problem is genuinely interesting and they can't stop thinking about it.

That gap is what I write into.

I got pulled into this space because I couldn't stop reading the papers.

Not because someone told me to. Not because it was strategically useful. Because I would find one paper -- something about KV cache management, or GPU scheduling, or post-training infrastructure -- and it would contain a number that didn't fit in my head, and I'd spend the next three hours chasing the references until I understood where the number came from. And then I'd surface and realize nobody had written about it in plain language anywhere.

That's the pattern that keeps happening. A paper gets published. It has real results -- 5x throughput, 9x latency reduction, 2.5x tokens per second from a software change on hardware you already own. It gets two citations. Nobody writes about it. The engineers who would benefit from knowing about it don't know it exists, because they're busy shipping and the paper is dense and nobody translated it.

i find that situation slightly maddening. so i write.

Why this space specifically.

The honest answer is that I think we are in one of those rare moments where the foundational decisions being made right now will determine the shape of an entire industry for a decade. The infrastructure decisions -- how you serve models, how you train them, what hardware you build around, how you schedule the work across a cluster -- these are being made in 2025 and 2026 by a relatively small number of people, and most of the options aren't even visible yet because the papers describing them haven't been translated into language engineers can act on.

That's a genuinely interesting problem to be writing around. Not "here's a think-piece about what AI means for society." The actual technical decisions. The ones with numbers. The ones where being right or wrong by a factor of two changes your compute bill by tens of millions of dollars.

I also think the problems are beautiful. I mean that in the way that mathematicians mean it -- there is a kind of elegance to a well-formulated constraint. The attention mechanism quadratically scales with context length, but the model's capability grows with context, and you need the capability, so you have to find a way to make the quadratic not matter. That's a hard problem. The kind you can spend years on and still feel like you haven't gotten to the bottom of it.

The KV cache is a similar shape. Memory is finite. Context is infinite. The model needs everything it's ever seen to answer your question well. Something has to give. The papers I keep reading are all different attempts at negotiating that trade -- compression, eviction, pooling, off-loading, tiering, restructuring the attention kernel so it doesn't need to see everything. None of them fully solve it. Each one moves the constraint somewhere else. That is a beautiful problem.

The thing I try to do when I write -- and often fail at -- is find the one true thing buried in whatever I'm looking at and say it out loud before the reader can negotiate with it.

Not the interesting thing. Not the surprising thing. The true thing. The thing that follows inevitably from the facts if you look at them directly enough. Usually it's something that's visible in the data but that nobody has said explicitly, because saying it explicitly is slightly uncomfortable.

The GPU utilization post started because I kept seeing teams report 85%, 90% utilization and treat it as a sign of success, and I knew from the math that 85% utilization on inference workloads is not a success metric -- it's a symptom of serving the wrong users. The true thing was: you are measuring how busy your hardware is, not how well you are serving people. Those two things can diverge silently. They do diverge, constantly, in production systems. Nobody was saying it.

That's the post I want to write every time. The one where the true thing is hiding in the numbers and everyone has been politely not saying it.

I write about this space specifically -- AI infrastructure, systems engineering, the machinery underneath the models -- because I think it's where the most important problems are right now, and they're dramatically undercovered relative to their importance.

A 10% improvement in inference throughput at Anthropic's scale is not a footnote. It is hundreds of millions of dollars. It is the difference between being able to serve a new capability profitably or not. The engineering decisions that produce that 10% are made by people who read papers, run experiments, argue about kernel implementations. Those decisions are invisible in the coverage that most people consume.

I want to make them slightly less invisible.

Not because I think everyone needs to know about warp specialization or CXL memory pooling or the difference between goodput and throughput. Most people don't, and that's fine.

But the people who are making these decisions -- the engineers, the infra leads, the people choosing between hardware configurations that will determine their cost structure for the next three years -- those people deserve writing that treats them as the intelligent adults they are. That doesn't condescend. That says the true thing directly and trusts them to handle it.

That's what I'm trying to do.

Whether I'm succeeding is a different question. Most days I'm not sure.

But the gap is still there. Papers keep getting published. Numbers keep being buried. Problems keep being real and interesting and mostly invisible.

so I keep writing.

P.S. The problems I find most interesting right now, if you're curious: the KV cache at long contexts (unsolved), the RL post-training synchronization bottleneck (just being cracked open), the memory hierarchy for disaggregated inference (active research, nobody has the full answer), and what on-device inference actually means for privacy and economics when the models get small enough to run locally at genuine quality. That last one is the one that keeps me up. The implications are large and mostly unexplored.