Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

Sora cannot be interactive. Neither can Veo. Neither can Kling or Runway.

Not because they're slow. Because of their architecture. Bidirectional video diffusion models generate past, present, and future frames jointly from a prompt fixed in advance. The model sees the whole sequence before it generates any of it. That structure is why they produce such coherent video.

It's also why they fundamentally cannot respond to a user action that happens mid-generation. The future frames would need to condition on inputs the user hasn't taken yet.

A world model -- a model that simulates an evolving environment and responds to actions in real time -- has to be causal. Each frame predicted from prior frames and the current action. Nothing else. That architectural constraint is not a minor implementation detail. It determines the entire serving stack, the latency target, the memory management, the distillation strategy, and who can actually build this.

This is the framing I want to use for the startups I've been watching closely.

Odyssey is the clearest example of a team that internalized this constraint before writing a line of model code.

Odyssey-2 Max (April 21st, one month ago) uses what they call an AR DiT -- autoregressive diffusion transformer. The model generates video chunk by chunk, conditioning only on past frames and the current action. Each frame arrives in ~40ms. 25 frames per second. Real-time.

The detail that tells you the team knows what they're doing: they built roofline estimates from day one. Before finalizing the architecture, before training, they modeled the compute requirements against the target inference hardware and made sure the model as designed could hit the latency target on that hardware. Most ML teams do this after training, when it's too late. Odyssey did it before.

They also use continuous flow matching rather than discrete tokenization. The quality ceiling on discrete tokenization comes from the codebook -- you can only generate things that map to learned token embeddings. Continuous flow matching operates directly in latent space with no discretization step, which preserves fine-grained detail over long rollouts without quality collapse. They claim 20x longer context than prior work with full backpropagation. The serving implication: long-horizon rollouts accumulate context that has to be cached. Managing that cache under a 40ms budget requires the same kind of KV management thinking as LLM serving, but with 3D spatiotemporal structure instead of 1D sequence structure.

The thing I find most credible about Odyssey: the product experience matches the claimed architecture. Bidirectional models have a first-frame latency of tens of seconds because they have to finish generating the full clip before outputting anything. Odyssey streams the first frame in 40ms. That's not achievable with a bidirectional model dressed up as interactive. The architecture is real.

DISK (February 2026, preprint) is the most technically interesting inference paper in the world model space and has approximately no coverage outside systems research circles.

The insight: not every frame needs full denoising.

In a causal AR world model, you run N denoising steps per frame to generate each output. N is the inference cost. If the scene is relatively static -- sky not changing, background stable, the agent is paused -- the full N-step denoising is paying for precision you don't need. The frame is almost identical to the previous one. You ran the full diffusion anyway.

DISK coordinates two coupled DiTs -- one for video, one for ego-trajectory -- via dual-branch controllers that make per-frame skip decisions. If the latent difference between the current prediction and the prior frame is below a threshold, skip some denoising steps. The skip decision is made without retraining -- it's a runtime test on the latent-space differential, not a learned parameter.

The result on 1,500 NuPlan and NuScenes driving samples on a single L40S GPU: 2x speedup on trajectory diffusion, 1.6x on video diffusion, while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM planning scores. Free performance. No retraining required. The same model, run smarter.

This is speculative decoding applied to diffusion steps. Instead of always running N denoising passes, run fewer when the frame doesn't warrant them. The world model inference space is going to converge on this pattern for the same reason LLM serving converged on speculative decoding: the compute is being spent uniformly on non-uniform content, and the non-uniformity is exploitable.

XPENG X-World (technical report April 29th, three weeks ago) is worth noting specifically because they solved a problem nobody else has solved cleanly: multi-camera, multi-view consistency.

Autonomous driving doesn't have one camera. It has eight to twelve. A world model for AV has to generate consistent futures across all camera views simultaneously -- the pedestrian crossing in the front camera has to appear correctly in the front-left and front-right cameras, with correct occlusion, correct depth, correct lighting. Inconsistency between views is immediately detectable to human evaluators and disastrous for using the world model for training downstream perception systems.

X-World uses video diffusion with controllable multi-view generation. They're not the only multi-camera world model -- Vista (April 2025) addressed similar issues -- but the April 2026 technical report is the most detailed public description of what it takes to make this work in production AV data pipelines. The training data alone required a new data production pipeline. The inference stack required explicit cross-view consistency constraints during denoising.

The reason this matters commercially: Waymo, Zoox, and every other AV company needs world models that produce consistent multi-camera synthetic scenarios for rare events -- the 1-in-10,000-mile scenarios that are impossible to collect at scale in the real world. A world model that generates inconsistent views is useless for this. Multi-view consistency is the hard part. XPENG published the methods publicly. That's unusually transparent for a company with a genuine production moat.

AMI Labs (Yann LeCun, March 2026, 500M at 3B valuation before a product ships) is worth understanding specifically through the JEPA lens rather than the founder lens.

Joint Embedding Predictive Architecture predicts in latent space rather than pixel space. Standard video diffusion predicts pixels. JEPA predicts representations -- abstract embeddings of what the world looks like, without reconstructing the actual visual output unless you need it. This is dramatically cheaper: you're doing the prediction computation in a compressed space, not in pixel dimensions.

For robotics applications -- where the robot needs to plan in terms of high-level scene representations, not pixel-accurate video -- JEPA's architecture is a better fit than generative video diffusion. The robot doesn't need to hallucinate photorealistic pixels. It needs to reason about object positions, physical relationships, action consequences. JEPA operates at that level of abstraction.

The inference cost advantage is significant. A JEPA-based world model can run faster and with less memory than a video DiT operating in pixel space, because it's never generating the high-dimensional pixel output. The accuracy of physical reasoning doesn't require photorealism. If the latent space captures the relevant physical structure, the model can plan and predict without decoding to pixels at all.

Whether AMI Labs can execute on this before the Cosmos and Genie 3 ecosystems solidify is the real question. LeCun has been saying JEPA is the right path for five years. 500M at 3B is an enormous bet on that thesis before there's a product. I don't know if it's right. I find the technical argument compelling.

The serving infrastructure story for all of these is the same one I've been writing about for months, one level harder.

LLM serving has KV cache management, disaggregated prefill/decode, continuous batching, PagedAttention. World model serving has all of that plus: 3D spatiotemporal attention with spatial and temporal factorization, denoising step management (multiple forward passes per output frame rather than one), causal context accumulation over multi-minute rollouts that dwarfs LLM context lengths in raw data volume, and real-time latency constraints (40ms) that are 5-10x tighter than typical LLM interactive latency targets.

The serving system I built for video world models was designed from the latency budget down: 35ms for model compute, 15ms overhead, 2 distilled denoising steps at 15ms each. Every other architectural decision followed from that constraint.

The startups in this space that will win are the ones who did the same thing. Who modeled the roofline before training. Who built the serving stack as a first-class engineering problem, not an afterthought. Odyssey's "roofline estimates from day one" detail is the signal. It's the same thing I look for in LLM infrastructure teams.

causal vs bidirectional is the most important architectural distinction in the world model space.

sora can't be interactive. that's not a product limitation. it's physics.

the companies that built causal from the start -- and designed the serving stack around what causal requires -- have an 18-month head start over anyone trying to retrofit real-time interaction onto a bidirectional architecture.

disk's dynamic inference skipping is the paper to read if you're building world model inference infrastructure. 2x speedup on trajectory, 1.6x on video, no retraining. the compute was always being spent uniformly on non-uniform content. disk is the first system to exploit that.

P.S. The Causal-RoPE SP paper (March 10, 2026) solves a specific serving problem nobody had addressed for causal AR video generation: sequence-parallel inference across multiple GPUs requires position embeddings that can be computed locally per rank without global sequence information. Standard 3D RoPE requires the full sequence to compute positions correctly, which forces cross-rank communication that kills the parallelism benefit. Causal-RoPE SP adapts position embeddings to work with local context only. It's a two-page methods section in a systems paper that makes multi-GPU world model serving viable without the communication bottleneck. It should be in every world model infrastructure team's reading list and it has essentially no coverage.