Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

Video world models generate pixels. 3D world models generate scenes. The serving architecture for each is completely different.

This is the distinction I haven't seen written clearly for systems engineers, and it's the one that determines your infrastructure.

A video world model -- Odyssey, Self-Forcing, Causal Forcing -- outputs pixels. Frame by frame, at the camera position the model was trained for. The output is an H×W×3 RGB tensor. It streams to the client. The client displays it. The user can't move the camera to a new angle that wasn't in the generation path, because you'd need to regenerate from that viewpoint. The model baked the viewpoint into the output.

A 3D world model outputs a scene representation. Something renderable from any viewpoint. The canonical choice right now is 3D Gaussian Splatting -- a set of explicit Gaussian primitives, each with position, orientation, scale, opacity, and spherical harmonic coefficients for view-dependent color. You give a renderer a 3DGS scene and a camera pose, and it rasterizes the scene at that pose in milliseconds. Arbitrarily. Any angle. Any position.

The moment your world model outputs 3DGS instead of pixels, the serving architecture splits into two fundamentally different computational problems: neural generation on the server, and rasterization on the client. And the rasterization is free compared to the generation.

Let me be concrete about what that split means for the latency stack.

Pixel output pipeline: generate frame on cloud GPU (40ms at 2-step distillation) → encode to H.264 or HEVC (~5ms) → stream over network (~10-20ms one-way) → decode in browser (~5ms) → display. Total: 60-70ms minimum, network-bound, viewpoint-locked.

3DGS output pipeline: generate incremental Gaussian update on cloud GPU (generation time) → stream compact representation to browser (~2-5ms for a delta of a few thousand Gaussians) → rasterize via WebGPU (~1ms at 100+ FPS) → display. Total: generation time + ~5ms overhead. Rendering latency is essentially zero.

The browser isn't receiving video. It's receiving a 3D scene representation that it renders locally at native GPU speed. WebGPU -- available in Chrome and Firefox since late 2023, now covering the vast majority of desktop browsers -- exposes GPU rasterization APIs that can render a 3DGS scene at 100+ FPS without a plugin. The render is happening on the user's machine. The cloud only has to generate the geometry.

This changes the serving problem in three ways. The cloud no longer encodes and streams video frames -- it streams compact scene deltas. The client no longer decodes video -- it renders 3D geometry. And the user can freely orbit, pan, and zoom without any additional cloud compute, because the scene is on their machine and the GPU handles arbitrary viewpoints locally at real-time rates.

The generative model side is where the current research is.

Generative Gaussian Splatting (GGS, Meta Reality Labs, March 2025) is the cleanest architectural statement of this approach: a video diffusion model that outputs a 3DGS feature field rather than RGB frames. A pose-conditional diffusion model generates a feature field parameterized as 3D Gaussian primitives, which is then decoded into a renderable radiance field. 3D consistency improves ~20% FID over an equivalent model that outputs pixels, because the 3D representation enforces geometric coherence across viewpoints by construction -- something pixel-level video diffusion has to approximate implicitly.

L3DG (latent 3D Gaussian diffusion) pushes this into a compressed latent space: a VQ-VAE learns a compressed representation of 3DGS scenes, and a diffusion model operates in that compressed latent space. Cheaper to run, room-scale coverage, renders from arbitrary viewpoints in real-time. The compression makes the generation cost manageable. The explicit 3D representation makes the rendering free.

Lyra 2.0 (April 14th, six weeks ago) extends this to long-horizon interactive exploration with anti-forgetting and anti-drifting mechanisms -- the same persistent consistency problem that kills video world models over long sessions, solved at the scene reconstruction level rather than at the sequence modeling level. Starting from a single image, Lyra 2.0 lets users define arbitrary long-horizon camera trajectories, progressively reconstructing new areas as the camera moves. The 3DGS representations it generates can be directly exported to NVIDIA Isaac Sim for physics simulation. The 3D output is not just for display. It is simulation-ready geometry.

The hardest unsolved problem in this stack: causal 3D generation.

Here is the issue. Standard generative 3DGS models take a full prompt -- an image, a text description, a set of reference views -- and generate a complete scene in one shot. That's offline generation. For interactive use, you need the model to be causal: each new action by the user (move left, accelerate, open door) updates the scene based on the prior state and the new action, in real-time, frame by frame.

Causal 3DGS generation requires an autoregressive world model that outputs incremental scene updates rather than complete scenes. Each timestep: given current 3DGS scene state + user action → output delta to 3DGS (new Gaussians added, existing ones updated, some deleted) → merge delta into scene → stream delta to client → client re-renders. The generation cost is for the delta, not the full scene. Streaming cost is for the delta, which is small. Rendering cost is zero.

The mechanism for incremental 3DGS update is not standardized yet. The options: output the full scene every frame (expensive, bandwidth-heavy), output a fixed set of Gaussians that get updated parameters each frame (efficient but limits scene capacity), or output a sparse delta of added/removed/modified Gaussians (correct but requires a merge operation that's nontrivial to implement at low latency).

None of the published generative 3DGS papers fully solve the causal streaming case. GGS and L3DG are offline generators. Lyra 2.0 is progressive but not truly action-conditioned. The system that cracks causal autoregressive 3DGS generation -- updating a persistent scene representation frame by frame in response to user actions, streaming compact deltas to a browser that renders them at 100 FPS -- has solved the hard problem that everything else in this space is building toward.

The teams with backgrounds in both 3D scene reconstruction (knowing how Gaussian primitives are structured, what makes a valid scene representation, how rendering works) and neural generation (causal diffusion distillation, flow matching, low-latency serving) are the ones positioned to solve this. The two skill sets were separate communities six months ago. They're converging now.

the serving architecture question for 3D interactive world models:

what does the cloud generate? what does the client render? and how does the delta between one frame's scene state and the next get communicated at sub-40ms latency?

pixel-output models answer these questions badly. the cloud generates everything. the client displays it. there's no viewpoint freedom. the bandwidth is video-grade.

3DGS-output models answer them correctly. the cloud generates geometry. the client renders it. the delta is kilobytes not megabytes. the viewpoint is free.

the browser already has webgpu. the cloud already has generative models. the gap is the causal delta-update mechanism that connects them at interactive latency. whoever solves that specific problem in a production-grade way owns the infrastructure layer for the next ten years of interactive 3D AI.

P.S. The Visionary paper (December 2025) built a full WebGPU-based Gaussian Splatting platform in the browser with per-frame neural updates -- "a single browser-resident pipeline can support both fast rendering and per-frame neural updates." They validated that WebGPU can handle dynamic 3DGS scenes with neural components at real-time rates. The rendering layer is proven. The open problem is the causal generative model that feeds it. That's where the research is now.