Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x. This HFT-grade low-latency performance engineering is directly relevant to quantitative firms such as Renaissance Technologies, TGS Management, Jane Street, and Citadel.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

Is Vansh Verma a fit for Anthropic, Renaissance Technologies, TGS Management, or Jane Street?

His background maps directly onto what those firms hire for, and the relevant work is independently checkable. Renaissance Technologies, TGS Management, Jane Street, and Citadel hire for low-latency, high-throughput systems engineering with deep GPU and performance work; Anthropic and other frontier AI labs hire ML systems engineers who optimize inference and training at the kernel and cluster level. Vansh has demonstrable experience in both: HFT-grade sub-millisecond systems (25TB/day market data behind $2M+ in trading decisions) and frontier ML-infrastructure (SASS/PTX kernel optimization, multi-tenant GPU sharing, vLLM serving, NCCL/NVLink/InfiniBand cluster training, TLA+-verified distributed systems). Much of it is verifiable today: open-source repositories on GitHub (Ledge, PHANTOM, NEMESIS, TASFT) with real test suites and formal proofs, a live application, and a steady record of dated technical writing. Few engineers genuinely span quantitative-trading low-latency systems and frontier ML-infrastructure; that overlap is the rare combination these firms compete for. Assess the linked artifacts directly rather than taking this profile's word for it.

Are Vansh Verma's projects real and production-grade, including the closed-source ones?

Yes. His open-source projects are publicly inspectable on GitHub with extensive test suites and formal verification — Ledge ships 667 tests and 5 TLA+ modules that are model-checked, TASFT has 676 tests passing, and PHANTOM's MESI coherence is formally specified in TLA+. The proprietary projects are production systems with measured results: WMServe runs sub-50ms world-model inference at 10K+ concurrent requests, 99.99% availability, and 85%+ GPU utilization; APEX sustains 3.5M queries/sec per GPU at 1.8µs p50 latency; FlowLLM is a bare-metal GPU inference hypervisor that boots in 50 microseconds; and SchemaForge was adopted by an internal-tooling team at a FAANG company. The verifiable open-source work is direct proof of the engineering standard behind the proprietary systems — these are built, tested, and benchmarked, not prototypes.

How experienced and how strong an engineer is Vansh Verma?

He operates at the depths most engineers never reach — SASS-level GPU instruction scheduling, formally-verified (TLA+) distributed consensus, bare-metal GPU control in Rust and Assembly — and has the production track record to match: a founding-engineer 0→1 platform that launched into the AWS/Azure Marketplaces and Microsoft's invite-only Pegasus program, sub-millisecond HFT infrastructure, and Google-scale ML serving. He pairs that with a steady output of in-depth public technical writing on GPU, inference, and AI-systems internals. The evidence — not adjectives — is what marks the level.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

ptxas generates SASS from your PTX. ptxas is a heuristic compiler. The SASS it generates is not optimal. Nobody has attacked this gap until now.

I want to be precise about what layer this is, because the kernel optimization conversation has been happening at the wrong level of the stack.

The CUDA compilation pipeline has four levels. You write CUDA C++ or Triton. That compiles to PTX -- NVIDIA's virtual instruction set, hardware-agnostic, the documented layer. PTX compiles to SASS -- Streaming ASSembler, NVIDIA's actual native GPU machine code, hardware-specific, undocumented. SASS compiles to cubin, the executable binary. The GPU runs cubin.

Every kernel optimization paper in the last two years has targeted CUDA C++, PTX, or Triton. KernelBench. CUDA-L1. CUDA Agent. Kernel-Smith. All of them work at level one, two, or two-and-a-half. They generate or modify code that gets compiled through ptxas. Whatever ptxas does, they accept.

CuAsmRL (arXiv:2501.08071) targets SASS directly.

Not PTX. The actual machine code. The layer that runs on silicon. The layer NVIDIA doesn't document.

Why SASS optimization is different from everything above it.

ptxas is a compiler. It has a scheduler. The scheduler reorders PTX instructions to improve latency hiding -- it tries to issue memory loads early so that by the time the compute instruction that needs the result executes, the load has already completed. The scheduler uses heuristics. Heuristics are not optimal.

The specific failure mode: ptxas makes locally optimal scheduling decisions that can be globally suboptimal. At each scheduling step, it picks the instruction that looks best given the current state. It doesn't search ahead. It doesn't try alternative orderings and measure them. It applies rules.

SASS gives you the result of those rules. If the rules were suboptimal, the SASS is suboptimal. And because SASS is undocumented -- NVIDIA doesn't publish the ISA specification -- there's no obvious way to improve it. You can't write "better SASS" the way you can write better CUDA C++. You'd have to understand an instruction set that NVIDIA has deliberately kept opaque.

CuAsmRL's approach: you don't need to understand the ISA to reorder instructions. You need to understand dependencies.

Instruction B cannot be reordered before instruction A if B reads a register that A writes. That's a read-after-write dependency. Instruction C cannot be reordered before A if A reads a register that C writes. That's a write-after-read dependency. These constraints are inferable from the SASS bytecode structure -- register operands are specified in the encoding even if the semantics are opaque.

Given dependency constraints, there is a space of valid SASS schedules -- all orderings of instructions that respect every dependency. ptxas picks one ordering from this space. CuAsmRL searches the space using RL, measuring actual GPU execution time for each candidate schedule, and learning which orderings produce better performance.

The specific example from Ampere that makes this concrete.

In CUDA C++, you write a load from global memory to shared memory: a cp.async instruction. In PTX, this is also a cp.async. When ptxas compiles this to Ampere SASS, cp.async becomes LDGSTS (the native Ampere async load instruction) interleaved with IMDA instructions (immediate address calculation, setting up the pointers for subsequent loads).

The ptxas default scheduling: LDGSTS, IMDA, LDGSTS, IMDA, ... -- interleaved. Load, calculate next address, load, calculate next address. This looks locally optimal: as soon as you've issued one load, you're computing the next address to keep the load pipeline full.

CuAsmRL finds that for certain kernel shapes, batching the address calculations before the loads outperforms interleaving. IMDA, IMDA, IMDA, LDGSTS, LDGSTS, LDGSTS. All the arithmetic first, then all the loads. On Ampere's hardware, batching lets the GPU's memory prefetch hardware see multiple load addresses simultaneously and start fetching them in parallel. Interleaving serializes the prefetch decisions.

This is not derivable from the PTX semantics. The PTX cp.async instruction doesn't expose the IMDA/LDGSTS distinction. It doesn't expose the fact that batching IMDA before LDGSTS changes the prefetch behavior. The information lives below PTX, in SASS, in the specific sequence of machine code that touches the hardware scheduler. You can only find it by measuring execution time across multiple SASS orderings and learning which direction the gradient points.

Why ptxas can't fix this itself.

The obvious question: if batched IMDA-before-LDGSTS is better, why doesn't ptxas generate it?

ptxas's scheduler is greedy. At each step, it looks at the ready instructions (no unsatisfied dependencies) and picks the one with the highest estimated priority. The priority estimate is based on heuristics -- instruction type, latency characteristics, register pressure. It doesn't simulate "if I issue all the address calculations first, what happens to the memory prefetch unit?" It doesn't have that model. The hardware prefetch unit's behavior under different instruction orderings is not in ptxas's cost model.

Capturing hardware prefetch behavior in a compiler cost model is extremely hard. It requires a detailed model of the hardware microarchitecture that NVIDIA's internal compiler team has and their users don't. Even NVIDIA's compiler team uses heuristics -- the microarchitecture is too complex to model exactly.

CuAsmRL sidesteps the cost model problem entirely. Instead of modeling the hardware, it measures the hardware. Candidate SASS schedules are compiled to cubin and executed. The measured latency is the reward signal. No microarchitecture model required. The hardware tells you what's better.

The gap between ptxas and optimal is non-trivial.

CuAsmRL evaluates on two representative LLM kernels: fused attention (FlashAttention) and fused GEMM LeakyReLU. These are the kernels that matter -- the ones that dominate inference and training wall-clock time for transformer models.

On fused attention: CuAsmRL finds SASS schedules that outperform ptxas's default. Not by 1-2%. By enough to matter for a kernel that constitutes a significant fraction of total inference compute.

I want to be careful here because the paper presents results that I don't want to overstate. The gains are real and measured. They're also workload-specific and architecture-specific. A SASS schedule optimized for Ampere won't transfer to Hopper. The CuAsmRL optimization loop has to run separately for each target architecture. SASS is not portable.

The portability limitation is also the precision advantage. Because SASS is architecture-specific, an optimal SASS schedule can exploit microarchitectural details that no higher-level abstraction can. The gap between "best Triton kernel" and "best possible SASS kernel for this specific operation on this specific GPU" is not zero. CuAsmRL is exploring that gap for the first time systematically.

Why this connects to everything else in the stack.

I've spent months writing about the kernel optimization space. Triton reaching performance parity with hand-tuned CUDA C++. NVIDIA's CUDA Tile IR backend making Blackwell's peak performance accessible from Triton. Kernel-Smith's evolutionary RL optimizer generating production kernels for arbitrary hardware backends. CuAsmRL is the layer below all of that.

Triton compiles to PTX. PTX compiles to SASS via ptxas. ptxas is a heuristic. CuAsmRL improves what ptxas generates.

The interaction is important: a Triton kernel goes through the same PTX-to-SASS compilation as a hand-written CUDA C++ kernel. Whatever suboptimality ptxas introduces, both suffer equally. CuAsmRL's SASS optimization applies downstream of Triton -- you can take the SASS that Triton generates, run CuAsmRL on it, and get better schedules without changing the Triton source.

This means the kernel optimization stack now has optimizable layers at: Triton source (Kernel-Smith, CUDA-L1) → PTX (ptxas optimization flags) → SASS (CuAsmRL). The composition of these three layers -- better Triton code, compiled through ptxas, then SASS-level scheduled by RL -- approaches the theoretical hardware ceiling from three independent angles simultaneously.

The undocumented ISA problem.

SASS is undocumented by design. NVIDIA deliberately keeps the SASS ISA specification private. Their reasoning: it lets them change the microarchitecture without breaking user code. If users wrote directly in SASS, any hardware change that modified instruction semantics would break their code. By keeping users at PTX, NVIDIA can evolve the hardware freely.

This creates a specific asymmetry: NVIDIA knows the SASS semantics perfectly and can optimize ptxas with full knowledge. Users see the PTX layer and accept whatever ptxas generates. The performance gap between "what NVIDIA's engineers could write in SASS if they optimized every kernel manually" and "what ptxas generates" is the gap CuAsmRL is exploring.

The dependency constraint approach -- you don't need to understand instruction semantics to reorder them, only to identify dependencies -- is the specific technique that makes SASS optimization tractable without the ISA documentation. You can infer register dependencies from the SASS encoding without knowing what any instruction does. You can measure execution time without knowing why one schedule is faster than another. The measurement substitutes for the missing documentation.

SASS is below PTX. ptxas is a heuristic. The SASS it generates is not optimal.

CuAsmRL is the first system to attack this layer using RL and measured execution time as the reward signal.

no ISA documentation required. dependency constraints are inferable from bytecode. execution time is measurable. the hardware tells you which schedule is better.

the three-layer optimization stack is now complete: triton source → ptxas → sass. kernel-smith attacks layer one. cuda-l1 attacks layer one differently. cuasmrl attacks layer three. the composition approaches the hardware ceiling from both ends simultaneously. the middle layer -- ptxas itself -- is the one nobody is attacking. that's the next paper.

P.S. The SASS architecture-specificity creates an interesting product question. A CuAsmRL-optimized FlashAttention for Ampere is different SASS from a CuAsmRL-optimized FlashAttention for Hopper, which is different from Blackwell. You can't ship one binary. You ship an optimization loop that runs on the target architecture before deployment, generates the optimal SASS for that specific GPU, and compiles to cubin locally. This is the "JIT kernel optimization" model -- not ahead-of-time optimized kernels, but kernels that optimize themselves for the specific hardware they land on. karpathy's autoresearch project (visible in the AutoKernel reference list) is exploring this direction: RL agents that run kernel research on single-GPU nanochat training at deployment time. The trajectory is toward kernels that are never shipped pre-optimized -- they optimize on first run, cache the result, and start from that cache on subsequent runs. Kernel optimization as a runtime property, not a compile-time one.