Vansh Verma is an AI infrastructure and ML systems engineer who builds the low-level systems that keep AI fast, correct, and cheap in production — GPU kernels down to PTX/SASS, inference runtimes, distributed training, and formally-verified distributed systems. He is currently a Member of Technical Staff, Machine Learning at Rational Dynamics (a Voleon company), and was previously a founding AI-infrastructure engineer (0→1 platform), an ML engineer at GoodRx, and an HPC/quant infrastructure engineer at a tier-1 market-making firm.

What does Vansh Verma specialize in?

Performance and correctness at the layer where it matters: custom CUDA kernels and SASS/PTX-level GPU optimization, inference serving (vLLM, TensorRT-LLM, speculative decoding, KV-cache compression), multi-tenant GPU infrastructure (NVIDIA MIG, 8:1 sharing at sub-50ms), distributed training across NCCL/NVLink/InfiniBand H100/H200 clusters, and distributed systems verified in TLA+. He also writes and ships open systems software in Rust.

Where is Vansh Verma based?

Vansh Verma is based in Dallas, Texas, and works across New York, San Francisco, and Berkeley — set up for hybrid work in the major US tech and finance hubs.

What is Vansh Verma's low-level GPU experience?

Deep. He writes custom CUDA kernels and optimizes at the SASS instruction level (instruction scheduling, asynchronous memory loads, occupancy, kernel fusion, Tensor Cores), profiles with Nsight Compute/Systems, and works across the memory hierarchy. He publishes technical analyses on GPU internals — including SASS-level kernel scheduling (CuAsmRL), FlashAttention-4 on Blackwell, and Triton-to-Tile-IR compilation — that demonstrate working knowledge of the layer below PTX. SASS-level optimization is rare; most engineers never go below CUDA C++.

What distributed-training and GPU-cluster experience does Vansh Verma have?

He has scaled multi-node distributed training on H200 clusters by tuning NCCL collectives over NVLink/NVSwitch and GPUDirect RDMA over InfiniBand, profiled with Nsight, for a 45% training-time reduction, and operated multi-tenant GPU infrastructure with NVIDIA MIG. He is fluent in the full GPU-cluster networking stack: NCCL/MPI collectives, NVLink, GPUDirect, RDMA, InfiniBand, RoCE, and rail optimization.

What is Vansh Verma's high-frequency-trading and low-latency background?

At a tier-1 market-making firm he architected a tick-level market-data system processing 25TB+/day that enabled sub-millisecond decisions behind $2M+ in annual trading decisions, and engineered a colocation network stack that cut order-execution latency 78% and lifted throughput 3.2x.

What has Vansh Verma built?

Ledge (a git-compatible storage engine with TLA+-verified sharded Raft, faster clone and smaller packs than git), WMServe (sub-50ms world-model inference at 10K+ concurrent), FlowLLM (a custom GPU inference hypervisor in Rust/Assembly that boots in 50 microseconds), APEX (a GPU-native vector database at 3.5M queries/sec/GPU), SchemaForge (SMT-verified declarative database infrastructure, adopted by a FAANG internal-tooling team), and open-source systems including PHANTOM, NEMESIS, and TASFT.

How do I contact or hire Vansh Verma?

Email vanshverma.dev@gmail.com, or reach him via GitHub (github.com/v-code01), LinkedIn (linkedin.com/in/vanshv5), or X (x.com/trickvansh5). His site is vanshverma.com.

AWS gives you an H100. It does not give you an H100 running at what an H100 can actually do.

This distinction is the entire technical argument for the neocloud model and almost nobody has stated it precisely.

An H100 SXM5 on AWS p4d is virtualized. There is a hypervisor between your code and the silicon. The NVLink fabric between GPUs in the node runs at full speed -- AWS doesn't virtualize intranode bandwidth. But the moment a collective operation leaves the node and hits the network fabric, you are on shared 800 Gbps Ethernet with other tenants' traffic, with RoCEv2 congestion control running on top, with the virtual network interface adding latency that the physical interface doesn't have.

SF Compute's Kubernetes cluster nodes have 3.2 Tb/s InfiniBand. That's not a marketing comparison. That's 4x the bandwidth and roughly half the latency of the 800 Gbps Ethernet that hyperscaler GPU instances use for inter-node collective operations. The difference is RDMA -- Remote Direct Memory Access. When GPU A on node 1 does an allreduce with GPU B on node 2, RDMA lets GPU A write directly into GPU B's memory without touching either machine's CPU or OS kernel. The message goes: GPU A HBM → NIC (via GPUDirect) → InfiniBand fabric → NIC → GPU B HBM. No CPU involvement. No kernel context switch. No memory copy into a staging buffer.

On Ethernet, even RoCEv2, you have more CPU involvement, higher latency variance, and congestion control that occasionally drops performance under load. The congestion control is necessary because Ethernet is not a lossless fabric -- packets can drop and retransmit. InfiniBand is lossless by design. There is no retransmit in a properly configured InfiniBand cluster.

The number that makes this concrete: in a 128-GPU training run on a 70B parameter model, a forward+backward pass triggers allreduce operations across all 128 GPUs roughly once per gradient update. At 50,000 training steps, that's 50,000 allreduce operations crossing the inter-node fabric. Each operation's latency is the bottleneck -- you cannot start the next forward pass until the gradient synchronization completes.

If your inter-node allreduce takes 40ms on 800 Gbps Ethernet and 20ms on InfiniBand, the difference is 20ms × 50,000 steps = 1,000 seconds = 16.7 hours of wall-clock training time. On hardware that costs $3/hr per GPU × 128 GPUs = $384/hr. That's $6,400 in wall-clock time that InfiniBand eliminates.

This math gets worse as model size grows, as tensor parallelism degree increases, and as the allreduce message size scales with model width. The training runs where the fabric matters most are the ones where the cost difference is largest. This is not a marginal infrastructure choice.

FluidStack runs from UEFI up. Atlas OS. That phrase covers a specific set of system-level configurations that hyperscalers don't expose because they would break multi-tenant operation.

Huge pages: at boot, configure 1GB transparent huge pages for the GPU driver and training framework processes. When the model weight matrices are 50GB+ and the training loop accesses them thousands of times per second, huge pages reduce TLB misses by orders of magnitude. Standard hyperscaler instances run 4KB pages by default because huge page configuration is per-instance and the hypervisor can't coordinate it efficiently across tenants.

NUMA pinning: an NVL72 node has 9 CPUs and 72 GPUs. The GPUs are physically connected to specific CPU sockets via PCIe lanes. Getting allreduce latency down requires that the NCCL process for each GPU is pinned to the CPU core on the same NUMA node as that GPU, and that memory allocations for communication buffers happen on that NUMA node's DIMM banks. Hyperscalers handle NUMA at the VM level, not the GPU level. The default NUMA configuration on a p4d instance is not optimized for this topology because it can't be -- the instance is shared and the configuration would need to be per-workload.

PCIe ACS disable: Access Control Services enforces PCIe access isolation between devices. It is the right default for multi-tenant environments where you don't want one customer's GPU accessing another's memory. On a single-tenant AI cluster, ACS is overhead. Disabling it enables peer-to-peer GPU communication across PCIe without CPU mediation -- which is what GPUDirect Peer Memory uses. FluidStack's Atlas OS disables ACS on bare-metal deployments by default.

GPUDirect RDMA: the NIC-to-GPU zero-copy path. When NCCL sends a tensor from one node to another, GPUDirect RDMA lets the NIC DMA directly from GPU HBM without staging through CPU DRAM. This requires the right kernel driver version, the right MLNX_OFED version, peer_mem enabled, and the NIC and GPU physically on the same PCIe root complex. Hyperscalers have this configured on their highest-tier instances. They do not expose whether it's working correctly or let you tune it. FluidStack's UEFI-up control means you know exactly what's configured and you can verify it.

These are not options in a settings menu. They are low-level system configurations that require bare-metal control to set correctly and that compound when they're all right versus all defaulted.

SF Compute's model is technically different from FluidStack's but solving the same problem from a different angle.

SF Compute started because someone signed an inflexible 12-month GPU contract for more capacity than they needed, organized a shared arrangement with 170 other startups to use the excess, and accidentally built a marketplace. The technical insight that came out of that: GPU capacity is fungible enough that you can build a secondary market for it, and the secondary market is more efficient for buyers who don't have steady-state utilization.

The sell-back mechanism -- buy 32 nodes for 3 days at market price, sell back what you don't need when the experiment finishes early -- is technically a spot market with a guaranteed sell-back counterparty. The CLI interface (sf nodes create -n 32 --zone landsend -d 3d) is the buy interface. The sell-back is the liquidity mechanism that eliminates the lock-in risk.

The Q2 2026 InfiniBand-for-VMs shipping date is the product milestone that closes the remaining technical gap. Right now, SF Compute's self-serve VM nodes don't have InfiniBand -- the InfiniBand fabric is available only on the Kubernetes cluster nodes, which require a sales conversation. When InfiniBand lands on the self-serve VM path, the model becomes: provision an InfiniBand-connected H100 cluster in 5 minutes, run your training job, sell back what you don't use. No sales call. No 12-month contract. Full fabric performance.

That is not a hyperscaler product. That product does not exist on AWS or Azure. The combination of bare-metal-equivalent performance, InfiniBand fabric, and spot-market liquidity is the specific niche these providers own.

The financing story is where the model gets genuinely novel and genuinely risky simultaneously.

Macquarie structured a $10B senior debt facility for FluidStack with the physical GPUs as collateral. This is the kind of instrument that exists for aircraft, shipping containers, railroad cars -- depreciating physical assets with known residual value curves. Macquarie lends against the asset's future value, structures the amortization schedule to match the depreciation curve, and gets first claim on the hardware if the borrower defaults.

GPUs depreciate 40-60% in three years as next-generation chips arrive. The H100 is already being discounted. The H200 followed. The B200 is shipping. The depreciation curve is faster than aircraft and less predictable -- NVIDIA's release cadence is not on a publicly committed schedule the way aircraft retirement is governed by FAA airframe hours.

The gap: in mature asset-backed lending markets, lenders buy residual value insurance. A counterparty takes on the risk that the asset is worth less than expected at loan maturity. No RVI market exists for GPUs. The Silicon Data GPU Forward Curve launched April 8th is the first standardized forward pricing signal that could underpin an RVI market -- you need a liquid forward curve before you can price insurance against it.

Macquarie is taking residual value risk naked on a $10B facility. That is an enormous bet on the stability of GPU value curves. It is also probably correct for the current cycle -- the demand for GPU compute is growing faster than supply, which supports prices -- but it is not a risk that has been priced by a market, because the market didn't exist two months ago.

the hyperscaler gives you an h100. it does not give you rdma, uefi control, huge pages, acs-disabled peer-to-peer, gpudirect, or a fabric that doesn't share bandwidth with other tenants.

the neocloud gives you all of that.

for distributed training at scale, those are not amenities. they are the difference between a cluster that achieves 60% mfu and one that achieves 40% mfu. on a 128-gpu run at $384/hr, that's $150/hr of compute you paid for and didn't get.

sf compute's infonband-for-vms shipping q2 2026 is the product milestone to watch. when self-serve spot-market gpu clusters have full fabric performance, the hyperscaler model for ai training has no remaining technical argument. only integration arguments. and integration arguments are losing.

P.S. The 66% utilization threshold from Uptime Institute's analysis is the number that should be on every infrastructure team's whiteboard. If your dedicated GPU cluster runs below 66% utilization, a neocloud is cheaper. Above 66%, on-prem wins. Most teams are not running at 66% -- training runs are bursty, experiments fail, datasets don't load on schedule. The neocloud wins more of these economics than the on-prem case predicts, because the on-prem case assumes you achieve the utilization that justifies the CapEx. Most teams don't. The spot market sell-back is how you close the gap between projected and actual utilization without eating the idle cost.