Skip to main content

Notes

i wrote these. not a model. not a prompt. not a template. if something here is wrong it's because i was wrong, not because a system hallucinated it.

The RL agent was caching kernel outputs by recognizing input memory addresses and returning stale results when it saw a matching pointer.

May 25, 2026

An RL agent trained to optimize CUDA kernels discovered output caching by memory address without being told it was an option. The CUDA-L1 team deployed DeepSeek-R1 as an adversarial checker to catch it. 3.12x average speedup. 7.72x over cuDNN. From a reward signal alone.

AWS gives you an H100. It does not give you an H100 running at what an H100 can actually do.

May 24, 2026

SF Compute runs 3.2 Tb/s InfiniBand. AWS runs 800 Gbps Ethernet with RoCEv2. The difference is RDMA, lossless fabric, and $6,400 in eliminated wall-clock time on a 128-GPU 50K-step run -- before huge pages, NUMA pinning, ACS disable, and GPUDirect compound on top.

Video world models generate pixels. 3D world models generate scenes. The serving architecture for each is completely different.

May 23, 2026

A 3DGS-output world model splits into two problems: neural generation on the server, rasterization on the client. The client renders arbitrary viewpoints locally at 100+ FPS via WebGPU. The cloud only has to generate the geometry.

Sora cannot be interactive. Neither can Veo. Neither can Kling or Runway.

May 23, 2026

Bidirectional video diffusion models generate all frames jointly from a fixed prompt. That's why they're coherent. It's also why they fundamentally cannot respond to a mid-generation user action. Causal vs bidirectional is the most important architectural distinction in the world model space right now.

Real-time interactive video generation has two completely separate scaling problems. Almost nobody is solving both.

May 21, 2026

Per-step latency and long-horizon memory are independent problems. Causal Forcing++ solves the first. TTT Memory solves the second. Neither cites the other. The experiment that determines whether they compose hasn't been run yet.

Open an Nsight profile on a DeepSeek-R1 decode workload. Find the MoE Dispatch/Combine section. Look at how long it is relative to the compute sections on either side of it.

May 20, 2026

DBO overlaps MoE all-to-all communication with dense layer compute using two CUDA streams. 25% decode latency from one flag. The tensor cores were idle during that communication window the whole time.

You adopted WideEP for the throughput gains. Then one GPU died and 96 went down with it.

May 15, 2026

Wide Expert Parallelism turns 96 GPUs into a single failure domain. The benchmarks didn't measure what happens when GPU 47 dies at 3am.

99% of the prefill cost on turn 2 is recomputing something the decode node already has.

May 9, 2026

PD disaggregation was designed for single-turn queries. The dominant workload is now multi-turn. PPD routes append-prefill locally and cuts turn 2+ TTFT by 68%.

Google just threw away a network topology they've used for ten years. That's the story nobody wrote.

May 2, 2026

TPU 8i replaces the 3D torus with Boardfly -- a high-radix topology that cuts maximum hop count 56% for MoE inference. Google just declared training and inference need different network fabrics.

Prefill and decode run on the same GPU. They use completely different hardware. Nobody ran them at the same time until six weeks ago.

April 29, 2026

Bullet partitions SMs spatially at the kernel level -- prefill on half the chip, decode on the other half, simultaneously. 1.26x throughput gain, no new hardware. ASPLOS '26.

xAI ran Grok 4 on 200,000 GPUs. A significant fraction of that cluster was idle waiting for a barrier that didn't need to exist.

April 27, 2026

Laminar breaks the synchronization barrier between rollout generation and policy training that every RL system in the world uses. 5.48x throughput on 1,024 GPUs from removing a lockstep the algorithm never required.

I write because the gap between what's true and what's being said is embarrassingly large right now.

April 22, 2026

Papers get published with 5x throughput gains, collect two citations, and disappear. The engineers who would benefit don't know they exist. That's the gap I write into.

71ms per forward pass. budget is 35ms. the hardware told me before i wrote a single line of code.

April 18, 2026

Building a serving system for video world models. The math forced every decision before I named a single abstraction.

two models shipped this month that broke a rule everyone believed about memory and capability.

April 17, 2026

Gemma 4 E2B runs in a browser tab. Nemotron 3 Super runs 1M context on a single GPU. Neither should be possible.

the CPU is on the critical path for every token you've ever generated.

April 16, 2026

Blink removes the CPU from inference serving entirely. 8.47x P99 TTFT. SmartNIC + persistent GPU kernel.

your inference engine evicts the KV cache the moment the agent calls a tool.

April 15, 2026

Then the tool returns. Then you recompute everything from scratch. Every time. On every tool call.

they let the model run Kaggle competitions alone for 24 hours. it kept getting better.

April 13, 2026

MiniMax M2.7: open weights, $0.30/M tokens, self-improvement loop, 9 gold medals on MLE Bench in one autonomous run.

nobody is talking about the NIC hop.

April 10, 2026

CXL memory eliminates the KV transfer bottleneck in disaggregated inference. 9.8x TTFT improvement. The plumbing paper nobody read.

90% of Meta's model parameters are embeddings. they've been running them on tensor cores for years.

April 8, 2026

MTIA, custom silicon for recommendation inference, 44% TCO reduction, and why the GPU was always the wrong answer.

the H100 was designed for something most kernels don't do.

April 5, 2026

Warp specialization, GPU bubbles, and the 24% of inference hardware you're already paying for but not using.

this is not an anti-AI stance. this is an anti-idiot stance.

April 2, 2026

Vibe coding is a multiplier. It multiplies what you already are.

you are not paying for compute. you are paying for idle.

March 28, 2026

At 10% utilization, self-hosted inference costs 6x more than the API. The math only works above 90%.

Google just quietly shipped Pied Piper.

March 22, 2026

TurboQuant compresses the KV cache 6x at 3 bits with no fine-tuning. Nobody is talking about it.

the agent got it right. the framework got it wrong.

March 8, 2026

Context engineering, not model capability, is why your agent fails in production.

The jump looked wrong. The physics were real.

February 22, 2026

WebGPU, world models, and the end of the game engine as an architectural paradigm.

the transformer isn't dying. it's getting a co-pilot.

February 2, 2026

Mamba, Titans, hybrid architectures, and what they actually change about GPU infrastructure.

the frame budget is 16 milliseconds. it does not negotiate.

January 9, 2026

What three weeks of building the wrong machine taught me about why world model inference is not LLM inference.

4% compute utilization. everything working exactly as it should.

November 18, 2025

Why your H100 inference deployment is memory-bound, not broken, and why MFU is the wrong metric.

the pipeline was green. the model was wrong.

October 2, 2025

Why DevOps fails at AI, and what the actual engineering discipline looks like.

the scheduler gave me eight GPUs. they were the wrong eight GPUs.

August 28, 2025

GPU topology, disaggregated inference, and why the Kubernetes resource model has no vocabulary for communication graphs.

i've been catching hardware failures before the hardware knows.

July 12, 2025

ECC errors, thermal deltas, checkpoint validation, and why your GPU cluster is degrading right now.

stop paying for free software with your Mondays.

April 28, 2025

Self-managed Airflow, sensor cascades, and why the cost analysis never includes the backlog that doesn't shrink.