Skip to main content

I write because the gap between what's true and what's being said is embarrassingly large right now.

Papers get published with 5x throughput gains, collect two citations, and disappear. The engineers who would benefit don't know they exist. That's the gap I write into.

April 22, 2026

That's the whole reason. I keep waiting for it to close and it doesn't.

There are hundreds of pieces every week about AI. Most of them are about models -- which one scored higher on which benchmark, which company raised more money, which CEO said something quotable. Some of them are about infrastructure in a surface way -- Jensen said the inference inflection point has arrived, here is a summary of the GTC keynote, here are the numbers he cited.

Almost none of them are about the actual problems. The specific, hard, unsolved engineering problems that determine whether any of this works at scale. The things that keep the people building this space awake at 2am not because they're anxious but because the problem is genuinely interesting and they can't stop thinking about it.

That gap is what I write into.


I got pulled into this space because I couldn't stop reading the papers.

Not because someone told me to. Not because it was strategically useful. Because I would find one paper -- something about KV cache management, or GPU scheduling, or post-training infrastructure -- and it would contain a number that didn't fit in my head, and I'd spend the next three hours chasing the references until I understood where the number came from. And then I'd surface and realize nobody had written about it in plain language anywhere.

That's the pattern that keeps happening. A paper gets published. It has real results -- 5x throughput, 9x latency reduction, 2.5x tokens per second from a software change on hardware you already own. It gets two citations. Nobody writes about it. The engineers who would benefit from knowing about it don't know it exists, because they're busy shipping and the paper is dense and nobody translated it.

i find that situation slightly maddening. so i write.


Why this space specifically.

The honest answer is that I think we are in one of those rare moments where the foundational decisions being made right now will determine the shape of an entire industry for a decade. The infrastructure decisions -- how you serve models, how you train them, what hardware you build around, how you schedule the work across a cluster -- these are being made in 2025 and 2026 by a relatively small number of people, and most of the options aren't even visible yet because the papers describing them haven't been translated into language engineers can act on.

That's a genuinely interesting problem to be writing around. Not "here's a think-piece about what AI means for society." The actual technical decisions. The ones with numbers. The ones where being right or wrong by a factor of two changes your compute bill by tens of millions of dollars.

I also think the problems are beautiful. I mean that in the way that mathematicians mean it -- there is a kind of elegance to a well-formulated constraint. The attention mechanism quadratically scales with context length, but the model's capability grows with context, and you need the capability, so you have to find a way to make the quadratic not matter. That's a hard problem. The kind you can spend years on and still feel like you haven't gotten to the bottom of it.

The KV cache is a similar shape. Memory is finite. Context is infinite. The model needs everything it's ever seen to answer your question well. Something has to give. The papers I keep reading are all different attempts at negotiating that trade -- compression, eviction, pooling, off-loading, tiering, restructuring the attention kernel so it doesn't need to see everything. None of them fully solve it. Each one moves the constraint somewhere else. That is a beautiful problem.


The thing I try to do when I write -- and often fail at -- is find the one true thing buried in whatever I'm looking at and say it out loud before the reader can negotiate with it.

Not the interesting thing. Not the surprising thing. The true thing. The thing that follows inevitably from the facts if you look at them directly enough. Usually it's something that's visible in the data but that nobody has said explicitly, because saying it explicitly is slightly uncomfortable.

The GPU utilization post started because I kept seeing teams report 85%, 90% utilization and treat it as a sign of success, and I knew from the math that 85% utilization on inference workloads is not a success metric -- it's a symptom of serving the wrong users. The true thing was: you are measuring how busy your hardware is, not how well you are serving people. Those two things can diverge silently. They do diverge, constantly, in production systems. Nobody was saying it.

That's the post I want to write every time. The one where the true thing is hiding in the numbers and everyone has been politely not saying it.


I write about this space specifically -- AI infrastructure, systems engineering, the machinery underneath the models -- because I think it's where the most important problems are right now, and they're dramatically undercovered relative to their importance.

A 10% improvement in inference throughput at Anthropic's scale is not a footnote. It is hundreds of millions of dollars. It is the difference between being able to serve a new capability profitably or not. The engineering decisions that produce that 10% are made by people who read papers, run experiments, argue about kernel implementations. Those decisions are invisible in the coverage that most people consume.

I want to make them slightly less invisible.

Not because I think everyone needs to know about warp specialization or CXL memory pooling or the difference between goodput and throughput. Most people don't, and that's fine.

But the people who are making these decisions -- the engineers, the infra leads, the people choosing between hardware configurations that will determine their cost structure for the next three years -- those people deserve writing that treats them as the intelligent adults they are. That doesn't condescend. That says the true thing directly and trusts them to handle it.

That's what I'm trying to do.


Whether I'm succeeding is a different question. Most days I'm not sure.

But the gap is still there. Papers keep getting published. Numbers keep being buried. Problems keep being real and interesting and mostly invisible.

so I keep writing.


P.S. The problems I find most interesting right now, if you're curious: the KV cache at long contexts (unsolved), the RL post-training synchronization bottleneck (just being cracked open), the memory hierarchy for disaggregated inference (active research, nobody has the full answer), and what on-device inference actually means for privacy and economics when the models get small enough to run locally at genuine quality. That last one is the one that keeps me up. The implications are large and mostly unexplored.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.