Every kernel optimization system before Kernel-Smith…

Let me explain that distinction precisely because it determines why the results are what they are.

A one-shot generator takes a kernel specification -- "implement a fused attention kernel for GQA with FP8 weights on Hopper" -- and produces a kernel. It's the approach of CUDA-L1, CUDA Agent, most LLM-based kernel generation work. You train on (specification, fast reference kernel) pairs. The model learns to map specs to implementations. At inference, one forward pass produces a candidate. You evaluate it. Done.

The problem with one-shot generation: the optimization space for any non-trivial kernel is combinatorial. Block sizes, pipeline stages, warp counts, shared memory layout, register allocation strategy, memory access order -- each of these interacts with the others. The number of valid combinations is enormous. The number of near-optimal combinations is small. A one-shot generator is trying to find the good region of this space from a standing start. It can learn patterns ("attention kernels usually want 128x64 tiles") but it can't systematically explore the local neighborhood of a given configuration to find the optimum.

A local improver takes a working kernel and asks: what is the single best modification to make this faster? It doesn't generate from scratch. It looks at what exists, profiles it, identifies the bottleneck, and proposes one targeted change. Then repeats. This is how expert human kernel engineers actually work. You don't write FlashAttention in one shot. You start with something correct, profile it, find the bottleneck -- GEMM memory bandwidth? SFU throughput? register pressure? -- and address it. Then profile again.

The training signal for a local improver requires a different kind of data than one-shot generation requires.

One-shot training data: (spec, fast_kernel) pairs. Lots of them. Relatively easy to collect -- run reference implementations, generate fast variants via existing tools, pair them.

Local improver training data: you need (kernel_t, modification, kernel_t+1, speedup_delta) tuples -- the kernel at step t, the specific code change applied, the resulting kernel, and the speedup produced by that change. These tuples only exist inside evolution trajectories -- long sequences of iterative improvements where someone or something ran an optimization loop and recorded each step.

Kernel-Smith's training procedure: run long evolutionary trajectories (thousands of steps) across hundreds of kernels. Record every modification. Filter to retain only "correctness-preserving, high-gain revisions" -- the modifications that produced meaningful speedup without breaking the kernel. Convert these to step-centric (state, action, reward) tuples. Train the model on this filtered step corpus via RL.

The result: Kernel-Smith-235B-RL is not trained to write fast kernels. It's trained to make the next improvement to whatever kernel it receives. The optimization loop is: profile current kernel → identify bottleneck → propose targeted modification → apply it → measure → repeat. The model is the "propose targeted modification" step.

This is the architectural decision that changes the performance profile. A one-shot generator has one chance to be right. Kernel-Smith gets to iterate. Each iteration, it has more profiling information -- real hardware feedback -- than the previous step. The search is guided by what the hardware actually measured, not what the model predicted.

State of art on KernelBench with Nvidia Triton backend. Best average speedup ratio. Outperforms Gemini-3.0-pro and Claude-4.6-opus on kernel generation.

Those numbers are real but they're not the sentence that stopped me.

"Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy."

Kernel-Smith did not just score well on KernelBench. It wrote kernels that shipped to SGLang and LMDeploy. Code that is running right now in production inference deployments. The evolutionary optimization loop -- the same pipeline that produces benchmark numbers -- was also used to generate actual optimizations for real production kernels used by real serving deployments.

This closes a gap that has plagued every ML-based code generation system: the lab-to-production transfer problem. KernelBench is a controlled benchmark with clean kernel specifications, clear correctness criteria, and a standardized evaluation harness. Production kernels have messy dependencies, hardware-specific constraints, version-sensitive APIs, and performance requirements that interact with the full serving stack in ways the benchmark doesn't capture. Systems that do well on KernelBench often fail to transfer because the benchmark environment is cleaner than the real environment.

Kernel-Smith transferred. The same evolutionary loop worked on real SGLang kernels with real constraints. That is the sentence that matters.

The hardware portability result is the one nobody wrote about.

Kernel-Smith validated on MetaX MACA backend. MetaX is a Chinese GPU company. MACA is their programming interface -- analogous to CUDA for NVIDIA, ROCm for AMD. The model they deployed: Kernel-Smith-MACA-30B, trained specifically for MACA kernel optimization.

Result: Kernel-Smith-MACA-30B outperforms DeepSeek-V3.2-think and Qwen3-235B-2507-think on MACA kernel generation. Much larger models, beaten on a non-NVIDIA backend.

The implication is structural. The evolutionary kernel optimization approach -- maintain a population of kernels, iterate with LLM-proposed modifications, filter by hardware execution speedup -- is hardware-agnostic. The model needs to know the programming interface (Triton, MACA, HIP, Metal), and the evaluation service needs to be backend-specific. But the optimization loop itself is the same. You build a backend-specific evaluation service, feed it the appropriate compiler and runtime, train a model on evolution trajectories from that backend, and you have a kernel optimizer for that hardware.

This is directly relevant to anyone thinking about hardware alternatives to NVIDIA in the current infrastructure market. The CUDA ecosystem's dominance comes partly from the decades of accumulated kernel optimization work -- cuBLAS, cuDNN, CUTLASS, FlashAttention -- that simply doesn't exist for alternatives at the same depth. If evolutionary LLM optimization can close that gap -- if you can spin up a Kernel-Smith instance for any new hardware backend in weeks rather than years -- the moat narrows.

The GPU Kernel Scientist paper (arXiv 2506.20807, this week) takes the same principle to AMD HIP specifically. Three LLM stages -- Gemini 2.5 Flash for rapid generation, Gemini 2.5 Pro for higher-quality improvements -- orchestrating iterative AMD HIP kernel optimization. Starting point: a direct CUDA-to-HIP translation that ran 6x slower than PyTorch's baseline. The system autonomously applied loop transformations, memory access pattern reorganization, AMD-specific intrinsics, and fast math substitutions across multiple iterations until it reached competitive performance.

6x behind PyTorch to competitive in an automated loop on AMD hardware. Not from a specialized AMD kernel expert. From general-purpose LLMs with execution feedback as the guide.

The thing that the entire kernel optimization space has been building toward: making the "execution feedback → next modification" loop tight, reliable, and hardware-aware enough that it can discover optimizations humans would find, plus the interaction effects humans miss.

Kernel-Smith's step-centric RL is the training regime that makes the local improver better than the one-shot generator. The population-based evolutionary search is the exploration strategy that avoids local optima. The backend-specific evaluation service is what grounds every claim in real hardware numbers rather than model predictions.

And it shipped to production.

the one-shot generator learns patterns.

the local improver learns to navigate the optimization space one step at a time, guided by what the hardware actually measured.

these are not the same capability. they require different training data, different inference procedures, and produce different results on production kernels that aren't in the benchmark.

kernel-smith is the first system that trains specifically for the second capability.

it then contributed those optimizations to sglang and lmdeploy. the same loop that generated benchmark results generated production improvements. that is the bar nobody else has cleared.

P.S. The "correctness-preserving, high-gain revisions" filter is doing more work than it looks like. Long evolution trajectories contain a lot of noise -- modifications that improve performance on one input size but degrade it on another, modifications that only help because the preceding step created an artifact, modifications that improve the measured latency but increase variance in a way that hurts P99. Filtering to "correctness-preserving, high-gain" means the RL signal only reinforces modifications that unambiguously improve the kernel across validation inputs and produce consistent timing improvements. This is the eval-rigorous version of what makes RL training for kernel optimization trustworthy rather than reward-hacky. The filter is the quality control that prevents the model from learning to optimize the measurement rather than the kernel. It's described in two sentences in the paper. It's the engineering decision that makes everything else work.