Skip to main content

NVIDIA built a Triton backend targeting their own hardware. That's not a concession. It's a tell.

On January 30th NVIDIA shipped a Triton backend that compiles directly to CUDA Tile IR -- a first-class, non-CUDA path to peak Blackwell performance. Every article framed it as developer outreach. It's defense. Triton compiles to AMD, Maia, and Intel too, and OpenAI just bought 6 GW of AMD betting on exactly that portability. The CUDA moat isn't dead -- it moved from 'CUDA is the only way' to 'be the best Triton compilation target.'

June 16, 2026

Let me explain what the CUDA Tile IR backend for Triton actually means, because every piece I've read about it led with the headline and missed the implication.

January 30th. NVIDIA released Triton-to-TileIR -- a new backend for OpenAI's Triton GPU programming language that compiles directly to CUDA Tile IR instead of PTX assembly. Available on GitHub under the triton-lang organization. Requires Blackwell GPUs.

The framing in every article: NVIDIA making their hardware more accessible to developers who don't know CUDA.

The actual implication: NVIDIA just endorsed Triton as the canonical path to peak performance on their newest architecture. And Triton compiles to AMD. To Maia. To Intel XPU. To anything with a Triton backend.


The CUDA moat has always been at a specific layer.

The GPU programming stack has four levels:

CUDA C++ and PTX at the bottom -- NVIDIA-specific, maximum control, requires deep hardware knowledge. This is where CUTLASS, cuBLAS, and handwritten attention kernels live. This layer is entirely NVIDIA proprietary.

Triton in the middle -- cross-hardware, Python-level, compiles to backend-specific code. OpenAI built it. It targets NVIDIA via PTX, AMD via ROCm, Intel via oneAPI, Maia via Microsoft's Triton backend. Write once, compile many.

torch.compile above that -- automatic, no kernel writing required, lowers to Triton by default. Most ML engineers live here and never write a kernel.

The CUDA moat has been at level one. Decades of accumulated optimization expertise, written in CUDA C++, targeting NVIDIA-specific hardware features. cuBLAS. cuDNN. FlashAttention 2 and 3. CUTLASS. The performance of every LLM serving framework in production depends on this accumulated expertise. AMD's gap has not been hardware specs -- it has been the kernel library ecosystem. ROCm hardware is competitive. ROCm kernel coverage is not.

What NVIDIA just did: made their newest architecture -- Blackwell, CUDA Tile IR -- first-class in Triton. Not in CUDA C++. In Triton.


Why this matters more than it looks.

Blackwell's performance ceiling requires programming at the tile level. PTX is no longer sufficient to reach peak utilization. The CUDA Tile IR abstraction -- the same abstraction that FA4 used in CuTe-DSL, the same abstraction that ThunderKittens targets -- expresses tile-level semantics that the Blackwell hardware was designed to execute. You cannot reach 71% hardware utilization on Blackwell (FA4's number) by compiling PTX. You can only reach it by expressing tile-level operations that map to UMMA, TMEM, and 2-CTA MMA instructions.

NVIDIA's new Triton backend preserves tile-level semantics throughout compilation. Instead of lowering to thread-level SIMT code (the way Triton previously worked), it preserves the tile structure all the way to CUDA Tile IR, which then maps to the Blackwell-specific instructions that deliver peak performance.

What this means in practice: a Triton kernel written for Blackwell, using the TileIR backend, can now reach the same performance class as a hand-tuned CuTe-DSL kernel. Without writing CUDA C++. Without knowing warp specialization. Without manually managing TMEM. The compiler handles it.

NVIDIA made this path available because the alternative -- keeping peak Blackwell performance locked in CUDA C++ -- creates a problem for NVIDIA, not for AMD. The developers who can write CuTe-DSL are a tiny fraction of the ML engineering population. Keeping peak performance locked behind that expertise means most developers are leaving significant performance on the table, which makes Blackwell look worse in benchmarks that matter to real users. Making peak performance accessible through Triton serves NVIDIA's commercial interests.

The side effect: Triton now has a first-class path to Blackwell's peak performance. Triton is hardware-agnostic. Every other Triton backend benefits from the expertise and tooling improvements that come from having a performance-motivated backer (NVIDIA) improving the Triton ecosystem.


The OpenAI AMD deal is the stakes.

October 2025. OpenAI signed a multi-year agreement with AMD for up to 6 gigawatts of Instinct GPUs. The first wave -- 1 gigawatt of MI450 series -- arrives H2 2026. OpenAI is actively hiring inference engineers focused specifically on AMD GPU enablement.

6 GW is enormous. To put it in context: Anthropic committed to 3.5 GW of TPU capacity over a multiyear deal and it was the largest headline in AI infrastructure this year. OpenAI is committing to 6 GW of AMD on a shorter timeline.

For this to make economic sense, OpenAI needs the inference performance on AMD MI450 to be close enough to NVIDIA that the cost advantage (whatever they negotiated for 6 GW) is worth the engineering investment of enabling AMD. If AMD inference is 40% slower than NVIDIA, 6 GW is a bad deal regardless of price. If AMD inference is 10% slower, the math probably works.

Triton is the bridge that makes "10% slower" achievable. AMD ROCm 7 delivered 3.5x better inference than previous ROCm versions -- not from new hardware, from software improvements in the ROCm Triton backend. The gap between AMD and NVIDIA in production inference has been narrowing because Triton kernel coverage for AMD has been improving.

The specific technical bet OpenAI is making: by H2 2026 when MI450 arrives, the Triton ecosystem will have sufficient AMD backend quality that models trained on NVIDIA hardware can be served on AMD hardware with acceptable performance, using the same Triton kernels, with minimal AMD-specific engineering. The 6 GW bet is a bet on Triton portability.


The kernel optimization loop closes it.

Kernel-Smith (March 2026, the evolutionary RL kernel optimizer I wrote about recently) demonstrated that it could generate production kernels for MACA -- MetaX's Chinese GPU alternative to CUDA -- by training on MACA execution feedback. The same evolutionary optimization loop, different backend, near-equivalent results. A 30B model trained on MACA kernels outperformed DeepSeek-V3.2-think and Qwen3-235B on MACA kernel generation.

The specific implication: the kernel expertise that's been locked in NVIDIA-targeted CUDA code for 15 years is now reproducible for any hardware backend via evolutionary RL optimization. You point the optimization loop at your target hardware, run evolution for long enough, and converge on kernels approaching hardware ceiling -- not because your engineers know the hardware, but because the reward signal (measured throughput) teaches the model what the hardware rewards.

AMD MI450 will have an evolutionary kernel optimizer running against it before it ships at scale. OpenAI's AMD inference team is hiring for exactly this. The CUDA moat survives only as long as it takes to run Kernel-Smith on AMD hardware for long enough to close the performance gap. The Tawa paper I wrote about in the warp specialization post did this for Triton autotuning. Kernel-Smith did it for full kernel generation.

The moat is eroding from two directions simultaneously: from above, via Triton's first-class path to Blackwell (removing the CUDA expertise requirement) and portability to AMD (removing the NVIDIA hardware requirement). From below, via RL-based kernel optimization (removing the need for accumulated human expertise specific to any one hardware target).


What NVIDIA is actually protecting.

NVIDIA built the CUDA Tile IR Triton backend because they understand the strategic position. The layer they need to win is not CUDA -- it's the Triton compiler backend. If NVIDIA's Triton backend generates code that hits 95%+ hardware utilization on Blackwell, and AMD's Triton backend generates code that hits 80%, the performance gap survives the portability transition. NVIDIA wins not by keeping developers in CUDA but by being the best Triton compilation target.

AMD's path: close the ROCm Triton backend quality. 3.5x improvement in ROCm 7 is the trajectory. The question is whether AMD's compiler team can close the remaining gap before the MI450 deployment scale makes it a commercial problem.

NVIDIA's path: keep investing in the Triton backend. Make Blackwell the reference platform that Triton is tuned against. Accept that hardware portability is happening and compete on the quality of the compilation rather than the exclusivity of the programming model.

The CUDA moat is not dead. It's transforming. It's moving from "CUDA is the only way to reach peak performance" to "NVIDIA's Triton backend produces better code than AMD's Triton backend on NVIDIA hardware." That's a narrower moat. It's also more contestable. It's the one NVIDIA chose to defend.


nvidia built a triton backend for blackwell.

not cuda. triton.

the moat they're defending is no longer the programming model -- it's being the best compilation target for the programming model they just endorsed.

that is a different competitive position than the one they had 18 months ago.

the 6 GW openai AMD deal is the pressure that forced this. when your largest customer is buying 6 GW of competing hardware specifically because triton makes it viable, you invest heavily in being the best triton target. that's what the tile ir backend is. it's not outreach. it's defense.


P.S. CuTe-DSL (the C++ template library that FA4 used) and CUDA Tile IR (what the Triton backend targets) are the same underlying abstraction expressed at different levels. CuTe-DSL is for engineers who want maximum control and are willing to write C++ template metaprogramming. CUDA Tile IR from Triton is for engineers who want most of the performance with Python ergonomics. Both target the same Blackwell hardware instructions: UMMA, TMEM, 2-CTA MMA. The convergence is intentional. NVIDIA is saying: express your computation at the tile level, in either language, and we will compile it to peak Blackwell performance. The abstraction level is what matters, not whether you chose C++ or Python to express it. This is genuinely new for NVIDIA. They have never previously endorsed a non-CUDA path to peak hardware performance on their own chips.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.