# I went looking for what was below SASS. Found control codes. Went deeper. Found microcode. Then found the paper that explains why what I was seeing makes sense.

Date: 2026-06-25
Source: https://vanshverma.com/notes/five-layers-below-cuda
Tags: gpu, systems

Let me tell you what I actually found and what the research says about it, because the two connect in a way that I didn't expect when I started.

SASS is the layer I wrote about last time -- NVIDIA's native GPU assembly, below PTX, undocumented ISA, where CuAsmRL does RL optimization on instruction schedules. I assumed SASS was the floor. It's not. There's structure below it that most GPU engineers have never seen, that the compiler is actively using to guide hardware behavior, and that explains several things about kernel performance that don't make sense if you model SASS as the bottom layer.

---

**The control codes.**

Every SASS instruction is 64 bits. But SASS code isn't just a flat stream of 64-bit instructions. It's organized in 128-byte groups: four 64-bit instructions plus one 64-bit control word. The control word carries 16 bits of meta-information for each instruction in the group -- 4 × 16 = 64 bits per group.

Those 16 bits per instruction break down as:

- **Stall count (4 bits)**: how many cycles the scheduler should stall before issuing the next instruction. Encoded statically by the compiler. If the previous instruction has a 4-cycle latency, the compiler sets stall count to 4. The hardware issues the next instruction after waiting that many cycles. No dynamic latency detection needed.
- **Yield bit (1 bit)**: whether to yield execution to another warp after this instruction. The compiler marks instructions where switching warps is cheap (typically after a memory instruction that won't return for a while).
- **Read barrier (3 bits)** and **write barrier (3 bits)**: synchronization tokens for tracking outstanding memory operations. A write barrier is set when an instruction starts a memory write; a read barrier is checked before an instruction that reads from a register a memory write is targeting.
- **Wait mask (6 bits)**: a bitmask of instruction slots to wait on before issuing this instruction.

None of this is in NVIDIA's public documentation. It's in the NVIDIA patent filings, inferable from cuasm's source code (which had to reverse engineer the encoding to be able to assemble SASS back from disassembled code), and confirmed by the UPC microarchitecture paper (arXiv:2503.20481, March 2025).

The key insight: **the compiler is not just translating code to instructions. It's encoding a scheduling policy into the instruction stream.** The hardware scheduler reads the control codes and follows them. ptxas makes decisions about stall counts, yield bits, and barrier tokens at compile time based on its static model of instruction latencies. When ptxas's model is wrong -- when the actual hardware latency differs from ptxas's estimate, or when ptxas doesn't model the specific microarchitectural behavior of a new chip -- the stall counts it encodes are wrong. The hardware follows them anyway.

This is the specific mechanism by which CuAsmRL finds better SASS schedules. It's not changing the instructions. It's changing the control codes -- the stall counts, the yield bits, the barrier assignments -- to reflect what the hardware actually does better than what ptxas estimated. The instructions are the same. The embedded scheduling policy is different.

---

**The fixed-latency / variable-latency split.**

The UPC paper reveals that modern NVIDIA GPU cores distinguish between two instruction categories at the hardware level: fixed-latency instructions and variable-latency instructions. The distinction determines which path through the pipeline the instruction takes.

Fixed-latency instructions: arithmetic, logical operations, register-to-register moves. The latency is deterministic and known at compile time. ptxas sets the stall count for the following instruction accordingly. The hardware doesn't need to do any dynamic latency tracking. These instructions go through the "FL path" -- a fast path through the pipeline with an L0 fixed-latency instruction cache that prefetches them efficiently.

Variable-latency instructions: loads from global memory (latency depends on cache hit/miss), shared memory accesses (latency depends on bank conflicts), instructions that stall on synchronization barriers. The latency is not predictable at compile time. These use the barrier mechanism -- the compiler inserts a write barrier when the variable-latency instruction issues, and read barriers on subsequent instructions that depend on its result. The hardware waits for the write barrier to clear before issuing instructions with a matching read barrier.

The interesting consequence: **ptxas uses two completely different mechanisms for ordering fixed-latency and variable-latency instructions, and confusing the two is a source of performance bugs that are invisible at the PTX level.**

If ptxas incorrectly classifies an instruction as fixed-latency when it's actually variable-latency (or estimates the fixed latency incorrectly for a new architecture generation), it will encode stall counts instead of barrier tokens. The instruction will issue with incorrect timing relative to its dependencies. The output will be wrong or the performance will be degraded depending on whether the incorrectly ordered instruction happens to read from a register before the write completes.

This is architecture-version-specific. A kernel compiled for Ampere that runs correctly on Ampere may perform incorrectly on Hopper if the latency characteristics changed and ptxas's model wasn't updated. The SASS binaries are not portable across architecture generations precisely because the control codes encode architecture-specific timing information.

---

**The microcode layer.**

Below the control codes, below the instruction pipeline -- there's a layer that handles specific instruction classes that aren't single-cycle operations. The Special Function Units: MUFU.SIN, MUFU.COS, MUFU.EXP, MUFU.LOG, MUFU.RSQ, RCP (reciprocal), SQRT. These are the transcendental math functions.

An SFU instruction is not a single operation from the hardware's perspective. It's a microcode sequence. The NVIDIA patent for micro-coded transcendental instruction execution (US9471305) describes this explicitly: the pipeline "switches from a first mode to a second mode" to execute SFU instructions. In second mode, the pipeline is "controlled by the instruction pipeline for iterative processing of the micro-code" -- the normal SIMT pipeline hands control over to the microcode sequencer, which executes the SFU's iterative algorithm (Newton-Raphson iterations for reciprocal, polynomial approximations for trig functions, etc.), then switches back to first mode when done.

The SFU is a fixed-point pipeline -- not floating-point -- that evaluates piecewise polynomial approximations. The algorithms are proprietary. The latency is multiple cycles -- higher than a floating-point multiply, much higher than what the stall count mechanism can express in 4 bits (which caps at 15 cycles). SFU instructions use the variable-latency barrier mechanism even though their latency is approximately predictable, because the microcode execution time can vary depending on the input value's domain.

**This is why software-emulated exp() outperforms MUFU.EXP on Blackwell.**

When FlashAttention-4 routes the softmax exponential through ALU polynomial approximation instead of MUFU.EXP, it's not just avoiding the SFU unit. It's avoiding the pipeline mode switch overhead, the microcode sequencer, and the fixed-point arithmetic path that the SFU uses. A 4th-order polynomial approximation on the regular FP32 ALU stays in "first mode" the entire time. It uses the fast path. It uses the FL cache. The stall counts are small and statically correct.

MUFU.EXP triggers a mode switch, hands control to the microcode sequencer, runs through the polynomial evaluation on the fixed-point pipeline, switches back to first mode. The round-trip is more expensive than ptxas's stall count estimate because the mode switch overhead isn't captured in the stall count model.

At Hopper hardware ratios, the SFU was not the bottleneck and MUFU.EXP was fast enough. At Blackwell, where tensor core throughput doubled while SFU count didn't change, the SFU became saturated. The mode switch overhead started to dominate. The software-emulated path on ALUs that were otherwise idle became faster not just because it used idle hardware but because it avoided a pipeline mode that has overhead the control code system doesn't fully model.

---

**The cross-vendor view: what's hardware-invariant at this level.**

The March 2026 paper (arXiv:2603.28793) from my research -- "Toward a Universal GPU ISA" -- did something that no prior work had done: analyzed this microarchitectural layer across all four major GPU vendors (NVIDIA, AMD, Intel, Apple) and identified what's hardware-invariant.

Across 16 distinct microarchitectures, drawing on 5,000+ pages of ISA references, patent filings, and community reverse-engineering, they found ten hardware-invariant computational primitives. Things that appear on every GPU regardless of vendor because they're required by the physics of parallel computation:

Warp/wavefront execution units. Register file with banked access. Shared on-chip scratchpad memory. Thread synchronization barriers. Memory access coalescing. Fixed-latency and variable-latency instruction distinction. Special function units for transcendental math. Texture/sampler units. Atomic operations for shared state. Instruction-level control codes for scheduling hints.

That last one is the one I want to flag: **instruction-level scheduling hints encoded in the instruction stream itself are a hardware-invariant primitive across all four vendors.** AMD's RDNA has a similar system -- wave32 instructions have implied dependency tracking that functions like NVIDIA's barrier mechanism. Intel's Xe has SWSB (software scoreboarding) -- an explicit, documented version of exactly what NVIDIA embeds in control codes. Apple's M-series GPU (reverse-engineered since Apple doesn't publish the ISA) has an analogous encoding inferred from Metal shader compiler output.

The six true architectural divergences -- places where vendors made fundamentally different design choices -- include one directly relevant to the microcode question: NVIDIA uses a fixed-point SFU pipeline for transcendental functions, AMD uses a fully floating-point implementation, Intel uses a mixed approach. These are different answers to the same question (how do you implement sin() efficiently in hardware?) that result in different performance characteristics and different costs for software emulation.

AMD's floating-point transcendental implementation means software-emulated exp() on AMD may not show the same benefit over the hardware instruction as it does on NVIDIA. The mode-switch overhead that makes MUFU.EXP costly on Blackwell doesn't necessarily exist on RDNA's FP transcendental pipeline. The FA4 optimization that routes softmax through ALUs specifically addresses an NVIDIA microarchitectural property. Porting FA4's softmax emulation to AMD without understanding this distinction would apply an optimization designed for one vendor's microcode structure to hardware where it may not help.

---

I started this exploration because CuAsmRL's results implied that the scheduler was making decisions based on information encoded below the instruction level. I looked for where that information lived. Found the control code encoding. Looked for what the control codes were implementing. Found the fixed-latency / variable-latency path split. Looked for what used the variable-latency path despite having approximately-predictable latency. Found the microcode sequencer.

The UPC paper (March 2025) confirmed most of what I'd inferred from cuasm's source and the patent filings. The cross-vendor paper (March 2026) told me that the control code mechanism isn't NVIDIA-specific -- it's hardware-invariant across all four vendors because the scheduling problem it solves is physics, not design choice.

---

there are five layers below your CUDA C++.

PTX. SASS. control codes. fixed/variable latency paths. microcode.

most engineers know the first two. the research community recently published on the next two. the fifth one is in the patents and the compiler source and the performance anomalies you've been calling "weird hardware behavior."

it's not weird. it's a pipeline mode switch for transcendental functions that adds cycles the stall count model doesn't capture.

*the reason mufu.exp is slower than polynomial exp on blackwell is not that the sfu is slower. it's that the mode switch from first mode to second mode costs cycles that ptxas's stall count doesn't encode correctly for the new hardware ratios. the software path stays in first mode. it wins not by being faster per operation but by never paying the switching overhead.*

---

**P.S.** The cross-vendor paper's finding on Intel's SWSB (software scoreboarding) is worth a standalone post. Intel made the NVIDIA control code mechanism fully public and explicit -- SWSB is documented in the Xe HPG ISA reference. Every instruction has a SWSB field that works identically to NVIDIA's barrier tokens, but Intel tells you what it does and what the encoding means. NVIDIA's version is inferred from patents and disassembly. Intel's version is in the manual. If you want to understand what's in NVIDIA's control codes, read Intel's SWSB specification. The concepts are identical. The documentation is not.
