HBM is 5-10x more expensive than conventional DRAM…

this post is more technical than usual. if you've got a fried attention span you might wanna skip this one. if you stayed -- good. this is the paper nobody in the inference infrastructure community is talking about and it directly changes the cost floor for everything we're building.

I want to explain a specific paper that just published in IEEE Computer Architecture Letters and then explain why the timing -- right after Fable 5 dropped with 1 million token context and 128k output tokens -- makes it more important than it was six months ago when it first appeared.

The argument in one sentence: HBM is expensive partly because it's manufactured to tight reliability tolerances. Those tolerances are more stringent than inference workloads require. You can use cheaper HBM dies with higher raw bit error rates if you compensate with workload-aware error correction at the memory controller. At error rates up to 10^-3, you retain 78% of throughput and 97% of accuracy. The cost reduction from looser manufacturing tolerances is substantial.

That's the entire paper. Let me explain why it's technically non-trivial.

The reliability problem in HBM manufacturing.

HBM uses a 3D stack of DRAM dies connected through silicon vias. The die stacking introduces defects. The tight interconnect densities amplify the yield problem. To ship parts that meet spec, manufacturers test every die, repair defects with redundant cells, and run each HBM module through extensive characterization. Parts that pass tight error rate requirements ship as HBM3E or HBM4. Parts that fail get discarded or reclassified.

The on-die ECC in current HBM is short-codeword -- typically 16B or 32B. Short codewords provide limited error correction strength. The main purpose is catching single-bit upsets during operation, not compensating for manufacturing defects. The manufacturing defects are handled upstream through binning and yield management.

The paper's premise: if you could accept higher raw bit error rates from the DRAM die -- letting more defective dies through manufacturing -- and compensate with stronger ECC at the memory controller level rather than on-die, you'd have higher yield per wafer, lower test overhead, and lower cost per usable gigabyte.

The question is whether stronger controller-side ECC can actually compensate. Short on-die ECC at 16B-32B codeword length has limited error correction capability -- it can correct single-bit errors per codeword. Reed-Solomon ECC at 512B-2KB codeword length corrects many more errors per codeword because ECC strength improves exponentially with codeword length.

The tradeoff: large-codeword ECC introduces two problems. First, write amplification -- updating a 2KB ECC codeword when you write a 32-byte block requires reading and rewriting 2KB. Second, decoder complexity -- RS decoding at multi-terabyte-per-second HBM bandwidth requires significant silicon area and power at the memory controller.

Both problems have solutions specific to the AI inference access pattern.

Why inference access patterns make large-codeword ECC feasible.

Inference workloads access HBM in two modes: streaming large contiguous blocks (weight matrices, KV cache for sequential token generation) and small random accesses (scheduling metadata, index updates, cache management).

Large contiguous access is the dominant mode. Weight streaming during decode reads the same large matrices repeatedly in sequence. KV cache access during prefill reads long contiguous context windows. The access granularity is naturally 512B-2KB aligned -- the same size as the large-codeword ECC. Write amplification doesn't apply when you're reading and writing at the codeword granularity anyway.

The small random accesses are the problem case. When a scheduler updates a 32-byte metadata block, naively you'd need to read 2KB, decode ECC, modify 32 bytes, re-encode 2KB, write 2KB. 64x write amplification.

The paper's fix: differential parity updates. For small random writes within a large codeword, you XOR only the changed bytes into the parity symbols rather than re-encoding the full codeword from scratch. The parity update cost is proportional to the modified region, not the full codeword. Write amplification collapses from 64x to near 1x for small random writes.

The bit criticality insight is the most technically interesting piece.

BF16 and FP8 floating point values are not uniform in their bit criticality. An exponent bit error in FP8 changes the represented value by a factor of 2 -- potentially catastrophic for the output. A mantissa bit error in FP8 changes the value by at most 0.4% of full scale -- noise-like, typically absorbed in the model's statistical tolerance.

The paper organizes HBM storage bit-plane-wise. For m floating-point values stored together, the i-th bit plane contains all bits at position i across all m values. The exponent planes are critical. The mantissa planes are not.

Importance-adaptive ECC: apply full Reed-Solomon protection only to the critical exponent planes. Apply lighter CRC detection or no error correction to the mantissa planes. The protected-plane ratio γ directly reduces decoder complexity by (1-γ). If only 30% of bits are in critical planes, your RS decoder needs to handle only 30% of the bandwidth it would otherwise require.

This makes large-codeword RS ECC at multi-terabyte-per-second HBM bandwidth viable from a silicon area perspective. You're not decoding 3.35 TB/s through a massive RS decoder. You're decoding 30% of 3.35 TB/s -- roughly 1 TB/s -- through a more modest RS decoder that nonetheless provides exponentially stronger correction than the 16B on-die ECC it replaces.

The numbers at 10^-3 raw bit error rate.

10^-3 BER means 1 bit error per 1000 bits at rest. That is an extremely high error rate for DRAM. Current HBM operates at BERs many orders of magnitude lower. The paper's claim: at 10^-3, with domain-specific ECC, you retain 78% of throughput and 97% of PIQA accuracy, 94% of MMLU accuracy compared to error-free HBM.

78% throughput retention at 10^-3 BER is from the error correction and detection overhead -- not from uncorrectable errors killing performance. The ECC processing adds latency to HBM accesses. At 10^-3 BER you're correcting a lot of errors and the correction time adds up. At lower BER the throughput penalty is smaller.

The accuracy numbers are the more important ones. 97% of PIQA, 94% of MMLU. The model is still working. The occasional undetected error that slips through ECC ends up in a mantissa bit and the model absorbs it. This matches the MTIA paper's observation that Meta ran inference without ECC because "inference results are inherently statistical" -- the tolerance is real, not theoretical.

The cost implication: HBM yield is highly sensitive to BER targets. Relaxing the BER target from current tight specifications to 10^-4 or 10^-3 increases yield per wafer substantially. The exact numbers depend on process node and vendor economics and aren't public. But the directional argument is strong: strict BER targets are a major driver of HBM cost, and loosening them while compensating at the controller level opens a cost path that doesn't exist in today's supply chain.

Why Fable 5 makes this paper more important than it was six months ago.

In February 2026, the inference infrastructure problem was primarily about efficiency -- how do you serve Llama-3 70B at reasonable cost with acceptable latency. The optimization was at the software layer: better schedulers, smarter KV allocation, tiered memory.

After today, the problem includes scale that didn't exist before. Fable 5 with 1M context and 128k output tokens running multi-hour asynchronous jobs requires HBM at a different order of magnitude. A single 21-minute Fable 5 decode job -- 128k tokens at 100 tokens/second -- holds a KV cache of potentially hundreds of gigabytes in HBM for the duration. At any reasonable concurrency, the HBM footprint per H100 is fully consumed by KV state. You're in tiering territory immediately.

The HMA tiered memory I wrote about last week addresses the capacity constraint by offloading cold KV to DRAM. That helps. What it doesn't address is the per-gigabyte cost of HBM itself. If you're building the infrastructure that serves Fable 5 at Anthropic's scale -- $30B ARR, 3.5 gigawatts of TPU committed, serving hundreds of millions of tokens per second -- the HBM cost is a first-order budget item.

Domain-specific ECC that allows cheaper HBM dies attacks a cost driver that software optimization can't touch. It's a hardware supply chain argument dressed up as a systems paper. The paper is from academic researchers at RPI and IBM. The companies with the leverage to push this into HBM manufacturing are the hyperscalers and neocloud operators buying HBM at scale -- Anthropic, Meta, Microsoft, Google. The paper gives them a technical argument for a procurement conversation with HBM vendors that didn't have a technical foundation before.

most of the inference optimization work this year has been software.

better schedulers. better allocators. better kernels. better quantization.

this paper is about the manufacturing floor.

hbm reliability is a tunable parameter, not a fixed constraint. inference workloads tolerate bit errors in ways other workloads don't. the tolerance gap between "what hbm provides" and "what inference actually needs" is large enough to drive a meaningful cost reduction through manufacturing yield.

at fable 5 scale, that gap is a budget line item.

the bit plane organization detail is the implementation insight that makes this practical. you don't protect all bits equally. you protect exponent bits with rs correction and leave mantissa bits to crc or unprotected. decoder area drops by (1-γ). at γ=0.3, your decoder handles 30% of the bandwidth. viable at hbm speeds.

P.S. The write amplification solution -- differential parity updates -- is the engineering detail that makes this deployable rather than theoretical. Without it, every small random write to HBM requires re-encoding a 2KB codeword: 64x write amplification that would destroy scheduling performance. With differential parity, the parity update cost scales with the modified bytes, not the codeword. The small random accesses that dominate scheduling overhead pay near-zero amplification. The large contiguous accesses that dominate inference bandwidth pay nothing because they're already at codeword granularity. Both access patterns are solved. The paper has a companion at arXiv:2512.18152 that goes deeper on the controller implementation. If this thread interests you, read that one next.