# Companies are paying for 20x more GPU capacity than their workloads use. The number is worse than last year. The year before that it was worse than the year before that.

Date: 2026-06-21
Source: https://vanshverma.com/notes/agentic-gpu-idle-scheduling
Tags: gpu, inference

Cast AI published this in April. Average GPU utilization across production clusters: 5%. Not 40%. Not 20%. Five. For every dollar of GPU compute that runs a workload, nineteen dollars sits idle waiting for something to happen.

And in January 2026, AWS raised H200 Capacity Block prices by 15%. The first GPU price increase since EC2 launched in 2006. Paying more for hardware that's idle 95% of the time.

I've been sitting with this because I think most people are looking at the wrong idle time. They're looking at "GPU is on but underutilized during inference." That's the MFU problem. The utilization problem for agentic workloads is different. The GPU isn't underutilized during inference. It's idle between inferences. Not computing at low efficiency. Not computing at all.

Let me show you the specific mechanism.

---

**Why agentic workloads make GPUs idle in ways inference workloads don't.**

Standard inference: request arrives, prefill runs, decode runs, response emitted. The GPU is doing work the entire time. The only idle time is scheduling overhead between requests.

Agentic inference: model generates a tool call, GPU finishes, CPU dispatches the tool, tool executes, result arrives, CPU tokenizes and prepares the next input, GPU starts again. For a workflow with 10 tool calls per task -- which is typical for anything non-trivial -- you have 10 GPU-idle gaps per task. Each gap is the duration of: CPU parsing the output, dispatching the tool call, waiting for the external service, receiving the result, tokenizing it for the next input.

A database query takes 50ms. A web search takes 200ms. A code execution sandbox takes 2-5 seconds. The GPU is idle for all of it.

At 10 tool calls per agentic task, with average tool execution time of 500ms, you have 5 seconds of GPU idle time per task. If the GPU inference time per task is also 5 seconds, you're at 50% GPU utilization structurally -- not from poor kernel efficiency, not from scheduling overhead, but from the fundamental pattern of alternating between GPU compute and external tool execution.

The Semiwiki infrastructure report puts the CPU-to-GPU ratio at 1:1 to 1.4:1 (86-120 CPU cores per GPU) as the emerging recommendation for agentic workloads. That's 4-5x more CPU than standard inference deployments. The CPU isn't the bottleneck for inference efficiency. It's the bottleneck for agentic throughput because all the orchestration -- control flow, branching, retries, tool dispatch, result parsing -- runs on CPU, and when CPU is saturated the GPU waits.

---

**The hardware trend is making this worse, not better.**

From NVIDIA Ampere to Blackwell, GPU FLOPS increased by roughly 4-6x depending on precision. Storage network bandwidth (the SNICs connecting GPUs to KV cache storage) increased by much less. The I/O-to-compute ratio from Ampere to Blackwell decreased by 14.4x. Fourteen times. You have dramatically more compute per unit of I/O.

For pure compute workloads, this is fine. You're faster.

For agentic workloads that are I/O-bound -- waiting for KV cache retrieval from storage, waiting for tool call results, waiting for external services -- more GPU FLOPS doesn't help. The GPU finishes its work faster and then waits longer proportionally. The idle fraction goes up as the compute-to-I/O ratio grows.

Blackwell is faster at the arithmetic. It is also, for I/O-bound agentic workloads, idle more of the time than Ampere was. Not because Blackwell is worse -- because the I/O hasn't kept up. The hardware improvements are outrunning the infrastructure architecture.

---

**DualPath's insight: the compute network has burst idle capacity that nobody is using.**

DualPath (arXiv:2602.21548) is the most technically specific answer to the storage bottleneck problem and I want to explain the mechanism precisely.

In PD-disaggregated serving -- prefill and decode on separate worker pools -- KV cache for long-context agentic sessions lives in remote storage (object store, distributed cache). When a decode worker needs KV for a new turn, it retrieves it from storage. The retrieval goes through the storage network -- the SNICs (storage network interface cards) attached to prefill engines. All KV loading pressure centralizes on the prefill-side SNICs. The decode-side SNICs sit idle.

Meanwhile: the compute network -- InfiniBand or NVLink, the high-bandwidth fabric used for collective operations (allreduce, allgather) during tensor parallel inference -- has a specific access pattern. During model forward passes, the collective operations burst at sub-millisecond intervals. Between forward passes (during decode token generation, during tool call gaps, during scheduling overhead) the compute network is idle.

Two idle resources. Compute network has burst idle bandwidth. Decode-side SNICs have unused capacity. Storage loading is the bottleneck.

DualPath's solution: allow decode engines to participate in KV retrieval using their temporarily idle compute network bandwidth. Instead of all KV loading going exclusively through prefill-side SNICs over the storage network, decode engines pull KV from storage using their idle compute network bandwidth during the gaps between collective operations. The storage I/O pressure distributes across both networks simultaneously instead of centralizing on one.

The storage network and compute network now load KV cache in parallel -- "dual path." The aggregate bandwidth available for KV retrieval increases substantially. GPU idle time waiting for KV loads decreases. The resource that was idle (compute network burst gaps) now does work.

---

**AgentRL: 93.2% GPU utilization on long-horizon agent workloads.**

AgentRL (arXiv:2602.06485) attacks the synchronization barrier problem specifically for RL post-training on agentic tasks -- the problem I wrote about in the Laminar post, applied to tool-intensive long-horizon agent training.

The comparison: veRL (ByteDance's RL framework, which achieves state-of-the-art on many standard RL benchmarks) achieves 45.2% average GPU compute utilization on long-horizon agent workloads. AgentRL achieves 93.2%.

Same hardware. Same model. Different system organization.

The mechanism: AgentRL uses a fully asynchronous pipeline with explicit resource scheduling and memory handover between the sampling phase (generating rollouts via tool-using agents) and the training phase (computing policy gradients and updating weights). In veRL, sampling and training are loosely synchronized -- the cluster waits for the sampling phase to complete before training begins, and vice versa. Each synchronization barrier means GPUs sit idle while the other phase catches up.

In AgentRL, sampling and training run concurrently. A pool of workers continuously generates agent rollouts -- sending requests, executing tool calls, receiving results, completing episodes. A separate pool continuously trains on completed rollouts as they arrive. The handover between the two pools is explicit and memory-managed -- completed rollouts move from sampling memory to training memory without blocking either pool.

The GPU idle time from synchronization barriers collapses. Training GPUs have rollouts to train on continuously. Sampling GPUs have trajectories to generate continuously. Neither pool waits for the other.

45.2% → 93.2% from removing synchronization barriers. The compute was always available. The scheduling was the bottleneck.

---

**The pattern connecting DualPath and AgentRL.**

Both papers are solving the same category of problem: resources that are available but not being used because the system architecture doesn't schedule work into the gaps.

DualPath: compute network bandwidth is available during gaps between collective operations. Standard serving architectures don't schedule KV loading into those gaps. DualPath schedules it there.

AgentRL: training GPUs are available while sampling is running. Standard RL training architectures don't schedule training into that window. AgentRL schedules it there.

The Hummingbird paper I wrote about months ago did the same thing at the kernel level: GPU SMs are idle during the 24% bubble time in distributed inference. Hummingbird harvests those idle SMs for best-effort work.

The pattern: everywhere there is scheduled idle time in the system, there is a paper proposing to fill it. The gaps are real. The utilization is there if you architect the system to use it. The infrastructure that the industry is deploying right now doesn't use it.

5% average GPU utilization is not a hardware problem. It's a scheduling problem. The compute is there. The decisions about when to use it are wrong.

---

5% utilization. 20x more capacity than workloads use. GPU prices rising for the first time since 2006.

the compute network is idle between collective operations. the decode-side SNICs are idle while prefill-side SNICs saturate. training gpus are idle while sampling runs.

three idle resources in the same cluster. three papers solving each one independently.

the system architecture treats these gaps as unavoidable. they're not. they're scheduled. and what's scheduled can be rescheduled.

*agentrl achieves 93.2% gpu utilization on long-horizon tool-using agent workloads. the same workload achieves 45.2% on verl. same hardware. different scheduling. if your agentic training infrastructure is showing 40-50% gpu utilization, the gap to 90%+ is entirely in the system design.*

---

**P.S.** The CPU-to-GPU ratio recommendation (1:1 to 1.4:1, 86-120 cores per GPU for agentic workloads) is 4-5x higher than standard inference deployments. Most GPU servers ship with 1:0.25 or 1:0.5 ratios -- 16-32 CPU cores per GPU. For pure inference, that's sufficient. For agentic inference where CPU orchestration, tool dispatch, and result parsing run between every GPU step, CPU becomes the throughput bottleneck before GPU becomes the throughput bottleneck. If you're deploying Fable 5 for multi-step agentic tasks and your GPU utilization is low, run `htop` while your workload runs. Look for 100% CPU cores coinciding with GPU utilization dropping to single digits. That's the pattern. It means you need more CPU, not more GPU. Buying more H100s to fix CPU starvation is the most expensive way to not solve the problem.
