Skip to main content

Prefill and decode run on the same GPU. They use completely different hardware. Nobody ran them at the same time until six weeks ago.

Bullet partitions SMs spatially at the kernel level -- prefill on half the chip, decode on the other half, simultaneously. 1.26x throughput gain, no new hardware. ASPLOS '26.

April 29, 2026

This one took me a while to see.

The standard story of prefill-decode disaggregation goes like this: prefill is compute-bound, decode is memory-bandwidth-bound, they want different hardware, so split them across GPUs. vLLM does it. SGLang does it. NVIDIA Dynamo was built around it. The whole industry has spent 18 months optimizing the inter-GPU version of this problem.

The paper I've been reading for the last three days asks a different question. If prefill saturates compute units and decode saturates memory bandwidth -- and those are genuinely different hardware resources on the same chip -- why are we running them sequentially at all?

Bullet. ASPLOS '26. March 22nd, Pittsburgh. One citation in the vLLM issue tracker. Otherwise: nothing.


Here's the problem it's solving, made physical.

A GPU has two scarce resources: SM compute throughput and HBM memory bandwidth. Prefill uses the first one -- it's processing your entire input in parallel, saturating the tensor cores, feeding the SMs as fast as it can. Decode uses the second one -- loading weight matrices one token at a time, mostly waiting for HBM to deliver bytes.

When you run prefill and decode in the same batch (what every serving framework does by default), you are forcing two workloads with opposite resource profiles to share scheduling time. The scheduler picks one batch, runs it, picks the next. While prefill is running, HBM bandwidth sits unused. While decode is running, SM compute sits unused. You are getting one resource at a time when the hardware has two.

The number Bullet puts on this: chunked prefill -- the technique everyone uses to prevent prefill from blocking decode -- produces 5.2% lower SLO compliance than Bullet's approach, and leaves 20% of GPU compute idle during decode batches. You bought the hardware. You are using roughly 80% of it during the portion of inference that matters most for latency.


Bullet's mechanism is SM partitioning via libsmctrl.

libsmctrl is a CUDA SM masking library. It lets you specify which Streaming Multiprocessors on the GPU a given kernel is allowed to run on. Not scheduling hints -- actual hard partitions. SM 0-47 for prefill. SM 48-95 for decode. Both running simultaneously. Different kernels, different resource profiles, different SM allocations, one GPU.

The two engines -- prefill and decode -- run in separate processes under NVIDIA MPS (Multi-Process Service), which handles the GPU context multiplexing. They communicate through a shared CPU buffer and a unified GPU memory pool so KV cache doesn't have to move between engines. The scheduler monitors both continuously with microsecond-level overhead via non-intrusive model instrumentation, and dynamically rebalances the SM partition in real time based on what the current batch composition needs.

This is the part that took the longest to build: a real-time performance model that knows, given the current mix of prefill and decode work, what SM split maximizes throughput while keeping both engines inside their SLO. Static splits don't work -- a 70/30 prefill/decode SM partition is wrong during a burst of short decode-only traffic and wrong in the opposite direction during a prefill-heavy admission surge. Bullet's control loop adjusts the partition at microsecond granularity without restarting either engine.

1.26x average throughput gain over state-of-the-art. Up to 1.55x at peak. While consistently meeting latency constraints. On real-world workloads at ASPLOS, not synthetic benchmarks.


The thing I keep coming back to: this makes intra-GPU disaggregation possible without buying a second GPU.

The inter-GPU disaggregation story -- split prefill and decode onto separate servers -- requires two machines, an interconnect, a KV transfer layer. It's correct at scale. It's expensive and operationally complex, and for a lot of serving deployments it's overkill.

Bullet runs both on the same GPU. No new hardware. No KV transfer across the network. No RDMA. Just a CUDA SM partition and two processes. The KV cache lives in the same GPU memory pool accessible to both engines.

"But won't SM contention--" it won't, that's the entire point of the partition. The engines don't share SMs. They share memory bandwidth, which is fine because during prefill the compute engine is using compute not memory, so decode's memory-bandwidth-heavy access pattern isn't contending with anything. The resource profiles are complementary by design. They fit together.


The reason this wasn't done before is libsmctrl. Hard SM partitioning at the application level wasn't really accessible before NVIDIA exposed the APIs that libsmctrl wraps. The MPS multi-process approach also required careful engineering to avoid the context-switching overhead that historically made GPU multi-tenancy painful. Both of those constraints loosened in the Hopper generation. Bullet is the paper that used them.

The code is at github.com/zejia-lin/BulletServe. It was originally forked from SGLang. The authors note it's a research prototype without full feature parity. Integration into vLLM is open in issue #27093 -- the same issue where someone from the Bullet team posted the proof-of-concept. It will be in production frameworks within 12 months. That's how these things go.


Inter-GPU disaggregation split the problem across machines.

Bullet split it across SM allocations on the same machine.

The hardware had two resource profiles the whole time. They just ran sequentially because nobody partitioned them spatially until six weeks ago.

1.26x throughput on real workloads from a kernel-level scheduling change. no new hardware. the gains were already in the silicon. they were just waiting for someone to run both engines at once.


P.S. The complementary paper at ASPLOS '26 is "Towards High-Goodput LLM Serving with Prefill-decode Multiplexing" -- a different group, same conference, attacking the same problem from a slightly different angle. Where Bullet uses hard SM partitions, the multiplexing paper uses temporal interleaving with tighter latency modeling. Both shipped at the same conference six weeks apart. Either the problem was riper than anyone realized, or ASPLOS reviewers saw something in this direction and took everything that addressed it. Either way: two independent solutions to the same root cause in one conference proceedings is a signal.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.