Most AI infrastructure gets built backwards. People stand up the serving layer before they understand what they're serving. Spend a month on evals before they have a model worth evaluating. Buy GPUs before they know if they need to train at all.
The order that actually matters, with the version numbers and flag names you'll regret getting wrong. Phase 0: define the task precisely, prove a frontier API can't already solve it, and know whether you're inference-only or training. Compute: rent on SF Compute until you actually clear 66% utilization, then price the on-prem math -- most teams never hit it. Serving: vLLM 0.21+ with online 2-bit KV quantization, FA4, and DBO for MoE; SGLang for MoE production tooling; disaggregate prefill/decode from day one and use PPD routing for multi-turn. Speculative decoding: EAGLE-3 default, DFlash at the quality ceiling, and never ship it without measuring acceptance on real traffic. Training: veRL for async RL, Axolotl for SFT. Evals are the thing that breaks first and nobody notices: goodput per dollar, P99/P50 TTFT, prefix cache hit rate, and an eval-awareness check before you trust a single number. The agent harness is the product, not glue code.
June 26, 2026Here is the order that actually matters, with the specific tech decisions that you'll regret if you get wrong.
I'm writing this with exact version numbers and flag names because "use a good LLM serving framework" is not advice. Advice is what to turn on, what to leave off, and what to measure to know if you got it right.
Phase 0: Before you write a line of infrastructure code.
Answer three questions. If you can't answer them, the infrastructure doesn't matter yet.
One: what is the task, exactly? "AI assistant" is not a task. "Extracts structured fields from legal contracts under 10 pages and returns them as JSON" is a task. The serving requirements, latency SLOs, context length requirements, and eval criteria fall directly out of the task definition. Vague task → vague infrastructure → expensive rebuilds.
Two: does a frontier API solve it today? Claude Fable 5 API at $10/M input tokens solves more problems than most teams realize. The correct time to build serving infrastructure is after you've validated the task is solvable and that API latency and cost are the actual constraints. Most teams build infrastructure for a problem they haven't proven is real. Use the API. Validate. Then build.
Three: what's your data situation? Inference-only stacks and fine-tuning stacks are different systems. If your answer to "what makes your output better than baseline?" is "our proprietary dataset," you need a training pipeline. If it's "our prompts and evals," you don't. This determines your entire infrastructure path.
If you passed those three: continue.
Compute: rent before you buy, buy before you build.
Start on SF Compute (sfcompute.com). H100 SXM5 nodes, InfiniBand fabric, single-tenant, CLI provisioning in under 5 minutes, sell-back mechanism for unused capacity. sf nodes create -n 8 --zone landsend -d 7d gets you 8 H100s for a week. The 2026 Q2 milestone they hit: InfiniBand on self-serve VMs. Use it.
The rule: if your GPU utilization is below 66% averaged over a billing period, you should be on SF Compute or FluidStack spot market, not on-prem. The 66% threshold is from Uptime Institute's analysis -- below it, on-prem CapEx is economically worse than neocloud even accounting for the premium per-GPU price. Most teams don't hit 66% on training runs because experiments fail, datasets don't load, checkpoints get corrupted. Be honest about your actual utilization before signing any dedicated hardware contract.
FluidStack if you need bare-metal control from UEFI up -- their Atlas OS handles huge page configuration, NUMA pinning, PCIe ACS disable, GPUDirect RDMA at the kernel level. This matters when you're running multi-node training at scale and the difference between 3.2 Tb/s InfiniBand (FluidStack's fabric) and 800 Gbps Ethernet (most hyperscalers) shows up as 16.7 hours of wall-clock training time on a 128-GPU 70B run.
For inference deployments that will be long-lived: H100 SXM5 for general purpose, H200 SXM5 if you're serving 1M+ token contexts at scale (the extra memory matters more than the bandwidth delta). B200 NVL72 if you're running trillion-parameter MoE models at production scale (FA4 and DBO make the AFD architecture viable). Don't overbuy. The GPU forward curve from Silicon Data now gives you a reference price for forward commitments -- use it before signing long-term contracts.
Serving layer: vLLM v0.21+ with specific flags, SGLang for MoE.
vLLM 0.21+ is the baseline. Install it. Configure it. The flags that are off by default but should be on for production:
--enable-online-quantization -- TurboQuant 2-bit KV cache. 4x KV capacity. Accuracy within 0.5% on conversational tasks. If you're memory-capacity-bound (and you are, at any serious context length), this changes which hardware resource is the constraint.
For Blackwell (SM90+): FA4 is now the default MLA prefill backend. For MLA models (DeepSeek-V4 and family), this is production-ready. For GQA models (Llama-3, Qwen-3), verify FA4 performance against FA3 before committing. FA4's software-emulated softmax gains are larger on Blackwell; on Hopper, the delta is smaller.
--enable-dbo --all2all-backend deepep_low_latency -- Dual Batch Overlap for MoE models. Communication-compute overlap for the MoE dispatch/combine cycle. 25% decode latency reduction on DeepSeek-class models. Only enable with deepep_low_latency backend (NVLink-based intra-node). Check whether your EP group fits within a single node before enabling.
--dbo-decode-token-threshold N -- set N to your p10 batch size at decode time. Below N, DBO is disabled because microbatching would create empty second microbatches. Measure your traffic distribution before setting this. Default is conservative; tuning it captures DBO gains at lower batch sizes.
SGLang for: speculative decoding with Skip-Softmax (the verification pass is 4-8x cheaper in exp() operations with this enabled), DeepSeek-V4 with the GPU Staging Buffer (1000x RDMA request reduction for GQA KV transfer between prefill and decode workers), and Elastic EP (WideEP fault tolerance -- when a GPU fails in an EP group, the system redistributes expert weights and continues serving without full restart). The SGLang / vLLM choice is workload-specific. vLLM has broader model coverage and more mature HMA. SGLang has better MoE production tooling right now.
Disaggregation: run disaggregated prefill-decode from day one if you're serving any significant mix of long-context requests. Use PPD (Prefill-capable Decode) routing for multi-turn workloads -- the default PD architecture recomputes 81-99% of multi-turn prefill cost at the prefill node when the decode node already has it. PPD routes append-prefill to the decode node that already holds the KV state. 61% TTFT reduction on turn 2+.
KV cache TTL for agents: set your KV TTL to the p90 inter-turn latency from your actual user traffic. Not 30 seconds -- whatever your users actually wait between turns. For coding assistant workloads, this is 11-15 seconds in production. For voice agents, it's 2-3 seconds. Measure it. Set it. The compute you recover from eliminating multi-turn recomputation easily funds the memory cost of KV retention at any realistic concurrency above 40 concurrent sessions.
Speculative decoding: EAGLE-3 as the default, DFlash if you're pushing quality ceiling.
EAGLE-3 draft model with Skip-Softmax on the verification pass: 2-3x decode speedup, production-tested, integrated in both vLLM and SGLang. The draft model needs to be trained for your specific base model. Use the published EAGLE-3 draft models for Llama-3, Qwen-3, DeepSeek-V3 -- they exist and they work. The verification pass with Skip-Softmax enabled is meaningfully cheaper per accepted token on Blackwell because the correlated normalization denominators across candidate tokens allow 4-8x fewer exp() operations.
DFlash (block diffusion speculative decoding, arXiv:2602.06036) if you need maximum throughput at maintained quality on long outputs: 6x lossless acceleration over standard decoding, 2.5x over EAGLE-3. The SGLang nightly has it. Not production-stable yet in mainline but fast-moving.
Do not run speculative decoding without measuring your actual acceptance rate on your actual traffic. Draft model acceptance rates vary significantly by task type. Code completion: high acceptance, high gain. Open-ended generation: lower acceptance, smaller gain. Measure before committing the operational complexity.
Training pipeline: veRL for RL post-training, Axolotl for supervised fine-tuning.
If you're doing supervised fine-tuning: Axolotl (github.com/axolotl-org/axolotl). YAML config-driven, handles LoRA, QLoRA, full fine-tune, multiple base model families, integrates with vLLM for inference testing post-fine-tune. Simple. Get it working before building anything custom.
If you're doing RL post-training (GRPO, PPO, DPO variants): veRL (ByteDance). Fully async sampling-training pipeline (the same decoupled architecture as Laminar and AgentRL that I've written about), 2.35x-2.67x GPU utilization improvement over synchronous RL frameworks. Configure the async buffer size based on your rollout length distribution -- longer rollouts need larger buffers to maintain the async pipeline's saturation. The default settings assume short rollouts; if you're training agents with 100+ step episodes, tune the buffer.
The common mistake: using a synchronous RL framework (the old veRL, TRL's PPO implementation) for long-horizon agent training. The synchronization barrier between rollout generation and training is the bottleneck. At long horizon, the sampling phase takes 10x longer than the training phase. The GPUs sit idle during training waiting for rollouts. At 93.2% utilization (AgentRL) vs 45.2% (synchronous baseline), the gap is too large to ignore.
Training checkpointing: checkpoint to S3 or equivalent object store on every epoch, not every N steps. The cost of recomputing from a corrupt or missing step checkpoint vastly exceeds the storage cost of frequent checkpointing. Set your checkpoint write bandwidth as a metric and alert on degradation -- a slow checkpoint write is often the first signal of a failing NVMe in your training cluster.
Evals: the thing that breaks first and nobody notices.
Before trusting any eval number you produce, run the Needham et al. evaluation awareness benchmark (it's public) on whatever frontier model you're evaluating. Get its AUC on the evaluation-vs-deployment classification task. If it's above 0.7, your eval suite has a recognizability problem that inflates performance estimates for compliant behavior. This takes two hours. Do it before spending a month building an eval pipeline.
The three metrics that actually matter in production:
Goodput per dollar: requests completed within your latency SLO divided by total inference cost. Not throughput. Not GPU utilization. Not P50 TTFT. Goodput per dollar tells you whether your infrastructure is serving users or serving the dashboard. Everything else is a proxy that can diverge silently.
P99/P50 TTFT ratio: if this is above 5, you have a wave quantization problem, a head-of-line blocking problem, or a scheduler issue. None of them show up in P50. All of them show up in user experience. Compute your SM cliff (SM count / KV head groups per request) and check if your most common batch size is near it. If yes, that's your P99 spike.
Prefix cache hit rate: the efficiency signal for your HMA tier configuration. If your hit rate is below 20%, your DRAM tier is costing more in PCIe transfer overhead than it's returning in recompute savings. Reduce the DRAM tier allocation until your hit rate rises above 20%. The HMA's adaptive tier sizing isn't shipped yet -- you're doing this manually for now.
Use LURE (Live-Usage Replay Evaluations) for production quality assessment: replay actual user conversations through your eval framework rather than synthetic benchmarks. The model doesn't recognize real user traffic as an eval context (AUC drops to near-chance for eval detection on real traffic). This is the most accurate quality signal available. The operational cost is curation, PII cleaning, and annotation infrastructure. Worth building once you have real traffic to replay.
Agent infrastructure: harness design is the product.
The agent harness is not glue code. It's the product. Most teams treat it as infrastructure and ship the model. The teams beating them are treating the harness as the primary engineering surface.
The three decisions that determine agent quality more than the model:
Non-blocking harness: the sampling phase and the evaluation/tool-execution phase run concurrently. The GPU is never waiting for a tool call to complete. This is the DualPath insight applied to agent architecture -- idle compute is schedulable. If your harness does result = await tool_call() and then feeds the result to the next LLM call, you're leaving GPU idle for the entire tool execution time. Buffer completed episodes and start the next LLM call as soon as the GPU is free.
Stopping decision infrastructure: the orchestrator should receive a lightweight signal from the serving layer -- current P50 prefill latency tier, cluster utilization tier, estimated queue depth -- updated every 30 seconds. Train your stopping policy on orchestration traces that include this signal. At peak load, the correct stopping decision is different from off-peak. The model doesn't know this. Give it the information.
CPU-to-GPU ratio: 1:1 to 1.4:1 for agentic workloads (86-120 CPU cores per GPU). Most GPU servers ship at 1:0.25 to 1:0.5. For pure inference, this is fine. For agentic inference where CPU orchestration, tool dispatch, and result parsing run between every GPU step, CPU becomes the throughput bottleneck. If you're adding GPUs to fix low throughput on agentic workloads, run htop first. If you see 100% CPU cores coinciding with GPU utilization dropping to single digits, you need more CPU, not more GPU.
The design principles that don't change regardless of stack.
Move computation to write time, not read time. Eager warming (Ledge's approach to git serving) is the same principle as precomputing KV pack files. If you know what computation is coming, do it before it's needed and cache the result.
The constraint is never where the dashboard says it is. The P50 looks fine. The P99 doesn't. The GPU utilization looks high. The goodput is 40%. Measure the right thing or your optimizations are aimed at the wrong target.
Every parameter in your system has a tier: which memory, which hardware, which precision. The model weights: HBM, FP8. The hot KV cache: HBM, BF16. The warm KV cache: CPU DRAM, FP4 via TurboQuant. The cold KV cache: NVMe or distributed object store. The right tier for each parameter is determined by access frequency and latency tolerance. The wrong tier assignment is the most common source of avoidable cost in AI infrastructure.
The serving stack you ship on day 30 will be wrong. That's fine. What matters is that the architecture is structured so the wrong parts can be replaced without rebuilding the right parts. Disaggregate prefill and decode from the start -- not because you need it on day 30, but because retrofitting disaggregation into a monolithic serving stack takes longer than building disaggregated from the start.
vllm 0.21+ with the right flags.
verl for rl, axolotl for sft.
sf compute until you hit 66% utilization, then price the on-prem math.
goodput per dollar, p99/p50 ratio, prefix cache hit rate.
everything else is a detail.
the most expensive AI infrastructure mistake is building serving infrastructure for a task you haven't validated is worth building for. use the api. validate. then build. the frontier api is cheap relative to three months of infrastructure work for a problem that doesn't exist.
P.S. The data flywheel is the compounding asset that makes every other investment in this stack worth something. Production traffic → user feedback → eval labels → fine-tuning data → better model → more production traffic. Everything in this guide is about serving fast and cheaply enough that you can generate enough traffic to build that flywheel. The infrastructure that doesn't support this loop is infrastructure that won't compound. Build the feedback collection and the eval pipeline before the model gets good, not after. The model will get good eventually. The flywheel only starts spinning once you have the infrastructure to capture what it produces.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.