nobody trained an RL model for the stopping decision.

Nobody has trained an RL model for the stopping decision in multi-agent systems.

That sentence is from a paper published May 4th. I want to add something to it that I haven't seen written anywhere.

The paper (arXiv 2605.02801) surveyed every published RL training method for multi-agent LLM orchestration as of May 4, 2026. It found methods for five sub-decisions: when to spawn a sub-agent, whom to delegate to, how to communicate between agents, how to aggregate results, and when to stop. Four of those five have explicit RL training methods in the literature. The fifth -- when to stop -- has none.

Not "fewer methods than the others." None. Zero.

The stopping decision is the most important decision in a multi-agent system from a cost perspective. Every sub-agent spawn is a new inference request. Every communication between agents is tokens. Every aggregation step is prefill. A multi-agent workflow that decides to spawn three more sub-agents before answering is making a compute allocation decision. If that decision is wrong -- if the answer was already available before the third spawn -- the waste is not a quality issue. It's a bill.

Current multi-agent systems make the stopping decision using the same mechanism they use for everything else: the orchestrator model generates text. It decides whether to stop based on whether its current context, reasoning, and accumulated sub-agent results look "good enough." There's no explicit objective function for this. There's no reward signal. There's no trained policy. The stopping decision is made by a model that was never taught what "good enough" means in the context of stopping.

Here's the insight I want to add that I haven't seen in any of the papers.

The stopping decision is not just a task-quality problem. It is an infrastructure-state problem. And the infrastructure has no signal back to the orchestrator.

Consider what actually happens when an orchestrator decides to spawn another sub-agent at 3am versus 6am:

At 3am, your inference cluster might be at 30% utilization. The sub-agent spawn queues immediately. The KV cache for the new context prefills in 40ms. The decode starts. The marginal cost of the spawn is close to the marginal compute cost -- negligible at low load.

At peak hours, the same cluster might be at 87% utilization. The spawn queues behind 40 other requests. The prefill takes 400ms instead of 40ms because the prefill pool is saturated. The total latency for the orchestrated task climbs to 8 seconds instead of 2. The orchestrator made exactly the same stopping decision -- spawn one more sub-agent -- but the consequences in wall-clock time and downstream SLO compliance are completely different.

The orchestrator doesn't know any of this. It receives no signal about queue depth, cluster utilization, current prefill latency, or cost per token at this moment. It's making a resource allocation decision -- spawn another inference request -- with zero visibility into the resource environment it's allocating into.

This is the infrastructure blindness problem in multi-agent orchestration. And it compounds directly with the unsolved stopping decision: not only is there no trained policy for when to stop, the stopping policy that does exist has no access to the one signal that would change its decision most dramatically -- how expensive is the next spawn right now.

The RL paper identifies five sub-decisions and notes that stopping has no training method. The deeper reason it has no training method: the reward signal for stopping is the hardest to define.

For spawning: reward = did the sub-agent's output improve the final answer. Measurable. Clear counterfactual.

For delegation: reward = did this agent perform better on this task than alternatives would have. Measurable. Requires routing experiments.

For aggregation: reward = does the combined output outperform individual outputs. Measurable. Straightforward comparison.

For stopping: reward = was this the right moment to return an answer rather than continuing. This requires knowing both what the current answer quality is AND what continued exploration would have produced. The counterfactual is: if I had stopped here, how much quality would I have lost? If I hadn't stopped, how much quality would I have gained and at what cost?

You can't evaluate this reward signal without running the system both ways -- stopping and not stopping -- and comparing outcomes. That requires a counterfactual evaluation infrastructure that doesn't exist for most agentic deployments. Production systems don't run the same task twice with different stopping decisions. They make one decision and move on.

The paper notes this explicitly: "explicit counterfactual message-level credit remains especially sparse in our curated pool." The credit assignment problem for stopping is harder than for any other sub-decision because stopping terminates the episode. Once you stop, you can't observe what would have happened if you hadn't.

The practical consequence in production systems right now:

Multi-agent orchestration frameworks -- LangGraph, AutoGen, CrewAI, Claude's lead-agent model, all of them -- implement stopping via heuristics. Maximum iteration count. Token budget. Confidence threshold on the current answer. Human-specified task completion criteria. These work approximately for narrow, well-specified tasks. They fail systematically for open-ended tasks where the quality ceiling is unknown and "good enough" is context-dependent.

"Maximum 5 sub-agent calls" is a hard stop. It is not a learned stopping policy. It will underperform on tasks that need 3 calls and waste compute on tasks that needed 2. The gap between a hard-stop heuristic and an optimal learned policy is not small for production workloads with high task heterogeneity.

The infrastructure piece makes this worse. A hard stop of "maximum 5 calls" doesn't account for the fact that at peak load, 5 calls might take 40 seconds and violate every SLO in the system. At off-peak, 5 calls might take 6 seconds and be fine. The optimal stopping decision should be jointly conditioned on task state AND infrastructure state. Neither the RL literature nor the production frameworks have this today.

what I'd build, if I were building it:

A stopping policy trained on orchestration traces -- task state, accumulated evidence, current answer quality estimate -- plus a lightweight infrastructure signal: current p50 prefill latency, cluster utilization tier, estimated queue depth. Not full observability. One additional feature vector from the serving layer, updated every 30 seconds. The policy learns to stop earlier when infrastructure is under load and continue longer when it isn't.

The serving layer already collects this data for its own scheduling decisions. It just doesn't expose it to the orchestration layer. The integration is a few lines of code and a retrained policy. The savings at peak load -- fewer spawns that would have queued, fewer SLO violations from runaway agentic tasks -- are not small.

The RL literature hasn't built this because the RL literature treats stopping as a pure reasoning problem. The serving infrastructure literature hasn't built this because they don't see the orchestrator's decision process as their scope. The gap is between two communities that don't read each other's papers.

nobody trained an rl model for the stopping decision.

they also didn't give the stopping decision access to the one signal that would change it most at scale.

the serving layer knows how loaded it is. the orchestrator doesn't know it exists.

that's the gap. it's not a research problem. it's a missing interface between two systems that are running on the same cluster right now.

if you're running a multi-agent framework in production, measure the p99 task completion time segmented by cluster load tier. the variance will tell you how much the infrastructure state is affecting your orchestration decisions without the orchestrator knowing it. that number is the size of the problem.

P.S. The paper found no RL method for the stopping decision in the literature as of May 4, 2026. There are 27 days between then and today. If someone shipped one in the last four weeks, it's not in the paper and I haven't found it. The stopping decision is still the open problem. The infrastructure-awareness angle is the one I haven't seen addressed anywhere -- in RL papers, in serving infrastructure papers, or in production agentic framework documentation. If you've seen it, I want to read it.