Skip to main content

Real-time interactive video generation has two completely separate scaling problems. Almost nobody is solving both.

Per-step latency and long-horizon memory are independent problems. Causal Forcing++ solves the first. TTT Memory solves the second. Neither cites the other. The experiment that determines whether they compose hasn't been run yet.

May 21, 2026

I want to be precise about what I mean because conflating them is why most systems in this space hit a wall.

Problem one: per-step latency. How long does it take to generate one frame. In a diffusion model, that's the number of denoising steps times the cost per step. Standard video diffusion runs 20-50 steps. At real-time rates (25 FPS), you have 40 milliseconds per frame. You cannot run 20 steps in 40ms. You need 1 or 2.

Problem two: long-horizon memory. How much GPU memory does a 5-minute interactive session require. In a causal autoregressive world model, context accumulates as the session runs. Every frame the model generates gets appended to the KV cache. At 25 FPS over 5 minutes, that's 7,500 frames. At roughly 1,024 spatial tokens per frame, that's 7.68 million KV cache entries. An LLM running a 128K context window has 128,000 entries. World model KV cache is 60x worse than a long-context LLM, and it grows continuously with no natural stopping point.

These are independent problems. Solving per-step latency does not help with memory growth. Solving memory growth does not help with per-step latency. You need both, simultaneously, for a production interactive world model. The papers solving each one have almost no citation overlap.


Causal Forcing++ dropped May 14th -- eight days ago, ICML 2026, Tsinghua -- and it's the sharpest attack on problem one I've seen.

To understand why it matters, you need to understand the specific failure mode it fixes. The standard approach for making a fast causal world model: take a strong bidirectional video model, distill it down to a causal AR student that runs in 2 steps instead of 20. The bidirectional teacher knows the whole sequence -- it conditions on past and future frames simultaneously. It generates clean samples with high quality. You use its outputs as supervision targets for the faster student.

The problem is architectural misalignment.

The bidirectional teacher generates each frame conditioned on future frames that the causal student will never see. The ODE trajectory -- the path from noise to clean frame that the teacher traces -- is fundamentally shaped by information that doesn't exist for a causal model. When you use that trajectory as a supervision target for the causal student, you're training the student to match a signal that was computed with access it can't have at inference time.

Previous methods -- CausVid, Self Forcing -- did this anyway. The results were acceptable for chunk-wise generation (process 4 seconds at once, output in a burst) but broke down badly under frame-wise generation (output one frame at a time, truly interactive). Dynamic degree -- how much the generated world actually moves and changes in response to actions -- collapsed. The models became static and unresponsive at the frame-wise latency regime that real-time interaction requires.

Causal Forcing (the original, February 2026) fixed the alignment problem by computing causal ODE trajectories -- teacher paths that condition only on past frames, architecturally matched to what the causal student sees. Better dynamics, better quality. The cost: precomputing full PF-ODE trajectories is expensive. Slow data curation. High training cost.

Causal Forcing++ eliminates the trajectory precomputation entirely. Instead of storing full ODE paths, it uses causal consistency distillation -- a single online teacher ODE step between adjacent timesteps provides the supervision signal, computed on the fly, no stored trajectories needed. Same causal alignment as the original. 4x lower Stage 2 training cost. 50% lower first-frame latency.

The result: a frame-wise 2-step model that outperforms the best existing chunk-wise 4-step model on VBench Total, VBench Quality, and VisionReward. Finer response granularity, lower latency, better quality. The chunk-wise 4-step model was previously the practical ceiling. The frame-wise 2-step model just cleared it.


DexWorldModel's TTT Memory Module (April 13th, preprint) attacks problem two from a direction I hadn't seen before.

The standard approach to long-horizon KV cache growth: cap the context window and evict old frames. Drop frames older than N seconds. Keep a sliding window. The model loses information about events that happened more than N seconds ago.

For a world model running a continuous interactive session -- a robot performing a task, a user navigating a game environment -- losing context is not a minor quality degradation. It's the difference between a model that remembers where it placed an object two minutes ago and one that doesn't. Causal world models derive most of their value from long-range temporal consistency. Evicting the context that provides that consistency defeats the purpose.

TTT Memory replaces the KV cache entirely with a small neural network layer whose weights get updated with each new frame. Instead of appending each frame's key-value pairs to a growing sequence, you run a gradient-free update rule on the memory layer's weights that compresses the new observation into the existing weights. The "memory" at any point in time is the current state of that layer -- fixed size, regardless of how long the session has been running.

The mechanism is Test-Time Training applied to sequence memory. The memory layer is trained to support fast weight updates via a linear attention-style recurrent update rule, not full gradient descent. At inference time, each new frame triggers a weight update that takes roughly the same compute as a forward pass through the layer. The memory size stays constant. The session can run indefinitely.

On long-horizon manipulation tasks in DexWorldModel's evaluation: the TTT Memory Module eliminates the memory exhaustion that causes KV-cache-based models to fail or degrade after ~2 minutes of continuous operation. The model maintains task-relevant context from the start of the session. Performance on long-horizon tasks -- tasks requiring memory of actions taken more than 60 seconds ago -- improves substantially compared to sliding-window KV approaches.


The systems engineering point I want to make about both papers together:

If you are building real-time interactive world model inference, the serving stack has to solve both problems. Causal Forcing++ gets your per-frame latency to 40ms or below by distilling to 2 steps with correct causal alignment. TTT Memory or equivalent gets your memory footprint to constant size so a 5-minute session costs the same as a 5-second session. Neither alone is sufficient. A system with 2-step distillation but sliding-window KV eviction works in demos but fails on long tasks. A system with TTT memory but 20-step denoising can't hit real-time rates regardless of how much GPU you throw at it.

The interaction between these two is also nontrivial. TTT Memory requires the model to generate hidden states that carry the temporal information needed for the weight update rule. Those hidden states are produced by the denoising process. If you aggressively distill to 1-2 steps, you need to verify that the reduced denoising trajectory still produces hidden states with sufficient temporal information for the memory update. The original Causal Forcing paper doesn't address this -- it was designed without TTT Memory in mind. Causal Forcing++ doesn't address it either. This is an open problem that whoever is actually shipping production interactive world models is going to have to solve, probably through careful ablation of distillation depth against memory quality on long rollouts.

That experiment does not exist in any paper I've found. Whoever runs it first has the answer that determines whether 1-step or 2-step is the practical floor for a system that also uses TTT-style memory compression.


two papers. eight days old and six weeks old. solving the two completely separate problems that determine whether real-time interactive world model inference is actually possible at production scale.

neither cites the other.

the engineers building these systems right now are going to have to figure out how they interact -- whether causal consistency distillation and TTT memory compose cleanly, or whether they fight each other at the hidden state level.

that experiment hasn't been run yet.

if you're building in this space, run it. the result determines your architecture.


P.S. The chunk-wise vs frame-wise distinction is the one that maps most directly to system design decisions. Chunk-wise: buffer 4 seconds of frames, run denoising on the chunk, output the chunk. Lower per-token compute, higher first-chunk latency (users wait for the buffer to fill), coarser action responsiveness. Frame-wise: generate and output each frame individually, 1-2 denoising steps, immediate action response. Lower buffer latency, higher per-frame cost, tighter real-time constraints. The right choice depends entirely on your latency SLO and action granularity requirement. Causal Forcing++ enables frame-wise 2-step to match chunk-wise 4-step quality -- meaning the quality tradeoff that previously forced you into chunk-wise is now resolved. Frame-wise is the correct architecture for truly interactive systems. The quality excuse for chunk-wise just disappeared.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.