Skip to main content

Sora cannot be interactive. Neither can Veo. Neither can Kling or Runway.

Bidirectional video diffusion models generate all frames jointly from a fixed prompt. That's why they're coherent. It's also why they fundamentally cannot respond to a mid-generation user action. Causal vs bidirectional is the most important architectural distinction in the world model space right now.

May 23, 2026

Not because they're slow. Because of their architecture. Bidirectional video diffusion models generate past, present, and future frames jointly from a prompt fixed in advance. The model sees the whole sequence before it generates any of it. That structure is why they produce such coherent video.

It's also why they fundamentally cannot respond to a user action that happens mid-generation. The future frames would need to condition on inputs the user hasn't taken yet.

A world model -- a model that simulates an evolving environment and responds to actions in real time -- has to be causal. Each frame predicted from prior frames and the current action. Nothing else. That architectural constraint is not a minor implementation detail. It determines the entire serving stack, the latency target, the memory management, the distillation strategy, and who can actually build this.

This is the framing I want to use for the startups I've been watching closely.


Odyssey is the clearest example of a team that internalized this constraint before writing a line of model code.

Odyssey-2 Max (April 21st, one month ago) uses what they call an AR DiT -- autoregressive diffusion transformer. The model generates video chunk by chunk, conditioning only on past frames and the current action. Each frame arrives in ~40ms. 25 frames per second. Real-time.

The detail that tells you the team knows what they're doing: they built roofline estimates from day one. Before finalizing the architecture, before training, they modeled the compute requirements against the target inference hardware and made sure the model as designed could hit the latency target on that hardware. Most ML teams do this after training, when it's too late. Odyssey did it before.

They also use continuous flow matching rather than discrete tokenization. The quality ceiling on discrete tokenization comes from the codebook -- you can only generate things that map to learned token embeddings. Continuous flow matching operates directly in latent space with no discretization step, which preserves fine-grained detail over long rollouts without quality collapse. They claim 20x longer context than prior work with full backpropagation. The serving implication: long-horizon rollouts accumulate context that has to be cached. Managing that cache under a 40ms budget requires the same kind of KV management thinking as LLM serving, but with 3D spatiotemporal structure instead of 1D sequence structure.

The thing I find most credible about Odyssey: the product experience matches the claimed architecture. Bidirectional models have a first-frame latency of tens of seconds because they have to finish generating the full clip before outputting anything. Odyssey streams the first frame in 40ms. That's not achievable with a bidirectional model dressed up as interactive. The architecture is real.


DISK (February 2026, preprint) is the most technically interesting inference paper in the world model space and has approximately no coverage outside systems research circles.

The insight: not every frame needs full denoising.

In a causal AR world model, you run N denoising steps per frame to generate each output. N is the inference cost. If the scene is relatively static -- sky not changing, background stable, the agent is paused -- the full N-step denoising is paying for precision you don't need. The frame is almost identical to the previous one. You ran the full diffusion anyway.

DISK coordinates two coupled DiTs -- one for video, one for ego-trajectory -- via dual-branch controllers that make per-frame skip decisions. If the latent difference between the current prediction and the prior frame is below a threshold, skip some denoising steps. The skip decision is made without retraining -- it's a runtime test on the latent-space differential, not a learned parameter.

The result on 1,500 NuPlan and NuScenes driving samples on a single L40S GPU: 2x speedup on trajectory diffusion, 1.6x on video diffusion, while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM planning scores. Free performance. No retraining required. The same model, run smarter.

This is speculative decoding applied to diffusion steps. Instead of always running N denoising passes, run fewer when the frame doesn't warrant them. The world model inference space is going to converge on this pattern for the same reason LLM serving converged on speculative decoding: the compute is being spent uniformly on non-uniform content, and the non-uniformity is exploitable.


XPENG X-World (technical report April 29th, three weeks ago) is worth noting specifically because they solved a problem nobody else has solved cleanly: multi-camera, multi-view consistency.

Autonomous driving doesn't have one camera. It has eight to twelve. A world model for AV has to generate consistent futures across all camera views simultaneously -- the pedestrian crossing in the front camera has to appear correctly in the front-left and front-right cameras, with correct occlusion, correct depth, correct lighting. Inconsistency between views is immediately detectable to human evaluators and disastrous for using the world model for training downstream perception systems.

X-World uses video diffusion with controllable multi-view generation. They're not the only multi-camera world model -- Vista (April 2025) addressed similar issues -- but the April 2026 technical report is the most detailed public description of what it takes to make this work in production AV data pipelines. The training data alone required a new data production pipeline. The inference stack required explicit cross-view consistency constraints during denoising.

The reason this matters commercially: Waymo, Zoox, and every other AV company needs world models that produce consistent multi-camera synthetic scenarios for rare events -- the 1-in-10,000-mile scenarios that are impossible to collect at scale in the real world. A world model that generates inconsistent views is useless for this. Multi-view consistency is the hard part. XPENG published the methods publicly. That's unusually transparent for a company with a genuine production moat.


AMI Labs (Yann LeCun, March 2026, 500M at 3B valuation before a product ships) is worth understanding specifically through the JEPA lens rather than the founder lens.

Joint Embedding Predictive Architecture predicts in latent space rather than pixel space. Standard video diffusion predicts pixels. JEPA predicts representations -- abstract embeddings of what the world looks like, without reconstructing the actual visual output unless you need it. This is dramatically cheaper: you're doing the prediction computation in a compressed space, not in pixel dimensions.

For robotics applications -- where the robot needs to plan in terms of high-level scene representations, not pixel-accurate video -- JEPA's architecture is a better fit than generative video diffusion. The robot doesn't need to hallucinate photorealistic pixels. It needs to reason about object positions, physical relationships, action consequences. JEPA operates at that level of abstraction.

The inference cost advantage is significant. A JEPA-based world model can run faster and with less memory than a video DiT operating in pixel space, because it's never generating the high-dimensional pixel output. The accuracy of physical reasoning doesn't require photorealism. If the latent space captures the relevant physical structure, the model can plan and predict without decoding to pixels at all.

Whether AMI Labs can execute on this before the Cosmos and Genie 3 ecosystems solidify is the real question. LeCun has been saying JEPA is the right path for five years. 500M at 3B is an enormous bet on that thesis before there's a product. I don't know if it's right. I find the technical argument compelling.


The serving infrastructure story for all of these is the same one I've been writing about for months, one level harder.

LLM serving has KV cache management, disaggregated prefill/decode, continuous batching, PagedAttention. World model serving has all of that plus: 3D spatiotemporal attention with spatial and temporal factorization, denoising step management (multiple forward passes per output frame rather than one), causal context accumulation over multi-minute rollouts that dwarfs LLM context lengths in raw data volume, and real-time latency constraints (40ms) that are 5-10x tighter than typical LLM interactive latency targets.

The serving system I built for video world models was designed from the latency budget down: 35ms for model compute, 15ms overhead, 2 distilled denoising steps at 15ms each. Every other architectural decision followed from that constraint.

The startups in this space that will win are the ones who did the same thing. Who modeled the roofline before training. Who built the serving stack as a first-class engineering problem, not an afterthought. Odyssey's "roofline estimates from day one" detail is the signal. It's the same thing I look for in LLM infrastructure teams.


causal vs bidirectional is the most important architectural distinction in the world model space.

sora can't be interactive. that's not a product limitation. it's physics.

the companies that built causal from the start -- and designed the serving stack around what causal requires -- have an 18-month head start over anyone trying to retrofit real-time interaction onto a bidirectional architecture.

disk's dynamic inference skipping is the paper to read if you're building world model inference infrastructure. 2x speedup on trajectory, 1.6x on video, no retraining. the compute was always being spent uniformly on non-uniform content. disk is the first system to exploit that.


P.S. The Causal-RoPE SP paper (March 10, 2026) solves a specific serving problem nobody had addressed for causal AR video generation: sequence-parallel inference across multiple GPUs requires position embeddings that can be computed locally per rank without global sequence information. Standard 3D RoPE requires the full sequence to compute positions correctly, which forces cross-rank communication that kills the parallelism benefit. Causal-RoPE SP adapts position embeddings to work with local context only. It's a two-page methods section in a systems paper that makes multi-GPU world model serving viable without the communication bottleneck. It should be in every world model infrastructure team's reading list and it has essentially no coverage.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.