Three Papers That Reveal Why Your Robot Still Can't Plan Ahead
New research on world models, video-language rewards, and causal planning exposes the fundamental gaps between what LLMs predict and what robots actually need to reason about.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
If you have ever watched a large language model confidently describe how to make a sandwich, then watched a robot arm flail helplessly when actually attempting the task, you have witnessed the central problem in robotic AI right now. The gap between linguistic competence and physical reasoning is not just an engineering challenge to be optimized away. Three recent papers suggest it may be a fundamental architectural limitation, and the solutions they propose tell us something important about where the field is heading.
To be precise, what we are seeing is a convergence of independent research groups arriving at similar conclusions through different methodologies. This is the kind of pattern that typically precedes a paradigm shift in how we think about robot learning, though I should note that "paradigm shift" is an overused phrase and the actual transition will probably be messier and slower than anyone predicts.
The case against sequence prediction is laid out most directly in a paper from researchers who introduce what they call Latent Dynamics Inference, or LDI. The core argument in arXiv is straightforward but has significant implications: large language models are trained to predict the next token in a sequence, but reasoning about physical environments requires tracking persistent state and modeling how actions cause transitions between states. These are fundamentally different computational problems.
The researchers created a test environment called Flux, which is basically a game specified entirely through natural language rules. What makes this interesting, and I know I'm being picky here, but this distinction matters, is that the rules can be compiled into an explicit state-transition simulator. This means you can directly compare how well an LLM reasons about the game from text descriptions versus how well a reinforcement learning agent performs when it has access to the actual underlying state space.
Related coverage
More in AI Models
New analysis suggests AI isn't causing mass unemployment, but it may be quietly dismantling the first rung of the career ladder.
Aisha Patel · 33 mins ago · 7 min
Distribution shift remains the quiet killer of deployed robot systems. This week's research offers genuinely different approaches to the same fundamental challenge.
Aisha Patel · 33 mins ago · 7 min
Everyone's predicting white-collar extinction. I think they're missing something important about how automation actually unfolds.
Sarah Williams · 33 mins ago · 4 min
Four new papers show researchers finally cracking the problem that's held back practical robotics for years: how to make smart robots that don't need a data center to think.
The results are stark. Agents with explicit access to the latent state achieved roughly 79% win rates in long-horizon gameplay. LLMs operating purely over textual observations managed about 11%. The failure modes are instructive: invalid actions, state-tracking errors, and what the authors describe as "short-horizon reasoning failures." The models could handle immediate decisions but fell apart when they needed to maintain a coherent model of the world over many steps.
It's worth noting that this is a controlled laboratory environment, not a physical robot. The gap might be smaller or larger in real-world settings. We don't know yet. But the qualitative pattern, that sequence prediction struggles with persistent state, matches what roboticists observe in practice.
The reward hacking problem gets a thorough treatment in work on SOLE-R1, a video-language model designed specifically to serve as a reward signal for robot reinforcement learning. The paper addresses a problem that anyone who has tried to use vision-language models to supervise robot learning will recognize: these models are easily fooled.
When you use a VLM to judge whether a robot has completed a task, the robot often learns to exploit perceptual errors in the model rather than actually solving the task. It's reward hacking, but at the level of the evaluator rather than the environment. The robot learns that certain visual configurations make the VLM think the task is done, even when nothing useful has happened.
SOLE-R1 attempts to address this through what the researchers call spatiotemporal chain-of-thought reasoning. Instead of making a single judgment about task completion, the model reasons through each timestep, tracking progress over time. The training pipeline generates what they call "temporally grounded" reasoning traces, basically forcing the model to justify its progress estimates with reference to specific visual evidence at specific moments.
The results are genuinely interesting. The system enables what the paper describes as "zero-shot online RL from random initialization," meaning robots can learn manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. They report success on 24 unseen tasks and claim substantially better robustness to reward hacking compared to alternatives including, notably, GPT-5 and Gemini-3-Pro.
I want to be careful here about methodology concerns. The comparison is across four simulation environments and one real robot setting. The sample size for real-world validation is limited. The paper claims "markedly greater robustness to reward hacking" but robustness is notoriously difficult to measure comprehensively. What looks robust in tested conditions may fail in untested ones. Still, the approach of building temporal reasoning directly into the reward model rather than treating it as a post-hoc evaluator seems like a genuine architectural insight.
The planning problem gets yet another perspective from ChainFlow-VLA, which tackles autonomous driving rather than manipulation. The paper frames the core tension as a dichotomy between autoregressive models and diffusion models for trajectory planning.
Autoregressive models, the paper argues, capture temporal dependencies well because they explicitly model how each step depends on previous steps. But they accumulate errors over long sequences and can produce globally inconsistent trajectories. Diffusion models optimize trajectories as a whole, giving better global structure, but lack explicit causal constraints. They can produce beautiful trajectories that violate basic physics or ignore how other agents will respond to the ego vehicle's actions.
ChainFlow-VLA attempts to unify these approaches. An autoregressive generator produces what the authors call "causal trajectory modes," essentially a discrete set of plausible trajectory shapes that respect temporal causality. Then a diffusion-based refiner adjusts these trajectories using vision-language model features as semantic priors. The idea is to get the benefits of both: causal coherence from the autoregressive component, global optimization from the diffusion component.
The benchmark results are strong. The system achieves 94.85 on the NAVSIM v1 leaderboard, which the authors note matches human-level performance at 94.8. I should be precise about what this means: NAVSIM is a simulation benchmark, and matching human performance on a benchmark is not the same as matching human performance in the real world. Simulation benchmarks have known limitations around distribution shift and edge cases. But within the evaluated setting, the results suggest the hybrid architecture is doing something useful.
What connects these papers is a shared recognition that the dominant paradigm of treating everything as sequence prediction has fundamental limitations for physical reasoning. Each paper proposes a different solution, but they all involve some form of explicit state or dynamics modeling that goes beyond predicting the next token.
The Flux paper argues for latent dynamics inference. SOLE-R1 builds temporal reasoning into the reward model. ChainFlow-VLA combines causal autoregression with global diffusion-based optimization. Different approaches, but the underlying diagnosis is similar: you cannot get robust physical reasoning purely from pattern matching over sequences.
This is, actually, the research shows something that roboticists have suspected for years. The question has always been whether scale would solve the problem, whether enough data and parameters would allow sequence prediction to implicitly learn world models. The evidence from these papers suggests the answer is probably no, at least not with current architectures.
What remains unclear is how these insights will translate into practical systems. All three papers are working in relatively controlled settings. Flux is a text-based game environment. SOLE-R1 is tested primarily in simulation with limited real-robot validation. ChainFlow-VLA is benchmarked on driving simulation. The gap between these controlled evaluations and robust real-world deployment is substantial.
There's also a question of computational cost. Explicit state tracking and world modeling add complexity. SOLE-R1's per-timestep chain-of-thought reasoning is more expensive than a single VLM call. ChainFlow-VLA runs both an autoregressive generator and a diffusion refiner. Whether these approaches can scale to real-time robot control in complex environments is, well, multiple things need to be figured out.
What I'd want to see next is systematic comparison across these different approaches on shared benchmarks. Right now each paper uses its own evaluation setup, which makes it difficult to compare them directly. We need standardized tests for long-horizon reasoning, state tracking, and robustness to distribution shift.
I would also want to see more real-world validation. Simulation results are necessary but not sufficient. The failure modes that matter most, the ones that lead to broken objects, damaged robots, or worse, often only appear when systems encounter the full complexity of physical environments.
Finally, I think there's interesting theoretical work to be done on characterizing exactly when sequence prediction fails and when explicit world models are necessary. The Flux paper makes a start on this with its latent dynamics framework, but we need more formal understanding of the boundary conditions.
The broader picture is that robotic AI may be approaching an inflection point. The easy gains from scaling language models are hitting diminishing returns for physical reasoning tasks. The next generation of systems will likely need architectural innovations that explicitly model dynamics, track state, and reason causally about physical interactions.
This is not to say that language models are useless for robotics. They clearly provide valuable semantic understanding and can serve as components in larger systems. But the vision of a single foundation model that handles everything from language understanding to physical manipulation through pure sequence prediction looks increasingly unlikely.
The field seems to be converging on hybrid architectures that combine the semantic capabilities of language models with explicit mechanisms for physical reasoning. What those mechanisms look like, whether they involve learned world models, structured state representations, or something else entirely, is still being worked out. But the direction of travel is becoming clearer.
(A methodological note: I have focused on three papers here, but this is based on limited sampling of recent work. There may be other approaches I have not covered that reach similar or different conclusions. The pattern I am describing is suggestive, not definitive.)
The practical implications for robotics companies and researchers are significant. If you are betting everything on scaling up language models for robot control, these papers suggest you may want to hedge that bet. If you are working on explicit world models or hybrid architectures, you are probably on the right track, though the specific implementations that will win out remain unclear.
For the rest of us watching from the outside, the takeaway is that the path to capable robotic AI is not simply a matter of making language models bigger. The problems are architectural, not just scale-related. And solving them will require the kind of careful research represented in these papers: identifying specific failure modes, proposing principled solutions, and validating them in controlled settings before attempting real-world deployment.
The robots that eventually work reliably in our homes and workplaces will likely contain something like a language model for understanding instructions and context. But they will also contain explicit mechanisms for tracking state, modeling dynamics, and reasoning about cause and effect. Getting that hybrid architecture right is the central challenge for robotic AI over the next several years.