Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
A robot arm reaches for a coffee mug. Its neural network has predicted the next 16 actions in sequence, a technique called action chunking that's become standard in modern robot learning. But somewhere around action 9, the mug shifts slightly. The robot doesn't notice. It's still executing the original plan, open-loop, committed to a trajectory that no longer makes sense.
This scenario plays out constantly in robotics labs around the world, and it points to a problem that the field has been quietly struggling with for the past two years. Action chunking, the technique that made modern vision-language-action (VLA) models practical, is fundamentally flawed. And judging by the flood of papers hitting arXiv this week, researchers are finally admitting it.
I counted six papers in the past seven days that directly address action chunking limitations. That's not normal. That's a field collectively realizing something is broken.
The core issue is deceptively simple. When a robot policy predicts a chunk of, say, 16 future actions, someone has to decide how many of those actions to actually execute before checking in with the real world again. Execute too few and you get jerky, inefficient motion. Execute too many and the robot can't react to changes. The "right" number turns out to be maddeningly task-dependent.
PACE, a new method from researchers working with the RoboTwin2.0 benchmark, quantifies this problem with uncomfortable precision. On 50 simulation tasks, they found that success rates varied non-monotonically with execution horizon. There's no single "good" number. A horizon that works perfectly for one task fails catastrophically on another. Their solution, which identifies natural transition points in predicted trajectories, raised average success from 57.8% to 64.2% in simulation. On real hardware (bimanual ALOHA and single-arm Franka platforms), they improved success rates from 50.7% to 70.4%.
Those are significant gains. But look at those baseline numbers. We're talking about systems that fail half the time on tasks they were specifically trained for.
A separate paper on Mixture of Horizons (MoH) takes a different approach, processing multiple horizon lengths in parallel and fusing them with a learned gate. The authors claim 99% average success on LIBERO benchmarks after 30,000 training iterations, which is impressive, but LIBERO tasks are relatively constrained. The real test is always production volume and real-world variability.
A batch of new papers tackle the computational bottleneck in robot learning, with one approach claiming 4x speedups without sacrificing policy performance.
James Chen · 2 hours ago · 5 min
Two new papers expose a measurement crisis in vision-language-action models, and if you've been through the self-driving hype cycle, this should sound familiar.
Mark Kowalski · 2 hours ago · 7 min
New benchmark reveals that vision-language-action models grasp objects just fine, but pick the right one at basically random rates.
James Chen · 2 hours ago · 7 min
A wave of 'world model' papers promises robots that can think ahead. It's promising work, but let's not pretend this is the first time we've heard that pitch.
Here's what I find genuinely concerning. All of these papers assume a static world. The robot plans, executes some actions, observes, replans. But what happens when objects move during execution?
AHEAD is the only paper this week that directly confronts dynamic manipulation, and its results are stark. On 20 simulation scenarios involving moving objects, the strongest baseline (a frozen OpenVLA) achieved 31 to 58% success. AHEAD, which adds a small world model that predicts future visual states, reached 79 to 97%.
The real hardware numbers are more telling. On a UFactory xArm 7 doing conveyor and rolling-ball tasks, AHEAD succeeded on 29 or 30 out of 30 trials. On projectile catching (admittedly a hard task), it managed 19 out of 30. Every baseline scored zero.
Look, I've seen enough spec sheets to know that zero-success baselines usually mean the task was designed to make baselines fail. But even accounting for that, the gap is enormous. Current VLA models simply cannot handle moving targets, and that's a problem because the real world moves.
The more ambitious papers this week are betting that world models (neural networks that predict future states) can solve the chunking problem by letting robots simulate outcomes before committing to actions.
$\tau_0$-WM is the most comprehensive attempt I've seen. It's trained on approximately 27,300 hours of robot teleoperation and human video, and it can both generate actions and simulate their consequences. At inference time, it samples candidate action chunks, ranks them by consistency, and uses the simulator to fix low-quality candidates.
The training data scale is notable. 27,300 hours is roughly 3.1 years of continuous recording. That's a serious data collection effort, and it raises questions about whether this approach can scale to new embodiments without similar investments.
Wall-OSS-0.5, an open-source 4B parameter VLA, takes a different stance on the pretraining question. The authors argue that most VLA papers only report results after task-specific fine-tuning, which makes it impossible to know whether pretraining itself provides useful robot capability or just better initialization.
Their answer: it's both, sort of. The pretrained checkpoint achieves "non-trivial zero-shot real-robot behavior" on several tasks, including a held-out deformable manipulation task. After fine-tuning, it reaches 60.5% average task progress on 15 real-robot tasks, outperforming $\pi_{0.5}$ by 17.5 percentage points.
That 60.5% number deserves scrutiny. It's task progress, not success rate, and the distinction matters. A robot that gets 60% of the way through a task every time is still failing every time. The paper doesn't break down how many tasks were actually completed versus partially completed, which is frustrating. I'd want to see those numbers before drawing conclusions.
ELAN4D tackles a different angle: what happens when deployment conditions don't match training? Their approach adds future robot keypoint predictions as auxiliary supervision during training, essentially teaching the model to anticipate where its own joints will be.
The results under distribution shift are the interesting part. On tasks with camera, background, and layout changes, ELAN4D shows "substantial gains" over baselines. The paper tests on LIBERO, LIBERO-Plus, RoboTwin2.0, and real-world tasks, which is a reasonable spread.
But here's what bothers me. The auxiliary prediction branch is discarded at inference time. The model doesn't actually predict future keypoints when deployed. It just learned better representations by being forced to during training. That's clever, but it also means the model has no explicit mechanism for anticipating its own motion at deployment. It's learned a better prior, not a better inference procedure.
If I'm being honest, the picture that emerges from this week's papers is not encouraging for near-term deployment. The techniques work, often impressively, in controlled settings. But the gaps are still large:
Success rates on trained tasks hover around 60-70% for many systems
Moving objects remain largely unsolved outside of specialized approaches
Generalization to new environments requires either massive pretraining data or careful fine-tuning
The "right" execution horizon varies by task in ways that aren't predictable
From my time building hardware at Fanuc, I learned that a 70% success rate in the lab usually means 40% in production. Lighting changes, sensor drift, mechanical wear, operator variation. All of it compounds.
The research direction is clearly correct. World models, adaptive chunking, and better pretraining are all necessary pieces. But the field seems to be in a phase where each paper solves one piece of the puzzle while assuming the other pieces are solved. They're not.
The most honest assessment might be this: action chunking was a necessary hack to make VLA models trainable and deployable. Now we're discovering all the ways that hack breaks down. The papers this week are patches on patches, each one clever, each one addressing a real limitation. But we don't yet have a unified solution.
That's not a criticism. That's just where the field is. The question is whether these incremental improvements will compound into something robust enough for real deployment, or whether we need a more fundamental rethink of how robots should represent and execute multi-step actions.
I don't know the answer. Based on what I'm reading, neither does anyone else. But at least we're asking the right questions now.