VLA Models Are Getting Smarter, But the Hard Problems Remain Unsolved
A wave of new research tackles the gap between language understanding and robot control, with genuinely clever approaches that still leave fundamental questions open.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Vision-Language-Action models are the most exciting development in robot learning we've seen in years. There, I said it. Now let me spend the next 2,000 words explaining why that excitement should be heavily qualified.
The past few weeks have brought a flurry of papers pushing VLA architectures in new directions, and while the results are genuinely impressive in places, the field is still dancing around a core problem: these systems remain brittle in ways that matter for real-world deployment. The research is good. The hype is, predictably, getting ahead of it.
Let me walk through the most interesting recent work, because there's real substance buried under the benchmark numbers.
The paper that caught my attention first was π₀-EqM, which replaces the flow-matching decoder in Physical Intelligence's π₀ architecture with something called Equilibrium Matching. To be precise, this is an energy-based approach that treats action generation as finding a fixed point rather than running a fixed number of denoising steps. The results on RoboTwin jump from 40.4% to 50.2% average success across 19 tasks under matched compute budgets.
That's a meaningful improvement, but here's what I find more interesting than the numbers: the authors identify what they call the "stationarity-executability gap." Basically, they found that the relationship between how converged the model is and how well it actually performs is non-monotonic and task-dependent. Sometimes stopping early works better. Sometimes you need more iterations. This suggests that inference depth in iterative VLA control is part of policy design, not just a hyperparameter to tune. That's a genuinely novel framing.
À lire aussi
More in AI Models
Retailers are slashing prices on desktops and laptops this weekend, which is fine, but let's talk about what these machines are actually for.
Mark Kowalski · 1 hour ago · 5 min
The Chinese tech giant claims a breakthrough that could close the gap with TSMC, but the details are frustratingly thin.
Sarah Williams · 1 hour ago · 6 min
Pope Leo XIV's new encyclical on artificial intelligence might have been partially written by the very thing it warns against.
Robert "Bob" Macintosh · 3 hours ago · 3 min
A wave of new research is revisiting an old idea in robotics, and the results suggest we've been overthinking trajectory generation for years.
Then there's Agentic-VLA, which takes a different approach entirely. Instead of improving the core architecture, it wraps VLAs in an agentic training framework with three components: adaptive reward synthesis that decomposes tasks into learnable sub-goals, language-guided exploration using a critic model, and an experience memory for warm-starting on similar tasks. The headline numbers are impressive: +12.3% on long-horizon LIBERO tasks, +28.5% in 1-shot learning, and cross-task transfer going from 0% to 31.2% without task-specific demonstrations.
I know I'm being picky here, but the cross-task transfer result needs context. Going from 0% to 31.2% sounds dramatic until you remember that 31.2% success rate means the robot fails more than two-thirds of the time. That's fine for research, genuinely, but let's not pretend this is deployment-ready.
V-VLAPS addresses a different limitation: VLAs are reactive, which means they can fail badly on long-horizon tasks or under distribution shift. The solution here is to train a lightweight value head on offline rollouts and use those predictions to guide Monte Carlo Tree Search. The results are modest but real: +6 percentage points on LIBERO-Object and +4 on LIBERO-10 with larger search budgets. The authors are refreshingly honest about limitations, noting that many hard failures occur at root-level timeouts where predicted values are weakly separated. In other words, when the value estimates aren't confident, the planning doesn't help much.
The key takeaways from this batch of research:
Energy-based decoders can outperform flow-matching under matched compute, but optimal inference depth is task-dependent
Agentic wrappers with curriculum learning show substantial gains on adaptation and few-shot learning
Value-guided planning helps, but only when the value head can actually distinguish good from bad branches
Cross-task transfer remains difficult; even the best results are below 35% success without task-specific data
Sample efficiency is improving (2.4x faster convergence claimed for Agentic-VLA) but still requires substantial compute
Perhaps the most conceptually interesting paper in this batch is Language Movement Primitives, which takes a completely different approach to the VLA paradigm. Instead of training end-to-end models that map observations to actions, LMPs use VLMs to set parameters for Dynamic Movement Primitives, which are a classical robotics formulation for generating stable trajectories.
The insight here is that DMPs provide a small, interpretable parameter space that VLMs can actually reason about. Rather than asking a language model to output joint torques (which it fundamentally cannot understand), you ask it to specify high-level trajectory properties that get compiled down to motion. Across 31 real-world manipulation tasks, LMPs achieve 65% success compared to 35% for the best baseline.
Actually, the research shows something important about why current VLA architectures struggle. The authors argue that end-to-end approaches force the action decoder to "implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM." That's a significant inefficiency, and it might explain why these models need so much data.
AVP (Action with Visual Primitives) takes a related but distinct approach. Instead of DMPs, it has the VLM emit "visual-primitive tokens" that condition a flow-matching action expert. The VLM infers the next-stage target and encodes it visually; the action expert handles the low-level control. Real-robot experiments show a 27.61% improvement over π₀.5 on pick-and-place tasks, with better data efficiency and spatial generalization.
Both of these papers are pushing against the same problem: the current dominant paradigm of mapping observations directly to actions in a single forward pass entangles too many different capabilities. Instruction comprehension, spatial understanding, and motor control are all competing for the same representational capacity. Decomposing these functions, whether through DMPs or visual primitives, appears to help.
It's worth noting that this hasn't been replicated extensively yet. The LMP paper reports results on 31 tasks, which is solid, but these are all tabletop manipulation. AVP shows gains on pick-and-place specifically. Whether these approaches generalize to more complex manipulation, or to mobile robots, or to multi-step assembly, remains unclear.
One paper stands somewhat apart from the others: SOLE-R1, which uses a video-language reasoning model as the sole reward signal for online reinforcement learning. This is, in a way, the logical endpoint of the VLA paradigm: if language models can understand tasks, why not use them to evaluate whether robots are succeeding?
The approach is genuinely clever. SOLE-R1 performs per-timestep spatiotemporal chain-of-thought reasoning and produces dense estimates of task progress. The authors developed a large-scale synthesis pipeline to generate temporally grounded reasoning traces aligned with continuous progress supervision. The training combines supervised fine-tuning with RL from verifiable rewards.
The results are striking. SOLE-R1 enables zero-shot online RL from random initialization across four simulation environments and a real robot, succeeding on 24 unseen tasks. It substantially outperforms other vision-language reward models, including comparisons against GPT-5 and Gemini-3-Pro (I'm assuming these are the latest versions available to the researchers).
More importantly, the authors report that SOLE-R1 exhibits "markedly greater robustness to reward hacking." This is crucial. The failure mode of using language models as reward signals is that policies learn to exploit perceptual errors rather than actually solve tasks. A robot might learn to position objects in ways that look correct to the vision system without actually completing the manipulation. If SOLE-R1 genuinely resists this, that's a significant advance.
But I want to be careful here. The paper claims robustness to reward hacking, but the evaluation is necessarily limited. Reward hacking is an adversarial problem; the more capable the policy becomes, the more sophisticated its potential exploits. Whether SOLE-R1's robustness holds up under longer training or more capable base policies is something we don't know yet.
This batch of papers represents real progress, but I keep coming back to the same concerns:
Benchmark saturation. LIBERO appears in nearly every paper here. It's a useful benchmark, but when multiple papers report results in the 80-90% range on LIBERO-10, we're approaching the point where improvements might reflect overfitting to benchmark quirks rather than genuine capability gains. The field needs harder, more diverse evaluation settings.
Real-world validation. LMPs and AVP include real-robot experiments, which is good. But the tasks are relatively simple: tabletop manipulation, pick-and-place. The gap between simulation performance and real-world deployment remains substantial, and most papers don't address it.
Long-horizon tasks. Agentic-VLA shows improvements on "long-horizon" tasks, but in context, this means tasks that take maybe 10-20 steps. Real-world robotics often involves tasks with hundreds or thousands of steps, with complex error recovery requirements. We're not close to that.
Failure analysis. The V-VLAPS paper is unusually good about this, acknowledging that many failures occur when value estimates aren't well-separated. More papers should include this kind of honest analysis. When does your method fail? Why? What would fix it?
Compositional generalization. Can these systems handle novel combinations of known skills? AVP claims gains in "spatial-compositional generalization," but the evaluation is limited. This remains one of the hardest problems in robot learning.
(A methodological aside: I'd also like to see more standardization in how these papers report compute costs. "Matched compute budget" means different things in different contexts, and it's often unclear whether improvements come from better algorithms or just more GPU hours.)
Stepping back, what does this wave of research tell us about where VLA models are heading?
First, the architecture is not settled. Flow-matching, equilibrium matching, visual primitives, DMP grounding (these are all being explored simultaneously). We're still in the phase where fundamental design choices are up for grabs.
Second, the pure end-to-end paradigm is showing cracks. Multiple papers argue, with evidence, that decomposing the problem helps. Whether that decomposition happens through DMPs, visual primitives, value heads, or agentic wrappers, the theme is consistent: trying to do everything in one forward pass is inefficient.
Third, online adaptation is becoming central. Both Agentic-VLA and SOLE-R1 focus on enabling robots to learn and adapt during deployment, not just execute pretrained behaviors. This is probably the right direction, because the real world will always contain situations that weren't in the training data.
Fourth, and perhaps most importantly, the gap between benchmark performance and real-world capability remains wide. A system that achieves 87% on LIBERO-10 might still fail catastrophically in a slightly different kitchen. We don't have good ways to measure or predict this gap.
I remain optimistic about VLA models as a paradigm. The combination of language understanding, visual perception, and action generation in a single framework is compelling, and the rapid progress we're seeing suggests there's more room to run. But I also think the field would benefit from more skepticism about headline numbers and more attention to the failure modes and limitations that these papers sometimes bury in appendices.
The hard problems, robust generalization, long-horizon planning, safe deployment, compositional skill transfer, haven't been solved. The research is chipping away at them, which is what research should do. But let's not mistake incremental progress for revolution.