Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Nineteen minutes. That's how long it took one new system to achieve perfect task performance on challenging manipulation tasks, from routing string lights to sinking pool balls into pockets. I had to read that number twice.
If you've been following the vision-language-action (VLA) space, you know the usual story: impressive demos, followed by the quiet admission that these models still fail a lot in the real world. The gap between "understands what you want" and "can actually do it" has been stubbornly wide. But a cluster of new papers suggests researchers are finally making serious progress on closing it.
Honestly, I'm cautiously optimistic. Let me walk through what's actually happening.
VLA models have gotten remarkably good at understanding language instructions and visual scenes. You can tell a robot "put the red block on the blue plate" and it genuinely understands what you mean. The problem is that understanding and executing are different skills, and current architectures often try to learn both at once.
One paper from researchers working on a framework called AVP (Action with Visual Primitives) puts it well: the action expert in most VLA systems "must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM." That's inefficient. It's like hiring someone who already speaks French, then making them sit through French 101 again because your training program can't separate language skills from other competencies.
AVP's approach is to let the vision-language model do what it's good at (understanding the scene and figuring out what needs to happen next) while a separate action expert handles the actual motor control. In their real-robot tests on pick-and-place tasks, this improved success rates by 27.61% over the pi_0.5 baseline. That's a meaningful jump, though I should note this is on a specific set of tasks, and generalization to messier real-world scenarios remains unclear.
À lire aussi
More in AI Models
New analysis suggests AI isn't causing mass unemployment, but it may be quietly dismantling the first rung of the career ladder.
Aisha Patel · 1 hour ago · 7 min
Distribution shift remains the quiet killer of deployed robot systems. This week's research offers genuinely different approaches to the same fundamental challenge.
Aisha Patel · 1 hour ago · 7 min
Everyone's predicting white-collar extinction. I think they're missing something important about how automation actually unfolds.
Sarah Williams · 1 hour ago · 4 min
Four new papers show researchers finally cracking the problem that's held back practical robotics for years: how to make smart robots that don't need a data center to think.
The memory problem is trickier than I initially thought. A new benchmark called RoboMME specifically tests how well VLA models handle tasks that require remembering things: counting repeated actions, tracking objects that get temporarily hidden, that sort of thing. The researchers built 16 manipulation tasks designed to stress-test temporal, spatial, object, and procedural memory.
What they found is, tbh, a bit sobering. They tested 14 different memory-augmented VLA variants, and the results showed that "the effectiveness of memory representations is highly task-dependent." In other words, there's no universal solution. What works for remembering how many times you've stirred something might not help you track an object that rolled behind a box.
This matters because real-world tasks are often long-horizon and history-dependent. If your robot can't remember what it already did, it's going to struggle with anything more complex than single-step actions.
Here's what the current research wave is tackling:
Replanning on the fly: RePlan-Bot introduces multi-level continuous replanning during task execution, with a high-level LLM-based auditor that adjusts sub-goals based on environmental feedback. On the ALFRED benchmark, it achieved state-of-the-art performance in both seen and unseen environments.
Grounding language in motion: Language Movement Primitives (LMPs) connects VLM reasoning to Dynamic Movement Primitive parameterization, essentially giving language models a way to specify actual robot trajectories. Across 31 real-world manipulation tasks, they hit 65% task success compared to 35% for the best baseline.
Efficient online adaptation: Agentic-VLA uses adaptive reward synthesis and language-guided exploration to help VLAs adapt to new environments without massive demonstration datasets. They saw +12.3% improvement on long-horizon tasks and +28.5% in 1-shot learning.
Sample-efficient fine-tuning: EXPO-FT is where that 19-minute number comes from. They achieved 30/30 successes on tasks like inserting a plug to light up string lights, within an average of 19.1 minutes of online robot data.
The 65% success rate on Language Movement Primitives is interesting to me. On one hand, that's nearly double the best baseline. On the other hand, 65% means the robot still fails about a third of the time. For research, that's great progress. For deployment in, say, a warehouse or a home, you probably need to be in the high 90s at minimum.
I initially thought the EXPO-FT results were almost too good, but the tasks they tested (string lights, pool balls, flower insertion) are admittedly controlled scenarios. The paper's authors are releasing their code, which should let other researchers verify and build on the results. That's the right move.
You might be wondering: what's actually driving these improvements? A few patterns emerge across the papers:
Separation of concerns. Instead of one monolithic model doing everything, newer approaches split the work. The VLM handles understanding. A separate module handles action generation. Another handles memory or replanning. This modularity seems to help.
Better use of priors. Rather than training from scratch, systems like Agentic-VLA use "experience memory" to warm-start adaptation to similar tasks. EXPO-FT specifically focuses on fine-tuning pretrained VLAs rather than building new ones. The models we already have are pretty capable; the trick is getting them to translate that capability into reliable action.
Continuous adaptation. Several papers emphasize online learning and replanning. The robot isn't just executing a pre-computed plan; it's adjusting as it goes based on what it observes. This matters because the real world is messy and unpredictable.
I should be honest about what we don't know yet. Most of these results are on benchmarks like ALFRED, LIBERO, or controlled real-robot setups. How well they transfer to genuinely novel environments with unexpected obstacles, lighting changes, and objects the model has never seen is still an open question.
The RoboMME benchmark is a step toward more rigorous evaluation, but 16 tasks isn't exhaustive. And while EXPO-FT's 30/30 successes sound impressive, we don't know how those tasks were selected or whether there were failed attempts before the final evaluation runs.
There's also the compute question. Papers often don't fully disclose the resources required for training and inference. If these approaches need massive GPU clusters to work, that limits who can actually use them.
Where this leaves us: VLA models are getting meaningfully better at translating understanding into action. The 19-minute fine-tuning result from EXPO-FT and the 27% improvement from AVP aren't incremental; they're the kind of jumps that suggest the field is figuring something out.
But we're not at reliable deployment yet. A robot that succeeds 65% of the time, or even 90% of the time, isn't ready for unsupervised operation in most real-world contexts. The gap between research benchmarks and messy reality remains wide.
What I find encouraging is that researchers seem to be tackling the right problems: memory, replanning, efficient adaptation, and better separation between understanding and action. These are the bottlenecks that have held back embodied AI for years.
I think we'll look back at this period as when VLA models started becoming practical, not just impressive. But honestly, it's too early to say whether the current approaches will scale to the reliability levels real-world deployment demands. The next year or two should tell us a lot.