Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Twelve point three percent. That's the improvement on long-horizon tasks that a new framework called Agentic-VLA claims over existing methods, and honestly, that number stopped me cold when I read it. Not because it's huge (it's not earth-shattering), but because of what it represents: reinforcement learning crawling back from the grave in robotics.
I've been covering tech long enough to watch entire methodologies get declared dead and then resurrect themselves when nobody's looking. Neural networks in the 90s, anyone? So when DeepMind's blog quietly announced that their internal benchmarks show offline RL methods reaching parity with imitation learning, my first thought was: here we go again.
For the past few years, the robotics world has been absolutely drunk on imitation learning. The pitch was seductive: why bother with complex reward engineering when you can just show a robot what to do? Collect demonstrations, train a model, deploy. The kids building these systems (and I say that with affection, mostly) grew up in an era where data was cheap and compute was cheaper.
But here's the thing about imitation learning that the hype cycle conveniently forgot to mention: it doesn't generalize well. You train a robot to pick up a red cup in a lab with specific lighting and a specific table height, and then you put it in a warehouse with fluorescent lights and suddenly it's useless. The robot learned to mimic, not to understand.
This isn't a new problem, by the way. I remember writing nearly identical paragraphs about expert systems in the 80s, about how they'd fail the moment you stepped outside their narrow domain. Call me old-fashioned, but I think there's something to be said for systems that actually learn principles rather than just copying homework.
Related coverage
More in AI Models
Five years after AlphaFold solved protein folding, researchers are engineering heat-tolerant plants by redesigning photosynthesis itself.
Sarah Williams · 2 hours ago · 5 min
Google and OpenAI just released benchmarks showing their best models get basic facts wrong 30-40% of the time. That's... not great.
Sarah Williams · 2 hours ago · 5 min
Three papers in two weeks suggest synthetic training data could replace expensive real-world robot demonstrations. I've seen this movie before, but the ending might be different this time.
Mark Kowalski · 2 hours ago · 6 min
Everyone's focused on AI chatbots manipulating users. The real concern is what happens when these systems control physical hardware.
The arXiv paper introducing Agentic-VLA is dense (aren't they all), but the core ideas are worth unpacking. The researchers built a framework with three main innovations, and I'll try to explain them without putting you to sleep.
First, there's Adaptive Reward Synthesis. Instead of a human sitting down and manually designing reward functions (which is tedious and error-prone), the system generates and adjusts its own rewards based on what the robot can currently do. It breaks complex tasks into smaller goals, basically creating its own curriculum. This is clever because it sidesteps one of RL's biggest historical problems: reward hacking, where robots find bizarre loopholes in poorly designed reward functions.
Second, Language-Guided Exploration. Rather than having the robot flail around randomly trying things (which is how a lot of older RL worked, and it was painful to watch), a critic model provides structured guidance. Think of it as the difference between a toddler randomly grabbing at everything versus one who's been told "the toy is near the blue box."
Third, Experience Memory. The system stores successful approaches and retrieves them when facing similar tasks. This sounds obvious, but it's surprisingly hard to implement well. The 1-shot learning improvement of 28.5% suggests they've figured something out here.
The cross-task transfer numbers are what really caught my attention though. Going from 0% to 31.2% without task-specific demonstrations isn't revolutionary (31.2% is still failing most of the time!), but it's the direction of travel that matters. Zero to something is always harder than something to better.
So why is RL making a comeback in 2025 after years of imitation learning dominance? A few factors seem to be converging.
The offline RL methods that DeepMind mentions have matured significantly. You no longer need a robot to spend thousands of hours in the real world learning from scratch, you can train on logged data and then fine-tune. This addresses the sample efficiency problem that made RL impractical for physical systems.
Vision-language models have also gotten good enough to provide meaningful guidance. The "Language-Guided Exploration" in Agentic-VLA wouldn't have been possible five years ago because the language models weren't there yet. Now you can actually describe what you want in natural language and have a system that roughly understands.
And frankly, the limitations of pure imitation learning have become impossible to ignore. Companies that went all-in on demonstration collection are discovering that you can't collect your way to general intelligence. At some point, the robot needs to actually figure things out.
Now look, I should be clear about what we don't know yet. The Agentic-VLA results are on benchmarks, specifically LIBERO and RoboTwin 2.0. Benchmarks are useful but they're not the real world. I've seen plenty of systems ace benchmarks and then fall apart in deployment, it's practically a rite of passage in this field.
The 2.4x faster convergence claim is nice, but convergence to what? If you're converging to 31.2% cross-task transfer, you're still failing more than two-thirds of the time. The paper is honest about this (refreshingly so), but I worry the headline numbers will get separated from the context.
DeepMind's blog post is also notably light on specifics. "Internal benchmarks showing parity or improvement" is the kind of claim that makes me want to see the actual data. But what do I know, maybe they're saving it for a proper paper.
It's also too early to say whether this represents a genuine paradigm shift (ugh, I hate that phrase) or just another swing of the methodology pendulum. The history of AI is littered with approaches that looked promising for a few years and then hit walls nobody anticipated.
If RL really is becoming practical for robotics again, the implications are significant. Companies that built their entire stack around imitation learning might need to rethink their approach. The massive demonstration datasets that everyone's been collecting might become less valuable relative to good reward design and exploration strategies.
This is the self-driving car hype cycle all over again, in a way. Remember when everyone thought the problem was mostly solved around 2016? Then edge cases multiplied and timelines stretched and the whole industry had to get humble. Robotics might be entering a similar phase of recalibration.
The smart money, I think, is on hybrid approaches. Use imitation learning to get a reasonable starting point, then use RL to adapt and improve. Agentic-VLA's Experience Memory component suggests the researchers are thinking along these lines too.
For what it's worth, I'm cautiously optimistic. Not because I think we're about to see general-purpose robots (we're not), but because the field seems to be maturing past the "one methodology to rule them all" phase. That's usually when real progress happens.
If you want to argue about any of this, my email's on the about page. I actually read it, unlike certain messaging platforms I could name.