The New Trick in Robot Learning: Train Without the Robot
A wave of research papers suggests we might finally crack the robot data problem by ditching robots entirely during training. I've seen this kind of hype before, but this time the numbers are interesting.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Researchers are getting serious about training robots without actually using robots, and the results are good enough that I'm paying attention. A new paper called Phantom demonstrates manipulation policies trained entirely from human video demonstrations, no teleoperation required, achieving up to 92% success rates on real hardware with zero fine-tuning.
This is the kind of claim that would have gotten you laughed out of the room five years ago. Now it's one of half a dozen papers this month pushing the same basic idea: robot data is expensive and hard to get, human data is everywhere, so let's figure out how to bridge the gap.
The robot learning field has been banging its head against the data wall for years. Everyone knows the problem, you need demonstrations to train policies, but collecting robot demonstrations requires having robots, operators, and controlled environments. It doesn't scale. The kids building foundation models for language had the entire internet to work with. Robot researchers have been stuck with whatever they could teleoperate in their own labs.
So the field has been hunting for workarounds. Simulation was supposed to be the answer for a while (and still is, for some applications), but sim-to-real transfer remains finicky. Domain randomization helps but doesn't solve everything. The new approach is more direct: just use human video and figure out how to make it robot-compatible.
Phantom does this by converting human demonstrations into robot observation-action pairs. They estimate hand poses from video, inpaint the human arm out of the frame, and overlay a rendered robot arm instead. The visual domains get aligned without ever touching a real robot during training. It's clever, maybe too clever, but the 92% success rate on deformable object manipulation is hard to argue with.
Related coverage
More in AI Models
Six new papers promise to solve robot training bottlenecks. I've seen this movie before, but this time the approaches are actually interesting.
Mark Kowalski · 5 hours ago · 5 min
The company just raised its outlook by a staggering amount, and honestly, I'm trying to figure out if this is real momentum or a peak we're about to fall off.
Sarah Williams · 6 hours ago · 5 min
A $65 billion raise that eclipses OpenAI. I've seen big valuations before, but this one's got me scratching my head.
Robert "Bob" Macintosh · 6 hours ago · 3 min
The private equity giants are seeking additional investors for what would be one of the largest AI infrastructure financing deals to date.
Not everyone is abandoning robot data entirely. Some researchers are trying to squeeze more value out of limited robot demonstrations, and there's interesting work here too.
MonoDuo tackles bimanual manipulation, which is one of those capabilities everyone agrees robots need but few labs can actually train because two-arm setups are expensive and rare. Their solution is to collect single-arm demonstrations with a human filling in for the missing arm, then swap roles and synthesize the full bimanual policy. They're reporting 70% success rates on zero-shot deployment to unseen bimanual configurations, which is, well, it's not bad for a problem this hard.
The approach feels a bit like a hack, call me old-fashioned, but hacks that work are still hacks that work. If you can train bimanual policies using single-arm robots that most labs already have, that's a genuine accessibility improvement for the field.
Here's where things get more interesting, and honestly more concerning. A paper called VE2VF (terrible name, but bear with me) points out that vision-enabled policies tend to overfit to the visual conditions they saw during training. This is not news to anyone who's deployed a robot in the real world, but the solution they propose is unusual: train a vision-enabled teacher, then distill its knowledge into a vision-free student that relies only on pose, twist, and wrench sensing.
They're achieving 95% success on the NIST assembly benchmark after about 50 minutes of real-world training, including generalization to 8 unseen task variants. The whole thing is done without domain randomization or data augmentation, which is notable because those techniques are usually considered essential for robustness.
I'm genuinely uncertain what to make of this. On one hand, throwing away vision entirely seems like giving up on a capability robots obviously need. On the other hand, if vision is making policies brittle in ways that are hard to fix, maybe the pragmatic move is to use it for training and then discard it. It's too early to say whether this is a dead end or a genuine insight.
Vision-Language-Action models are the hot thing right now, and several papers are trying to make them more reliable without just scaling up parameters.
ProgVLA is a compact model, only 0.1 billion parameters, that maintains an explicit representation of task progress over extended horizons. They're using offline reinforcement learning to train auxiliary "progress heads" that give the policy an internal estimate of how far along it is in a task. The idea is that knowing where you are in a task helps you figure out what to do next, which sounds obvious but apparently isn't built into most VLA architectures.
The results are competitive with much larger models on standard benchmarks, and they actually beat the bigger models on long-horizon tasks. That's the kind of result that makes you wonder whether we've been throwing compute at the wrong problems.
VLA-Pro takes a different approach, storing task-specific LoRA adapters as "procedural memories" during training and retrieving relevant ones at inference time. They're reporting up to 207% relative improvement in simulation and jumping from 5.8% to 65.0% success rate in real-world tests. Those numbers are dramatic enough that I want to see independent replication, but the basic idea of modular, retrievable task knowledge seems sound.
One paper that deserves attention is BORA, which tackles what might be the hardest practical problem in robot learning: how do you do reinforcement learning in the real world without breaking things?
Dexterous manipulation with high-dimensional hand control is particularly brutal for RL because exploration is expensive and dangerous. BORA uses an offline-to-online approach with human-in-the-loop intervention, basically training a critic offline and then doing careful online adaptation with humans ready to step in when things go wrong. They're reporting a 33% absolute increase in success rate under standard conditions and 43% improvement on unseen object generalization.
This is the kind of unglamorous but necessary work that actually moves the field forward. Real-world RL for dexterous manipulation has been stuck for years because nobody figured out how to make exploration safe enough. If BORA's approach generalizes, it could unlock a lot of capability that's been bottlenecked on this specific problem.
I've been covering tech long enough to recognize a pattern when I see it. The field is clearly pivoting away from the assumption that robot learning requires massive robot datasets. Multiple independent groups are converging on variations of the same insight: use whatever data you can get, human video, single-arm demonstrations, simulation, and figure out how to transfer it.
This is reminiscent of how computer vision evolved, actually. For years everyone assumed you needed hand-labeled datasets for everything, then transfer learning from ImageNet changed the game, then self-supervised learning changed it again. Robot learning seems to be going through a similar evolution, just compressed into a shorter timeframe.
The numbers in these papers are encouraging but I'd want to see how they hold up in messier real-world conditions. Lab benchmarks are one thing, actual deployment is another. The 92% success rate from Phantom is impressive but it's on tasks the researchers chose, in environments they controlled. That's not a criticism, it's just reality.
What I find most promising is the diversity of approaches. We're not seeing one dominant paradigm yet, we're seeing multiple groups attacking the data problem from different angles. That's usually a sign that the field is genuinely making progress rather than just iterating on a single idea until it stops working.
But what do I know. If you want to argue about any of this, my email's on the about page.