The End of Robot Data Collection as We Know It? Two Papers Suggest Human Videos Might Be Enough
New research shows robots learning manipulation skills directly from watching humans, no expensive teleoperation required. I'm cautiously optimistic, but let's look at what's actually happening here.
By
·5 hours ago·4 min de lectura
I've been covering humanoid robotics long enough to develop a healthy skepticism about "breakthrough" claims. So when two papers dropped this week suggesting robots can learn dexterous manipulation just by watching human videos, my first instinct was to look for the catch.
Honestly? I'm still looking. But the results are compelling enough that I think we need to talk about what's happening here.
Here's the thing about robot learning that doesn't get enough attention: there's basically no data. Language models train on the entire internet. Vision models have billions of images. Robotics? We're scraping together datasets through painstaking teleoperation, where a human operator controls a robot arm for hours to collect maybe a few hundred demonstrations of a single task.
This is why progress has been so uneven. It's not that the algorithms are bad. It's that we're trying to teach robots to interact with the physical world using datasets that would make a 2015 image classifier laugh.
Two new papers from arXiv are attacking this problem from slightly different angles, and both arrive at a similar conclusion: maybe we've been overthinking the embodiment gap.
The first paper, Ego-Pi, builds on Physical Intelligence's π₀.₅ model (which, if you're not following this space closely, is one of the more capable vision-language-action models out there). The core idea is deceptively simple: fine-tune the model on egocentric human video, the kind of first-person footage you'd get from someone wearing a GoPro while cooking or assembling furniture.
Cobertura relacionada
More in Humanoids
Three new papers suggest we're finally figuring out how to make humanoid robots move without programming every gesture by hand.
Aisha Patel · 8 hours ago · 9 min
Two new papers show robots recovering from falls on rough terrain. I've been waiting 15 years for this.
Robert "Bob" Macintosh · 9 hours ago · 4 min
New work from separate teams tackles the same problem from opposite directions, and the results reveal something important about where humanoid control is actually headed.
Aisha Patel · Yesterday · 8 min
Two new papers tackle robotic grasping from opposite directions, and honestly, both approaches reveal how far we still have to go.
What caught my attention is the claim that human data enables robots to "learn new task semantics and compose existing skills into novel behaviors without corresponding robot data." That's a big deal if it holds up. It means you could potentially teach a robot to fold laundry by showing it videos of humans folding laundry, without ever teleoperating the robot through the motion.
I should note that the paper doesn't provide extensive quantitative benchmarks in the abstract, so I'm reserving judgment on how well this actually works in practice. The research website has more details, but the gap between "works in controlled demos" and "works in your kitchen" remains substantial.
The second paper is where things get really interesting. Dexterous Point Policy takes a different approach: instead of trying to map human motions directly to robot motions, it extracts 3D keypoints from both human hands and robot hands, focusing on wrist and fingertip positions.
The insight here is that at the keypoint level, human and robot behaviors actually align pretty well. Your fingertips move through similar trajectories whether you're picking up a mug or a robot gripper is doing it.
The results are striking. On a suite of real-robot tasks spanning pick-and-place and tool use, their method achieved 75.0% success. A state-of-the-art VLA baseline? 1.0%.
That's not a typo. One percent.
Now, I initially thought this had to be some kind of cherry-picked comparison or an unfair baseline. But the paper claims strong generalization to unseen scenarios, including multi-object environments and novel object categories. If that's accurate (and tbh, I'd want to see independent replication), this is a genuinely significant result.
You might be wondering if this solves the robot data problem entirely. It doesn't. A few caveats worth noting:
First, both papers focus on manipulation tasks with dexterous hands. It's unclear how well these approaches transfer to other domains, like locomotion or whole-body control.
Second, the 75% success rate is impressive but not deployment-ready. In real-world applications, you often need 95%+ reliability before a system becomes useful rather than frustrating.
Third, and this is something I should know better but don't, we don't have great metrics for measuring how "novel" the generalization actually is. A robot picking up a slightly different mug isn't the same as a robot figuring out how to use a tool it's never seen before.
What excites me about these papers isn't any single result. It's the convergence. Multiple research groups are independently discovering that the embodiment gap, this supposedly fundamental barrier between human and robot learning, might be more porous than we thought.
If you can learn manipulation from human videos, suddenly the data problem looks very different. There are millions of hours of cooking tutorials, craft videos, repair guides, and everyday footage on YouTube. That's not quite "internet-scale" for robotics, but it's orders of magnitude more data than we've been working with.
I'm not ready to declare victory here. The gap between research demos and real-world deployment is littered with the corpses of promising approaches. But for the first time in a while, I'm seeing a plausible path from "robots that work in the lab" to "robots that work in your home."
Whether we get there in two years or ten remains unclear. But the direction of travel seems right.