Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
82.9 percent.
That's the out-of-distribution success rate one research team achieved by rethinking how we train robots to learn from visual feedback. The baseline they improved on? 17.5 percent. I've been covering embodied AI for a while now, and jumps like that don't happen often.
This week, three separate papers dropped on arXiv that all circle the same fundamental question: how do we make reinforcement learning agents behave less like optimization machines and more like, well, us? It's a question I initially thought was mostly about aesthetics (who cares if a robot moves weirdly if it gets the job done?), but after reading through these papers, I'm starting to think it might be more foundational than that.
Here's the thing about most RL agents: they're really good at maximizing whatever reward signal you give them, but they do it in ways that can be, honestly, kind of alien. They find shortcuts humans would never take. They develop movement patterns that work but look nothing like natural motion. And when you try to interpret what they're doing or predict their next move, good luck.
The team behind HiMAQ (Hierarchical Macro Action Quantization) is attacking this head-on. Their approach encodes human demonstrations into what they call "macro actions" using two levels of vector quantization. The lower level maps actions to fine-grained clusters, the higher level aggregates those into broader action patterns. The result is an agent that doesn't just succeed at tasks but does so in ways that look recognizably human.
Cobertura relacionada
More in Humanoids
Two new papers tackle the same problem: teaching robots to look at terrain before they plant their feet. It's harder than it sounds.
Mark Kowalski · 18 hours ago · 6 min
Three new papers expose the same uncomfortable truth: our best robot AI models still can't reliably figure out where to put things.
Sarah Williams · 20 hours ago · 8 min
Six new vision-language-action papers dropped this week. Here's what actually matters for humanoid robots.
Sarah Williams · 2 days ago · 6 min
A wave of new research suggests we've been training robots to treat every movement the same. That's a problem.
You might be wondering why this matters beyond making robots less creepy to watch. The researchers argue it's about interpretability and reliability. If an agent's behavior aligns with human expectations, you can actually predict what it's going to do next. That seems important for, say, a humanoid working alongside people in a warehouse.
The cross-embodiment problem
But making robots learn from humans hits a wall pretty quickly: humans and robots don't have the same bodies. We don't see the same things. We can't execute the same movements. This is the cross-embodiment gap, and it's been a persistent headache for anyone trying to train robots on human demonstration videos.
HARP-VLA takes a clever approach here. Instead of trying to directly map human actions to robot actions (which, tbh, doesn't work great), they use limited paired human-robot demonstrations as "bridges" while pulling in way more unpaired video data for scale. The key insight is training a robot-adapted visual encoder that pushes robot representations toward human semantics while keeping them distinct enough to be useful.
Their results on the CALVIN benchmark and real-world manipulation tasks show meaningful improvements, a 7.1% success rate gain over their strongest baseline in real-world tests. That's not nothing, though I should note these are still controlled lab environments. How this translates to messier real-world conditions remains unclear.
What Feat2Go adds to the picture
The third paper, Feat2Go, tackles a different piece of the puzzle: how do you give robots useful feedback during long, multi-step tasks when you can only really tell them "success" or "failure" at the end?
Their solution involves deriving continuous progress signals from a pretrained visual world model. Basically, the system measures how similar the current visual state is to subgoal states and clusters episodes into semantic stages. This gives the robot a sense of "am I getting warmer?" throughout a task, not just at the finish line.
The numbers here are striking:
OpenVLA-OFT improved from 17.5% to 82.9% average out-of-distribution success on ManiSkill3
Retained 96.9% in-distribution performance (so it's not sacrificing generalization for the easy cases)
Achieved 88.8% average success rate on RoboTwin 2.0 in domain-randomized settings
I should be clear that these are simulation benchmarks, not real-world deployments. The gap between sim and reality is, well, it's a whole thing. But as simulation results go, these are pretty compelling.
What I think is actually happening here
Reading these three papers together, a pattern emerges. The field seems to be moving away from the pure "maximize reward" paradigm toward something more structured. More... human-shaped, I guess?
There's an implicit argument running through all of this work: that human behavior isn't just one possible solution to manipulation tasks, it's actually a pretty good prior. Humans have spent millions of years optimizing for the kinds of physical interactions robots need to master. Maybe there's information encoded in how we move that's worth preserving, not just the outcomes we achieve.
I'm not entirely convinced this is always true. Humans do plenty of suboptimal things. We have biomechanical constraints that robots don't share. But for tasks where robots need to work alongside us, predict our behavior, or be interpretable to human operators, the case for human-likeness seems strong.
The questions I'm still sitting with
A few things remain unclear to me after digging through this work:
First, how do these approaches compose? Could you use HARP's alignment with Feat2Go's progress estimation and HiMAQ's action quantization? The papers don't really address this, and I suspect integration isn't trivial.
Second, there's a selection bias issue I keep bumping into. The demonstrations these systems learn from are (presumably) competent humans doing tasks well. What happens when the human demonstrations include errors or suboptimal strategies? Does the robot learn those too?
Third, and this is more philosophical, at what point does "human-like" become a constraint rather than an asset? If a robot could complete a task faster or more reliably with non-human motion patterns, should we force it to move like us anyway?
I don't have clean answers to any of these. But I think the direction is interesting. We've spent years asking "can robots do this task?" Maybe the better question is "can robots do this task in a way we can understand and trust?"
That's a harder bar to clear. But it might be the one that actually matters for getting these systems out of the lab.