The Hidden Crisis in Robot Learning: Your Training Data Is Lying to You
Two new papers expose a problem most robotics labs don't want to talk about: the data we're using to train manipulation policies is riddled with invisible failures and physically impossible trajectories.
Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Robot learning has a data quality problem, and it's worse than most researchers admit.
This isn't a controversial claim if you've spent time in the weeds of imitation learning. But two papers crossing my desk this week lay out the issue with unusual clarity: one from a team studying false success detection in simulation, another from researchers trying to make Universal Manipulation Interface data actually usable for Vision-Language-Action models. Read together, they paint a picture of a field building increasingly sophisticated policies on foundations that are, to be precise, somewhat rotten.
Let me be clear about what I mean. The problem isn't that we lack data. The problem is that the data we have is contaminated in ways that are genuinely difficult to detect, and we've been papering over this with bigger models and more compute rather than addressing it directly.
The first paper, "How Visible Are Silent Manipulation Failures?" from arXiv, asks a deceptively simple question: when a robot thinks it succeeded at a task but actually failed, how much of the information needed to catch that error is present in the robot's own sensor data?
This matters because imitation learning pipelines typically rely on the robot's own success checks to label training episodes. If the robot says it transferred the cube successfully, that episode gets a positive label. If the robot is wrong (and robots are wrong more often than you'd hope), you've just trained your policy to replicate a failure.
The researchers built a testbed using two bimanual ALOHA tasks: cube transfer and peg insertion. Rather than manually corrupting labels, they induced failures through environment perturbations and used privileged simulator state to establish ground truth. Then they compared proprioceptive detectors (joint positions, velocities) against vision-based detectors.
Cobertura relacionada
More in AI Models
A wave of new research suggests the future of robot learning lies not in predicting what happens next, but in building better internal representations of the world.
Aisha Patel · 2 hours ago · 7 min
A flood of new research promises robots that can imagine the future before they act. I've seen this pattern in AI before, and I'm not sure we're asking the right questions yet.
Mark Kowalski · 2 hours ago · 6 min
MAI-Thinking-1 marks Microsoft's first serious attempt at a flagship reasoning model. Whether it matters is another question entirely.
Mark Kowalski · 8 hours ago · 6 min
The CVPR and Microsoft Build announcements sound like robotics news, but they're really infrastructure plays. That matters more than you think.
The results are, well, complicated in ways that matter.
For cube transfer, false successes were almost entirely recoverable from joint data alone. The proprioceptive signal was sufficient. For peg insertion, proprioception caught only a fraction, and vision was needed to close the gap. This suggests the detectability of false successes is highly task-dependent, which is not great news for anyone hoping for a universal solution.
But here's the finding that stuck with me (I know I'm being picky here, but this seems important): the proprioceptive separability they measured depended on velocity differences far below any realistic sensor noise floor. The authors themselves note this should be read as an "optimistic upper bound that a noiseless simulator inflates." In other words, even the good results may not transfer to real hardware.
It's worth noting that this is a simulation study with only two tasks. The authors are admirably transparent about this limitation, but it means we don't know yet how these findings generalize. What we do know is that the problem exists and that naive success checking is probably contaminating datasets across the field.
Universal Manipulation Interface has been genuinely useful for scalable data collection. You don't need robot-specific teleoperation hardware; humans can demonstrate tasks more naturally. The problem is that human demonstrations are, to put it bluntly, physically impossible for robots to execute.
The VISTA authors identify two critical mismatches. First, UMI uses wrist-mounted fisheye cameras, which produce images with severe radial distortion that are out-of-distribution for pretrained vision-language models. Second, human-collected trajectories routinely violate kinematic limits, cause collisions, or exceed controller bandwidth. You're essentially teaching the robot to attempt movements it cannot physically perform.
Their solution has three components. UMI-VQA is a large-scale visual question answering dataset specifically designed for wrist-mounted fisheye observations, intended to adapt VLM representations to this visual regime. A physical validation pipeline scores trajectories for continuity, self-collision risk, and execution fidelity before training. And a two-stage co-training recipe combines vision-language grounding with action prediction on validated data.
The results suggest this actually works. VISTA outperforms several strong baselines including π₀.₅, LingBot-VLA, and Wall-X on both simulation and real-world tasks. More interestingly, they show that physical validation scores are predictive of deployment success. This is the kind of finding that, if it replicates, could change how labs curate training data.
I want to be careful here though. The paper claims "significant" improvements but the specific numbers and task distributions matter for assessing generalization. The release of their validation pipeline, dataset, and pretrained model should help others verify these claims.
Read together, these papers suggest something that has been obvious to practitioners but underacknowledged in the literature: data quality is probably the bottleneck for robot learning, not model architecture or scale.
This is somewhat uncomfortable for a field that has spent the last few years scaling up. Bigger models, more parameters, more data. The implicit assumption has been that quantity eventually overcomes quality issues. These papers suggest that's not quite right, or at least that there are failure modes that don't get averaged away.
The false success paper shows that even with perfect privileged labels (which we never have in practice), detecting failures from robot observations is task-dependent and potentially limited by sensor noise. The VISTA paper shows that even abundant human demonstrations can be worse than useless if they teach physically impossible behaviors.
Neither paper fully solves the problems they identify. The false success detection work is limited to simulation and two tasks. VISTA's physical validation pipeline requires defining what counts as a valid trajectory, which involves assumptions that may not generalize. But both point toward a research direction that seems underexplored: systematic methods for auditing and filtering training data before it enters the learning pipeline.
A few open questions that these papers raise but don't answer:
First, how prevalent are false successes in real deployed systems? The simulation study gives us a framework for thinking about this, but we don't have good estimates of contamination rates in actual training datasets. Someone should do that audit.
Second, can physical validation be made task-agnostic? VISTA's approach requires defining validity criteria for specific behaviors. A more general method for flagging physically implausible trajectories would be valuable.
Third, what's the interaction between data quality and model scale? It's possible that larger models are more robust to label noise, or that they're actually more sensitive to it. The answer probably depends on the type of noise. This seems like it matters for deciding where to invest compute.
Finally, both papers focus on manipulation. Locomotion, navigation, and multi-robot coordination presumably have their own data quality failure modes. Characterizing those would be useful.
The broader point is this: robot learning is maturing past the phase where any data is good data. The field needs systematic methods for data quality assessment, and these two papers, despite their limitations, are steps in that direction. Whether the community actually prioritizes this work over the next scaling paper remains unclear. But the problems aren't going away.