The Robots Need to Fail More, and We Need to Watch

New research confirms what anyone who's trained a robot arm already knows: you can't teach good behaviour without showing what bad looks like.

By Robert "Bob" Macintosh

7 hours ago3 min read

Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Three papers crossed my desk this week, all circling the same problem I've been griping about since my Kuka days: we're training robots on success stories and then acting surprised when they can't handle failure.

Look, here's the thing. When I was at Kuka, we had this internal dataset we called the "blooper reel." Thousands of hours of robot arms dropping parts, missing welds, crashing into fixtures. Nobody wanted to publish it. Nobody wanted to admit their system screwed up that many times. But that data was gold for calibration and error recovery. We'd pull it out whenever a new engineer needed to understand failure modes.

Now there's a position paper from Stanford and Berkeley researchers making the same argument I made in conference hallways fifteen years ago: embodied reward models need bad behaviour data. They tested three state-of-the-art systems and found they systematically over-reward unsafe interactions, poor execution, and what they call "shortcut strategies" that only superficially complete tasks. The robots look like they're succeeding when they're actually cheating.

I called my old colleague Frank at Siemens about this. He laughed. Said they've got terabytes of failure data sitting on drives that'll never see daylight because legal won't sign off on releasing anything that shows their products misbehaving.

The False Positive Problem

A second paper, Demo2Reward, tackles a related issue: vision-language models used as reward functions produce too many false positives without careful prompt engineering. Their solution uses 3 to 10 demonstration trajectories to tune the reward model before training even starts. No additional compute during policy learning, which matters when you're running real hardware.

This is basically what we did manually in the 2010s. You'd have a senior technician watch the first hundred cycles and flag the ones that "looked wrong" even when the system scored them as successful. Now they're automating that intuition. Progress, I suppose, though it feels a bit like we're rediscovering wheels.

The third paper, Feat2Go, goes further. They're using visual world models to create continuous progress targets, measuring patch-level similarity to subgoal states. Dense numbers. On ManiSkill3 benchmarks, they improved out-of-distribution success from 17.5% to 82.9% while keeping in-distribution performance at 96.9%. On RoboTwin 2.0, they hit 88.8% success in domain-randomized settings.

Those are good numbers. I'll be honest, better than I expected from a reinforcement learning approach on manipulation tasks.

Sources

Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning· arXiv — cs.RO (Robotics)
Position: Good Embodied Reward Models Need Bad Behavior Data· arXiv — cs.RO (Robotics)
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models· arXiv — cs.RO (Robotics)

Related coverage

More in Industrial

New research tackles the speed problem that's kept diffusion planners in the lab. About time.

Robert "Bob" Macintosh · 1 hour ago · 3 min

JetPack 7.2 won't make headlines, but it's the kind of infrastructure work that actually moves industrial robotics forward.

Robert "Bob" Macintosh · 1 hour ago · 3 min

A batch of new research papers show that vision-language-action models break down in predictable, clusterable ways. Anyone who's deployed industrial robots could've told you this.

Robert "Bob" Macintosh · 1 hour ago · 4 min

New research shows AI-powered robots can fail in ways we can't see coming, and the industry doesn't have a good answer yet.

The False Positive Problem

Sources

Why This Matters for Industrial Users