VLA models are succeeding at tasks while failing at safety, and we're only now measuring it

New benchmarks reveal that up to 56% of 'successful' robot manipulation tasks involve safety violations we weren't even tracking.

By Sarah Williams

1 hour ago4 min read

Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Most coverage of vision-language-action models focuses on the same metric: did the robot complete the task? Pick up the cup, insert the connector, open the drawer. Success rates are climbing, papers are celebrating, and honestly, I was ready to write another "VLAs are getting better" piece.

Then I read two papers that made me reconsider what "better" even means here.

The gap nobody was measuring

Here's something that should probably concern us more than it does: when researchers at SafeVLA-Bench went back and evaluated existing VLA benchmarks with actual safety metrics, they found that high-performing models were routinely completing tasks in ways that would be, well, problematic in the real world.

The numbers are sort of striking. On tabletop manipulation tasks, models with high success rates still had 13 to 15 percent of episodes flagged as unsafe. On kitchen tasks in RoboCasa-365, between 36 and 56 percent of successful rollouts violated at least one safety requirement.

Let me be clear about what "unsafe" means here. We're talking about excessive contact force, disturbing objects the robot wasn't supposed to touch, destabilizing whatever it was holding, or the robot colliding with itself. These aren't edge cases. These are things that would matter immediately if you deployed these models outside a simulator.

The researchers introduced two metrics I think we'll be hearing more about: Succ-But-Unsafe (SBU), which tracks the fraction of rollouts that succeed while violating safety, and Violation Severity Index (VSI), a bounded score for how bad the worst violation was. Neither of these existed in standard VLA benchmarks before. We were literally not tracking this.

Related coverage

More in Humanoids

Two new research papers tackle the same problem from wildly different angles, and honestly, both approaches make me rethink what 'dexterous' really means.

Sarah Williams · 1 hour ago · 6 min

After years of watching robots stumble because their eyes couldn't keep up with their legs, the research community is finally cracking the perception problem.

Robert "Bob" Macintosh · 1 hour ago · 4 min

A wave of new research is figuring out how to teach robots from human videos, and honestly, it's more promising than I expected.

Sarah Williams · 1 hour ago · 4 min

Researchers are combining diffusion models with reinforcement learning to help robots work together without the computational nightmare of centralized planning.

VLA models are succeeding at tasks while failing at safety, and we're only now measuring it

The gap nobody was measuring

More in Humanoids

Why this happens (I think)

What we still don't know

So what

Sources