VLA models are succeeding at tasks while failing at safety, and we're only now measuring it
New benchmarks reveal that up to 56% of 'successful' robot manipulation tasks involve safety violations we weren't even tracking.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Most coverage of vision-language-action models focuses on the same metric: did the robot complete the task? Pick up the cup, insert the connector, open the drawer. Success rates are climbing, papers are celebrating, and honestly, I was ready to write another "VLAs are getting better" piece.
Then I read two papers that made me reconsider what "better" even means here.
The gap nobody was measuring
Here's something that should probably concern us more than it does: when researchers at SafeVLA-Bench went back and evaluated existing VLA benchmarks with actual safety metrics, they found that high-performing models were routinely completing tasks in ways that would be, well, problematic in the real world.
The numbers are sort of striking. On tabletop manipulation tasks, models with high success rates still had 13 to 15 percent of episodes flagged as unsafe. On kitchen tasks in RoboCasa-365, between 36 and 56 percent of successful rollouts violated at least one safety requirement.
Let me be clear about what "unsafe" means here. We're talking about excessive contact force, disturbing objects the robot wasn't supposed to touch, destabilizing whatever it was holding, or the robot colliding with itself. These aren't edge cases. These are things that would matter immediately if you deployed these models outside a simulator.
The researchers introduced two metrics I think we'll be hearing more about: Succ-But-Unsafe (SBU), which tracks the fraction of rollouts that succeed while violating safety, and Violation Severity Index (VSI), a bounded score for how bad the worst violation was. Neither of these existed in standard VLA benchmarks before. We were literally not tracking this.
Related coverage
More in Humanoids
Two new research papers tackle the same problem from wildly different angles, and honestly, both approaches make me rethink what 'dexterous' really means.
Sarah Williams · 1 hour ago · 6 min
After years of watching robots stumble because their eyes couldn't keep up with their legs, the research community is finally cracking the perception problem.
Robert "Bob" Macintosh · 1 hour ago · 4 min
A wave of new research is figuring out how to teach robots from human videos, and honestly, it's more promising than I expected.
Sarah Williams · 1 hour ago · 4 min
Researchers are combining diffusion models with reinforcement learning to help robots work together without the computational nightmare of centralized planning.