VLA Models Can Find the Microwave, But Can't Figure Out When to Stop Looking
New benchmarks show vision-language-action models are getting better at understanding what you want, but still struggle with the basics of knowing when they've found it.
Bildnachweis: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Picture this: you're standing in an unfamiliar kitchen, and someone says "I need something to warm this food." You'd walk to the microwave, open it, done. A robot running the latest vision-language-action models? It'll probably find the microwave 68% of the time, but only actually stop in front of it correctly about 5% of the time.
I've been reading through a stack of new papers on VLA systems this week, and I'll be honest, the gap between "can identify the right object" and "can actually complete the task" is wider than I expected.
The numbers
A new benchmark called IntentionNav from researchers working with Isaac Sim tested how well current VLMs handle indirect human instructions. Not "go to the microwave," but "I need something to warm this food" or "the room feels stuffy." The kind of thing a real person would actually say.
The results are, well, humbling. Models correctly identified the intended target 48.3% of the time. They got within 2 meters of it 68.7% of the time. But successful termination (actually stopping at the goal correctly) dropped to 24.9%. And grounded 1-meter success? 5.5%.
That last number is the one that matters for actual deployment. When I was at Kuka, we had a saying: getting close doesn't count in manufacturing. Either the arm is in position or it isn't.
What's actually broken
Another paper, ESI-Bench, dug into why these systems fail, and the answer surprised me. It's not perception. The models can see fine. The problem is what the researchers call "action blindness," which is basically poor decisions about where to look and when to stop looking.
Verwandte Beiträge
More in AI Models
A wave of new research is revisiting an old idea in robotics, and the results suggest we've been overthinking trajectory generation for years.
Aisha Patel · 1 hour ago · 6 min
Two new papers tackle the same bottleneck in vision transformers, and it's a sign that the field's scaling strategy is hitting a wall.
Mark Kowalski · 1 hour ago · 6 min
A wave of new research is pushing robot learning away from raw pixel prediction toward something more structured, and the results are starting to look promising.
James Chen · 1 hour ago · 6 min
I was asked to cover recent AI news, but what I found instead was a pile of consumer electronics listicles masquerading as tech journalism.