VLA Models Can Find the Microwave, But Can't Figure Out When to Stop Looking

New benchmarks show vision-language-action models are getting better at understanding what you want, but still struggle with the basics of knowing when they've found it.

By Robert "Bob" Macintosh

2 hours ago読了 4 分

画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Picture this: you're standing in an unfamiliar kitchen, and someone says "I need something to warm this food." You'd walk to the microwave, open it, done. A robot running the latest vision-language-action models? It'll probably find the microwave 68% of the time, but only actually stop in front of it correctly about 5% of the time.

I've been reading through a stack of new papers on VLA systems this week, and I'll be honest, the gap between "can identify the right object" and "can actually complete the task" is wider than I expected.

The numbers

A new benchmark called IntentionNav from researchers working with Isaac Sim tested how well current VLMs handle indirect human instructions. Not "go to the microwave," but "I need something to warm this food" or "the room feels stuffy." The kind of thing a real person would actually say.

The results are, well, humbling. Models correctly identified the intended target 48.3% of the time. They got within 2 meters of it 68.7% of the time. But successful termination (actually stopping at the goal correctly) dropped to 24.9%. And grounded 1-meter success? 5.5%.

That last number is the one that matters for actual deployment. When I was at Kuka, we had a saying: getting close doesn't count in manufacturing. Either the arm is in position or it isn't.

What's actually broken

Another paper, ESI-Bench, dug into why these systems fail, and the answer surprised me. It's not perception. The models can see fine. The problem is what the researchers call "action blindness," which is basically poor decisions about where to look and when to stop looking.

More in AI Models

A wave of new research tackles the gap between language understanding and robot control, with genuinely clever approaches that still leave fundamental questions open.

Aisha Patel · 47 mins ago · 9 min

Retailers are slashing prices on desktops and laptops this weekend, which is fine, but let's talk about what these machines are actually for.

Mark Kowalski · 47 mins ago · 5 min

The Chinese tech giant claims a breakthrough that could close the gap with TSMC, but the details are frustratingly thin.

Sarah Williams · 48 mins ago · 6 min

Pope Leo XIV's new encyclical on artificial intelligence might have been partially written by the very thing it warns against.

VLA Models Can Find the Microwave, But Can't Figure Out When to Stop Looking

The numbers

What's actually broken

More in AI Models

Some actual progress

So what

What happens next

出典