Vision-Language Models Still Can't Grasp Objects Properly, New Benchmarks Reveal

A wave of new research exposes a fundamental gap: today's AI can describe a scene beautifully but struggles to actually interact with it.

By James Chen

5 hours ago読了 5 分

画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

21,000 question-answer pairs later, the verdict is in: current vision-language models are basically impressive pattern matchers that fall apart the moment you ask them to do something physical.

That's the takeaway from Embodied3DBench, a new benchmark out of academic research that systematically tested 13 state-of-the-art models on what the authors call "low-level spatial intelligence." The results aren't pretty. While these models can tell you that a mug is to the left of a keyboard, they struggle with the basics of actually picking that mug up.

What exactly are these models failing at?

The benchmark divides tasks into two groups: understanding spatial structure (where things are) and interaction-oriented perception (how to manipulate them). Models performed reasonably well on the first category. The second? Not so much.

Specifically, the tests covered:

Grounding and spatial relation prediction
Multi-view correspondence
Affordance prediction (what can you do with this object?)
Grasp point prediction
Trajectory prediction

The researchers found that models "remain fragile in interaction-oriented perception," which they attribute to a "significant lack of robust 3D-aware interaction priors." In plain English: these systems haven't learned the physics of actually touching things.

Look, I've seen enough spec sheets to know that benchmark numbers can be massaged. But the gap here is consistent across all 13 models tested, which suggests this isn't a cherry-picked result.

More in AI Models

Six new vision-language-action papers dropped this week. I read them all so you don't have to.

Robert "Bob" Macintosh · 2 hours ago · 4 min

A wave of new robotics benchmarks is revealing just how brittle today's vision-language-action models really are when things don't go exactly as planned.

James Chen · 2 hours ago · 7 min

A wave of new research suggests the path to smarter robots isn't just scaling up, it's rethinking what robots actually pay attention to.

Sarah Williams · 4 hours ago · 7 min

New research shows today's AI models can tell you where objects are, but they still can't figure out how to actually grab them.

Vision-Language Models Still Can't Grasp Objects Properly, New Benchmarks Reveal

What exactly are these models failing at?

More in AI Models

Is anyone actually fixing this?

What about the 3D perception problem?

Does any of this actually work on real robots?

What's the actual bottleneck?

Where does this leave us?

What's missing from all this research?

出典