VLA Models Can't Actually Understand What You're Asking Them to Do

New benchmark reveals that vision-language-action models grasp objects just fine, but pick the right one at basically random rates.

By James Chen

2 hours ago7 Min. Lesezeit

Bildnachweis: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Most coverage of vision-language-action models focuses on their impressive demos: robots folding laundry, making coffee, following complex instructions. What gets buried in the fine print is whether these models actually understand the instructions or just learned to mimic the training data really well.

A new benchmark called RoboSemanticBench suggests it's mostly the latter. The setup is elegantly brutal: give a robot a multiple-choice question (math problems, general knowledge), show it blocks labeled with possible answers, and ask it to grasp the correct one. The results? Models that successfully grasp blocks at high rates select the semantically correct block at near-random or below-random rates.

That's a damning finding. It means the gap between what these models can do (grab things) and what we assume they can do (understand language and act on it) is far wider than demo videos suggest.

What does the benchmark actually test?

RoboSemanticBench isolates semantic grounding from motor control. The distinction matters. A robot that can grasp any block proves it has decent visuomotor skills. A robot that grasps the block corresponding to "What is 7 × 8?" proves something much harder: that language understanding flows all the way through to action selection.

The benchmark covers three categories: controlled arithmetic, grade-school math, and commonsense or factual knowledge. Each comes in four-choice and ten-choice variants. More choices means less chance of getting lucky.

The researchers tested representative VLA models and found a consistent pattern. Grasp success rates were reasonable. Semantic accuracy was not. After controlling for grasp success (meaning: among robots that successfully grabbed block, how many grabbed the block), performance was often indistinguishable from random guessing.

Verwandte Beiträge

More in AI Models

A batch of new papers tackle the computational bottleneck in robot learning, with one approach claiming 4x speedups without sacrificing policy performance.

James Chen · 2 hours ago · 5 min

Six new papers in a week all tackle the same fundamental flaw in robot learning. That's not a coincidence.

James Chen · 2 hours ago · 6 min

Two new papers expose a measurement crisis in vision-language-action models, and if you've been through the self-driving hype cycle, this should sound familiar.

Mark Kowalski · 2 hours ago · 7 min

A wave of 'world model' papers promises robots that can think ahead. It's promising work, but let's not pretend this is the first time we've heard that pitch.

Problem	Proposed Solutions	Maturity
Language-action mismatch	Continuous reasoning latents	Early research
Reasoning latency	Latency-aware training	Single paper
Shortcut learning	Better benchmarks (RSB)	Diagnostic only
Condition monitoring	Decoupled architectures (CLAW)	Task-specific
World modeling integration	Sequential WAMs	Architecture-dependent

VLA Models Can't Actually Understand What You're Asking Them to Do

What does the benchmark actually test?

More in AI Models

Why does semantic grounding fail?

Are researchers addressing this?

What about zero-shot capability?

Do world models help?

Can you add explicit condition monitoring?

What does this mean for deployment?

Quellen