画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
21,000 question-answer pairs later, the verdict is in: current vision-language models are basically impressive pattern matchers that fall apart the moment you ask them to do something physical.
That's the takeaway from Embodied3DBench, a new benchmark out of academic research that systematically tested 13 state-of-the-art models on what the authors call "low-level spatial intelligence." The results aren't pretty. While these models can tell you that a mug is to the left of a keyboard, they struggle with the basics of actually picking that mug up.
The benchmark divides tasks into two groups: understanding spatial structure (where things are) and interaction-oriented perception (how to manipulate them). Models performed reasonably well on the first category. The second? Not so much.
Specifically, the tests covered:
Grounding and spatial relation prediction
Multi-view correspondence
Affordance prediction (what can you do with this object?)
Grasp point prediction
Trajectory prediction
The researchers found that models "remain fragile in interaction-oriented perception," which they attribute to a "significant lack of robust 3D-aware interaction priors." In plain English: these systems haven't learned the physics of actually touching things.
Look, I've seen enough spec sheets to know that benchmark numbers can be massaged. But the gap here is consistent across all 13 models tested, which suggests this isn't a cherry-picked result.
関連記事
More in AI Models
Six new vision-language-action papers dropped this week. I read them all so you don't have to.
Robert "Bob" Macintosh · 2 hours ago · 4 min
A wave of new robotics benchmarks is revealing just how brittle today's vision-language-action models really are when things don't go exactly as planned.
James Chen · 2 hours ago · 7 min
A wave of new research suggests the path to smarter robots isn't just scaling up, it's rethinking what robots actually pay attention to.
Sarah Williams · 4 hours ago · 7 min
New research shows today's AI models can tell you where objects are, but they still can't figure out how to actually grab them.
Several teams are trying. The most ambitious effort comes from Alibaba's Qwen team, who released Qwen-VLA, a unified model that attempts to bridge perception and action through what they call a "DiT-based action decoder."
The numbers are impressive on paper:
97.9% on LIBERO manipulation benchmarks
73.7% on Simpler-WidowX
86.1%/87.2% on RoboTwin-Easy/Hard
76.9% average success in real-world ALOHA experiments
That last figure is the one worth watching. Real-world success rates are where most of these systems collapse. A 76.9% average in physical experiments is, well, that's actually decent. Though I'd want to see what "average" is hiding before getting too excited.
The Qwen team's approach uses what they call "embodiment-aware prompt conditioning," essentially telling the model which robot body it's controlling. This lets the same model work across different robot platforms, at least in theory.
A separate paper, 3DVLA, tackles this more directly. The researchers argue that current vision-language-action models suffer from three intertwined problems:
Weak extraction of 3D spatial positions
Inadequate 3D instance understanding
Fragile reasoning under occlusion
That third one is particularly nasty. In the real world, objects are constantly hidden behind other objects. Humans handle this effortlessly. Robots... don't.
3DVLA proposes a "plug-and-play framework" that injects 3D reasoning into existing models without requiring expensive manual labeling. The key innovation is a masked self-supervised branch that essentially teaches the model to imagine what's behind occluded areas.
This is where things get interesting. DGSG-Mind, another recent paper, claims to have deployed their system on physical robots with "target-oriented reasoning and dynamic update capabilities."
The system builds what the authors call a "hierarchical scene graph" using 3D Gaussian representations. It achieved what they describe as "the best zero-shot 3DVG performance among methods operating on self-reconstructed maps."
I should note: "best among methods operating on self-reconstructed maps" is a fairly specific category. The real test is production volume, and none of these systems are anywhere close to that.
From my time in hardware, I can tell you that perception is rarely the whole story. But DynaFLIP, a new pre-training framework, makes a compelling argument that we've been training visual encoders wrong.
The paper's central claim: most robot learning pipelines use visual encoders pre-trained for static image recognition or vision-language alignment. Motion understanding gets dumped on downstream policies, which then have to figure it out from scratch.
DynaFLIP instead trains encoders on "image-language-3D flow triplets" extracted from human and robot videos. The result, they claim, is representations that encode "not just what is present, but how the world changes under action."
The gains are substantial in their tests: up to +22.5% improvement in out-of-distribution scenarios. That's an ambitious number, and I'd want to see independent replication before taking it at face value.
The Embodied3DBench team did something useful beyond just exposing the problem: they synthesized a training dataset of 1.3 million QA pairs specifically designed to address the interaction-perception gap. Fine-tuning on this data, they report, "yields significant improvements in low-level spatial intelligence."
There's also SCOUT, which takes a different approach entirely. Instead of trying to make vision-language models better at 3D reasoning, it uses scene graphs and "procedural distillation" to extract knowledge from large language models into lightweight models suitable for on-robot inference.
The tradeoff is explicit: SCOUT "matches LLM-level performance while remaining computationally efficient." In other words, it's not trying to be smarter, just faster and cheaper.
A few things remain unclear. None of these papers adequately address latency, the time between perception and action. In real-world robotics, a 100-millisecond delay can mean the difference between catching an object and watching it fall.
The generalization claims also need scrutiny. Qwen-VLA reports "26.6% zero-shot success on DOMINO dynamic manipulation." That's honest reporting (most papers would bury a 26% figure), but it also shows how far we are from robots that can handle genuinely novel situations.
And then there's the question nobody wants to talk about: compute requirements. These are large models. The Qwen-VLA paper doesn't disclose exact inference costs, but given the architecture, I'd guess we're looking at significant GPU requirements for real-time operation.
The research community is clearly converging on the same diagnosis: today's vision-language models are perception-heavy and interaction-light. The treatment is less obvious. Some teams are building bigger models with more data. Others are trying surgical interventions, injecting 3D reasoning into existing architectures.
My bet is that the surgical approach wins in the short term. Production robotics can't wait for the next generation of foundation models. It needs fixes that work with today's hardware and today's constraints. But we don't know yet which of these approaches will actually scale beyond the lab.