The Real Problem With Robot AI Isn't Perception, It's Knowing When to Act
A wave of new benchmarks and frameworks reveals that vision-language models fail not because they can't see, but because they commit too early and explore too little.
画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Most coverage of the latest robotics AI research focuses on the impressive capabilities: models that can answer questions about their environment, generate manipulation trajectories from natural language, or reason about spatial relationships. What these summaries consistently miss is the more troubling finding buried in the methodology sections: these systems fail in ways that suggest a fundamental gap between perception and action that no amount of scaling will fix.
I've spent the past week reading through five recent papers that, taken together, paint a picture that's both more nuanced and more concerning than the typical "AI gets better at robots" narrative. The research spans embodied question answering, visual planning for manipulation, language-to-motion generation, and comprehensive benchmarking. What emerges is a consistent theme: we've gotten reasonably good at teaching models to see and understand. We remain remarkably bad at teaching them to act wisely on that understanding.
The perception-action disconnect is real, and it's not getting better. The most striking evidence comes from ESI-Bench, a new benchmark from researchers building on OmniGibson that explicitly tests what they call "embodied spatial intelligence." The benchmark spans 10 task categories grounded in Spelke's core knowledge systems (the developmental psychology framework for how infants understand objects, space, and causality). What makes ESI-Bench different from prior spatial reasoning benchmarks is that it treats the observer as an actor who must decide what to do to gather information, not just process information that's handed to them.
関連記事
More in AI Models
New analysis suggests AI isn't causing mass unemployment, but it may be quietly dismantling the first rung of the career ladder.
Aisha Patel · 1 hour ago · 7 min
Distribution shift remains the quiet killer of deployed robot systems. This week's research offers genuinely different approaches to the same fundamental challenge.
Aisha Patel · 1 hour ago · 7 min
Everyone's predicting white-collar extinction. I think they're missing something important about how automation actually unfolds.
Sarah Williams · 1 hour ago · 4 min
Four new papers show researchers finally cracking the problem that's held back practical robotics for years: how to make smart robots that don't need a data center to think.
The results are, to be precise, damning for current approaches. The researchers found that most failures "stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors." In other words, the models can see fine. They just don't know what to look at or when to look at it.
Here's what I find particularly concerning: random multi-view sampling (basically, taking more pictures from different angles) often made performance worse, not better. The models consumed far more images but added noise rather than signal. Meanwhile, agents that were allowed to actively explore "spontaneously discovered emergent spatial strategies without explicit instructions." The capability is there, somewhere, but current training approaches aren't reliably eliciting it.
The metacognitive gap is the real problem. The ESI-Bench paper includes human studies that reveal something the robotics community needs to grapple with. Humans, when uncertain about a spatial judgment, seek falsifying viewpoints and revise their beliefs when they encounter contradictory evidence. The models do neither. They "commit prematurely with high confidence regardless of evidence quality." This isn't a perception problem. It's a knowing-what-you-don't-know problem.
The researchers are blunt about the implications: this metacognitive gap "neither better perception nor more embodied interaction alone can close." I know I'm being picky here, but this is exactly the kind of finding that should temper the enthusiasm around simply scaling up robot foundation models. If the architecture itself doesn't support appropriate uncertainty, more data won't help.
Meanwhile, others are trying to bridge the gap with explicit visual planning. The Afford-VLA paper takes a different approach to the perception-action disconnect. The authors argue that current vision-language-action models suffer from "insufficient spatial reasoning, particularly in determining where to interact in complex visual scenes." Their solution is to internalize what they call "affordance" (the regions of a scene where meaningful interaction is possible) as an explicit planning interface.
The technical approach involves learnable tokens that query task-relevant interaction regions, decode affordance masks, and convert them into compact embeddings that condition action generation. It's genuinely new in the sense that it creates a tightly coupled perception-action pathway rather than treating visual understanding and motor control as separate modules. Prior work either relied on global geometric cues, symbolic intermediate representations, or externally generated visual signals, all of which the authors argue are "weakly coupled with downstream action prediction."
The results on LIBERO, LIBERO-Plus, and SimplerEnv show consistent improvements over baselines. But I'd want to see replication before getting too excited. The paper reports strong real-world results but doesn't provide the kind of detailed failure analysis that ESI-Bench does. We don't know if Afford-VLA still suffers from premature commitment or action blindness in edge cases.
A simpler approach that actually works. The Language Movement Primitives paper takes a refreshingly modest approach to the same problem. Rather than trying to get vision-language models to output raw action commands (which they're bad at), the researchers ground VLM reasoning in Dynamic Movement Primitive parameterization. DMPs are a well-established framework in robotics that represent trajectories with a small number of interpretable parameters.
The key insight is that VLMs are actually pretty good at reasoning over free-form natural language task descriptions. They're just bad at converting that reasoning into low-level position and velocity control. By having the VLM set DMP parameters rather than raw actions, you get "diverse, continuous, and stable trajectories" without requiring the model to understand motor control.
Across 31 real-world manipulation tasks, LMP achieved 65% task success compared to 35% for the best performing baseline. That's a substantial improvement, though I'd note the sample size is relatively small for drawing strong conclusions. The approach is also limited to tabletop manipulation, it's not clear how well it would generalize to more complex scenarios involving locomotion or multi-step planning.
The benchmarking problem isn't solved either. The EQA-Decision paper attempts to address another gap in the field: the fragmentation of existing benchmarks. Current datasets each focus on limited subsets of reasoning skills (spatial understanding, procedural reasoning, etc.) without offering a unified framework for comprehensive evaluation.
EQA-Decision contains over four million question-answer pairs with hierarchical annotations across four dimensions: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The scale is impressive, but I have methodology concerns. Four million QA pairs generated at scale inevitably involves automated annotation pipelines, and the paper doesn't provide enough detail about quality control to assess how reliable the ground truth labels are.
The accompanying RoboDecision baseline model provides a unified framework for evaluating perception, reasoning, and action-level decision-making. The results demonstrate that the benchmark "effectively benchmarks and enhances VLM capabilities," but this is somewhat circular. We'd need to see whether models trained on EQA-Decision actually perform better on held-out real-world tasks, not just on the benchmark itself.
Tool use might be the path forward, but we're not there yet. The most comprehensive paper in this batch is Embodied Tool Protocol, which proposes externalizing heterogeneous capabilities into independently optimized tools that are dynamically invoked at inference time. The authors argue that perception, reasoning, planning, and control are "inherently hierarchical and heterogeneous," making them difficult to reliably learn within a single parameterized policy.
They've curated over 100 validated tools spanning perception, cognition, reasoning, and execution, and built EmbodiedToolBench to evaluate how well current models use them. The results confirm that capability externalization consistently improves embodied performance (average gains of 31% on EB-ALFRED and 36% on EB-Navigation).
But here's the critical finding: there's a clear boundary to these gains. They're substantial for cognition and perception but limited for execution-type capabilities. The analysis reveals that "knowing when, which, and how to invoke tools remains a persistent challenge across all models." This echoes the ESI-Bench finding about action blindness. Models can be given powerful tools but still fail because they don't know when to use them.
What this means for the field. Taken together, these papers suggest we're at an inflection point in embodied AI research. The perception side of the problem, while not solved, is tractable. We can build models that understand scenes, answer questions about them, and reason about spatial relationships. The action side remains stubbornly difficult.
The ESI-Bench metacognition findings are particularly important. If models commit prematurely regardless of evidence quality, then simply improving perception won't help. We need architectures that can represent and reason about their own uncertainty, that know when to gather more information before acting.
The tool use results from ETP point toward a possible path forward: don't try to learn everything end-to-end. Externalize capabilities into modular, independently optimized components. But even this approach hits the same wall. The models still need to know when to invoke which tool, and that meta-level reasoning remains weak.
I'd want to see future work focus explicitly on the metacognitive gap. How do we train models that seek falsifying evidence rather than confirming evidence? How do we get appropriate uncertainty calibration in embodied settings where the cost of premature commitment is high? These aren't questions that more data or bigger models will automatically answer.
What I'd want to see next. The field needs benchmarks that explicitly test metacognitive competence, not just task performance. ESI-Bench makes a start here with its human comparison studies, but we need more systematic evaluation of whether models know what they don't know.
We also need better failure analysis. The Afford-VLA paper reports strong results but doesn't tell us much about failure modes. The EQA-Decision paper has scale but unclear quality control. The LMP paper has good real-world results but limited task diversity.
Most importantly, we need to stop treating perception and action as separable problems that can be solved independently and then combined. The consistent finding across all five papers is that the interface between seeing and doing is where things break down. That interface, not perception alone or action alone, should be the focus of the next generation of research.