The 3D Vision Gap: Why Your Robot Can't Plan What It Can't See
New research reveals frontier AI models fail spectacularly at multi-step visual planning, but a self-exploration technique just boosted one model's success rate from 2.5% to nearly 48%.
Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
From 2.5% to 47.8%. That's the improvement researchers achieved on interactive view planning tasks by teaching a vision-language model to explore 3D environments on its own, according to a new paper on arXiv. The number caught my attention because it suggests something uncomfortable about the current generation of AI models: they can describe what they see, but they can't reliably plan how to see something else.
The research team behind ViewSuite, a new benchmark built on real ScanNet scenes, tested 13 frontier vision-language models on what sounds like a simple task: predict how moving a camera will change the view, then chain multiple moves together to reach a target viewpoint. The results were, honestly, worse than I expected.
GPT-5.4 Pro hit 18.5%. Gemini 3.1 Pro managed 21.4%. These are models that can write poetry and debug code, struggling with spatial reasoning that a warehouse robot needs to perform hundreds of times per shift.
The researchers identified a specific failure mode they call the "planning gap." The models possess what they term basic view-action knowledge, meaning they understand that rotating left will shift the view rightward. But they fall apart when asked to compose multiple such transformations. The gap widens as viewpoint distance grows, which makes sense if you think about it: each step compounds uncertainty.
From my time building hardware at Fanuc, I saw this problem from the other side. Industrial robots don't rely on learned visual planning because the failure modes are too unpredictable. They use explicit kinematic models and pre-programmed paths. But that approach doesn't scale to unstructured environments, which is precisely where the robotics industry wants to go.
Cobertura relacionada
More in AI Models
Five new papers show Vision-Language-Action models can now run 2-3x faster and recover from errors, but production deployment remains the missing benchmark.
James Chen · 29 mins ago · 6 min
A wave of new research is teaching robot brains to conserve their computational energy, and as someone who spent years watching robots waste cycles, I'm cautiously optimistic.
Robert "Bob" Macintosh · 2 hours ago · 4 min
Six new vision-language-action papers dropped this week. I read them all so you don't have to.
Robert "Bob" Macintosh · 4 hours ago · 4 min
A wave of new robotics benchmarks is revealing just how brittle today's vision-language-action models really are when things don't go exactly as planned.
The fix proposed in the ViewSuite paper is clever. Rather than training on expert demonstrations (the standard approach), they let the model explore environments and fail repeatedly. All those failure trajectories, combined with successful ones, form what they call a view graph: a map of how viewpoints connect across a scene. Distilling this graph into training data apparently reshapes the policy distribution in ways that pure reinforcement learning can't achieve.
Look, 47.8% is still a failing grade in most contexts. But the jump from 2.5% suggests the approach has legs.
This isn't the only recent work attacking the 3D understanding problem. A separate paper introducing 3DVLA argues that current vision-language-action models suffer from three intertwined challenges:
Weak extraction of 3D spatial positions without multi-view consistency
Inadequate 3D instance understanding (knowing that the red mug is a distinct object from the table)
Fragile reasoning under occlusion (what happens when something blocks the view)
The 3DVLA framework attempts to inject 3D reasoning into existing VLA models without requiring expensive instance-level annotations. They use what they call Spatially-Conditioned Geometry Aggregation and a masked self-supervised encoding branch that handles occlusions. The paper claims consistent gains on LIBERO-Plus and RoboTwin 2.0 benchmarks, though I couldn't find the exact percentage improvements in the abstract.
What interests me more is the architectural approach. They're treating 3D perception as a plug-in module rather than requiring a ground-up redesign. That's an ambitious number to hit in terms of compatibility, if it actually works across different base models.
Meanwhile, a third line of research is asking whether we even need the robot to understand 3D space directly. The VERA project from MIT takes a different tack: leave the video planner completely embodiment-agnostic and train a separate inverse dynamics model (IDM) for each robot type.
The logic here is that video generation models already understand how scenes evolve. They can predict that pushing a block will move it, that opening a drawer reveals contents. The hard part is translating that visual prediction into motor commands for a specific robot arm. VERA decouples these problems entirely.
Their IDM design uses the robot's Jacobian matrix, which maps joint velocities to end-effector motion. This is basic robotics math, but apparently nobody had combined it with video world models in quite this way. The paper demonstrates zero-shot Panda arm manipulation and 16-DoF Allegro hand dexterous manipulation, basically cube reorientation with a robot hand.
The real test is production volume, and VERA remains a research prototype. But the architecture has some practical advantages: you can swap video models without retraining the IDM, and the IDM can be trained with self-play data (the robot just moves around and records what happens). No expensive human demonstrations required.
A fourth paper tackles the exploration problem from yet another angle. LLM-Guided Future Hypotheses uses language models to generate short-horizon future videos that serve as priors for robot control. The idea is that if you can show the robot a plausible future, a video of the block sliding into position, it has a target to aim for.
They call this Future-Experience Conditioning (FEC), and the results are sort of interesting. Generated futures improve performance over no-future conditioning. Mismatched futures (showing the wrong outcome) degrade performance. Ground-truth futures work best, obviously. The BC+RL variant achieved the strongest results on RoboCasa and CALVIN benchmarks.
The pipeline is complex: an LLM reasons over a task ontology, a digital twin rolls out intended object motion without the robot present, and then a video diffusion model synthesizes what the scene would look like with the robot executing the action. It's a lot of moving parts, and I'm skeptical about real-time performance. But the core insight, that robots benefit from imagining the future, seems sound.
So where does this leave us? Four papers, four approaches to the same underlying problem: robots need to understand 3D space to operate in the real world, and current AI models are surprisingly bad at it.
Self-exploration with graph distillation dramatically improves planning (2.5% → 47.8%)
3D perception can potentially be added to existing VLAs as a plug-in module
Decoupling video prediction from action generation may simplify training
Future video conditioning improves policy learning, but requires accurate futures
What remains unclear is how these approaches will combine, or whether they're even compatible. The field is moving fast enough that papers from three months ago already feel dated. I've seen enough spec sheets to know that benchmark performance rarely translates directly to real-world capability, and none of these papers include data on cycle times, failure recovery, or the kind of edge cases that actually matter in deployment.
The 47.8% number is promising. It's also a reminder that we're still, in a way, teaching robots to see. The hardware has been ready for years. The perception is finally catching up.