Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Why do robots still struggle with tasks a toddler finds trivial?
I've been thinking about this a lot lately. You can show a vision-language model a cluttered kitchen counter and it'll describe every object with impressive accuracy. Ask it to tell a robot how to stack those objects without knocking anything over, and suddenly things fall apart. Sometimes literally.
A batch of recent papers from robotics researchers is converging on the same uncomfortable truth: our best VLMs are great at seeing and talking, but the spatial reasoning that connects perception to physical action remains, honestly, kind of a mess. The good news? People are finding workarounds. The bad news? Those workarounds reveal just how far we still have to go.
Let me start with what I think is the most revealing study of the bunch. Researchers tested VLMs on a collaborative structure-building task, basically a robot version of describing how to rebuild a Lego tower to someone who can't see it. According to their paper published on arXiv, multi-turn dialogue between AI agents improved performance on spatial reasoning. But here's the kicker: only barely.
The finding that stuck with me was this: detailed text descriptions of a target structure actually worked better than showing the model images of it. Think about that for a second. You'd assume a vision model would prefer, you know, vision. But when it comes to spatial reasoning, words apparently beat pictures. That's weird, right?
What this tells us is that VLMs can process visual information, but they struggle to extract the precise spatial relationships needed for manipulation. They see a cup on a table but can't reliably tell you it's 15 centimeters from the edge or that rotating it 30 degrees would clear the obstacle behind it.
Cobertura relacionada
More in AI Models
A wave of research tackles the same problem: vision-language-action models break down on extended manipulation sequences, and everyone's proposing different band-aids.
James Chen · 1 hour ago · 5 min
A wave of new research reveals that vision-language-action models need external scaffolding to work reliably, and that's actually fine.
James Chen · 1 hour ago · 4 min
SoftBank promises €75 billion for French data centers while the EU's own €20 billion plan stumbles. I've seen this pattern before.
Mark Kowalski · 1 hour ago · 5 min
Everyone's talking about the new reasoning model, but the real story might be what Microsoft isn't saying about developer trust.
The tool augmentation approach
One response to this limitation is to give VLMs access to specialized tools. A team behind SpaceTools took this route, combining depth estimators, segmentation models, and pose estimators that the VLM can call on when it needs precise spatial information. Their method, published on arXiv, uses what they call Double Interactive Reinforcement Learning to teach models how to coordinate multiple tools.
The results are pretty striking:
12% improvement over standard supervised fine-tuning on spatial understanding benchmarks
16% improvement over vanilla reinforcement learning approaches
State-of-the-art performance on RoboSpatial-Home, BLINK, and BOP-ASK benchmarks
Successful real-world manipulation using a 7-DOF robot arm
I initially thought this was just throwing more compute at the problem, but after reading the paper more carefully, I think there's something smarter happening here. The two-phase training (a teaching phase with demonstrations, then an exploration phase) lets the model discover tool-use patterns that humans might not think to program explicitly. It's learning when to ask for help, basically.
Planning as a crutch (in a good way)
Another approach that's gaining traction is to give VLMs a structured reasoning scaffold. PLanAR, from a team whose work appears on arXiv, introduces what they call a planning-language interface. Instead of letting the VLM reason in free-form natural language (which, tbh, leads to some pretty wild hallucinations), it constrains reasoning to defined object predicates, action schemas, and symbolic plans.
The clever bit is stepwise verification. After each action, the system checks whether the expected effects actually happened. Did the cup actually move to where we thought it would? If not, the VLM can update its understanding and replan.
This works across different robot types and VLM backends, which suggests the approach is genuinely useful rather than just tuned for one specific setup. But it also reveals a limitation: current VLMs apparently can't do this kind of verification and replanning on their own. They need the scaffolding.
You might be wondering why this matters if the scaffolding works. I think it's because it tells us something about what's missing from VLMs themselves. They're not learning the kind of causal, physical reasoning that would let them anticipate and catch their own mistakes.
The simulation problem (and a possible solution)
Here's where things get interesting. If we want robots to learn manipulation skills, we need environments to train them in. Real-world training is expensive and slow. Simulation is cheaper but, honestly, most simulated environments are kind of pathetic. Sparse furniture, no clutter, nothing like the chaos of an actual home.
SceneSmith, detailed in a paper on arXiv, tries to fix this with a hierarchical framework that generates realistic indoor environments from text descriptions. The numbers are impressive: 3-6x more objects than previous methods, less than 2% object collisions, and 96% of objects remain stable under physics simulation.
In a user study with 205 participants, SceneSmith achieved 92% average realism ratings and 91% prompt faithfulness. That's... actually pretty good? I should know this better, but I'm not sure what the baseline comparison looks like for these metrics. The paper claims wins against prior methods, though the specific baselines aren't clear from the abstract.
What I find most promising is that these environments can be used for automatic robot policy evaluation. If you can generate realistic test environments on demand, you can stress-test robot behaviors much more thoroughly than with hand-crafted scenarios.
Learning from humans (sort of)
The data problem keeps coming up in these papers. Robot demonstrations are expensive to collect and only work for specific robot types. Human videos are everywhere, but humans and robots have different bodies, so you can't just copy what a human does.
A comprehensive survey on arXiv breaks down how researchers are trying to bridge this gap. The approaches fall into four categories:
Latent action representations that encode changes between video frames
World models that predict what will happen next
2D supervision extracted from image-plane cues
3D reconstruction that recovers geometry and motion
The survey highlights three open challenges that remain unsolved: structuring messy internet videos into usable training data, grounding video-derived knowledge into robot-executable actions, and designing evaluation protocols that actually predict real-world performance.
That last one is, in a way, the most important. We don't have great ways to know whether a model that performs well on benchmarks will actually work when you put it in a kitchen. The gap between benchmark performance and deployment success remains unclear.
Affordances: the missing piece?
One more paper worth mentioning: AFUN, which aims to build what the researchers call an affordance foundation model. Published on arXiv, it predicts both where to interact with an object (a functional mask) and how to interact (a 3D motion curve) from a single RGB-D image and language description.
The results are substantial. On affordance segmentation, AFUN improves mean gIoU by 23.9 points and cIoU by 26.3 points over baselines across eight test sets. Contact-point prediction improves by 12.7% to 61.3% depending on the baseline.
What's notable is that AFUN works on real robots without fine-tuning for specific embodiments or task-specific heuristics. That's rare. Most systems need significant adaptation to move from simulation to reality, or from one robot to another.
The team built a data pipeline that converts heterogeneous sources (robot data, human data, simulation, real-world scans) into a shared format. This kind of infrastructure work is unglamorous but probably essential for the field to progress.
Where does this leave us?
I keep coming back to a tension in all this research. On one hand, people are finding clever workarounds for VLM limitations: tool augmentation, planning scaffolds, better training environments, affordance models. These workarounds work, at least on benchmarks.
On the other hand, the need for so many workarounds suggests that current VLMs are missing something fundamental about spatial reasoning. They're not learning the intuitive physics that lets humans (and toddlers) manipulate objects without explicit planning.
It's too early to say whether scaling up existing approaches will eventually solve this, or whether we need architectural changes to how these models process spatial information. The papers I've read this week don't settle that question.
What they do show is that the robotics community is getting more systematic about identifying and addressing VLM limitations. The problems are clearer now than they were a year ago. The solutions are still partial, still scaffolded, still requiring human-designed structures to compensate for what the models can't do themselves.
But partial solutions that work in the real world? That's progress. I think.