Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Centimeters. That's the margin we're talking about here.
I've been digging through a batch of recent robotics papers, and honestly, I keep coming back to this one detail that's bothering me. We've got vision-language-action models that can recognize thousands of objects, follow complex instructions, and generate fluid motion trajectories. But ask them to place a cup in a specific slot on a dish rack? That's where things fall apart.
Three separate research papers dropped recently that all circle the same problem from different angles, and I think they're telling us something important about where embodied AI actually is right now. Not where the press releases say it is. Where it actually is.
The slot problem is harder than it looks. A team introduced something called AnySlot, a framework specifically designed to handle what they call "slot-level placement." Think about tasks like putting dishes in a dishwasher, organizing tools on a pegboard, or slotting batteries into a remote. These aren't exotic research scenarios. They're Tuesday afternoon in a warehouse.
The core insight here is that current end-to-end VLA policies (the kind that go directly from visual input to motor commands) struggle with compositional language that requires precise geometric execution. You can tell a robot "put the red cup in the top-left slot" and it understands the words just fine. The semantic grounding works. But translating that understanding into centimeter-accurate placement? That's a different beast entirely.
AnySlot's solution is to break the problem into two stages. First, convert the language instruction into a visual goal by literally rendering a marker where the object should go. Then hand that visual target to a goal-conditioned policy that executes the placement. It's a hierarchical approach that decouples high-level slot selection from low-level motor control.
I initially thought this was sort of a hack, like they're just adding an extra step to compensate for model limitations. But after reading through their experiments, I think there's something deeper going on. The paper introduces SlotBench, a simulation benchmark with nine task categories specifically designed for precision placement. And the results show AnySlot significantly outperforming both flat VLA baselines and modular grounding methods.
The key word there is "significantly." In robotics papers, that usually means the difference between a system that works sometimes and one that works reliably. Though I should note, this is based on simulation results. Real-world performance tends to be messier.
Meanwhile, dialogue doesn't save us. You might be wondering if the problem is just that robots need to ask clarifying questions. If spatial reasoning is hard, maybe back-and-forth communication could help fill in the gaps?
A separate study on tested exactly this hypothesis. Researchers set up a task where VLMs had to reconstruct a target structure through dialogue, combining visual interpretation, grounding, and action generation. The kind of collaborative scenario you'd want in, say, a human-robot assembly line.
The findings are, well, humbling. Spatial reasoning over visual representations remains difficult for the evaluated VLMs. That's the paper's conclusion, stated plainly. Multi-agent dialogue helps, but only barely (their word, not mine). The improvement exists, but it's marginal enough that the researchers seem almost surprised by how little it moved the needle.
What did help? Detailed text representations of the target structure yielded higher reconstruction success. In other words, if you describe the goal in exhaustive textual detail rather than relying on the model to interpret an image, performance improves across conditions. Decomposed image representations (breaking a scene into parts) also helped.
This is interesting because it suggests the bottleneck isn't language understanding or even action generation. It's visual spatial grounding, the ability to look at a scene and accurately encode where things are relative to each other. Our models can talk about space. They struggle to see it.
Memory makes everything harder. The third paper that caught my attention tackles a related but distinct problem: what happens when placement tasks require remembering information across time?
Researchers working on scratchpad-augmented VLAs make a point that honestly should be more obvious than it is. Many dexterous manipulation tasks are non-Markovian in nature. Translation: the correct action depends on what happened before, not just what you're seeing right now. Yet most VLAs are, as the paper puts it, "stateless." They process each moment independently.
Think about a task like sorting objects into bins based on a rule you were told at the start. Or picking items in a specific sequence. Or, tbh, most real assembly tasks where step 4 depends on how you did step 2.
Their solution is to give the model a "language scratchpad," essentially a running text buffer where the system can write notes to itself. It can memorize object positions, track progress toward subgoals, and maintain a plan over time. The approach was tested on memory-dependent tasks from the ClevrSkills environment, on something called MemoryBench, and on a real-world pick-and-place task.
The results show that incorporating a scratchpad significantly improves generalization on these tasks for both recurrent and non-recurrent architectures. Which makes sense, but also raises questions about why we've been building stateless systems for so long when so many real tasks obviously require memory.
What ties these together. Here's what I think is happening, and I should be upfront that this is my interpretation, not something any of these papers explicitly claims.
We've been so focused on scaling language understanding and visual recognition that we've somewhat neglected the geometric and temporal reasoning that makes manipulation actually work. It's like we built incredible eyes and ears but forgot about proprioception.
The AnySlot paper addresses this by adding an explicit spatial goal representation. The dialogue paper reveals that visual spatial grounding is the weak link even when communication is robust. The scratchpad paper shows that temporal memory is basically absent from systems that need it.
Three different research groups, three different approaches, all pointing at the same gap.
Why this matters for the humanoid moment. We're in the middle of what feels like a humanoid robot gold rush. Figure, 1X, Apptronik, Tesla, and others are all racing to deploy general-purpose robots in warehouses and factories. The pitch is usually some version of "drop-in replacement for human workers."
But human workers are really good at spatial reasoning. We can eyeball whether a part will fit. We remember where we put the tool we'll need in step 7. We adjust our grip based on subtle visual cues about an object's weight distribution.
These papers suggest that our current VLA architectures, even the impressive ones, have fundamental limitations in exactly these areas. Not impossible limitations. AnySlot shows you can engineer around the slot problem with hierarchical design. The scratchpad work shows you can add memory with the right augmentation. But these aren't solved problems. They're active research challenges.
I talked to a few people in the field while trying to make sense of this (off the record, unfortunately), and the general sentiment was something like cautious concern. The gap between demo videos and reliable deployment is wider than the funding announcements suggest. Centimeter-level precision matters when you're loading dishwashers or assembling electronics. "Close enough" doesn't cut it.
The benchmarking problem. One thing that struck me across all three papers is the emphasis on new benchmarks. SlotBench for slot-level placement. MemoryBench for memory-dependent tasks. Structured evaluation frameworks for collaborative reconstruction.
This tells me something about the state of the field. When researchers feel the need to create new benchmarks, it usually means existing ones aren't capturing what matters. We've been measuring the wrong things, or at least not measuring the right things precisely enough.
It also suggests that progress in these areas has been slower than progress on the metrics we were tracking. If your benchmark says your model is 95% accurate but it still can't reliably place objects in slots, you've got a benchmark problem.
What I'm still uncertain about. Honestly, I'm not sure how quickly these gaps will close. The optimistic read is that now that we've identified the problems clearly, targeted solutions like AnySlot and scratchpad augmentation will proliferate. The engineering solutions exist. They just need to be integrated into production systems.
The pessimistic read is that these are fundamental architectural limitations. Maybe end-to-end VLAs are just the wrong paradigm for precision manipulation, and we need something more modular. Maybe the current generation of vision encoders simply doesn't capture spatial relationships with enough fidelity.
I lean toward cautious optimism, but I've been wrong before. The history of AI is full of problems that seemed almost solved until they weren't.
What to watch for. If you're tracking this space, here's what I'd pay attention to:
First, whether hierarchical approaches like AnySlot become standard practice or remain research curiosities. If major robotics companies start adopting explicit spatial goal representations, that's a signal the field has internalized these lessons.
Second, benchmark adoption. Will SlotBench and similar precision-focused evaluations become standard? Or will we keep optimizing for metrics that don't capture real-world performance?
Third, memory architectures. The scratchpad approach is clever, but it's also sort of a bolt-on solution. Watch for whether next-generation VLAs incorporate memory more natively, or whether we keep adding external augmentations.
And finally, watch the gap between demos and deployments. If humanoid companies start talking less about impressive videos and more about uptime, error rates, and precision tolerances, that's a sign of maturation. If the marketing stays flashy while the technical details stay vague, well, draw your own conclusions.
The centimeter conclusion. We started with centimeters, so let's end there.
The difference between a robot that can place an object "roughly in the right area" and one that can hit a specific slot reliably is the difference between a research demo and a useful tool. These three papers, in their different ways, are all about closing that gap.
It's not glamorous work. Spatial reasoning and memory don't make for exciting headlines the way "robot learns to cook" does. But it's the work that will determine whether the current generation of embodied AI actually delivers on its promises.
I think we'll get there. I'm just less sure about the timeline than the press releases suggest.