Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Why do robots still get lost in places they've been a hundred times?
I've been covering autonomy long enough to remember when the answer was always "better sensors." Then it was "more compute." Then it was "bigger neural nets." Now we're in the foundation model era, and guess what, robots are still getting confused about whether that thing in front of them is a parked car or a large shrub. The sensors got better! The compute got faster! The neural nets got enormous! And yet.
Two papers dropped this week that I think get at something important, something the press releases and demo videos tend to gloss over. The problem isn't perception anymore, not really. The problem is that modern robots have two completely different ways of understanding the world, and those two ways keep contradicting each other.
Here's what's happening under the hood of any modern autonomous system worth its salt. You've got your geometric perception stack, the stuff that's been around since the DARPA Grand Challenge days, LiDAR, depth cameras, SLAM algorithms that build maps of physical space. This channel is boring and reliable. It knows where the walls are. It knows you can't drive through a concrete pillar.
Then you've got your foundation model channel, your GPT-4Vs and your Geminis and whatever else the kids are plugging in these days. This channel is exciting and unreliable. It can tell you that's a fire hydrant, not a bollard. It can read street signs. It can understand that the person waving their arms is probably telling you to stop.
The problem, and I've seen this movie before with sensor fusion in the 2010s, is that nobody's figured out what to do when these two channels disagree. The geometric stack says "there's definitely something there." The foundation model says "that's a pedestrian." But what if the foundation model is hallucinating? What if it's confidently wrong, which, call me old-fashioned, but I've noticed these models tend to be?
Cobertura relacionada
More in Autonomy
Two new papers tackle the same old problem I've been griping about since my Kuka days: you can have accurate robot control or fast robot control, but getting both is still a pain.
Robert "Bob" Macintosh · 1 hour ago · 3 min
A flurry of new research papers claim big improvements in robot navigation. Some of it's genuinely clever, some of it's solving problems we created for ourselves.
Robert "Bob" Macintosh · 1 hour ago · 4 min
Two new papers show autonomous vehicle planners getting serious about safety constraints, and honestly it's about time.
Mark Kowalski · 1 hour ago · 4 min
Three new papers tackle the same problem from wildly different angles. The common thread? Making robots actually understand what they're looking at.
A team out of (the paper doesn't specify the institution, which is a bit odd) published work on arXiv this week proposing what they call a "conflict-drop window," basically a mechanism that refuses to commit foundation model claims when the geometric channel contradicts them at the moment of observation. The results are striking, car commit precision jumped from 43.9% to 99.7% on the KITTI-360 benchmark.
Let me say that again. The baseline system was wrong about cars more than half the time. Cars! The thing we've been training autonomous vehicles to recognize for over a decade!
Now, this is based on a specific benchmark configuration with an oracle geometric channel, so real-world numbers would be messier. But the direction is clear, we've been treating foundation models as just another voter in the perception pipeline, and that's been a mistake.
The key insight, and this is the part that feels obvious in retrospect, is that foundation models don't come with calibrated reliability scores. They're confident about everything. Your geometric perception stack, boring as it is, at least knows when it's uncertain. The foundation model will tell you with equal confidence that something is a mailbox whether it's actually a mailbox or a small child in a boxy costume.
The second paper, from a team that actually does link to their code repository, takes a different angle on the same fundamental problem. Their work, also on arXiv, focuses on vision-language navigation, getting robots to follow natural language instructions through unfamiliar spaces.
Their diagnosis is blunt: VLMs "struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions." In other words, these models can understand language beautifully, can recognize objects in 2D images like nobody's business, but ask them to understand that turning left and walking ten feet puts you in a different room and they fall apart.
Their solution is a hierarchical map structure that essentially translates 3D geometric information into a format that VLMs can actually reason about. Three levels:
Geometric level: where you can and can't walk, basic obstacle information
Semantic level: what objects are where and how they relate to each other
Decision level: high-level task planning and goal selection
The VLM handles the semantic and decision stuff, picking waypoints that make sense given the instructions. A classical path-planning algorithm, the kind we've had working reliably since the 80s, handles actually getting between those waypoints without hitting things.
This is what I mean when I say we keep relearning the same lessons. The robotics field spent years trying to get neural networks to do everything end-to-end, and now we're coming back around to hybrid architectures that let different systems do what they're good at. The foundation model is good at understanding "go to the kitchen" means finding a room with a stove and refrigerator. The geometric planner is good at not walking into walls. Why were we ever trying to make one system do both?
The results on the R2R-CE and RxR-CE benchmarks are genuinely impressive, they claim state-of-the-art zero-shot performance that beats some supervised methods. Though I'd want to see more independent replication before getting too excited. We've been burned before on navigation benchmarks that don't transfer to real environments.
What I find interesting about both papers is what they don't say. Neither one really grapples with the deployment question, what happens when you put these systems in environments that are genuinely novel, not just held-out test sets from the same distribution as training? The KITTI dataset is German streets. ScanNet is indoor scenes. What happens when your delivery robot encounters, I don't know, a street fair? A construction zone? Snow?
This is the self-driving car hype cycle all over again. The benchmarks look great. The demos are impressive. And then you deploy and discover that the real world has a much longer tail than your test set.
But I don't want to be too grumpy about this. These papers represent genuine progress on a real problem. The insight that foundation models need different integration strategies than traditional perception, that you can't just treat them as another sensor, that's valuable. The specific mechanisms proposed, calibrated commit gates, conflict-drop windows, hierarchical maps that separate semantic reasoning from action execution, these are the kinds of architectural innovations that actually move the field forward.
The question, and it remains unclear at this point, is whether these approaches will generalize. Both papers evaluate on established benchmarks with known characteristics. Real-world deployment means handling the unknown unknowns, the situations that aren't in any dataset because nobody thought to record them.
I've been doing this long enough to know that the gap between benchmark performance and real-world reliability is where robotics companies go to die. But at least we're asking the right questions now. The problem isn't sensors. The problem isn't compute. The problem is integration, getting systems with fundamentally different strengths and failure modes to work together without one corrupting the other.
That's progress. Slow, unglamorous progress that won't make any headlines, but progress nonetheless.
If you want to argue about any of this, my email's on the about page. I still check it, unlike apparently everyone under 40.