Robots Are Finally Learning to See. Now the Hard Part Begins.
A wave of new research on imitation learning and robot perception is promising a lot. Mark Kowalski has seen this kind of promise before.
By
·Yesterday·7 min de lecture
I've seen this movie before. A cluster of papers drops, each one claiming to solve some foundational problem in robotics, and for a few weeks the field buzzes with excitement before reality reasserts itself and everyone gets back to the slow, grinding work of actually making things function outside a lab. That's not cynicism, that's just pattern recognition built up over thirty-something years of covering tech cycles.
But here's the thing: this particular cluster of papers on robot manipulation and imitation learning is, genuinely, worth paying attention to. Not because any single one of them cracks the code, but because taken together they're pointing at something real, a convergence of ideas around how robots perceive space and learn from watching others, that feels less like hype and more like actual progress. I could be wrong. I've been wrong before. But let me walk you through what's actually happening.
The core problem, stated simply, is this: robots are bad at figuring out where things are and what to do about it, especially when you change anything about the environment they were trained in. Move the camera. Change the lighting. Put the cup in a slightly different spot. A robot that seemed competent will suddenly look very, very stupid. The research community has been chipping away at this for years, and several new papers suggest some of the chipping is finally drawing blood.
Start with the camera question, because it's more interesting than it sounds. Most manipulation systems today use multiple cameras, including a wrist-mounted camera that essentially gives the robot a close-up view of whatever it's grabbing. That's the de facto standard, as a new paper from on something called Spatially Conditioned Diffusion Policy (SCDP) puts it. The problem is that wrist cameras add complexity, cost, and failure points. The SCDP paper argues you can get comparable performance from a single fixed camera if you're clever about how the policy uses visual information.
À lire aussi
More in Research
FlowMPC and WAM-RL both attack the same core limitation of behavior cloning from different angles. Here's what the research actually shows.
Aisha Patel · 9 hours ago · 9 min
Two new research papers suggest the future of robot control might be written in code by AI agents that never touched a robot. That's either brilliant or a disaster waiting to happen.
Mark Kowalski · 10 hours ago · 7 min
Researchers dropped three notable papers on robot planning and navigation this week. The progress is real. The hype is, as usual, getting ahead of the engineering.
Mark Kowalski · 10 hours ago · 7 min
A cluster of preprints from this week's arXiv suggests the field is converging on a shared bottleneck: retargeting human demonstrations faithfully enough that downstream RL policies actually benefit.
The clever part involves using the robot's predicted end-effector trajectory as a kind of attention signal, essentially telling the visual encoder "look here, this is where the action is." The results in simulation are competitive with multi-camera setups, and real-world tests show reasonable robustness to visual distractors. Whether this holds up at scale and in genuinely messy environments, well, it's too early to say. But the basic idea is elegant and I appreciate elegant ideas even when I'm skeptical of the claims around them.
A separate paper, also on arXiv, attacks a different piece of the perception puzzle: the way visual encoders compress spatial information before passing it to the action-generation module. The researchers, working on something they call PRISM, found that standard approaches lose fine-grained spatial detail through repeated downsampling, which hurts performance especially on high-precision tasks. Their fix, using multiscale cross-attention to preserve that detail, improved success rates on a notoriously difficult benchmark called ToolHang from 5.0% to 13.4% while adding only 15.4% more parameters. Those numbers sound modest but in manipulation research, going from 5% to 13% on a hard task is not nothing.
Then there's the data problem, which is sort of the meta-problem underneath all of this. Imitation learning works by training on demonstrations, and getting enough demonstrations across enough varied conditions is expensive and slow. A paper on a system called R2RDreamer proposes augmenting a small set of real demonstrations by editing them in 3D space and then using a video generation model to fill in realistic RGB observations. The goal is spatial generalization, getting a policy trained on a handful of demos to work across varied object positions and camera angles without collecting hundreds more real-world examples. This approach avoids the sim-to-real gap that plagues simulation-based augmentation, at least in theory. The results are promising but this is based on limited evaluation tasks, and I'd want to see it stress-tested more before getting too excited.
Two other papers caught my eye for different reasons.
One tackles something that's bothered me about imitation learning for a while, which is that most policies only look at a short window of recent observations. If the robot gets confused partway through a task, it can get stuck in a loop, repeating the same failing motion over and over because it has no memory of what it already tried. A new study investigates what happens when you scale context length, meaning how much history the policy can see, from short to long across various tasks. The finding is actually reassuring in a way: naively scaling context length is not as brittle as prior literature suggested. With the right architecture choices, specifically a UNet with cross-attention for denoising, longer context helps on memory-dependent tasks without wrecking performance on simpler ones. The paper also proposes training policies at multiple context lengths simultaneously to reduce sample complexity. It's careful, systematic work of the kind that doesn't generate headlines but actually moves the field forward.
The one-shot paper is the flashiest of the bunch, and I mean that as mild criticism, not a compliment. DemoDiffusion, described in a paper on arXiv, claims to enable robots to perform manipulation tasks by imitating a single human demonstration, no task-specific training, no paired human-robot data. The trick is combining kinematic retargeting of hand motion with a pre-trained diffusion policy that adjusts the trajectory to stay within the distribution of plausible robot actions. Across 8 tasks, the system achieves 83.8% average success, compared to 52.5% for plain kinematic retargeting and 13.8% for the base policy alone. Those are striking numbers. I remain curious about how the 8 tasks were selected and how performance degrades on tasks further from the training distribution of the base policy, because that's always where these things fall apart. The paper doesn't fully answer that question, at least not to my satisfaction.
There's also work on using human video at scale to train robot policies. CLAP, which stands for Contrastive Latent Action Pretraining and is described in another arXiv paper, tries to bridge the gap between the abundance of human video on the internet and the scarcity of labeled robot demonstration data. The approach learns a latent action vocabulary from robot trajectories and then aligns human video into that vocabulary through contrastive learning, so the robot can effectively learn manipulation skills from watching people do things. It's ambitious. Whether the latent action space learned from robot data is rich enough to capture the full range of human manipulation behavior remains unclear, and that's a real open question, not a rhetorical one.
And then there's ReMoBot, a paper on mobile manipulation with a Boston Dynamics Spot robot that takes a different approach entirely: instead of distilling demonstrations into a parametric policy, it retrieves relevant demonstrations at inference time and uses them directly to select actions. With just 20 demonstrations per task, it achieves 70% success on a Table Uncover task and 80% on a Gap Cover task in real-world settings. The training-free angle is interesting, even if the approach has obvious scaling questions around demonstration retrieval in complex environments.
Look, this is a lot of papers and a lot of ideas, and I'm not going to pretend I can synthesize them into a clean narrative about where robot manipulation is headed, because I don't think anyone can right now. What I can say is that the field is attacking the right problems: perception under limited sensing, generalization from few demonstrations, memory and context, learning from human behavior without expensive robot-specific data collection. Those are the real bottlenecks and it's good to see serious work on all of them.
What I'm watching for is whether any of this survives contact with the real world at scale, in settings that weren't designed to make the research look good. That's where the self-driving car hype cycle taught me to pay attention, not the demo videos, not the simulation benchmarks, but the messy, uncontrolled, slightly chaotic conditions that actual deployment involves. The kids doing this research are smart, genuinely smart, and some of them are going to be right about things I'm currently skeptical of. But the gap between a promising paper and a robot that reliably does useful work in an unstructured environment is still very, very large.
We're making progress. It's just slower than the press releases suggest. But what do I know.