Two New Studies Show Foundation Models Can Actually Work for Robots, With Some Clever Engineering
Researchers are figuring out how to adapt vision models like SAM for robotic tasks without burning through hundreds of GPU-hours, and the results are surprisingly practical.
画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Foundation models are everywhere in AI research right now, but making them work for actual robots has been, frankly, a mess. Two new papers on arXiv this week offer some of the clearest evidence yet that the gap between impressive demos and reliable manipulation can be closed, if you're willing to get into the weeds on representation alignment and parameter efficiency.
The Segment Anything Model and similar foundation models were trained on millions of internet images. Robotic environments look nothing like that. From my time in hardware, I've watched plenty of teams try to bolt a foundation model onto a robot arm and wonder why it fails on transparent objects or cluttered bins.
The RepSAM paper from a team of researchers actually quantifies this problem. They measured representation similarity across SAM's transformer layers using Centered Kernel Alignment (CKA) and found something interesting: shallow layers show massive domain gaps (CKA below 0.7), while deeper layers remain relatively stable (CKA above 0.7). That's an ambitious claim to verify, but their methodology looks sound.
This means you can't just fine-tune the whole model uniformly. The shallow layers need more aggressive adaptation than the deeper ones. It's the kind of insight that seems obvious in hindsight but requires actual measurement to confirm.
I was asked to cover recent AI news, but what I found instead was a pile of consumer electronics listicles masquerading as tech journalism.
Aisha Patel · 43 mins ago · 4 min
Researchers are finding ways to train robots with corrective feedback and direct video imitation, potentially cutting the need for massive demonstration datasets.
James Chen · 2 hours ago · 7 min
One approach breaks expert behavior into atomic rules; the other builds a differentiable simulator from minimal real-world data. Both are trying to solve robotics' persistent generalization problem.
Aisha Patel · 2 hours ago · 6 min
A wave of new research tackles the same frustrating issue: getting robots to move smoothly when their brains can't keep up with their bodies.
Here's where RepSAM gets interesting for anyone who's ever had to justify GPU budgets.
The numbers:
Full fine-tuning: 632M trainable parameters, 384 GPU-hours on A100
RepSAM: 4.0M trainable parameters, 4 hours on a single A100
That's a 158x reduction in parameters and 96x reduction in training time. The performance hit? They report 89.0% mIoU versus 90.9% for full fine-tuning, which is 97.9% of the performance. The paper claims these results are statistically significant (p < 0.01), though I'd want to see more detail on their validation methodology.
Look, I've seen enough spec sheets to know that benchmark numbers don't always translate to real-world performance. But a 12.0% absolute improvement in robotic manipulation success rates over their LoRA baseline is substantial. That's the difference between a robot that works most of the time and one that actually ships.
The second paper takes a different approach. The keypoint imitation learning study ran over 2,000 real-world rollouts to test whether extracting keypoints from foundation models can help robots generalize to unseen objects.
Their headline result: 75% overall success rate across five manipulation tasks. For comparison:
RGB baseline: 47%
S2-diffusion: 73%
So keypoint imitation learning (KIL) performs roughly on par with S2-diffusion while being more data-efficient. The researchers are refreshingly honest about limitations. They explicitly note that KIL "does not outperform alternative representations" in all cases and "inherits limitations of the foundation models used for keypoint extraction."
That's the kind of hedging I wish more papers included. The real test is whether these approaches work outside controlled lab conditions, and we don't know yet.
If you're building robotic systems that need to handle novel objects or cluttered environments, these papers suggest a few things:
For perception pipelines:
Don't fine-tune foundation models uniformly. Measure where the domain gap actually exists (shallow vs. deep layers) and allocate parameters accordingly.
RepSAM's CKA-guided rank allocation is, basically, a principled way to decide where to spend your parameter budget.
Transparent objects and clutter remain hard, but multi-modal fusion helps.
For imitation learning:
Keypoints extracted from foundation models provide useful intermediate representations, but they're not magic.
You'll still need to handle multiple object instances carefully.
The 2,000+ rollouts in the KIL paper provide reasonable confidence in the results, though replication on different robot platforms would strengthen the findings.
For training budgets:
4 hours on a single A100 is genuinely accessible for most robotics labs.
The parameter efficiency gains (4.0M vs. 632M) mean you can iterate faster on adaptation strategies.
Neither paper addresses long-horizon tasks where errors compound. A 75% success rate on individual manipulation primitives becomes much less impressive when you chain five of them together.
The RepSAM benchmarks cover six datasets, but the paper doesn't provide detailed breakdowns of failure modes. When it fails, why does it fail? That information would be more useful than aggregate mIoU numbers for practitioners.
And there's a broader question neither paper fully answers: are foundation models the right starting point for robotic perception at all? The domain gap problem exists precisely because these models weren't trained on robotic data. Training smaller, domain-specific models from scratch might eventually prove more efficient. It's too early to say.
For now, though, these papers represent solid engineering work on a real problem. The robotics community has been talking about adapting foundation models for years. It's good to see quantitative evidence of what works, what doesn't, and what it actually costs.