Foundation Models Are Getting Smarter About What They Don't Know
Two new papers show real progress on adapting big AI models for robot vision, and for once the results actually hold up in the real world.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
I spent a good chunk of yesterday morning reading through two papers that landed on arXiv this week, and I'll be honest, I almost didn't write anything. The robotics AI space is so full of incremental work dressed up as breakthroughs that my default setting these days is skepticism. But these two caught my attention for a simple reason: they're both trying to solve problems I watched engineers struggle with for years.
The first one, RepSAM, tackles something I've been curious about since Meta's Segment Anything Model came out. SAM is impressive, no question, but when you try to use it on actual factory floors with transparent plastic parts or cluttered bins, it falls apart. Anyone who's worked with vision systems for pick-and-place knows this pain. I remember we had a project at Kuka back in 2018 trying to get a system to reliably segment shrink-wrapped components, and we burned months on it.
What the RepSAM team figured out is that the problem isn't uniform across the model. The shallow layers of the transformer have massive domain gaps (they measured it using something called CKA, basically a similarity metric), while the deeper layers are actually pretty stable. So instead of fine-tuning the whole thing, which takes forever and costs a fortune in compute, they focus their adaptation on the layers that actually need it.
The numbers are genuinely good. They got 97.9% of full fine-tuning performance while reducing trainable parameters by 158 times. Four hours on a single A100 versus 384 GPU-hours for the full approach. Look, I've seen enough papers with cherry-picked benchmarks to be cautious, but they tested across six different benchmarks plus actual manipulation tasks. The 12% improvement in manipulation success rates over the baseline is the kind of thing that matters in production.
Related coverage
More in Industrial
Multi-robot coordination and tactile feedback are finally getting serious academic attention, and the results are promising if you know where to look.
Robert "Bob" Macintosh · 3 hours ago · 3 min
Thousands of attendees, hundreds of exhibitors, and a lot of motion control demos. Here's what's worth paying attention to.
Sarah Williams · 5 hours ago · 4 min
New research shows we might finally be moving past the 'just make it squishy' era of soft pneumatic grippers.
Sarah Williams · 5 hours ago · 4 min
Two new research papers show promising approaches to obstacle avoidance, and I'm cautiously optimistic we're getting somewhere useful.