Foundation Models Are Getting Smarter About What They Don't Know
Two new papers show real progress on adapting big AI models for robot vision, and for once the results actually hold up in the real world.
Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
I spent a good chunk of yesterday morning reading through two papers that landed on arXiv this week, and I'll be honest, I almost didn't write anything. The robotics AI space is so full of incremental work dressed up as breakthroughs that my default setting these days is skepticism. But these two caught my attention for a simple reason: they're both trying to solve problems I watched engineers struggle with for years.
The first one, RepSAM, tackles something I've been curious about since Meta's Segment Anything Model came out. SAM is impressive, no question, but when you try to use it on actual factory floors with transparent plastic parts or cluttered bins, it falls apart. Anyone who's worked with vision systems for pick-and-place knows this pain. I remember we had a project at Kuka back in 2018 trying to get a system to reliably segment shrink-wrapped components, and we burned months on it.
What the RepSAM team figured out is that the problem isn't uniform across the model. The shallow layers of the transformer have massive domain gaps (they measured it using something called CKA, basically a similarity metric), while the deeper layers are actually pretty stable. So instead of fine-tuning the whole thing, which takes forever and costs a fortune in compute, they focus their adaptation on the layers that actually need it.
The numbers are genuinely good. They got 97.9% of full fine-tuning performance while reducing trainable parameters by 158 times. Four hours on a single A100 versus 384 GPU-hours for the full approach. Look, I've seen enough papers with cherry-picked benchmarks to be cautious, but they tested across six different benchmarks plus actual manipulation tasks. The 12% improvement in manipulation success rates over the baseline is the kind of thing that matters in production.
Cobertura relacionada
More in Industrial
Action chunking at high frequencies has become the bottleneck for smooth robot manipulation. A cluster of new papers suggests the field is zeroing in on latent space as the fix.
James Chen · 25 mins ago · 7 min
Researchers are tackling the unglamorous but critical problem of teaching robots how surfaces really work, and it's about time.
Mark Kowalski · 25 mins ago · 5 min
Two new papers show neural network controllers can now come with actual safety guarantees. I've been waiting 15 years for this.
Robert "Bob" Macintosh · 25 mins ago · 4 min
Multi-robot coordination and tactile feedback are finally getting serious academic attention, and the results are promising if you know where to look.