Foundation Models Are Getting Smarter About What They Don't Know

Two new papers show real progress on adapting big AI models for robot vision, and for once the results actually hold up in the real world.

27 May 20263 min read

I spent a good chunk of yesterday morning reading through two papers that landed on arXiv this week, and I'll be honest, I almost didn't write anything. The robotics AI space is so full of incremental work dressed up as breakthroughs that my default setting these days is skepticism. But these two caught my attention for a simple reason: they're both trying to solve problems I watched engineers struggle with for years.

The first one, RepSAM, tackles something I've been curious about since Meta's Segment Anything Model came out. SAM is impressive, no question, but when you try to use it on actual factory floors with transparent plastic parts or cluttered bins, it falls apart. Anyone who's worked with vision systems for pick-and-place knows this pain. I remember we had a project at Kuka back in 2018 trying to get a system to reliably segment shrink-wrapped components, and we burned months on it.

What the RepSAM team figured out is that the problem isn't uniform across the model. The shallow layers of the transformer have massive domain gaps (they measured it using something called CKA, basically a similarity metric), while the deeper layers are actually pretty stable. So instead of fine-tuning the whole thing, which takes forever and costs a fortune in compute, they focus their adaptation on the layers that actually need it.

The numbers are genuinely good. They got 97.9% of full fine-tuning performance while reducing trainable parameters by 158 times. Four hours on a single A100 versus 384 GPU-hours for the full approach. Look, I've seen enough papers with cherry-picked benchmarks to be cautious, but they tested across six different benchmarks plus actual manipulation tasks. The 12% improvement in manipulation success rates over the baseline is the kind of thing that matters in production.

Related coverage

More in Industrial

The Apple supplier priced its shares at the maximum and still had to turn away demand, which tells you something about where hardware money is flowing right now.

James Chen · 25 Jun · 5 min

Prime Day deals on Echos and Ring cameras are fine, but let's not confuse consumer gadgets with the serious robotics work happening in warehouses.

Robert "Bob" Macintosh · 25 Jun · 3 min

Amazon's CEO made his first India trip and left behind a $13 billion AI commitment and an aggressive quick-commerce expansion. The numbers are real. The execution is the hard part.

James Chen · 25 Jun · 6 min

A wave of arXiv preprints this week tackles one of manipulation's oldest problems: how do you get a robot to learn from imperfect, incomplete, or just plain missing data?

Sources