Retail robots are failing basic grocery tasks, and researchers finally have the benchmarks to prove it

A new simulation benchmark shows that today's best vision-language models can't reliably stock shelves or pick items from cluttered store environments.

By Sarah Williams

3 hours ago5 min read

Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

The robots that can fold laundry and make coffee in YouTube demos? They can't stock a grocery shelf. That's the uncomfortable finding from RoboBenchMart, a new benchmark out of researchers who decided to test whether today's generalist robot AI actually generalizes to, you know, real work environments.

I've been tracking the hype around vision-language-action models (VLAs) for months now. These are the systems that combine large language models with visual understanding and physical control, the ones companies keep promising will revolutionize everything from warehouses to kitchens. And honestly, the tabletop demos are impressive. Robots sorting objects, stacking blocks, following natural language commands. It looks like progress.

But here's what RoboBenchMart reveals: when you take those same state-of-the-art models and drop them into a simulated retail environment (think dark stores, the fulfillment centers behind rapid grocery delivery), they struggle with tasks that seem almost trivially simple. Picking items from cluttered shelves. Handling objects at different heights. Navigating the kind of dense, chaotic layouts that define actual retail spaces.

The benchmark is open-source and includes a procedural store layout generator, which means researchers can create endless variations of retail environments. That's important because one of the persistent problems in robotics is overfitting to specific test setups. A robot that's learned to pick up a particular mug from a particular table in a particular lab isn't necessarily learning manipulation. It might just be memorizing a sequence.

What strikes me about this work is how it exposes a gap I think a lot of us suspected but couldn't quite prove. The geometry of a retail shelf is fundamentally different from a tabletop. Items are stacked vertically, tucked behind other items, positioned at awkward depths. The semantics are different too. A grocery store has thousands of SKUs, many of which look nearly identical (good luck distinguishing between two brands of canned tomatoes from a camera angle). And the workflows involve continuous operation, not one-off demonstrations.

Related coverage

More in Industrial

Another month of announcements, funding rounds, and breathless press releases. Here's what's worth remembering and what you can safely forget.

Mark Kowalski · 1 hour ago · 5 min

Most coverage of the new DAG-Plan research missed the point entirely. Here's what actually matters for industrial dual-arm coordination.

Robert "Bob" Macintosh · 1 hour ago · 5 min

A month of warehouse automation funding, summit announcements, and AI claims that deserve closer scrutiny than they're getting.

Aisha Patel · 1 hour ago · 7 min

Two new papers tackle the same old question: when do you let the robot take over, and when do you keep a hand on the wheel?

Sources