The "robotics generalist" question gets a useful new benchmark

A new benchmark suite makes the question of robotic generalisation testable in a way previous benchmarks did not.

14 May 20263 min de lecture

Crédit photo: Photo by Conny Schneider on Unsplash · source

Robotics has had a lot of benchmarks, and the field has known for years that most of them overstate how much actual generalisation models are capable of. A new benchmark suite, RoboGen, addresses the criticism directly.

The arXiv paper proposing RoboGen, and an IEEE Spectrum analysis, describe a benchmark designed specifically to evaluate cross-task and cross-environment generalisation.

What is different

Most existing manipulation benchmarks evaluate model performance on tasks that are minor variants of the training conditions. New object colours, slightly different starting positions, different table heights. Models perform well, and the field celebrates "generalisation".

The criticism, fairly, has been that this is not generalisation in any meaningful sense. It is interpolation across small variations within a distribution the model already saw.

RoboGen is structured to test something harder. The benchmark holds out entire task categories from training. It includes environments with object configurations the model could not have encountered during training. It evaluates on robot platforms with kinematic structures different from those used in training.

The early results from the original paper are bracing. State-of-the-art models that score in the 80s on conventional benchmarks score in the 30s and 40s on RoboGen. The gap reflects how much current "generalisation" is interpolation within a known distribution.

More in Research

One of robotics' oldest bottlenecks may have a real solution. Or it may not. A new arXiv paper makes a strong case for synthetic demonstration data.

Isaac Mendez · 22 May · 3 min

For five years, imitation learning has dominated practical robotics research. New results suggest reinforcement learning is back, with better tooling.

Isaac Mendez · 18 May · 3 min

Researchers have developed a sensor dense enough to let a robot distinguish between fabrics by feel. The applications are immediate.

Isaac Mendez · 4 May · 3 min

Code generation for robot tasks has improved dramatically. The reliability gap between generated and human-written code is narrowing.

The "robotics generalist" question gets a useful new benchmark

What is different

More in Research

Why this is healthy

What is likely next

Sources