New Benchmark Reveals Robots Can't Reliably Combine Skills They've Already Learned

ATOM-Bench ran 2,700 physical robot trials and found that good performance on individual tasks doesn't translate to combining them — a core problem for anyone building general-purpose manipulation systems.

16 June 20265 min read

Picture a robot arm that can pick up a cup. It can also push a button. Ask it to pick up the cup and then push the button, and it falls apart. That's not a hypothetical. It's roughly what a new benchmark is finding across five of the most widely tested manipulation policies in robotics research.

Two papers dropped on arXiv this week that together paint a fairly sobering picture of where robot manipulation actually stands, as opposed to where press releases say it stands.

What did ATOM-Bench actually test?

arXiv published the ATOM-Bench paper, which introduces a real-world benchmark designed specifically to separate two types of failure that usually get lumped together: robots that can't execute a skill cleanly, and robots that can execute skills individually but can't chain them into new task sequences.

The distinction matters. A lot.

ATOM-Bench breaks tabletop manipulation down into what the authors call "motor atoms" and "instruction atoms." Motor atoms are the physical primitives: grasping, pushing, placing. Instruction atoms are the language-grounding side: understanding "pick up the red one" versus "pick up the heavy one." The benchmark contains 30 atomic tasks and 24 held-out compositional tasks, run across both single-arm and dual-arm robot configurations. The researchers collected 3,000 human demonstrations for fine-tuning and then ran 2,700 physical rollouts across five policies to get their results.

That's a meaningful amount of physical testing. I've seen enough spec sheets from purely simulation-based evaluations to be skeptical of benchmarks that never leave the virtual environment, so the real-world rollout count here is worth noting.

Related coverage

More in Research

TurboMPC and jaxipm tackle the same bottleneck from different angles: getting constrained optimization off the CPU and onto the GPU where the rest of modern robotics already lives.

Aisha Patel · 25 Jun · 8 min

New work on exoskeletons, hybrid supervision, humanoid data collection, and vibrotactile sensing all circle the same bottleneck: getting good demonstration data into dexterous robot hands.

Aisha Patel · 25 Jun · 10 min

A flow-matching framework for cross-embodiment manipulation and a point-cloud feasibility predictor both land this week. One is genuinely novel. The other is incremental but useful.

Aisha Patel · 25 Jun · 10 min

New Benchmark Reveals Robots Can't Reliably Combine Skills They've Already Learned

What did ATOM-Bench actually test?

More in Research

Why does compositional generalization keep failing?

So are these problems actually being solved?

What does this mean for generalist manipulation research?

Sources