ATOM-Bench ran 2,700 physical robot trials and found that good performance on individual tasks doesn't translate to combining them — a core problem for anyone building general-purpose manipulation systems.
By
·Yesterday·5 min read
Picture a robot arm that can pick up a cup. It can also push a button. Ask it to pick up the cup and then push the button, and it falls apart. That's not a hypothetical. It's roughly what a new benchmark is finding across five of the most widely tested manipulation policies in robotics research.
Two papers dropped on arXiv this week that together paint a fairly sobering picture of where robot manipulation actually stands, as opposed to where press releases say it stands.
arXiv published the ATOM-Bench paper, which introduces a real-world benchmark designed specifically to separate two types of failure that usually get lumped together: robots that can't execute a skill cleanly, and robots that can execute skills individually but can't chain them into new task sequences.
The distinction matters. A lot.
ATOM-Bench breaks tabletop manipulation down into what the authors call "motor atoms" and "instruction atoms." Motor atoms are the physical primitives: grasping, pushing, placing. Instruction atoms are the language-grounding side: understanding "pick up the red one" versus "pick up the heavy one." The benchmark contains 30 atomic tasks and 24 held-out compositional tasks, run across both single-arm and dual-arm robot configurations. The researchers collected 3,000 human demonstrations for fine-tuning and then ran 2,700 physical rollouts across five policies to get their results.
That's a meaningful amount of physical testing. I've seen enough spec sheets from purely simulation-based evaluations to be skeptical of benchmarks that never leave the virtual environment, so the real-world rollout count here is worth noting.
Related coverage
More in Research
FlowMPC and WAM-RL both attack the same core limitation of behavior cloning from different angles. Here's what the research actually shows.
Aisha Patel · 8 hours ago · 9 min
Two new research papers suggest the future of robot control might be written in code by AI agents that never touched a robot. That's either brilliant or a disaster waiting to happen.
Mark Kowalski · 9 hours ago · 7 min
Researchers dropped three notable papers on robot planning and navigation this week. The progress is real. The hype is, as usual, getting ahead of the engineering.
Mark Kowalski · 9 hours ago · 7 min
A cluster of preprints from this week's arXiv suggests the field is converging on a shared bottleneck: retargeting human demonstrations faithfully enough that downstream RL policies actually benefit.
The headline finding: strong atomic performance does not reliably transfer to held-out compositional tasks. Policies that scored well on individual skills still struggled when those same skills needed to be recombined in a new sequence the policy hadn't seen during training. The benchmark introduces two new metrics, Atomic Score and Compositional Failure Share, specifically to quantify which type of failure is responsible when a policy breaks down.
The paper also flags specific weak spots. Fine-grained motor control, counting, and logical filtering ("pick up the object that is NOT blue") were consistent trouble areas even for policies that otherwise performed reasonably.
This is where the second paper becomes relevant. The SCE paper, also on arXiv (https://arxiv.org/abs/2606.15685), tackles a related but distinct problem: Embodied Continual Learning, or ECL. The goal in ECL is to train a robot on new tasks over time without it forgetting the old ones, which is harder than it sounds under closed-loop control.
The core issue the SCE authors identify is something called feature drift. When a robot operates in closed-loop, small errors in its internal representations compound over sequential decisions. A tiny miscalibration in how the policy represents "grasp" at step one propagates forward and degrades behavior at step five. The longer the task sequence, the worse this gets.
The SCE framework tries to address this by building what the authors call a "skill base" through Compositional Skill Grounding, which decomposes task demonstrations into reusable skill units. A second component, Dual Execution-and-Transition Experts (DETE), then handles both executing individual skills and managing the transitions between them. The idea is that clean transitions are just as important as clean execution, and most prior work focuses almost entirely on the execution side.
Tested on the LIBERO benchmark suite and on real-world manipulation tasks, SCE shows improvements in retention, meaning it forgets old tasks less severely when learning new ones. The ablation studies in the paper suggest the transition expert component is doing meaningful work, not just adding parameters.
Honestly, it's too early to say. Both papers represent genuine technical progress, but they're also both pointing at how hard the underlying problem is.
Look, the compositional generalization problem isn't new. Roboticists have been aware that skill combination is a bottleneck for years. What ATOM-Bench contributes is a standardized, physically grounded way to measure it, which is valuable because it means future papers will have to actually demonstrate compositional transfer rather than just atomic performance on familiar tasks. That's the kind of benchmark the field needs.
The SCE approach to continual learning is interesting, but the real test is whether it holds up as the number of tasks scales. The LIBERO benchmark is well-designed, but it covers a constrained set of tabletop scenarios. Whether the skill base stays coherent at, say, 50 or 100 distinct manipulation tasks remains unclear from the current paper.
From my time in hardware, one thing I kept running into was the gap between what a system could do in a controlled demo and what it could do when the task context shifted even slightly. Both of these papers are essentially trying to close that gap from different angles, one by measuring it more precisely and one by proposing an architectural fix. Neither is claiming the problem is solved, which is, sort of, the right framing.
The ATOM-Bench paper makes a pointed observation: generalist manipulation policies are increasingly being presented as foundation models for robotic control, but their real-world generalization is difficult to diagnose. A policy can succeed on demonstrated tasks and still fail at the fine-grained motor execution or compositional recombination that would make it genuinely useful in varied environments.
That's a polite way of saying some of the generalist manipulation claims circulating in the field are getting ahead of what the systems can actually do. The benchmark gives researchers and, importantly, anyone evaluating these systems for industrial deployment a cleaner way to ask: is this policy actually generalizing, or is it pattern-matching to training distributions?
For anyone building real automation systems, the practical upshot is fairly concrete. Compositional task failure is a distinct failure mode from motor execution failure, and they require different fixes. Knowing which one you're dealing with is half the battle. ATOM-Bench, at minimum, gives you the diagnostic tools to find out.