Bildnachweis: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Vision-language-action models can't remember what happened three seconds ago, and that's becoming a serious problem for anyone trying to deploy them on real tasks.
That's the uncomfortable conclusion from RoboMME, a new benchmark from researchers that systematically tests how well these foundation models handle memory-dependent manipulation. The results aren't pretty: even state-of-the-art VLA architectures fail at tasks that require counting repeated actions or tracking objects that briefly disappear from view.
I've seen enough spec sheets to know when a technology's limitations are being glossed over in demos. Memory is one of those limitations. A robot that can't remember it already picked up two screws isn't going to reliably assemble anything.
The RoboMME benchmark breaks down robot memory into four categories: temporal (what happened when), spatial (where things are), object (which item is which), and procedural (what steps were completed). The researchers built 16 manipulation tasks specifically designed to stress-test each category.
The findings reveal something that anyone who's worked with these systems probably suspected: current VLA models are essentially stateless. They process each frame as if it's the first time they're seeing the world. That works fine for simple pick-and-place operations. It falls apart completely when a task requires the robot to remember that it already stirred the pot twice, or that the red block moved behind the blue one.
The researchers developed 14 memory-augmented variants built on the π0.5 backbone to test different approaches. Here's where it gets interesting, and frankly, a bit discouraging. No single memory architecture worked well across all task types. What helped with temporal memory often hurt spatial reasoning. What improved object tracking degraded procedural recall.
Verwandte Beiträge
More in AI Models
New analysis suggests AI isn't causing mass unemployment, but it may be quietly dismantling the first rung of the career ladder.
Aisha Patel · 1 hour ago · 7 min
Distribution shift remains the quiet killer of deployed robot systems. This week's research offers genuinely different approaches to the same fundamental challenge.
Aisha Patel · 1 hour ago · 7 min
Everyone's predicting white-collar extinction. I think they're missing something important about how automation actually unfolds.
Sarah Williams · 1 hour ago · 4 min
Four new papers show researchers finally cracking the problem that's held back practical robotics for years: how to make smart robots that don't need a data center to think.
That's not the kind of result that suggests an easy fix is coming.
Several recent papers claim improvements, but the numbers need context.
Agentic-VLA reports a 12.3% improvement on long-horizon tasks in the LIBERO benchmark. That sounds significant until you dig into what "long-horizon" means in this context. These are still relatively short sequences compared to real industrial applications. The framework also enabled cross-task transfer from 0% to 31.2% without task-specific demonstrations, which is genuinely impressive, but 31.2% isn't exactly production-ready.
The system uses what the researchers call "Experience Memory," storing and retrieving task-relevant policy weights for warm-starting adaptation. It's a clever approach, basically giving the robot a library of past experiences to draw from. Whether it scales to the thousands of task variations you'd encounter in, say, a fulfillment center remains unclear.
RePlan-Bot takes a different approach, implementing multi-level replanning throughout task execution. Instead of trying to remember everything, it continuously re-evaluates and adjusts. The system integrates an LLM-based auditor for dynamic sub-goal adjustments and a lightweight ViT-based corrector to fix risky low-level actions before they happen.
On the ALFRED benchmark, RePlan-Bot achieves state-of-the-art performance in both seen and unseen environments. But ALFRED tasks are still fairly constrained. The real test is whether this replanning approach can handle the kind of irreversible state changes you get in manufacturing, where dropping a component means starting over.
This is where I'll admit the recent work looks more promising than I expected.
EXPO-FT claims perfect task performance (30/30 successes) across evaluated tasks within an average of 19.1 minutes of online robot data. That's a remarkable number if it holds up. The tasks include:
Routing string lights and inserting the plug to light it up
Striking a pool ball into a pocket
Inserting a flower into a wine bottle
These require combinations of high precision and dynamic actions, which is harder than the typical block-stacking benchmarks. The researchers released an open-source codebase, which should help verify whether these results replicate.
For comparison, prior RL-from-scratch approaches and VLA finetuning methods performed substantially worse on the same tasks. The paper doesn't provide exact baseline numbers in the abstract, which is frustrating, but the claim of outperforming both categories is specific enough to be meaningful.
Two papers suggest that maybe we're overcomplicating things.
Language Movement Primitives proposes grounding VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. The key insight: DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify trajectories without needing to learn low-level control from scratch.
Across 31 real-world manipulation tasks, LMP achieved 65% task success compared to 35% for the best performing baseline. That's a meaningful gap. The approach essentially sidesteps the memory problem by breaking tasks into discrete motion primitives that don't require maintaining complex state.
AVP (Action with Visual Primitives) takes a similar philosophy. Instead of forcing the action expert to relearn capabilities already present in the pretrained VLM, it implements a visual-primitive-centric interface. The VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert.
Real-robot experiments showed a 27.61% improvement over π0.5 on general pick-and-place tasks, with gains in data efficiency and object-level transfer. Look, that's not a solution to the memory problem, but it suggests that decomposing the problem might be more tractable than trying to build monolithic models that do everything.
From my time in hardware, I learned that the gap between benchmark performance and production reliability is usually wider than researchers acknowledge. These papers represent genuine progress, but some caveats are worth noting:
Task complexity is still limited. Even the "long-horizon" tasks in these benchmarks are short by industrial standards. A real assembly sequence might involve hundreds of steps over hours. We don't have good data on how these approaches scale.
Environmental variation is controlled. The unseen environments in these benchmarks are still variations on training distributions. A factory floor with unexpected obstacles, lighting changes, and component variations is a different challenge entirely.
Memory requirements are task-dependent. The RoboMME results show that different memory architectures excel at different things. That suggests deployment might require task-specific tuning rather than general-purpose solutions.
Sample efficiency claims need verification. EXPO-FT's 19.1-minute training time is impressive, but it's based on a limited task set. Whether that efficiency holds for more diverse applications remains to be seen.
The honest assessment: we're making progress on the memory problem, but we're not close to solving it. The systems that work best seem to be the ones that avoid relying on memory in the first place, either through continuous replanning or by decomposing tasks into stateless primitives.
That's not necessarily a bad approach. Sometimes the best engineering solution is to design around a limitation rather than trying to eliminate it. But it does suggest that the vision of general-purpose robots that learn and remember like humans is still quite far off.
For now, the practical path forward seems to be: keep tasks short, replan frequently, and don't expect your robot to remember what it did five minutes ago. That's an ambitious enough target for most applications.