Vision-language-action models hit a memory wall, and three labs are trying to break it
The latest VLA models are starting to fail in a specific, predictable way: they remember the last few seconds, and not much more. Researchers are racing to fix it.
Crédito da imagem: Photo by Steve Johnson on Unsplash · source
If you have followed vision-language-action models for the last year, you have probably noticed the same complaint coming from researchers and roboticists: the models are clever, the models are precise, and the models forget what they were doing about ninety seconds in.
Stanford HAI summarises the problem and three concurrent research efforts that are trying to solve it.
Everyone hit the same wall in the same month. — VLA researcher (via Stanford HAI)
What the wall actually looks like
Current VLA architectures use a transformer-style attention mechanism over a relatively small window of recent observations. That window covers, generously, the last few seconds of visual context.
For short manipulation tasks, that is plenty. Pick up the cup. Place it on the saucer. Done.
For longer tasks, the model loses track. Asked to assemble a small component from a parts bin, a current VLA can perform individual sub-tasks beautifully and lose the order between them. Asked to clean a kitchen, it can wash one plate well and have no idea whether it has already washed the second.
This is not a minor inconvenience. Most useful work happens over time scales longer than the current attention window.
The three candidate fixes
Cobertura relacionada
More in AI Models
Pi has released model weights for π0.5, the first major open-weights foundation model trained specifically on robot demonstration data.
Lena Park · 23 May · 3 min
A long-running theoretical disagreement inside robotics research is starting to resolve in favour of one side. The implications are bigger than they sound.
Lena Park · 23 May · 3 min
Nvidia's humanoid robotics foundation model has been talked about for two years. The new SDK release is the first time it looks like a serious platform play.
Nadia Rahman · 23 May · 3 min
Open X-Embodiment was supposed to be a research curiosity. A year on, it is the default dataset for serious robot manipulation research.