Vision-language-action models hit a memory wall, and three labs are trying to break it

The latest VLA models are starting to fail in a specific, predictable way: they remember the last few seconds, and not much more. Researchers are racing to fix it.

By Priya Nair

23 May 20263 min de leitura

Crédito da imagem: Photo by Steve Johnson on Unsplash · source

If you have followed vision-language-action models for the last year, you have probably noticed the same complaint coming from researchers and roboticists: the models are clever, the models are precise, and the models forget what they were doing about ninety seconds in.

Stanford HAI summarises the problem and three concurrent research efforts that are trying to solve it.

Everyone hit the same wall in the same month. — VLA researcher (via Stanford HAI)

What the wall actually looks like

Current VLA architectures use a transformer-style attention mechanism over a relatively small window of recent observations. That window covers, generously, the last few seconds of visual context.

For short manipulation tasks, that is plenty. Pick up the cup. Place it on the saucer. Done.

For longer tasks, the model loses track. Asked to assemble a small component from a parts bin, a current VLA can perform individual sub-tasks beautifully and lose the order between them. Asked to clean a kitchen, it can wash one plate well and have no idea whether it has already washed the second.

This is not a minor inconvenience. Most useful work happens over time scales longer than the current attention window.

The three candidate fixes

Cobertura relacionada

More in AI Models

Pi has released model weights for π0.5, the first major open-weights foundation model trained specifically on robot demonstration data.

Lena Park · 23 May · 3 min

A long-running theoretical disagreement inside robotics research is starting to resolve in favour of one side. The implications are bigger than they sound.

Lena Park · 23 May · 3 min

Nvidia's humanoid robotics foundation model has been talked about for two years. The new SDK release is the first time it looks like a serious platform play.

Nadia Rahman · 23 May · 3 min

Open X-Embodiment was supposed to be a research curiosity. A year on, it is the default dataset for serious robot manipulation research.

Vision-language-action models hit a memory wall, and three labs are trying to break it

What the wall actually looks like

The three candidate fixes

More in AI Models

What is likely next

Fontes