VLAs Are Getting Memory, and It's About Time

Vision-language-action models are finally learning to remember what they did five seconds ago. I've been waiting for this since 2019.

By Robert "Bob" Macintosh

18 hours ago4 min de leitura

Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Look, here's the thing about vision-language-action models: they're impressive as hell at understanding what they're looking at, but ask them to remember what they did thirty seconds ago and they fall apart like a cheap gripper on a dusty conveyor.

I've been watching this space since the early VLA papers started dropping, and the memory problem has always been the elephant in the room. When I was at Kuka, we had a saying: a robot that can't remember is just a very expensive reflex. These new research papers suggest the field is finally taking that seriously.

The Scratchpad Approach

A team has published work on what they call "Notes-to-Self," essentially giving VLAs a language scratchpad to jot down what they've seen and what they're planning to do. It's described in arXiv and the concept is almost embarrassingly simple. The model writes itself notes. Object positions, subgoal progress, that sort of thing.

I'll be honest, when I first read the abstract I thought, "we're reinventing state machines with extra steps." But then I thought about it more. The beauty here is that the scratchpad uses natural language, which means you get the semantic flexibility of modern language models without hardcoding your state representation. That's actually clever.

They tested it on something called ClevrSkills and a real-world pick-and-place task. The results show significant improvement on memory-dependent tasks for both recurrent and non-recurrent models. How significant? The paper doesn't give me the exact percentage improvement I'd like to cite, so I can't tell you. That's a limitation of working from abstracts.

The Latency Problem Nobody Talks About

While we're on VLAs, there's another paper that caught my eye. UCLA's TIC-VLA work tackles something I've been complaining about for years: these models assume that thinking and acting happen at the same speed. They don't.

In any real factory environment, you've got sensor latency, network delays, inference time on your compute. By the time your fancy language model has figured out what to do, the world has moved on. I remember debugging a palletizing cell back in 2014 where we had maybe 200 milliseconds of total system latency and it was causing stack failures. These VLA models can take multiple seconds to reason.

TIC-VLA explicitly models this delay. They condition action generation on delayed semantic states plus latency metadata, so the policy can compensate for the fact that its understanding of the world is always slightly stale. They built a simulation suite called DynaNav to test this properly. The results show robust control even under multi-second reasoning latency, which is, well, that's where we actually are with current hardware.

Fontes

GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation· arXiv — cs.RO (Robotics)
RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning· arXiv — cs.RO (Robotics)
On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making· arXiv — cs.RO (Robotics)
Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely· arXiv — cs.RO (Robotics)
TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments· arXiv — cs.RO (Robotics)
Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks· arXiv — cs.RO (Robotics)

Cobertura relacionada

More in AI Models

I spent a week parsing the claims around Google's new 'always-on' AI agent, and the answer is more complicated than the marketing suggests.

Aisha Patel · 5 hours ago · 7 min

The AI company is now officially the world's most valuable startup, and it's moving fast toward public markets.

James Chen · 6 hours ago · 3 min

The Claude maker beat OpenAI to the SEC paperwork, but I've seen enough tech IPO races to know this is really about runway, not rivalry.

Mark Kowalski · 6 hours ago · 5 min

Everyone's writing about the $200B CPU market grab. The actual story is how Nvidia is quietly becoming the landlord of global AI compute.