The VLM-to-Motion Problem Is Getting Solved, Piece by Piece
Four new papers tackle the same headache I've watched engineers struggle with for years: getting language models to actually move a robot arm.
画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Thirty-one real-world manipulation tasks with a 65% success rate, zero-shot, from natural language instructions.
That number from Virginia Tech's Language Movement Primitives paper stopped me mid-coffee this morning. When I was at Kuka, we spent months tuning motion primitives for a single palletizing application. The idea that you could describe a task in plain English and have a robot figure out the trajectory parameters on its own would've gotten you laughed out of the building. Times change.
What's actually happening here
Four papers crossed my desk this week, all circling the same problem: how do you get these fancy vision-language models to stop being clever chatbots and start being useful robot controllers?
The Virginia Tech team's approach (their paper is at arXiv) uses Dynamic Movement Primitives, which I'll admit made me smile. DMPs have been around since, what, the early 2000s? Stefan Schaal's work at USC, if I remember right. The insight here is simple: VLMs are good at reasoning about what should happen, DMPs are good at generating smooth trajectories. Connect them properly and you get something that actually works.
Then there's this AVP architecture from another group, published at arXiv, claiming a 27.61% improvement over pi_0.5 on pick-and-place. I'll be honest, I had to look up what pi_0.5 even was (it's a recent VLA baseline, for those as behind as me). The core idea is separating visual reasoning from motor control rather than cramming everything into one model. Which, look, makes intuitive sense. You don't ask your eyes to move your hands.
The reinforcement learning angle
The SOLE-R1 paper (arXiv) takes a different route. They've built a video-language model specifically designed to judge whether a robot is making progress on a task, then use that as the reward signal for reinforcement learning. No ground-truth rewards, no demonstrations, no task-specific tuning.
This is where I get a bit skeptical, actually, let me be precise, cautiously optimistic. They tested on 24 unseen tasks across four simulation environments plus a real robot. That's not nothing. But simulation-to-real transfer remains unclear, and the paper admits their model is "markedly more robust to reward hacking" than alternatives, which implies reward hacking is still a problem.
I called my old colleague at Siemens last week about something unrelated, and we ended up talking about this exact issue. His take: the reward hacking problem in VLM-supervised RL is like the sensor drift problem we dealt with in the 90s. Everyone knows it's there, everyone has workarounds, nobody's really solved it.
出典
- Language Movement Primitives: Grounding Language Models in Robot Motion· arXiv — cs.RO (Robotics)
- Action with Visual Primitives· arXiv — cs.RO (Robotics)
- Learning Compositional Symbolic Task Rules from Demonstrations with Inductive Logic Programming· arXiv — cs.RO (Robotics)
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning· arXiv — cs.RO (Robotics)
関連記事
More in AI Models
SK Hynix and Micron both crossed the $1 trillion threshold this week, and honestly, the implications for embodied AI might be bigger than anyone's talking about.
Sarah Williams · 4 hours ago · 4 min
Three new papers push the boundaries of how robots understand 3D scenes without task-specific training, but the benchmarks tell a more nuanced story than the abstracts suggest.
Aisha Patel · 4 hours ago · 8 min
Three people allegedly faked export documents to route banned AI chips through Japan and into China. This is exactly the kind of thing export controls were supposed to prevent.
Robert "Bob" Macintosh · 5 hours ago · 4 min

