Two New Papers Show How to Fix Robot Policies Without Starting Over

FlowPRO and EVE tackle the same problem from opposite directions: making robot learning actually work outside the lab.

6 June 2026読了 4 分

Training a robot to do something in a lab is one thing. Getting that same robot to keep working when reality gets messy is, honestly, where most approaches fall apart.

Two papers dropped this week that attack this problem from completely different angles, and I think they're worth looking at together because they reveal something interesting about where the field is headed.

The core problem both papers are solving is what happens after you've trained a vision-language-action model. These VLAs (the things that let robots see, understand language commands, and actually move) work pretty well in controlled settings. But deploy them on a real robot, and they start failing in ways that are expensive to fix. The traditional answer has been to collect more data and retrain, which is slow and costly.

arXiv published FlowPRO this week, and it takes what I'd call the "teach from corrections" approach. The setup is clever: a human operator watches the robot attempt a task, and when it screws up, they intervene with a correction. That single correction creates a natural pair of data, the wrong thing the robot was about to do and the right thing it should have done instead.

What makes this interesting is that it's reward-free. If you've followed RL in robotics at all, you know that designing reward functions for real-world tasks is, tbh, kind of a nightmare. FlowPRO sidesteps this entirely by using preference optimization. The robot learns that trajectory A (what the human did) is better than trajectory B (what it was about to do) without needing to assign specific numerical rewards.

The technical contribution here is something called RPRO, which adds a regularizer to prevent what the authors call "reward hacking." I should know this better, but my understanding is that without this anchor, the model can find degenerate solutions that technically satisfy the preference objective but don't actually produce useful behavior. The regularizer keeps the learned preferences grounded.

More in AI Models

Chipmakers swung wildly this week, from a Tuesday 'chip-wreck' to a Micron-led surge after hours. What's actually going on with AI's hardware backbone?

Sarah Williams · 26 Jun · 5 min

The original Creator Studio was shut down in 2023. Now it's back, rebuilt around an AI assistant that promises to grow your audience and reply to comments in your voice.

Sarah Williams · 26 Jun · 5 min

At its annual Config conference, Figma announced coding layers, AI-generated motion graphics, and a reimagined canvas that blurs the line between design and full-stack development.

Sarah Williams · 26 Jun · 5 min

Everyone talks about chips and models. The memory bottleneck is the part of the AI buildout that keeps getting underestimated, and Micron's latest earnings make that case hard to ignore.

出典