Two New Papers Show How to Fix Robot Policies Without Starting Over
FlowPRO and EVE tackle the same problem from opposite directions: making robot learning actually work outside the lab.
画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Training a robot to do something in a lab is one thing. Getting that same robot to keep working when reality gets messy is, honestly, where most approaches fall apart.
Two papers dropped this week that attack this problem from completely different angles, and I think they're worth looking at together because they reveal something interesting about where the field is headed.
The core problem both papers are solving is what happens after you've trained a vision-language-action model. These VLAs (the things that let robots see, understand language commands, and actually move) work pretty well in controlled settings. But deploy them on a real robot, and they start failing in ways that are expensive to fix. The traditional answer has been to collect more data and retrain, which is slow and costly.
arXiv published FlowPRO this week, and it takes what I'd call the "teach from corrections" approach. The setup is clever: a human operator watches the robot attempt a task, and when it screws up, they intervene with a correction. That single correction creates a natural pair of data, the wrong thing the robot was about to do and the right thing it should have done instead.
What makes this interesting is that it's reward-free. If you've followed RL in robotics at all, you know that designing reward functions for real-world tasks is, tbh, kind of a nightmare. FlowPRO sidesteps this entirely by using preference optimization. The robot learns that trajectory A (what the human did) is better than trajectory B (what it was about to do) without needing to assign specific numerical rewards.
The technical contribution here is something called RPRO, which adds a regularizer to prevent what the authors call "reward hacking." I should know this better, but my understanding is that without this anchor, the model can find degenerate solutions that technically satisfy the preference objective but don't actually produce useful behavior. The regularizer keeps the learned preferences grounded.
On four bimanual tasks (two-armed manipulation, which is genuinely hard), FlowPRO beat four baselines. The paper doesn't give exact success rate numbers in the abstract, so I can't tell you by how much.
EVE takes the opposite approach. Instead of fixing the policy through additional training, it asks: what if we just made the robot think harder at test time?
This is directly inspired by what's happened with large language models. The whole "test-time compute scaling" thing, where you let a model generate multiple candidates and then verify which one is best, has been transformative for reasoning tasks. arXiv published EVE as an attempt to bring that same idea to robot control.
The setup: you take a frozen base policy (no additional training) and wrap it with multiple VLM-based verifiers. The robot proposes an action, the verifiers score it and suggest refinements, and an "action incorporator" fuses that feedback back into the final motion.
出典
- FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization· arXiv — cs.RO (Robotics)
- EVE: A Generator-Verifier System for Generative Policies· arXiv — cs.RO (Robotics)
関連記事
More in AI Models
Two new papers suggest that camera motion, long treated as noise, might be the key to unlocking human video for robot pretraining.
Aisha Patel · 5 hours ago · 6 min
At Bloomberg's San Francisco tech summit, Musk dodged the IPO question everyone wanted answered and instead painted a vision of the future that investors apparently found more compelling than hard numbers.
Sarah Williams · 11 hours ago · 5 min
Dan Schulman's comments at Bloomberg Tech 2026 are vague on timeline and numbers, but the direction is clear.
James Chen · 13 hours ago · 4 min
The AI pioneer is worried about systems we can't control. I've seen that movie before, just with simpler robots.


