Two New Frameworks Tackle Reinforcement Learning's Reward Function Problem from Opposite Directions
One uses graph-based reasoning to auto-generate rewards; the other fuses human language and physical corrections. Both beat expert-designed baselines.
By
·7 hours ago·5 min read
The hardest part of teaching a robot isn't the teaching. It's figuring out what to reward.
That's the core problem in reinforcement learning: you need a reward function that tells the system what "good" looks like, and designing one by hand is tedious, error-prone, and often requires domain expertise that most end users don't have. Two new papers from arXiv tackle this from completely different angles, and both show results that, if they hold up in production, could meaningfully reduce the human effort required to train robotic systems.
The first approach, RE-GoT, automates reward design entirely. Researchers have introduced Reward Evolution with Graph-of-Thoughts (RE-GoT), a framework that uses large language models combined with visual language models to generate and iteratively refine reward functions without human feedback. The key innovation here is structured reasoning: instead of asking an LLM to hallucinate a reward function in one shot, RE-GoT decomposes tasks into text-attributed graphs that break down the problem into analyzable components.
The numbers are worth paying attention to. On RoboGen benchmarks (10 tasks), RE-GoT improved average success rates by 32.25% over existing LLM-based methods. On ManiSkill2 manipulation tasks, it hit 93.73% average success across four tasks. That last figure is notable because it exceeds expert-designed rewards on those benchmarks.
Look, I've seen enough spec sheets to know that benchmark performance doesn't always translate to real-world deployment. But 93.73% on manipulation tasks is a strong result, and the fact that it beat hand-crafted rewards suggests the automated approach isn't just "good enough" but potentially better at capturing task requirements that human designers miss.
Related coverage
More in AI Models
Three new papers tackle the same problem: how do you get a robot to understand 'I left my backpack on the table' when it can't even see the table?
Sarah Williams · 9 hours ago · 4 min
Two new papers tackle the unsexy problem that's actually holding back robotics: we can't generate enough good training data without armies of human experts.
Mark Kowalski · 11 hours ago · 6 min
The collaboration hints at where large enterprises are placing their bets on AI automation, though the technical details remain frustratingly sparse.
Aisha Patel · 16 hours ago · 6 min
Researchers are finding ways to shrink vision-language-action models and add safety guarantees without sacrificing performance. The catch? We're still mostly talking about lab benchmarks.
The visual feedback loop is the clever bit. VLMs evaluate rollouts (basically, they watch the robot try to do the task) and provide feedback that guides reward refinement. No human in the loop. This addresses one of the persistent problems with LLM-based reward design: these models hallucinate, and without grounding in actual task performance, they'll confidently generate reward functions that optimize for nonsense.
The second paper takes the opposite approach: humans stay in the loop, but the system gets much better at understanding them. QuickLAP (Quick Language-Action Preference learning) comes from MIT's CLEAR Lab and focuses on semi-autonomous systems where humans provide real-time corrections.
The insight here is that physical corrections and verbal feedback are both incomplete on their own. If I grab a robot arm and nudge it left, that's grounded in physical reality but ambiguous. Was I correcting the trajectory? The speed? The approach angle? Conversely, if I say "be more careful near the edges," that's clear in intent but lacks physical grounding.
QuickLAP fuses both modalities using a Bayesian framework that treats language as a probabilistic observation over the user's latent preferences. In practice, this means LLMs extract what the researchers call "reward feature attention masks" from free-form speech, basically figuring out which aspects of the task the user cares about, and then integrating that with physical feedback in real time.
The claimed performance: over 70% reduction in reward learning error compared to physical-only baselines in a semi-autonomous driving simulator. A 15-participant user study found the system "significantly more understandable and collaborative" than alternatives.
I want to be careful here. Fifteen participants is a small study, and "understandable and collaborative" is subjective. The 70% error reduction is more concrete, but it's in simulation, not on physical hardware. The real test is whether this works when someone's actually sitting in a semi-autonomous vehicle trying to teach it their driving preferences.
What these papers share is a recognition that the reward specification problem is fundamentally a communication problem. RE-GoT tries to eliminate the need for communication by having AI systems talk to each other (LLMs generating rewards, VLMs evaluating them). QuickLAP tries to make human-robot communication more efficient by combining multiple modalities.
Neither approach is complete. RE-GoT still requires well-defined task specifications and benchmark environments. It's not clear how it handles truly novel tasks or environments that differ significantly from training distributions. QuickLAP requires a human in the loop, which limits scalability, and the Bayesian inference assumes the user's preferences are consistent (they often aren't).
From my time in hardware, I can tell you that the gap between "works in simulation" and "works on the factory floor" is where most promising research goes to die. Manipulation tasks in ManiSkill2 are controlled environments with known objects and predictable physics. Real industrial settings have dust, vibration, lighting variation, and parts that don't quite match the CAD model.
That said, the direction is right. Reward engineering is a bottleneck. If you can reduce the human expertise required to specify what "good" looks like, you expand the set of people who can train robotic systems. That's meaningful for deployment at scale.
The practical implications differ by application. RE-GoT's fully automated approach makes more sense for scenarios where you have clear task definitions and can run extensive simulations: think warehouse picking, assembly operations, repetitive manufacturing tasks. You define the task structure, let the system figure out the reward, validate in simulation, then deploy.
QuickLAP's human-in-the-loop approach fits better for personalized systems where preferences vary by user: autonomous vehicles, assistive robots, collaborative manufacturing where different operators have different styles. You can't pre-specify everyone's preferences, so you need efficient online learning.
Both papers are available on arXiv. QuickLAP has released code at their GitHub repository. RE-GoT's code availability isn't mentioned in the abstract, which, if you're trying to reproduce results, is a limitation worth noting.
The bigger picture here is that LLMs are increasingly being used not to control robots directly but to handle the meta-problem of specifying what robots should learn. This is probably the right level of abstraction. LLMs are bad at real-time control (too slow, too prone to hallucination) but potentially good at translating human intent into formal specifications that traditional RL can optimize.
Whether these specific approaches scale remains unclear. But the 32.25% improvement on RoboGen and the 70% error reduction in QuickLAP suggest we're past the "interesting demo" phase and into territory where the gains are large enough to matter for practical applications.
I'll be watching for follow-up work that tests these methods on physical hardware with real-world task variation. That's where we'll learn if this is a genuine step forward or another case of simulation results that don't survive contact with reality.