The Reward Engineering Problem Is Getting Fixed, Just Not the Way You Think
Four new papers from robotics researchers tackle one of RL's most stubborn bottlenecks, and the approaches are more varied and more interesting than the headlines suggest.
By
·9 hours ago·7 min read
Most of the coverage on LLMs in robotics right now focuses on manipulation demos and humanoid hype. What's getting less attention is a quieter but arguably more consequential problem that's been grinding away at researchers for years: reward engineering. Specifically, how do you tell a robot what "good" looks like without spending hundreds of engineering hours hand-crafting reward functions that break the moment conditions change?
Four papers published this month on arXiv take different angles on this problem, and together they sketch out something close to a coherent research direction. None of them are magic bullets. But the convergence is worth paying attention to.
Let me be specific about what's actually being claimed across these four papers, because the abstracts can blur together if you're skimming.
The first paper, Self-CriTeach (arXiv:2509.21543), proposes a framework where an LLM essentially teaches itself how to plan robotic tasks. The mechanism is dual-purpose: the model generates symbolic planning domains, which then serve both as a source of training data (chain-of-thought trajectories for supervised fine-tuning) and as structured reward functions for reinforcement learning. The key claim is that this sidesteps the need for manual reward engineering while also reducing the cost of collecting chain-of-thought supervision, which historically has required human annotators or expensive oracle systems.
Related coverage
More in Research
A cluster of new robotics research tackles cloth manipulation, VLA latency, and humanoid locomotion. The results are genuinely interesting, though production-ready is still a ways off.
James Chen · 6 hours ago · 7 min
A pair of new arXiv preprints take different but complementary approaches to a problem the field has largely been avoiding: how do you formally guarantee the safety of a robot running a foundation model?
Aisha Patel · 8 hours ago · 9 min
A pair of arXiv preprints tackle one of soft robotics' most stubborn problems: making tendon-driven continuum robots actually track where you tell them to go.
Aisha Patel · 10 hours ago · 8 min
I've seen enough spec sheets to know that "higher planning success rates" is a phrase that needs context. The paper reports improved cross-task generalization and resistance to imperfect logical states, which in plain terms means the planner degrades more gracefully when perception is noisy. That's actually the more practically useful result, because real factory floors are not clean perception environments.
The second paper, MEMO (arXiv:2603.04560), comes from Virginia Tech and takes a different approach entirely. Instead of self-generated supervision, MEMO collects natural language corrections from human users when a robot fails, then clusters and rephrases those corrections across multiple users and tasks to synthesize more general skill templates. The system builds a retrieval-augmented skillbook that grows over time. At runtime, the robot retrieves relevant guidance and uses it to generate new skills on the fly.
This is worth unpacking. The baseline comparison they're beating is simple text recall, where you just store the exact correction and replay it next time. MEMO's hypothesis is that aggregating and generalizing across many corrections produces better guidance than any single correction alone. Their experiments support this, though the evaluation is on household manipulation benchmarks, which remain somewhat idealized environments. Whether this holds at production volume is a different question.
Third is MAPL (arXiv:2606.25398), which focuses specifically on locomotion rather than manipulation or planning. Reward design for locomotion is its own particular headache because you're balancing competing objectives simultaneously: speed, stability, energy efficiency, terrain adaptation. MAPL prompts an LLM to compare robot trajectories along multiple semantically meaningful criteria independently, trains a multi-head preference scoring model from those comparisons, and aggregates the outputs into a scalar reward for policy optimization.
The result, tested across four quadruped locomotion environments, is performance comparable to or better than expert-designed rewards. No task-specific reward engineering required. That's an ambitious claim, and it's worth noting that four environments is a limited dataset. Still, the multi-objective framing is the right instinct. Single-judgment preference learning has always struggled with locomotion precisely because "which behavior is better overall" is a question that obscures the tradeoffs underneath.
Fourth is HEART (arXiv:2606.25404), which is less about reward engineering and more about planning reliability under physical constraints. The core problem it addresses is that LLMs are good at reasoning over language but routinely produce plans that are physically infeasible: reach for an object that's out of range, try to stack something on an unstable surface, sequence actions in an order that violates physical logic. HEART decomposes planning into atomic reasoning tasks and routes them to specialized expert agents, each focused on a specific type of constraint validation (capability, reachability, logical ordering). There's also a token budget constraint built in, which is a practical nod to the fact that real-world deployments can't run unconstrained inference.
HEART outperforms both single-LLM planners and rule-based planners on household benchmarks. The heterogeneous agent architecture is not a new idea in AI, but applying it specifically to physical constraint validation in robotics is a cleaner framing than most.
Key themes across all four papers:
All four are attacking the reward engineering and supervision bottleneck from different angles: self-generated domains (Self-CriTeach), human feedback aggregation (MEMO), multi-objective LLM preferences (MAPL), and constraint-specialized agent decomposition (HEART)
None require task-specific reward functions hand-crafted by engineers, which has historically been the dominant cost in deploying RL-trained robots
Three of the four use LLMs as a component of the training pipeline rather than as the deployed policy itself, which is a more realistic architecture for production systems
All four are evaluated on benchmarks rather than physical hardware at scale, which matters for how much weight you put on the results
The papers converge on a shared assumption: that the bottleneck is structured reasoning about constraints and objectives, not raw model capability
That last point is the one I'd push back on slightly. It's not entirely clear whether the bottleneck is reasoning structure or data quality or embodiment transfer or something else entirely. These papers make a reasonable case for their framing, but the field doesn't have consensus yet.
From my time in hardware at Fanuc, the thing that always killed automation projects wasn't the algorithm. It was the gap between what the system was trained to handle and what the actual production environment threw at it. Perception noise, mechanical variance, edge cases that nobody thought to include in the training set. Self-CriTeach's explicit claim of robustness to imperfect logical states speaks directly to this, and it's the part of that paper I find most credible.
MEMO's approach of learning from human corrections is also practically interesting, though it raises questions about deployment context. It works well when you have a stream of users providing corrections across similar tasks. In a single-facility industrial deployment with one operator, the feedback aggregation benefit is much smaller. The paper doesn't address this directly, which is a gap.
MAP's locomotion results are genuinely surprising to me. Quadruped locomotion reward tuning is notoriously fussy. Getting LLM-generated preferences to match expert-designed rewards across four different terrain types, without any task-specific engineering, is a result that deserves more scrutiny than it will probably get. The terrain-invariant language descriptions they use are the clever part: by describing behaviors in generic terms that don't reference specific terrain features, they avoid the distribution shift problem that kills most LLM-based reward approaches when conditions change. Whether this holds on terrain types outside the four tested environments remains unclear.
HEART's token budget constraint is the detail I appreciated most. Look, a lot of multi-agent LLM architectures proposed for robotics are operationally impractical because they assume unbounded compute at inference time. Baking in a token budget as a first-class constraint is the kind of engineering-minded decision that separates papers written by people who've thought about deployment from papers written by people who haven't.
The broader picture here is that reward engineering, which has been a known bottleneck in robotics RL for at least a decade, is finally getting systematic research attention from the LLM side. This is based on a limited sample of four papers, and the field is moving fast enough that the landscape will look different in six months. But the convergence on similar problem framings across independent research groups suggests this isn't just a coincidence of timing.
The real test, as always, is production volume on physical hardware. Benchmark results on household manipulation tasks are a starting point, not a finish line. None of these papers have shown their methods working on an actual factory floor or in an actual warehouse deployment at scale. That's not a criticism exactly, it's just where the work is. The path from "works on benchmark" to "works in production" is where most robotics research quietly disappears, and there's no reason to expect these papers will be exceptions until someone proves otherwise.
The sources provided for this article were about portable power station discounts on Amazon. That is not a robotics or AI story, and publishing it as one would be a disservice to readers.