Coding Agents Are Teaching Themselves to Be Roboticists. I'm Not Sure How I Feel About That.
Two new research papers suggest the future of robot control might be written in code by AI agents that never touched a robot. That's either brilliant or a disaster waiting to happen.
By
·6 hours ago·7 min read
Here's my take: the most interesting thing happening in robotics right now isn't a new humanoid, isn't a flashy warehouse deployment, and it's definitely not another foundation model trained on YouTube videos of people folding laundry. It's a quieter shift, and it's the kind of thing that only looks obvious in hindsight. Researchers are starting to let AI coding agents design the actual policy architecture for robots, not just write helper functions, but build the whole multi-file, multi-component system that tells a robot how to act in the world. And that should probably make you stop and think for a moment.
Now, I've seen this movie before. Every five years or so, some genuinely clever idea arrives in a field I cover, gets overhyped by people who should know better, and then either quietly delivers or quietly collapses. The honest answer here is that it's too early to say which direction this one goes. But the research coming out right now is specific enough, and weird enough, that it's worth paying attention.
For a few years now, the robotics community has been excited about what's called Code-as-Policies, the idea that you can use a large language model to write code that controls a robot by stringing together perception, planning, and control primitives. It's a reasonable idea. Instead of training a neural network end-to-end on millions of demonstrations, you ask an LLM to reason about what the robot should do and write executable instructions. The robot follows the code. Simple.
Related coverage
More in Research
FlowMPC and WAM-RL both attack the same core limitation of behavior cloning from different angles. Here's what the research actually shows.
Aisha Patel · 5 hours ago · 9 min
Researchers dropped three notable papers on robot planning and navigation this week. The progress is real. The hype is, as usual, getting ahead of the engineering.
Mark Kowalski · 6 hours ago · 7 min
A cluster of preprints from this week's arXiv suggests the field is converging on a shared bottleneck: retargeting human demonstrations faithfully enough that downstream RL policies actually benefit.
Aisha Patel · 6 hours ago · 10 min
Except it isn't simple, and the problem has been lurking in the background for a while. Most of these systems rely on multi-turn code generation loops at test time, meaning the LLM is sitting in the control loop, generating and revising code while the robot is trying to do something in the real world. That's fine for a slow tabletop demo in a lab. It's basically useless for anything that needs to happen in real time.
A paper just posted to arXiv called RHO (Robotics Harness Optimization) tries to fix this, and the approach is genuinely different from what I've seen before. Instead of running the coding agent at test time, RHO runs it at training time. The agent proposes and searches through what the researchers call "Repositories-as-Policies," interpretable multi-file policy repositories that compose primitives, and it does this search using feedback from environment rewards and execution results, not from human teleoperation demonstrations. You train the whole structure up front. Then you deploy it and the LLM stays out of the loop.
The numbers they report are hard to ignore. On a benchmark called LIBERO-PRO, which tests perturbed pick-and-place tasks (meaning the setup changes in ways the system hasn't seen before), OpenVLA scores 0.0%. Zero. A model called pi-0.5 averages 12.83%. RHO, using the same low-level primitives as those systems, reaches 45.0%. That's 3.5 times better than pi-0.5 and 2.5 times better than the strongest multi-turn agentic system they tested against. On Robosuite, RHO hits 70.0%, just edging out the previous record of 68.29%, and does it in single-turn execution with no corrective LLM edits at deployment.
Is this the final word? No. These are benchmark results, and I've watched benchmark results mean almost nothing in real-world conditions more times than I care to count. But the gap between 0.0% and 45.0% is not a rounding error. That's a qualitative difference in capability, and it comes from a structural change in how the policy is built, not from throwing more compute at the same approach.
What I find most interesting, and maybe most underappreciated, is the bit about searching with reflective feedback from environment reward rather than demonstrations. Teleoperation data is expensive to collect, slow to scale, and biased toward whatever tasks the human demonstrators happened to do. If you can replace that with a coding agent that learns from reward signals, you've potentially removed one of the biggest bottlenecks in robot learning. Whether RHO actually delivers on that in practice outside of controlled benchmarks remains unclear.
Meanwhile, a separate group has posted something called MagicSim, also on arXiv, and it's attacking a different part of the same problem. Simulation for robot learning has always been a mess, and I don't mean that as a criticism, I mean it as a structural observation. You've got rendering pipelines, controller testbeds, training environments, and planning layers that were all built separately, by different teams, for different purposes, and they don't talk to each other cleanly. Researchers end up with what the MagicSim paper calls "magic" actions, shortcuts in the simulation that bypass the actual physics and make results look better than they are.
MagicSim's pitch is a unified infrastructure built around a single deterministic batched runtime and a shared Markov decision process. Everything, world construction, skill execution, planning, evaluation, data collection, runs through one loop. You define tasks in YAML files that separate out what's in the world, where things are, how they behave, and what the agent can see. From those specs, the system generates diverse executable environments that span different physics, layouts, sensors, and robot embodiments.
The part that caught my attention is the autocollect interface. If a command executes successfully, the system automatically saves it as a structured multimodal trajectory that aligns language supervision, action representations, visual and geometric data, and task status with the actual episode. You're not just running simulations, you're automatically building training datasets from the simulations that work. The Command->Skill->Planner->Robot->Record pipeline runs independently per environment above a shared physics tick, so you can run many environments in parallel without them interfering with each other.
This is the kind of infrastructure work that doesn't get enough credit. The kids building the flashy end-to-end models get the press. The people building the plumbing that makes the whole enterprise possible get a citation or two and a poster session at a conference. But MagicSim, if it holds up under scrutiny, is addressing something real: the fact that robot learning research is still fragmented enough that comparing results across papers is often nearly impossible because everyone's running different simulators with different assumptions baked in.
I only found these two papers directly on this topic this week, so take the broader claims with appropriate skepticism. This is based on limited data from preprints that haven't gone through peer review yet.
Put these two ideas together and you get a picture of where a certain corner of robotics research is heading. Coding agents that design policy architectures at training time, unified simulation infrastructure that can generate training data automatically, and the whole thing running without a human in the loop except at the design stage. That's a meaningful shift from where the field was even two years ago.
Call me old-fashioned, but I want to see this work outside a simulator before I get too excited. The history of autonomous systems is littered with results that looked spectacular in controlled conditions and fell apart the moment you introduced a slightly different lighting condition or a box that was 3 centimeters off from where the model expected it. The LIBERO-PRO benchmark is specifically designed to test generalization to perturbed conditions, which is encouraging, but it's still a benchmark.
What I will say is that the structural logic here is sound. If you can move the expensive reasoning work to training time and deploy a lean, interpretable policy that doesn't need a live LLM in the loop, you've solved a real problem. And if you can build simulation infrastructure that automatically generates aligned training data from successful rollouts, you've potentially solved another real problem. Whether both of these hold up at scale, in messy real-world conditions, with robots that cost real money and operate around real people, that's the question that the next two or three years will answer.
I've been covering tech long enough to know that the papers that change things rarely announce themselves as paradigm-shifting. They just show up on arXiv on a Tuesday and start getting cited. These two might be that kind of paper. Or they might be a footnote. But they're worth reading.