If you've ever watched a reinforcement learning demo and thought "that robot moves weird," you're not imagining things. The jerky, twitchy movements you see in lab videos are called action jitter, and it's been a dirty secret of the RL community for longer than most of these young founders have been coding.
I've seen this movie before. Back in the early days of industrial robotics, we had similar problems with motor control, and engineers solved it with filters and dampers and lots of trial and error. The RL crowd has been trying to solve it with math, which, call me old-fashioned, seems like the harder path. But two new papers out this week suggest they might finally be getting somewhere.
The first, from a team publishing on arXiv, introduces something called ZAPS-DA (because of course it needs an acronym). The second tackles the same fundamental problem but for fixed-wing UAVs, using what they call HJB-inspired residual filtering. Both papers are attacking the same enemy: policies that work great in simulation but would destroy physical hardware in about thirty seconds flat.
Here's the thing most people outside the field don't understand, continuous control policies trained with off-policy reinforcement learning have a tendency to output high-frequency noise in their actions. The neural network thinks it's making tiny optimizations, but what it's actually doing is telling a motor to go left-right-left-right-left-right faster than the hardware can physically respond.
À lire aussi
More in Autonomy
A wave of new research is pushing multi-modal perception forward, and honestly, the progress is more incremental than revolutionary.
Sarah Williams · 1 hour ago · 4 min
Two new papers tackle the same uncomfortable truth, that robots don't actually know what they're looking at half the time.
Mark Kowalski · 1 hour ago · 5 min
Two new papers tackle the same old problem I've been griping about since my Kuka days: you can have accurate robot control or fast robot control, but getting both is still a pain.
Robert "Bob" Macintosh · 3 hours ago · 3 min
A flurry of new research papers claim big improvements in robot navigation. Some of it's genuinely clever, some of it's solving problems we created for ourselves.
The traditional fix has been to slap a filter on the output. Smooth out the jitter after the fact. Problem is, filters introduce lag. Your robot sees an obstacle, decides to turn, but the filter delays the command by a few hundred milliseconds while it smooths things out. In a lab, maybe that's fine. On a highway at 70 mph, you're dead.
The ZAPS-DA approach does something clever that I genuinely didn't expect from this generation of researchers (and I mean that as a compliment). Instead of filtering the output, they train a second "decoupled actor" that learns to imitate what a zero-phase filter would have done, but without actually running the filter at deployment time. It's basically teaching the robot to be smooth from the start, rather than smoothing it after the fact.
The numbers are actually impressive! On MetaDrive, a driving simulator, they reduced steering jitter by 14 to 21 times. Throttle jitter dropped by 3 to 5 times. And here's the kicker, task completion stayed basically the same. The robot drove just as well, just without looking like it was having a panic attack.
This is where I get a bit skeptical, because I've watched a lot of simulator results fail to translate to the real world. The ZAPS-DA team validated on two simulators, MetaDrive and a custom Webots environment for adaptive cruise control. Good start, but we don't know yet how this holds up on actual hardware with actual sensor noise and actual mechanical slop.
The UAV paper takes a different approach that I find more interesting from an engineering perspective, even if the results are messier. Instead of replacing the low-level controller, they put a learned supervisor on top of an unchanged autopilot. The supervisor picks from a finite set of bounded adjustments to airspeed, altitude, and heading. The autopilot stays the only thing actually talking to the actuators.
This is the kind of conservative architecture that might actually get deployed! You're not asking anyone to trust a neural network with direct motor control. You're asking them to trust a neural network to nudge the references that a proven autopilot then executes. Much easier sell to the FAA, I'd imagine.
Their results show mean RMS path-tracking error dropping from 338.6 meters (baseline autopilot) to 44.8 meters (with their HJB residual approach). That's an 86.77% reduction. But, and this is important, they're honest about the tradeoffs: airspeed error went up. No method dominated every metric. I appreciate when researchers actually report their failures alongside their wins.
Look, I've been covering tech since the 90s, and if there's one thing I've learned it's that academic papers don't change the world on their own. These techniques need to be implemented, tested on real hardware, validated across edge cases, and then maybe, maybe, they'll show up in products five years from now.
But what I see here is a maturing of the field. The RL community spent the last decade chasing benchmark scores and superhuman game-playing. Now they're actually grappling with the boring, hard problems that matter for deployment: how do you make these things work on physical hardware that breaks when you jitter the controls?
The ZAPS-DA paper is particularly notable for what they call "zero-hyperparameter portability." Meaning you can supposedly drop this into different RL setups without endless tuning. That's the kind of practical consideration that suggests these researchers have actually tried to deploy something, failed, and learned from it.
The UAV work is interesting for a different reason. It shows a path toward certification. Regulators don't want to certify a black-box neural network flying an airplane. But a neural network that can only adjust references within bounded ranges, filtered through safety constraints, with a proven autopilot doing the actual flying? That's a conversation you can have with the FAA.
Neither paper addresses what happens when the simulation-to-reality gap bites you. Both validated in simulation only. The ZAPS-DA team used paired n=150 evaluation protocols, which is decent statistical rigor, but it's still simulation. The UAV paper held "the plant, autopilot, and actuator model fixed" during comparison, meaning they're comparing software packages, not real-world performance.
We also don't know how these approaches handle distribution shift. What happens when your robot encounters a situation it's never seen in training? The filtered outputs might stay smooth, but smooth and wrong is still wrong.
And then there's the question nobody wants to ask: are these improvements actually worth the added complexity? ZAPS-DA requires training and maintaining a second actor network. The HJB residual approach adds a whole supervisor layer with its own value-iteration critic and safety filters. At some point, you've got so many systems watching each other that debugging becomes a nightmare.
But what do I know. I still prefer email to Slack, and these kids are clearly onto something. The jitter problem has been a real barrier to deployment, and seeing multiple groups attack it from different angles in the same week suggests the field is finally taking it seriously.
If you want to argue about any of this, my email's on the about page. I actually read those, unlike certain other communication channels.