Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
I'm going to say something that might sound obvious: we've been leaving performance on the table with robot learning for years, and everyone kind of knew it.
That's the thread running through a handful of papers that dropped recently, and honestly, it's refreshing to see researchers finally tackling problems that practitioners have been complaining about forever. The common theme? The algorithms we've standardized on aren't actually the best ones. They're just the ones that happened to work first.
If you've followed robot learning at all, you've probably heard of PPO (Proximal Policy Optimization). It's become the default choice for training legged robots, the thing everyone reaches for when they want to teach a quadruped to walk or a humanoid to balance. And it works! That's not the issue.
The issue, as a team points out in a new arXiv paper, is that PPO is what's called "on-policy." In plain terms: it can only learn from experiences it just collected. It can't reuse old data. This makes it wildly sample-inefficient, which matters a lot when you want to fine-tune a robot in the real world where every interaction costs time and wear on the hardware.
Soft Actor-Critic (SAC) doesn't have this problem. It's off-policy, meaning it can learn from a big buffer of past experiences. In theory, this makes it perfect for sim-to-real transfer workflows. In practice? SAC has consistently failed to match PPO's performance in the massively parallel training setups everyone uses now.
À lire aussi
More in Humanoids
Three new papers tackle the same underlying issue: we've been forcing robots into kinematic boxes that don't fit their actual capabilities.
Sarah Williams · 1 hour ago · 6 min
Two new papers tackle robot safety with CBFs. The math is elegant. The gap between theory and messy reality is still enormous.
Aisha Patel · 3 hours ago · 9 min
Researchers at KAIST and UC Berkeley tackle the gap between theoretical safety guarantees and messy real-world dynamics, with mixed but promising results.
Aisha Patel · 3 hours ago · 7 min
Six new papers on physics-based humanoid control share a common thread that most coverage missed: the field is converging on intent representation, not just bigger models.
The researchers dug into why, and the fixes are almost embarrassingly simple: better policy initialization, timeout-aware critic targets, and multi-step return estimation. With these changes, SAC matches PPO entirely across multiple robot platforms. I initially thought this was incremental work, but after reading through the experiments, I think it's actually kind of a big deal. It means we might finally be able to use the same algorithm in simulation and on real hardware without switching approaches.
Here's where things get interesting for anyone who doesn't have a server farm.
Another paper introduces something called SDPG (Stochastic Decoupled Policy Gradient), and the headline result is striking: they're training visual control policies end-to-end in a few hours on a single RTX 4080. Not a cluster. Not even a high-end workstation. A gaming GPU.
The trick is estimating policy gradients via random perturbations of trajectory rollouts instead of the usual approach. This requires "orders of magnitude fewer" batch-rendered environments and cuts compute and memory overhead substantially. On visual MuJoCo benchmarks, it beats baselines in training time, memory usage, and final rewards.
What I find compelling here is the democratization angle. Right now, if you want to do serious robot learning research, you basically need access to expensive compute. If methods like SDPG hold up (and tbh, I'm cautiously optimistic but want to see more replication), that barrier drops significantly.
Okay, so we can train robots faster and more efficiently. But here's the thing that's been bugging me: most robot learning systems are still terrible at handling anything they weren't explicitly trained on. You train a robot to pick up a mug, and it falls apart when the lighting changes.
Vision-Language-Action models (VLAs) are supposed to help with this by leveraging pre-trained representations from the internet-scale datasets used for things like GPT and CLIP. The promise is that robots could understand tasks described in natural language and transfer knowledge across situations.
The reality, as a new framework called Agentic-VLA acknowledges, is that current VLA training methods have "poor generalization to novel environments and low training efficiency requiring extensive demonstrations." Not exactly the dream.
Agentic-VLA tries to fix this with three ideas: dynamically generating reward functions based on the robot's current capabilities (so it's not trying to learn everything at once), using a critic model to guide exploration systematically rather than randomly, and maintaining a memory of task-relevant policy weights to warm-start adaptation.
The results on the LIBERO benchmark are solid: +12.3% on long-horizon tasks, +28.5% in one-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. They also claim 2.4x faster convergence than existing methods.
I should note that benchmark results don't always translate to real-world performance, and the paper doesn't provide extensive real-world validation. But the direction feels right.
There's a separate thread in recent work that's worth mentioning: the return of model-based approaches.
The idea of learning a model of the world and then planning through it has been around forever, but it's historically been finicky. Model errors compound, planning is computationally expensive, and in practice, model-free methods often just work better.
Dream-MPC is trying to change that by combining Model Predictive Control with learned world models in a smarter way. The key insight is that gradient-based optimization through a learned model can work if you do it carefully: generate candidate trajectories from a policy, then optimize each one using gradient ascent with uncertainty regularization.
On 24 continuous control tasks, Dream-MPC outperforms both gradient-free MPC and state-of-the-art baselines. You might be wondering why gradient-based methods haven't dominated before. The paper notes that "recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts." The difference here seems to be the combination of uncertainty regularization and amortization of optimization iterations over time.
A related paper on Trust Region Q-Adjoint Matching (TRQAM) tackles a different but connected problem: fine-tuning pre-trained flow policies without everything collapsing. The technical details are dense, but the upshot is they achieve 68% success rate on OGBench offline RL tasks, up from 46% for the strongest baseline.
Honestly, I'm not sure yet. That's the truthful answer.
What I can say is that there's a pattern here: researchers are systematically identifying why standard approaches underperform and finding fixes that are often conceptually simple but require careful engineering. SAC was always theoretically better for certain workflows; it just needed the right modifications to work at scale. Model-based RL was always appealing; it just needed better handling of uncertainty and optimization.
The gap between what's possible in simulation and what works on real robots remains wide. These papers mostly evaluate in simulation or on limited real-world setups. The SDPG paper mentions "effective sim-to-real transfer on physical hardware," but the details are sparse.
I think we're in a period where the low-hanging fruit in robot learning is being picked more aggressively. The algorithms are getting more mature, the compute requirements are (slowly) coming down, and the frameworks for evaluation are getting more standardized.
What's still missing, in my view, is the kind of robust real-world validation that would make me confident these advances translate outside the lab. But that's always been the hard part, and at least now we might have better tools to attempt it.