Robots Are Finally Learning to Handle Soft Things. Here's Why That's Harder Than It Sounds.
Two new papers out of academic robotics tackle deformable object manipulation and learning from human video. The results are genuinely interesting, even if the road to your laundry-folding robot is still very long.
By
·Yesterday·7 min de lectura
Picture a robot in a lab somewhere, trying to fold a dish towel. It grabs the corner, pulls, and the cloth bunches up in ways the robot's model never anticipated. The robot freezes, or worse, it keeps going and makes a mess of it. Researchers have been watching this same scene play out, in various forms, for decades now. Soft things, floppy things, things that don't hold their shape when you grab them, have been the quiet nemesis of robotic manipulation research for as long as I've been covering this beat.
Two new papers, both out of academic groups and posted to arXiv in recent weeks, take direct aim at this problem from different angles. One focuses on real-time control for ropes and cloth. The other looks at teaching robots manipulation skills by watching humans on video. Neither claims to have solved the problem entirely, and I appreciate that honesty. But together they suggest the field is making actual, measurable progress, not just publishing benchmarks that look good on paper.
I've seen this movie before, and usually around this point in a hype cycle somebody announces a robot that can fold laundry in three seconds and the headlines go wild and then nothing ships for five years. These papers are more careful than that, which is either a sign of scientific maturity or just good PR instincts. Probably both.
The first paper introduces something called CORD-SLS, a control method built around a GPU-parallel differentiable simulator. The core idea is that you can run the physics of a rope or piece of cloth through a simulator fast enough, at millisecond speeds, to actually use it for real-time planning. That's not trivial. Deformable objects have what researchers call high-dimensional state spaces, which is a polite way of saying there are a staggering number of ways a piece of cloth can be configured at any given moment, and your planner has to reason about all of them.
Cobertura relacionada
More in Research
FlowMPC and WAM-RL both attack the same core limitation of behavior cloning from different angles. Here's what the research actually shows.
Aisha Patel · 9 hours ago · 9 min
Two new research papers suggest the future of robot control might be written in code by AI agents that never touched a robot. That's either brilliant or a disaster waiting to happen.
Mark Kowalski · 10 hours ago · 7 min
Researchers dropped three notable papers on robot planning and navigation this week. The progress is real. The hype is, as usual, getting ahead of the engineering.
Mark Kowalski · 10 hours ago · 7 min
A cluster of preprints from this week's arXiv suggests the field is converging on a shared bottleneck: retargeting human demonstrations faithfully enough that downstream RL policies actually benefit.
The team's approach uses something called robust model predictive control, or MPC, which is a well-established control framework, but the novelty here is running it in parallel on a GPU and coupling it with what they call conformal prediction, a statistical technique for calibrating uncertainty bounds. In plain English, the system is trying to be honest about what it doesn't know about where the cloth is and how it'll behave, and then plan conservatively enough that it stays safe despite that uncertainty. They test it on obstacle avoidance, rope routing, cloth folding, and cloth smoothing, and report that it beats baselines on safety, speed, and task success.
Millisecond-speed planning is the headline number here, and it's a meaningful one. Previous approaches to this kind of problem often required planning times that made real-time control impractical. Whether those millisecond speeds hold up outside of the specific hardware and simulation configurations they tested is, honestly, unclear. I only found the one source on the implementation details, and the paper is brand new.
The second paper, describing a framework called Perceive-Simulate-Imitate (or PSI, which is a better acronym than most), attacks a different piece of the puzzle. The question it's asking is: can a robot learn manipulation skills by watching videos of humans doing tasks, without needing any robot-collected data at all? The answer is sort of yes, with an important asterisk.
Human videos are actually pretty good for teaching the post-grasp part of a manipulation task, the bit where you've already got the object in your hand and you're doing something with it. They're much less useful for teaching grasping itself, because robot hands don't look like human hands, and the geometry is just different enough that you can't copy directly. PSI's solution is modular: use a dedicated grasp generator for the grasping part, and use the human video data for the downstream motion. The clever bit is a simulation filtering step that labels which grasps are actually compatible with the task the robot needs to perform afterward. A stable grasp isn't always a useful grasp, and that distinction matters a lot in practice.
Here's my take, for whatever it's worth. Both of these papers are addressing real bottlenecks, not invented ones. The deformable manipulation problem has been a genuine obstacle to deploying robots in environments that involve fabric, cables, or any material that doesn't hold a fixed shape. That covers a huge range of practical applications, from warehouse logistics to surgical robotics to, yes, eventually household tasks.
The learning-from-video angle is interesting because it's trying to solve a data problem. Robot learning is hungry for training data, and collecting that data with actual robots is expensive and slow. If you can bootstrap from the enormous amount of human video that exists in the world, that's potentially a significant shortcut. The PSI paper is careful to note its limitations, specifically that it tested on prehensile manipulation tasks and that the simulation filtering step adds its own complexity. It's too early to say how well this generalizes beyond the specific tasks they evaluated.
What I keep coming back to is the gap between academic benchmarks and real-world deployment. Both papers do include hardware experiments, which is better than pure simulation, but the tasks are still controlled, the environments are still relatively structured, and the objects are still relatively well-behaved examples of deformable things. Real cloth in a real laundry situation is messier, wetter, heavier, and more unpredictable than a lab towel on a clean table. Real cables in a real wiring harness are tangled in ways that no simulator has fully captured yet.
Call me old-fashioned, but I think the honest version of this story is: these are genuinely good results that advance the state of the art in two specific and important subproblems, and we're probably still five to ten years away from seeing them translated into products that work reliably outside a lab. Maybe less if the GPU-parallel simulation approach turns out to scale as well as the authors suggest. Maybe more if the real-world robustness problems turn out to be harder than the benchmarks imply.
Some researchers in this space argue that the simulation-to-real gap is nearly closed for rigid objects and closing fast for deformable ones. Others counter that we've been saying that for a decade and the gap keeps revealing new layers. This raises questions about... well, multiple things, including whether our benchmarks are actually measuring the right things, and whether the field has the right incentives to be honest about failure cases.
The CORD-SLS paper's GPU-parallel approach is worth watching, because it suggests a path toward scaling this kind of robust control to more complex scenarios without sacrificing the real-time performance that makes it practical. If the approach holds up under more adversarial conditions, and if the conformal prediction calibration works as well on novel objects as it does on the test set, that's a meaningful step.
The PSI framework's value proposition is essentially about data efficiency. If you can train a useful manipulation policy from human video without collecting robot data, you dramatically lower the cost of teaching robots new tasks. The question is how far that scales and how much the simulation filtering step can be automated. Right now it still requires a fair amount of careful engineering.
Both papers are from academic groups, and the companies paying attention to this space, the robot arm manufacturers, the warehouse automation players, the surgical robotics startups, will be looking at whether these techniques can be productized. That's a different problem than publishing a paper, and it's where most good academic robotics research goes to have a long, quiet conversation with reality.
I've been covering tech long enough to know that the papers that actually change the industry aren't always the ones that get the most attention when they drop. Sometimes it's the careful, methodical work on hard subproblems that ends up mattering most, because it removes the bottlenecks that were blocking everything else. Deformable manipulation has been one of those bottlenecks for a long time. Whether these specific approaches are the ones that finally crack it open, I genuinely don't know yet. But the direction is right, and the work is serious. That's more than you can say for a lot of what crosses my desk.