Robots Still Can't Fold Your Laundry, But Three New Papers Are Making the Problem Less Embarrassing

A cluster of new robotics research tackles cloth manipulation, VLA latency, and humanoid locomotion. The results are genuinely interesting, though production-ready is still a ways off.

4 hours ago7 min read

Picture a shirt crumpled on a table. A human picks it up, shakes it once, reads its orientation in half a second, and folds it. A robot in 2025 will spend several seconds just figuring out where the fabric ends and the background begins, and that's before it attempts a single grasp. Cloth manipulation has been one of the most stubborn open problems in robotics for years, not because researchers aren't smart, but because deformable objects break nearly every assumption that makes rigid-body manipulation tractable. This week, two separate research groups published work that makes meaningful progress on exactly this problem, and a third paper addresses a different but equally stubborn issue: getting vision-language-action models to stop jittering during real robot execution.

Start with the cloth work. A team from arXiv preprint arXiv (cs.RO) published a method called simulator-in-the-loop refinement for cloth manipulation. The core idea is to use a physical simulator, specifically a deformable-object simulator called FLASH, as a real-time backend that evaluates candidate robot trajectories in parallel during inference. The robot takes a single RGB image, maps it to a simulation-compatible cloth state, and then runs online planning using a technique called prior-guided MPPI, which stands for Model Predictive Path Integral. That last part matters: MPPI lets the system run many parallel trajectory rollouts quickly, which is exactly what you need when you're trying to reason about a shirt that could be in roughly infinite configurations.

The real-to-sim module is trained entirely on synthetic data, which is worth noting. The system has to map a single camera image to a mesh representation that the simulator can actually use, and it does this by fusing pretrained visual features with what the authors call learnable canonical tokens. From my time in hardware, I know that the gap between synthetic training data and real-world sensor input is where a lot of these systems fall apart. The paper reports higher success rates and stronger robustness compared to baseline methods in real-robot experiments, though the paper doesn't break down exact success percentages in the abstract, so the full numbers require reading the complete paper.

Related coverage

More in Research

A pair of new arXiv preprints take different but complementary approaches to a problem the field has largely been avoiding: how do you formally guarantee the safety of a robot running a foundation model?

Aisha Patel · 7 hours ago · 9 min

Four new papers from robotics researchers tackle one of RL's most stubborn bottlenecks, and the approaches are more varied and more interesting than the headlines suggest.

James Chen · 7 hours ago · 7 min

A pair of arXiv preprints tackle one of soft robotics' most stubborn problems: making tendon-driven continuum robots actually track where you tell them to go.

Aisha Patel · 8 hours ago · 8 min

Sources