Robots Still Can't Fold Your Laundry, But Three New Papers Are Making the Problem Less Embarrassing
A cluster of new robotics research tackles cloth manipulation, VLA latency, and humanoid locomotion. The results are genuinely interesting, though production-ready is still a ways off.
By
·4 hours ago·7 min read
Picture a shirt crumpled on a table. A human picks it up, shakes it once, reads its orientation in half a second, and folds it. A robot in 2025 will spend several seconds just figuring out where the fabric ends and the background begins, and that's before it attempts a single grasp. Cloth manipulation has been one of the most stubborn open problems in robotics for years, not because researchers aren't smart, but because deformable objects break nearly every assumption that makes rigid-body manipulation tractable. This week, two separate research groups published work that makes meaningful progress on exactly this problem, and a third paper addresses a different but equally stubborn issue: getting vision-language-action models to stop jittering during real robot execution.
Start with the cloth work. A team from arXiv preprint arXiv (cs.RO) published a method called simulator-in-the-loop refinement for cloth manipulation. The core idea is to use a physical simulator, specifically a deformable-object simulator called FLASH, as a real-time backend that evaluates candidate robot trajectories in parallel during inference. The robot takes a single RGB image, maps it to a simulation-compatible cloth state, and then runs online planning using a technique called prior-guided MPPI, which stands for Model Predictive Path Integral. That last part matters: MPPI lets the system run many parallel trajectory rollouts quickly, which is exactly what you need when you're trying to reason about a shirt that could be in roughly infinite configurations.
The real-to-sim module is trained entirely on synthetic data, which is worth noting. The system has to map a single camera image to a mesh representation that the simulator can actually use, and it does this by fusing pretrained visual features with what the authors call learnable canonical tokens. From my time in hardware, I know that the gap between synthetic training data and real-world sensor input is where a lot of these systems fall apart. The paper reports higher success rates and stronger robustness compared to baseline methods in real-robot experiments, though the paper doesn't break down exact success percentages in the abstract, so the full numbers require reading the complete paper.
Related coverage
More in Research
A pair of new arXiv preprints take different but complementary approaches to a problem the field has largely been avoiding: how do you formally guarantee the safety of a robot running a foundation model?
Aisha Patel · 7 hours ago · 9 min
Four new papers from robotics researchers tackle one of RL's most stubborn bottlenecks, and the approaches are more varied and more interesting than the headlines suggest.
James Chen · 7 hours ago · 7 min
A pair of arXiv preprints tackle one of soft robotics' most stubborn problems: making tendon-driven continuum robots actually track where you tell them to go.
Aisha Patel · 8 hours ago · 8 min
The second cloth paper, also on arXiv, tackles a more specific but genuinely hard sub-problem: Random-to-Target Fabric Flattening, which the authors abbreviate RTFF. The task is to take a randomly wrinkled piece of fabric and bring it to a specific, user-defined wrinkle-free target pose. That's harder than it sounds. Flattening a fabric tends to shift its position, and repositioning it tends to reintroduce wrinkles. The two objectives are coupled in a way that makes naive approaches fail.
The team's solution is a hybrid policy combining imitation learning with visual servoing. They anchor both the current fabric state and the target state to the same template mesh, which lets the system do direct vertex-level comparison without needing a separate registration step. A component they call the Mesh Action Chunking Transformer, or MACT, handles coarse alignment using a small set of demonstrations, and then visual servoing takes over for precise final convergence. The system runs on a real dual-arm teleoperation platform and generalizes to unseen target poses, different fabric types, and different fabric scales. The code and videos are publicly available at the project page linked in the paper.
Both cloth papers are interesting, and both represent genuine technical progress. But I've seen enough spec sheets to know that lab success rates and production robustness are different animals entirely. Cloth manipulation in a controlled lab environment, with consistent lighting and a single fabric type, is not the same as handling the variety of materials and states you'd encounter in, say, a hospital laundry facility or a garment factory. The real test is whether these methods hold up when the fabric is wet, or when there's a button on it, or when the lighting changes. That remains unclear from the current publications.
The third paper this week is less about manipulation and more about making VLA models actually usable on real hardware. Vision-language-action models, which use large pretrained vision and language backbones to generate robot actions directly from observations and natural language instructions, have shown impressive generalization in recent years. The problem is that running inference on a large model takes time, often tens to hundreds of milliseconds, and robots need to send commands at high frequency to maintain smooth, stable motion. The mismatch creates what the authors of this paper call handoff discontinuities: the robot is still executing one chunk of actions while the model is computing the next, and when the new chunk arrives, the transition can be jerky or outright wrong.
The paper, also from arXiv, proposes a lightweight adapter called Action ControlNet, or ACNet. Rather than retraining the entire VLA model to be delay-aware (which is expensive and architecture-specific), ACNet adds a small trainable module that takes the recently executed motion suffix as a residual condition for the action head. The pretrained backbone stays frozen. The adapter learns to account for the fact that the observation the model sees is already stale by the time the action executes. The authors test on Kinetix, Meta-World MT50, and a real-world platform called SO-ARM101, and report improved robustness under inference delay and smoother trajectories compared to direct chunk stitching.
This is, sort of, a pragmatic engineering solution rather than a fundamental architectural fix. And that's not a criticism. Robotics is full of elegant theoretical solutions that never ship because they require too much retraining or too many assumptions about the deployment hardware. ACNet's selling point is that it's compatible with generative action heads including diffusion and flow matching, introduces few trainable parameters, and doesn't touch the backbone. If those claims hold up across a wider range of VLA architectures and tasks, this could be a useful plug-in for teams already deploying VLA-based systems. It's too early to say how broadly it generalizes.
There are also two humanoid locomotion papers worth mentioning briefly. WOLF-VLA, from another arXiv preprint, proposes a framework for training VLA models on whole-body humanoid locomotion rather than just manipulation. The core contribution is a dataset of dynamically feasible humanoid trajectories across six locomotion task families, each with variations in environment, object placement, and visual distractors. The model takes joint trajectories, ego-centric visual observations, and natural language instructions as inputs. The team is releasing the dataset, model checkpoints, and a benchmarking simulation suite, which is the right move. Humanoid locomotion research has suffered from a lack of standardized benchmarks, and more open datasets in this space are unambiguously useful.
PhyGile, the fifth paper in this cluster, addresses a related problem: text-to-motion generation for humanoid robots tends to produce motions that look kinematically reasonable but violate physical feasibility when you actually try to run them on hardware. The paper from arXiv introduces physics-guided prefixes at inference time, generating robot-native motions directly in a 262-dimensional skeletal space rather than retargeting human motion capture data. The General Motion Tracking controller is trained with a curriculum-based mixture-of-experts scheme and fine-tuned on unlabeled motion data. Real-robot experiments reportedly show stable tracking of agile, complex whole-body motions beyond what prior methods could handle.
Look, the honest summary of all five papers is this: the field is making real, incremental progress on problems that have resisted clean solutions for years. Cloth manipulation is getting more robust. VLA inference latency is being addressed with lightweight, practical adapters. Humanoid locomotion is getting more physically grounded. None of these papers are claiming a solved problem, and none of them should be. What they represent is the kind of steady, methodical work that eventually compounds into systems that actually ship.
The cloth manipulation results are probably the most practically significant in the near term, because the industrial use case (automated garment handling, hospital linen management, e-commerce fulfillment) is large and the current state of the art is genuinely bad. Whether the simulator-in-the-loop approach scales to real production environments is the question. FLASH, the deformable simulator used in the first paper, is described as balancing physical fidelity, numerical stability, and rollout efficiency. That's the right set of tradeoffs to optimize for, but this is based on limited published benchmark data and the real stress test comes when someone tries to run this outside a lab. That's when we'll know if any of this matters.
The sources provided for this article were about portable power station discounts on Amazon. That is not a robotics or AI story, and publishing it as one would be a disservice to readers.