Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
How do you know an LLM-generated robot plan is actually safe?
This is the question I keep coming back to as foundation models increasingly get handed the keys to physical systems. The pattern is familiar by now: a large language model generates a plan, the robot executes it, and we cross our fingers that nothing goes catastrophically wrong. Two papers released this week on arXiv offer different answers to this verification problem, and I think they're worth examining together because they reveal something important about where the field is heading.
The core tension, to be precise, is this: LLMs are remarkably good at generating plausible-sounding robot plans. They can take natural language instructions and produce sequences of actions that often work. But "often works" is a problematic standard when you're dealing with physical systems that can hurt people or break things. A self-driving car that works 99% of the time is still a car that crashes once every hundred trips.
The first paper, PerceptTwin, takes what I'd call the empirical verification route. The authors, whose affiliations aren't specified in the abstract, propose building interactive simulations directly from a robot's perception of its environment. The robot sees a room, constructs a semantic map of what's in it, and PerceptTwin automatically generates a simulation where proposed plans can be tested before execution.
The pipeline combines several components: open-vocabulary object maps (so the system can recognize and represent arbitrary objects, not just a predefined set), 3D asset generation to populate the simulation, affordance prediction to determine what actions are possible on each object, and what they call "commonsense condition checking." There's also an LLM judge, borrowed from the AI alignment literature, that evaluates whether plans align with human preferences.
Related coverage
More in AI Models
OpenAI's CEO is pushing public-private AI collaboration in DC, and if you think this doesn't affect your factory floor, I've got news for you.
Robert "Bob" Macintosh · 3 hours ago · 4 min
A Blackstone-backed company raises $437 million on its second try, and everyone's celebrating. But the real story is what this says about the IPO window, not the business.
Mark Kowalski · 3 hours ago · 5 min
The RTX Spark promises to transform how we use computers, but the real question is whether the transformation solves problems we actually have.
Sarah Williams · 3 hours ago · 6 min
Scene understanding research is having a moment, but the gap between benchmark performance and real-world deployment remains stubbornly wide.
The results are genuinely interesting. PerceptTwin improves plan success by approximately 39% on average across GPT-5, GPT-5-Mini, and GPT-5-Nano planners. It also improves human plan verification by up to 18% for plans that fail due to unfilled skill preconditions. These are meaningful improvements.
But here's what I'd want to see next: the paper tests on "a suite of tasks," but the abstract doesn't specify what those tasks are or how diverse they are. A 39% improvement on carefully selected benchmarks might not translate to the messy, unpredictable environments where robots actually operate. The approach also inherits whatever limitations exist in the perception stack and asset generation pipeline. If the robot misidentifies an object, or the simulation doesn't capture some relevant physical property, the verification becomes meaningless. It's worth noting that simulation fidelity is itself a hard problem that doesn't get solved just by automating the simulation creation process.
The second paper, VASO, takes a fundamentally different tack. Rather than testing plans empirically in simulation, VASO uses formal methods to mathematically prove that plans satisfy safety specifications.
The key insight here is that existing skill-evolution approaches (where LLMs learn to generate better robot skills over time) rely on what the authors call "trace-level evidence." A skill worked on this execution, and that execution, and those five executions over there. But that doesn't prove the skill will work on the execution you haven't tested yet. VASO instead represents each skill as a "semantic contract" with both a formal interface (for model checking) and a planner-facing interface (for generating behavior). A model checker verifies that plans satisfy temporal safety specifications, and when verification fails, the counterexample gets translated into feedback that updates the skill.
The results are striking: 97.2% formal-specification compliance using fewer than 100 optimization samples, tested on Clearpath Jackal ground robots and PX4 quadcopters. This outperforms execution-feedback baselines, prompt optimization, and fine-tuning approaches.
I know I'm being picky here, but the 97.2% figure deserves scrutiny. Formal verification is supposed to give you guarantees, not probabilities. If 2.8% of plans still violate specifications, what's happening there? The abstract doesn't clarify whether this is a limitation of the approach, the specifications themselves, or something else. Also, formal methods have historically struggled with the gap between mathematical models and physical reality. Your proof is only as good as your model, and physical systems have a way of surprising you.
Let me try to be precise about what each paper contributes.
PerceptTwin's novelty is primarily in the automation. Building simulations from perception isn't new. Using simulations to validate plans isn't new. But doing it automatically, without human intervention, in a way that scales to arbitrary environments? That's a meaningful step forward. The LLM judge for alignment checking is also interesting, though I'd want to see more details on how robust it is to adversarial inputs.
VASO's contribution is more fundamental. The idea of using formal verification counterexamples as optimization feedback for skill learning is, to my knowledge, genuinely new. Previous work has either used formal methods to verify one-off plans (useful but doesn't improve the underlying skills) or used execution feedback to improve skills (doesn't provide formal guarantees). Closing this loop is a real contribution.
Key points worth highlighting:
Both papers acknowledge that "it worked in testing" is insufficient evidence for robot safety
PerceptTwin takes an empirical approach (simulate and test), while VASO takes a formal approach (prove correctness mathematically)
Neither approach fully solves the sim-to-real or model-to-reality gap
Both show substantial improvements over baseline approaches (39% and 97.2% respectively)
Neither paper's abstract specifies the full range of tasks tested, which makes it hard to assess generalization
The PerceptTwin paper mentions resistance to "harmful black-box prompting attacks," which suggests security considerations are entering the robot planning literature
What strikes me about these papers appearing in the same week is that they represent two different philosophical approaches to the same problem, and I'm not sure either one is sufficient on its own.
The empirical approach (PerceptTwin) is more flexible. You don't need to formally specify every safety requirement; you just simulate and see what happens. But it can only test the scenarios you think to test. The formal approach (VASO) is more rigorous. If the proof goes through, you have mathematical certainty. But you can only prove properties you can formalize, and physical reality has a way of including properties you didn't think to formalize.
Actually, the research shows that the field seems to be converging on a recognition that verification is the bottleneck. We've gotten remarkably good at generating robot behaviors. What we haven't gotten good at is knowing when to trust them. These papers are both attempts to address that gap, and I expect we'll see many more.
One thing remains unclear: how these approaches will scale to more complex, longer-horizon tasks. The experiments described involve ground robots and quadcopters on what appear to be relatively constrained tasks. Humanoid robots doing household chores, or surgical robots, or robots working alongside humans in factories present verification challenges of a different order. The sample sizes in both papers appear small (VASO explicitly mentions "fewer than 100 optimization samples," which is good for efficiency but tells us little about robustness), and neither has been replicated yet by independent teams.
I'll be watching to see whether these approaches can be combined. A system that uses formal verification where possible and falls back to simulation-based testing where formal specifications are impractical might get you closer to actual safety guarantees. But that's speculation on my part.
For now, I think the most important takeaway is that the research community is taking verification seriously. For too long, the attitude toward LLM-generated robot plans has been something like "ship it and see what happens." These papers suggest that attitude is changing. Whether the specific approaches they propose will prove sufficient is, well, too early to say. But the fact that we're asking the question seriously is progress.