Two new papers tackle the same problem: how do you trust an LLM to control a robot?

PerceptTwin and VASO take different approaches to verification, but both acknowledge that 'it worked once' isn't good enough for physical systems.

5 June 20266 min read

How do you know an LLM-generated robot plan is actually safe?

This is the question I keep coming back to as foundation models increasingly get handed the keys to physical systems. The pattern is familiar by now: a large language model generates a plan, the robot executes it, and we cross our fingers that nothing goes catastrophically wrong. Two papers released this week on arXiv offer different answers to this verification problem, and I think they're worth examining together because they reveal something important about where the field is heading.

The core tension, to be precise, is this: LLMs are remarkably good at generating plausible-sounding robot plans. They can take natural language instructions and produce sequences of actions that often work. But "often works" is a problematic standard when you're dealing with physical systems that can hurt people or break things. A self-driving car that works 99% of the time is still a car that crashes once every hundred trips.

The simulation-as-verification approach

The first paper, PerceptTwin, takes what I'd call the empirical verification route. The authors, whose affiliations aren't specified in the abstract, propose building interactive simulations directly from a robot's perception of its environment. The robot sees a room, constructs a semantic map of what's in it, and PerceptTwin automatically generates a simulation where proposed plans can be tested before execution.

The pipeline combines several components: open-vocabulary object maps (so the system can recognize and represent arbitrary objects, not just a predefined set), 3D asset generation to populate the simulation, affordance prediction to determine what actions are possible on each object, and what they call "commonsense condition checking." There's also an LLM judge, borrowed from the AI alignment literature, that evaluates whether plans align with human preferences.

Related coverage

More in AI Models

Chipmakers swung wildly this week, from a Tuesday 'chip-wreck' to a Micron-led surge after hours. What's actually going on with AI's hardware backbone?

Sarah Williams · 26 Jun · 5 min

The original Creator Studio was shut down in 2023. Now it's back, rebuilt around an AI assistant that promises to grow your audience and reply to comments in your voice.

Sarah Williams · 26 Jun · 5 min

At its annual Config conference, Figma announced coding layers, AI-generated motion graphics, and a reimagined canvas that blurs the line between design and full-stack development.

Sarah Williams · 26 Jun · 5 min

Everyone talks about chips and models. The memory bottleneck is the part of the AI buildout that keeps getting underestimated, and Micron's latest earnings make that case hard to ignore.

Two new papers tackle the same problem: how do you trust an LLM to control a robot?

The simulation-as-verification approach

More in AI Models

The formal verification approach

What's genuinely new versus incremental

The bigger picture

Sources