The 9% Problem: Why Synthetic Data Pipelines Are Finally Getting Serious About Pedestrian Behaviour
Two new papers tackle the unglamorous but critical challenge of generating useful training data for autonomous vehicles, and the results reveal how far we still have to go.
Bildnachweis: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Nine percent. That's the native pedestrian crossing rate in CARLA, the widely-used autonomous driving simulator. If you've ever wondered why your perception model struggles with pedestrian intent prediction in the real world, this number should give you pause.
To be precise, this means that in default CARLA scenarios, only about one in eleven pedestrians actually crosses the road. The rest just stand there or walk parallel to traffic. For researchers trying to train models on crossing prediction (a safety-critical task, obviously), this creates a fundamental data imbalance problem that no amount of clever loss weighting can fully solve.
Two papers released this week on arXiv address different aspects of this synthetic data challenge, and together they paint a picture of a field that's finally taking data generation infrastructure seriously. Neither paper is revolutionary. But both represent the kind of methodical, reproducible work that actually moves the needle.
The first paper, ARCANE-PedSynth, comes from researchers who clearly got frustrated with CARLA's limitations. Their solution is a hybrid AI-manual pedestrian control architecture that can push crossing rates up to 75%, configurable via command line parameters.
I know I'm being picky here, but the phrase "hybrid AI-manual" deserves unpacking. What they've actually built is a 12-state behavioural finite state machine with five character archetypes (the paper doesn't specify what these archetypes are, which is a gap in the abstract). The "AI" component handles navigation and collision avoidance, while the "manual" component allows researchers to trigger specific behaviours at specific moments. It's less sophisticated than it sounds, but arguably more useful for that reason.
Verwandte Beiträge
More in Autonomy
A cluster of recent arXiv preprints suggests the field is finally getting serious about uncertainty calibration, though the solutions remain fragmented.
Aisha Patel · 2 hours ago · 7 min
Two new papers show real progress on protecting vulnerable road users, and it's about time someone did the work.
Robert "Bob" Macintosh · 2 hours ago · 4 min
Everyone's excited about risk-aware planning, but these preprints reveal something more fundamental: your robot's safety guarantees are only as good as its uncertainty estimates.
Aisha Patel · 3 hours ago · 7 min
New research shows vision-language models can guide robots through unfamiliar spaces with surprisingly little training, but the approach comes with some weird failure modes.
The framework generates synchronised RGB, LiDAR, and DVS (dynamic vision sensor) data with per-frame crossing labels, behavioural states, and estimated 2D pose keypoints. They've released an example dataset called PedSynth++ with 533 multi-pedestrian clips across 12 weather conditions.
533 clips is, frankly, not a lot. The paper positions this as a demonstration rather than a production dataset, which is fair. But it does raise the question of whether the framework can scale. Docker containerisation and CLI parameterisation are nice for reproducibility, but they don't tell us anything about computational cost or generation time per clip.
What's genuinely new here is the behavioural annotation granularity. Most synthetic datasets give you a binary crossing/not-crossing label. ARCANE-PedSynth provides the intermediate behavioural states: approaching, hesitating, committing, crossing, completing. This is useful for intent prediction models that need to catch the hesitation phase, which is often where real-world models fail.
The second paper, from a different research group, tackles a related but distinct problem: what do you do when you have a rich multi-sensor dataset but only bounding box annotations?
The Zenseact Open Dataset (ZOD) is one of those frustrating resources. Great sensor coverage, good diversity, but no pixel-level segmentation labels. The researchers built a SAM-based pipeline to convert bounding boxes into semantic masks, processed over 100,000 frames, and manually curated a 2,300-frame subset.
That 36% acceptance rate is telling. It means nearly two-thirds of the SAM-generated annotations weren't good enough for the researchers' own standards. The paper doesn't detail the failure modes, which would have been valuable. Was SAM struggling with occlusion? Weather conditions? Specific object classes?
The segmentation results themselves are... okay. 48.1% mIoU with their best model (CLFT-Hybrid) across diverse weather conditions. For context, state-of-the-art on Cityscapes is above 85% mIoU, though that's not a fair comparison given the different domains and conditions. The 77.5% mIoU on their Iseauto platform validation is more impressive, but that's a controlled environment with presumably cleaner data.
Here's where both papers converge on an uncomfortable truth: the stuff that matters most for safety (pedestrians, cyclists, small signs) is vanishingly rare in pixel terms.
The SAM paper reports that pedestrians, cyclists, and signs constitute less than 1% of pixels in their dataset. Less than one percent. You can throw all the focal loss and oversampling you want at this problem, but fundamentally you're asking a model to care deeply about something it almost never sees.
The ARCANE-PedSynth approach of artificially boosting crossing rates is one solution, but it introduces its own distribution shift. If your synthetic data has 75% crossing pedestrians and your real deployment environment has 5%, you've created a different kind of mismatch.
It's worth noting that neither paper addresses this tension directly. ARCANE-PedSynth focuses on behavioural diversity within the crossing class. The SAM paper explores "specialized models targeting rare classes" but doesn't report results for these in the abstract. This is the kind of methodological gap that makes me want to see the full papers before drawing strong conclusions.
Both papers are doing necessary infrastructure work. Reproducible data generation pipelines and annotation tools are the plumbing of machine learning research. Not glamorous, but essential.
That said, several questions remain unclear:
First, how do models trained on ARCANE-PedSynth's behaviourally-diverse synthetic data actually perform on real-world crossing prediction benchmarks? The paper demonstrates the framework but doesn't close the loop on downstream task performance.
Second, what's the ceiling on SAM-based annotation quality? The 36% acceptance rate suggests significant room for improvement, but is that a SAM limitation, a bounding box quality issue, or something else entirely?
Third (and this is the big one), how do these synthetic and semi-automated annotation approaches compare to smaller amounts of high-quality human annotation? There's an implicit assumption in both papers that more data, even if noisier, is better. That assumption hasn't been rigorously tested in this domain.
The transfer learning results in the SAM paper (bidirectional transfer between sensor configurations) are actually the most interesting finding, though they're somewhat buried. If SAM-derived representations genuinely transfer across different camera setups, that has implications for fleet-wide model deployment that go beyond the annotation use case.
I've been critical of the robotics and AV research community's tendency to chase flashy demos over reproducible infrastructure. These two papers are the opposite of that tendency, and I mean that as a compliment.
Neither ARCANE-PedSynth nor the SAM annotation pipeline will appear in any "breakthrough" headlines. They're tools, not solutions. But they're the kind of tools that let other researchers do better work, and that compounds over time.
The 9% crossing rate problem has been known for years. Someone finally built a configurable fix and released the code. That's how progress actually happens in this field, one unglamorous commit at a time.
Whether these specific tools see adoption depends on factors the papers can't control: documentation quality, community engagement, maintenance over time. The Docker containerisation is a good sign. The CLI parameterisation is a good sign. But I've seen too many "fully reproducible" research codebases rot within months of publication to get excited prematurely.
For now, both frameworks are worth bookmarking if you work on pedestrian perception or AV segmentation. The sample sizes are small, the evaluations are preliminary, and the hard questions about synthetic-to-real transfer remain unanswered. But the infrastructure is there, which is more than we could say last week.