画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
You know that moment when you're driving and a kid runs into the street? Your brain processes that in maybe 150 milliseconds. Now imagine your car's perception system taking hundreds of seconds to figure out what it's looking at.
That's not a hypothetical. That's been the actual state of some cutting-edge 3D scene understanding systems. And two new papers dropped this week that are trying to fix this in very different ways.
Here's what I initially thought when I started digging into this: surely we've solved basic perception by now? Autonomous vehicles have been in development for over a decade. Robots are doing warehouse work. This should be figured out.
But after reading through these papers, I think I was conflating 'works in demos' with 'works in the real world on hardware you can actually afford.'
The first paper, LiteViLNet, tackles road segmentation. Which sounds boring until you realize that most state-of-the-art methods use massive transformer architectures that are basically impossible to run on embedded hardware. We're talking about systems that need to go on actual robots and cars, not datacenter GPUs.
The numbers here are kind of wild. LiteViLNet hits 96.36% MaxF score on the KITTI Road benchmark with only 14.04 million parameters. For context, that's competitive with much larger transformer models while running at 163.79 FPS on a consumer GPU (RTX 4060 Ti). On a Jetson Orin NX, which is what you'd actually put in a robot, it still manages 22.18 FPS.
Is 22 FPS enough? Honestly, I'm not sure. It depends heavily on the application. For a warehouse robot moving at walking speed, probably fine. For highway driving, you might want more headroom.
関連記事
More in Autonomy
A new paper shows that faster GPUs don't actually mean faster AI inference for robots and autonomous vehicles. I've seen this movie before.
Mark Kowalski · 7 hours ago · 6 min
Two new papers suggest we've been overthinking autonomous vehicle perception, and the simpler approaches are winning.
Sarah Williams · 18 hours ago · 5 min
Three new papers tackle the same problem most coverage ignores: predicting the future is useless if you can't actually do anything with it.
Sarah Williams · 18 hours ago · 7 min
Two new papers show robots are finally learning to navigate spaces the way humans do: by reading signs and understanding context, not just mapping geometry.
The second paper is tackling something harder: open-vocabulary 3D instance segmentation. SpaCeFormer is trying to let robots understand arbitrary objects in 3D space, not just pre-defined categories.
The speed difference here is staggering. Prior methods using 2D foundation models aggregated into 3D take hundreds of seconds per scene. SpaCeFormer does it in 0.12 to 0.30 seconds. That's two to three orders of magnitude faster.
You might be wondering why anyone was using methods that took hundreds of seconds in the first place. The answer is accuracy. Those slow multi-stage pipelines were getting better results by leveraging powerful 2D vision-language models. SpaCeFormer is trying to get similar quality without the computational murder.
Key technical approaches worth noting:
LiteViLNet uses a dual-stream encoder that processes RGB and LiDAR data separately but efficiently, with depth-wise separable convolutions keeping parameter count low
They've got something called a Multi-Scale Feature Fusion Module that handles cross-modal interaction at different resolution levels
SpaCeFormer skips the whole 'generate proposals then classify' pipeline entirely, predicting instance masks directly from learned queries
They built a new dataset (SpaCeFormer-3M) with 3 million multi-view captions across 604,000 instances, which addresses a data problem that's been limiting the field
The SpaCeFormer dataset point is interesting. They claim 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU greater than 0.5). That's a massive gap, and it suggests previous work was training on pretty fragmented data.
I should note: I haven't verified these numbers independently, and both papers are self-reporting results on their own benchmarks. The KITTI results for LiteViLNet are on a well-established benchmark, which gives me more confidence. SpaCeFormer's comparisons are a bit harder to evaluate since they're introducing new evaluation protocols alongside the method.
What this means for actual robots
Tbh, I think these papers represent two different bets on the future.
LiteViLNet is betting that specialized, efficient networks for specific tasks (road segmentation) will remain important even as foundation models get bigger. They're optimizing for deployment on real hardware today.
SpaCeFormer is betting that open-vocabulary understanding (robots that can recognize arbitrary objects without retraining) is worth pursuing, but that it needs to be much faster to be practical. They're trying to make tomorrow's capability deployable.
Neither paper solves the full perception stack, obviously. Road segmentation is one piece of autonomous driving. 3D instance segmentation is one piece of robot scene understanding. But both are attacking the same fundamental tension: the methods that work best in research papers often can't run on actual robots.
There's something else here that remains unclear to me. Both papers benchmark on established datasets (KITTI, ScanNet, Replica), but real-world performance can diverge significantly from benchmark results. Weather conditions, unusual lighting, objects that don't appear in training data. The papers don't really address this, and I only found limited discussion of failure cases.
The LiteViLNet team does mention 'real-world applications' in their abstract, but the details are thin. SpaCeFormer's zero-shot mAP of 11.1 on ScanNet200 sounds impressive (2.8x better than prior proposal-free methods), but I'd want to see more analysis of where it breaks down.
The bigger picture
I keep coming back to that 22 FPS number for LiteViLNet on embedded hardware. That's the kind of constraint that separates research from products. You can have the most elegant architecture in the world, but if it can't run on a $500 compute module at useful speeds, it's not going to end up in robots people actually buy.
The SpaCeFormer speed improvements are more dramatic in absolute terms (hundreds of seconds to fractions of a second), but they're also starting from a much worse baseline. Getting open-vocabulary 3D understanding to work in real-time on edge devices is still probably years away, if the trajectory of 2D vision-language models is any guide.
What I find encouraging is that both teams are explicitly optimizing for deployment constraints rather than just chasing benchmark numbers. That's a shift I've been noticing in robotics ML papers over the past year or so. Maybe the field is finally getting serious about the gap between research and reality.
Or maybe I'm reading too much into two papers. It's too early to say whether these specific approaches will matter in the long run. But the problems they're solving, fast enough perception on cheap enough hardware, those definitely matter.