Two New Papers Show Why LiDAR and Cameras Are Better Together Than Apart
SurfFill and CoMo3R-SLAM take opposite approaches to the same problem, and both reveal something important about where 3D reconstruction is actually headed.
Bildnachweis: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
The debate over LiDAR versus cameras in robotics has always struck me as somewhat misguided. It is not a competition. It never was. Two papers that appeared on arXiv this month make this point with unusual clarity, and they do so by approaching the sensor fusion problem from completely opposite directions. SurfFill asks: what if we used cameras to fix what LiDAR gets wrong? CoMo3R-SLAM asks: what if we used learned 3D priors to make cameras work like LiDAR never existed? Both succeed, which tells us something important about where this field is actually heading.
To be precise, SurfFill (arXiv:2512.03010) addresses a problem that anyone who has worked with LiDAR point clouds knows intimately but rarely discusses in polite company. arXiv LiDAR is supposed to be the gold standard in active 3D reconstruction. And it is, mostly. Flat surfaces, large structures, well-lit environments with cooperative materials: LiDAR handles these beautifully. But thin structures? Edges? Dark, absorbent materials? This is where things get embarrassing. The authors identify LiDAR beam divergence as the main culprit for artifacts at thin structures and edges, which is a more specific diagnosis than I usually see. Most papers wave vaguely at "occlusion" and "material properties" without pinning down the mechanism.
The approach here is genuinely clever, though I should note that the underlying technique (Gaussian splatting for 3D reconstruction) is not new. What is new is the application to LiDAR completion specifically, and the ambiguity heuristic they introduce. They evaluate changes in point cloud density to identify regions close to missed areas, then use Gaussian surfel reconstruction to grow additional points in those ambiguous zones. The divide-and-conquer scheme for building-scale completion is a practical addition that suggests the authors actually want this to work in the real world, not just on benchmark datasets.
Verwandte Beiträge
More in Research
Four new papers tackle the same problem from different angles, and the pattern tells us something about where manipulation research is actually headed.
Mark Kowalski · 18 hours ago · 5 min
Separate research teams at arXiv are attacking the action precision problem from different angles, and both claim significant accuracy gains.
James Chen · 19 hours ago · 5 min
Two new papers tackle the same problem from different angles, and for once, the math actually connects to real robots.
Mark Kowalski · Yesterday · 6 min
Three new papers show robot touch moving from lab demos to actual working systems, and the technical approach is more pragmatic than you'd expect.
CoMo3R-SLAM takes the opposite philosophical position. arXiv Where SurfFill says "LiDAR is great, let's patch its weaknesses with cameras," CoMo3R-SLAM says "what if we could skip LiDAR entirely?" This is the first collaborative monocular dense RGB SLAM system designed for outdoor multi-agent mapping, and it achieves this without depth sensors or even parametric camera intrinsics. That last part is worth pausing on. No intrinsics means you could, in theory, throw together a fleet of robots with whatever cheap cameras you have lying around and still get globally consistent metric maps.
The technical contribution centers on learned feed-forward 3D reconstruction priors. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion. A coordinator handles the hard part: dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization (which addresses the scale ambiguity problem that has plagued monocular SLAM forever), and GPU-accelerated global bundle adjustment with segment-level depth optimization. The authors report matching or exceeding state-of-the-art RGB-D methods on Tanks and Temples and Waymo sequences while running online at 8 FPS.
I know I am being picky here, but the 8 FPS figure deserves scrutiny. Real-time in robotics typically means 30 FPS or higher for control loops. 8 FPS is fine for mapping and localization that feeds into a slower planning system, but it is not "real-time" in the sense that a control engineer would use the term. The paper does not claim otherwise, to be fair. I am just noting this because "real-time" has become one of those words that means different things to different communities, and it is worth being precise.
What strikes me about these two papers appearing in the same month is how they bracket the solution space. SurfFill is incremental over prior Gaussian splatting work but applies it to a specific, well-defined problem with clear practical value. CoMo3R-SLAM is more ambitious, attempting to solve collaborative monocular SLAM in outdoor environments, a problem that has resisted clean solutions for years. Both succeed on their chosen benchmarks, but the generalization question remains open for both.
The SurfFill evaluation covers synthetic and real-world scenes, which is good, but the paper does not specify how many real-world scenes or what their characteristics were. This is a common gap in reconstruction papers. Synthetic benchmarks are useful for controlled comparisons but often feature idealized geometry and lighting. Real-world performance can degrade in ways that synthetic evaluations miss entirely. The authors claim they "outperform previous reconstruction methods," but without knowing the diversity of their test set, it is hard to assess how robust that claim is.
CoMo3R-SLAM has a different evaluation challenge. They report "best ATE on three of four Tanks and Temples scenes," which is impressive, but Tanks and Temples is a relatively constrained benchmark. Waymo sequences are more representative of real outdoor driving scenarios, but the paper describes their Waymo performance as "competitive" rather than state-of-the-art. This is actually honest reporting, which I appreciate. Too many papers cherry-pick the metrics where they win and bury the ones where they merely match prior work.
The learned priors in CoMo3R-SLAM raise questions about training data distribution. Feed-forward 3D reconstruction networks are only as good as what they have seen during training. Outdoor scenes have enormous variability: weather conditions, lighting, vegetation, architectural styles, road surfaces. The paper does not detail what training data was used for the priors or how the system performs on scenes that differ significantly from that distribution. This is not a criticism unique to this paper; it is a gap in basically every learned reconstruction system I have read about. But it matters for deployment.
Actually, the research shows something interesting when you look at both papers together. The sensor fusion question is not really about LiDAR versus cameras anymore. It is about how to combine geometric constraints (from LiDAR, from structure-from-motion, from learned priors) with appearance information (from cameras) in ways that are robust to the failure modes of each. SurfFill uses cameras to fill in where LiDAR fails. CoMo3R-SLAM uses learned priors to provide the geometric grounding that monocular cameras cannot supply on their own. Both are sensor fusion in spirit, even if only one uses multiple physical sensor modalities.
The multi-agent aspect of CoMo3R-SLAM deserves more attention than it is getting in the abstract. Collaborative SLAM is hard. Really hard. You have to handle inter-agent data association, which means figuring out when two robots are looking at the same place from different angles. In outdoor environments with low overlap and repetitive structures, traditional feature matching fails constantly. The paper claims their dense pointmap matching approach handles this, but the details of how well it handles adversarial cases (repetitive building facades, for instance, or large featureless parking lots) remain unclear.
It is worth noting that both papers focus on reconstruction accuracy rather than downstream task performance. This is standard for the field, but it leaves open the question of whether the improvements matter for actual robot behavior. A robot that needs to navigate through a doorway cares about whether it can fit, not about the mean reconstruction error across the entire scene. A manipulation system cares about the geometry of the object it is grasping, not about the walls in the background. The connection between reconstruction metrics and task success is weaker than we often assume.
The practical deployment story differs significantly between these approaches. SurfFill requires both LiDAR and cameras, which means more hardware, more calibration, more things that can break. But it builds on a sensor (LiDAR) that is already widely deployed and trusted. CoMo3R-SLAM requires only cameras, which is cheaper and lighter, but it relies on learned components that may behave unpredictably in novel environments. For safety-critical applications, the interpretability of LiDAR-based systems is a real advantage. You can debug a LiDAR point cloud. Debugging why a neural network produced a bad depth estimate is, sort of, a different kind of problem.
I would want to see both systems tested on failure cases specifically. What happens to SurfFill when the camera images are motion-blurred or poorly exposed? What happens to CoMo3R-SLAM when the learned priors encounter architecture they have never seen? The papers present success cases, as papers do, but deployment requires understanding failure modes. This is not unique to these papers. It is a gap in how we evaluate reconstruction systems generally.
The computational requirements also matter and are not fully specified in either abstract. SurfFill mentions a divide-and-conquer scheme for building-scale completion, which suggests significant compute for large scenes. CoMo3R-SLAM mentions GPU-accelerated bundle adjustment, which implies non-trivial hardware requirements. Neither paper (based on the abstracts) specifies whether these systems could run on edge devices typical of mobile robots or whether they require cloud compute. For multi-agent systems especially, the question of where computation happens (onboard versus offloaded) has major implications for latency, reliability, and bandwidth.
What these papers collectively suggest is that the field is converging on hybrid approaches, even when the explicit framing is about a single modality. SurfFill is explicitly multi-modal. CoMo3R-SLAM is nominally camera-only, but the learned priors encode geometric knowledge that could only have been learned from, ultimately, some combination of sensor modalities in the training data. The pure camera versus pure LiDAR debate is increasingly irrelevant. The question is how to combine information sources, whether that combination happens at training time (learned priors) or inference time (explicit fusion).
The sample sizes in both evaluations are relatively small by machine learning standards, which is typical for robotics papers but still limits confidence in generalization. Tanks and Temples has a few dozen scenes. Waymo is larger but still represents a specific geographic and environmental distribution (mostly California, mostly daytime, mostly clear weather). We do not know yet how these methods perform in snow, heavy rain, fog, or the kind of chaotic urban environments common in many parts of the world. This is not a criticism of these papers specifically. It is a limitation of the field's evaluation infrastructure.
Looking at where this research points, I expect we will see more work on adaptive sensor fusion, systems that can dynamically weight different information sources based on confidence estimates. SurfFill's ambiguity heuristic is a step in this direction for LiDAR completion. CoMo3R-SLAM's use of learned priors is another approach to the same underlying challenge: knowing when to trust which source of information. The next generation of systems will likely need to handle not just sensor failure but graceful degradation, maintaining useful (if reduced) capability when one or more information sources become unreliable.
Both papers represent solid incremental progress on hard problems. SurfFill is the more modest contribution, applying known techniques to a specific gap in LiDAR reconstruction. CoMo3R-SLAM is more ambitious and correspondingly more uncertain in its generalization. Neither is a breakthrough in the sense of fundamentally changing how we think about 3D reconstruction. But both push the boundaries of what is practical, and that matters. The gap between research demonstrations and deployed systems remains large, and papers that close that gap, even partially, are valuable even when they are not revolutionary.