IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM

Minghao Yin1, Shangzhe Wu2, Kai Han1
1The University of Hong Kong    2University of Oxford
CVPR 2024

Abstract

In this paper, we address the challenging problem of visual SLAM with neural scene representations. Recently, neural scene representations have shown promise for SLAM to produce dense 3D scene reconstruction with high quality. However, existing methods require scene-specific optimization, leading to time-consuming mapping processes for each individual scene. To overcome this limitation, we propose IBD SLAM, an Image-Based Depth fusion framework for generalizable SLAM. In particular, we adopt a Neural Radiance Field (NeRF) for scene representation. Inspired by multi-view image-based rendering, instead of learning a fixed-grid scene representation, we propose to learn an image-based depth fusion model that fuses depth maps of multiple reference views into a xyz-map representation. Once trained, this model can be applied to new, uncalibrated monocular RGBD videos of unseen scenes, without the need for retraining, and reconstructs full 3D scenes efficiently with a light-weight pose optimization procedure. We thoroughly evaluate IBD-SLAM on public visual SLAM benchmarks, outperforming the previous state-of-the-art while being 10×faster in the mapping stage.

Overview

MY ALT TEXT

In this paper, we address the challenging problem of visual SLAM with neural scene representations. Recently, neural scene representations have shown promise for SLAM to produce dense 3D scene reconstruction with high quality. However, existing methods require scene-specific optimization, leading to time-consuming mapping processes for each individual scene. To overcome this limitation, we propose IBD SLAM, an Image-Based Depth fusion framework for generalizable SLAM. In particular, we adopt a Neural Radiance Field (NeRF) for scene representation. Inspired by multi-view image-based rendering, instead of learning a fixed-grid scene representation, we propose to learn an image-based depth fusion model that fuses depth maps of multiple reference views into a xyz-map representation. Once trained, this model can be applied to new, uncalibrated monocular RGBD videos of unseen scenes, without the need for retraining, and reconstructs full 3D scenes efficiently with a light-weight pose optimization procedure. We thoroughly evaluate IBD-SLAM on public visual SLAM benchmarks, outperforming the previous state-of-the-art while being 10×faster in the mapping stage.

IBD-SLAM is a feed-forward 3D scene reconstruction approach with robust generalization capabilities.

Reconstruction

Poster

BibTeX


        @InProceedings{Yin_2024_CVPR,
        author    = {Yin, Minghao and Wu, Shangzhe and Han, Kai},
        title     = {IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2024},
        pages     = {10563-10573}
        }