Multimodal Inference for 6D Camera Relocalization and Object Pose Estimation

Mai Bui

Tolga Birdal

Haowen Deng

Shadi Albarqouni

Leonidas Guibas

Slobodan Ilic

Nassir Navab

Stanford University Technical University of Munich Siemens AG

We introduce Deep Bingham Networks (DBN) a generic framework that can naturally handle pose-related uncertainties and ambiguities arising in almost all real life applications concerning 3D data. While existing works strive to find a single solution to the pose estimation problem, we make peace with the ambiguities causing high uncertainty around which solutions to identify as the best. Instead, we report a family of poses which capture the nature of the solution space. DBN extends the state of the art direct pose regression networks by (i) a multi-hypotheses prediction head which can yield different distribution modes; and (ii) novel loss functions that benefit from Bingham distributions on rotations. This way, DBN can work both in unambiguous cases providing uncertainty information, and in ambiguous scenes where an uncertainty per mode is desired. On a technical front, our network regresses continuous Bingham mixture models and is applicable to both 2D data such as images and to 3D data such as point clouds. We proposed new training strategies so as to avoid mode or posterior collapse during training and to improve numerical stability. Our methods are thoroughly tested on two different applications exploiting two different modalities: (i) 6D camera relocalization from images; and (ii) object pose estimation from 3D point clouds, demonstrating decent advantages over the state of the art. For the former we contributed our own dataset composed of five indoor scenes where it is unavoidable to capture images corresponding to views that are hard to uniquely identify. For the latter we achieve the top results especially for symmetric objects of ModelNet dataset.

Overview (ECCV'20 Presentation on Relocalization)

Dataset and Results on 6D Camera Relocalization

We created a synthetic dataset, that is specifically designed to contain repetitive structures and introduce highly ambiguous viewpoints:

In addition, we captured highly ambiguous real scenes using Google Tango and a graph-based SLAM approach. We provide RGB images as well as distinct ground truth camera trajectories for training and testing. 3D reconstructions are also provided for visualization purposes. In comparison to current state-of-the-art methods, our model is able to capture plausible, but diverse modes as well as associated uncertainties for each pose hypothesis, as shown below:

Please use the link on top of the page to download this dataset.

Results on 3D Object Pose Estimation

Multi-hypotheses Bingham layers are also applicable to pose estimation of symmetric/occluded/partially viewed objects. As shown below, we are able to capture all the plausible pose configurations, without explicitly supervising for object symmetries:

Citation

@inproceedings{bui2020eccv,
title={6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference},
author={Bui, Mai and Birdal, Tolga and Deng, Haowen and Albarqouni, Shadi and Guibas, Leonidas and Ilic, Slobodan and Navab, Nassir},
journal={European Conference on Computer Vision (ECCV)},
year={2020}
}

@misc{deng2020deep,
title={Deep Bingham Networks: Dealing with Uncertainty and Ambiguity in Pose Estimation},
author={Haowen Deng and Mai Bui and Nassir Navab and Leonidas Guibas and Slobodan Ilic and Tolga Birdal},
year={2020},
eprint={2012.11002},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Funding

This joint effort is supported by BaCaTec, the Bavaria California Technology Center, Stanford-Ford Alliance, NSF grant IIS-1763268, Vannevar Bush Faculty Fellowship, Samsung GRO program, the Stanford SAIL Toyota Research, and the PRIME programme of the German Academic Exchange Service (DAAD) with funds from the German Federal Ministry of Education and Research (BMBF).

Interested in Collaborating with Us?

We would like this project to evolve towards a repository of methods for handling challenging multimodal problems of 3D computer vision. Therefore, we look for contributors and collaborators with great coding and mathematical skills as well as good knowledge in 3D vision, machine (deep) learning. If you are interested please send an e-mail to Tolga.

Multimodal Inference for 6D Camera Relocalization and Object Pose Estimation

Stanford University Technical University of Munich Siemens AG

ECCV'20 Paper

Extended Arxiv Version