Deep Incremental Scene Understanding Federico Tombari & - PowerPoint PPT Presentation

Deep Incremental Scene Understanding Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Scene Understanding and SLAM Scene understanding with deep SLAM from RGB-D data allowing learning (typically frame-wise) real-time scene reconstruction [Couprie14] [Izadi11] Can we fuse the two, while still being real-time? C. Couprie et al. "Toward Real-time Indoor Semantic Segmentation Using Depth Information" JMLR, 2014 S. Izadi et al., “ KinectFusion: Real- time 3D Reconstruction and Interaction Using a Moving Depth Camera”, UIST 2011

Beyond SLAM: fusing reconstruction with scene understanding Fusing multiple viewpoints over time improves semantic perception and object pose estimation SLAM++ [Salas-Moreno13] Incremental Scene Understanding on Dense SLAM [Li16] C . Li et al., “ Incremental scene understanding on dense SLAM“, IROS 2016 R. Salas- Moreno et al., “ SLAM++: Simultaneous Localisation and Mapping at the Level of Objects “, CVPR 2013

Incremental 3D Segmentation Real-time segmentation of SLAM reconstruction [Tateno15], yielding constant complexity wrt. the size of the reconstruction K. Tateno, F . Tombari, N. Navab , “Real -Time and Scalable Incremental Segmentation on Dense SLAM”, IROS 15

Real-time also on Google Tango..

What if a depth sensor is not available? Is semantic mapping/incremental scene understanding still possible from a single RGB camera?

Monocular SLAM – state of the art DIRECT FEATURE-BASED ORB-SLAM [Mur-Artal14] LSD-SLAM [Engel14] MAIN LIMITATIONS Not dense on No pure rotational No absolute texture-less motions scale regions J. Engel et al., “LSD -SLAM: Large-Scale Direct Monocular SLAM” ECCV 2014 R. Mur-Artal et al., “ORB -SLAM: A Versatile and Accurate Monocular SLAM System” IEEE Trans. Robotics 2015

Depth prediction with CNNs Goal: Use a CNN to predict a dense depth map from a single RGB image Depth Ground Truth RGB Image Depth Prediction (Kinect) An alternative to monocular SLAM?

FC ResNet with UpProjections [Laina16] CNN Architecture avg FC Mem emory limi limitations pool ResNet-50 Restriction of full connections: high dimensional outputs can produce billions of parameters Residual blocks I. Laina, C. Rupprecht, V. Belagiannis, F . Tombari, N. Navab : “ Deeper Depth Prediction using fully Convolutional Residual Networks“, 3DV 2016

FC ResNet with UpProjections [Laina16] CNN Architecture avg FC pool difficult convergence vs blurry predictions need for bigger datasets ground truth prediction Residual blocks

FC ResNet with UpProjections [Laina16] CNN Architecture fully convolutional ResNet with progressive up-sampling Residual blocks

FC ResNet with UpProjections [Laina16] CNN Architecture Residual blocks

Multi-task FC ResNet RGB Input Depth GT (Kinect) Depth Prediction 4-class Sem. Seg. 40-class (RGB-Only) 40-class (RGB + Depth Pred.)

Monocular SLAM and CNN depth prediction are complementary Monocular SLAM Accurate on depth borders CNN-SLAM [Tateno17] but sparse takes the best of both world by fusing monocular SLAM with depth prediction in real time CNN Depth Prediction 1. can learn the absolute Dense but imprecise along scale depth borders 2. dense maps 3. can deal with pure rotational motion K. Tateno, F . Tombari, I. Laina, N. Navab : “CNN -SLAM: Real-time dense monocular SLAM with learned depth prediction" CVPR, 2017

CNN-SLAM framework Every Key-frame CNN Semantic Segmentation Global Map and Semantic Label Fusion CNN Depth Key-frame Pose Graph Input RGB Optimization Prediction Initialization Image Frame-wise Depth Camera Pose Refinement Estimation Every input frame • Camera pose estimated via direct method at each new frame • Set of key-frames, each associated to a depth map Each key-frame depth map 𝐸 𝑙 𝑗 is • 1. initialized via Fully Convolutional ResNet [Laina16] 2. refined with depth values 𝐸 𝑢 estimated via short-baseline stereo matching [Engel14], weighted by the associated uncertainty 𝑉 𝑙 𝑗 , 𝑉 𝑢 : 𝑉 𝑢 𝒗 ⋅ 𝐸 𝑙 𝑗 𝒗 + 𝑉 𝑙 𝑗 𝒗 ⋅ 𝐸 𝑢 𝒗 𝐸 𝑙 𝑗 𝒗 = 𝑉 𝑙 𝑗 𝒗 + 𝑉 𝑢 𝒗

Key-frame depth refinement • Key-frame depth refinement allows estimating fine structures on previously blurred surfaces • Gradual fusion of CNN-predicted depth with monocular SLAM: – elements near intensity gradients will be more and more refined by the frame-wise depth estimates – elements within low-textured regions will gradually hold the predicted depth value from the CNN Refining depth in Key-frame RGB image in Key-frame RGB image in current frame

Qualitative results – SLAM on pure rotational motion •

Qualitative results – Absolute scale estimation •

First demonstration of fully monocular real-time semantic mapping

Many prediction tasks are ambiguous Many prediction tasks contain uncertainty. In some cases, uncertainty is inherent in the task itself [Rupprecht17]. What will the other driver do? What is the label for this image? C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F . Tombari, N. Navab, G. D. Hager: “ Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses " arXiv:1612.00197, 2017

Simple example: next frame prediction single prediction • a square is bouncing around the frame • it randomly switches color between black and white • the CNN predicts the next frame in the sequence • the mean of black and white is gray, which is also the background • the frame is constant gray

Approximations with the mean p(x) x Learning the mean can lead to very unlikely solutions

Approximate with multiple hypotheses a simple meta-loss transforms any model into a multiple hypothesis predictor (MHP)

Simple example: next frame prediction prediction 1 prediction 2 • now we transformed the same network into a multiple hypothesis model • with two predictions it is able to separate black and white blocks for the future frame

Image Classification

Human Pose Estimation • the variance of prediction can help detecting ambiguities • the predictions for the location of the hands varies much more than for the shoulders

Future Frame Prediction • with more predictions future frames become sharper • the model does not need to blend together all possible outcomes

Multiple Hypothesis for Depth Prediction input ground truth hypotheses mean variance hypotheses

Multiple Hypothesis Prediction for CNN-SLAM original CNN-SLAM CNN-SLAM with MHP • the variance can be used to estimate confidences • confidences will be used as initialization for the refinement of the keyframe • with MHP depth prediction the overall accuracy increases correct pixels: correct pixels: 36.0% 10.6%

Conclusion • We presented a framework for real-time scene understanding fusing semantic segmentation and SLAM reconstruction • Depth prediction complements monocular SLAM in low texture regions and global scale • Multiple hypotheses allow for improved 3D reconstruction Combine deep learning with 3D computer vision to leverage the best of both worlds Slide 32

Credits (alphabetical) • Dr. Max Baust • Dr. Vasilis Belagiannis • Robert DiPietro • Prof. Greg Hager • Iro Laina • Prof. Nassir Navab • Keisuke Tateno We gratefully acknowledge the donation from Nvidia of two GPUs that helped the development of the presented research activities.

References [Couprie14] C. Couprie, C. Farabet, L. Najman, Y. LeCun: "Toward Real-time Indoor Semantic Segmentation Using Depth Information" JMLR, 2014 [Engel14] J . Engel et al., “LSD -SLAM: Large-Scale Direct Monocular SLAM” ECCV 2014 [Izadi11] S. Izadi et al., “ KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera”, UIST 2011 [Laina16] I . Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab: “ Deeper Depth Prediction using fully Convolutional Residual Networks“, 3DV 2016 [Li16] C. Li et al., “ Incremental scene understanding on dense SLAM“, IROS 2016 [Mur-Artal15] R. Mur-Artal et al., “ORB -SLAM: A Versatile and Accurate Monocular SLAM System” IEEE Trans. Robotics 2015 [Rupprecht17] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, G. D. Hager: “Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses" arXiv:1612.00197, 2017 [Salas-Moreno13] R. Salas- Moreno et al., “ SLAM++: Simultaneous Localisation and Mapping at the Level of Objects “, CVPR 2013 [Tateno15] K. Tateno , F. Tombari, N. Navab, “Real -Time and Scalable Incremental Segmentation on Dense SLAM”, IROS 15 [Tateno17] K . Tateno, F. Tombari, I. Laina, N. Navab: “CNN -SLAM: Real-time dense monocular SLAM with learned depth prediction" CVPR, 2017

Deep Incremental Scene Understanding Federico Tombari & - PowerPoint PPT Presentation

Deep Incremental Scene Understanding Federico Tombari & Christian Rupprecht Technical University of Munich, Germany Scene Understanding and SLAM Scene understanding with deep SLAM from RGB-D data allowing learning (typically frame-wise)

Scene Graphs Scene Representation How does one describe the objects in a 3D scene? Scene

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Episode 42: I Made Slides 10 February 2019 The Three-Act, Seven Scene Structure Act I:

CMSC427 Scene graphs Credit: slides from Dr. Zwicker Today Scene graphs & hierarchies

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

a better and faster way Shu Kong CS, ICS, UCI Image Understanding --> Scene Parsing Scene

Volumetric Scene Reconstruction Volumetric Scene Reconstruction Goal Goal from Multiple

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Managing Street Scene Matthew Wakelam Assistant Director Street Scene Cardiff Council 1.

Scene Understanding Introduction & Overview Outline Motivation The problems Scene

Scene Represe sentation Networks: ks: Continuous 3D-Structure-Aware Neural Scene Representations

JavaFX Basics Scene Builder CS 2112 Lab 9: JavaFX JavaFX Basics Scene Builder CS 2112 Lab 9:

Extent- -based Incremental Identification based Incremental Identification Extent of Reaction

Incremental Construction Cost Incremental Construction Cost Analysis for New Homes Robin Snyder,

ENTSOG: 5 th Stakeholder Joint Working Session for the Incremental Proposal 8 April 2014 5th SJWS

Incremental SAT Library Integration using Abstract Stobjs Sol Swords Centaur Technology, Inc.

Agribot Sprayer, SLAM, and Robust Navigation Andrey Kurenkov, Troy ONeal, Pavel Komarov

Real-Time Monocular SLAM Andrew Davison Robot Vision Group Department of Computing Imperial

Monocular 3D Reconstruction Using Depth Cues from Planar Range Data Ben Eckart Heather Justice

Persistent self-supervised learning principle: from stereo to monocular vision for obstacle

AR with head-mounted Displays Vorlesung Augmented Reality Prof. Dr. Andreas Butz WS

Single-View Depth Image Estimation Fangchang Ma PhD Candidate at MIT (Sertac Karaman Group)

Digital night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Analog night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Deep Incremental Scene Understanding Federico Tombari & - PowerPoint PPT Presentation

Deep Incremental Scene Understanding Federico Tombari & Christian Rupprecht Technical University of Munich, Germany Scene Understanding and SLAM Scene understanding with deep SLAM from RGB-D data allowing learning (typically frame-wise)

Scene Graphs Scene Representation How does one describe the objects in a 3D scene? Scene

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Episode 42: I Made Slides 10 February 2019 The Three-Act, Seven Scene Structure Act I:

CMSC427 Scene graphs Credit: slides from Dr. Zwicker Today Scene graphs &amp; hierarchies

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

a better and faster way Shu Kong CS, ICS, UCI Image Understanding --&gt; Scene Parsing Scene

Volumetric Scene Reconstruction Volumetric Scene Reconstruction Goal Goal from Multiple

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Managing Street Scene Matthew Wakelam Assistant Director Street Scene Cardiff Council 1.

Scene Understanding Introduction &amp; Overview Outline Motivation The problems Scene

Scene Represe sentation Networks: ks: Continuous 3D-Structure-Aware Neural Scene Representations

JavaFX Basics Scene Builder CS 2112 Lab 9: JavaFX JavaFX Basics Scene Builder CS 2112 Lab 9:

Extent- -based Incremental Identification based Incremental Identification Extent of Reaction

Incremental Construction Cost Incremental Construction Cost Analysis for New Homes Robin Snyder,

ENTSOG: 5 th Stakeholder Joint Working Session for the Incremental Proposal 8 April 2014 5th SJWS

Incremental SAT Library Integration using Abstract Stobjs Sol Swords Centaur Technology, Inc.

Agribot Sprayer, SLAM, and Robust Navigation Andrey Kurenkov, Troy ONeal, Pavel Komarov

Real-Time Monocular SLAM Andrew Davison Robot Vision Group Department of Computing Imperial

Monocular 3D Reconstruction Using Depth Cues from Planar Range Data Ben Eckart Heather Justice

Persistent self-supervised learning principle: from stereo to monocular vision for obstacle

AR with head-mounted Displays Vorlesung Augmented Reality Prof. Dr. Andreas Butz WS

Single-View Depth Image Estimation Fangchang Ma PhD Candidate at MIT (Sertac Karaman Group)

Digital night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Analog night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

CMSC427 Scene graphs Credit: slides from Dr. Zwicker Today Scene graphs & hierarchies

a better and faster way Shu Kong CS, ICS, UCI Image Understanding --> Scene Parsing Scene

Scene Understanding Introduction & Overview Outline Motivation The problems Scene