Deep Incremental Scene Understanding Federico Tombari & - - PowerPoint PPT Presentation

deep incremental scene
SMART_READER_LITE
LIVE PREVIEW

Deep Incremental Scene Understanding Federico Tombari & - - PowerPoint PPT Presentation

Deep Incremental Scene Understanding Federico Tombari & Christian Rupprecht Technical University of Munich, Germany Scene Understanding and SLAM Scene understanding with deep SLAM from RGB-D data allowing learning (typically frame-wise)


slide-1
SLIDE 1

Deep Incremental Scene Understanding

Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

slide-2
SLIDE 2

Scene Understanding and SLAM

Scene understanding with deep learning (typically frame-wise) [Couprie14] SLAM from RGB-D data allowing real-time scene reconstruction [Izadi11]

Can we fuse the two, while still being real-time?

  • C. Couprie et al. "Toward Real-time Indoor Semantic Segmentation Using Depth Information" JMLR, 2014
  • S. Izadi et al., “KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera”, UIST

2011

slide-3
SLIDE 3

Beyond SLAM: fusing reconstruction with scene understanding

Fusing multiple viewpoints over time improves semantic perception and object pose estimation Incremental Scene Understanding

  • n Dense SLAM [Li16]

SLAM++ [Salas-Moreno13]

  • C. Li et al., “Incremental scene understanding on dense SLAM“, IROS 2016
  • R. Salas-Moreno et al., “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects“, CVPR 2013
slide-4
SLIDE 4

Incremental 3D Segmentation

Real-time segmentation of SLAM reconstruction [Tateno15], yielding constant complexity wrt. the size of the reconstruction

  • K. Tateno, F

. Tombari, N. Navab, “Real-Time and Scalable Incremental Segmentation on Dense SLAM”, IROS 15

slide-5
SLIDE 5

Real-time also on Google Tango..

slide-6
SLIDE 6

Is semantic mapping/incremental scene understanding still possible from a single RGB camera? What if a depth sensor is not available?

slide-7
SLIDE 7

Monocular SLAM – state of the art

LSD-SLAM [Engel14] ORB-SLAM [Mur-Artal14]

FEATURE-BASED DIRECT

No pure rotational motions Not dense on texture-less regions No absolute scale

MAIN LIMITATIONS

  • J. Engel et al., “LSD-SLAM: Large-Scale Direct Monocular SLAM” ECCV 2014
  • R. Mur-Artal et al., “ORB-SLAM: A Versatile and Accurate Monocular SLAM System” IEEE Trans. Robotics 2015
slide-8
SLIDE 8

Depth prediction with CNNs

RGB Image Depth Ground Truth (Kinect) Depth Prediction

An alternative to monocular SLAM?

Goal: Use a CNN to predict a dense depth map from a single RGB image

slide-9
SLIDE 9

FC ResNet with UpProjections [Laina16]

Restriction of full connections: high dimensional outputs can produce billions of parameters

CNN Architecture

FC

avg pool

ResNet-50

Mem emory limi limitations

Residual blocks

  • I. Laina, C. Rupprecht, V. Belagiannis, F

. Tombari, N. Navab: “Deeper Depth Prediction using fully Convolutional Residual Networks“, 3DV 2016

slide-10
SLIDE 10

FC ResNet with UpProjections [Laina16]

CNN Architecture

FC

avg pool

Residual blocks

prediction ground truth

difficult convergence blurry predictions need for bigger datasets vs

slide-11
SLIDE 11

FC ResNet with UpProjections [Laina16]

CNN Architecture

Residual blocks

fully convolutional ResNet with progressive up-sampling

slide-12
SLIDE 12

FC ResNet with UpProjections [Laina16]

CNN Architecture

Residual blocks

slide-13
SLIDE 13

FC ResNet with UpProjections [Laina16]

CNN Architecture

Residual blocks

slide-14
SLIDE 14

FC ResNet with UpProjections [Laina16]

CNN Architecture

Residual blocks

slide-15
SLIDE 15

Multi-task FC ResNet

RGB Input Depth GT (Kinect) Depth Prediction 4-class Sem. Seg. 40-class (RGB-Only) 40-class (RGB + Depth Pred.)

slide-16
SLIDE 16

Monocular SLAM and CNN depth prediction are complementary

Monocular SLAM

Accurate on depth borders but sparse

CNN Depth Prediction

Dense but imprecise along depth borders

1. can learn the absolute scale 2. dense maps 3. can deal with pure rotational motion CNN-SLAM [Tateno17]

takes the best of both world by fusing monocular SLAM with depth prediction in real time

  • K. Tateno, F

. Tombari, I. Laina, N. Navab: “CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction" CVPR, 2017

slide-17
SLIDE 17

CNN-SLAM framework

  • Camera pose estimated via direct method at each new frame
  • Set of key-frames, each associated to a depth map
  • Each key-frame depth map 𝐸𝑙𝑗 is
  • 1. initialized via Fully Convolutional ResNet [Laina16]
  • 2. refined with depth values 𝐸𝑢 estimated via short-baseline stereo matching

[Engel14], weighted by the associated uncertainty 𝑉𝑙𝑗, 𝑉𝑢: 𝐸𝑙𝑗 𝒗 = 𝑉𝑢 𝒗 ⋅ 𝐸𝑙𝑗 𝒗 + 𝑉𝑙𝑗 𝒗 ⋅ 𝐸𝑢 𝒗 𝑉𝑙𝑗 𝒗 + 𝑉𝑢 𝒗

Camera Pose Estimation Pose Graph Optimization Frame-wise Depth Refinement Global Map and Semantic Label Fusion Input RGB Image Key-frame Initialization

CNN Semantic Segmentation

CNN Depth Prediction

Every input frame Every Key-frame

slide-18
SLIDE 18

Key-frame depth refinement

  • Key-frame depth refinement allows estimating fine structures on previously

blurred surfaces

  • Gradual fusion of CNN-predicted depth with monocular SLAM:

– elements near intensity gradients will be more and more refined by the frame-wise depth estimates – elements within low-textured regions will gradually hold the predicted depth value from the CNN

Refining depth in Key-frame RGB image in Key-frame RGB image in current frame

slide-19
SLIDE 19

Qualitative results – SLAM on pure rotational motion

slide-20
SLIDE 20

Qualitative results – Absolute scale estimation

slide-21
SLIDE 21

First demonstration of fully monocular real-time semantic mapping

slide-22
SLIDE 22

Many prediction tasks are ambiguous

What will the other driver do? What is the label for this image?

  • C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F

. Tombari, N. Navab, G. D. Hager: “Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses" arXiv:1612.00197, 2017

Many prediction tasks contain uncertainty. In some cases, uncertainty is inherent in the task itself [Rupprecht17].

slide-23
SLIDE 23

Simple example: next frame prediction

single prediction

  • a square is bouncing around the frame
  • it randomly switches color between black and white
  • the CNN predicts the next frame in the sequence
  • the mean of black and white is gray, which is also the

background

  • the frame is constant gray
slide-24
SLIDE 24

Approximations with the mean

Learning the mean can lead to very unlikely solutions p(x) x

slide-25
SLIDE 25

Approximate with multiple hypotheses

a simple meta-loss transforms any model into a multiple hypothesis predictor (MHP)

slide-26
SLIDE 26

Simple example: next frame prediction

prediction 1 prediction 2

  • now we transformed the same network into a multiple hypothesis

model

  • with two predictions it is able to separate black and white blocks

for the future frame

slide-27
SLIDE 27

Image Classification

slide-28
SLIDE 28

Human Pose Estimation

  • the variance of prediction can help detecting ambiguities
  • the predictions for the location of the hands varies much more

than for the shoulders

slide-29
SLIDE 29

Future Frame Prediction

  • with more predictions future frames become sharper
  • the model does not need to blend together all possible outcomes
slide-30
SLIDE 30

Multiple Hypothesis for Depth Prediction

input ground truth variance hypotheses mean hypotheses

slide-31
SLIDE 31

Multiple Hypothesis Prediction for CNN-SLAM

correct pixels: 10.6% correct pixels: 36.0%

  • riginal CNN-SLAM

CNN-SLAM with MHP

  • the variance can

be used to estimate confidences

  • confidences will be

used as initialization for the refinement of the keyframe

  • with MHP depth

prediction the

  • verall accuracy

increases

slide-32
SLIDE 32

Conclusion

  • We presented a framework for real-time scene understanding fusing

semantic segmentation and SLAM reconstruction

  • Depth prediction complements monocular SLAM in low texture regions and

global scale

  • Multiple hypotheses allow for improved 3D reconstruction

Combine deep learning with 3D computer vision to leverage the best of both worlds

Slide 32

slide-33
SLIDE 33

Credits (alphabetical)

  • Dr. Max Baust
  • Dr. Vasilis Belagiannis
  • Robert DiPietro
  • Prof. Greg Hager
  • Iro Laina
  • Prof. Nassir Navab
  • Keisuke Tateno

We gratefully acknowledge the donation from Nvidia of two GPUs that helped the development of the presented research activities.

slide-34
SLIDE 34

References

[Couprie14] C. Couprie, C. Farabet, L. Najman, Y. LeCun: "Toward Real-time Indoor Semantic Segmentation Using Depth Information" JMLR, 2014 [Engel14] J. Engel et al., “LSD-SLAM: Large-Scale Direct Monocular SLAM” ECCV 2014 [Izadi11] S. Izadi et al., “KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera”, UIST 2011 [Laina16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab: “Deeper Depth Prediction using fully Convolutional Residual Networks“, 3DV 2016 [Li16] C. Li et al., “Incremental scene understanding on dense SLAM“, IROS 2016 [Mur-Artal15] R. Mur-Artal et al., “ORB-SLAM: A Versatile and Accurate Monocular SLAM System” IEEE Trans. Robotics 2015 [Rupprecht17] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, G. D. Hager: “Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses" arXiv:1612.00197, 2017 [Salas-Moreno13] R. Salas-Moreno et al., “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects“, CVPR 2013 [Tateno15] K. Tateno, F. Tombari, N. Navab, “Real-Time and Scalable Incremental Segmentation on Dense SLAM”, IROS 15 [Tateno17] K. Tateno, F. Tombari, I. Laina, N. Navab: “CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction" CVPR, 2017