Scene Understanding with 3D Deep Networks Thomas Funkhouser - - PowerPoint PPT Presentation

scene understanding with 3d deep networks
SMART_READER_LITE
LIVE PREVIEW

Scene Understanding with 3D Deep Networks Thomas Funkhouser - - PowerPoint PPT Presentation

Scene Understanding with 3D Deep Networks Thomas Funkhouser Princeton University Disclaimer: I am talking about the work of these people Shuran Song Fisher Yu Yinda Zhang Andy Zeng Maciej Halber Jianxiong Xiao Angela Dai Matt Fisher


slide-1
SLIDE 1

Scene Understanding with 3D Deep Networks

Thomas Funkhouser Princeton University

slide-2
SLIDE 2

Disclaimer: I am talking about the work of these people …

Shuran Song Andy Zeng Fisher Yu Angela Dai Matthias Niessner Matt Fisher Jianxiong Xiao Maciej Halber Yinda Zhang

slide-3
SLIDE 3

Goal

Understanding indoor scenes observed in RGB-D images

  • Robotics
  • Augmented reality
  • Virtual tourism
  • Surveillance
  • Home remodeling
  • Real estate
  • Telepresence
  • Forensics
  • Games
  • etc.
slide-4
SLIDE 4

Goal

Understanding indoor scenes observed in RGB-D images

Input RGB-D Image(s)

Semantic Segmentation

slide-5
SLIDE 5

Goal

Understanding indoor scenes observed in RGB-D images in 3D

3D Scene Understanding Input RGB-D Image(s)

Semantic Segmentation

slide-6
SLIDE 6

Goal

Understanding indoor scenes observed in RGB-D images in 3D

  • Surface reconstruction
  • Amodal object detection
  • Object relationships
  • Materials, lights, etc.
  • Physical properties
  • Novel views
  • Info sharing
  • Spatial inference
  • Simulation
  • etc.

Semantic Segmentation

slide-7
SLIDE 7

Goal for This Talk

Learn ConvNets to recognize patterns in voxels

  • Local shape descriptor
  • Amodal object detection
  • Semantic scene completion
slide-8
SLIDE 8

Talk Outline

Local shape descriptor Amodal object detection Semantic scene completion

Scale

Small Large

slide-9
SLIDE 9

Talk Outline

Local shape descriptor Amodal object detection Semantic scene completion

Scale

Small Large

  • A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, T. Funkhouser,

“3DMatch: Learning Local Geometric Descriptors from 3D Reconstructions,” submitted to CVPR 2017

slide-10
SLIDE 10

Local Shape Descriptor

Goal: train a discriminating 3D local shape descriptor from data

Local shape descriptor Local shape descriptor

… 0.58 0.21 0.92 0.67 0.04 0.53

Match!

0.58 0.21 0.92 0.67 0.04 0.53 …

slide-11
SLIDE 11

Local Shape Descriptor

Challenge: where to get training data?

slide-12
SLIDE 12

Local Shape Descriptor: “3D Match”

Approach: train on wide-baseline correspondences in RGB-D reconstructions

“Ground truth” match between RGB-D Images from different views

slide-13
SLIDE 13

Local Shape Descriptor: “3D Match”

Approach: train on wide-baseline correspondences in RGB-D reconstructions

slide-14
SLIDE 14

Local Shape Descriptor: “3D Match”

Method: sample true/false correspondences from RGB-D reconstructions, train Siamese network

slide-15
SLIDE 15

Local Shape Descriptor: “3D Match”

Result: learns to discriminate local shapes found in real-world data

slide-16
SLIDE 16

Local Shape Descriptor: “3D Match” Results

Result 1: learned feature descriptor predicts RGB-D point correspondences more accurately than hand-tuned descriptors

Match classification error at 95% recall Fragment Alignment Success Rate

slide-17
SLIDE 17

Local Shape Descriptor: “3D Match” Results

Result 2: feature descriptor learned from RGB-D reconstructions provides matching for recognizing poses of small objects in Amazon Picking Challenge

Predicting pose of 3D object model in RGB-D scan Object pose prediction accuracy

slide-18
SLIDE 18

Local Shape Descriptor: “3D Match” Results

Result 3: feature descriptor learned from RGB-D reconstructions provides discriminative matching of semantic correspondences on 3D meshes

slide-19
SLIDE 19

Talk Outline

Local Shape Descriptor Amodal object detection Semantic scene completion

Scale

Small Large

  • S. Song and J. Xiao,

“Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images,” CVPR 2016

slide-20
SLIDE 20

Object Detection

Goal: given a RGB-D image, find objects (labeled 3D amodal bounding boxes)

Input: Single RGB-D Output: labeled 3D Amodal Boxes

slide-21
SLIDE 21

[CVPR13] Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images [IJCV14] Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and semantic segmentation [ECCV14] Object Detection and Segmentation using Semantically Rich Image and Depth Features [CVPR15] Aligning 3D Models to RGB-D Images of Cluttered Scenes [CVPR16] Cross Modal Distillation for Supervision Transfer

2D Operations

2D Instance Segmentation Coarse Pose Classification Point Cloud Alignment 2D Contour Detection 2D Region Proposal 2D Object Detection Encode Depth Map as Extra Channels 3D Amodal Detection Result Depth Map Image

3D Output 3D Input 3D

Object Detection

Most previous work:

slide-22
SLIDE 22

3D Deep Learning

Object Detection: “Deep Sliding Shapes”

Approach:

3D Amodal Detection Result Depth Map Image

3D Operations 3D Output 3D Input

slide-23
SLIDE 23

bed Object Detection: “Deep Sliding Shapes”

Object Recognition Network Region Proposal Network

RGB-D Image

slide-24
SLIDE 24

bed Object Detection: “Deep Sliding Shapes”

Object Recognition Network Region Proposal Network

RGB-D Image

slide-25
SLIDE 25

Object Detection: “Deep Sliding Shapes”

Data encoding:

1) Estimate major directions

  • f room

2) Compute TSDF

slide-26
SLIDE 26

Object Detection: “Deep Sliding Shapes”

Data encoding:

1) Estimate major directions

  • f room

2) Compute TSDF

2.5 m 5.2 m 5.2 m

slide-27
SLIDE 27

Object Detection: “Deep Sliding Shapes”

Data encoding:

1) Estimate major directions

  • f room

2) Compute TSDF

slide-28
SLIDE 28

Region Proposal Network TSDF 3D Region Proposals

Object Detection: “Deep Sliding Shapes”

3D region proposal network:

slide-29
SLIDE 29

×3 ×50

Pixel Area Physical Size

Object Detection: “Deep Sliding Shapes”

3D region proposal network:

slide-30
SLIDE 30

Object Detection: “Deep Sliding Shapes”

Multiscale 3D region proposal network:

slide-31
SLIDE 31

Input: TSDF Conv 1 ReLU + Pool Conv 2 ReLU + Pool

Object Detection: “Deep Sliding Shapes”

Multiscale 3D region proposal network:

slide-32
SLIDE 32

Input: TSDF Conv 1 ReLU + Pool Conv 2 ReLU + Pool Conv 3 ReLU + Pool Conv Class Conv 3D Box Softmax L1 Smooth

Object Detection: “Deep Sliding Shapes”

Multiscale 3D region proposal network:

Receptive field: 0.4 m3

slide-33
SLIDE 33

Level 1 Anchors 0.6×0.2×0.4 m 0.5×0.5×0.2 m 0.6×0.2×0.4 m

Input: TSDF Conv 1 ReLU + Pool Conv 2 ReLU + Pool Conv 3 ReLU + Pool Conv Class Conv 3D Box Softmax L1 Smooth

Object Detection: “Deep Sliding Shapes”

Multiscale 3D region proposal network:

Receptive field: 0.4 m3

slide-34
SLIDE 34

Conv 4 ReLU + Pool Conv Class Conv 3D Box Softmax L1 Smooth Input: TSDF Conv 1 ReLU + Pool Conv 2 ReLU + Pool Conv 3 ReLU + Pool Conv Class Conv 3D Box Softmax L1 Smooth

Object Detection: “Deep Sliding Shapes”

Multiscale 3D region proposal network:

Receptive field: 1 m3 Receptive field: 0.4 m3

slide-35
SLIDE 35

Level 2 Anchors

Conv 4 ReLU + Pool Conv Class Conv 3D Box Softmax L1 Smooth

Object Detection: “Deep Sliding Shapes”

Receptive field: 1 m3

slide-36
SLIDE 36

bed Object Detection: “Deep Sliding Shapes”

Object Recognition Network Region Proposal Network

RGB-D Image

slide-37
SLIDE 37

project to 2D

Object Detection: “Deep Sliding Shapes”

Joint object recognition network:

slide-38
SLIDE 38

TSDF Image Patch

Object Detection: “Deep Sliding Shapes”

Joint object recognition network:

slide-39
SLIDE 39

Object Detection: “Deep Sliding Shapes”

Joint object recognition network:

slide-40
SLIDE 40

Conv 1 ReLU + Pool Conv 2 ReLU + Pool Conv 3 ReLU FC 2

2D VGG on ImageNet

Concatenation FC 3 FC Class FC 3D Box Softmax L1 Smooth

3D ConvNet

Object Detection: “Deep Sliding Shapes”

Joint object recognition network:

slide-41
SLIDE 41

Object Detection: “Deep Sliding Shapes” Experiments

Train and test on amodal boxes provided in SUN RGB-D

  • S. Song, S. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite,” CVPR 2015
slide-42
SLIDE 42

2D Deep Learning 3D Deep Learning 3D Non-Deep Learning

Object Detection: “Deep Sliding Shapes” Results

Quantitative comparisons:

Object detection accuracy on NYU v2 dataset (mAP)

slide-43
SLIDE 43

Sliding Shapes: sofa Ours: bathtub

Object Detection: “Deep Sliding Shapes” Results

Qualitative comparisons:

slide-44
SLIDE 44

Sliding Shapes: chair Ours: sofa

Object Detection: “Deep Sliding Shapes” Results

Qualitative comparisons:

slide-45
SLIDE 45

Sliding Shapes: table Ours: bed

Object Detection: “Deep Sliding Shapes” Results

Qualitative comparisons:

slide-46
SLIDE 46

Sliding Shapes: miss Ours: table and chairs

Object Detection: “Deep Sliding Shapes” Results

Qualitative comparisons:

slide-47
SLIDE 47

Sliding Shapes: toilet Ours: garbage bin+bed

Object Detection: “Deep Sliding Shapes” Results

Qualitative comparisons:

slide-48
SLIDE 48

Talk Outline

Local Shape Descriptor Amodal object detection Semantic scene completion

Scale

Small Large

  • S. Song, F. Yu, A. Zeng, A. Chang, M. Savva, and T. Funkhouser,

“Semantic Scene Completion from a Single Depth Image,” submitted to CVPR 2017

slide-49
SLIDE 49

Input: Single view depth map Output: Semantic scene completion

Semantic Scene Completion

Goal: given an RGB-D image, label all voxels by semantic class

slide-50
SLIDE 50

3D Scene visible surface free space

  • ccluded space
  • utside view
  • utside room

Semantic Scene Completion

Goal: given an RGB-D image, label all voxels by semantic class

slide-51
SLIDE 51

visible surface free space

  • ccluded space
  • utside view
  • utside room

3D Scene

Semantic Scene Completion

Goal: given an RGB-D image, label all voxels by semantic class

slide-52
SLIDE 52

semantic scene completion This paper scene completion Firman et al. surface segmentation Silberman et al.

The occupancy and the object identity are tightly intertwined !

3D Scene

Semantic Scene Completion

Prior work: segmentation OR completion

slide-53
SLIDE 53

Prediction: N+1 classes

SSCNet

Input: Single view depth map Output: Semantic scene completion 3D ConvNet

Semantic Scene Completion: “SSCNet”

Approach: end-to-end deep network

slide-54
SLIDE 54

Semantic Scene Completion : “SSCNet”

slide-55
SLIDE 55

Semantic Scene Completion : “SSCNet”

slide-56
SLIDE 56

Encode 3D space using flipped TSDF

Semantic Scene Completion : “SSCNet”

slide-57
SLIDE 57

Encode 3D space using flipped TSDF

Voxel size: 0.02 m

Semantic Scene Completion : “SSCNet”

slide-58
SLIDE 58

Local geometry

Receptive field: 0.98 m

Semantic Scene Completion : “SSCNet”

slide-59
SLIDE 59

High-level 3D context via big receptive field provided by dilated convolution

Receptive field: 2.26

Semantic Scene Completion : “SSCNet”

slide-60
SLIDE 60

Multi-scale aggregation Receptive field: 0.98 m Receptive field:1.62 m Receptive field: 2.26 m

Semantic Scene Completion : “SSCNet”

slide-61
SLIDE 61

Semantic Scene Completion: “SSCNet” Experiments

Where to get training data?

slide-62
SLIDE 62

Semantic Scene Completion: “SSCNet” Experiments

Where to get training data?

No dense volumetric ground truth with semantic labels for a complete scene SUN3D: No semantic labels NYU: only visible surfaces

slide-63
SLIDE 63

Semantic Scene Completion: “SSCNet” Experiments

SUNCG dataset

slide-64
SLIDE 64

Semantic Scene Completion: “SSCNet” Experiments

SUNCG dataset

  • 46K houses
  • 50K floors
  • 400K rooms
  • 5.6M object instances
slide-65
SLIDE 65

Semantic Scene Completion: “SSCNet” Experiments

SUNCG dataset

synthetic camera views depth ground truth semantic scene completion

slide-66
SLIDE 66

Semantic Scene Completion: “SSCNet” Experiments

SUNCG dataset

slide-67
SLIDE 67

Train on SUNCG Test on NYU

Semantic Scene Completion: “SSCNet” Experiments

slide-68
SLIDE 68

Semantic Scene Completion: “SSCNet” Results

Result: better than previous volumetric completion algorithms

Comparison to previous algorithms for volumetric completion

slide-69
SLIDE 69

Zhang et al. Ground Truth Ours(SSCNet) Color Image Observed Surface Firman et al.

slide-70
SLIDE 70

Semantic Scene Completion: “SSCNet” Results

Result: better than previous 3D model fitting algorithms

Comparison to previous algorithms for 3D model fitting

slide-71
SLIDE 71

Ours(SSCNet) Geiger and Wang Lin et al. Color Image Observed Surface Ground Truth

slide-72
SLIDE 72

Ours(SSCNet) Geiger and Wang Lin et al. Color Image Observed Surface Ground Truth

slide-73
SLIDE 73

Ours(SSCNet) Geiger and Wang Lin et al. Color Image Observed Surface Ground Truth

slide-74
SLIDE 74

Summary

Three projects where ConvNets are trained to recognize patterns in voxels with different …

  • Tasks
  • Scales
  • Training data
  • Loss functions
  • Network architectures
  • Training protocols
slide-75
SLIDE 75

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

slide-76
SLIDE 76

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

slide-77
SLIDE 77

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

1,500 surface reconstructions 36,213 labeled objects

  • A. Dai, A. Chang, M. Savva,
  • M. Halber, T. Funkhouser, and M. Niessner,

“ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes,” submitted to CVPR 2017.

slide-78
SLIDE 78

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

  • M. Halber, T. Funkhouser,

“Fine-to-Coarse Registration of RGB-D Scans,” submitted to CVPR 2017

slide-79
SLIDE 79

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

  • M. Halber, T. Funkhouser,

“Fine-to-Coarse Registration of RGB-D Scans,” submitted to CVPR 2017

slide-80
SLIDE 80

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

  • M. Halber, T. Funkhouser,

“Fine-to-Coarse Registration of RGB-D Scans,” submitted to CVPR 2017

slide-81
SLIDE 81

Future Challenges

Acquiring larger data sets Leveraging geometric structure Leveraging semantic structure Better integration RGB and D Better surface parameterizations Finer-grained categories Higher resolution etc.

Sleeping Area

  • ttoman

bed sofa dresser with mirror dresser nightstand lamp wall dresser dresser with mirror

  • Y. Zhang, M. Bai, J. Xiao, P. Kohli, and S. Izadi,

“DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding,” submitted to CVPR 2017

slide-82
SLIDE 82

Acknowledgments

Princeton:

  • Angel Chang, Maciej Halber, Manolis Savva, Elena Sizikova,

Shuran Song, Jianxiong Xiao, Fisher Yu, Yinda Zhang, Andy Zeng

Collaborators:

  • Angela Dai, Matt Fisher, Matthias Niessner, Ersin Yumer

Data:

  • SUN3D, 7-Scenes, Analysis-by-Synthesis, NYU, Trimble, Planner5D

Funding:

  • Intel, NSF, Adobe

Thank You!