3D Scene Reconstruction with Multi-layer Depth and Epipolar - - PowerPoint PPT Presentation

3d scene reconstruction with multi layer depth and
SMART_READER_LITE
LIVE PREVIEW

3D Scene Reconstruction with Multi-layer Depth and Epipolar - - PowerPoint PPT Presentation

3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers to appear, ICCV 2019 Goal: 3D scene reconstruction from a single RGB image RGB Image 3D Scene Reconstruction (SUNCG Ground Truth) Pixels, voxels, and views: A study of


slide-1
SLIDE 1

3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers

to appear, ICCV 2019

slide-2
SLIDE 2

Goal: 3D scene reconstruction from a single RGB image RGB Image 3D Scene Reconstruction (SUNCG Ground Truth)

slide-3
SLIDE 3

z y x z y x

Object- centered Viewer- centered Multi-surface Voxels

Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction (CVPR 18. Shin, Fowlkes, Hoiem)

Question: What effect does shape representation have on prediction?

slide-4
SLIDE 4

Coordinate system is an important part of shape representation

z y x z y x

CVPR 18

slide-5
SLIDE 5

Synthetic training data

CVPR 18

slide-6
SLIDE 6

Multi-surface Prediction

Surfaces vs. voxels for 3D object shape prediction

RGB Image Predicted Mesh Predicted Voxels

3D Reconstruction

3D Convolution (most common approach) 2D Conv.

CVPR 18

slide-7
SLIDE 7

z y x z y x

Object- centered Viewer- centered Multi-surface Voxels

Question: What effect does shape representation have on prediction?

CVPR 18

slide-8
SLIDE 8

Network architecture for surface prediction

CVPR 18

slide-9
SLIDE 9

Experiments

  • Three difficulty settings (how well does the prediction generalize?)

– Novel view: new view of model that is in training set – Novel model: new model from a category that is in training set – Novel category: new model from a category that is not in the training set

  • Evaluation metrics: Mesh surface distance, Voxel IoU, Depth L1

error

  • Same procedure applied in all four cases.

CVPR 18

slide-10
SLIDE 10

What effect does coordinate system have on prediction?

Voxel IoU (mean, higher is better) Depth error (mean, lower is better)

Viewer-centered vs. Object-centered

CVPR 18

slide-11
SLIDE 11

What effect does shape representation have on prediction?

Voxel IoU (mean, higher is better)

Voxels vs. multi-surface

Surface distance (mean, lower is better) CVPR 18

slide-12
SLIDE 12

CVPR 18

slide-13
SLIDE 13
slide-14
SLIDE 14

Input GT Object-centered prediction (3D-R2N2) Inspiring examples from 3D-R2N2's Supplementary Material

slide-15
SLIDE 15
slide-16
SLIDE 16

Shape representation is important in learning and prediction.

  • Viewer-centered representation generalizes better to difficult input,

such as, novel object categories.

  • 2.5D surfaces (depth and segmentation) tend to generalize better than

voxels and predicts higher fidelity shapes (thin structures)

2.5D segmentation, depth

slide-17
SLIDE 17

Viewer-centered vs. Object-centered: Human vision

  • Tarr and Pinker 1: Found that human perception is largely tied to viewer-centered coordinate,

in experiments on 2D symbols

  • McMullen and Farah 2: Object-centered coordinates seem to play more of a role for familiar

exemplars, in line drawing experiments.

  • We do not claim our computational approach has any similarity to human visual processing.

[1]: M. J. Tarr and S. Pinker. When does human object recognition use a viewer-centered reference frame? Psychological Science, 1(4):253–256, 1990 [2]: P. A. McMullen and M. J. Farah. Viewer-centered and object-centered representations in the recognition of naturalistic line drawings. Psychological Science, 2(4):275–278, 1991.

slide-18
SLIDE 18

Follow-up work (Tatarchenko et al., CVPR 19):

  • They observe that SoA single-view 3D object reconstruction methods actually

perform image classification, and retrieval performance is just as good.

  • Following our CVPR 18 work, they recommend the use of viewer-centered

coordinate frames.

slide-19
SLIDE 19

Follow-up work (Zhang et al., NIPS 18 oral):

  • Zhang et al. performs single-view reconstruction of objects in novel categories.
  • Their viewer-centered approach achieves SoA results.
  • Following our CVPR 18 work, they experiment with both object-centered and

viewer-centered models and validate our findings.

slide-20
SLIDE 20

How can we extend viewer-centered, surface-based object representations to whole scenes?

slide-21
SLIDE 21

Predicted Depth (2.5D Surface)

Viewer-centered visible geometry inference

GT Depth Pixel-wise error Evaluation

What about the rest of the scene? Background: Typical monocular depth estimation pipeline

slide-22
SLIDE 22

Predicted Depth

2.5D in relation to 3D

Predicted Depth as 3D mesh Ground Truth 3D Mesh Evaluation

  • 3D requires predicting both visible and occluded surfaces!
slide-23
SLIDE 23

Multi-layer Depth

slide-24
SLIDE 24

Synthetic dataset

CAD model of 3D Scene

(SUNCG Ground Truth, CVPR 17)

RGB Rendering

Physically-based rendering (PBRS, CVPR 17)

slide-25
SLIDE 25

Object First-hit Depth Layer

Learning Target: “Traditional depth image with segmentation”

D1

z

slide-26
SLIDE 26

Object Instance-exit Depth Layer

Learning Target: “Back of the first object instance”

D2

slide-27
SLIDE 27

Room Envelope Depth Layer

Learning Target:

D5

slide-28
SLIDE 28

Predicted Multi-layer Depth and Semantic Segmentation Input RGB Image

Encoder-decoder

Multi-layer Surface Prediction

slide-29
SLIDE 29

Input RGB Image Surface Reconstruction from multi-layer depth Multi-layer Depth Prediction and Segmentation

Multi-layer Surface Prediction

slide-30
SLIDE 30

3D scene geometry from depth (2.5D)

  • How much geometric information is present in a depth image?

RGB image (2D) 2.5D depth

Mesh representation of a synthetically generated depth image (SUNCG).

slide-31
SLIDE 31

Epipolar Feature Transformers

slide-32
SLIDE 32

Multi-layer is not enough. Motivation for multi-view prediction

2.5D (objects only) Multiple layers of 2.5D Multiple views of 2.5D Including a top-down view

Ground truth depth visualization

slide-33
SLIDE 33

Multi-view prediction from a single image:

Epipolar Feature Transformer Networks

slide-34
SLIDE 34

Multi-view prediction from a single image:

Epipolar Feature Transformer Networks

slide-35
SLIDE 35

Transformed Depth Feature Map Transformed Segmentation Feature Map Transformed RGB “Best Guess” Depth Frustum Mask

Transformed Virtual View Features

3 channels 1 channel 1 channel 48 channels 64 channels

Virtual View Surface Prediction

Virtual Viewpoint Proposal

(tx, ty, tz, θ, σ)

117 channels total

slide-36
SLIDE 36

Height Map Prediction Transformed Virtual View Features Ground Truth L1 Error Map

slide-37
SLIDE 37

Frontal Multi-layer Prediction Height Map Prediction Input Image Virtual View Surface Reconstruction Frontal View Surface Reconstruction

Multi-layer Multi-view Inference

slide-38
SLIDE 38

Network architecture for multi-layer depth prediction

slide-39
SLIDE 39

Network architecture for multi-layer semantic segmentation

slide-40
SLIDE 40

Network architecture for virtual camera pose proposal

slide-41
SLIDE 41

Network architecture for virtual view surface prediction

slide-42
SLIDE 42

Network architecture for virtual view semantic segmentation

slide-43
SLIDE 43
slide-44
SLIDE 44

Layer-wise cumulative surface coverage

slide-45
SLIDE 45

Results

slide-46
SLIDE 46

Input View / Alternate viewpoint

slide-47
SLIDE 47

Input View / Alternate viewpoint

slide-48
SLIDE 48

Previous state-of-the-art based on object detection and volumetric object shape prediction

  • CVPR 2018
  • "Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene" by Tulsiani et al.
  • 3D scene geometry prediction from a single RGB image
slide-49
SLIDE 49

Object-based reconstruction is sensitive to detection and pose estimation errors

Our viewer-centered, end-to-end scene surface prediction Object-detection-based state of the art (Tulsiani et al., CVPR 18)

slide-50
SLIDE 50

Results on real-world images: object detection error and geometry

slide-51
SLIDE 51

Results on real-world images

slide-52
SLIDE 52

Results on real-world images

slide-53
SLIDE 53

Predicted 3D Mesh Ground Truth 3D Mesh

Quantitative Evaluation Metric

“Inlier” Threshold:

slide-54
SLIDE 54

Surface Coverage Precision-Recall Metrics

GT Surface from SUNCG Predicted Surface

slide-55
SLIDE 55

Surface Coverage Precision-Recall Metrics

GT Surface from SUNCG Predicted Surface

i.i.d. point sampling on predicted mesh

(with constant density ρ = 10000 points per unit area, m2 in real world scale)

slide-56
SLIDE 56

Surface Coverage Precision-Recall Metrics

GT Surface from SUNCG Predicted Surface Closest distance from point to surface, within threshold

Number of points within threshold ( ) Total number of sampled points ( + ) Precision =

“Inlier” Threshold:

slide-57
SLIDE 57

Surface Coverage Precision-Recall Metrics

GT Surface from SUNCG Predicted Surface

slide-58
SLIDE 58

Surface Coverage Precision-Recall Metrics

GT Surface from SUNCG Predicted Surface

i.i.d. point sampling on GT mesh

(with constant density ρ = 10000 points per unit area, m2 in real world scale)

slide-59
SLIDE 59

Surface Coverage Precision-Recall Metrics

GT Surface from SUNCG Predicted Surface Closest distance from point to surface, within threshold

“Inlier” Threshold:

Number of points within threshold ( ) Total number of sampled points ( + ) Recall =

slide-60
SLIDE 60

Our multi-layer, virtual-view depths

  • vs. Object detection based state-of-the-art, 2018

Multi-layer + virtual-view (ours) Multi-layer + virtual-view (ours)

slide-61
SLIDE 61

Layer-wise evaluation

slide-62
SLIDE 62

Top-down virtual-view prediction improves both precision and recall

(Match threshold of 5cm)

slide-63
SLIDE 63

Synthetic-to-real transfer of 3D scene geometry on ScanNet

We measure recovery of true object surfaces and room layouts within the viewing frustum (threshold of 10cm).

slide-64
SLIDE 64

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Our fully convolutional, viewer-centered inference of 3D scene geometry Output

We project the center of each voxel into the input camera, and the voxel is marked occupied if the depth value falls in the first object interval (D1, D2) or the occluded object interval (D3, D4).

slide-65
SLIDE 65

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-66
SLIDE 66

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-67
SLIDE 67

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels Input Image

Voxelization of multi-layer depth maps

Occluded structure Output

slide-68
SLIDE 68

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-69
SLIDE 69

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-70
SLIDE 70

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-71
SLIDE 71

Voxelization of multi-layer depth maps

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels Input Image Output

slide-72
SLIDE 72

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-73
SLIDE 73

Semantic Segmentation Front Surfaces Back Surfaces Visible Objects Invisible Objects High resolution voxels

Voxelization of multi-layer depth maps

Input Image Output

slide-74
SLIDE 74

Supplemental Video

slide-75
SLIDE 75

Conclusion

  • Multi-layer and virtual-view prediction from a single image
  • Surface-based accuracy evaluation
  • Synthetic-to-real transfer of 3D scene geometry prediction, evaluated quantitatively
  • Geometric comparison with detection-based voxel prediction methods

Real-world Input Image Real-world Ground Truth

@DaeyunShin Code and dataset coming soon. Follow on Twitter for updates!