Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and - - PowerPoint PPT Presentation

synthesizing 3d shapes via modeling multi view depth maps
SMART_READER_LITE
LIVE PREVIEW

Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and - - PowerPoint PPT Presentation

Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks Amir A. Soltani Haibin Huang Jiajun Wu Tejas Kulkarni Josh Tenenbaum Samples Out-of-Sample Generalization 07/21/2017 Motivation -


slide-1
SLIDE 1

Amir A. Soltani Haibin Huang Jiajun Wu Tejas Kulkarni Josh Tenenbaum

Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks

07/21/2017

Samples

Out-of-Sample Generalization

slide-2
SLIDE 2

Motivation - Autonomous Vehicles

slide-3
SLIDE 3

Motivation - Robotics

slide-4
SLIDE 4

Motivation

  • Computer Vision cannot simply rely on 2D data to solve 3D problems
  • We need to have good 3D representations to solve inverse problems
  • A generative model for 3D is a good starting point (A lot more needed though)
  • Good progress has been made in the past 2 or 3 years
  • Still, the choice of 3D representation is being debated
  • Each representation has advantages and disadvantages
  • So far there is not a good agreement on which representation to use
slide-5
SLIDE 5

Choice of Representation

Multi-view Template-based Point clouds Meshes Voxels

slide-6
SLIDE 6

3D Representation - Voxels

  • Computational complexity is very high (O3) if used naively
  • Cannot Model High-Res Shapes
  • Details can easily get lost
  • Highly sparse at higher resolutions
  • Cannot model regular structures easily
slide-7
SLIDE 7

3D Representation - Voxels

  • Directly predicting high-res voxel-based outputs is very hard
  • Highest so far is 64 x 64 x 64
  • One model per object category

Wu et al, NIPS 2016

slide-8
SLIDE 8

3D Representation - Point Clouds

  • Not a lot of work done using point clouds so far
  • Things start to get mathematically-involved from here
  • The choice of loss function, non-differentiability issues etc
  • Not obvious how many points to have

Image courtesy: Hao Su

  • Details Will Be Missing
slide-9
SLIDE 9

3D Representation - Point Clouds

Su et al, CVPR 2017

slide-10
SLIDE 10

3D Representation - Meshes

  • Cannot directly apply out-of-the-box models on
  • Mathematically Involved
  • Can be seen as a graph as well

Image courtesy: Hao Su

  • Need to Construct Special Kind of Kernels for CNNs
slide-11
SLIDE 11

3D Representation - Template-Based (CAD)

Image courtesy: Haibin Huang

  • Again, Not Able to Easily Apply Out-of-Box Models on
  • Data Is Very Hard to Obtain
  • Offers Compositionality Intrinsically and Explicitly
  • Hard to Model Shapes Never Seen Before
  • Might be a Good Option for Learning Functionalities
slide-12
SLIDE 12

3D Representation - Multi-View

  • Multi-view representation is very lightweight
  • Can easily apply out-of-the-box CNN models on
  • Offers Flexibility (Depth Maps) and Eases the Computation Significantly
  • Although 2D, Still Explicitly Models 3D Shapes
  • Allows Generating Hi-Res, Detailed, Novel Objects
  • Without the machinery required for new voxel-based models
  • Not Mathematically Involved
  • More Intuitive
slide-13
SLIDE 13

Motivations

  • Synthesize/Generate Hi-Res, Detailed and Novel Shapes
  • Use Some Sort of a Representation Whose Data is Easily Obtainable
  • No Doubt that it is Very Easy to Obtain 2D images or RGBD or just D
  • Have Out-of-Sample Generalizability
  • A Step Forward Towards Obtaining 3D Concepts Efficiently to Solve Inverse Vision

Problems

  • Model 3D via 2D (inspired by biological vision)
  • Share the Same Representations For All Categories
slide-14
SLIDE 14

Pipeline - Data Set

  • Used ShapeNet Core
  • Contains Aligned, Normalized Shapes
  • ~37k for train, ~3k for test
  • Normalized and Aligned
  • Render 20 views of depth maps
  • Camera Positions Fixed
slide-15
SLIDE 15

Pipeline - Architectures

  • Train 3 Different VAE Models
  • AllVPNet: Train with All 20 Views
  • DropoutNet: Train with 2-5 Randomly Chosen Views
  • SingleVPNet: Train with 1 Randomly Chosen View
  • Z Layer Has 100 Nodes for Unconditional and 40 for Conditional
  • L1 Loss Function is Used During Training
slide-16
SLIDE 16

Pipeline - Architectures

L1 L1

slide-17
SLIDE 17

Pipeline - 3D Reconstruction

  • Deterministic Function is Used to Generate the Final 3D Point Cloud
  • Number of Points is Between ~30k to ~400k depending on Shape Complexity
  • Not fixed!
slide-18
SLIDE 18

Results - Sampling

Random Sampling

slide-19
SLIDE 19

Results - Sampling

Random Samples’ Nearest Neighbors

Training set Reconstruction Random Sample

slide-20
SLIDE 20

Results - Sampling

Random Samples

slide-21
SLIDE 21

Results - Sampling

More Random Samples

slide-22
SLIDE 22

Results - Sampling

More Random Samples

slide-23
SLIDE 23

Results Samples

Conditional Sampling

slide-24
SLIDE 24

Results - Sampling

Conditional Sampling

slide-25
SLIDE 25

Results - Sampling

Conditional Samples

slide-26
SLIDE 26

Results - Conditional Sampling

Nearest Neighbors

Training set Reconstruction

  • Cond. Sample

Training set Reconstruction

  • Cond. Sample
slide-27
SLIDE 27

Results - Conditional Sampling

Nearest Neighbors

Training set Reconstruction

  • Cond. Sample
slide-28
SLIDE 28

Results - Reconstruction

slide-29
SLIDE 29

Results - Classification

Classification, Reconstruction Error

slide-30
SLIDE 30

Results - Reconstruction

Out-of-Sample Generalization

  • Put Silhouettes/Depth Maps into 224 x 224 canvases
  • Camera Pose Not Fixed
  • Different Size and Orientation
  • NYUD and Silhouettes from the Internet
  • Images Scaled to Fit
  • The Rest of The Results Are All Obtained Through SingleVPNet Model
slide-31
SLIDE 31

Results - Reconstruction

Out-of-Sample Generalization (NYUD)

slide-32
SLIDE 32

Results - Reconstruction

Out-of-Sample Generalization (Uncond. SinlgeVPNet - NYUD Silhouettes)

slide-33
SLIDE 33

Results - Reconstruction

Out-of-Sample Generalization (Uncond. SinlgeVPNet - NYUD Silhouettes)

slide-34
SLIDE 34

Results - Reconstruction

Out-of-Sample Generalization (Silhouettes From Web)

slide-35
SLIDE 35

Results - Representation Analysis

Consistent Representation

  • Naturally Would Like to Get The Same Shape Across All Views
  • Intuitively-Thinking, Uncertainty is Actually Part of Consistency
  • Obtaining Good Priors Is Important!
slide-36
SLIDE 36

Results - Analysis

Consistent Representation

slide-37
SLIDE 37

Results - Analysis

Consistent Representation

slide-38
SLIDE 38

Results - Analysis

Priors Matter!

slide-39
SLIDE 39

Results - Analysis

What 3D shape is this?

slide-40
SLIDE 40

Results - Analysis

slide-41
SLIDE 41

Results - Analysis

  • Model’s Prediction: “airplane”
  • Obtaining good, inductive biases is hard but helps a lot!
  • Behaves like a hierarchical prior
  • Quite meaningful and intuitive
slide-42
SLIDE 42

Results - Analysis

Implicitly Learning About Parts

slide-43
SLIDE 43

Results - Analysis

Implicitly Learning About Parts

slide-44
SLIDE 44

Concolusion

  • We showed an effective paradigm for learning 3D shapes using multiview representation
  • Samples obtained look realistic, novel and detailed
  • Out-of-sample generalization is attainable via good generative models + meaningful priors
  • Hierarchical priors can effectively induce enough bias to generate meaningful results
  • Parts can be learned implicitly. Hard to explicitly learn parts for real-word tasks
  • Strong inductive biases helps get meaningful 3D shapes on highly occluded inputs
slide-45
SLIDE 45

Future Directions and Challenges

  • Current data sets are not sufficient to learn about 3D vision
  • 3D shapes are the end product of an underlying process: physics
  • Current data-driven approaches do get us to where we want to be
  • 3D shapes are composed of things like material, mass, etc
  • To meaningfully interact with 3D shapes we need to do more!
  • Learning fast, and accurate physics simulators might be a good starting point
slide-46
SLIDE 46

Future Directions and Challenges

Thank you!

slide-47
SLIDE 47

Results - Conditional Sampling

Conditional Samples

slide-48
SLIDE 48

Results - Classification, Recon. Err.

  • The Goal Is Not to do Classification or Recon. But to Have Hierarchical Priors
  • Strong Regularization
slide-49
SLIDE 49

Results - IoU

IoU numbers for ShapeNet Core

slide-50
SLIDE 50

Results - Conditional Sampling

More Conditional Samples

slide-51
SLIDE 51

Results Conditional Sampling

More Conditional Samples

slide-52
SLIDE 52

Results - Analysis

What about this?

slide-53
SLIDE 53

Results - Analysis

slide-54
SLIDE 54

Results - Analysis

slide-55
SLIDE 55

Results - Analysis