Learning a Probabilistic Latent Space of Object Shapes via 3D - - PowerPoint PPT Presentation

learning a probabilistic latent space of object shapes
SMART_READER_LITE
LIVE PREVIEW

Learning a Probabilistic Latent Space of Object Shapes via 3D - - PowerPoint PPT Presentation

Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling Jiajun Wu* Chengkai Zhang* Tianfan Xue Bill Freeman Josh Tenenbaum NIPS 2016 (* indicates equal contribution) Outline Synthesizing 3D shapes


slide-1
SLIDE 1

Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

Bill Freeman Josh Tenenbaum Chengkai Zhang* Tianfan Xue Jiajun Wu*

(* indicates equal contribution)

NIPS 2016

slide-2
SLIDE 2

Outline

Synthesizing 3D shapes Recognizing 3D structure

slide-3
SLIDE 3

Outline

Synthesizing 3D shapes

slide-4
SLIDE 4

3D Shape Synthesis

Templated-based model

  • Synthesizing realistic shapes
  • Requiring a large shape repository
  • Recombining parts and pieces

Image credit: [Huang et al., SGP 2015]

slide-5
SLIDE 5

3D Shape Synthesis

Image credit: 3D ShapeNet [Wu et al., CVPR 2015]

Voxel-based deep generative model

  • Synthesizing new shapes
  • Hard to scale up to high resolution
  • Resulting in not-as-realistic shapes
slide-6
SLIDE 6

3D Shape Synthesis

Realistic + New

Realistic New

slide-7
SLIDE 7

Adversarial Learning

Generative adversarial networks [Goodfellow et al., NIPS 2014] DCGAN [Radford et al., ICLR 2016]

slide-8
SLIDE 8

Our Synthesized 3D Shapes

Latent vector

slide-9
SLIDE 9

3D Generative Adversarial Network

Real shape Real? Training on ShapeNet [Chang et al., 2015] Generated shape Latent vector

  • r

Discriminator Generator

slide-10
SLIDE 10

3D Generative Adversarial Network

Real shape Discriminator Real? Training on ShapeNet [Chang et al., 2015] Generator Generated shape Latent vector

  • r
slide-11
SLIDE 11

Generator Structure

Latent vector G(z) in 3D Voxel Space 64×64×64 512×4×4×4 256×8×8×8 128×16×16×16 64×32×32×32

slide-12
SLIDE 12

Randomly Sampled Shapes

Chairs Sofas Results from 3D ShapeNet

slide-13
SLIDE 13

Randomly Sampled Shapes

Cars Tables Results from 3D ShapeNet

slide-14
SLIDE 14

Interpolation in Latent Space

slide-15
SLIDE 15

Interpolation in Latent Space

Car Boat

slide-16
SLIDE 16

Arithmetic in Latent Space

Latent space Shape space

slide-17
SLIDE 17

Arithmetic in Latent Space

Latent space Shape space

slide-18
SLIDE 18

Arithmetic in Latent Space

Latent space Shape space

slide-19
SLIDE 19

Arithmetic in Latent Space

Latent space Shape space

slide-20
SLIDE 20

Arithmetic in Latent Space

Latent space Shape space

slide-21
SLIDE 21

Arithmetic in Latent Space

Latent space Shape space

slide-22
SLIDE 22

Unsupervised 3D Shape Descriptors

Shape Discriminator Extracted Mid-level Features Real?

slide-23
SLIDE 23

3D Shape Classification

Shape Discriminator Real? Extracted Mid-level Features Linear SVM Chair

slide-24
SLIDE 24

Supervision Pretraining Method Classification (Accuracy) ModelNet40 ModelNet10 Category labels ImageNet MVCNN [Su et al., 2015] 90.1%

  • MVCNN-MultiRes [Qi et al., 2016]

91. 91.4% 4%

  • None

3D ShapeNets [Wu et al., 2015] 77.3% 83.5% DeepPano [Shi et al., 2015] 77.6% 85.5% VoxNet [Maturana and Scherer, 2015] 83.0% 92.0% ORION [Sedaghat et al., 2016]

  • 93.

93.8% 8% Unsupervised

  • SPH [Kazhdan et al., 2003]

68.2% 79.8% LFD [Chen et al., 2003] 75.5% 79.9% T-L Network [Girdhar et al., 2016] 74.4%

  • Vconv-DAE [Sharma et al., 2016]

75.5% 80.5% 3D-GAN (ours) 83. 83.3% 3% 91. 91.0% 0%

3D Shape Classification Results

slide-25
SLIDE 25

Supervision Pretraining Method Classification (Accuracy) ModelNet40 ModelNet10 Category labels ImageNet MVCNN [Su et al., 2015] 90.1%

  • MVCNN-MultiRes [Qi et al., 2016]

91. 91.4% 4%

  • None

3D ShapeNets [Wu et al., 2015] 77.3% 83.5% DeepPano [Shi et al., 2015] 77.6% 85.5% VoxNet [Maturana and Scherer, 2015] 83.0% 92.0% ORION [Sedaghat et al., 2016]

  • 93.

93.8% 8% Unsupervised

  • SPH [Kazhdan et al., 2003]

68.2% 79.8% LFD [Chen et al., 2003] 75.5% 79.9% T-L Network [Girdhar et al., 2016] 74.4%

  • Vconv-DAE [Sharma et al., 2016]

75.5% 80.5% 3D-GAN (ours) 83. 83.3% 3% 91. 91.0% 0%

3D Shape Classification Results

slide-26
SLIDE 26

Supervision Pretraining Method Classification (Accuracy) ModelNet40 ModelNet10 Category labels ImageNet MVCNN [Su et al., 2015] 90.1%

  • MVCNN-MultiRes [Qi et al., 2016]

91. 91.4% 4%

  • None

3D ShapeNets [Wu et al., 2015] 77.3% 83.5% DeepPano [Shi et al., 2015] 77.6% 85.5% VoxNet [Maturana and Scherer, 2015] 83.0% 92.0% ORION [Sedaghat et al., 2016]

  • 93.

93.8% 8% Unsupervised

  • SPH [Kazhdan et al., 2003]

68.2% 79.8% LFD [Chen et al., 2003] 75.5% 79.9% T-L Network [Girdhar et al., 2016] 74.4%

  • Vconv-DAE [Sharma et al., 2016]

75.5% 80.5% 3D-GAN (ours) 83. 83.3% 3% 91. 91.0% 0%

3D Shape Classification Results

slide-27
SLIDE 27

Limited Training Samples

Comparable with best unsupervisedly learned features with about 25 training samples/class Comparable with best voxel-based supervised descriptors with the entire training set

slide-28
SLIDE 28

Discriminator Activations

Units respond to certain object shapes and their parts.

slide-29
SLIDE 29

Extension: Single Image 3D Reconstruction

slide-30
SLIDE 30

Model: 3D-VAE-GAN

Mapped latent vector Variational image encoder Image input Reconstructed shape

A variational image encoder maps an image to a latent vector for 3D object reconstruction.

VAE-GAN [Larson et al., ICML 2016], TL-Network [Girdhar et al., ECCV 2016] Generator

slide-31
SLIDE 31

Model: 3D-VAE-GAN

Generator Generated shape Latent vector Real shape Discriminator Mapped latent vector Variational image encoder Image input Reconstructed shape

We combine the encoder with 3D-GAN for reconstruction and generation.

slide-32
SLIDE 32

Input image Reconstructed 3D shape Input image Reconstructed 3D shape

Single Image 3D Reconstruction

slide-33
SLIDE 33

Single Image 3D Reconstruction

Bed Bookcase Chair Desk Sofa Table Mean AlexNet-fc8 [Girdhar et al., 2016] 29.5 17.3 20.4 19.7 38.8 16.0 23.6 AlexNet-conv4 [Girdhar et al., 2016] 38.2 26.6 31.4 26.6 69.3 19.1 35.2 T-L Network [Girdhar et al., 2016] 56.3 30.2 32.9 25.8 71.7 23.3 40.0 Our 3D-VAE-GAN (jointly trained) 49.1 31.9 42.6 34.8 79.8 33.1 45.2 Our 3D-VAE-GAN (separately trained) 63.2 46.3 47.2 40.7 78.8 42.3 53.1

Average precision on IKEA dataset [Lim et al., ICCV 2013]

slide-34
SLIDE 34

Contributions of 3D-GAN

  • Synthesizing new and realistic 3D shapes via adversarial learning
  • Exploring the latent shape space
  • Extracting powerful shape descriptors for classification
  • Extending 3D-GAN for single image 3D reconstruction
slide-35
SLIDE 35

Outline

Recognizing 3D structure

slide-36
SLIDE 36

Single Image 3D Interpreter Network

Joseph Lim Josh Tenenbaum Tianfan Xue* Yuandong Tian

Jiajun Wu*

(* indicates equal contribution)

ECCV 2016

Bill Freeman Antonio Torralba

slide-37
SLIDE 37

3D Object Representation

Mesh Skeleton Voxel

Girdhar et al. ’16 Choy et al. ’16 Xiao et al. ’12 Zhou et al. ’16 Biederman et al. ’93 Fan et al. ’89 Goesele et al. ’10 Furukawa and Ponce, ’07 Lensch et al. ’03

slide-38
SLIDE 38

Goal

slide-39
SLIDE 39

Skeleton Representation

𝐶2 𝐶4 𝐶3 𝐶1

structure parameter

slide-40
SLIDE 40

3D Skeleton to 2D Image

𝐶2 𝐶4 𝐶3 𝐶1

structure parameter rotation translation projection

slide-41
SLIDE 41

Goal

slide-42
SLIDE 42

Approach I: Using 3D Object Labels

ObjectNet3D [Xiang et al, 16]

slide-43
SLIDE 43

Approach II: Using 3D Synthetic Data

ObjectNet3D [Xiang et al, 16] Render for CNN [Su et al, ’15] Multi-view CNNs [Dosovitskiy et al, ’16] TL network [Girdhar et al, ’16] PhysNet [Lerer et al, ’16]

slide-44
SLIDE 44

Intermediate 2D Representation

Real images with 2D keypoint labels Synthetic 3D models

Only 2D labels!

slide-45
SLIDE 45

3D INterpreter Network (3D-INN)

Real images with 2D keypoint labels Synthetic 3D models

Only 2D labels!

Ramakrishna et al. ’12 Grinciunaite et al. ’13

slide-46
SLIDE 46

3D-INN: Image to 2D Keypoints

IMG

Inspired by Tompson et al. ’15

2D Keypoint Estimation Using 2D-annotated real data Input: an RGB image Output: keypoint heatmaps

slide-47
SLIDE 47

3D-INN: 2D Keypoints to 3D Skeleton

Using 3D synthetic data Input: rendered keypoint heatmaps Output: 3D parameters 3D Interpreter

slide-48
SLIDE 48

3D-INN: Initial Design

IMG

3D Interpreter 2D Keypoint Estimation

slide-49
SLIDE 49

Initial Results

Errors in the first stage propagate to the second

Image Inferred Keypoint Heatmap Inferred 3D Skeleton

slide-50
SLIDE 50

3D-INN: End-to-End Training?

No 3D Labels Available 3D Interpreter 2D Keypoint Estimation

slide-51
SLIDE 51

3D-INN: End-to-End Training?

3D Interpreter 2D Keypoint Labels 2D Keypoint Estimation

slide-52
SLIDE 52

3D-INN: 3D-to-2D Projection Layer

3D-to-2D projection is fully differentiable.

3D-to-2D Projection

slide-53
SLIDE 53

3D-INN: 3D-to-2D Projection Layer

3D Interpreter 3D-to-2D Projection Using 2D-annotated real data Input: an RGB image Output: keypoint coordinates 2D Keypoint Labels Objective function: 2D Keypoint Estimation

slide-54
SLIDE 54

3D-INN: Training Paradigm

3D Interpreter 3D-to-2D Projection 2D Keypoint Labels Three-step training paradigm I: 2D Keypoint Estimation III: End-to-end Finetuning II: 3D Interpreter 2D Keypoint Estimation

slide-55
SLIDE 55

Refined Results

Initial Estimation After End-to-End Fine-tuning Image

slide-56
SLIDE 56

3D Estimation: Qualitative Results

Keypoint-5 dataset Training: our Keypoint-5 dataset, 2K images per category

slide-57
SLIDE 57

3D Estimation: Qualitative Results

Training: our Keypoint-5 dataset, 2K images per category IKEA Dataset [Lim et al, ’13]

slide-58
SLIDE 58

3D Estimation: Qualitative Results

SUN Database [Xiao et al, ’11]

SUN

Input

After FT

Training: our Keypoint-5 dataset, 2K images per category

slide-59
SLIDE 59

3D Estimation: Qualitative Results

Training: our Keypoint-5 dataset, 2K images per category SUN Database [Xiao et al, ’11]

slide-60
SLIDE 60

3D Structure Estimation

Images Results Me Meth thod Be Bed Sof Sofa Ch Chair Av Avg. 3D-INN 88.6 88.0 87.8 88.0 Zhou, ’16 52.3 58.0 60.8 58.5

Recall (%) RMSE of estimated 3D keypoints on IKEA [Lim et al, ’13] Average recall (%)

0.2 0.4 0.6 0.8 1 20 40 60 80 100 Zhou-perp 3D-INN

slide-61
SLIDE 61

20 40 60 80 100 20 40 60 80

Viewpoint Estimation

Recall (%)

3D-INN Su et al.

Me Meth thod Ta Table Sof Sofa Ch Chair Av Avg. 3D-INN 55.0 64.7 63.5 60.3 Su, ’15 52.7 35.7 37.7 43.3

Average recall (%)

Images Results

Azimuth angular error on IKEA [Lim et al, ’13]

slide-62
SLIDE 62

Localization and Viewpoint Estimation

Viewpoint estimation on the PASCAL 3D+ dataset [Xiang et al, ’14]

R-CNN

Girshick et al, ’14

Ca Cate tegor

  • ry

VD VDPM DP DPM+V +VP Su Su e et a al. V V & K 3D 3D-IN INN Chair 6.8 6.1 15.7 25.1 23.1 Sofa 5.1 11.8 18.6 43.8 45.8

slide-63
SLIDE 63

Chair Embedding

Manifold of chairs based on their inferred viewpoint

slide-64
SLIDE 64

Contributions of 3D-INN

  • Single image 3D perception
  • Real 2D labels + synthetic 3D models, connected via keypoints
  • A 3D-to-2D projection layer for end-to-end training
slide-65
SLIDE 65

Summary

3D-GAN: Synthesizing 3D shapes 3D-INN: Recognizing 3D structure