Render for CNN: Viewpoint Estimation in Images Using CNNsTrained - - PowerPoint PPT Presentation

render for cnn
SMART_READER_LITE
LIVE PREVIEW

Render for CNN: Viewpoint Estimation in Images Using CNNsTrained - - PowerPoint PPT Presentation

Render for CNN: Viewpoint Estimation in Images Using CNNsTrained with Rendered 3D Model Views Hao Su * Charles R. Qi * Yangyan Li Leonidas J. Guibas ILSVRC Image ClassificationTop-5 Error (%) 28.2 30 25.8 25 Top-5 Error (%) 20 16.4 15


slide-1
SLIDE 1

Render for CNN:

Viewpoint Estimation in Images Using CNNsTrained with Rendered 3D Model Views

Hao Su* Charles R. Qi* Yangyan Li Leonidas J. Guibas

slide-2
SLIDE 2

ILSVRC Image ClassificationTop-5 Error (%)

5 10 15 20 25 30 2010 2011 2012 2013 2014 2015 28.2 25.8 16.4 11.7 6.7 3.6

Top-5 Error (%)

slide-3
SLIDE 3

Go beyond 2D Image Classification

car

  • 3D bounding box
  • 3D alignment
  • 3D model retrieval
slide-4
SLIDE 4

Go beyond 2D Image Classification

car

3D Viewpoint Estimation

azimuth elevation in-plane rotation

slide-5
SLIDE 5

3DViewpoint Estimation in theWild

Images in the Wild Models unknown

slide-6
SLIDE 6

3D Perception in theWild

Images in the Wild Models unknown

AlexNet [Krizhevsky et al.]

Learn from Data

AlexNet [Krizhevsky et al.]

slide-7
SLIDE 7

However.. Accurate Label Acquisition is Expensive

What’s the camera viewpoint angles to the SUV in the image?

slide-8
SLIDE 8

PASCAL3D+ dataset [Xiang et al.]

However.. Accurate Label Acquisition is Expensive

slide-9
SLIDE 9

PASCAL3D+ dataset [Xiang et al.]

However.. Accurate Label Acquisition is Expensive

Step1: Choose similar model

slide-10
SLIDE 10

PASCAL3D+ dataset [Xiang et al.]

However.. Accurate Label Acquisition is Expensive

Step2: Coarse Viewpoint Labeling

Step1: Choose similar model

slide-11
SLIDE 11

PASCAL3D+ dataset [Xiang et al.]

However.. Accurate Label Acquisition is Expensive

Step2: Coarse Viewpoint Labeling Step3: Label keypoints For alignment

Step1: Choose similar model

Annotation takes ~1 min per object

slide-12
SLIDE 12

30K images with viewpoint labels in PASCAL3D+ dataset [Xiang et al.]

High-cost Label Acquisition High-capacity Model

60M parameters. AlexNet [Krizhevsky et al.]

How to get MORE images with ACCURATE viewpoint labels?

slide-13
SLIDE 13

Manual alignment by annotators Auto alignment through rendering

slide-14
SLIDE 14

http://shapenet.cs.stanford.edu

#total models #models per class

1,000 1,000,000 PSB 05’ SHREC 14’ 10 100 1,000 ModelNet 15’

Good News: ShapeNet

3M models in total 330K from 4K categories annotated

ShapeNet (going on)

slide-15
SLIDE 15

Key Idea: Render for CNN

Rendering Viewpoint

Synthetic Images Training

ShapeNet

slide-16
SLIDE 16

Key Idea: Render for CNN

Viewpoint

Real Images Testing

slide-17
SLIDE 17

Rendering

I want data! How to render data with both quantity and quality?

slide-18
SLIDE 18

Synthesize: Scalability vs Quality

Scalability Quality

Low High High Low

Ideal

slide-19
SLIDE 19

Synthesize: Scalability vs Quality

Scalability Quality

Low High High Low

Ideal

slide-20
SLIDE 20

Synthesize: Scalability vs Quality

Scalability Quality

Low High High Low

Ideal

Previous works

slide-21
SLIDE 21

Synthesize: Scalability vs Quality

Scalability Quality

Low High High Low

Ideal Sweet spot

Previous works

slide-22
SLIDE 22

Synthesize: Scalability vs Quality

Scalability Quality

Low High High Low

Ideal Sweet spot

Previous works

Story Time!

slide-23
SLIDE 23

A “Data Engineering” Journey

  • 80K rendered chair images
  • Metric: 16-view classification accuracy tested on real images

At beginning..

  • Lighting: 4 fixed point light sources on the sphere
  • Background: clean
slide-24
SLIDE 24

A “Data Engineering” Journey ConvNet: Ah ha, I know! Viewpoint is just the brightness pattern!

47% on real test set  95% on synthetic val set

slide-25
SLIDE 25

A “Data Engineering” Journey ConvNet: Ah ha, I know! Viewpoint is just the brightness pattern!

47% on real test set  95% on synthetic val set

slide-26
SLIDE 26

47% -> 74%

A “Data Engineering” Journey

Randomize lighting

ConvNet: hmm.. viewpoint is not the brightness

  • pattern. Maybe it’s the contour?
slide-27
SLIDE 27

47% -> 74%

A “Data Engineering” Journey

Randomize lighting

ConvNet: hmm.. viewpoint is not the brightness

  • pattern. Maybe it’s the contour?
slide-28
SLIDE 28

A “Data Engineering” Journey

74% -> 86% Add backgrounds

ConvNet: It becomes really hard! Let me look more into the picture.

slide-29
SLIDE 29

A “Data Engineering” Journey

bbox crop texture 86% -> 93%

slide-30
SLIDE 30

A “Data Engineering” Journey

bbox crop texture

ConvNet: the mapping becomes hard. I have to learn harder to get it right!

86% -> 93%

Key Lesson: Don’t give CNN a chance to “cheat” - it’s very good

at it. When there is no way to cheat, true learning starts.

slide-31
SLIDE 31

Render for CNN Image Synthesis Pipeline

3D model Rendering Add bkg Crop

Sample lighting and camera params Sample bkg. Image Alpha-blending Sample cropping params Hyper-parameters estimation from real images

slide-32
SLIDE 32

Render for CNN Image Synthesis Pipeline

3D model Rendering

Sample lighting and camera params

slide-33
SLIDE 33

Rendering

Camera params KDE from PASCAL3D+ train set Lighting params Randomly sampled

  • Number of light sources
  • Light distances
  • Light energies
  • Light positions
  • Light types
slide-34
SLIDE 34

Render for CNN Image Synthesis Pipeline

3D model Rendering Add bkg

Sample lighting and camera params Sample bkg. Image Alpha-blending

slide-35
SLIDE 35

Background Composition

  • Simple but effective!
  • Backgrounds randomly sampled from SUN397 dataset [Xiao et al.]
  • Alpha blending composition for natural boundaries
slide-36
SLIDE 36

Render for CNN Image Synthesis Pipeline

3D model Rendering Add bkg Crop

Sample lighting and camera params Sample bkg. Image Alpha-blending Sample cropping params

slide-37
SLIDE 37

Image Cropping

Cropping patterns KDE from PASCAL3D+ train set

slide-38
SLIDE 38

Image Cropping

Cropping patterns KDE from PASCAL3D+ train set

slide-39
SLIDE 39

2.4M Synthesized Images for 12 Categories

  • High scalability
  • High quality
  • Overfit-resistant
  • Accurate labels
slide-40
SLIDE 40

Results

slide-41
SLIDE 41

3D Viewpoint Estimation Evaluation

Metric: median angle error (lower the better) Real test images from PASCAL3D+ dataset

slide-42
SLIDE 42

3D Viewpoint Estimation Evaluation

Metric: viewpoint accuracy and median angle error (lower the better) Real test images from PASCAL3D+ dataset Our model trained on rendered images outperforms state-of-the-art model trained on real images in PASCAL3D+.

8 9 10 11 12 13 14 15 16 Vps&Kps (CVPR15) RenderForCNN (Ours) Viewpoint Median Error

slide-43
SLIDE 43

How many 3D models are necessary?

55 60 65 70 75 80 85 90 10 91 1000 6928

#models (for one category) Accuracy

10 vs 1000 models 20% + difference

slide-44
SLIDE 44

3D Viewpoint Estimation

slide-45
SLIDE 45

90 180 270 360 90 180 270 360 90 180 270 360 90 180 270 360

90 180 270 Ground truth view Estimated view confidence airplane bicycle bicycle boat motorbike car

Azimuth Viewpoint Estimation

slide-46
SLIDE 46

table chair monitor 90 180 270 Ground truth view Estimated view confidence chair sofa sofa

Azimuth Viewpoint Estimation

slide-47
SLIDE 47

Failure Cases

sofa occluded by people car occluded by motorbike ambiguous car viewpoint ambiguous chair viewpoint multiple cars multiple chairs

slide-48
SLIDE 48

Limitations of Current Synthesis Pipeline

  • Modeling Occlusions?
  • Modeling Background Context?
  • Shape database augmentation by interpolation?
slide-49
SLIDE 49

Render for CNN – BeyondViewpoint

  • 3D model retrieval
  • Joint Embedding [Li et al sigasia15]
  • Object detection
  • Segmentation
  • Intrinsic image decomposition
  • Controlled experiments for DL
  • Vision algorithm verification
slide-50
SLIDE 50

Conclusion

Images rendered from 3D models can be effectively used to train CNNs, especially for 3D tasks. State-of-the-art result has been achieved. Keys to success

  • Quantity: Large scale 3D model collection (ShapeNet)
  • Quality: Overfit-resistant, scalable image synthesis pipeline

http://shapenet.cs.stanford.edu

slide-51
SLIDE 51

THE END THANK YOU!