Model Architectures and Training Techniques for High-Precision - - PowerPoint PPT Presentation

model architectures and training techniques for high
SMART_READER_LITE
LIVE PREVIEW

Model Architectures and Training Techniques for High-Precision - - PowerPoint PPT Presentation

Model Architectures and Training Techniques for High-Precision Landmark Localization Sina Honari Pavlo Molchanov Jason Yosinski Stephen Tyree Pascal Vincent Jan Kautz Christopher Pal KEYPOINT DETECTION / LANDMARK LOCALIZATION The problem


slide-1
SLIDE 1

Pavlo Molchanov Stephen Tyree Jan Kautz

Model Architectures and Training Techniques for High-Precision Landmark Localization

Sina Honari Jason Yosinski Pascal Vincent Christopher Pal

slide-2
SLIDE 2

2

KEYPOINT DETECTION / LANDMARK LOCALIZATION

The problem of localizing important points on images

Keypoints for human face can be:

  • left/right eye
  • nose
  • left/right mouth corner

Applications include:

  • Face alignment/rectification
  • Emotion recognition
  • Head pose estimation
  • Person identification
slide-3
SLIDE 3

3

MOTIVATION

Conventional ConvNets: alternating convolutional and max-pooling layers Max-pooled Features: loses precise spatial information, but gets robust features Networks of only Conv layers: lots of false positives, but keeps spatial information

Robustness Precision Precision False positives

Can we take robust pooled features and keep positional information?

slide-4
SLIDE 4

4

SUMMATION-BASED NETWORKS (SUMNETS)

Sums features of different granularity (FCN[1], HyperColumn[2])

[1] J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015 [2] B.Hariharan,P.Arbela ́ez,R.Girshick,andJ.Malik.Hyper- columns for object segmentation and fine-grained localization. In CVPR, 2015. C=Convolution P=Pooling U= Upsampling Branch=horizontal C, U layers

Coarse to Fine Branches

slide-5
SLIDE 5

5

SUMMATION-BASED NETWORKS (SUMNETS)

Pre-softmax activations Softmax probabilities

Coarse to Fine Branches

Sum of Branches

slide-6
SLIDE 6

6

SUMMATION-BASED NETWORKS (SUMNETS)

Pre-softmax activations Softmax probabilities

Coarse to Fine Branches

Sum of Branches

slide-7
SLIDE 7

Recombinator Networks

Learning Coarse-to-Fine Feature Aggregation

(CVPR 2016)

Sina Honari Jason Yosinski Pascal Vincent Christopher Pal

Montreal Institue for Learning Algorithms

slide-8
SLIDE 8

The model feeds coarse features into finer layers early in their computation

Recombinator Networks (RCNs)

C=Convolution P=Pooling U= Upsampling K= Concatenation Branch=horizontal C, U layers

Coarse to Fine Branches

slide-9
SLIDE 9

Recombinator Networks (RCNs)

A Convolutional Encoder-Decoder Network with Skip Connections

slide-10
SLIDE 10

SumNets vs. RCNs

Summation-Based Networks (SumNets) Recombinator Networks (RCNs)

slide-11
SLIDE 11

Pre-softmax activations SumNet RCN

SumNets vs. RCNs

slide-12
SLIDE 12

SumNet RCN Pre-softmax activations Softmax probabilities

SumNets vs. RCNs

Pre-softmax activations Softmax probabilities

slide-13
SLIDE 13

Prediction Samples

TCDCN SumNet RCN

Red is model prediction Green is GT key-points

slide-14
SLIDE 14

Comparison with SOTA

Landmark Estimation Error (as percent; lower is better) 300W Dataset MTFL Dataset

Error=Euclidean distance between GT & predicted key-points normalized by inter-ocular distance

slide-15
SLIDE 15

15 15

Sina Honari Jan Kautz Pascal Vincent Christopher Pal Pavlo Molchanov Stephen Tyree

Improving Landmark Localization with Semi-Supervised Learning

(CVPR 2018)

slide-16
SLIDE 16

16 16

MOTIVATION

Smiling

Landmarks

Fast Very time consuming

Attribute

Labeling: ~60s

Looking straight

Labeling: ~1s

Image

Manual landmark localization is a tedious task (to build datasets)

slide-17
SLIDE 17

17 17

MOTIVATION

Smiling

Landmarks

Fast Very time consuming

Attribute

Labeling: ~60s

Looking straight

Labeling: ~1s

Image

Manual landmark localization is a tedious task (to build datasets)

Can we use an attribute (e.g. head pose) to guide landmark localization?

slide-18
SLIDE 18

19 19

SEMI-SUPERVISED LEARNING 1

Using CNNs with sequential multitasking

1. First predict landmarks 2. Use predicted landmarks to predict attribute

slide-19
SLIDE 19

20 20

SEMI-SUPERVISED LEARNING 1

Using CNNs with sequential multitasking

Forward pass: landmarks help attribute Backward pass

1. First predict landmarks 2. Use predicted landmarks to predict attribute 3. Get gradient from attribute to the landmark localization network

slide-20
SLIDE 20

21 21

SEMI-SUPERVISED LEARNING 1

To train the entire network end-to-end we use soft-argmax to predict landmarks

Soft-argmax

Soft-argmax estimates location of the scaled centrum of mass: ü Continuous, not discrete ü Differentiable

slide-21
SLIDE 21

22 22

SEMI-SUPERVISED LEARNING 2

Equivariant Landmark Transformation

Ask the model to make equivariant prediction w.r.t to image transformations

  • Predict landmarks (L) on an image I
  • Apply a transformation T to image I
  • Predict landmarks (!′) on an image I ′
  • Apply transformation T to landmarks L
  • Compare !′ with $ ⊙ !
  • Get gradient from ELT loss
slide-22
SLIDE 22

23 23

SEMI-SUPERVISED LEARNING 2

Equivariant Landmark Transformation

Ask the model to make equivariant prediction w.r.t to image transformations

  • Predict landmarks (L) on an image I
  • Apply a transformation T to image I
  • Predict landmarks (!′) on an image I′
  • Apply transformation T to landmarks L
  • Compare !′ with $ ⊙ !
  • Get gradient from ELT loss
slide-23
SLIDE 23

24 24

LEARNING LANDMARKS

Making use of all data

  • Loss from GT Landmarks
  • Loss from Attributes (A) using

Sequential Multi-tasking

  • Loss from Equivariant Landmark

Transformation (ELT)

  • S << M <= N
slide-24
SLIDE 24

25

RESULTS

Faces

just CNN

Method Training data

100%

slide-25
SLIDE 25

26

RESULTS

Faces

just CNN just CNN

Method Training data

100% 5%

slide-26
SLIDE 26

27

RESULTS

Faces

Training data

100% 5% 5% just CNN semi supervised CNN just CNN

Method

slide-27
SLIDE 27

28

COMPARISON WITH SOTA

Faces

5.43 4.35 4.25 4.05 3.92 3.73 2.72 2.17 2.88 2.46 2.03 1.59

1 2 3 4 5 6 C D M E R T L B F S D M C F S S R C P R C C L L v e t a l . O u r O u r + S S L O u r + S S L O u r + S S L

100% labeled data 1% 5% 100% AFLW dataset:

  • 19 landmarks
  • Head pose: yaw, pitch, roll
  • ~25k images

Error metric

slide-28
SLIDE 28

29

COMPARISON WITH SOTA

Faces

5.43 4.35 4.25 4.05 3.92 3.73 2.72 2.17 2.88 2.46 2.03 1.59

1 2 3 4 5 6 C D M E R T L B F S D M C F S S R C P R C C L L v e t a l . O u r O u r + S S L O u r + S S L O u r + S S L

100% labeled data 1% 5% 100% AFLW dataset:

  • 19 landmarks
  • Head pose: yaw, pitch, roll
  • ~25k images

Error metric

slide-29
SLIDE 29

30

COMPARISON WITH SOTA

Faces

5.43 4.35 4.25 4.05 3.92 3.73 2.72 2.17 2.88 2.46 2.03 1.59

1 2 3 4 5 6 C D M E R T L B F S D M C F S S R C P R C C L L v e t a l . O u r O u r + S S L O u r + S S L O u r + S S L

100% labeled data 1% 5% 100% AFLW dataset:

  • 19 landmarks
  • Head pose: yaw, pitch, roll
  • ~25k images

Error metric

slide-30
SLIDE 30

31

COMPARISON WITH SOTA

Faces

5.43 4.35 4.25 4.05 3.92 3.73 2.72 2.17 2.88 2.46 2.03 1.59

1 2 3 4 5 6 C D M E R T L B F S D M C F S S R C P R C C L L v e t a l . O u r O u r + S S L O u r + S S L O u r + S S L

100% labeled data 1% 5% 100% AFLW dataset:

  • 19 landmarks
  • Head pose: yaw, pitch, roll
  • ~25k images

Error metric

slide-31
SLIDE 31

32

COMPARISON WITH SOTA

Faces

RCN+ (L+ELT+A) 100% RCN+ (L+ELT+A) 1% RCN+ (L) 1%

slide-32
SLIDE 32

33

CONCLUSIONS

  • Fuse predictions from multiple granularity
  • Let network to learn fusing method
  • Additional attributes and prior improve results with semi-supervised learning
  • Works with landmarks on hands as well
  • Read more in our CVPR2016 paper: https://arxiv.org/abs/1511.07356 ,

and CVPR2018: https://arxiv.org/abs/1709.01591

slide-33
SLIDE 33

THANK YOU!

Pavlo Molchanov pmolchanov@nvidia.com Sina Honari honaris@iro.umontreal.ca

slide-34
SLIDE 34

Comparison with SOTA

slide-35
SLIDE 35

Loss: Softmax loss on key-points + L2-norm Error: Euclidean distance between GT & predicted key-points normalized by inter-ocular distance

Loss & Error

slide-36
SLIDE 36

Mask: 0 branch is omitted, 1 branch in included.

Masking Branches

SumNet RCN Mask AFLW AFW AFLW AFW coarse → fine 1, 0, 0, 0 10.54 10.63 10.61 10.89 0, 1, 0, 0 11.28 11.43 11.56 11.87 1, 1, 0, 0 9.47 9.65 9.31 9.44 0, 0, 1, 0 16.14 16.35 15.78 15.91 0, 0, 0, 1 45.39 47.97 46.87 48.61 0, 0, 1, 1 13.90 14.14 12.67 13.53 0, 1, 1, 1 7.91 8.22 7.62 7.95 1, 0, 0, 1 6.91 7.51 6.79 7.27 1, 1, 1, 1 6.44 6.78 6.37 6.43

SumNet RCN Coarse to Fine Coarse to Fine

slide-37
SLIDE 37

Adding More Branches

SumNet RCN Coarse to Fine Coarse to Fine