Learning Face Recognition from Limited Training Data using Deep - - PowerPoint PPT Presentation

learning face recognition from limited training data
SMART_READER_LITE
LIVE PREVIEW

Learning Face Recognition from Limited Training Data using Deep - - PowerPoint PPT Presentation

Learning Face Recognition from Limited Training Data using Deep Neural Networks Xi Peng * , Nalini Ratha ** , Sharathchandra Pankanti ** * Rutgers University, New Jersey, 08854 ** IBM T. J. Watson Research Center, New York 10598 Motivation


slide-1
SLIDE 1

Learning Face Recognition from Limited Training Data using Deep Neural Networks

Xi Peng*, Nalini Ratha**, Sharathchandra Pankanti**

*Rutgers University, New Jersey, 08854 **IBM T. J. Watson Research Center, New York 10598

slide-2
SLIDE 2

Motivation

Pose Low resolution Illumination Chronology Expression Decoration Angelina Jolie Occlusion

Challenges of Face Recognition

slide-3
SLIDE 3

Motivation

Impostor pair (different person) 1 1 1 Genuine pair (same person)

  • Pose
  • Expression
  • Illumination
  • Occlusion

Challenges of Face Verification: inter-class variations < intra-class variation

slide-4
SLIDE 4

Motivation

Limited Training Data

Private Dataset Group Approach #subjects #images Accuracy on LFW SFC Facebook DeepFace/2014 4,030 4,400,000 97.35% WebFace Google FaceNet/2015 8,000,000 200,000,000 98.87% Public Dataset #subjects #images LFW 5,749 13,233 PubFig 200 58,797 FaceScrub 530 107,818 CACD 2,000 163,446 CASIA 10,575 494,414

DNNs are extremely data-thirsty. However, limited training data is available.

slide-5
SLIDE 5

Motivation

Existing Solutions

  • Normalize viewpoints using 3D face models [Hassner et. al. CVPR 2015]
  • Sensitive to noise: incorrect landmark detection —> artifacts;
  • Computational expensive: external module —> inefficient;
slide-6
SLIDE 6

Motivation

Existing Solutions

  • Normalize viewpoints using 3D face models [Hassner et. al. CVPR 2015]
  • Sensitive to noise: incorrect landmark detection —> artifacts;
  • Computational expensive: external module —> inefficient;
  • Pose-aware networks [Masi et. al. CVPR 2016]

Require multiple networks:

  • complicate system
  • training expensive
  • end-to-end training
slide-7
SLIDE 7

Our Approach

alignment network

recognition network 224x224x3 100x100x3 face identity affine parameter

Overview

slide-8
SLIDE 8

Our Approach

alignment network

recognition network 224x224x3 100x100x3 face identity affine parameter

Alignment Network

slide-9
SLIDE 9

Our Approach

alignment network

recognition network 224x224x3 100x100x3 face identity affine parameter

Recognition Network

slide-10
SLIDE 10

Our Approach

Alignment Network Architecture

conv 7x7x128 max pooling 2x2 conv 5x5x128 conv 3x3x128 max pooling 2x2 max pooling 2x2 fully connected 256 fully connected 64 6-affine parameter 224x224x3

alignment network recognition network face identity affine parameter

slide-11
SLIDE 11

Our Approach

Alignment Network Architecture

conv 7x7x128 max pooling 2x2 conv 5x5x128 conv 3x3x128 max pooling 2x2 max pooling 2x2 fully connected 256 fully connected 64 6-affine parameter 224x224x3 100x100x3

alignment network recognition network face identity affine parameter

slide-12
SLIDE 12

Our Approach

Recognition Network Architecture

alignment network recognition network face identity affine parameter

convolution module 1 convolution module 2 inception module 5 inception module 4 inception module 3 fully connected 256 softmax loss contrastive loss 100x100x3

max pooling 2x2

max pooling 2x2 max pooling 2x2 max pooling 2x2

slide-13
SLIDE 13

Our Approach

Recognition Network: Convolution Module

conv 1a: 3x3 LRN ReLU LRN ReLU conv 1b: 3x3 convolution module 1 convolution module 2 inception module 5 inception module 4 inception module 3 fully connected 256 softmax loss contrastive loss 100x100x3

max pooling 2x2

max pooling 2x2 max pooling 2x2 max pooling 2x2 conv 2a: 3x3 LRN ReLU LRN ReLU conv 2a: 3x3

slide-14
SLIDE 14

Our Approach

Recognition Network: Inception Module

convolution module 1 convolution module 2 inception module 5 inception module 4 inception module 3 fully connected 256 softmax loss contrastive loss 100x100x3

max pooling 2x2

max pooling 2x2 max pooling 2x2 max pooling 2x2 incept4b depth concat depth concat incept4a incept3b depth concat depth concat incept3a incept5b depth concat depth concat incept5a

slide-15
SLIDE 15

Our Approach

Recognition Network: Inception Module

convolution module 1 convolution module 2 inception module 5 inception module 4 inception module 3 fully connected 256 softmax loss contrastive loss 100x100x3

max pooling 2x2

max pooling 2x2 max pooling 2x2 max pooling 2x2 incept4b depth concat depth concat incept4a incept3b depth concat depth concat incept3a incept5b depth concat depth concat incept5a

pool 3x3 previous layer depth concat conv 1x1 conv 1x1 conv 1x1 conv 5x5 conv 3x3 conv 1x1

slide-16
SLIDE 16

Our Approach

Recognition Network: Training Loss

convolution module 1 convolution module 2 inception module 5 inception module 4 inception module 3 fully connected 256 softmax loss contrastive loss 100x100x3

max pooling 2x2

max pooling 2x2 max pooling 2x2 max pooling 2x2 identity label fully connected 256 fully connected 10575 fully connected 256 fully connected 256 contrastive loss softmax loss all genuine pairs, top-k closest impostor pairs

slide-17
SLIDE 17

Our Approach

Training Strategy

  • Sufficient training images —> capable of end-to-end training;
slide-18
SLIDE 18

Our Approach

Training Strategy

  • Sufficient training images —> capable of end-to-end training;
  • Speed up the training using limited data —> two-step training:
slide-19
SLIDE 19

Our Approach

Training Strategy

  • Sufficient training images —> capable of end-to-end training;
  • Speed up the training using limited data —> two-step training:
  • Step 1, 100x100x3 2D aligned images —> pre-train the recognition network;

224x224x3 100x100x3 face detection landmark detection landmark centers mirroring 2D alignment

slide-20
SLIDE 20

Our Approach

Training Strategy

  • Sufficient training images —> capable of end-to-end training;
  • Speed up the training using limited data —> two-step training:
  • Step 1, 100x100x3 2D aligned images —> pre-train the recognition network;
  • Step 2, 224x224x3 unaligned images —> fine-tune both the alignment and recognition networks;

224x224x3 100x100x3 face detection landmark detection landmark centers mirroring 2D alignment

alignment network

recognition network loss

slide-21
SLIDE 21

Experiments

Comparison of Different Distance Metric

slide-22
SLIDE 22

Experiments

Comparison of Different Distance Metric

  • Cosine distance is more robust to scale and rotation than the other metrics
slide-23
SLIDE 23

Experiments

Comparison of Different Data Pre-processing

  • The external 3D alignment:
  • significant performance degradation on testing;
slide-24
SLIDE 24

Experiments

Comparison of Different Data Pre-processing

  • The external 3D alignment:
  • significant performance degradation on testing;
  • artifacts <— landmark localization fails in challenging cases;
slide-25
SLIDE 25

Experiments

Comparison of Different Data Pre-processing

  • The external 3D alignment:
  • significant performance degradation on testing;
  • artifacts <— landmark localization fails in challenging cases;
  • The proposed alignment network:
  • better generalization ability <— jointly optimized;
slide-26
SLIDE 26

Experiments

Comparison of Different Data Pre-processing

  • The external 3D alignment:
  • significant performance degradation on testing;
  • artifacts <— landmark localization fails in challenging cases;
  • The proposed alignment network:
  • better generalization ability <— jointly optimized;
  • 27ms per image, 100x times faster <— end-to-end network;
slide-27
SLIDE 27

Experiments

Comparison with State-of-the-arts

  • Our training dataset is significantly smaller:
  • images: 0.4 million v.s. 200 million;
slide-28
SLIDE 28

Experiments

Comparison with State-of-the-arts

  • Our training dataset is significantly smaller:
  • images: 0.4 million v.s. 200 million;
  • subjects: 10,575 v.s. 8 million;
slide-29
SLIDE 29

Experiments

Comparison with State-of-the-arts

  • Our training dataset is significantly smaller:
  • images: 0.4 million v.s. 200 million;
  • subjects: 10,575 v.s. 8 million;
  • Strong performance:
  • comparable accuracy: 96.6% v.s. 97.3%;
  • more efficient: no external modules for alignment;
slide-30
SLIDE 30

Summary

  • An end-to-end framework to jointly learn alignment and recognition tasks;
  • Comparable verification accuracy using much less training images;
  • Highly efficient deploying in testing time;