De Deep Learning fo for Face ce Analysis Chen-Change LOY MMLAB - - PowerPoint PPT Presentation

β–Ά
de deep learning fo for face ce analysis
SMART_READER_LITE
LIVE PREVIEW

De Deep Learning fo for Face ce Analysis Chen-Change LOY MMLAB - - PowerPoint PPT Presentation

De Deep Learning fo for Face ce Analysis Chen-Change LOY MMLAB The Chinese University of Hong Kong Homepage : http://personal.ie.cuhk.edu.hk/~ccloy/


slide-1
SLIDE 1

De Deep Learning fo for Face ce Analysis

Chen-Change LOY MMLAB The Chinese University of Hong Kong

Homepage: http://personal.ie.cuhk.edu.hk/~ccloy/

slide-2
SLIDE 2

https://www.youtube.com/watch?v=k3T2WbRkgvg&index=4&list=PLkNuzPSJx0mO0_mLUjDQFXFgngTV7QwHZ

slide-3
SLIDE 3
slide-4
SLIDE 4

Vivo X20 Face Wake: unlock your mobile phone in 0.1 seconds

slide-5
SLIDE 5
slide-6
SLIDE 6

DeepID3 99.55% DeepID2 99.15% GaussianFace 98.52%

  • C. Lu, X. Tang, "Surpassing Human-Level Face Verification Performance on LFW with

GaussianFace", Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI),, January

  • 2015. Best student paper of AAAI 2015

Human accuracy 97.45%

Papers

Training set DeepID2: 200K images Now: 2 billion images in total, 200M individuals’ faces 1:1 result DeepID2 (2014): 99.5% accuracy @ 0.5% FAR 6 digit password (2015): >90% accuracy @10^-6 FAR 8 digit password (2017): >97% accuracy @10^-8 FAR 1:N result DeepID2: top 30 < 40% for N = 100M Now: top 30 > 90% for N = 100M

Industry Breakthrough

slide-7
SLIDE 7

2015

Yang et al., From Facial Part Responses to Face Detection: A Deep Learning Approach, ICCV 2015

slide-8
SLIDE 8

Zhang et al., S 3FD: Single Shot Scale-invariant Face Detector, ICCV 2017

2017

slide-9
SLIDE 9

Is there anything else I can solve?

slide-10
SLIDE 10

Is there anything else I can solve?

  • Learning in small data regime
  • The use of unannotated data
  • Challenging scenarios
  • Generalization and transferability
  • Imbalance problem
  • …
slide-11
SLIDE 11

Face Recognition

Pose-Robust Face Recognition via Deep Residual Equivariant Mapping

  • K. Cao, Y. Rong, C. Li, C. C. Loy

A submission to CVPR 2018

slide-12
SLIDE 12

Profile and Frontal Face Recognition

  • Large pose discrepancy between two face images is one of the key challenges

in face recognition

  • The number of frontal and profile training faces are highly imbalanced

Profile faces of different persons are easily to be mismatched (false positives), and profile and frontal faces of the same identity may not trigger a match leading to false negatives

slide-13
SLIDE 13

Why does not face recognition work well on profile faces?

  • The generalization power of deep models is usually proportional to the training

data size

  • Given an uneven distribution of profile and frontal faces in the dataset, deeply

learned features tend to bias on distinguishing frontal faces rather than profile faces.

slide-14
SLIDE 14

Existing solutions

  • I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face recognition in the wild. In CVPR, 2016
slide-15
SLIDE 15

Existing solutions

  • Y. Taigman et al. Deepface: Closing the gap to human-level performance in face verification. In CVPR, 2014
slide-16
SLIDE 16

Existing solutions

Zhu et al. High-Fidelity Pose and Expression Normalization for Face Recognition in the Wild, CVPR 2015

slide-17
SLIDE 17

Existing solutions

  • L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, 2017

Model Input Generated Real

slide-18
SLIDE 18

Motivation

We can map profile face feature to the frontal space through a mapping function that adds residual.

slide-19
SLIDE 19

Feature equivariance

  • The representation of many deep layers depends upon transformations of the

input image

  • Such transformations can be learned by a mapping function from data
  • The function can be subsequently applied to manipulate the representation of

an input image to achieve the desired transformation

  • K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015
slide-20
SLIDE 20

Feature equivariance

  • A convolutional neural network (CNN) can be regarded as a function πœ’ that

maps an image 𝑦 ∈ π‘Œ to a vector πœ’(𝑦) ∈ 𝑆(

  • The representation πœ’ is said equivariant with a transformation 𝑕 of the input

image if the transformation can be transferred to the representation output βˆ€π‘¦ ∈ π‘Œ: πœ’(𝑕𝑦) β‰ˆ 𝑁/πœ’(𝑦)

  • K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015
slide-21
SLIDE 21

Problem formulation

  • For simplicity, let’s assume we have: frontal face image π’š1 and profile face

image π’š2

  • We wish to obtain a transformed representation of a profile image π’š2 through

a mapping function 𝑁/ , so that 𝑁/ πœ’(π’š2) β‰ˆ πœ’(π’š1) 𝑁/ πœ’(π’š2) = πœ’(π’š2) + 𝒡(π’š2)β„›(π’š2) β‰ˆ πœ’(π’š1)

residual function yaw coefficient, [0 1], a soft gate of the residuals

slide-22
SLIDE 22

Problem formulation

  • Yaw coefficient
  • provide a higher magnitude of residuals (thus a heavier fix) to a face that deviates more

from the frontal pose

  • 𝒡 π’š = 0 for frontal face and gradually changes from 0 to 1 when the face pose shifts

from frontal to a complete profile

  • The soft gate can be viewed as a correction mechanism that adopts top-down

information (the yaw in our case) to influence the feed-forward process

slide-23
SLIDE 23

Network structure – the DREAM block

The Deep Residual EquivAriant Mapping (DREAM) block

slide-24
SLIDE 24

Usage of DREAM

  • Stitching
  • Stitch the DREAM block to an existing

stem CNN

  • End-to-end + Stitching
  • First end-to-end training
  • Followed by DREAM block fine-tuning
  • DREAM block training
slide-25
SLIDE 25

Visualization

slide-26
SLIDE 26

Visualization

slide-27
SLIDE 27

Results on Celebrities in Frontal-Profile (CFP)

  • Equal error rate (EER).
  • Baselines
  • CDFE - Two transforms are simultaneously learned to map the samples in two modalities

respectively to the common feature space.

  • JB – Joint Bayesian approach for face verification
  • FF - Face Frontalization morphs faces from profile to frontal with a generative adversarial

network

  • S. Sengupta et al. Frontal to profile face verification in the wild. In WACV, 2016

7.26 7.82

slide-28
SLIDE 28

Results on IJB-A

slide-29
SLIDE 29

Further analysis

slide-30
SLIDE 30

Summary

  • Equivariant mapping in the deep feature space
  • Performing frontalization in the feature space is more fruitful than the image

space

  • Easy to use, light-weight, and can be implemented with a negligible

computational overhead.

slide-31
SLIDE 31

WIDER FACE

slide-32
SLIDE 32

Diversity

MIT+CMU FDDB WIDER FACE

slide-33
SLIDE 33

Data scale

468 507 1335 5171 11931 49759 393703

50000 100000 150000 200000 250000 300000 350000 400000 AFW MIT+CMU PASCAL FACE FDDB MALF IJB-A WIDER FACE

Number of labeled faces

slide-34
SLIDE 34

Richer annotations

507 1335 2808 5171 49759 95448 393703βœ•6=2362218

500000 1000000 1500000 2000000 2500000 MIT+CMU PASCAL FACE AFW FDDB IJB-A MALF WIDER FACE

Number of annotations

slide-35
SLIDE 35

0.2 0.4 0.6 0.8 1

Detection Rate

Rich events

Traffic

slide-36
SLIDE 36

0.2 0.4 0.6 0.8 1

Detection Rate

Rich events

Students Schoolkids

slide-37
SLIDE 37

0.2 0.4 0.6 0.8 1

Detection Rate

Rich events

Handshaking

slide-38
SLIDE 38

Rich label annotations

Occlusion Illumination Expression Pose Blur Normal Extreme Intermediate

slide-39
SLIDE 39

WIDER FACE is more challenging

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2000 4000 6000 8000 10000

AFW

Detection Rate

Proposals/per image

slide-40
SLIDE 40

WIDER FACE is more challenging

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2000 4000 6000 8000 10000

AFW PASCAL FACE

Detection Rate

Proposals/per image

slide-41
SLIDE 41

WIDER FACE is more challenging

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2000 4000 6000 8000 10000

AFW PASCAL FACE FDDB

Detection Rate

Proposals/per image

slide-42
SLIDE 42

WIDER FACE is more challenging

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2000 4000 6000 8000 10000

AFW PASCAL FACE FDDB IJB-A

Detection Rate

Proposals/per image

slide-43
SLIDE 43

WIDER FACE is more challenging

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2000 4000 6000 8000 10000

AFW PASCAL FACE FDDB IJB-A WIDER FACE Hard WIDER FACE Medium WIDER FACE Easy

Detection Rate

Proposals/per image

slide-44
SLIDE 44

Webpage: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/

slide-45
SLIDE 45

WIDER FACE Benchmark

Average precision FAN – 0.946 Face R-FCN – 0.943 SFD - 0.935 … 2015 method - 0.711 Average precision FAN – 0.936 Face R-FCN – 0.931 SFD - 0.921 … 2015 method - 0.636 Average precision FAN – 0.885 Face R-FCN – 0.876 SFD - 0.858 … 2015 method - 0.400

slide-46
SLIDE 46

Is there anything else I can solve?

  • While maintaining good detection performance
  • Light-weight architecture and speed
  • Training with fewer annotated data
  • Coping with noisy annotations
  • …
slide-47
SLIDE 47

Face Detection

Face Detection through Scale-Friendly Deep Convolutional Networks

  • S. Yang, Y. Xiong, C. C. Loy, X. Tang

https://arxiv.org/pdf/1706.02863.pdf, 2017

slide-48
SLIDE 48

Problem

  • The clues to be gleaned for recognizing a 300-pixels tall face are qualitatively

different than those for recognizing a 10-pixels tall face

  • More convolution layers are required to learn highly representative features

that can distinguish faces with large appearance variations

  • By going deeper, the spatial information will lose through pooling or

convolution operations

  • Dilated convolution? Remove pooling?
slide-49
SLIDE 49

Motivation

  • Faces with different scales possess different

inherent visual cues and thus lead to disparate detection difficulties

  • Use different specialized network structures
slide-50
SLIDE 50

A naΓ―ve solution

Input image Proposal network1 Proposal network2 Proposal network3 Proposal network4 Detection network1 Detection network2 Detection network3 Detection network4 Final results 10-30 Pixels 30-120 Pixels 120-240 Pixels 240-480 Pixels 30Γ—30 Pixels 120Γ—120 Pixels 240Γ—240 Pixels 480Γ—480 Pixels

Multiscale proposal networks Response maps Proposals Multiscale detection networks Detection results

Stage 1 Stage 2

slide-51
SLIDE 51

Solution

  • Splits a large range of target scales into a set of sub-ranges
  • Each subrange is modeled by a specialized network with carefully designed

depth and spatial pooling to optimize the receptive field for the particular range

  • Combine sub-nets into a single network and optimize them end-to-end
  • Previous state-of-the-art = average precision of 81%, and runs at 0.6 fps.
  • Ours = average precision of 76.4% with just 7 fps
slide-52
SLIDE 52

Faster R-CNN

Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015 Fixed size feature maps for each ROI

slide-53
SLIDE 53

ScaleFace

Contains three scale-variant detectors with different size of spatial pooling stride and depth Scale-variant detectors are integrated into a single backbone network by sharing representation (ResNet-50) Single-scale inference -- using a single input image without an image pyramid.

slide-54
SLIDE 54

ScaleFace

Given a test image, a forward pass is performed and each scale-variant face detector will generate detection windows independently

slide-55
SLIDE 55

Finding a network for specific scale range

  • Faces with different scales can be better modeled by networks with different

spatial pooling structure Experiment

  • Group faces into three classes according to the image height:
  • small(10px βˆ’ 40px), medium (40px βˆ’ 140px), and large(140px or more).
  • For each face group, we train four deep networks with different spatial pooling

structure.

slide-56
SLIDE 56

Finding a network for specific scale range

  • The best performance of certain scales when the projected face scale on the

feature map is close to the ROI template

slide-57
SLIDE 57

Finding a network for specific scale range

  • Convolutional features at higher layers

tend to have smaller projected ROI size

  • The detection performance of a target

scale consistently decreases when the ROI on the target layer is smaller than ROI pooling size

  • Even if we increase the depth of the

network which will generally improve the discriminative power of the feature representation, the detection performance still drops

The green box represents the ROI template

slide-58
SLIDE 58

Finding a network for specific scale range

  • Remapped features with a similar size
  • f ROI template will yield the best

detection performance.

  • If the projected region is much larger

than the ROI template discriminative information will loss during pooling procedure.

  • On the other hand, if the projected

region is much smaller than the ROI template, the insufficient information and overlapping between features will cause a performance drop.

slide-59
SLIDE 59

How many scale-variant detectors

  • Small faces (less than 40 pixel height)
  • lose most appearance information and can be characterized by rigid structures and context.
  • Medium faces (40px βˆ’ 140px)
  • have high variance since persons in these images are usually not the main subjects of the photographer, and therefore they can be of various

poses looking at different directions.

  • Large faces (140px or more)
  • usually have low variance as they are the main subjects when a photo is captured. These large faces are usually in a frontal or profile pose.
slide-60
SLIDE 60

How many scale-variant detectors

slide-61
SLIDE 61

How to combine the scale-variant detectors

Input image Proposal network1 Proposal network2 Proposal network3 Proposal network4 Detection network1 Detection network2 Detection network3 Detection network4 Final results 10-30 Pixels 30-120 Pixels 120-240 Pixels 240-480 Pixels 30Γ—30 Pixels 120Γ—120 Pixels 240Γ—240 Pixels 480Γ—480 Pixels

Multiscale proposal networks Response maps Proposals Multiscale detection networks Detection results

Stage 1 Stage 2

slide-62
SLIDE 62

Online hard negative mining

  • Detection datasets contain an overwhelming number of easy examples and a

small number of hard examples.

  • Automatic selection of these hard examples can make training more effective

and efficient

  • Training examples are sampled according to a non-uniform, non-stationary

distribution that depends on the current loss of each example under consideration

Shrivastava et al., Training Region-based Object Detectors with Online Hard Example Mining, CVPR 2016

slide-63
SLIDE 63

Online hard negative mining

Shrivastava et al., Training Region-based Object Detectors with Online Hard Example Mining, CVPR 2016

slide-64
SLIDE 64

Online hard negative mining

slide-65
SLIDE 65

Results

slide-66
SLIDE 66

Results

slide-67
SLIDE 67

Results

Tested using NVIDIA Titan X GPU by averaging the runtime of 1, 000 images randomly sampled from the WIDER FACE dataset Evaluation of different range partitioning schemes across three difficulty settings of WIDER FACE (Easy, Medium, Hard)

slide-68
SLIDE 68

Face Attribute Recognition

Learning Deep Representation for Imbalanced Classification

  • C. Huang, Y. Li, C. C. Loy, X. Tang

in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016 Code available: http://mmlab.ie.cuhk.edu.hk/projects/LMLE.html

slide-69
SLIDE 69

CelebA face attributes dataset

200K celebrity images, each with 40 attribute Liu et al. β€œDeep Learning Face Attributes in the Wild”, ICCV 2015 http://mmlab.ie.cuhk.edu. hk/projects/CelebA.html

slide-70
SLIDE 70

CelebA face attributes dataset

slide-71
SLIDE 71

Previous work

xo

Linear SVM

Smiling Wavy Hair No Beard High Cheekbones

h

… …

xs xf

(a) LNeto (b) LNets (c) ANet (d) Extracting features to predict attributes m n

  • (5

(5)

s

(5 (5)

h

f (4 (4)

h

FC FC FC FC FC FC

FC FC y Linear SVM Linear SVM

xf

Liu et al. β€œDeep Learning Face Attributes in the Wild”, ICCV 2015

slide-72
SLIDE 72

Previous work

  • Classification accuracy biased to the majority class
  • 𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 =

=2 > =? @2 > @?

  • We adopt a balance accuracy
  • 𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = A

B =2 @2 + =? @?

π‘‚π‘ž and π‘‚π‘œ are the numbers of positive and negative samples, while π‘’π‘ž and π‘’π‘œ are the numbers of true positive and true negative.

slide-73
SLIDE 73

A more fundamental problem

  • Without handling

imbalanced class issue

  • Prediction biases

toward the majority class

  • Poor accuracy for

the minority class

CelebA positive/negative distribution

slide-74
SLIDE 74

Existing solutions

  • Class re-sampling [Drummond & Holte, ICML’03]
  • Random under-sampling of majority class

Remove valuable information

  • Random over-sampling of minority class

Introduce artificial noise

  • Cost-sensitive learning [Zadrozny et al., ICDM’03]
  • Assigns higher misclassification costs to the minority class

How to design costs?

slide-75
SLIDE 75

Motivation

  • Is there a better way apart from sampling and cost learning?

Minority class: very few instances with high degree of visual variability The genuine neighborhood

  • f these instances is easy to

be invaded by other imposter nearest neighbors Can we introduce tighter constrains to ameliorate such invasion?

slide-76
SLIDE 76

Triplet loss helps to a certain extent

  • Class-level constraint
  • 𝑦G – an anchor
  • 𝑦G

2 – a positive instance (of the same class)

  • 𝑦G

? – a negative instance (different class) Class 1 minority Class 2 majority

xi xp

i

xn

i

Wearing hat Not wearing hat

slide-77
SLIDE 77

Triplet loss helps to a certain extent

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8 -6 -4 -2 0 2 4 6 8
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10 -8 -6 -4 -2 0 2 4 6 8
  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

Class 1: cluster 1 Class 1: cluster 2 Class 2: cluster 1 Class 2: cluster 2 Class 2: cluster 3 Class 2: cluster 4 Class 2: cluster 5

Features extracted from DeepID2 model Triplet embedding

2D feature embedding of one imbalanced binary face attribute

slide-78
SLIDE 78

Contributions

  • Learning deep feature embedding for imbalanced data classification
  • A new method that preserves locality across clusters and discrimination

between classes

  • Large margin classification via fast cluster-wise kNN search
slide-79
SLIDE 79

Our solution compared to triplet loss

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8 -6 -4 -2 0 2 4 6 8
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10 -8 -6 -4 -2 0 2 4 6 8
  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

Class 1: cluster 1 Class 1: cluster 2 Class 2: cluster 1 Class 2: cluster 2 Class 2: cluster 3 Class 2: cluster 4 Class 2: cluster 5

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5

  • 15
  • 10
  • 5

5 10

  • 20
  • 15
  • 10
  • 5

5 10 15

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8

  • 8
  • 6
  • 4
  • 2

2 4 6 8 NC2 NC3 NC4 NC5 PC1 PC2 NC1

Features extracted from DeepID2 model Triplet embedding

2D feature embedding of one imbalanced binary face attribute

Our solution

slide-80
SLIDE 80

Large Margin Local Embedding

  • Our goal:

Learn a Euclidean embedding 𝑔(𝑦) from an image 𝑦 into a feature space ℝ(, such that the embedded features are discriminative with minimal possible local class imbalance.

  • Main idea:
  • 1. Find patterns (clusters) in each class
  • 2. Draw classification boundary locally only between marginal clusters, so not depends
  • n class size
  • 3. Learn deep features to reduce class imbalance in any local neighborhood
slide-81
SLIDE 81

Large Margin Local Embedding

CNN CNN CNN CNN CNN Triple-header hinge loss Mini- batches Training samples … Embedding Quintuplet xpβˆ’βˆ’

i

xi xp+

i

xpβˆ’

i

xn

i

f(xi) f(xp+

i )

f(xpβˆ’

i

) f(xpβˆ’βˆ’

i

) f(xn

i )

Shared parameters

slide-82
SLIDE 82

Quintuplet sampling

  • Cluster- and class-level
  • 𝑦G – an anchor
  • 𝑦G

2> – the anchor’s most distant within-

cluster neighbor

  • 𝑦G

2J – the nearest within-class neighbor of

the anchor, but from a different cluster

  • 𝑦G

2JJ – the most distant within-class

neighbor of the anchor

  • 𝑦G

? – the nearest between-class neighbor

  • f the anchor

Class 1 minority Class 2 majority … Cluster 1 Cluster j Cluster 1 Cluster 2

xi xp+

i

xpβˆ’

i

xpβˆ’βˆ’

i

xn

i

slide-83
SLIDE 83

Quintuplet sampling

  • Ensure the following relationship

< D(f(xi), f(xn

i )) Class 1 minority Class 2 majority … Cluster 1 Cluster j Cluster 1 Cluster 2

xi xp+

i

xpβˆ’

i

xpβˆ’βˆ’

i

xn

i

< D(f(xi), f(xpβˆ’

i

)) < D(f(xi), f(xpβˆ’βˆ’

i

)) D(f(xi), f(xp+

i ))

> > >

D(f(xi), f(xj)) = kf(xi) f(xj)k2

2 is the Euclidean distance

slide-84
SLIDE 84

Advantages

  • Richer information and a stronger constraint than the conventional class-level

image similarity

  • No information loss unlike under-sampling
  • No artificial noise unlike over-sampling
slide-85
SLIDE 85

How to obtain the clusters?

  • Obtain the initial clusters for each class by applying k-means on some prior

features

  • Face attribute recognition, we use pre-trained DeepID2 features
  • Alternating scheme
  • Refine the clusters using features extracted from the proposed model itself every π‘œ

iterations

slide-86
SLIDE 86

Triple-header hinge loss

  • To constrain three margins between the four distances

min X

i

(Ξ΅i + Ο„i + Οƒi) + Ξ»kWk2

2

max

  • 0, g1 + D(f(xi), f(xp+

i )) βˆ’ D(f(xi), f(xpβˆ’ i

))

  • ≀ Ξ΅i

max

  • 0, g2 + D(f(xi), f(xpβˆ’

i

)) βˆ’ D(f(xi), f(xpβˆ’βˆ’

i

))

  • ≀ Ο„i

βˆ€i, Ξ΅i β‰₯ 0, Ο„i β‰₯ 0, Οƒi β‰₯ 0 max

  • 0, g3 + D(f(xi), f(xpβˆ’βˆ’

i

)) βˆ’ D(f(xi), f(xn

i ))

  • ≀ Οƒi

s.t.:

slide-87
SLIDE 87

Triple-header hinge loss

  • To constrain three margins between the four distances

< D(f(xi), f(xn

i ))

< D(f(xi), f(xpβˆ’

i

)) < D(f(xi), f(xpβˆ’βˆ’

i

)) D(f(xi), f(xp+

i ))

> > >

min X

i

(Ξ΅i + Ο„i + Οƒi) + Ξ»kWk2

2

max

  • 0, g1 + D(f(xi), f(xp+

i )) βˆ’ D(f(xi), f(xpβˆ’ i

))

  • ≀ Ξ΅i

max

  • 0, g2 + D(f(xi), f(xpβˆ’

i

)) βˆ’ D(f(xi), f(xpβˆ’βˆ’

i

))

  • ≀ Ο„i

βˆ€i, Ξ΅i β‰₯ 0, Ο„i β‰₯ 0, Οƒi β‰₯ 0 max

  • 0, g3 + D(f(xi), f(xpβˆ’βˆ’

i

)) βˆ’ D(f(xi), f(xn

i ))

  • ≀ Οƒi

s.t.:

slide-88
SLIDE 88

Triple-header hinge loss

R2 space Class 2 Class 1 Class c g1 g2 g3

min X

i

(Ξ΅i + Ο„i + Οƒi) + Ξ»kWk2

2

max

  • 0, g1 + D(f(xi), f(xp+

i )) βˆ’ D(f(xi), f(xpβˆ’ i

))

  • ≀ Ξ΅i

max

  • 0, g2 + D(f(xi), f(xpβˆ’

i

)) βˆ’ D(f(xi), f(xpβˆ’βˆ’

i

))

  • ≀ Ο„i

βˆ€i, Ξ΅i β‰₯ 0, Ο„i β‰₯ 0, Οƒi β‰₯ 0 max

  • 0, g3 + D(f(xi), f(xpβˆ’βˆ’

i

)) βˆ’ D(f(xi), f(xn

i ))

  • ≀ Οƒi

s.t.:

clusters

slide-89
SLIDE 89

Network architecture (learning)

CNN CNN CNN CNN CNN Triple-header hinge loss Mini- batches Training samples … Embedding Quintuplet xpβˆ’βˆ’

i

xi xp+

i

xpβˆ’

i

xn

i

f(xi) f(xp+

i )

f(xpβˆ’

i

) f(xpβˆ’βˆ’

i

) f(xn

i )

slide-90
SLIDE 90

Summary of steps

  • Clustering by k-means
  • Generate quintuplets from

cluster & class membership

  • Re-sample batches equally

from each class

  • Forward their quintuplets to

CNN to compute loss

  • Back-propagation

Feature-based clustering Feature learning/updating

Every 5000 iterations

slide-91
SLIDE 91

Why is it effective?

  • Triplet loss
  • The similarity information is only extracted at the class-level
  • Homogeneously collapse each class irrespective of their different degrees of variation
  • When a class has high data variability, it is also hard to maintain the class-wise margin
  • Triple-header hinge loss
  • Generates diverse quintuplets that differ in the membership of both clusters and classes
  • Captures the considerable data variability within each class
  • Can easily enforce the local margin
slide-92
SLIDE 92

Nearest neighbor imbalanced classification

  • We modified kNN in two ways:
  • 1. In the well-clustered embedding space LMLE, we treat each cluster as a class-specific

exemplar, and perform a fast cluster-wise kNN search.

  • 2. Use a large margin decision

Let 𝜚(π‘Ÿ) be query π‘Ÿ's local neighborhood defined by its kNN cluster centroids {𝑛G}GJA

P

yq = arg max

c=1,...,C

   min

mj2Ο†(q) yj6=c

D(f(q), f(mj)) βˆ’ max

mi2Ο†(q) yi=c

D(f(q), f(mi)) ο£Ά ο£· ο£Έ

slide-93
SLIDE 93

CelebA dataset (100k train,10k test)

Class imbalance level (= |positive class rate-50|%) Anet classification accuracy = 87.24%, balance accuracy = 80.02% Ours classification accuracy = 90.35%, balance accuracy = 84.25%

slide-94
SLIDE 94

CelebA dataset (100k train,10k test)

  • Code available
  • http://mmlab.ie.cuhk.edu.hk/pr
  • jects/LMLE.html

5 10 15 20 10 20 30 40 50

Relative accuracy gain (%) Class imbalance level (%)

Face attribute Over PANDA [32] Over Triplet-kNN [22] M

  • r

e i m b a l a n c e d

10 20 30 40 10 20 30 40 50

Relative accuracy gain (%) Class imbalance level (%)

Face attribute M

  • r

e i m b a l a n c e d Over Anet [28] Over PANDA [46] Over Triplet-kNN [33]

slide-95
SLIDE 95

Thanks