Towards the subjective-ness in facial expression analysis Jiabei - - PowerPoint PPT Presentation
Towards the subjective-ness in facial expression analysis Jiabei - - PowerPoint PPT Presentation
Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @ VALSE Webinar It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the
It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the facial expression.
2
The six ix ba basic ic emo motions tions
Universal across culture
He was about to fight
angry
3
The six ix ba basic ic emo motions tions
Universal across culture
His child had just died.
sad
4
Unive iversal sal ≠ 100% 00% Consis nsistent ent
Elfenbein H A, Ambady N. On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychological bulletin, 2002, 128(2): 203.
5
Humans’ annotations are subjective. How do we make the machines objective?
Subjective tive-ness ess of human Trainin ning g datase set t has s annotati ation
- n bias
Trained ined sy syste tem m has s re recog
- gnition
nition bi bias s Subj bjectiv ctive-ne ness ss of th the machines hines
6
Humans’ annotations are subjective. How do we make the machines objective?
“兼听则明,偏信则暗”: To Learn the classifier from multiple datasets instead of the only one To describe facial expression in a more objective way: Facial Action Coding System (FACS)
7
Humans’ annotations are subjective. How do we make the machines objective?
“兼听则明,偏信则暗”: To Learn the classifier from multiple datasets instead of the only one To describe facial expression in a more objective way: Facial Action Coding System (FACS)
8
A+R < R A+R < A
Challeng allenge
How to evaluate the machine?
Consistent performance boost on diverse test datasets.
How to train the machine?
More data by merging multiple training datasets ≠ Better performance of the trained system
9
Le Learn arn from
- m da
datase asets ts wi with annot notation ation bi biases ases
Inconsistent Pseudo Annotations to Latent Truth framework
- multiple inconsistent annotations
- unlabeled data
10
Step 1: Train machine coders
Model A
Model B
… …
Data A Data B
labelA: happy labelA: disgust … labelB: sad labelB: fear …
Le Learn arn from
- m da
datase asets ts wi with annot notation ation bi biases ases
Inconsistent Pseudo Annotations to Latent Truth framework
- multiple inconsistent annotations
- unlabeled data
11
Step 1: Train machine coders
Data B labelB: sad labelB: fear Data A labelA: happy labelA: disgust Data U predA: disgust predA: sad Model A
Step 2: Predict pseudo labels
Model B predB: happy predB: angry Model A predA: sad predA: angry Model B predB: disgust predB: disgust
Le Learn arn from
- m da
datase asets ts wi with annot notation ation bi biases ases
Inconsistent Pseudo Annotations to Latent Truth framework
- multiple inconsistent annotations
- unlabeled data
12
Step 1: Train machine coders Step 2: Predict pseudo labels Step 3: Train Latent Truth Net
predB: angry predA: angry predB: disgust predA: sad predB: disgust
Data A Data U Data B
labelB: fear labelA: disgust predA: disgust
LT: disgust LT: angry estimate Latent Truth(LT) LT: angry LT: sad
…
Le Learn arn from
- m da
datase asets ts wi with annot notation ation bi biases ases
Inconsistent Pseudo Annotations to Latent Truth framework
- multiple inconsistent annotations
- unlabeled data
13
Step 1: Train machine coders Step 2: Predict pseudo labels Step 3: Train Latent Truth Net
Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net
Conve nventional ntional archit itec ecture ture
p is the predicted probability of each facial expression y is the ground truth label
14
Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net
LTN TNet learns from samples with inconsistent annotations
15
latent truth probability transition matrix for each coder predicted annotation for each coder
Expe xperi riments ments on
- n syn
ynthe thetic tic data
Synthetic data
- Make 3 copies of the training set of
CIFAR-10.
- Randomly add 20%,30%,40% label
noises, respectively.
- Evaluate the methods on the clean
test set of CIFAR-10.
16
LTNet can reveal the true labels
LTNet-learned latent truth Ground truth
Expe xperi riments ments on
- n syn
ynthe thetic tic data
Evaluations on synthetic data
- LTNet is compatible with the CNN trained on clean data
17 Test accuracy of different methods
Expe xperi riments ments on
- n FE
FER da datase asets ts
Training data
- Dataset A: AffectNet (training part)
- Dataset B: RAF(training part)
- Unlabeled data:
- un-annotated part of AffectNet (~700,000 images)
- unlabeled facial images downloaded from Bing (~500,000 images)
Test data
- In-the-lab
- CK+, MMI, CFEE, Oulu-CASIA
- In-the-wild
- SFEW,AffectNet (validation part), RAF (test part)
18
Expe xperi riments ments on
- n FE
FER da datase asets ts
Evaluation on FER datasets
19 Table 1. Test accuracy of different methods(Bold: best, Underline: second best)
Expe xperi riments ments on
- n FE
FER da datase asets ts
LTNet-learned transition matrix T for 4 coders
- Human coder (RAF) is the most reliable
- Labels in RAF are derived from ~40 annotations per image
20 human coder (AffectNet) human coder (RAF) machine coder (AffectNet trained model) machine coder (RAF trained model)
Expe xperi riments ments on
- n FE
FER da datase asets ts
Statistics of the samples
- For majority of the samples, the latent truth agrees with either/both the human
annotation or/and model prediction.
- For few samples, the latent truth differs from both the human annotation and model
prediction (case2, case3)
21
Expe xperi riments ments on
- n FE
FER da datase asets ts
Examples in the 5 cases
- LTNet-learned latent truth is reasonable
22 H: human annotation G: LTNet- learned latent truth A: predictions by AffectNet- trained model R: predictions by RAF-trained model
Humans’ annotations are subjective. How do we make the machines objective?
“兼听则明,偏信则暗”: To Learn the classifier from multiple datasets instead of the only one To describe facial expression in a more objective way: Facial Action Coding System (FACS)
23
From subjective bjective-ness ness to to obj bjective ective-ness ness
emotional category
Facial Action Coding System
24
Facia cial Ac Action ion Coding ing System em (FACS) CS)
Taxonomizes facial muscle movements by their appearance Human-defined facial action units (AUs)
* Pictures are from “ Facial Action Coding System, Manual” by P. Ekman, V. Friesen, J. C. Hager
AU4: Brow lowerer AU1: Inner brow raiser AU2: Outer brow raiser AU5: Upper lid raiser AU7: Lid tightener
25
Wh What di did we we usually ually do do?
supervised learning manually annotated data
BP4D DISFA AlexNet state-of-the-art (e.g., JAA-Net, ECCV’18) VGGNet
26
Can Can we we le learn arn from
- m the
the unlabeled labeled vi vide deos?
- s?
Facial actions appear as the local changings
- f the faces between frames!
Le Lear arn n from
- m the
he cha hang ngings! ings!
27
Changings are easy to be detected without manual annotations!
Can Can we we le learn arn from
- m the
the unlabeled labeled vi vide deos?
- s?
changing of face changing of facial actions changing of head poses
28
Can Can we we le learn arn from
- m the
the unlabeled labeled vi vide deos?
- s?
Supervisory task
- Change the facial actions or head poses of the source frame to those of
the target frame by predicting the related movements respectively
29
changing of face changing of facial actions changing of head poses
Self lf-super supervised vised le learning arning from vi vide deos
- s
target image source image
30
Self lf-super supervised vised le learning arning from vi vide deos
- s
≈
facial action changes AU feature re-generate target image source image
31
Self lf-super supervised vised le learning arning from vi vide deos
- s
≈ ≈
facial action changes pose changes AU feature re-generate re-generate
32
target image source image
Self lf-super supervised vised le learning arning from vi vide deos
- s
≈ ≈ ≈
facial action changes pose changes AU feature re-generate re-generate
33
target image source image
Twi winCycle cle AutoEncoder ncoder
Fea eature ture di disent entangle anglement ment
source target
AU-related displacements
- sparse: local
- small values: subtle
34
Twi winCycle cle AutoEncoder ncoder
Cyc ycle le wit ith AU AU changed ged
source target source
- Pixel consistency
35
Twi winCycle cle AutoEncoder ncoder
Cyc ycle le with po pose changed ged
- Pixel consistency
source target source
36
Twi winCycle cle AutoEncoder ncoder
Targe get rec econs nstruction truction
source target source target
37
- Pixel consistency
Twi winCycle cle AutoEncoder ncoder
Fea eature ture co consi sistency ency
AU-changed image
- Same AU as the target
- Same pose as the source
source target AU-changed image
38
Twi winCycle cle AutoEncoder ncoder
source target pose-changed image
39
Fea eature ture co consi sistency ency
AU-changed image
- Same AU as the target
- Same pose as the source
Expe xperi riment mental al settings ttings
Training data without AU annotation: VoxCeleb
- 7000+ subjects
- Practically infinite random image pairs
Evaluation protocols
- AU features from the encoder
+ simple linear classifier
- Datasets with AU annotations:
BP4D, DISFA, GFT
40
GFT BP4D DISFA
0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Comp mparisons arisons to to other her me methods hods
handcrafted vgg-face vgg-emotion alexnet Res50 SplitBrain DeformAE FabNet TCAE(pr (prop
- p.)
.)
descriptors supervised
handcrafted vgg-face alexnet DRML EAC-Net JAA-Net SplitBrain DeformAE FabNet TCAE(prop.)
descriptors supervised self- supervised F1 on DISFA dataset F1 on GFT dataset
TCAE(prop.)
TCAE outperforms
- ther self-supervised
methods TCAE is comparable to state-of-the-art supervised AU detection methods
self- supervised
0.45 0.47 0.49 0.51 0.53 0.55 0.57 0.59 0.61
F1 on BP4D dataset descriptors supervised AU detection self-supervised
vgg-face handcrafted SplitBrain DeformAE FabNet TCAE(prop.) alexnet DRML EAC-Net JAA-Net ROI TCAE-A (prop.) vgg-emotion vgg-emotion TCAE-A (prop.) TCAE-A (prop.) 41
An Analy lysis sis of
- f the
the le learned rned di displacem placement ent
42
≈
facial action changed pose changed
≈
pose changed facial action changed
TCAE TCAE-A
An Analy lysis sis of
- f the
the le learned rned di displacem placement ent
43
≈
≈
facial action changed pose changed
≈
facial action changed pose changed
TCAE TCAE-A
An Analy lysis sis of
- f the
the le learned rned di displacem placement ent
AU-related displacements are of smaller length than that of pose-related displacements
Amount percentage of the transformations
44 TACE: average length of the displacements in a transformation TACE-A: average length of the displacements in a transformation
Result sult of facial cial im images ges retrie trieval al
Query image Query image
Images retrieved by facial actions Images retrieved by facial actions Images retrieved by head poses Images retrieved by head poses
45
Summa mmary
Humans’ annotations are subjective. How do we make the machines
- bjective?
“兼听则明,偏信则暗”:
To Learn the classifier from multiple datasets instead of the only one
To describe facial expression in a more objective way:
Facial Action Coding System (FACS) IPA2LT framework that learns from inconsistent annotations https://github.com/dualplus/LTNet self-supervised learning with TwinCycleAutoEncoder https://github.com/mysee1989/TCAE
More in the future ……
46
Thanks for your attention
Q&A
47