Towards the subjective-ness in facial expression analysis Jiabei - - PowerPoint PPT Presentation

towards the subjective ness in
SMART_READER_LITE
LIVE PREVIEW

Towards the subjective-ness in facial expression analysis Jiabei - - PowerPoint PPT Presentation

Towards the subjective-ness in facial expression analysis Jiabei Zeng, Ph.D. August 21, 2019 @ VALSE Webinar It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the


slide-1
SLIDE 1

Towards the subjective-ness in facial expression analysis

Jiabei Zeng, Ph.D. August 21, 2019 @ VALSE Webinar

slide-2
SLIDE 2

It is subjective for human beings to recognize the facial expression. Different individuals have different understandings of the facial expression.

2

slide-3
SLIDE 3

The six ix ba basic ic emo motions tions

 Universal across culture

He was about to fight

angry

3

slide-4
SLIDE 4

The six ix ba basic ic emo motions tions

 Universal across culture

His child had just died.

sad

4

slide-5
SLIDE 5

Unive iversal sal ≠ 100% 00% Consis nsistent ent

Elfenbein H A, Ambady N. On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychological bulletin, 2002, 128(2): 203.

5

slide-6
SLIDE 6

Humans’ annotations are subjective. How do we make the machines objective?

Subjective tive-ness ess of human Trainin ning g datase set t has s annotati ation

  • n bias

Trained ined sy syste tem m has s re recog

  • gnition

nition bi bias s Subj bjectiv ctive-ne ness ss of th the machines hines

6

slide-7
SLIDE 7

Humans’ annotations are subjective. How do we make the machines objective?

 “兼听则明,偏信则暗”: To Learn the classifier from multiple datasets instead of the only one  To describe facial expression in a more objective way: Facial Action Coding System (FACS)

7

slide-8
SLIDE 8

Humans’ annotations are subjective. How do we make the machines objective?

 “兼听则明,偏信则暗”: To Learn the classifier from multiple datasets instead of the only one  To describe facial expression in a more objective way: Facial Action Coding System (FACS)

8

slide-9
SLIDE 9

A+R < R A+R < A

Challeng allenge

 How to evaluate the machine?

 Consistent performance boost on diverse test datasets.

 How to train the machine?

 More data by merging multiple training datasets ≠ Better performance of the trained system

9

slide-10
SLIDE 10

Le Learn arn from

  • m da

datase asets ts wi with annot notation ation bi biases ases

 Inconsistent Pseudo Annotations to Latent Truth framework

  • multiple inconsistent annotations
  • unlabeled data

10

Step 1: Train machine coders

Model A

Model B

… …

Data A Data B

labelA: happy labelA: disgust … labelB: sad labelB: fear …

slide-11
SLIDE 11

Le Learn arn from

  • m da

datase asets ts wi with annot notation ation bi biases ases

 Inconsistent Pseudo Annotations to Latent Truth framework

  • multiple inconsistent annotations
  • unlabeled data

11

Step 1: Train machine coders

Data B labelB: sad labelB: fear Data A labelA: happy labelA: disgust Data U predA: disgust predA: sad Model A

Step 2: Predict pseudo labels

Model B predB: happy predB: angry Model A predA: sad predA: angry Model B predB: disgust predB: disgust

slide-12
SLIDE 12

Le Learn arn from

  • m da

datase asets ts wi with annot notation ation bi biases ases

 Inconsistent Pseudo Annotations to Latent Truth framework

  • multiple inconsistent annotations
  • unlabeled data

12

Step 1: Train machine coders Step 2: Predict pseudo labels Step 3: Train Latent Truth Net

predB: angry predA: angry predB: disgust predA: sad predB: disgust

Data A Data U Data B

labelB: fear labelA: disgust predA: disgust

LT: disgust LT: angry estimate Latent Truth(LT) LT: angry LT: sad

slide-13
SLIDE 13

Le Learn arn from

  • m da

datase asets ts wi with annot notation ation bi biases ases

 Inconsistent Pseudo Annotations to Latent Truth framework

  • multiple inconsistent annotations
  • unlabeled data

13

Step 1: Train machine coders Step 2: Predict pseudo labels Step 3: Train Latent Truth Net

slide-14
SLIDE 14

Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net

 Conve nventional ntional archit itec ecture ture

 p is the predicted probability of each facial expression  y is the ground truth label

14

slide-15
SLIDE 15

Conven nventional tional archit hitecture ecture v.s. . Latent ent Truth uth Net

 LTN TNet learns from samples with inconsistent annotations

15

latent truth probability transition matrix for each coder predicted annotation for each coder

slide-16
SLIDE 16

Expe xperi riments ments on

  • n syn

ynthe thetic tic data

 Synthetic data

  • Make 3 copies of the training set of

CIFAR-10.

  • Randomly add 20%,30%,40% label

noises, respectively.

  • Evaluate the methods on the clean

test set of CIFAR-10.

16

 LTNet can reveal the true labels

LTNet-learned latent truth Ground truth

slide-17
SLIDE 17

Expe xperi riments ments on

  • n syn

ynthe thetic tic data

 Evaluations on synthetic data

  • LTNet is compatible with the CNN trained on clean data

17 Test accuracy of different methods

slide-18
SLIDE 18

Expe xperi riments ments on

  • n FE

FER da datase asets ts

 Training data

  • Dataset A: AffectNet (training part)
  • Dataset B: RAF(training part)
  • Unlabeled data:
  • un-annotated part of AffectNet (~700,000 images)
  • unlabeled facial images downloaded from Bing (~500,000 images)

 Test data

  • In-the-lab
  • CK+, MMI, CFEE, Oulu-CASIA
  • In-the-wild
  • SFEW,AffectNet (validation part), RAF (test part)

18

slide-19
SLIDE 19

Expe xperi riments ments on

  • n FE

FER da datase asets ts

 Evaluation on FER datasets

19 Table 1. Test accuracy of different methods(Bold: best, Underline: second best)

slide-20
SLIDE 20

Expe xperi riments ments on

  • n FE

FER da datase asets ts

 LTNet-learned transition matrix T for 4 coders

  • Human coder (RAF) is the most reliable
  • Labels in RAF are derived from ~40 annotations per image

20 human coder (AffectNet) human coder (RAF) machine coder (AffectNet trained model) machine coder (RAF trained model)

slide-21
SLIDE 21

Expe xperi riments ments on

  • n FE

FER da datase asets ts

 Statistics of the samples

  • For majority of the samples, the latent truth agrees with either/both the human

annotation or/and model prediction.

  • For few samples, the latent truth differs from both the human annotation and model

prediction (case2, case3)

21

slide-22
SLIDE 22

Expe xperi riments ments on

  • n FE

FER da datase asets ts

 Examples in the 5 cases

  • LTNet-learned latent truth is reasonable

22 H: human annotation G: LTNet- learned latent truth A: predictions by AffectNet- trained model R: predictions by RAF-trained model

slide-23
SLIDE 23

Humans’ annotations are subjective. How do we make the machines objective?

 “兼听则明,偏信则暗”: To Learn the classifier from multiple datasets instead of the only one  To describe facial expression in a more objective way: Facial Action Coding System (FACS)

23

slide-24
SLIDE 24

From subjective bjective-ness ness to to obj bjective ective-ness ness

emotional category

Facial Action Coding System

24

slide-25
SLIDE 25

Facia cial Ac Action ion Coding ing System em (FACS) CS)

 Taxonomizes facial muscle movements by their appearance  Human-defined facial action units (AUs)

* Pictures are from “ Facial Action Coding System, Manual” by P. Ekman, V. Friesen, J. C. Hager

AU4: Brow lowerer AU1: Inner brow raiser AU2: Outer brow raiser AU5: Upper lid raiser AU7: Lid tightener

25

slide-26
SLIDE 26

Wh What di did we we usually ually do do?

supervised learning manually annotated data

BP4D DISFA AlexNet state-of-the-art (e.g., JAA-Net, ECCV’18) VGGNet

26

slide-27
SLIDE 27

Can Can we we le learn arn from

  • m the

the unlabeled labeled vi vide deos?

  • s?

Facial actions appear as the local changings

  • f the faces between frames!

Le Lear arn n from

  • m the

he cha hang ngings! ings!

27

Changings are easy to be detected without manual annotations!

slide-28
SLIDE 28

Can Can we we le learn arn from

  • m the

the unlabeled labeled vi vide deos?

  • s?

changing of face changing of facial actions changing of head poses

28

slide-29
SLIDE 29

Can Can we we le learn arn from

  • m the

the unlabeled labeled vi vide deos?

  • s?

 Supervisory task

  • Change the facial actions or head poses of the source frame to those of

the target frame by predicting the related movements respectively

29

changing of face changing of facial actions changing of head poses

slide-30
SLIDE 30

Self lf-super supervised vised le learning arning from vi vide deos

  • s

target image source image

30

slide-31
SLIDE 31

Self lf-super supervised vised le learning arning from vi vide deos

  • s

facial action changes AU feature re-generate target image source image

31

slide-32
SLIDE 32

Self lf-super supervised vised le learning arning from vi vide deos

  • s

≈ ≈

facial action changes pose changes AU feature re-generate re-generate

32

target image source image

slide-33
SLIDE 33

Self lf-super supervised vised le learning arning from vi vide deos

  • s

≈ ≈ ≈

facial action changes pose changes AU feature re-generate re-generate

33

target image source image

slide-34
SLIDE 34

Twi winCycle cle AutoEncoder ncoder

 Fea eature ture di disent entangle anglement ment

source target

AU-related displacements

  • sparse: local
  • small values: subtle

34

slide-35
SLIDE 35

Twi winCycle cle AutoEncoder ncoder

 Cyc ycle le wit ith AU AU changed ged

source target source

  • Pixel consistency

35

slide-36
SLIDE 36

Twi winCycle cle AutoEncoder ncoder

 Cyc ycle le with po pose changed ged

  • Pixel consistency

source target source

36

slide-37
SLIDE 37

Twi winCycle cle AutoEncoder ncoder

 Targe get rec econs nstruction truction

source target source target

37

  • Pixel consistency
slide-38
SLIDE 38

Twi winCycle cle AutoEncoder ncoder

 Fea eature ture co consi sistency ency

AU-changed image

  • Same AU as the target
  • Same pose as the source

source target AU-changed image

38

slide-39
SLIDE 39

Twi winCycle cle AutoEncoder ncoder

source target pose-changed image

39

 Fea eature ture co consi sistency ency

AU-changed image

  • Same AU as the target
  • Same pose as the source
slide-40
SLIDE 40

Expe xperi riment mental al settings ttings

 Training data without AU annotation: VoxCeleb

  • 7000+ subjects
  • Practically infinite random image pairs

 Evaluation protocols

  • AU features from the encoder

+ simple linear classifier

  • Datasets with AU annotations:

BP4D, DISFA, GFT

40

GFT BP4D DISFA

slide-41
SLIDE 41

0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Comp mparisons arisons to to other her me methods hods

handcrafted vgg-face vgg-emotion alexnet Res50 SplitBrain DeformAE FabNet TCAE(pr (prop

  • p.)

.)

descriptors supervised

handcrafted vgg-face alexnet DRML EAC-Net JAA-Net SplitBrain DeformAE FabNet TCAE(prop.)

descriptors supervised self- supervised F1 on DISFA dataset F1 on GFT dataset

TCAE(prop.)

 TCAE outperforms

  • ther self-supervised

methods  TCAE is comparable to state-of-the-art supervised AU detection methods

self- supervised

0.45 0.47 0.49 0.51 0.53 0.55 0.57 0.59 0.61

F1 on BP4D dataset descriptors supervised AU detection self-supervised

vgg-face handcrafted SplitBrain DeformAE FabNet TCAE(prop.) alexnet DRML EAC-Net JAA-Net ROI TCAE-A (prop.) vgg-emotion vgg-emotion TCAE-A (prop.) TCAE-A (prop.) 41

slide-42
SLIDE 42

An Analy lysis sis of

  • f the

the le learned rned di displacem placement ent

42

facial action changed pose changed

pose changed facial action changed

TCAE TCAE-A

slide-43
SLIDE 43

An Analy lysis sis of

  • f the

the le learned rned di displacem placement ent

43

facial action changed pose changed

facial action changed pose changed

TCAE TCAE-A

slide-44
SLIDE 44

An Analy lysis sis of

  • f the

the le learned rned di displacem placement ent

 AU-related displacements are of smaller length than that of pose-related displacements

Amount percentage of the transformations

44 TACE: average length of the displacements in a transformation TACE-A: average length of the displacements in a transformation

slide-45
SLIDE 45

Result sult of facial cial im images ges retrie trieval al

Query image Query image

Images retrieved by facial actions Images retrieved by facial actions Images retrieved by head poses Images retrieved by head poses

45

slide-46
SLIDE 46

Summa mmary

 Humans’ annotations are subjective. How do we make the machines

  • bjective?

 “兼听则明,偏信则暗”:

To Learn the classifier from multiple datasets instead of the only one

 To describe facial expression in a more objective way:

Facial Action Coding System (FACS) IPA2LT framework that learns from inconsistent annotations https://github.com/dualplus/LTNet self-supervised learning with TwinCycleAutoEncoder https://github.com/mysee1989/TCAE

 More in the future ……

46

slide-47
SLIDE 47

Thanks for your attention

Q&A

47