Sparse Representations The Sparse Representations (SRs) framework - - PDF document

sparse representations
SMART_READER_LITE
LIVE PREVIEW

Sparse Representations The Sparse Representations (SRs) framework - - PDF document

Manifold Based Sparse Representation for Robust Expression Recognition without Neutral Subtraction Raymond Ptucha, Grigorios Tsagkatakis, Andreas Savakis Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY Nov


slide-1
SLIDE 1

1

Manifold Based Sparse Representation for Robust Expression Recognition without Neutral Subtraction

Raymond Ptucha, Grigorios Tsagkatakis, Andreas Savakis Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY Nov 13 , 2011 BeFIT 2011- 1st IEEE International Workshop on Benchmarking Facial Image Analysis Technologies

Ptucha, Tsagkatakis, Savakis, BeFIT2011 1

g g y g ICCV 2011

WS24, Paper 15

Sparse Representations

  • The Sparse Representations (SRs) framework was

inspired by studies of neurons in the visual cortex that suggest selective firing of neurons for visual processing.

  • For many input signals, such as natural images, only a

small number of exemplars are needed to represent new test images.

  • SR gives state-of-the-art results for pattern recognition,

noise reduction, super-resolution, tracking, …

Ptucha, Tsagkatakis, Savakis, BeFIT2011

  • At the The First Facial Expression Recognition and

Analysis Challenge (FERA2011) at FG’11: – 13/15 entrants used SVM, but 0/15 entrants used SR

4

slide-2
SLIDE 2

2

High Level Overview

K-NN

Ptucha, Tsagkatakis, Savakis, BeFIT2011 5

SVM Neural nets

Related Work

  • Yang [PAMI ‘07] used dimensionality reduction with SRs

for classification purposes.

  • Wright [PAMI ‘09] used SRs for best in class facial

recognition.

  • Zafeiriou [CVPR ‘10] used PCA and SR methods based
  • n Wright for facial expression.

– Zafeiriou noted this was not an easy task as facial identity was often confused with facial expression.

Ptucha, Tsagkatakis, Savakis, BeFIT2011

de t ty as o te co used t ac a e p ess o – The solution was to do neutral frame subtraction.

  • The methods proposed in this paper do not require

neutral frame subtraction, yet give state-of-the-art results.

7

slide-3
SLIDE 3

3

Hypothesis

  • Methods based on manifold learning and sparse

representations can achieve accurate, robust, p , , and efficient facial expression recognition.

Facial Pre- processing Manifold Learning Sparse Reconstruction Facial Parts Processing Statistical Temporal Manifold Learning Manifold Learning Sparse Sparse Reconstruction Reconstruction Statistical Statistical Facial Parts Processing Statistical Facial Pre- processing Facial Pre- processing Refer to two excellent papers ICCV 2011 papers: Rudovic, O., Pantic, M. “Shape-constrained Gaussian Process Regression for Facial-point- based Head-pose Normalization” Asthana, A, et al. “Fully Automatic Pose-Invariant

Ptucha, Tsagkatakis, Savakis, BeFIT2011 8

Sparse Representation Reconstruction Model Mixture Model Temporal Processing

  • We term this process Manifold based Sparse

Representation (MSR).

Sparse Representation Sparse Representation Reconstruction Model Reconstruction Model Mixture Model Mixture Model Mixture Model Asthana, A, et al. Fully Automatic Pose Invariant Face Recognition via 3D Pose Normalization”

Dimensionality Reduction

  • For the purpose of facial understanding, the dimensionality of

a 26x20 (∈ R520 ) pixel face image or a 82x2 (∈ R164) set of ASM coordinates are artificially high ASM coordinates are artificially high.

  • The high dimensionality space makes the facial

understanding algorithms more complex/burdened than necessary.

  • The set of 520 pixels (or 164 coordinates) actually are

samples from a lower dimensional manifold that is embedded in a higher dimensional space.

Ptucha, Tsagkatakis, Savakis, BeFIT2011 9

  • We would like to discover this lower dimensional manifold

representation (to simplify our facial modeling)- a technique formally called manifold learning. [Cayton ‘05, Ghodsi ’06]

  • Given a set of inputs x1..xn ∈ RD, find a mapping yi = f(xi),

y1..yn ∈ Rd; where d <D.

slide-4
SLIDE 4

4

Locality Preserving Projections (LPP) [He ‘03]

  • Given a set of input points x1..xn ∈ RD, find a mapping yi = ATxi,

where the resulting y1..yn ∈ Rd; where d < D. – Same algebra as PCA if we kept the top d eigenvectors! d << D. Same algebra as PCA, if we kept the top d eigenvectors!

  • Create a fully connected adjacency graph W. Assign high weights to

close/similar nodes, and low weights to far/dissimilar nodes. – Mimic local neighborhood structure from input to projected space.

  • LPP is a linear approximation to the nonlinear Laplacian Eigenmap

and is solved via the generalized eigenvector problem: X L XT a = λ X D XT a

Ptucha, Tsagkatakis, Savakis, BeFIT2011 13

  • Where:

– D is a diagonal matrix whose values are the column sums of W, – L is the Laplacian matrix: L = D-W, – a is the resulting projection matrix (== “eigenvectors” ) , and – λ is the resulting vector importance (== “eigenvalues”) .

Sparse Representations

  • In the SRs framework, a test image is represented as a

sparse linear combination of exemplar images from a training dictionary, Φ.

  • The objective of SRs is to identify the smallest number of

nonzero coefficients a∈Rn such that: ŷ = Φa.

  • The solution is equivalent to the Lasso regression:

â = arg min ||a||1 s.t. ŷ = Φa

Ptucha, Tsagkatakis, Savakis, BeFIT2011 22

g || ||1 ŷ where ||a||1 = Σ |a|.

  • This is easily solved using iterative convex optimization

problems or Least Angle Regression with lasSo (LARS).

slide-5
SLIDE 5

5

MSR: Putting it Together

  • MSR exploits the discriminative behavior of manifold

learning, and combines it with the parsimonious power of sparse signal representation. sparse signal representation.

fold Learning

Test sample

Manifold Learning ℓ1

  • ptimization

Φ

Ptucha, Tsagkatakis, Savakis, BeFIT2011 23

Manif

Training dictionary, Φ ∈Rnxd n training samples, each ∈RD

= Σ ai Φi

i=1 n Classifier p

Φ

≈ Test Face

Sparse Coefficients

Top non-negative ‘a’ sparse coefficients for Test “sad” Face.

Face Interesting…but ..how do we turn this into a classifier?

  • Max peak?
  • Max non-zero

ffi i t ?

A: Anger C: Contempt D: Disgust

Ptucha, Tsagkatakis, Savakis, BeFIT2011 24

A C D F H Sa Su

coefficients?

  • Max Energy?

D: Disgust F: Fear H: Happy Sa: Sad Su: Surprised

slide-6
SLIDE 6

6

Reconstruction Error

  • A reconstruction error classifier generally outperforms
  • ther methods. [Wright ‘09]

[ g ]

  • Estimate the class, c* of a query sample y by comparing

the reconstruction error inquired when only the reconstruction coefficients ac corresponding to a specific class c are selected.

  • Select the class with the minimum reconstruction error.

c* = arg minc=1…z ||y – Φ ac||2

Ptucha, Tsagkatakis, Savakis, BeFIT2011 25

Use non-zero coefficients from all classes to estimate, y ≈ Φ a Use non-zero coefficients from each class

Localized Facial Processing

Improve accuracy and make robust to occlusions [Kumar ‘08]

  • Perform the MSR process for each of the above

11 regions. {fullimg, face, eyes, mouth, nose, chin, eyebrow, mustache, cheek, farhead,

Ptucha, Tsagkatakis, Savakis, BeFIT2011 26

chin, eyebrow, mustache, cheek, farhead, eyereg}.

  • Evaluate performance of each region for

expression recognition.

slide-7
SLIDE 7

7

Statistical Mixture Model

  • Final classification is predicted from top m facial

regions.

  • Each facial region casts a weighted vote for its
  • Each facial region casts a weighted vote for its

expression classification, the class with the most votes is the predicted class. ĉ = arg maxc=1…z ΣPf cf* I[c=c*]

  • The summation is done over multiple facial

Ptucha, Tsagkatakis, Savakis, BeFIT2011 27

regions, where the weight, Pf is based on prior classification accuracy of each region. (more accurate regions get higher weight).

Region and Pixel Processing

  • It is further conceivable that different regions of the face

may benefit from different types of pixel processing. E h i l i f i l i bi ti i – Each pixel processing↔facial region combination is a valid feature input to the statistical mixture model.

Ptucha, Tsagkatakis, Savakis, BeFIT2011 28

MSR accuracies on CK dataset

slide-8
SLIDE 8

8

Datasets

  • CK dataset [FG ’00] has 92 subjects in 229 expression

sequences, each of expression {anger, disgust, fear, happiness sadness surprise} happiness, sadness, surprise}.

  • CK+ dataset [CVPR ‘10] has 118 subjects in 327

expression sequences, each of 6 expressions above, plus the contempt expression. A panel of judges removed any fake expressions.

  • GEMEP-FERA dataset [FG ’11] has 7 training set actors

Ptucha, Tsagkatakis, Savakis, BeFIT2011

[ ] g in 155 video clips and 6 (half of which were not in the training set) test set actors in 134 videos. Each video clip exhibited one of five emotions of {anger, fear, joy, relief, sadness}. Actors were talking as they expressed their emotions.

29

Results on CK Dataset

  • MSR can use any number of features, where each

feature has its own pixel processing, manifold, and Φ.

1 2

Ptucha, Tsagkatakis, Savakis, BeFIT2011 30

  • 1. Zafeiriou [CVPR ‘10] used neutral frame subtraction,

PCA, NMF, most nonzero classifier.

  • 2. Zhi [ICME ‘09] used NMF, with less challenging k-fold,

rather than leave-one-subject out cross validation.

slide-9
SLIDE 9

9

Robustness to Occlusions

Orig Test Approximately 250 real-world images with occlusions.

  • Occluded regions of the face are automatically

detected when the minimum reconstruction error (across all classes) for a region is > τ, our threshold.

  • The weight of occluded regions is pulled to zero

Test w/occl.

Ptucha, Tsagkatakis, Savakis, BeFIT2011 31

Classification Accuracy on CK dataset without and with occlusions.

The weight of occluded regions is pulled to zero.

GEMEP-FERA Dataset

  • Results for GEMEP-FERA test set are returned

by the FERA2011 organizers three ways:

Accuracy Rank (out of 15 submissions) Person dependent: subjects in test set are in the training set 98.5% 2nd place Person independent: subjects

Ptucha, Tsagkatakis, Savakis, BeFIT2011 34

in test set are not in the training 56.5% 10th place Overall 73.5% 5th place

slide-10
SLIDE 10

10

MSR enables evaluation of any region of the face

Posed vs. Natural Datasets

More important

Ptucha, Tsagkatakis, Savakis, BeFIT2011 35

CK+ GEMEP-FERA

*Correlates well with Pfister et al. ICCV2011

Summary

  • Facial expressions can be reliably extracted in

unconstrained scenes using SRs.

  • The usage of SLPP before SR clusters by
  • The usage of SLPP before SR clusters by

expression, avoiding confusion by identity, pose,

  • r other factors.
  • The usage of sparse reconstruction coefficients

in varying facial parts along with a statistical mixture model makes the model robust to

  • cclusions.

Ptucha, Tsagkatakis, Savakis, BeFIT2011 36

  • If the training dictionary is not over complete, SR

methods have trouble generalizing test samples from training dictionary exemplars.

slide-11
SLIDE 11

11

Thank You

Ptucha, Tsagkatakis, Savakis, BeFIT2011 37