 
              Manifold Based Sparse Representation for Robust Expression Recognition without Neutral Subtraction Raymond Ptucha, Grigorios Tsagkatakis, Andreas Savakis Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY Nov 13 , 2011 BeFIT 2011- 1 st IEEE International Workshop on Benchmarking Facial Image Analysis Technologies g g y g ICCV 2011 WS24, Paper 15 1 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Sparse Representations • The Sparse Representations (SRs) framework was inspired by studies of neurons in the visual cortex that suggest selective firing of neurons for visual processing. • For many input signals, such as natural images, only a small number of exemplars are needed to represent new test images. • SR gives state-of-the-art results for pattern recognition, noise reduction, super-resolution, tracking, … • At the The First Facial Expression Recognition and Analysis Challenge (FERA2011) at FG’11: – 13/15 entrants used SVM, but 0/15 entrants used SR 4 Ptucha, Tsagkatakis, Savakis, BeFIT2011 1
High Level Overview K-NN SVM Neural nets 5 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Related Work • Yang [PAMI ‘07] used dimensionality reduction with SRs for classification purposes. • Wright [PAMI ‘09] used SRs for best in class facial recognition. • Zafeiriou [CVPR ‘10] used PCA and SR methods based on Wright for facial expression. – Zafeiriou noted this was not an easy task as facial identity was often confused with facial expression. de t ty as o te co used t ac a e p ess o – The solution was to do neutral frame subtraction. • The methods proposed in this paper do not require neutral frame subtraction, yet give state-of-the-art results. 7 Ptucha, Tsagkatakis, Savakis, BeFIT2011 2
Hypothesis • Methods based on manifold learning and sparse representations can achieve accurate, robust, p , , and efficient facial expression recognition. Refer to two excellent papers ICCV 2011 papers: Facial Pre- Facial Pre- Facial Pre- Facial Parts Facial Parts Manifold Manifold Manifold Rudovic, O., Pantic, M. “Shape-constrained processing processing processing Processing Processing Learning Learning Learning Gaussian Process Regression for Facial-point- based Head-pose Normalization” Statistical Statistical Statistical Statistical Asthana, A, et al. Fully Automatic Pose Invariant Asthana, A, et al. “Fully Automatic Pose-Invariant Sparse Sparse Sparse Sparse Sparse Sparse Reconstruction Reconstruction Reconstruction Reconstruction Reconstruction Reconstruction Temporal Temporal Mixture Mixture Mixture Mixture Face Recognition via 3D Pose Normalization” Representation Representation Representation Model Model Model Processing Model Model Model Model • We term this process Manifold based Sparse Representation (MSR). 8 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Dimensionality Reduction • For the purpose of facial understanding, the dimensionality of a 26x20 ( ∈ R 520 ) pixel face image or a 82x2 ( ∈ R 164 ) set of ASM coordinates are artificially high. ASM coordinates are artificially high • The high dimensionality space makes the facial understanding algorithms more complex/burdened than necessary. • The set of 520 pixels (or 164 coordinates) actually are samples from a lower dimensional manifold that is embedded in a higher dimensional space. • We would like to discover this lower dimensional manifold representation (to simplify our facial modeling)- a technique formally called manifold learning. [Cayton ‘05, Ghodsi ’06] Given a set of inputs x 1 ..x n ∈ R D , find a mapping y i = f(x i ), • y 1 ..y n ∈ R d ; where d < D . 9 Ptucha, Tsagkatakis, Savakis, BeFIT2011 3
Locality Preserving Projections (LPP) [He ‘03] Given a set of input points x 1 ..x n ∈ R D , find a mapping y i = A T x i , • where the resulting y 1 ..y n ∈ R d ; where d < D . d << D. – Same algebra as PCA if we kept the top d eigenvectors! Same algebra as PCA, if we kept the top d eigenvectors! • Create a fully connected adjacency graph W . Assign high weights to close/similar nodes, and low weights to far/dissimilar nodes. – Mimic local neighborhood structure from input to projected space. • LPP is a linear approximation to the nonlinear Laplacian Eigenmap and is solved via the generalized eigenvector problem: X L X T a = λ X D X T a • Where: – D is a diagonal matrix whose values are the column sums of W , – L is the Laplacian matrix: L = D - W , – a is the resulting projection matrix (== “eigenvectors” ) , and λ is the resulting vector importance (== “eigenvalues”) . – 13 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Sparse Representations • In the SRs framework, a test image is represented as a sparse linear combination of exemplar images from a training dictionary, Φ . • The objective of SRs is to identify the smallest number of nonzero coefficients a ∈ R n such that: ŷ = Φ a . • The solution is equivalent to the Lasso regression: â = arg min || a || 1 g || || 1 s.t. ŷ = Φ a ŷ where || a || 1 = Σ | a |. • This is easily solved using iterative convex optimization problems or Least Angle Regression with lasSo (LARS). 22 Ptucha, Tsagkatakis, Savakis, BeFIT2011 4
MSR: Putting it Together • MSR exploits the discriminative behavior of manifold learning, and combines it with the parsimonious power of sparse signal representation. sparse signal representation. Test sample Manifold Learning fold Learning ℓ 1 Φ Φ optimization p … … Manif ≈ n = Σ a i Φ i Training i=1 n training dictionary, samples, Φ ∈ R nxd Classifier each ∈ R D 23 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Sparse Coefficients Top non-negative ‘a’ sparse coefficients for Test “sad” Face. Test Face Face Interesting…but ..how do we turn this into a classifier? • Max peak? A: Anger C: Contempt • Max non-zero D: Disgust D: Disgust coefficients? ffi i t ? F: Fear • Max Energy? H: Happy Sa: Sad Su: Surprised A C D F H Sa Su 24 Ptucha, Tsagkatakis, Savakis, BeFIT2011 5
Reconstruction Error • A reconstruction error classifier generally outperforms other methods. [Wright ‘09] [ g ] • Estimate the class, c* of a query sample y by comparing the reconstruction error inquired when only the reconstruction coefficients a c corresponding to a specific class c are selected. • Select the class with the minimum reconstruction error. c* = arg min c=1…z || y – Φ a c || 2 Use non-zero coefficients from all Use non-zero classes to estimate, coefficients y ≈ Φ a from each class 25 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Localized Facial Processing Improve accuracy and make robust to occlusions [Kumar ‘08] • Perform the MSR process for each of the above 11 regions. {fullimg, face, eyes, mouth, nose, chin, eyebrow, mustache, cheek, farhead, chin, eyebrow, mustache, cheek, farhead, eyereg}. • Evaluate performance of each region for expression recognition. 26 Ptucha, Tsagkatakis, Savakis, BeFIT2011 6
Statistical Mixture Model • Final classification is predicted from top m facial regions. • Each facial region casts a weighted vote for its • Each facial region casts a weighted vote for its expression classification, the class with the most votes is the predicted class. ĉ = arg max c=1…z Σ P f c f * I [ c = c* ] • The summation is done over multiple facial regions, where the weight, P f is based on prior classification accuracy of each region. (more accurate regions get higher weight). 27 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Region and Pixel Processing • It is further conceivable that different regions of the face may benefit from different types of pixel processing. – Each pixel processing ↔ facial region combination is a E h i l i f i l i bi ti i valid feature input to the statistical mixture model. MSR accuracies on CK dataset 28 Ptucha, Tsagkatakis, Savakis, BeFIT2011 7
Datasets • CK dataset [FG ’00] has 92 subjects in 229 expression sequences, each of expression {anger, disgust, fear, happiness sadness surprise} happiness, sadness, surprise}. • CK+ dataset [CVPR ‘10] has 118 subjects in 327 expression sequences, each of 6 expressions above, plus the contempt expression. A panel of judges removed any fake expressions. • GEMEP-FERA dataset [FG ’11] has 7 training set actors [ ] g in 155 video clips and 6 (half of which were not in the training set) test set actors in 134 videos. Each video clip exhibited one of five emotions of {anger, fear, joy, relief, sadness}. Actors were talking as they expressed their emotions. 29 Ptucha, Tsagkatakis, Savakis, BeFIT2011 Results on CK Dataset • MSR can use any number of features, where each feature has its own pixel processing, manifold, and Φ . 1 2 1. Zafeiriou [CVPR ‘10] used neutral frame subtraction, PCA, NMF, most nonzero classifier. 2. Zhi [ICME ‘09] used NMF, with less challenging k-fold, rather than leave-one-subject out cross validation. 30 Ptucha, Tsagkatakis, Savakis, BeFIT2011 8
Recommend
More recommend