[PPT] - Discriminative Feature Extraction and Dimension Reduction - PCA PowerPoint Presentation

SLIDE 1

Discriminative Feature Extraction and Dimension Reduction

PCA & LDA

Berlin Chen, 2004

SLIDE 2

2004 TCFST - Berlin Chen

2

Introduction

Goal: discover significant patterns or features from the

input data

– Salient feature selection or dimensionality reduction – Compute an input-output mapping based on some desirable properties

Network x y

Input space Feature space

W

SLIDE 3

2004 TCFST - Berlin Chen

3

Introduction (cont.)

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Heteroscedastic Discriminant Analysis (HDA)

SLIDE 4

2004 TCFST - Berlin Chen

4

Introduction (cont.)

Formulation for discriminative feature extraction

– Model-free (nonparametric)

Without prior information: e.g., PCA
With prior information: e.g., LDA

– Model-dependent (parametric)

E.g., EM (Expectation-Maximization), MCE (Minimum

Classification Error) Training

SLIDE 5

2004 TCFST - Berlin Chen

5

Principle Component Analysis (PCA)

Pearson, 1901

Known as Karhunen-Loẻve Transform (1947, 1963)

– Or Hotelling Transform (1933)

A standard technique commonly used for data reduction

in statistical pattern recognition and signal processing

A transform by which the data set can be represented by

reduced number of effective features and still retain the most intrinsic information content

– A small set of features to be found to represent the data samples accurately

Also called “Subspace Decomposition”

SLIDE 6

2004 TCFST - Berlin Chen

6

Principle Component Analysis (cont.)

The patterns show a significant difference from each other in one

f the transformed axes

SLIDE 7

2004 TCFST - Berlin Chen

7

Principle Component Analysis (cont.)

Suppose x is an n-dimensional zero mean

random vector,

– If x is not zero mean, we can subtract the mean before processing the following analysis – x can be represented without error by the summation

f n linearly independent vectors

∑ = =

= n i i i i

y Φy φ x

[ ]

T n i

y y y . . where

1

= y

[ ]

n i

φ φ φ Φ . .

1

=

{ }

x

= E

The basis vectors The i-th component in the feature (mapped) space

SLIDE 8

2004 TCFST - Berlin Chen

8

Principle Component Analysis (cont.)

– Further assume the column (basis) vectors of the matrix form an orthonormal set

Such that is equal to the projection of on
also has the following properties

– Its mean is zero, too – Its variance is

⎩ ⎨ ⎧ ≠ = = j i j i

j T

i

if if 1 φ φ

i

φ

i

y x x x

T i i T i i

y ϕ ϕ = = ∀

i

y

{ } { } { }

[ ]

x R R xx xx

f

matrix relation (auto-)cor the is

2 2 i T i i T T i i T T i

E E y E

i i

ϕ ϕ ϕ ϕ ϕ ϕ σ = = = =

{ }

= = = = x x

T i T i T i i

E E y E ϕ ϕ ϕ

x

1

ϕ

2

ϕ

1 where , cos

1 1 1 1

1 1

= = = = φ x φ x x x x

T T

y ϕ ϕ θ

1

y

2

y

{ }

∑

= =

i T i T

i

N E x x xx R 1

Φ

SLIDE 9

2004 TCFST - Berlin Chen

9

Principle Component Analysis (cont.)

– Further assume the column (basis) vectors of the matrix form an orthonormal set

also has the following properties

– Its mean is zero, too – Its variance is

The correlation between two projections and

is Φ

i

y

{ } { } { }

[ ]

x R R xx xx

f

matrix relation (auto-)cor the is

2 2 i T i i T T i i T T i

E E y E

i i

ϕ ϕ ϕ ϕ ϕ ϕ σ = = = =

{ }

= = = = x x

T i T i T i i

E E y E ϕ ϕ ϕ

{ }

∑

= =

i T i T

i

N E x x xx R 1

i

y

j

y

{ }

( )( )

{ }

{ } { }

j T i j T T i j T T i T T j T i j i

E E E y y E Rφ φ φ xx φ φ xx φ x φ x φ = = = =

SLIDE 10

2004 TCFST - Berlin Chen

10

Principle Component Analysis (cont.)

Minimum Mean-Squared Error Criterion

– We want to choose only m of that we still can approximate well in mean-squared error criterion

x s '

i

φ

∑ ∑ ∑

+ = = =

+ = =

n m j j j m i i n i i

y y y

1 1 1

φ φ φ x

i i

( )

∑

=

m i i

y m

1

ˆ

i

φ x

( ) ( )

{ }

∑ ∑ ∑ ∑ ∑ ∑ ∑

+ = + = + = + = + = + = + =

= = = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = − =

n m j j T j n m j j n m j j k T j n m j n m k k j k n m k k T j n m j j

y E y y E y y E m E m

1 1 2 1 2 1 1 1 1 2

ˆ R φ φ φ φ φ φ x x σ ε

⎩ ⎨ ⎧ ≠ = = k j k j

k T

j

if if 1 φ φ Q

{ }

{ } ( )

{ }

2 2 2 2

j j

y E y E y E y E

j j j

= − = = σ

We should discard the bases where the projections have lower variances

SLIDE 11

2004 TCFST - Berlin Chen

11

Principle Component Analysis (cont.)

Minimum Mean-Squared Error Criterion

– If the orthonormal (basis) set is selected to be the eigenvectors of the correlation matrix , associated with eigenvalues

They will have the property that:

– Such that the mean-squared error mentioned above will be

s '

i

φ

R

s '

i

λ

j j j

φ R φ λ = ( )

∑ ∑ ∑ ∑

+ = + = + = + =

= = = =

n m j j n m j j j T j n m j j T j n m j j

m

1 1 1 1 2

λ λ σ ε φ φ R φ φ

is real and symmetric, therefore its eigenvectors form a orthonormal set

R

SLIDE 12

2004 TCFST - Berlin Chen

12

Principle Component Analysis (cont.)

Minimum Mean-Squared Error Criterion

– If the eigenvectors are retained associated with the m largest eigenvalues, the mean-squared error will be – Any two projections and will be mutually uncorrelated

Good news for most statistical modeling

– Gaussians and diagonal matrices

( ) ( )

n m n m j j eigen m

λ λ λ λ ε ≥ ≥ ≥ ≥ =

∑

+ =

... ... where

1 1

i

y

j

y

{ }

( )( )

{ }

{ } { }

j

= = = = = =

j T i j T i j T T i j T T i T T j T i j i

E E E y y E φ φ Rφ φ φ xx φ φ xx φ x φ x φ λ

SLIDE 13

2004 TCFST - Berlin Chen

13

Principle Component Analysis (cont.)

An two-dimensional example of Principle Component

Analysis

SLIDE 14

2004 TCFST - Berlin Chen

14

Principle Component Analysis (cont.)

Minimum Mean-Squared Error Criterion

– It can be proved that is the optimal solution under the mean-squared error criterion

( )

m

eigen

ε

( )

[ ]

( )

[ ] ( ) [ ] [ ] [ ] ( )

n m m n m n m n m n n m m n n m n m m n j m n j n j m jn m j T n m k k jk j j n j m n m j n m k jk k T j jk n m j j T j

µ µ µ J µ J

j

µ µ U U Φ RΦ µ µ Φ φ φ R φ φ Φ µ Φ Rφ µ φ Rφ φ φ φ Rφ φ ..... where ..... ..... ..... where 2 1 ,..., 2 1 where 2 : Define

1 1 1 1 1 1 1 1 1 1 1 + − − − − + − + + − − ≤ ≤ + + + = ≤ ≤ + + = + = + =

= = ⇒ = ⇒ = = ∀ ⇒ = = ∑ − = ∂ ∂ ∀ ⇒ ∑ ∑ − − ∑ = δ

Take derivation Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues

f and is their corresponding eigenvectors

m n−

U

n m

λ λ ...

1 +

R

n m

φ φ .....

1 +

ϕ ϕ ϕ ϕ R R 2 = ∂ ∂

T

constraints To be minimized y x y x = ∂ ∂

T

SLIDE 15

2004 TCFST - Berlin Chen

15

Principle Component Analysis (cont.)

Given an input vector x with dimensional m

– Try to construct a linear transform Φ’ (Φ’ is an nxm matrix m<n) such that the truncation result, Φ’Tx, is optimal in mean- squared error criterion

Encoder

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

n

x x x . .

2 1

x

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

m

y y y . .

2 1

y

Decoder

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

n

x x x ˆ . . ˆ ˆ ˆ

2 1

x

( ) ( )

( )

x

x

x

x E

T x

ˆ ˆ minimize

x Φ y

T

′ =

[ ]

m T

e e e Φ Φ .. where

2 1

= ′ ′

y Φ x ′ = ˆ

Φ′

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

m

y y y . .

2 1

y

SLIDE 16

2004 TCFST - Berlin Chen

16

Principle Component Analysis (cont.)

Data compression in communication

– PCA is an optimal transform for signal representation and dimensional reduction, but not necessary for classification tasks, such as speech recognition – PCA needs no prior information (e.g. class distributions) of the sample patterns

SLIDE 17

2004 TCFST - Berlin Chen

17

Principle Component Analysis (cont.)

Example 1: principal components of some data points

SLIDE 18

2004 TCFST - Berlin Chen

18

Principle Component Analysis (cont.)

Example 2: feature transformation and selection

threshold for information content reserved

New feature dimensions Correlation matrix for old feature dimensions

SLIDE 19

2004 TCFST - Berlin Chen

19

Principle Component Analysis (cont.)

Example 3: Image Coding

256 256 8 8

SLIDE 20

2004 TCFST - Berlin Chen

20

Principle Component Analysis (cont.)

Example 3: Image Coding (cont.)

SLIDE 21

2004 TCFST - Berlin Chen

21

Principle Component Analysis

Eigenface and Eigenvoice

Example 4: Eigenface in face recognition (1990)

– Consider an individual image to be a linear combination of a small number of face components or “eigenface” derived from a set of reference images – Steps

Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixel

Calculate the coverance/correlation matrix between these

reference vectors

Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix: the eigenfaces

Besides, the vector obtained by averaging all images are

called “eigenface 0”. The other eigenfaces from “eigenface 1”

nwards model the variations from this average face

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

n L L L L n n

x x x x x x x x x

. 2 . 1 , . 2 2 . 2 1 , 2 2 . 1 2 . 1 1 , 1 1

. . ,......, . . , . . x x x

SLIDE 22

2004 TCFST - Berlin Chen

22

Principle Component Analysis

Eigenface and Eigenvoice (cont.)

Example 4: Eigenface in face recognition (cont.)

– Steps

Then the faces are then represented as eigenvoice 0 plus a

linear combination of the remain K (K ≤ L) eigenfaces – The Eigenface approach persists the minimum mean-squared error criterion – Incidentally, the eigenfaces are not themselves usually plausible faces, only directions of variations between faces

( ) ( ) ( )

[ ]

K i i i i K i i i i

w w w K w w w

, 2 , 1 , , 2 , 1 ,

,..., , , 1 ..... 2 1 ˆ = ⇒ + + + + = y e e e x x

Feature vector of a person i

SLIDE 23

2004 TCFST - Berlin Chen

23

Principle Component Analysis

Eigenface and Eigenvoice (cont.)

Example 5: Eigenvoice in speaker adaptation (PSTL, 2000)

– Steps

Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)

SD model mean parameters (µ)

Eigenvoice Eigenvoice space space construction construction

Speaker 1 Data SI HMM Speaker R Data Model Training Model Training Speaker 1 HMM Speaker R HMM D = (M．n)×1

Principal Component Analysis

Each new speaker S is represented Each new speaker S is represented by a point by a point P P in in K K-

space

space

( ) ( ) ( ) ( )

K w w w

K i i i i

e e e e P

, 2 , 1 ,

..... 2 1 + + + + =

SLIDE 24

2004 TCFST - Berlin Chen

24

Principle Component Analysis

Eigenface and Eigenvoice (cont.)

Example 4: Eigenvoice in speaker adaptation (cont.)

SLIDE 25

2004 TCFST - Berlin Chen

25

Principle Component Analysis

Eigenface and Eigenvoice (cont.)

Example 5: Eigenvoice in speaker adaptation (cont.)

– Dimension 1 (eigenvoice 1):

Correlate with pitch or sex

– Dimension 2 (eigenvoice 2):

Correlate with amplitude

– Dimension 3 (eigenvoice 3):

Correlate with second-formant

movement

SLIDE 26

2004 TCFST - Berlin Chen

26

Linear Discriminant Analysis (LDA)

Also called

– Fisher’s Linear Discriminant Analysis, Fisher-Rao Linear Discriminant Analysis

Fisher (1936): introduced it for two-class classification
Rao (1965): extended it to handle multiple-class classification

SLIDE 27

2004 TCFST - Berlin Chen

27

Linear Discriminant Analysis (cont.)

Given a set of sample vectors with labeled (class)

information, try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be Gaussians With equal variance in the two-dimensional sample space

SLIDE 28

2004 TCFST - Berlin Chen

28

Linear Discriminant Analysis (cont.)

Suppose there are N sample vectors with

dimensionality n, each of them is belongs to one of the J classes

– The sample mean is: – The class sample means are: – The class sample covariances are: – The average within-class variation before transform – The average between-class variation before transform

i

x

( ) { } ( )

index class is , ,...., 2 , 1 , ⋅ ∈ = g J j j g

i

x

∑ =

= N i i

N

1

1 x x

( )

∑

=

j g i j j

i

N

x

x x 1

( )( )

( )

∑

=

− − =

j g T j i j i j j

i

N

x

x x x x Σ 1 ∑

=

j j j w

N N Σ S 1

( )( )

∑

− − =

j T j j j b

N N x x x x S 1

SLIDE 29

2004 TCFST - Berlin Chen

29

Linear Discriminant Analysis (cont.)

If the transform is applied

– The sample vectors will be – The sample mean will be – The class sample means will be – The average within-class variation will be

[ ]

m

w w w W

1

....

2

=

i T i

x W y = x W x W x W y

T N i i T N i i T

N N = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = =

∑ ∑

= = 1 1

1 1

( )

j T j g i T j j

i

N x W x W y

x

= =

∑

=

1 ( )

( ) ( )

( )

W S W W Σ W x W x W x W x W S

x x x w T j j j T T j g i T j i T j g j g i T j i T j j j w

N N N N N N N

i i i

= ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⋅ =

∑ ∑ ∑ ∑ ∑

= = =

1 1 1 1 1 ~

SLIDE 30

2004 TCFST - Berlin Chen

30

Linear Discriminant Analysis (cont.)

If the transform is applied

– Similarly, the average between-class variation will be – Try to find optimal such that the following criterion function is maximized

A close form solution: the column vectors of an optimal matrix

are the generalized eigenvectors corresponding to the largest eigenvalues in

That is, are the eigenvectors corresponding to the

largest eigenvalues of

[ ]

m

w w w W

1

....

2

= W S W S

b T b =

~

W

( )

W S W W S W S S W

w T b T w b

J = = ~ ~

i w i i b

w S w S λ =

W

s '

i

w

i i i b w

w w S S λ =

−1

b w S

S

1 −

SLIDE 31

2004 TCFST - Berlin Chen

31

Linear Discriminant Analysis (cont.)

Proof:

( )

( ) ( ) ( ) ( ) ( ) ( ) ( )

i i i b i w i i b i w i i b i w T i b T i i i w T i w i w T i b i w T i b T i w i w T i w T i b i w T i b T i w i w T i b i i i w T i b T i i w T b T w b

w i i i i i i i i i i i i i

J w w S S w S w S w S w S w S w w S w w S w w S w S w w S w S w w S w w S w S w w S w w S w S w w S w w S w S w w S w w S w w S w W w W S W W S W S S W W

W W W

λ λ λ λ λ λ λ = ⇒ = ⇒ = − ⇒ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = − = − ⇒ = − = ∂ ∂ ⇒ = = = =

−1 2 2 2 ˆ ˆ ˆ

2 2 : solution

ptimal

has form qradtic The : that find want to we ,

f

tor column vec each for Or, max arg ~ ~ max arg max arg ˆ Q Q

2

G F G G F G F ′ − ′ = ′ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛

( )x

C C x Cx x

T T

+ = d d ) (

SLIDE 32

2004 TCFST - Berlin Chen

32

Linear Discriminant Analysis (cont.)

Example1: Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99’s 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99’s 5471 files

After Cosine Transform

( )( )

∑ − − =

i

T i i

N

x

x x x x Σ 1

( )( )

∑ − − = ′

i

T i i

N

y

y y y y Σ 1

SLIDE 33

2004 TCFST - Berlin Chen

33

Experiments on Speech Signal Processing

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99’s 5471 files Calculated using Year-99’s 5471 files

20.11 23.11 LDA-2 20.17 23.12 LDA-1 22.71 26.32 MFCC WG TC Character Error Rate

Example1: Experiments on Speech Signal Processing (cont.)

After PCA Transform After LDA Transform

SLIDE 34

2004 TCFST - Berlin Chen

34

PCA vs. LDA

SLIDE 35

2004 TCFST - Berlin Chen

35

LDA vs. HDA

HDA: Heteroscedastic Discriminant Analysis

SLIDE 36

2004 TCFST - Berlin Chen

36

Heteroscedastic Discriminant Analysis (HDA) IBM, 2000

Heteroscedastic : A set of statistical distributions having

different variances

LDA does not consider individual class covariances and

may therefore generate suboptimal results

– Modified the LDA objective function – Take the log and rearrange terms – However the dimensions of the HDA projection can often be highly correlated

An other transform can be further composed into HDA

( )

∏ ∏

= =

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

J j Nj j T b T J j Nj j T b T

H

1 1

W Σ W W S W W Σ W W S W W

( )

W S W W Σ W W

b T J j j T j

N N H log log log

1

+ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − =

∑

=

SLIDE 37

2004 TCFST - Berlin Chen

37

Heteroscedastic Discriminant Analysis (cont.)

The difference in the projections obtained from LDA and

HDA for 2-class case

– Clearly, the HDA provides a much lower classification error than LDA theoretically

However, most statistical modeling assume data samples are