[PPT] - Machine Learning for Signal Processing Linear Gaussian Models PowerPoint Presentation

SLIDE 1

Machine Learning for Signal Processing Linear Gaussian Models

Class 21. 12 Nov 2013 Instructor: Bhiksha Raj

12 Nov 2013 11755/18797 1

SLIDE 2

Administrivia

HW3 is up

– .

Projects – please send us an update

12 Nov 2013 11755/18797 2

SLIDE 3

Recap: MAP Estimators

MAP (Maximum A Posteriori): Find a “best

guess” for y (statistically), given known x

y = argmax Y P(Y|x)

12 Nov 2013 11755/18797 3

SLIDE 4

Recap: MAP estimation

x and y are jointly Gaussian

12 Nov 2013 11755/18797 4

z is Gaussian

       y x z        

y x z

z E    ] [        

yy yx xy xx zz

C C C C C z Var ) ( ] ) )( [(

T y x xy

y x E C     

 

T z z zz zz z

z z C C N z P ) )( ( 5 . exp | | 2 1 ) , ( ) (         

SLIDE 5

MAP estimation: Gaussian PDF

12 Nov 2013 11755/18797 5

F1

X Y

SLIDE 6

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 6

x0

SLIDE 7

Conditional Probability of y|x

The conditional probability of y given x is also Gaussian

– The slice in the figure is Gaussian

The mean of this Gaussian is a function of x
The variance of y reduces if x is known

– Uncertainty is reduced

12 Nov 2013 11755/18797 7

) ), ( ( ) | (

1 1 xy xx T yx yy x xx yx y

C C C C x C C N x y P

 

      ) ( ] [

1 | | x xx yx y x y x y

x C C y E       

 xy xx T xy yy

C C C C x y Var

1

) | (



 

SLIDE 8

F1

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 8

Most likely value x0

SLIDE 9

MAP Estimation of a Gaussian RV

12 Nov 2013 11755/18797 9

x0

] [ ) | ( max arg ˆ

|

y E x y P y

x y y

 

SLIDE 10

Its also a minimum-mean-squared error estimate

Minimize error:
Differentiating and equating to 0:

12 Nov 2013 11755/18797 10

   

] | ˆ ˆ [ ] | ˆ [

2

x y y y y x y y     

T

E E Err ] | [ ˆ 2 ˆ ˆ ] | [ ] | ˆ 2 ˆ ˆ [ x y y y y x y y x y y y y y y E E E Err

T T T T T T

      ˆ ] | [ 2 ˆ ˆ 2 .    y x y y y d E d Err d

T T

] | [ ˆ x y y E 

The MMSE estimate is the mean of the distribution

SLIDE 11

For the Gaussian: MAP = MMSE

12 Nov 2013 11755/18797 11

Most likely value is also The MEAN value

Would be true of any symmetric distribution

SLIDE 12

MMSE estimates for mixture distributions

12

 Let P(y|x) be a mixture density  The MMSE estimate of y is given by  Just a weighted combination of the MMSE

estimates from the component distributions

) , | ( ) | ( ) | ( x y x x y k P k P P

k





 

 y x y x y x y d k P k P E

k

) , | ( ) | ( ] | [

 

 y x y y x d k P k P

k

) , | ( ) | (

] , | [ ) | ( x y x k E k P

k





SLIDE 13

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 13

 P(y|x) is also a Gaussian mixture  Let P(x,y) be a Gaussian Mixture

) , ; ( ) ( ) ( ) (

k k k

N k P P P   



 z z y x,

       y x z ) ( ) , | ( ) | ( ) ( ) ( ) , , ( ) ( ) ( ) | ( x x y x x x y x x y x, x y P k P k P P P k P P P P

k k

 

  





k

k P k P P ) , | ( ) | ( ) | ( x y x x y

SLIDE 14

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 14

 Let P(y|x) is a Gaussian Mixture





k

k P k P P ) , | ( ) | ( ) | ( x y x x y ) , ( ) , , (

, , , , , ,

            

yy yx xy xx y x

x y

k k k k k k

C C C C N k P   ) ), ( ( ) , | (

, 1 , , ,

   

 x xx yx y

x x y

k k k k

C C N k P  



   

 k k k k k

C C N k P P ) ), ( ( ) | ( ) | (

, 1 , , , x xx yx y

x x x y  

SLIDE 15

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 15

 E[y|x] is also a mixture  P(y|x) is a mixture Gaussian density



   

 k k k k k

C C N k P P ) ), ( ( ) | ( ) | (

, 1 , , , x xx yx y

x x x y  





k

k E k P E ] , | [ ) | ( ] | [ x y x x y

 



  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

SLIDE 16

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 16

 Weighted combination of MMSE estimates

btained from individual Gaussians!

 Weight P(k|x) is easily computed too..

 



  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

) ( ) , ( ) | ( x x x P k P k P  ) , ( ) ( ) (

, xx x k k

C N k P P 



 x

SLIDE 17

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 17

 A mixture of estimates from individual Gaussians

SLIDE 18

Voice Morphing

Align training recordings from both speakers

– Cepstral vector sequence

Learn a GMM on joint vectors
Given speech from one speaker, find MMSE estimate of the other
Synthesize from cepstra

12 Nov 2013 11755/18797 18

SLIDE 19

MMSE with GMM: Voice Transformation

Festvox GMM transformation suite (Toda)

awb bdl jmk slt awb bdl jmk slt

12 Nov 2013 11755/18797 19

SLIDE 20

MAP / ML / MMSE

General statistical estimators
All used to predict a variable, based on other

parameters related to it..

Most common assumption: Data are Gaussian, all

RVs are Gaussian

– Other probability densities may also be used..

For Gaussians relationships are linear as we saw..

12 Nov 2013 11755/18797 20

SLIDE 21

Gaussians and more Gaussians..

Linear Gaussian Models..
But first a recap

12 Nov 2013 11755/18797 21

SLIDE 22

A Brief Recap

Principal component analysis: Find the K bases that

best explain the given data

Find B and C such that the difference between D and

BC is minimum

– While constraining that the columns of B are orthonormal

12 Nov 2013 11755/18797 22

BC D 

B

C D

SLIDE 23

Remember Eigenfaces

Approximate every face f as

f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

Estimate V to minimize the squared error
Error is unexplained by V1.. Vk
Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 23

SLIDE 24

Karhunen Loeve vs. PCA

Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 24

SLIDE 25

Karhunen Loeve vs. PCA

Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 25

Eigenvectors of the Covariance

matrix:

– Principal directions of tightest ellipse centered on data – Directions that retain maximum variance

SLIDE 26

Karhunen Loeve vs. PCA

Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 26

Eigenvectors of the Covariance

matrix:

– Principal directions of tightest ellipse centered on data – Directions that retain maximum variance

SLIDE 27

Karhunen Loeve vs. PCA

Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 27

Eigenvectors of the Covariance

matrix:

– Principal directions of tightest ellipse centered on data – Directions that retain maximum variance

SLIDE 28

Karhunen Loeve vs. PCA

If the data are naturally centered at origin, KLT == PCA
Following slides refer to PCA!

– Assume data centered at origin for simplicity

Not essential, as we will see..

12 Nov 2013 11755/18797 28

SLIDE 29

Remember Eigenfaces

Approximate every face f as

f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

Estimate V to minimize the squared error
Error is unexplained by V1.. Vk
Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 29

SLIDE 30

Eigen Representation

K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 30

= w11 + e1

e1 w11

Illustration assuming 3D space

SLIDE 31

Representation

K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 31

= w12 + e2

e2 w12

Illustration assuming 3D space Error is at 90o to the eigenface

90o

SLIDE 32

Representation

K-dimensional representation

– Error is orthogonal to representation

12 Nov 2013 11755/18797 32

w

All data with the same representation wV1 lie a plane orthogonal to wV1

SLIDE 33

With 2 bases

K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 33

= w11 + w21 + e1

e1

w11

Illustration assuming 3D space 0,0 Error is at 90o to the eigenfaces

w21

SLIDE 34

With 2 bases

K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 34

= w12 + w22 + e2

e2

w12

Illustration assuming 3D space Error is at 90o to the eigenfaces

w22

SLIDE 35

In Vector Form

K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 35

Xi = w1iV1 + w2iV2 + ei

e2

w12

Error is at 90o to the eigenfaces

w22

V1 V2 D2

 

i i i i

w w V V X e        

2 1 2 1

SLIDE 36

In Vector Form

12 Nov 2013 11755/18797 36

Xi = w1iV1 + w2iV2 + ei

e2

w12

Error is at 90o to the eigenface

w22

V1 V2 D2

e Vw x  

K-dimensional representation
x is a D dimensional vector
V is a D x K matrix
w is a K dimensional vector
e is a D dimensional vector

SLIDE 37

Learning PCA

For the given data: find the K-dimensional

subspace such that it captures most of the variance in the data

– Variance in remaining subspace is minimal

12 Nov 2013 11755/18797 37

SLIDE 38

Constraints

38

e Vw x  

VTV = I : Eigen vectors are orthogonal to each other
For every vector, error is orthogonal to Eigen vectors

– eTV = 0

Over the collection of data

– Average wTw = Diagonal : Eigen representations are uncorrelated – Determinant eTe = minimum: Error variance is minimum

Mean of error is 0

12 Nov 2013 11755/18797

e2

w12

Error is at 90o to the eigenface

w22

V1 V2 D2

SLIDE 39

A Statistical Formulation of PCA

x is a random variable generated according to a linear relation
w is drawn from an K-dimensional Gaussian with diagonal

covariance

e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian
Estimate V (and B) given examples of x

39

e Vw x  

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

e2

w12

Error is at 90o to the eigenface

w22

V1 V2 D2

SLIDE 40

Linear Gaussian Models!!

x is a random variable generated according to a linear relation
w is drawn from a Gaussian
e is drawn from a 0-mean Gaussian
Estimate V given examples of x

– In the process also estimate B and E

40

e Vw x  

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

SLIDE 41

Linear Gaussian Models!!

x is a random variable generated according to a linear relation
w is drawn from a Gaussian
e is drawn from a 0-mean Gaussian
Estimate V given examples of x

– In the process also estimate B and E

41

e Vw x  

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

SLIDE 42

Linear Gaussian Models

Observations are linear functions of two uncorrelated

Gaussian random variables

– A “weight” variable w – An “error” variable e – Error not correlated to weight: E[eTw] = 0

Learning LGMs: Estimate parameters of the model

given instances of x

– The problem of learning the distribution of a Gaussian RV

42

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

SLIDE 43

LGMs: Probability Density

The mean of x:

12 Nov 2013 11755/18797 43

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e μ e w V μ x     ] [ ] [ ] [ E E E

The Covariance of x:

  

E B E E E

T T

    V V x x x x ] ] [ ] [ [

SLIDE 44

The probability of x

x is a linear function of Gaussians: x is also Gaussian
Its mean and variance are as given

12 Nov 2013 11755/18797 44

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

) , ( ~ E B N

T 

V V μ x

    

 



 

μ x V V μ x V V x      

1

5 . exp | | 2 1 ) ( E B E B P

T T T D



SLIDE 45

Estimating the variables of the model

Estimating the variables of the LGM is

equivalent to estimating P(x)

– The variables are , V, B and E

12 Nov 2013 11755/18797 45

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

) , ( ~ E B N

T 

V V μ x

SLIDE 46

Estimating the model

The model is indeterminate:

– Vw = VCC-1w = (VC)(C-1w) – We need extra constraints to make the solution unique

Usual constraint : B = I

– Variance of w is an identity matrix

12 Nov 2013 11755/18797 46

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

) , ( ~ E B N

T 

V V μ x

SLIDE 47

Estimating the variables of the model

Estimating the variables of the LGM is

equivalent to estimating P(x)

– The variables are , V, and E

12 Nov 2013 11755/18797 47

e Vw μ x   

) , ( ~ I N w ) , ( ~ E N e

) , ( ~ E N

T 

VV μ x

SLIDE 48

The Maximum Likelihood Estimate

The ML estimate of  does not depend on the

covariance of the Gaussian

12 Nov 2013 11755/18797 48

) , ( ~ E N

T 

VV μ x

Given training set x1, x2, .. xN, find , V, E





i i

N x μ 1

SLIDE 49

Centered Data

We can safely assume “centered” data

–  = 0

If the data are not centered, “center” it

– Estimate mean of data

Which is the maximum likelihood estimate

– Subtract it from the data

12 Nov 2013 11755/18797 49

SLIDE 50

Simplified Model

Estimating the variables of the LGM is

equivalent to estimating P(x)

– The variables are V, and E

12 Nov 2013 11755/18797 50

e Vw x  

) , ( ~ I N w ) , ( ~ E N e

) , ( ~ E N

T 

VV x

SLIDE 51

Estimating the model

Given a collection of xi terms

– x1, x2,..xN

Estimate V and E
w is unknown for each x
But if assume we know w for each x, then

what do we get:

12 Nov 2013 11755/18797 51

) , ( ~ E N

T 

VV x

e Vw x  

SLIDE 52

Estimating the Parameters

We will use a maximum-likelihood estimate
The log-likelihood of x1..xN knowing their wis

12 Nov 2013 11755/18797 52

e Vw x  

i i

) , ( ) ( E N P  e

) , ( ) | ( E N P Vw w x   

 

) ( ) ( 5 . exp | | 2 1 ) | (

1

Vw x Vw x w x    



E E P

T D

  ) .. | .. ( log

1 1 N N

P w w x x



   

  i i i T i i

E E N ) ( ) ( 5 . | | log 5 .

1 1

Vw x Vw x

SLIDE 53

Maximizing the log-likelihood

Differentiating w.r.t. V and setting to 0

12 Nov 2013 11755/18797 53



    

  i i i T i i

E E N LL ) ( ) ( 5 . | | log 5 .

1 1

Vw x Vw x ) ( 2

1

 



 T i i i i

E w Vw x

1 

            

 

i T i i i T i i

w w w x V

Differentiating w.r.t. E-1 and setting to 0

       

 

i i T i i T i i

N E x w V x x 1

SLIDE 54

Estimating LGMs: If we know w

But in reality we don’t know the w for each x

– So how to deal with this?

EM..

12 Nov 2013 11755/18797 54

1 

            

 

i T i i i T i i

w w w x V        

 

i i T i i T i i

N E x w V x x 1

e Vw x  

i i

) , ( ) ( E N P  e

SLIDE 55

Recall EM

We figured out how to compute parameters if we knew the

missing information

Then we “fragmented” the observations according to the

posterior probability P(z|x) and counted as usual

In effect we took the expectation with respect to the a

posteriori probability of the missing data: P(z|x)

12 Nov 2013 11755/18797 55 Collection of “blue” numbers Collection of “red” numbers

6

.. ..

Collection of “blue” numbers Collection of “red” numbers

6

.. ..

Collection of “blue” numbers Collection of “red” numbers

6

6 6

6

6 6

..

6

..

Instance from blue dice Instance from red dice Dice unknown

SLIDE 56

EM for LGMs

Replace unseen data terms with expectations

taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 56

1 

            

 

i T i i i T i i

w w w x V        

 

i i T i i T i i

N E x w V x x 1

e Vw x  

i i

) , ( ) ( E N P  e

 

 

i T i i T i i

i

E N N E x w V x x

x w|

] [ 1 1

1

] [ ] [



            

 

i T i T i

i i

E E ww w x V

x w| x w|

SLIDE 57

EM for LGMs

Replace unseen data terms with expectations

taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 57

1 

            

 

i T i i i T i i

w w w x V        

 

i i T i i T i i

N E x w V x x 1

e Vw x  

i i

) , ( ) ( E N P  e

 

 

i T i i T i i

i

E N N E x w V x x

x w|

] [ 1 1

1

] [ ] [



            

 

i T i T i

i i

E E ww w x V

x w| x w|

SLIDE 58

Expected Value of w given x

x and w are jointly Gaussian!

– x is Gaussian – w is Gaussian – They are linearly related

12 Nov 2013 11755/18797 58

e Vw x  

) , ( ) ( E N P  e

) , ( ) ( I N P  w ) , ( ) ( E N P

T 

 VV x

       w x z

) , ( ) (

zz z

z C N P  

SLIDE 59

Expected Value of w given x

x and w are jointly Gaussian!

12 Nov 2013 11755/18797 59

      

ww wx xw xx zz

C C C C C        w x z        

w x z

   V w x

w x xw

    ] ) )( [(

T

E C  

) , ( ) (

zz z

z C N P  

e Vw x  

) , ( ) ( E N P

T 

 VV x

) , ( ) ( I N P  w

        I E C

T T

V V VV

zz

SLIDE 60

The conditional expectation of w given z

P(w|z) is a Gaussian

12 Nov 2013 11755/18797 60

) ), ( ( ) | (

1 1 xw xx wx ww x xx wx w

x w C C C C x C C N P

T  

     

        I E C

T T

V V VV

zz

      

ww wx xw xx zz

C C C C C

       

w x z

   ) ) ( , ) ( ( ) | (

1 1

V VV V x VV V x w

 

    E I E N P

T T T T i T T

E E

i

x VV V w

x w 1 |

) ( ] [



 

T T

i i i

E E Var E ] [ ] [ ) ( ] [

| | |

w w w ww

x w x w x w

 

T T T T

i i i

E E E I E ] [ ] [ ) ( ] [

| | 1 |

w w V VV V ww

x w x w x w

   



SLIDE 61

LGM: The complete EM algorithm

Initialize V and E
E step:
M step:
12 Nov 2013

11755/18797 61

i T T

E E

i

x VV V w

x w 1 |

) ( ] [



 

T T T T

i i i

E E E I E ] [ ] [ ) ( ] [

| | 1 |

w w V VV V ww

x w x w x w

   



1

] [ ] [



            

 

i T i T i

i i

E E ww w x V

x w| x w|

 

 

i T i i T i i

i

E N N E x w V x x

x w|

] [ 1 1

SLIDE 62

So what have we achieved

Employed a complicated EM algorithm to learn a

Gaussian PDF for a variable x

What have we gained???
Next class:

– PCA

Sensible PCA
EM algorithms for PCA

– Factor Analysis

FA for feature extraction

12 Nov 2013 11755/18797 62

SLIDE 63

Find directions that capture most of the

variation in the data

Error is orthogonal to these variations

3 Oct 2011 11755/18797 63

e Vw x  

) , ( ~ I N w ) , ( ~ E N e

LGMs : Application 1 Learning principal components

SLIDE 64

The full covariance matrix of a Gaussian has D2 terms
Fully captures the relationships between variables
Problem: Needs a lot of data to estimate robustly

3 Oct 2011 11755/18797 64

LGMs : Application 2 Learning with insufficient data

FULL COV FIGURE

SLIDE 65

To be continued..

Other applications..
Next class

12 Nov 2013 11755/18797 65