Machine Learning for Signal Processing Linear Gaussian Models - - PowerPoint PPT Presentation

machine learning for signal processing linear gaussian
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Signal Processing Linear Gaussian Models - - PowerPoint PPT Presentation

Machine Learning for Signal Processing Linear Gaussian Models Class 21. 12 Nov 2013 Instructor: Bhiksha Raj 12 Nov 2013 11755/18797 1 Administrivia HW3 is up . Projects please send us an update 12 Nov 2013 11755/18797 2


slide-1
SLIDE 1

Machine Learning for Signal Processing Linear Gaussian Models

Class 21. 12 Nov 2013 Instructor: Bhiksha Raj

12 Nov 2013 11755/18797 1

slide-2
SLIDE 2

Administrivia

  • HW3 is up

– .

  • Projects – please send us an update

12 Nov 2013 11755/18797 2

slide-3
SLIDE 3

Recap: MAP Estimators

  • MAP (Maximum A Posteriori): Find a “best

guess” for y (statistically), given known x

y = argmax Y P(Y|x)

12 Nov 2013 11755/18797 3

slide-4
SLIDE 4

Recap: MAP estimation

  • x and y are jointly Gaussian

12 Nov 2013 11755/18797 4

  • z is Gaussian

       y x z        

y x z

z E    ] [        

yy yx xy xx zz

C C C C C z Var ) ( ] ) )( [(

T y x xy

y x E C     

 

T z z zz zz z

z z C C N z P ) )( ( 5 . exp | | 2 1 ) , ( ) (         

slide-5
SLIDE 5

MAP estimation: Gaussian PDF

12 Nov 2013 11755/18797 5

F1

X Y

slide-6
SLIDE 6

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 6

x0

slide-7
SLIDE 7

Conditional Probability of y|x

  • The conditional probability of y given x is also Gaussian

– The slice in the figure is Gaussian

  • The mean of this Gaussian is a function of x
  • The variance of y reduces if x is known

– Uncertainty is reduced

12 Nov 2013 11755/18797 7

) ), ( ( ) | (

1 1 xy xx T yx yy x xx yx y

C C C C x C C N x y P

 

      ) ( ] [

1 | | x xx yx y x y x y

x C C y E       

 xy xx T xy yy

C C C C x y Var

1

) | (

 

slide-8
SLIDE 8

F1

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 8

Most likely value x0

slide-9
SLIDE 9

MAP Estimation of a Gaussian RV

12 Nov 2013 11755/18797 9

x0

] [ ) | ( max arg ˆ

|

y E x y P y

x y y

 

slide-10
SLIDE 10

Its also a minimum-mean-squared error estimate

  • Minimize error:
  • Differentiating and equating to 0:

12 Nov 2013 11755/18797 10

   

] | ˆ ˆ [ ] | ˆ [

2

x y y y y x y y     

T

E E Err ] | [ ˆ 2 ˆ ˆ ] | [ ] | ˆ 2 ˆ ˆ [ x y y y y x y y x y y y y y y E E E Err

T T T T T T

      ˆ ] | [ 2 ˆ ˆ 2 .    y x y y y d E d Err d

T T

] | [ ˆ x y y E 

The MMSE estimate is the mean of the distribution

slide-11
SLIDE 11

For the Gaussian: MAP = MMSE

12 Nov 2013 11755/18797 11

Most likely value is also The MEAN value

  • Would be true of any symmetric distribution
slide-12
SLIDE 12

MMSE estimates for mixture distributions

12

 Let P(y|x) be a mixture density  The MMSE estimate of y is given by  Just a weighted combination of the MMSE

estimates from the component distributions

) , | ( ) | ( ) | ( x y x x y k P k P P

k

 

 y x y x y x y d k P k P E

k

) , | ( ) | ( ] | [

 

 y x y y x d k P k P

k

) , | ( ) | (

] , | [ ) | ( x y x k E k P

k

slide-13
SLIDE 13

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 13

 P(y|x) is also a Gaussian mixture  Let P(x,y) be a Gaussian Mixture

) , ; ( ) ( ) ( ) (

k k k

N k P P P   

 z z y x,

       y x z ) ( ) , | ( ) | ( ) ( ) ( ) , , ( ) ( ) ( ) | ( x x y x x x y x x y x, x y P k P k P P P k P P P P

k k

 

  

k

k P k P P ) , | ( ) | ( ) | ( x y x x y

slide-14
SLIDE 14

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 14

 Let P(y|x) is a Gaussian Mixture

k

k P k P P ) , | ( ) | ( ) | ( x y x x y ) , ( ) , , (

, , , , , ,

            

yy yx xy xx y x

x y

k k k k k k

C C C C N k P   ) ), ( ( ) , | (

, 1 , , ,

   

 x xx yx y

x x y

k k k k

C C N k P  

   

 k k k k k

C C N k P P ) ), ( ( ) | ( ) | (

, 1 , , , x xx yx y

x x x y  

slide-15
SLIDE 15

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 15

 E[y|x] is also a mixture  P(y|x) is a mixture Gaussian density

   

 k k k k k

C C N k P P ) ), ( ( ) | ( ) | (

, 1 , , , x xx yx y

x x x y  

k

k E k P E ] , | [ ) | ( ] | [ x y x x y

 

  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

slide-16
SLIDE 16

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 16

 Weighted combination of MMSE estimates

  • btained from individual Gaussians!

 Weight P(k|x) is easily computed too..

 

  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

) ( ) , ( ) | ( x x x P k P k P  ) , ( ) ( ) (

, xx x k k

C N k P P 

 x

slide-17
SLIDE 17

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 17

 A mixture of estimates from individual Gaussians

slide-18
SLIDE 18

Voice Morphing

  • Align training recordings from both speakers

– Cepstral vector sequence

  • Learn a GMM on joint vectors
  • Given speech from one speaker, find MMSE estimate of the other
  • Synthesize from cepstra

12 Nov 2013 11755/18797 18

slide-19
SLIDE 19

MMSE with GMM: Voice Transformation

  • Festvox GMM transformation suite (Toda)

awb bdl jmk slt awb bdl jmk slt

12 Nov 2013 11755/18797 19

slide-20
SLIDE 20

MAP / ML / MMSE

  • General statistical estimators
  • All used to predict a variable, based on other

parameters related to it..

  • Most common assumption: Data are Gaussian, all

RVs are Gaussian

– Other probability densities may also be used..

  • For Gaussians relationships are linear as we saw..

12 Nov 2013 11755/18797 20

slide-21
SLIDE 21

Gaussians and more Gaussians..

  • Linear Gaussian Models..
  • But first a recap

12 Nov 2013 11755/18797 21

slide-22
SLIDE 22

A Brief Recap

  • Principal component analysis: Find the K bases that

best explain the given data

  • Find B and C such that the difference between D and

BC is minimum

– While constraining that the columns of B are orthonormal

12 Nov 2013 11755/18797 22

BC D 

B

C D

slide-23
SLIDE 23

Remember Eigenfaces

  • Approximate every face f as

f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

  • Estimate V to minimize the squared error
  • Error is unexplained by V1.. Vk
  • Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 23

slide-24
SLIDE 24

Karhunen Loeve vs. PCA

  • Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 24

slide-25
SLIDE 25

Karhunen Loeve vs. PCA

  • Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 25

  • Eigenvectors of the Covariance

matrix:

– Principal directions of tightest ellipse centered on data – Directions that retain maximum variance

slide-26
SLIDE 26

Karhunen Loeve vs. PCA

  • Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 26

  • Eigenvectors of the Covariance

matrix:

– Principal directions of tightest ellipse centered on data – Directions that retain maximum variance

slide-27
SLIDE 27

Karhunen Loeve vs. PCA

  • Eigenvectors of the Correlation

matrix:

– Principal directions of tightest ellipse centered on origin – Directions that retain maximum energy

12 Nov 2013 11755/18797 27

  • Eigenvectors of the Covariance

matrix:

– Principal directions of tightest ellipse centered on data – Directions that retain maximum variance

slide-28
SLIDE 28

Karhunen Loeve vs. PCA

  • If the data are naturally centered at origin, KLT == PCA
  • Following slides refer to PCA!

– Assume data centered at origin for simplicity

  • Not essential, as we will see..

12 Nov 2013 11755/18797 28

slide-29
SLIDE 29

Remember Eigenfaces

  • Approximate every face f as

f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

  • Estimate V to minimize the squared error
  • Error is unexplained by V1.. Vk
  • Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 29

slide-30
SLIDE 30

Eigen Representation

  • K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 30

= w11 + e1

e1 w11

Illustration assuming 3D space

slide-31
SLIDE 31

Representation

  • K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 31

= w12 + e2

e2 w12

Illustration assuming 3D space Error is at 90o to the eigenface

90o

slide-32
SLIDE 32

Representation

  • K-dimensional representation

– Error is orthogonal to representation

12 Nov 2013 11755/18797 32

w

All data with the same representation wV1 lie a plane orthogonal to wV1

slide-33
SLIDE 33

With 2 bases

  • K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 33

= w11 + w21 + e1

e1

w11

Illustration assuming 3D space 0,0 Error is at 90o to the eigenfaces

w21

slide-34
SLIDE 34

With 2 bases

  • K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 34

= w12 + w22 + e2

e2

w12

Illustration assuming 3D space Error is at 90o to the eigenfaces

w22

slide-35
SLIDE 35

In Vector Form

  • K-dimensional representation

– Error is orthogonal to representation – Weight and error are specific to data instance

12 Nov 2013 11755/18797 35

Xi = w1iV1 + w2iV2 + ei

e2

w12

Error is at 90o to the eigenfaces

w22

V1 V2 D2

 

i i i i

w w V V X e        

2 1 2 1

slide-36
SLIDE 36

In Vector Form

12 Nov 2013 11755/18797 36

Xi = w1iV1 + w2iV2 + ei

e2

w12

Error is at 90o to the eigenface

w22

V1 V2 D2

e Vw x  

  • K-dimensional representation
  • x is a D dimensional vector
  • V is a D x K matrix
  • w is a K dimensional vector
  • e is a D dimensional vector
slide-37
SLIDE 37

Learning PCA

  • For the given data: find the K-dimensional

subspace such that it captures most of the variance in the data

– Variance in remaining subspace is minimal

12 Nov 2013 11755/18797 37

slide-38
SLIDE 38

Constraints

38

e Vw x  

  • VTV = I : Eigen vectors are orthogonal to each other
  • For every vector, error is orthogonal to Eigen vectors

– eTV = 0

  • Over the collection of data

– Average wTw = Diagonal : Eigen representations are uncorrelated – Determinant eTe = minimum: Error variance is minimum

  • Mean of error is 0

12 Nov 2013 11755/18797

e2

w12

Error is at 90o to the eigenface

w22

V1 V2 D2

slide-39
SLIDE 39

A Statistical Formulation of PCA

  • x is a random variable generated according to a linear relation
  • w is drawn from an K-dimensional Gaussian with diagonal

covariance

  • e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian
  • Estimate V (and B) given examples of x

39

e Vw x  

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

e2

w12

Error is at 90o to the eigenface

w22

V1 V2 D2

slide-40
SLIDE 40

Linear Gaussian Models!!

  • x is a random variable generated according to a linear relation
  • w is drawn from a Gaussian
  • e is drawn from a 0-mean Gaussian
  • Estimate V given examples of x

– In the process also estimate B and E

40

e Vw x  

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

slide-41
SLIDE 41

Linear Gaussian Models!!

  • x is a random variable generated according to a linear relation
  • w is drawn from a Gaussian
  • e is drawn from a 0-mean Gaussian
  • Estimate V given examples of x

– In the process also estimate B and E

41

e Vw x  

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

slide-42
SLIDE 42

Linear Gaussian Models

  • Observations are linear functions of two uncorrelated

Gaussian random variables

– A “weight” variable w – An “error” variable e – Error not correlated to weight: E[eTw] = 0

  • Learning LGMs: Estimate parameters of the model

given instances of x

– The problem of learning the distribution of a Gaussian RV

42

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

12 Nov 2013 11755/18797

slide-43
SLIDE 43

LGMs: Probability Density

  • The mean of x:

12 Nov 2013 11755/18797 43

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e μ e w V μ x     ] [ ] [ ] [ E E E

  • The Covariance of x:

  

E B E E E

T T

    V V x x x x ] ] [ ] [ [

slide-44
SLIDE 44

The probability of x

  • x is a linear function of Gaussians: x is also Gaussian
  • Its mean and variance are as given

12 Nov 2013 11755/18797 44

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

) , ( ~ E B N

T 

V V μ x

    

 

 

μ x V V μ x V V x      

1

5 . exp | | 2 1 ) ( E B E B P

T T T D

slide-45
SLIDE 45

Estimating the variables of the model

  • Estimating the variables of the LGM is

equivalent to estimating P(x)

– The variables are , V, B and E

12 Nov 2013 11755/18797 45

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

) , ( ~ E B N

T 

V V μ x

slide-46
SLIDE 46

Estimating the model

  • The model is indeterminate:

– Vw = VCC-1w = (VC)(C-1w) – We need extra constraints to make the solution unique

  • Usual constraint : B = I

– Variance of w is an identity matrix

12 Nov 2013 11755/18797 46

e Vw μ x   

) , ( ~ B N w ) , ( ~ E N e

) , ( ~ E B N

T 

V V μ x

slide-47
SLIDE 47

Estimating the variables of the model

  • Estimating the variables of the LGM is

equivalent to estimating P(x)

– The variables are , V, and E

12 Nov 2013 11755/18797 47

e Vw μ x   

) , ( ~ I N w ) , ( ~ E N e

) , ( ~ E N

T 

VV μ x

slide-48
SLIDE 48

The Maximum Likelihood Estimate

  • The ML estimate of  does not depend on the

covariance of the Gaussian

12 Nov 2013 11755/18797 48

) , ( ~ E N

T 

VV μ x

  • Given training set x1, x2, .. xN, find , V, E

i i

N x μ 1

slide-49
SLIDE 49

Centered Data

  • We can safely assume “centered” data

–  = 0

  • If the data are not centered, “center” it

– Estimate mean of data

  • Which is the maximum likelihood estimate

– Subtract it from the data

12 Nov 2013 11755/18797 49

slide-50
SLIDE 50

Simplified Model

  • Estimating the variables of the LGM is

equivalent to estimating P(x)

– The variables are V, and E

12 Nov 2013 11755/18797 50

e Vw x  

) , ( ~ I N w ) , ( ~ E N e

) , ( ~ E N

T 

VV x

slide-51
SLIDE 51

Estimating the model

  • Given a collection of xi terms

– x1, x2,..xN

  • Estimate V and E
  • w is unknown for each x
  • But if assume we know w for each x, then

what do we get:

12 Nov 2013 11755/18797 51

) , ( ~ E N

T 

VV x

e Vw x  

slide-52
SLIDE 52

Estimating the Parameters

  • We will use a maximum-likelihood estimate
  • The log-likelihood of x1..xN knowing their wis

12 Nov 2013 11755/18797 52

e Vw x  

i i

) , ( ) ( E N P  e

) , ( ) | ( E N P Vw w x   

 

) ( ) ( 5 . exp | | 2 1 ) | (

1

Vw x Vw x w x    

E E P

T D

  ) .. | .. ( log

1 1 N N

P w w x x

   

  i i i T i i

E E N ) ( ) ( 5 . | | log 5 .

1 1

Vw x Vw x

slide-53
SLIDE 53

Maximizing the log-likelihood

  • Differentiating w.r.t. V and setting to 0

12 Nov 2013 11755/18797 53

    

  i i i T i i

E E N LL ) ( ) ( 5 . | | log 5 .

1 1

Vw x Vw x ) ( 2

1

 

 T i i i i

E w Vw x

1 

            

 

i T i i i T i i

w w w x V

  • Differentiating w.r.t. E-1 and setting to 0

       

 

i i T i i T i i

N E x w V x x 1

slide-54
SLIDE 54

Estimating LGMs: If we know w

  • But in reality we don’t know the w for each x

– So how to deal with this?

  • EM..

12 Nov 2013 11755/18797 54

1 

            

 

i T i i i T i i

w w w x V        

 

i i T i i T i i

N E x w V x x 1

e Vw x  

i i

) , ( ) ( E N P  e

slide-55
SLIDE 55

Recall EM

  • We figured out how to compute parameters if we knew the

missing information

  • Then we “fragmented” the observations according to the

posterior probability P(z|x) and counted as usual

  • In effect we took the expectation with respect to the a

posteriori probability of the missing data: P(z|x)

12 Nov 2013 11755/18797 55 Collection of “blue” numbers Collection of “red” numbers

6

.. ..

Collection of “blue” numbers Collection of “red” numbers

6

.. ..

Collection of “blue” numbers Collection of “red” numbers

6

6 6

6

6 6

..

6

..

Instance from blue dice Instance from red dice Dice unknown

slide-56
SLIDE 56

EM for LGMs

  • Replace unseen data terms with expectations

taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 56

1 

            

 

i T i i i T i i

w w w x V        

 

i i T i i T i i

N E x w V x x 1

e Vw x  

i i

) , ( ) ( E N P  e

 

 

i T i i T i i

i

E N N E x w V x x

x w|

] [ 1 1

1

] [ ] [

            

 

i T i T i

i i

E E ww w x V

x w| x w|

slide-57
SLIDE 57

EM for LGMs

  • Replace unseen data terms with expectations

taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 57

1 

            

 

i T i i i T i i

w w w x V        

 

i i T i i T i i

N E x w V x x 1

e Vw x  

i i

) , ( ) ( E N P  e

 

 

i T i i T i i

i

E N N E x w V x x

x w|

] [ 1 1

1

] [ ] [

            

 

i T i T i

i i

E E ww w x V

x w| x w|

slide-58
SLIDE 58

Expected Value of w given x

  • x and w are jointly Gaussian!

– x is Gaussian – w is Gaussian – They are linearly related

12 Nov 2013 11755/18797 58

e Vw x  

) , ( ) ( E N P  e

) , ( ) ( I N P  w ) , ( ) ( E N P

T 

 VV x

       w x z

) , ( ) (

zz z

z C N P  

slide-59
SLIDE 59

Expected Value of w given x

  • x and w are jointly Gaussian!

12 Nov 2013 11755/18797 59

      

ww wx xw xx zz

C C C C C        w x z        

w x z

   V w x

w x xw

    ] ) )( [(

T

E C  

) , ( ) (

zz z

z C N P  

e Vw x  

) , ( ) ( E N P

T 

 VV x

) , ( ) ( I N P  w

        I E C

T T

V V VV

zz

slide-60
SLIDE 60

The conditional expectation of w given z

  • P(w|z) is a Gaussian

12 Nov 2013 11755/18797 60

) ), ( ( ) | (

1 1 xw xx wx ww x xx wx w

x w C C C C x C C N P

T  

     

        I E C

T T

V V VV

zz

      

ww wx xw xx zz

C C C C C

       

w x z

   ) ) ( , ) ( ( ) | (

1 1

V VV V x VV V x w

 

    E I E N P

T T T T i T T

E E

i

x VV V w

x w 1 |

) ( ] [

 

T T

i i i

E E Var E ] [ ] [ ) ( ] [

| | |

w w w ww

x w x w x w

 

T T T T

i i i

E E E I E ] [ ] [ ) ( ] [

| | 1 |

w w V VV V ww

x w x w x w

   

slide-61
SLIDE 61

LGM: The complete EM algorithm

  • Initialize V and E
  • E step:
  • M step:
  • 12 Nov 2013

11755/18797 61

i T T

E E

i

x VV V w

x w 1 |

) ( ] [

 

T T T T

i i i

E E E I E ] [ ] [ ) ( ] [

| | 1 |

w w V VV V ww

x w x w x w

   

1

] [ ] [

            

 

i T i T i

i i

E E ww w x V

x w| x w|

 

 

i T i i T i i

i

E N N E x w V x x

x w|

] [ 1 1

slide-62
SLIDE 62

So what have we achieved

  • Employed a complicated EM algorithm to learn a

Gaussian PDF for a variable x

  • What have we gained???
  • Next class:

– PCA

  • Sensible PCA
  • EM algorithms for PCA

– Factor Analysis

  • FA for feature extraction

12 Nov 2013 11755/18797 62

slide-63
SLIDE 63
  • Find directions that capture most of the

variation in the data

  • Error is orthogonal to these variations

3 Oct 2011 11755/18797 63

e Vw x  

) , ( ~ I N w ) , ( ~ E N e

LGMs : Application 1 Learning principal components

slide-64
SLIDE 64
  • The full covariance matrix of a Gaussian has D2 terms
  • Fully captures the relationships between variables
  • Problem: Needs a lot of data to estimate robustly

3 Oct 2011 11755/18797 64

LGMs : Application 2 Learning with insufficient data

FULL COV FIGURE

slide-65
SLIDE 65

To be continued..

  • Other applications..
  • Next class

12 Nov 2013 11755/18797 65