Machine Learning for Signal Processing Regression and Prediction - - PowerPoint PPT Presentation

machine learning for signal processing regression and
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Signal Processing Regression and Prediction - - PowerPoint PPT Presentation

Machine Learning for Signal Processing Regression and Prediction Class 14. 17 Oct 2012 Instructor: Bhiksha Raj 17 Oct 2013 11755/18797 1 Matrix Identities df dx 1 dx 1 x 1 df


slide-1
SLIDE 1

Machine Learning for Signal Processing Regression and Prediction

Class 14. 17 Oct 2012 Instructor: Bhiksha Raj

17 Oct 2013 11755/18797 1

slide-2
SLIDE 2

Matrix Identities

  • The derivative of a scalar function w.r.t. a

vector is a vector

17 Oct 2013 11755/18797 2

          

D

x x x f ... ) (

2 1

x x

                    

D D

dx dx df dx dx df dx dx df df ... ) (

2 2 1 1

x

slide-3
SLIDE 3

Matrix Identities

  • The derivative of a scalar function w.r.t. a

vector is a vector

  • The derivative w.r.t. a matrix is a matrix

17 Oct 2013 11755/18797 3

          

DD D D D D

x x x x x x x x x f .. .. .. .. .. .. .. ) (

2 1 2 22 21 1 12 11

x x

                    

DD DD D D D D D D D D

dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df df .. .. .. .. .. .. .. ) (

2 2 1 1 2 2 22 22 12 12 1 1 21 21 11 11

x

slide-4
SLIDE 4

Matrix Identities

  • The derivative of a vector function w.r.t. a

vector is a matrix

– Note transposition of order

17 Oct 2013 11755/18797 4

                     

D N

x x x F F F ... ... ) (

2 1 2 1

x F x F

                              

D D N D D D D N N N

dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dF dF dF .. .. .. .. .. .. .. ...

2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1

slide-5
SLIDE 5

Derivatives

  • In general: Differentiating an MxN function by

a UxV argument results in an MxNxUxV tensor derivative

17 Oct 2013 11755/18797 5

,

Nx1 UxV NxUxV

,

Nx1 UxV UxVxN

slide-6
SLIDE 6

Matrix derivative identities

  • Some basic linear and quadratic identities

17 Oct 2013 11755/18797 6

a X X a a X Xa d d d d

T T

  ) ( ) (

X is a matrix, a is a vector. Solution may also be XT

) ( ) ( ; ) ( ) ( A X XA X A AX d d d d  

A is a matrix

    a

X X a Xa a d d

T T T

 

           

A X X X AA XAA XA A d trace d trace d trace d

T T T T

) (    

slide-7
SLIDE 7

A Common Problem

  • Can you spot the glitches?

17 Oct 2013 11755/18797 7

slide-8
SLIDE 8

How to fix this problem?

  • “Glitches” in audio

– Must be detected – How?

  • Then what?
  • Glitches must be “fixed”

– Delete the glitch

  • Results in a “hole”

– Fill in the hole – How?

17 Oct 2013 11755/18797 8

slide-9
SLIDE 9

17 Oct 2013 11755/18797 9

Interpolation..

  • “Extend” the curve on the left to “predict” the values in the

“blank” region

– Forward prediction

  • Extend the blue curve on the right leftwards to predict the

blank region

– Backward prediction

  • How?

– Regression analysis..

slide-10
SLIDE 10

Detecting the Glitch

  • Regression-based reconstruction can be done

anywhere

  • Reconstructed value will not match actual value
  • Large error of reconstruction identifies glitches

17 Oct 2013 11755/18797 10

NOT OK OK

slide-11
SLIDE 11

What is a regression

  • Analyzing relationship between variables
  • Expressed in many forms
  • Wikipedia

– Linear regression, Simple regression, Ordinary least squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, ….

  • Generally a tool to predict variables

17 Oct 2013 11755/18797 11

slide-12
SLIDE 12

Regressions for prediction

  • y = f(x; Q) + e
  • Different possibilities

– y is a scalar

  • y is real
  • y is categorical (classification)

– y is a vector – x is a vector

  • x is a set of real valued variables
  • x is a set of categorical variables
  • x is a combination of the two

– f(.) is a linear or affine function – f(.) is a non-linear function – f(.) is a time-series model

17 Oct 2013 11755/18797 12

slide-13
SLIDE 13

A linear regression

  • Assumption: relationship between variables is linear

– A linear trend may be found relating x and y – y = dependent variable – x = explanatory variable – Given x, y can be predicted as an affine function of x

17 Oct 2013 11755/18797 13

X Y

slide-14
SLIDE 14

An imaginary regression..

  • http://pages.cs.wisc.edu/~kovar/hall.html
  • Check this shit out (Fig. 1).

That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data

  • possible. Now, let's look a bit more

closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered.

17 Oct 2013 11755/18797 14

slide-15
SLIDE 15

Linear Regressions

  • y = Ax + b + e

– e = prediction error

  • Given a “training” set of {x, y} values: estimate A

and b

– y1 = Ax1 + b + e1 – y2 = Ax2 + b + e2 – y3 = Ax3 + b+ e3 – …

  • If A and b are well estimated, prediction error will

be small

17 Oct 2013 11755/18797 15

slide-16
SLIDE 16

Linear Regression to a scalar

  • Rewrite

17 Oct 2013 11755/18797 16

...] [

3 2 1

y y y  y        ... 1 1 1

3 2 1

x x x X        b a A ...] [

3 2 1

e e e  e

 Define:

e X A y  

T

y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3

slide-17
SLIDE 17

Learning the parameters

  • Given training data: several x,y
  • Can define a “divergence”: D(y, )

– Measures how much differs from y – Ideally, if the model is accurate this should be small

  • Estimate A, b to minimize D(y, )

17 Oct 2013 11755/18797 17

X A y

T

 ˆ

Assuming no error

e X A y  

T

y ˆ y ˆ

y ˆ

slide-18
SLIDE 18

The prediction error as divergence

  • Define divergence as sum of the squared error in predicting y

17 Oct 2013 11755/18797 18

e y e X A y     ˆ

T

y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3

... ˆ

2 3 2 2 2 1

     e e e E ) y D(y, ... ) ( ) ( ) (

2 3 3 2 2 2 2 1 1

          b y b y b y

T T T

x a x a x a

  

2

X A y X A y X A y E

T T T T

    

slide-19
SLIDE 19

Prediction error as divergence

  • y = aTx + e

– e = prediction error – Find the “slope” a such that the total squared length of the error lines is minimized

17 Oct 2013 11755/18797 19

slide-20
SLIDE 20

Solving a linear regression

  • Minimize squared error
  • Differentiating w.r.t A and equating to 0

17 Oct 2013 11755/18797 20

e X A y  

T T T T T

) )( ( X A y X A y || A X y || E

2

     A yX

  • A

XX A yy

T T T T

2  

 

2 2   A yX

  • XX

A E d d

T T T

 

 

X y XX yX A

  • 1

pinv

T T T

 

 

T T

Xy XX A

  • 1

slide-21
SLIDE 21

Regression in multiple dimensions

  • Also called multiple regression
  • Equivalent of saying:
  • Fundamentally no different from N separate single

regressions

– But we can use the relationship between ys to our benefit

17 Oct 2013 11755/18797 21

y1 = ATx1 + b + e1 y2 = ATx2 + b + e2 y3 = ATx3 + b + e3

yi is a vector

yi = ATxi + b + ei yi1 = a1

Txi + b1 + ei1

yi2 = a2

Txi + b2 + ei2

yi3 = a3

Txi + b3 + ei3

yij = jth component of vector yi ai = ith column of A bj = jth component of b

slide-22
SLIDE 22

Multiple Regression

  • Differentiating and equating to 0

17 Oct 2013 11755/18797 22

...] [

3 2 1

y y y Y         ...

3 2 1

1 x 1 x 1 x X        b A A ˆ ...] [

3 2 1

e e e E 

Dx1 vector of ones

E X A Y  

T

ˆ

 

T T T i i T i

trace DIV ) ˆ )( ˆ ( ˆ

2

X A Y X A Y x A y      

 

ˆ ˆ 2 .    A X X A

  • Y

d Div d

T T

 

 

X Y XX YX A

  • 1

pinv

T T T

  ˆ

 

ˆ

T T

XY XX A

  • 1

T T T

XX A YX ˆ 

slide-23
SLIDE 23

A Different Perspective

  • y is a noisy reading of ATx
  • Error e is Gaussian
  • Estimate A from

17 Oct 2013 11755/18797 23

e x A y  

T

) , (

2I

~ e  N

] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

= +

slide-24
SLIDE 24

The Likelihood of the data

  • Probability of observing a specific y, given x,

for a particular matrix A

  • Probability of collection:
  • Assuming IID for convenience (not necessary)

17 Oct 2013 11755/18797 24

e x A y  

T

) , (

2I

~ e  N

) , ; ( ) ; | (

2I

x A y A x y 

T

N P 

] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

i i T i

N P ) , ; ( ) ; | (

2I

x A y A X Y 

slide-25
SLIDE 25

A Maximum Likelihood Estimate

  • Maximizing the log probability is identical to

minimizing the trace

– Identical to the least squares solution

17 Oct 2013 11755/18797 25

e x A y  

T

) , (

2I

~ e  N ] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

        

i i T i D

P

2 2 2

2 1 exp ) 2 ( 1 ) | ( x A y X Y  

  

i i T i

C P

2 2

2 1 ) ; | ( log x A y A X Y 

 

T T T

trace C ) )( ( 2 1

2

X A Y X A Y     

 

 

X Y XX YX A

  • 1

pinv

T T T

 

 

T T

XY XX A

  • 1

slide-26
SLIDE 26

Predicting an output

  • From a collection of training data, have

learned A

  • Given x for a new instance, but not y, what is

y?

  • Simple solution:

17 Oct 2013 11755/18797 26

X A y

T

 ˆ

slide-27
SLIDE 27

Applying it to our problem

  • Prediction by regression
  • Forward regression
  • xt = a1xt-1+ a2xt-2…akxt-k+et
  • Backward regression
  • xt = b1xt+1+ b2xt+2…bkxt+k +et

17 Oct 2013 11755/18797 27

slide-28
SLIDE 28

Applying it to our problem

  • Forward prediction

17 Oct 2013 11755/18797 28

                                     

            1 1 1 1 1 3 2 2 1 1 1

.. .. .. .. .. .. .. .. ..

K t t t K K K t t t K t t t K t t

e e e x x x x x x x x x x x x a

e a X x  

t t

pinv a x X  ) (

slide-29
SLIDE 29

Applying it to our problem

  • Backward prediction

17 Oct 2013 11755/18797 29

                                     

               1 2 1 2 1 1 2 1 1 1 2 1

.. .. .. .. .. .. .. .. .. e e e x x x x x x x x x x x x

K t K t t K K K t t t K t t t K t K t

b

e b X x  

t t

pinv b x X  ) (

slide-30
SLIDE 30

Finding the burst

  • At each time

– Learn a “forward” predictor at – At each time, predict next sample xt

est = Si at,kxt-k

– Compute error: ferrt=|xt-xt

est |2

– Learn a “backward” predict and compute backward error

  • berrt

– Compute average prediction error over window, threshold

17 Oct 2013 11755/18797 30

slide-31
SLIDE 31

Filling the hole

  • Learn “forward” predictor at left edge of “hole”

– For each missing sample – At each time, predict next sample xt

est = Si at,kxt-k

  • Use estimated samples if real samples are not available
  • Learn “backward” predictor at left edge of “hole”

– For each missing sample – At each time, predict next sample xt

est = Si bt,kxt+k

  • Use estimated samples if real samples are not available
  • Average forward and backward predictions

17 Oct 2013 11755/18797 31

slide-32
SLIDE 32

Reconstruction zoom in

17 Oct 2013 11755/18797 32

Next glitch Interpolation result Reconstruction area Actual data Distorted signal Recovered signal

slide-33
SLIDE 33

Incrementally learning the regression

  • Can we learn A incrementally instead?

– As data comes in?

  • The Widrow Hoff rule
  • Note the structure

– Can also be done in batch mode!

17 Oct 2013 11755/18797 33

Requires knowledge of all (x,y) pairs

 

T T

XY XX A

  • 1

 

 

t T t t t t t t t

y y y x a x a a    

ˆ ˆ

1

error Scalar prediction version

slide-34
SLIDE 34

Predicting a value

  • What are we doing exactly?

– For the explanation we are assuming no “b” (X is 0 mean) – Explanation generalizes easily even otherwise

17 Oct 2013 11755/18797 34

 

T T

XY XX A

  • 1

  x

XX YX x A y

1

ˆ

 

T T T

T

XX C 

x C x

2 1

ˆ

i T T

x X Y x C C YX y ˆ ˆ ˆ

2 1 2 1

 

 

 Let and

 Whitening x  N-0.5 C-0.5 is the whitening matrix for x

X C X

2 1

ˆ

slide-35
SLIDE 35

Predicting a value

  • What are we doing exactly?

17 Oct 2013 11755/18797 35

i i T i T

y x x x X Y y

  ˆ ˆ ˆ ˆ ˆ

 

 

            

i T i i T N T N T

N x x y x x x y y x X Y y ˆ ˆ ˆ ˆ ˆ ... 1 ˆ ˆ ˆ

1 1

slide-36
SLIDE 36

Predicting a value

  • Given training instances (xi,yi) for i = 1..N, estimate y

for a new test instance of x with unknown y :

  • y is simply a weighted sum of the yi instances from the

training data

  • The weight of any yi is simply the inner product

between its corresponding xi and the new x

– With due whitening and scaling..

17 Oct 2013 11755/18797 36

 

i T i i

x x y y ˆ ˆ ˆ

slide-37
SLIDE 37

What are we doing: A different perspective

  • Assumes XXT is invertible
  • What if it is not

– Dimensionality of X is greater than number of

  • bservations?

– Underdetermined

  • In this case XTX will generally be invertible

17 Oct 2013 11755/18797 37

  x

XX YX x A y

1

ˆ

 

T T T

 

x X X X Y y

T T 1

ˆ

 

T T

Y X X X A

1 

slide-38
SLIDE 38

High-dimensional regression

  • XTX is the “Gram Matrix”

17 Oct 2013 11755/18797 38

 

x X X X Y y

T T 1

ˆ

              

N T N T N T N N T T T N T T T

x x x x x x x x x x x x x x x x x x G       

2 1 2 2 2 1 2 1 2 1 1 1

x X YG y

T 1

ˆ

slide-39
SLIDE 39

High-dimensional regression

  • Normalize Y by the inverse of the gram matrix

17 Oct 2013 11755/18797 39

x X YG y

T 1

ˆ

1 

 YG Y    x X Y y

T

   ˆ  x x y y  

i T i i

   ˆ

  • Working our way down..
slide-40
SLIDE 40

Linear Regression in High-dimensional Spaces

  • Given training instances (xi,yi) for i = 1..N, estimate y

for a new test instance of x with unknown y :

  • y is simply a weighted sum of the normalized yi

instances from the training data

– The normalization is done via the Gram Matrix

  • The weight of any yi is simply the inner product

between its corresponding xi and the new x

17 Oct 2013 11755/18797 40

x x y y  

i T i i

   ˆ

1 

 YG Y   

slide-41
SLIDE 41

Relationships are not always linear

  • How do we model these?
  • Multiple solutions

17 Oct 2013 11755/18797 41

slide-42
SLIDE 42

Non-linear regression

  • y = Aj(x)+e

17 Oct 2013 11755/18797 42

)] ( ... ) ( ) ( [ ) (

2 1

x x x x φ x

N

     )] ( ... ) ( ) ( [ ) (

2 1 K

x φ x φ x φ X X   

 Y = A(X)+e  Replace X with (X) in earlier equations for

solution

 

) ( ) ( ) (

T T

Y X X X A

  • 1

  

slide-43
SLIDE 43

Problem

17 Oct 2013 11755/18797 43

 Y = A(X)+e  Replace X with (X) in earlier

equations for solution

 (X) may be in a very high-dimensional space  The high-dimensional space (or the transform

(X)) may be unknown..

 

) ( ) ( ) (

T T

Y X X X A

  • 1

  

slide-44
SLIDE 44

The regression is in high dimensions

  • Linear regression:
  • High-dimensional regression

17 Oct 2013 11755/18797 44

x x y y  

i T i i

   ˆ

1 

 YG Y   

                                   

                               

N T N T N T N T T T N T T T

x x x x x x x x x x x x x x x x x x G       

2 1 1 2 2 2 1 2 1 2 2 1 1

1 

 YG Y   

   

  

i T i i

x x y y    ˆ

slide-45
SLIDE 45

Doing it with Kernels

  • High-dimensional regression with Kernels:
  • Regression in Kernel Hilbert Space..

17 Oct 2013 11755/18797 45

             ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , (

2 1 2 2 2 1 2 1 1 1 1 1 N N N N N N

K K K K K K K K K x x x x x x x x x x x x x x x x x x G       

   

y x y x   

T

K ) , (

1 

 YG Y   

i i iK

) , ( ˆ x x y y   

slide-46
SLIDE 46

A different way of finding nonlinear relationships: Locally linear regression

  • Previous discussion: Regression parameters are
  • ptimized over the entire training set
  • Minimize
  • Single global regression is estimated and applied to all

future x

  • Alternative: Local regression
  • Learn a regression that is specific to x

17 Oct 2013 11755/18797 46

  

i all i T i 2

b x A y E

slide-47
SLIDE 47

Being non-committal: Local Regression

  • Estimate the regression to

be applied to any x using training instances near x

  • The resultant regression has the form

– Note : this regression is specific to x

  • A separate regression must be learned for every x

17 Oct 2013 11755/18797 47

e y x x y

x x

 

 j

  • d

neighborho j

j

d

) (

) , (

  

) ( 2 x x

b x A y E

  • d

neighborho i T i

j

slide-48
SLIDE 48

Local Regression

  • But what is d()?

– For linear regression d() is an inner product

  • More generic form: Choose d() as a function of the

distance between x and xj

  • If d() falls off rapidly with |x and xj| the

“neighbhorhood” requirement can be relaxed

17 Oct 2013 11755/18797 48

e y x x y

x x

 

 j

  • d

neighborho j

j

d

) (

) , ( e y x x y   

j all j

d ) , (

slide-49
SLIDE 49

Kernel Regression: d() = K()

  • Typical Kernel functions: Gaussian, Laplacian, other

density functions

– Must fall off rapidly with increasing distance between x and xj

  • Regression is local to every x : Local regression
  • Actually a non-parametric MAP estimator of y

– But first.. MAP estimators..

49

 

  

i i h i i i h

K K ) ( ) ( ˆ x x y x x y

slide-50
SLIDE 50

Map Estimators

  • MAP (Maximum A Posteriori): Find a “best

guess” for y (statistically), given known x

y = argmax Y P(Y|x)

  • ML (Maximum Likelihood): Find that value of y

for which the statistical best guess of x would have been the observed x

y = argmax Y P(x|Y)

  • MAP is simpler to visualize

17 Oct 2013 11755/18797 50

slide-51
SLIDE 51

MAP estimation: Gaussian PDF

17 Oct 2013 11755/18797 51

F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data

X Y

slide-52
SLIDE 52

Learning the parameters of the Gaussian

17 Oct 2013 11755/18797 52

       x y z

N i i

N

1

1 z

z

       

x y z

         

YY YX XY XX

C C C C Cz

  

T i N i i

N C

z z z

z z     

1

1

slide-53
SLIDE 53

Learning the parameters of the Gaussian

17 Oct 2013 11755/18797 53

       x y z

N i i

N

1

1 z

z

  

T i N i i

N C

z z z

z z     

1

1       

x y z

         

YY YX XY XX

C C C C Cz

N i i

N

1

1 x

x

 

T i N i i XY

N C

y x y

x     

1

1

slide-54
SLIDE 54

MAP estimation: Gaussian PDF

17 Oct 2013 11755/18797 54

F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data

X Y

slide-55
SLIDE 55

MAP Estimator for Gaussian RV

11755/18797 55

Assume X and Y are jointly Gaussian The parameters

  • f the Gaussian

are learned from training data Now we are given an X, but no Y What is Y? Level set of Gaussian

X0

slide-56
SLIDE 56

MAP estimator for Gaussian RV

17 Oct 2013 11755/18797 56

x0

slide-57
SLIDE 57

MAP estimation: Gaussian PDF

17 Oct 2013 11755/18797 57

F1

X Y

slide-58
SLIDE 58

F1

MAP estimation: The Gaussian at a particular value of X

17 Oct 2013 11755/18797 58

x0

slide-59
SLIDE 59

F1

MAP estimation: The Gaussian at a particular value of X

17 Oct 2013 11755/18797 59

Most likely value x0

slide-60
SLIDE 60

MAP Estimation of a Gaussian RV

17 Oct 2013 11755/18797 60

Y = argmaxy P(y| X) ??? x0

slide-61
SLIDE 61

MAP Estimation of a Gaussian RV

17 Oct 2013 11755/18797 61

x0

slide-62
SLIDE 62

MAP Estimation of a Gaussian RV

17 Oct 2013 11755/18797 62

x0 Y = argmaxy P(y| X)

slide-63
SLIDE 63

So what is this value?

  • Clearly a line
  • Equation of Line:
  • Scalar version given; vector version is identical
  • Derivation? Later in the program a bit

– Note the similarity to regression

17 Oct 2013 11755/18797 63

 

x XX YX Y

x C C y     

1

ˆ

 

x

x y     

1

ˆ

XX YX Y

C C

slide-64
SLIDE 64

This is a multiple regression

  • This is the MAP estimate of y

– y = argmax Y P(Y|x)

  • What about the ML estimate of y

– argmax Y P(x|Y)

  • Note: Neither of these may be the regression line!

– MAP estimation of y is the regression on Y for Gaussian RVs – But this is not the MAP estimation of the regression parameter

17 Oct 2013 11755/18797 64

 

x

x y     

1

ˆ

XX YX Y

C C

slide-65
SLIDE 65

Its also a minimum-mean-squared error estimate

  • General principle of MMSE estimation:

– y is unknown, x is known – Must estimate it such that the expected squared error is minimized – Minimize above term

17 Oct 2013 11755/18797 65

] | ˆ [

2 x

y y   E Err

slide-66
SLIDE 66

Its also a minimum-mean-squared error estimate

  • Minimize error:
  • Differentiating and equating to 0:

17 Oct 2013 11755/18797 66

   

] | ˆ ˆ [ ] | ˆ [

2

x y y y y x y y     

T

E E Err ] | [ ˆ 2 ˆ ˆ ] | [ ] | ˆ 2 ˆ ˆ [ x y y y y x y y x y y y y y y E E E Err

T T T T T T

     

ˆ ] | [ 2 ˆ ˆ 2 .    y x y y y d E d Err d

T T

] | [ ˆ x y y E 

The MMSE estimate is the mean of the distribution

slide-67
SLIDE 67

For the Gaussian: MAP = MMSE

17 Oct 2013 11755/18797 67

Most likely value is also The MEAN value

  • Would be true of any symmetric distribution
slide-68
SLIDE 68

MMSE estimates for mixture distributions

68

 Let P(y|x) be a mixture density  The MMSE estimate of y is given by  Just a weighted combination of the MMSE

estimates from the component distributions

) , | ( ) ( ) | ( x y x y k P k P P

k

 

 y x y y x y d k P k P E

k

) , | ( ) ( ] | [

 

 y x y y d k P k P

k

) , | ( ) ( ] , | [ ) ( x y k E k P

k

slide-69
SLIDE 69

MMSE estimates from a Gaussian mixture

17 Oct 2013 11755/18797 69

 P(y|x) is also a Gaussian mixture  Let P(x,y) be a Gaussian Mixture

) , ; ( ) ( ) ( ) (

k k k

N k P P P S  

 z z y x,        x y z

) ( ) , | ( ) | ( ) ( ) ( ) , , ( ) ( ) ( ) | ( x x y x x x y x x y x, x y P k P k P P P k P P P P

k k

 

  

k

k P k P P ) , | ( ) | ( ) | ( x y x x y

slide-70
SLIDE 70

MMSE estimates from a Gaussian mixture

17 Oct 2013 11755/18797 70

 Let P(y|x) is a Gaussian Mixture

k

k P k P P ) , | ( ) | ( ) | ( x y x x y ) ], ; [ ]; ; ([ ) , , (

, , , , , ,

      

xx xy yx yy x y

x y x y

k k k k k k

C C C C N k P   ) ), ( ; ( ) , | (

, 1 , , ,

Q   

 x xx yx y

x y x y

k k k k

C C N k P  

Q   

 k k k k k

C C N k P P ) ), ( ; ( ) | ( ) | (

, 1 , , , x xx yx y

x y x x y  

slide-71
SLIDE 71

MMSE estimates from a Gaussian mixture

17 Oct 2013 11755/18797 71

 E[y|x] is also a mixture  P(y|x) is a mixture Gaussian density

Q   

 k k k k k

C C N k P P ) ), ( ; ( ) | ( ) | (

, 1 , , , x xx yx y

x y x x y  

k

k E k P E ] , | [ ) | ( ] | [ x y x x y

 

  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

slide-72
SLIDE 72

MMSE estimates from a Gaussian mixture

17 Oct 2013 11755/18797 72

 A mixture of estimates from individual Gaussians

slide-73
SLIDE 73

Voice Morphing

  • Align training recordings from both speakers

– Cepstral vector sequence

  • Learn a GMM on joint vectors
  • Given speech from one speaker, find MMSE estimate of the other
  • Synthesize from cepstra

17 Oct 2013 11755/18797 73

slide-74
SLIDE 74

MMSE with GMM: Voice Transformation

  • Festvox GMM transformation suite (Toda)

awb bdl jmk slt awb bdl jmk slt

17 Oct 2013 11755/18797 74

slide-75
SLIDE 75

A problem with regressions

  • ML fit is sensitive

– Error is squared – Small variations in data  large variations in weights – Outliers affect it adversely

  • Unstable

– If dimension of X >= no. of instances

  • (XXT) is not invertible

17 Oct 2013 11755/18797 75

 

T T

XY XX A

  • 1

slide-76
SLIDE 76

MAP estimation of weights

  • Assume weights drawn from a Gaussian

– P(a) = N(0, 2I)

  • Max. Likelihood estimate
  • Maximum a posteriori estimate

17 Oct 2013 11755/18797 76

a X e y=aTX+e

) ; | ( log max arg ˆ a X y a

a

P 

) ( ) , | ( log max arg ) , | ( log max arg ˆ a a X y X y a a

a a

P P P  

slide-77
SLIDE 77

MAP estimation of weights

  • Similar to ML estimate with an additional term

17 Oct 2013 11755/18797 77

) ( ) , | ( log max arg ) , | ( log max arg ˆ a a X y X y a a

A A

P P P  

 P(a) = N(0, 2I)  Log P(a) = C – log  – 0.5-2 ||a||2 T T T T

C P ) ( ) ( 2 1 ) ( log

2

X a y X a y a X, | y      a a X a y X a y a

A T T T T T

C

2 2

5 . ) ( ) ( 2 1 log ' max arg ˆ         

slide-78
SLIDE 78

MAP estimate of weights

  • Equivalent to diagonal loading of correlation matrix

– Improves condition number of correlation matrix

  • Can be inverted with greater stability

– Will not affect the estimation from well-conditioned data – Also called Tikhonov Regularization

  • Dual form: Ridge regression
  • MAP estimate of weights

– Not to be confused with MAP estimate of Y

17 Oct 2013 11755/18797 78

 

2 2 2     a I yX XX a d dL

T T T

 

T T

XY I XX a

  • 1

  

slide-79
SLIDE 79

MAP estimate priors

  • Left: Gaussian Prior on W
  • Right: Laplacian Prior

17 Oct 2013 11755/18797 79

slide-80
SLIDE 80

MAP estimation of weights with laplacian prior

  • Assume weights drawn from a Laplacian

– P(a) = l-1exp(-l-1|a|1)

  • Maximum a posteriori estimate
  • No closed form solution

– Quadratic programming solution required

  • Non-trivial

17 Oct 2013 11755/18797 80

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

slide-81
SLIDE 81

MAP estimation of weights with laplacian prior

  • Assume weights drawn from a Laplacian

– P(a) = l-1exp(-l-1|a|1)

  • Maximum a posteriori estimate

– …

  • Identical to L1 regularized least-squares

estimation

17 Oct 2013 11755/18797 81

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

slide-82
SLIDE 82

L1-regularized LSE

  • No closed form solution

– Quadratic programming solutions required

  • Dual formulation
  • “LASSO” – Least absolute shrinkage and

selection operator

17 Oct 2013 11755/18797 82

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

T T T T

C ) ( ) ( ' max arg ˆ X a y X a y a

A

    t 

1

a

subject to

slide-83
SLIDE 83

LASSO Algorithms

  • Various convex optimization algorithms
  • LARS: Least angle regression
  • Pathwise coordinate descent..
  • Matlab code available from web

17 Oct 2013 11755/18797 83

slide-84
SLIDE 84

Regularized least squares

  • Regularization results in selection of suboptimal (in

least-squares sense) solution

– One of the loci outside center

  • Tikhonov regularization selects shortest solution
  • L1 regularization selects sparsest solution

17 Oct 2013 11755/18797 84 Image Credit: Tibshirani

slide-85
SLIDE 85

LASSO and Compressive Sensing

  • Given Y and X, estimate sparse W
  • LASSO:

– X = explanatory variable – Y = dependent variable – a = weights of regression

  • CS:

– X = measurement matrix – Y = measurement – a = data

17 Oct 2013 11755/18797 85

Y

=

X a

slide-86
SLIDE 86

An interesting problem: Predicting War!

  • Economists measure a number of social

indicators for countries weekly

– Happiness index – Hunger index – Freedom index – Twitter records – …

  • Question: Will there be a revolution or war next

week?

17 Oct 2013 11755/18797 86

slide-87
SLIDE 87

An interesting problem: Predicting War!

  • Issues:

– Dissatisfaction builds up – not an instantaneous phenomenon

  • Usually

– War / rebellion build up much faster

  • Often in hours
  • Important to predict

– Preparedness for security – Economic impact

17 Oct 2013 11755/18797 87

slide-88
SLIDE 88

Predicting War

Given

– Sequence of economic indicators for each week – Sequence of unrest markers for each week

  • At the end of each week we know if war happened or not

that week

  • Predict probability of unrest next week

– This could be a new unrest or persistence of a current

  • ne

17 Oct 2013 11755/18797 88

W S W S W S W S W S W S W S W S wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 O1 O2 O3 O4 O5 O6 O7 O8

slide-89
SLIDE 89

Predicting Time Series

  • Need time-series models
  • HMMs – later in the course

17 Oct 2013 11755/18797 89