Regression and Prediction Class 15. 23 Oct 2012 Instructor: - - PowerPoint PPT Presentation

regression and prediction
SMART_READER_LITE
LIVE PREVIEW

Regression and Prediction Class 15. 23 Oct 2012 Instructor: - - PowerPoint PPT Presentation

11-755 Machine Learning for Signal Processing Regression and Prediction Class 15. 23 Oct 2012 Instructor: Bhiksha Raj 23 Oct 2012 11755/18797 1 Matrix Identities df dx 1 dx 1 x 1 df


slide-1
SLIDE 1

11-755 Machine Learning for Signal Processing

Regression and Prediction

Class 15. 23 Oct 2012 Instructor: Bhiksha Raj

23 Oct 2012 1 11755/18797

slide-2
SLIDE 2

Matrix Identities

 The derivative of a scalar function w.r.t. a vector

is a vector

 The derivative w.r.t. a matrix is a matrix

23 Oct 2012 11755/18797 2

          

D

x x x f ... ) (

2 1

                    

D D

dx dx df dx dx df dx dx df df ... ) (

2 2 1 1

slide-3
SLIDE 3

Matrix Identities

 The derivative of a scalar function w.r.t. a vector

is a vector

 The derivative w.r.t. a matrix is a matrix

23 Oct 2012 11755/18797 3

          

DD D D D D

x x x x x x x x x f .. .. .. .. .. .. .. ) (

2 1 2 22 21 1 12 11

                    

DD DD D D D D D D D D

dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df df .. .. .. .. .. .. .. ) (

2 2 1 1 2 2 22 22 12 12 1 1 21 21 11 11

slide-4
SLIDE 4

Matrix Identities

 The derivative of a vector function w.r.t. a vector

is a matrix

 Note transposition of order

23 Oct 2012 11755/18797 4

                     

D N

x x x F F F ... ... ) (

2 1 2 1

x F x F

                              

D D N D D D D N N N

dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dF dF dF .. .. .. .. .. .. .. ...

2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1

slide-5
SLIDE 5

Derivatives

 In general: Differentiating an MxN function by a

UxV argument results in an MxNxUxV tensor derivative

23 Oct 2012 11755/18797 5

,

Nx1 UxV NxUxV

,

Nx1 UxV UxVxN

slide-6
SLIDE 6

Matrix derivative identities

 Some basic linear and quadratic identities

23 Oct 2012 11755/18797 6

a X X a a X Xa d d d d

T T

  ) ( ) (

X is a matrix, a is a vector. Solution may also be XT

) ( ) ( ; ) ( ) ( A X XA X A AX d d d d  

A is a matrix

    a

X X a Xa a d d

T T T

 

           

A X X X AA XAA XA A d trace d trace d trace d

T T T T

) (    

slide-7
SLIDE 7

A Common Problem

 Can you spot the glitches?

7 11755/18797 23 Oct 2012

slide-8
SLIDE 8

How to fix this problem?

 “Glitches” in audio

 Must be detected  How?

 Then what?  Glitches must be “fixed”

 Delete the glitch

 Results in a “hole”

 Fill in the hole  How?

8 11755/18797 23 Oct 2012

slide-9
SLIDE 9

Interpolation..

23 Oct 2012 11755/18797 9

 “Extend” the curve on the left to “predict” the values in

the “blank” region

 Forward prediction

 Extend the blue curve on the right leftwards to predict

the blank region

 Backward prediction

 How?

 Regression analysis..

slide-10
SLIDE 10

Detecting the Glitch

23 Oct 2012 11755/18797 10

 Regression-based reconstruction can be done

anywhere

 Reconstructed value will not match actual value  Large error of reconstruction identifies glitches

NOT OK OK

slide-11
SLIDE 11

What is a regression

 Analyzing relationship between variables  Expressed in many forms  Wikipedia

 Linear regression, Simple regression, Ordinary least

squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, ….

 Generally a tool to predict variables

23 Oct 2012 11755/18797 11

slide-12
SLIDE 12

Regressions for prediction

 y = f(x; Q) + e  Different possibilities

 y is a scalar

Y is real

Y is categorical (classification)

 y is a vector  x is a vector

x is a set of real valued variables

x is a set of categorical variables

x is a combination of the two

 f(.) is a linear or affine function  f(.) is a non-linear function  f(.) is a time-series model

23 Oct 2012 11755/18797 12

slide-13
SLIDE 13

A linear regression

 Assumption: relationship between variables is linear

 A linear trend may be found relating x and y  y = dependent variable  x = explanatory variable  Given x, y can be predicted as an affine function of x

23 Oct 2012 11755/18797 13

X Y

slide-14
SLIDE 14

An imaginary regression..

 http://pages.cs.wisc.edu/~kovar/hall.html 

Check this shit out (Fig. 1). That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data

  • possible. Now, let's look a bit more

closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered.

23 Oct 2012 11755/18797 14

slide-15
SLIDE 15

Linear Regressions

 y = Ax + b + e

 e = prediction error

 Given a “training” set of {x, y} values: estimate A

and b

 y1 = Ax1 + b + e1  y2 = Ax2 + b + e2  y3 = Ax3 + b+ e3  …

 If A and b are well estimated, prediction error will

be small

23 Oct 2012 11755/18797 15

slide-16
SLIDE 16

Linear Regression to a scalar

 Rewrite

23 Oct 2012 11755/18797 16

...] [

3 2 1

y y y  y        ... 1 1 1

3 2 1

x x x X        b a A ...] [

3 2 1

e e e  e

 Define:

e X A y  

T

y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3

slide-17
SLIDE 17

Learning the parameters

 Given training data: several x,y  Can define a “divergence”: D(y, )

 Measures how much yhat differs from y  Ideally, if the model is accurate this should be small

 Estimate A, b to minimize D(y, )

23 Oct 2012 11755/18797 17

X A y

T

 ˆ

Assuming no error

e X A y  

T

y ˆ y ˆ

slide-18
SLIDE 18

The prediction error as divergence

 Define the divergence as the sum of the squared

error in predicting y

18

e X A y  

T

y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3

... ˆ

2 3 2 2 2 1

     e e e E ) y D(y, ... ) ( ) ( ) (

2 3 3 2 2 2 2 1 1

          b y b y b y

T T T

x a x a x a

  

2

X A y X A y X A y E

T T T T

    

23 Oct 2012 11755/18797

slide-19
SLIDE 19

Prediction error as divergence

 y = aTx + e

 e = prediction error  Find the “slope” a such that the total squared length

  • f the error lines is minimized

23 Oct 2012 11755/18797 19

slide-20
SLIDE 20

Solving a linear regression

 Minimize squared error  Differentiating w.r.t A and equating to 0

23 Oct 2012 11755/18797 20

e X A y  

T T T T T

) )( ( X A y X A y || A X y || E

2

     A yX

  • A

XX A yy

T T T T

2  

 

2 2   A yX

  • XX

A E d d

T T T

 

 

X y XX yX A

  • 1

pinv

T T T

 

 

T T

Xy XX A

  • 1

slide-21
SLIDE 21

An Aside

 What happens if we minimize the perpendicular

instead?

23 Oct 2012 11755/18797 21

slide-22
SLIDE 22

Regression in multiple dimensions

 Also called multiple regression  Equivalent of saying:  Fundamentally no different from N separate single

regressions

 But we can use the relationship between ys to our benefit

23 Oct 2012 11755/18797 22

y1 = ATx1 + b + e1 y2 = ATx2 + b + e2 y3 = ATx3 + b + e3

yi is a vector

y1 = ATx1 + b + e1 y11 = a1

Tx1 + b1 + e11

y12 = a2

Tx2 + b2 + e12

y13 = a3

Tx3 + b3 + e13

yij = jth component of vector yi ai = ith column of A bi = ith component of b

slide-23
SLIDE 23

Multiple Regression

 Differentiating and equating to 0

23 Oct 2012 11755/18797 23

...] [

3 2 1

y y y Y         ...

3 2 1

1 x 1 x 1 x X        b A A ...] [

3 2 1

e e e E 

Dx1 vector of ones

E X A Y  

T

 

T T T i i T i

trace DIV ) )( (

2

X A Y X A Y b x A y      

 

2 2   A YX

  • XX

A d dDiv

T T T

 

 

X Y XX YX A

  • 1

pinv

T T T

 

 

T T

XY XX A

  • 1

slide-24
SLIDE 24

A Different Perspective

 y is a noisy reading of ATx  Error e is Gaussian  Estimate A from

23 Oct 2012 11755/18797 24

e x A y  

T

) , (

2I

~ e  N

] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

= +

slide-25
SLIDE 25

The Likelihood of the data

 Probability of observing a specific y, given x, for a

particular matrix A

 Probability of the collection:  Assuming IID for convenience (not necessary)

23 Oct 2012 11755/18797 25

e x A y  

T

) , (

2I

~ e  N ) , ( ) ; | (

2I

x A A x y 

T

N P 

] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

i i T

N P ) , ( ) ; | (

2I

x A A X Y 

slide-26
SLIDE 26

A Maximum Likelihood Estimate

 Maximizing the log probability is identical to

minimizing the trace

 Identical to the least squares solution

23 Oct 2012 11755/18797 26

e x A y  

T

) , (

2I

~ e  N ] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

       

i i T D

P

2 2 2

2 1 exp ) 2 ( 1 ) | ( x A X Y  

  

i i T i

C P

2 2

2 1 ) ; | ( log x A y A X Y 

 

T T T

trace C ) )( ( 2 1

2

X A Y X A Y     

 

 

X Y XX YX A

  • 1

pinv

T T T

 

 

T T

XY XX A

  • 1

slide-27
SLIDE 27

Predicting an output

 From a collection of training data, have learned A  Given x for a new instance, but not y, what is y?  Simple solution:

23 Oct 2012 11755/18797 27

X A y

T

 ˆ

slide-28
SLIDE 28

Applying it to our problem

 Prediction by regression  Forward regression  xt = a1xt-1+ a2xt-2…akxt-k+et  Backward regression  xt = b1xt+1+ b2xt+2…bkxt+k +et

23 Oct 2012 11755/18797 28

slide-29
SLIDE 29

Applying it to our problem

 Forward prediction

23 Oct 2012 11755/18797 29

                               

            1 1 1 1 1 3 2 2 1 1 1

.. .. .. .. .. .. .. .. ..

K t t K t K t K t t K t t T K t t

e e e x x x x x x x x x x x x

t

a

e X a x  

T t T t

pinv a X x  ) (

slide-30
SLIDE 30

Applying it to our problem

 Backward prediction

23 Oct 2012 11755/18797 30

                               

               1 2 1 2 1 2 1 1 1 1 2 1

.. .. .. .. .. .. .. .. .. e e e x x x x x x x x x x x x

K t K t K t K t K t t K t t T t K t K t

b

e X b x  

T

t

T

t

pinv b X x  ) (

slide-31
SLIDE 31

Finding the burst

 At each time

 Learn a “forward” predictor at  At each time, predict next sample xt

est = Si at,kxt-k

 Compute error: ferrt=|xt-xt

est |2

 Learn a “backward” predict and compute backward error

 berrt

 Compute average prediction error over window,

threshold

23 Oct 2012 11755/18797 31

slide-32
SLIDE 32

Filling the hole

 Learn “forward” predictor at left edge of “hole”

 For each missing sample  At each time, predict next sample xt

est = Si at,kxt-k

Use estimated samples if real samples are not available

 Learn “backward” predictor at left edge of “hole”

 For each missing sample  At each time, predict next sample xt

est = Si bt,kxt+k

Use estimated samples if real samples are not available

 Average forward and backward predictions

23 Oct 2012 11755/18797 32

slide-33
SLIDE 33

Reconstruction zoom in

Next glitch Interpolation result Reconstruction area Actual data Distorted signal Recovered signal

33 11755/18797 23 Oct 2012

slide-34
SLIDE 34

Incrementally learning the regression

 Can we learn A incrementally instead?

 As data comes in?

 The Widrow Hoff rule  Note the structure

 Can also be done in batch mode!

23 Oct 2012 11755/18797 34

Requires knowledge of all (x,y) pairs

 

T T

XY XX A

  • 1

 

 

t T t t t t t t t

y y y x a x a a    

ˆ ˆ

1

error Scalar prediction version

slide-35
SLIDE 35

Predicting a value

 What are we doing exactly?

23 Oct 2012 11755/18797 35

 

T T

XY XX A

  • 1

  x

XX YX x A y

1

ˆ

 

T T T T

XX C 

x C x

2 1

ˆ

i i T i T

y x x x X Y y

  ˆ ˆ ˆ ˆ ˆ

 Let

 Normalizing and rotating space

The rotation is irrelevant  Weighted combination

  • f inputs
slide-36
SLIDE 36

Relationships are not always linear

 How do we model these?  Multiple solutions

23 Oct 2012 11755/18797 36

slide-37
SLIDE 37

Non-linear regression

 y = j(x)+e

23 Oct 2012 11755/18797 37

)] ( ... ) ( ) ( [ ) (

2 1

x x x x φ x

N

     )] ( ... ) ( ) ( [ ) (

2 1 K

x φ x φ x φ X X   

 Y = A(X)+e  Replace X with (X) in earlier equations for

solution

 

) ( ) ( ) (

T T

Y X X X A

  • 1

  

slide-38
SLIDE 38

What we are doing

 Finding the optimal combination of various

function

 Remind you of something?

23 Oct 2012 11755/18797 38

slide-39
SLIDE 39

Being non-commital: Local Regression

 Regression is usually trained over

the entire data

 Must apply everywhere

 How about doing this locally?

 For any x

23 Oct 2012 11755/18797 39

i i T i T

y x x x X Y y

  ˆ ˆ ˆ ˆ ˆ e y x x y  

i i i

d ) , ( e y x C x y  

 i i i T 1

slide-40
SLIDE 40

Local Regression

 The resulting regression is

dependent on x!

 No closed form solution

 But can be highly accurate

 But what is d(x,x’)??

23 Oct 2012 11755/18797 40

i i i

d y x x x y

 ) , ( ) ( ˆ

2

|| ) , ( || ) (

i i i

d e y x x y x

 

slide-41
SLIDE 41

Kernel Regression

 Actually a non-parametric MAP estimator of y

 Note – an estimator of y, not parameters of regression  The “Kernel” is the kernel of a parzen window

 But first.. MAP estimators..

23 Oct 2012 11755/18797 41

 

  

i i h i i i h

K K ) ( ) ( ˆ x x y x x y

slide-42
SLIDE 42

Map Estimators

 MAP (Maximum A Posteriori): Find a “best guess”

for y (in a statistical sense), given that we know x

y = argmax Y P(Y|x)

 ML (Maximum Likelihood): Find that value of Y

for which the statistical best guess of X would have been the observed X

y = argmax Y P(x|Y)

 MAP is simpler to visualize

23 Oct 2012 42 11755/18797

slide-43
SLIDE 43

MAP estimation: Gaussian PDF

F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data

23 Oct 2012 43 11755/18797

slide-44
SLIDE 44

Learning the parameters of the Gaussian

23 Oct 2012 11755/18797 44

       x y z

N i i

N

1

1 z

z

       

x y z

         

YY YX XY XX

C C C C Cz

  

T i N i i

N C

z z z

z z     

1

1

slide-45
SLIDE 45

Learning the parameters of the Gaussian

23 Oct 2012 11755/18797 45

       x y z

N i i

N

1

1 z

z

  

T i N i i

N C

z z z

z z     

1

1       

x y z

         

YY YX XY XX

C C C C Cz

N i i

N

1

1 x

x

 

T i N i i XY

N C

y x y

x     

1

1

slide-46
SLIDE 46

MAP estimation: Gaussian PDF

F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data

23 Oct 2012 46 11755/18797

slide-47
SLIDE 47

MAP Estimator for Gaussian RV

Assume X and Y are jointly Gaussian The parameters

  • f the Gaussian

are learned from training data Now we are given an X, but no Y What is Y? Level set of Gaussian

23 Oct 2012 47 11755/18797

slide-48
SLIDE 48

MAP estimator for Gaussian RV

x0

23 Oct 2012 48 11755/18797

slide-49
SLIDE 49

MAP estimation: Gaussian PDF

F1

23 Oct 2012 49 11755/18797

slide-50
SLIDE 50

F1

MAP estimation: The Gaussian at a particular value of X

x0

23 Oct 2012 50 11755/18797

slide-51
SLIDE 51

F1

MAP estimation: The Gaussian at a particular value of X

Most likely value x0

23 Oct 2012 51 11755/18797

slide-52
SLIDE 52

MAP Estimation of a Gaussian RV

Y = argmaxy P(y| X) ???

23 Oct 2012 52 11755/18797

slide-53
SLIDE 53

MAP Estimation of a Gaussian RV

23 Oct 2012 53 11755/18797

slide-54
SLIDE 54

MAP Estimation of a Gaussian RV

23 Oct 2012 54 11755/18797

slide-55
SLIDE 55

So what is this value?

 Clearly a line  Equation of Line:  Scalar version given; vector version is identical  Derivation? Later in the program a bit

23 Oct 2012 11755/18797 55

 

x XX YX Y

x C C y     

1

ˆ

 

x

x y     

1

ˆ

XX YX Y

C C

slide-56
SLIDE 56

This is a multiple regression

 This is the MAP estimate of

y

 NOT the regression parameter

 What about the ML estimate of y

 Again, ML estimate of y, not regression parameter

23 Oct 2012 11755/18797 56

 

x

x y     

1

ˆ

XX YX Y

C C

slide-57
SLIDE 57

Its also a minimum-mean-squared error estimate

 General principle of MMSE estimation:

 y is unknown, x is known  Must estimate it such that the expected squared error

is minimized

 Minimize above term

23 Oct 2012 11755/18797 57

] | ˆ [

2 x

y y   E Err

slide-58
SLIDE 58

Its also a minimum-mean-squared error estimate

 Minimize error:  Differentiating and equating to 0:

23 Oct 2012 11755/18797 58

   

] | ˆ ˆ [ ] | ˆ [

2

x y y y y x y y     

T

E E Err ] | [ ˆ 2 ˆ ˆ ] | [ ] | ˆ 2 ˆ ˆ [ x y y y y x y y x y y y y y y E E E Err

T T T T T T

      ˆ ] | [ 2 ˆ ˆ 2 ] | ˆ 2 ˆ ˆ [ 2       y x y y y x y y y y y y d E d E dErr

T T T T T

] | [ ˆ x y y E 

The MMSE estimate is the mean of the distribution

slide-59
SLIDE 59

For the Gaussian: MAP = MMSE

Most likely value is also The MEAN value

  • Would be true of any symmetric distribution

23 Oct 2012 59 11755/18797

slide-60
SLIDE 60

MMSE estimates for mixture distributions

60

 Let P(y|X) be a mixture density  The MMSE estimate of y is given by  Just a weighted combination of the MMSE

estimates from the component distributions

) , | ( ) ( ) | ( x y x y k P k P P

k

 

 y x y y x y d k P k P E

k

) , | ( ) ( ] | [

 

 y x y y d k P k P

k

) , | ( ) ( ] , | [ ) ( x y k E k P

k

23 Oct 2012 11755/18797

slide-61
SLIDE 61

MMSE estimates from a Gaussian mixture

23 Oct 2012 11755/18797 61

 Let P(y|x) is also a Gaussian mixture  Let P(x,y) be a Gaussian Mixture

) , ; ( ) ( ) ( ) (

k k k

N k P P P S  

 z z y x,        x y z ) ( ) , | ( ) | ( ) ( ) ( ) , , ( ) ( ) ( ) | ( x x y x x x y x x y x, P k P k P P P k P P P x y P

k k

 

  

k

k P k P P ) , | ( ) | ( ) | ( x y x x y

slide-62
SLIDE 62

MMSE estimates from a Gaussian mixture

23 Oct 2012 11755/18797 62

 Let P(y|x) is a Gaussian Mixture

k

k P k P P ) , | ( ) | ( ) | ( x y x x y ) ], ; [ ]; ; ([ ) , , (

, , , , , ,

      

xx xy yx yy x y

x y x y

k k k k k k

C C C C N k P   ) ), ( ; ( ) , | (

, 1 , , ,

Q   

 x xx yx y

x y x y

k k k k

C C N k P  

Q   

 k k k k k

C C N k P P ) ), ( ; ( ) | ( ) | (

, 1 , , , x xx yx y

x y x x y  

slide-63
SLIDE 63

MMSE estimates from a Gaussian mixture

23 Oct 2012 11755/18797 63

 E[y|x] is also a mixture  P[y|x] is a mixture density

Q   

 k k k k k

C C N k P P ) ), ( ; ( ) | ( ) | (

, 1 , , , x xx yx y

x y x x y  

k

k E k P E ] , | [ ) | ( ] | [ x y x x y

 

  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

slide-64
SLIDE 64

MMSE estimates from a Gaussian mixture

23 Oct 2012 11755/18797 64

 A mixture of estimates from individual

Gaussians

slide-65
SLIDE 65

MMSE with GMM: Voice Transformation

  • Festvox GMM transformation suite (Toda)

awb bdl jmk slt awb bdl jmk slt

23 Oct 2012 65 11755/18797

slide-66
SLIDE 66

Voice Morphing

 Align training recordings from both speakers

 Cepstral vector sequence

 Learn a GMM on joint vectors  Given speech from one speaker, find MMSE estimate of the

  • ther

 Synthesize from cepstra

23 Oct 2012 11755/18797 66

slide-67
SLIDE 67

A problem with regressions

 ML fit is sensitive

 Error is squared  Small variations in data  large variations in weights  Outliers affect it adversely

 Unstable

 If dimension of X >= no. of instances

(XXT) is not invertible

23 Oct 2012 11755/18797 67

 

T T

XY XX A

  • 1

slide-68
SLIDE 68

MAP estimation of weights

 Assume weights drawn from a Gaussian

 P(a) = N(0, 2I)

 Max. Likelihood estimate  Maximum a posteriori estimate

23 Oct 2012 11755/18797 68

a X e y=aTX+e

) ; | ( log max arg ˆ a X y a

a

P  ) ( ) , | ( log max arg ) , | ( log max arg ˆ a a X y X y a a

A a

P P P  

slide-69
SLIDE 69

MAP estimation of weights

 Similar to ML estimate with an additional term

23 Oct 2012 11755/18797 69

) ( ) , | ( log max arg ) , | ( log max arg ˆ a a X y X y a a

A A

P P P  

 P(a) = N(0, 2I)  Log P(a) = C – log  – 0.5-2 ||a||2 T T T T

C P ) ( ) ( 2 1 ) ( log

2

X a y X a y a X, | y      a a X a y X a y a

A T T T T T

C

2 2

5 . ) ( ) ( 2 1 log ' max arg ˆ         

slide-70
SLIDE 70

MAP estimate of weights

 Equivalent to diagonal loading of correlation matrix

 Improves condition number of correlation matrix

Can be inverted with greater stability

 Will not affect the estimation from well-conditioned data  Also called Tikhonov Regularization

Dual form: Ridge regression

 MAP estimate of weights

 Not to be confused with MAP estimate of Y

23 Oct 2012 11755/18797 70

 

2 2 2     a I yX XX a d dL

T T T

 

T T

XY I XX a

  • 1

  

slide-71
SLIDE 71

MAP estimate priors

 Left: Gaussian Prior on W  Right: Laplacian Prior

23 Oct 2012 11755/18797 71

slide-72
SLIDE 72

MAP estimation of weights with laplacian prior

 Assume weights drawn from a Laplacian

 P(a) = l-1exp(-l-1|a|1)

 Maximum a posteriori estimate  No closed form solution

 Quadratic programming solution required

 Non-trivial

23 Oct 2012 11755/18797 72

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

slide-73
SLIDE 73

MAP estimation of weights with laplacian prior

 Assume weights drawn from a Laplacian

 P(a) = l-1exp(-l-1|a|1)

 Maximum a posteriori estimate

 …

 Identical to L1 regularized least-squares

estimation

23 Oct 2012 11755/18797 73

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

slide-74
SLIDE 74

L1-regularized LSE

 No closed form solution

 Quadratic programming solutions required

 Dual formulation  “LASSO” – Least absolute shrinkage and selection

  • perator

23 Oct 2012 11755/18797 74

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

T T T T

C ) ( ) ( ' max arg ˆ X a y X a y a

A

    t 

1

a

subject to

slide-75
SLIDE 75

LASSO Algorithms

 Various convex optimization algorithms  LARS: Least angle regression  Pathwise coordinate descent..  Matlab code available from web

23 Oct 2012 11755/18797 75

slide-76
SLIDE 76

Regularized least squares

 Regularization results in selection of suboptimal (in least-

squares sense) solution

 One of the loci outside center

 Tikhonov regularization selects shortest solution  L1 regularization selects sparsest solution

23 Oct 2012 11755/18797 76 Image Credit: Tibshirani

slide-77
SLIDE 77

LASSO and Compressive Sensing

 Given Y and X, estimate sparse W  LASSO:

X = explanatory variable

Y = dependent variable

a = weights of regression

 CS:

X = measurement matrix

Y = measurement

a = data

23 Oct 2012 11755/18797 77

Y

=

X a

slide-78
SLIDE 78

An interesting problem: Predicting War!

 Economists measure a number of social

indicators for countries weekly

 Happiness index  Hunger index  Freedom index  Twitter records  …

 Question: Will there be a revolution or war next

week?

23 Oct 2012 11755/18797 78

slide-79
SLIDE 79

An interesting problem: Predicting War!

 Issues:

 Dissatisfaction builds up – not an instantaneous

phenomenon

 Usually

 War / rebellion build up much faster

 Often in hours

 Important to predict

 Preparedness for security  Economic impact

23 Oct 2012 11755/18797 79

slide-80
SLIDE 80

Predicting War

Given

 Sequence of economic indicators for each week  Sequence of unrest markers for each week

 At the end of each week we know if war happened or not

that week

 Predict probability of unrest next week

 This could be a new unrest or persistence of a current

  • ne

23 Oct 2012 11755/18797 80

W S W S W S W S W S W S W S W S wk1 wk2 wk3wk4 wk5wk6 wk7wk8 O1 O2 O3 O4 O5 O6 O7 O8

slide-81
SLIDE 81

A Step Aside: Predicting Time Series

 An HMM is a model for time-series data  How can we use it predict the future?

23 Oct 2012 11755/18797 81

slide-82
SLIDE 82

Predicting with an HMM

 Given

 Observations O1..Ot  All HMM parameters

 Learned from some training data

 Must estimate future observation Ot+1

 Estimate must consider entire history (O1..Ot)  No knowledge of actual state of the process at any

time

23 Oct 2012 11755/18797 82

slide-83
SLIDE 83

Predicting with an HMM

 Given O1..Ot

 Compute P(O1.. Ot,s)  Using the forward algorithm – computes a(s,t)

23 Oct 2012 11755/18797 83

time

t t+1 sa(s,t)

) ) ( , ,..., , ( ) , (

2 1

s t state O O O P t s

t

  a

 

    

' ' .. 1 .. 1 .. 1

) , ' ( ) , ( ) , ' ( ) , ( ) | (

s s t t t t t t

t s t s O s s P O s s P O s s P a a

slide-84
SLIDE 84

Predicting with an HMM

 Given P(st=s | O1..t) for all s  P(st+1 = s | O1..t) = Ss’ P(st=s’|O1..t)P(s|s’)  P(Ot+1,s|O1..t) = P(O|s) P(st+1=s|O1..t)  P(Ot+1|O1..t) = Ss P(Ot+1,s|O1..t)

= Ss P(O|s) P(st+1=s|O1..t)

 This is a mixture distribution

84

time

s

a(s,t+1)

23 Oct 2012 11755/18797

slide-85
SLIDE 85

Predicting with an HMM

 P(Ot+1|O1..t) = Ss P(Ot+1,s|O1..t)

= Ss P(O|s) P(st+1=s|O1..T)

 MMSE estimate of Ot+1 given O1..t

 E[Ot+1 | O1..t] = Ss P(st+1=s|O1..T) E[O|s]

 A weighted sum of the state means

23 Oct 2012 11755/18797 85

time

s

a(s,t+1)

slide-86
SLIDE 86

Predicting with an HMM

 MMSE Estimate of Ot+1 = E[Ot+1|O1..T]

 E[Ot+1 | O1..t] = Ss P(st+1=s|O1..T) E[O|s]

 If P(O|s) is a GMM

 E(O|s) = Sk P(k|s) k,s

23 Oct 2012 11755/18797 86

 

 s k s k s k t t

w O s P O

, , .. 1 1

) | ( ˆ 

  

 s k s k s k s t

w s t s t O

, , ' 1

) ' , ( ) , ( ˆ  a a

slide-87
SLIDE 87

Predicting War

 Train an HMM on z = [w, s]  After the tth week, predict probability distribution:

 P(zt | z1…zt) = P(w, z | z1..zt)

 Marginalize out x (not known for next week)  War?  E[w | z1..zt]

23 Oct 2012 11755/18797 87

 ds z s w P z w P

t t

) | , ( ) | (

.. 1 .. 1