[PPT] - Machine Learning for Signal Processing Regression and Prediction PowerPoint Presentation

SLIDE 1

Machine Learning for Signal Processing Regression and Prediction

Class 16. 29 Oct 2015 Instructor: Bhiksha Raj

11755/18797 1

SLIDE 2

A Common Problem

Can you spot the glitches?

11755/18797 2

SLIDE 3

How to fix this problem?

“Glitches” in audio

– Must be detected – How?

Then what?
Glitches must be “fixed”

– Delete the glitch

Results in a “hole”

– Fill in the hole – How?

11755/18797 3

SLIDE 4

11755/18797 4

Interpolation..

“Extend” the curve on the left to “predict” the values in the

“blank” region

– Forward prediction

Extend the blue curve on the right leftwards to predict the

blank region

– Backward prediction

How?

– Regression analysis..

SLIDE 5

Detecting the Glitch

Regression-based reconstruction can be done

anywhere

Reconstructed value will not match actual value
Large error of reconstruction identifies glitches

11755/18797 5

NOT OK OK

SLIDE 6

What is a regression

Analyzing relationship between variables
Expressed in many forms
Wikipedia

– Linear regression, Simple regression, Ordinary least squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, ….

Generally a tool to predict variables

11755/18797 6

SLIDE 7

Regressions for prediction

y = f(x; Q) + e
Different possibilities

– y is a scalar

y is real
y is categorical (classification)

– y is a vector – x is a vector

x is a set of real valued variables
x is a set of categorical variables
x is a combination of the two

– f(.) is a linear or affine function – f(.) is a non-linear function – f(.) is a time-series model

11755/18797 7

SLIDE 8

A linear regression

Assumption: relationship between variables is linear

– A linear trend may be found relating x and y – y = dependent variable – x = explanatory variable – Given x, y can be predicted as an affine function of x

11755/18797 8

X Y

SLIDE 9

An imaginary regression..

http://pages.cs.wisc.edu/~kovar/hall.html
Check this shit out (Fig. 1).

That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data

possible. Now, let's look a bit more

closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered.

11755/18797 9

SLIDE 10

Linear Regressions

y = aTx + b + e

– e = prediction error

Given a “training” set of {x, y} values: estimate a

and b

– y1 = aTx1 + b + e1 – y2 = aTx2 + b + e2 – y3 = aTx3 + b+ e3 – …

If a and b are well estimated, prediction error will be

small

11755/18797 10

SLIDE 11

Linear Regression to a scalar

Rewrite

11755/18797 11

...] [

3 2 1

y y y  y        ... 1 1 1

3 2 1

x x x X        b a A ...] [

3 2 1

e e e  e

 Define:

e X A y  

T

y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3

SLIDE 12

Learning the parameters

Given training data: several x,y
Can define a “divergence”: D(y, )

– Measures how much differs from y – Ideally, if the model is accurate this should be small

Estimate a, b to minimize D(y, )

11755/18797 12

X A y

T

 ˆ

Assuming no error

e X A y  

T

y ˆ y ˆ

y ˆ

SLIDE 13

The prediction error as divergence

Define divergence as sum of the squared error in predicting y

11755/18797 13

e y e X a y     ˆ

T

y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3

... ˆ

2 3 2 2 2 1

     e e e E ) y D(y, ... ) ( ) ( ) (

2 3 3 2 2 2 2 1 1

          b y b y b y

T T T

x a x a x a

  

2

X A y X A y X A y E

T T T T

    

SLIDE 14

Prediction error as divergence

y = ATx + e

– e = prediction error – Find the “slope” a such that the total squared length of the error lines is minimized

11755/18797 14

SLIDE 15

Solving a linear regression

Minimize squared error

11755/18797 15

e X A y  

T 2

|| X A y || E

T

 

 

X y A pinv

T 

 

T Ty

X A pinv 

SLIDE 16

More Explicitly

Minimize squared error
Differentiating w.r.t A and equating to 0

11755/18797 16

T T T T

) )( ( X A y X A y || A X y || E

2

     A yX

A

XX A yy

T T T T

2  

 

2 2   A yX

XX

A E d d

T T T

 

X y XX yX A

1

pinv

T T T

 

 

T T

Xy XX A

1



SLIDE 17

Regression in multiple dimensions

Also called multiple regression
Equivalent of saying:
Fundamentally no different from N separate single

regressions

– But we can use the relationship between ys to our benefit

11755/18797 17

y1 = ATx1 + b + e1 y2 = ATx2 + b + e2 y3 = ATx3 + b + e3

yi is a vector

yi = ATxi + b + ei yi1 = a1

Txi + b1 + ei1

yi2 = a2

Txi + b2 + ei2

yi3 = a3

Txi + b3 + ei3

yij = jth component of vector yi ai = ith column of A bj = jth component of b

SLIDE 18

Multiple Regression

Minimizing

11755/18797 18

...] [

3 2 1

y y y Y         ...

3 2 1

1 x 1 x 1 x X        b A A ˆ ...] [

3 2 1

e e e E  E X A Y  

T

ˆ



 

i i T i

DIV

2

ˆ x A y

 

1

XX YX X Y A

T T T

pinv   ˆ

 

ˆ

T T

XY XX A

1



SLIDE 19

A Different Perspective

y is a noisy reading of ATx
Error e is Gaussian
Estimate A from

11755/18797 19

e x A y  

T

) , (

2I

~ e  N

] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  

= +

SLIDE 20

The Likelihood of the data

Probability of observing a specific y, given x,

for a particular matrix A

Probability of collection:
Assuming IID for convenience (not necessary)

11755/18797 20

e x A y  

T

) , (

2I

~ e  N ) , ; ( ) ; | (

2I

x A y A x y 

T

N P 

] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  





i i T i

N P ) , ; ( ) ; | (

2I

x A y A X Y 

SLIDE 21

A Maximum Likelihood Estimate

Maximizing the log probability is identical to

minimizing the error

– Identical to the least squares solution

11755/18797 21

e x A y  

T

) , (

2I

~ e  N ] ... [ ] ... [

2 1 2 1 N N

x x x X y y y Y  



        

i i T i D

P

2 2 2

2 1 exp ) 2 ( 1 ) | ( x A y X Y  



  

i i T i

C P

2 2

2 1 ) ; | ( log x A y A X Y 

 

X Y XX YX A

1

pinv

T T T

 

 

T T

XY XX A

1



SLIDE 22

Predicting an output

From a collection of training data, have

learned A

Given x for a new instance, but not y, what is

y?

Simple solution:

11755/18797 22

X A y

T

 ˆ

SLIDE 23

Applying it to our problem

Prediction by regression
Forward regression
xt = a1xt-1+ a2xt-2…akxt-k+et
Backward regression
xt = b1xt+1+ b2xt+2…bkxt+k +et

11755/18797 23

SLIDE 24

Applying it to our problem

Forward prediction

11755/18797 24

                                     

            1 1 1 1 1 3 2 2 1 1 1

.. .. .. .. .. .. .. .. ..

K t t t K K K t t t K t t t K t t

e e e x x x x x x x x x x x x a

e a X x  

t t

pinv a x X  ) (

SLIDE 25

Applying it to our problem

Backward prediction

11755/18797 25

                                     

               1 2 1 2 1 1 2 1 1 1 2 1

.. .. .. .. .. .. .. .. .. e e e x x x x x x x x x x x x

K t K t t K K K t t t K t t t K t K t

b

e b X x  

t t

pinv b x X  ) (

SLIDE 26

Finding the burst

At each time

– Learn a “forward” predictor at – At each time, predict next sample xt

est = Si at,k xt-k

– Compute error: ferrt=|xt-xt

est |2

– Learn a “backward” predict and compute backward error

berrt

– Compute average prediction error over window, threshold

– If the error exceeds a threshold, identify burst

11755/18797 26

SLIDE 27

Filling the hole

Learn “forward” predictor at left edge of “hole”

– For each missing sample – At each time, predict next sample xt

est = Si at,kxt-k

Use estimated samples if real samples are not available
Learn “backward” predictor at left edge of “hole”

– For each missing sample – At each time, predict next sample xt

est = Si bt,kxt+k

Use estimated samples if real samples are not available
Average forward and backward predictions

11755/18797 27

SLIDE 28

Reconstruction zoom in

11755/18797 28

Next glitch Interpolation result Reconstruction area Actual data Distorted signal Recovered signal

SLIDE 29

Incrementally learning the regression

Can we learn A incrementally instead?

– As data comes in?

The Widrow Hoff rule
Note the structure

– Can also be done in batch mode!

11755/18797 29

Requires knowledge of all (x,y) pairs

 

T T

XY XX A

1



 

t T t t t t t t t

y y y x a x a a    



ˆ ˆ

1



error Scalar prediction version

SLIDE 30

Predicting a value

What are we doing exactly?

– For the explanation we are assuming no “b” (X is 0 mean) – Explanation generalizes easily even otherwise

11755/18797 30

 

T T

XY XX A

1



  x

XX YX x A y

1

ˆ



 

T T T

T

XX C 

x C x

2 1

ˆ





i T T

x X Y x C C YX y ˆ ˆ ˆ

2 1 2 1

 

   Let and

 Whitening x  N-0.5 C-0.5 is the whitening matrix for x

X C X

2 1

ˆ





SLIDE 31

Predicting a value

What are we doing exactly?

11755/18797 31



 

i T i i T

x x y x X Y y ˆ ˆ ˆ ˆ ˆ

 

 



            

i T i i T N T N T

N x x y x x x y y x X Y y ˆ ˆ ˆ ˆ ˆ ... 1 ˆ ˆ ˆ

1 1



SLIDE 32

Predicting a value

Given training instances (xi,yi) for i = 1..N, estimate y

for a new test instance of x with unknown y :

y is simply a weighted sum of the yi instances from the

training data

The weight of any yi is simply the inner product

between its corresponding xi and the new x

– With due whitening and scaling..

11755/18797 32

 





i T i i

x x y y ˆ ˆ ˆ

SLIDE 33

What are we doing: A different perspective

Assumes XXT is invertible
What if it is not

– Dimensionality of X is greater than number of

bservations?

– Underdetermined

In this case XTX will generally be invertible

11755/18797 33

  x

XX YX x A y

1

ˆ



 

T T T

 

x X X X Y y

T T 1

ˆ





 

T T

Y X X X A

1 



SLIDE 34

High-dimensional regression

XTX is the “Gram Matrix”

11755/18797 34

 

x X X X Y y

T T 1

ˆ





              

N T N T N T N N T T T N T T T

x x x x x x x x x x x x x x x x x x G       

2 1 2 2 2 1 2 1 2 1 1 1

x X YG y

T 1

ˆ





SLIDE 35

High-dimensional regression

Normalize Y by the inverse of the gram matrix

11755/18797 35

x X YG y

T 1

ˆ





1 

 YG Y    x X Y y

T

   ˆ  x x y y  

i T i i

   ˆ

Working our way down..

SLIDE 36

Linear Regression in High-dimensional Spaces

Given training instances (xi,yi) for i = 1..N, estimate y

for a new test instance of x with unknown y :

y is simply a weighted sum of the normalized yi

instances from the training data

– The normalization is done via the Gram Matrix

The weight of any yi is simply the inner product

between its corresponding xi and the new x

11755/18797 36

x x y y  

i T i i

   ˆ

1 

 YG Y   

SLIDE 37

Relationships are not always linear

How do we model these?
Multiple solutions

11755/18797 37

SLIDE 38

Non-linear regression

y = Aj(x)+e

11755/18797 38

)] ( ... ) ( ) ( [ ) (

2 1

x x x x φ x

N

     )] ( ... ) ( ) ( [ ) (

2 1 K

x φ x φ x φ X X   

 Y = A(X)+e  Replace X with (X) in earlier equations for

solution

 

) ( ) ( ) (

T T

Y X X X A

1

  

SLIDE 39

Problem

11755/18797 39

 Y = A(X)+e  Replace X with (X) in earlier

equations for solution

 (X) may be in a very high-dimensional space  The high-dimensional space (or the transform

(X)) may be unknown..

 

) ( ) ( ) (

T T

Y X X X A

1

  

SLIDE 40

The regression is in high dimensions

Linear regression:
High-dimensional regression

11755/18797 40

x x y y  

i T i i

   ˆ

1 

 YG Y   

                                   

                               

N T N T N T N T T T N T T T

x x x x x x x x x x x x x x x x x x G       

2 1 1 2 2 2 1 2 1 2 2 1 1

1 

 YG Y   

   



  

i T i i

x x y y    ˆ

SLIDE 41

Doing it with Kernels

High-dimensional regression with Kernels:
Regression in Kernel Hilbert Space..

11755/18797 41

             ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , (

2 1 2 2 2 1 2 1 1 1 1 1 N N N N N N

K K K K K K K K K x x x x x x x x x x x x x x x x x x G       

   

y x y x   

T

K ) , (

1 

 YG Y   





i i iK

) , ( ˆ x x y y   

SLIDE 42

A different way of finding nonlinear relationships: Locally linear regression

Previous discussion: Regression parameters are
ptimized over the entire training set
Minimize
Single global regression is estimated and applied to all

future x

Alternative: Local regression
Learn a regression that is specific to x

11755/18797 42



  

i all i T i 2

b x A y E

SLIDE 43

Being non-committal: Local Regression

Estimate the regression to

be applied to any x using training instances near x

The resultant regression has the form

– Note : this regression is specific to x

A separate regression must be learned for every x

11755/18797 43

e y x x y

x x

 



 j

d

neighborho j

j

d

) (

) , (





  

) ( 2 x x

b x A y E

d

neighborho i T i

j

SLIDE 44

Local Regression

But what is d()?

– For linear regression d() is an inner product

More generic form: Choose d() as a function of the

distance between x and xj

If d() falls off rapidly with |x and xj| the

“neighbhorhood” requirement can be relaxed

11755/18797 44

e y x x y

x x

 



 j

d

neighborho j

j

d

) (

) , ( e y x x y   

j all j

d ) , (

SLIDE 45

Kernel Regression: d() = K()

Typical Kernel functions: Gaussian, Laplacian, other

density functions

– Must fall off rapidly with increasing distance between x and xj

Regression is local to every x : Local regression
Actually a non-parametric MAP estimator of y

– But first.. MAP estimators..

45

 

  

i i h i i i h

K K ) ( ) ( ˆ x x y x x y

11755/18797

SLIDE 46

Map Estimators

MAP (Maximum A Posteriori): Find a “best

guess” for y (statistically), given known x

y = argmax Y P(Y|x)

ML (Maximum Likelihood): Find that value of y

for which the statistical best guess of x would have been the observed x

y = argmax Y P(x|Y)

MAP is simpler to visualize

11755/18797 46

SLIDE 47

MAP estimation: Gaussian PDF

11755/18797 47

F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data

X Y

SLIDE 48

Learning the parameters of the Gaussian

11755/18797 48

       x y z





N i i

N

1

1 z

z

       

x y z

         

YY YX XY XX

C C C C Cz

  

T i N i i

N C

z z z

z z     



1

1

SLIDE 49

Learning the parameters of the Gaussian

11755/18797 49

       x y z





N i i

N

1

1 z

z



  

T i N i i

N C

z z z

z z     



1

1       

x y z

         

YY YX XY XX

C C C C Cz





N i i

N

1

1 x

x



 



T i N i i XY

N C

y x y

x     



1

1

SLIDE 50

MAP estimation: Gaussian PDF

11755/18797 50

F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data

X Y

SLIDE 51

MAP Estimator for Gaussian RV

11755/18797 51

Assume X and Y are jointly Gaussian The parameters

f the Gaussian

are learned from training data Now we are given an X, but no Y What is Y? Level set of Gaussian

X0

SLIDE 52

MAP estimator for Gaussian RV

11755/18797 52

x0

SLIDE 53

MAP estimation: Gaussian PDF

11755/18797 53

F1

X Y

SLIDE 54

F1

MAP estimation: The Gaussian at a particular value of X

11755/18797 54

x0

SLIDE 55

F1

MAP estimation: The Gaussian at a particular value of X

11755/18797 55

Most likely value x0

SLIDE 56

MAP Estimation of a Gaussian RV

11755/18797 56

Y = argmaxy P(y| X) ??? x0

SLIDE 57

MAP Estimation of a Gaussian RV

11755/18797 57

x0

SLIDE 58

MAP Estimation of a Gaussian RV

11755/18797 58

x0 Y = argmaxy P(y| X)

SLIDE 59

So what is this value?

Clearly a line
Equation of Line:
Scalar version given; vector version is identical
Derivation? Later in the program a bit

– Note the similarity to regression

11755/18797 59

 

x XX YX Y

x C C y     

1

ˆ

 

x

x y     

1

ˆ

XX YX Y

C C

SLIDE 60

This is a multiple regression

This is the MAP estimate of y

– y = argmax Y P(Y|x)

What about the ML estimate of y

– argmax Y P(x|Y)

Note: Neither of these may be the regression line!

– MAP estimation of y is the regression on Y for Gaussian RVs – But this is not the MAP estimation of the regression parameter

11755/18797 60

 

x

x y     

1

ˆ

XX YX Y

C C

SLIDE 61

A familiar equation
Linear regression actually gives you an MAP estimate

under Gaussian assumption

A Closer Look

Assuming 0 mean for simplicity

11755/18797 61

 

x

x y     

1

ˆ

XX YX Y

C C x y

1

ˆ





XX YXC

C x YX y

1

ˆ





XX TC

x X Y y ˆ ˆ ˆ

T



whitened







training i T i i

x x y y ˆ ˆ ˆ

SLIDE 62

Its also a minimum-mean-squared error estimate

General principle of MMSE estimation:

– y is unknown, x is known – Must estimate it such that the expected squared error is minimized – Minimize above term

11755/18797 62

] | ˆ [

2 x

y y   E Err

SLIDE 63

Its also a minimum-mean-squared error estimate

Minimize error:
Differentiating and equating to 0:

11755/18797 63

   

] | ˆ ˆ [ ] | ˆ [

2

x y y y y x y y     

T

E E Err ] | [ ˆ 2 ˆ ˆ ] | [ ] | ˆ 2 ˆ ˆ [ x y y y y x y y x y y y y y y E E E Err

T T T T T T

      ˆ ] | [ 2 ˆ ˆ 2 .    y x y y y d E d Err d

T T

] | [ ˆ x y y E 

The MMSE estimate is the mean of the distribution

SLIDE 64

For the Gaussian: MAP = MMSE

11755/18797 64

Most likely value is also The MEAN value

Would be true of any symmetric distribution

SLIDE 65

MMSE estimates for mixture distributions

65

 Let P(y|x) be a mixture density  The MMSE estimate of y is given by  Just a weighted combination of the MMSE

estimates from the component distributions

) , | ( ) ( ) | ( x y x y k P k P P

k





 

 y x y y x y d k P k P E

k

) , | ( ) ( ] | [

 

 y x y y d k P k P

k

) , | ( ) ( ] , | [ ) ( x y k E k P

k





11755/18797

SLIDE 66

MMSE estimates from a Gaussian mixture

11755/18797 66

 P(y|x) is also a Gaussian mixture  Let P(x,y) be a Gaussian Mixture

) , ; ( ) ( ) ( ) (

k k k

N k P P P S  



 z z y x,        x y z ) ( ) , | ( ) | ( ) ( ) ( ) , , ( ) ( ) ( ) | ( x x y x x x y x x y x, x y P k P k P P P k P P P P

k k

 

  





k

k P k P P ) , | ( ) | ( ) | ( x y x x y

SLIDE 67

MMSE estimates from a Gaussian mixture

11755/18797 67

 Let P(y|x) is a Gaussian Mixture





k

k P k P P ) , | ( ) | ( ) | ( x y x x y ) ], ; [ ]; ; ([ ) , , (

, , , , , ,

      

xx xy yx yy x y

x y x y

k k k k k k

C C C C N k P   ) ), ( ; ( ) , | (

, 1 , , ,

Q   

 x xx yx y

x y x y

k k k k

C C N k P  



Q   

 k k k k k

C C N k P P ) ), ( ; ( ) | ( ) | (

, 1 , , , x xx yx y

x y x x y  

SLIDE 68

MMSE estimates from a Gaussian mixture

11755/18797 68

 E[y|x] is also a mixture  P(y|x) is a mixture Gaussian density



Q   

 k k k k k

C C N k P P ) ), ( ; ( ) | ( ) | (

, 1 , , , x xx yx y

x y x x y  





k

k E k P E ] , | [ ) | ( ] | [ x y x x y

 



  

 k k k k k

C C k P E ) ( ) | ( ] | [

, 1 , , , x xx yx y

x x x y  

SLIDE 69

MMSE estimates from a Gaussian mixture

11755/18797 69

 A mixture of estimates from individual Gaussians

SLIDE 70

Voice Morphing

Align training recordings from both speakers

– Cepstral vector sequence

Learn a GMM on joint vectors
Given speech from one speaker, find MMSE estimate of the other
Synthesize from cepstra

11755/18797 70

SLIDE 71

MMSE with GMM: Voice Transformation

Festvox GMM transformation suite (Toda)

awb bdl jmk slt awb bdl jmk slt

11755/18797 71

SLIDE 72

A problem with regressions

ML fit is sensitive

– Error is squared – Small variations in data  large variations in weights – Outliers affect it adversely

Unstable

– If dimension of X >= no. of instances

(XXT) is not invertible

11755/18797 72

 

T T

XY XX A

1



SLIDE 73

MAP estimation of weights

Assume weights drawn from a Gaussian

– P(a) = N(0, 2I)

Max. Likelihood estimate
Maximum a posteriori estimate

11755/18797 73

a X e y=aTX+e

) ; | ( log max arg ˆ a X y a

a

P  ) ( ) , | ( log max arg ) , | ( log max arg ˆ a a X y X y a a

a a

P P P  

SLIDE 74

MAP estimation of weights

Similar to ML estimate with an additional term

11755/18797 74

) ( ) , | ( log max arg ) , | ( log max arg ˆ a a X y X y a a

A A

P P P  

 P(a) = N(0, 2I)  Log P(a) = C – log  – 0.5-2 ||a||2 T T T T

C P ) ( ) ( 2 1 ) ( log

2

X a y X a y a X, | y      a a X a y X a y a

A T T T T T

C

2 2

5 . ) ( ) ( 2 1 log ' max arg ˆ         

SLIDE 75

MAP estimate of weights

Equivalent to diagonal loading of correlation matrix

– Improves condition number of correlation matrix

Can be inverted with greater stability

– Will not affect the estimation from well-conditioned data – Also called Tikhonov Regularization

Dual form: Ridge regression
MAP estimate of weights

– Not to be confused with MAP estimate of Y

11755/18797 75

 

2 2 2     a I yX XX a d dL

T T T



 

T T

XY I XX a

1

  

SLIDE 76

MAP estimate priors

Left: Gaussian Prior on W
Right: Laplacian Prior

11755/18797 76

SLIDE 77

MAP estimation of weights with laplacian prior

Assume weights drawn from a Laplacian

– P(a) = l-1exp(-l-1|a|1)

Maximum a posteriori estimate
No closed form solution

– Quadratic programming solution required

Non-trivial

11755/18797 77

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

SLIDE 78

MAP estimation of weights with laplacian prior

Assume weights drawn from a Laplacian

– P(a) = l-1exp(-l-1|a|1)

Maximum a posteriori estimate

– …

Identical to L1 regularized least-squares

estimation

11755/18797 78

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

SLIDE 79

L1-regularized LSE

No closed form solution

– Quadratic programming solutions required

Dual formulation
“LASSO” – Least absolute shrinkage and

selection operator

11755/18797 79

1 1

) ( ) ( ' max arg ˆ a X a y X a y a

A 

     l

T T T T

C

T T T T

C ) ( ) ( ' max arg ˆ X a y X a y a

A

    t 

1

a

subject to

SLIDE 80

LASSO Algorithms

Various convex optimization algorithms
LARS: Least angle regression
Pathwise coordinate descent..
Matlab code available from web

11755/18797 80

SLIDE 81

Regularized least squares

Regularization results in selection of suboptimal (in

least-squares sense) solution

– One of the loci outside center

Tikhonov regularization selects shortest solution
L1 regularization selects sparsest solution

11755/18797 81 Image Credit: Tibshirani

SLIDE 82

LASSO and Compressive Sensing

Given Y and X, estimate sparse a
LASSO:

– X = explanatory variable – Y = dependent variable – a = weights of regression

CS:

– X = measurement matrix – Y = measurement – a = data

11755/18797 82

Y

=

X a

SLIDE 83

An interesting problem: Predicting War!

Economists measure a number of social

indicators for countries weekly

– Happiness index – Hunger index – Freedom index – Twitter records – …

Question: Will there be a revolution or war next

week?

11755/18797 83

SLIDE 84

An interesting problem: Predicting War!

Issues:

– Dissatisfaction builds up – not an instantaneous phenomenon

Usually

– War / rebellion build up much faster

Often in hours
Important to predict

– Preparedness for security – Economic impact

11755/18797 84

SLIDE 85

Predicting War

Given

– Sequence of economic indicators for each week – Sequence of unrest markers for each week

At the end of each week we know if war happened or not

that week

Predict probability of unrest next week

– This could be a new unrest or persistence of a current

ne

11755/18797 85

W S W S W S W S W S W S W S W S wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 O1 O2 O3 O4 O5 O6 O7 O8

SLIDE 86

Predicting Time Series

Need time-series models
HMMs – later in the course

11755/18797 86