Machine Learning for Signal Processing Regression and Prediction
Class 14. 17 Oct 2012 Instructor: Bhiksha Raj
17 Oct 2013 11755/18797 1
Machine Learning for Signal Processing Regression and Prediction - - PowerPoint PPT Presentation
Machine Learning for Signal Processing Regression and Prediction Class 14. 17 Oct 2012 Instructor: Bhiksha Raj 17 Oct 2013 11755/18797 1 Matrix Identities df dx 1 dx 1 x 1 df
17 Oct 2013 11755/18797 1
17 Oct 2013 11755/18797 2
D
x x x f ... ) (
2 1
x x
D D
2 2 1 1
17 Oct 2013 11755/18797 3
DD D D D D
x x x x x x x x x f .. .. .. .. .. .. .. ) (
2 1 2 22 21 1 12 11
x x
DD DD D D D D D D D D
dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df df .. .. .. .. .. .. .. ) (
2 2 1 1 2 2 22 22 12 12 1 1 21 21 11 11
x
17 Oct 2013 11755/18797 4
D N
x x x F F F ... ... ) (
2 1 2 1
x F x F
D D N D D D D N N N
dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dF dF dF .. .. .. .. .. .. .. ...
2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1
17 Oct 2013 11755/18797 5
Nx1 UxV NxUxV
Nx1 UxV UxVxN
17 Oct 2013 11755/18797 6
a X X a a X Xa d d d d
T T
) ( ) (
X is a matrix, a is a vector. Solution may also be XT
) ( ) ( ; ) ( ) ( A X XA X A AX d d d d
A is a matrix
X X a Xa a d d
T T T
A X X X AA XAA XA A d trace d trace d trace d
T T T T
) (
17 Oct 2013 11755/18797 7
17 Oct 2013 11755/18797 8
17 Oct 2013 11755/18797 9
“blank” region
– Forward prediction
blank region
– Backward prediction
– Regression analysis..
17 Oct 2013 11755/18797 10
NOT OK OK
17 Oct 2013 11755/18797 11
17 Oct 2013 11755/18797 12
– A linear trend may be found relating x and y – y = dependent variable – x = explanatory variable – Given x, y can be predicted as an affine function of x
17 Oct 2013 11755/18797 13
X Y
That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data
closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered.
17 Oct 2013 11755/18797 14
17 Oct 2013 11755/18797 15
17 Oct 2013 11755/18797 16
3 2 1
3 2 1
3 2 1
Define:
T
y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3
17 Oct 2013 11755/18797 17
T
Assuming no error
T
17 Oct 2013 11755/18797 18
T
y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3
2 3 2 2 2 1
2 3 3 2 2 2 2 1 1
T T T
2
T T T T
17 Oct 2013 11755/18797 19
17 Oct 2013 11755/18797 20
T T T T T
2
T T T T
T T T
T T T
T T
– But we can use the relationship between ys to our benefit
17 Oct 2013 11755/18797 21
y1 = ATx1 + b + e1 y2 = ATx2 + b + e2 y3 = ATx3 + b + e3
yi = ATxi + b + ei yi1 = a1
Txi + b1 + ei1
yi2 = a2
Txi + b2 + ei2
yi3 = a3
Txi + b3 + ei3
yij = jth component of vector yi ai = ith column of A bj = jth component of b
17 Oct 2013 11755/18797 22
3 2 1
3 2 1
3 2 1
Dx1 vector of ones
T
T T T i i T i
2
T T
T T T
T T
T T T
17 Oct 2013 11755/18797 23
T
2I
2 1 2 1 N N
17 Oct 2013 11755/18797 24
T
2I
2I
T
] ... [ ] ... [
2 1 2 1 N N
x x x X y y y Y
i i T i
2I
17 Oct 2013 11755/18797 25
e x A y
T
) , (
2I
~ e N ] ... [ ] ... [
2 1 2 1 N N
x x x X y y y Y
i i T i D
P
2 2 2
2 1 exp ) 2 ( 1 ) | ( x A y X Y
i i T i
2 2
T T T
2
T T T
T T
17 Oct 2013 11755/18797 26
T
17 Oct 2013 11755/18797 27
17 Oct 2013 11755/18797 28
1 1 1 1 1 3 2 2 1 1 1
.. .. .. .. .. .. .. .. ..
K t t t K K K t t t K t t t K t t
e e e x x x x x x x x x x x x a
t t
17 Oct 2013 11755/18797 29
1 2 1 2 1 1 2 1 1 1 2 1
.. .. .. .. .. .. .. .. .. e e e x x x x x x x x x x x x
K t K t t K K K t t t K t t t K t K t
b
t t
est = Si at,kxt-k
est |2
17 Oct 2013 11755/18797 30
– For each missing sample – At each time, predict next sample xt
est = Si at,kxt-k
– For each missing sample – At each time, predict next sample xt
est = Si bt,kxt+k
17 Oct 2013 11755/18797 31
17 Oct 2013 11755/18797 32
Next glitch Interpolation result Reconstruction area Actual data Distorted signal Recovered signal
17 Oct 2013 11755/18797 33
Requires knowledge of all (x,y) pairs
T T
t T t t t t t t t
1
error Scalar prediction version
– For the explanation we are assuming no “b” (X is 0 mean) – Explanation generalizes easily even otherwise
17 Oct 2013 11755/18797 34
T T
1
T T T
T
2 1
i T T
2 1 2 1
Let and
Whitening x N-0.5 C-0.5 is the whitening matrix for x
X C X
2 1
ˆ
17 Oct 2013 11755/18797 35
i i T i T
i T i i T N T N T
1 1
– With due whitening and scaling..
17 Oct 2013 11755/18797 36
i T i i
– Dimensionality of X is greater than number of
– Underdetermined
17 Oct 2013 11755/18797 37
1
T T T
T T 1
T T
1
17 Oct 2013 11755/18797 38
T T 1
N T N T N T N N T T T N T T T
2 1 2 2 2 1 2 1 2 1 1 1
T 1
17 Oct 2013 11755/18797 39
T 1
1
T
i T i i
– The normalization is done via the Gram Matrix
17 Oct 2013 11755/18797 40
i T i i
1
17 Oct 2013 11755/18797 41
17 Oct 2013 11755/18797 42
2 1
N
2 1 K
Y = A(X)+e Replace X with (X) in earlier equations for
T T
17 Oct 2013 11755/18797 43
Y = A(X)+e Replace X with (X) in earlier
(X) may be in a very high-dimensional space The high-dimensional space (or the transform
T T
17 Oct 2013 11755/18797 44
i T i i
1
N T N T N T N T T T N T T T
x x x x x x x x x x x x x x x x x x G
2 1 1 2 2 2 1 2 1 2 2 1 1
1
i T i i
17 Oct 2013 11755/18797 45
) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , (
2 1 2 2 2 1 2 1 1 1 1 1 N N N N N N
K K K K K K K K K x x x x x x x x x x x x x x x x x x G
T
1
i i iK
17 Oct 2013 11755/18797 46
i all i T i 2
– Note : this regression is specific to x
17 Oct 2013 11755/18797 47
x x
j
neighborho j
j
) (
) ( 2 x x
neighborho i T i
j
– For linear regression d() is an inner product
17 Oct 2013 11755/18797 48
x x
j
neighborho j
j
) (
j all j
– Must fall off rapidly with increasing distance between x and xj
– But first.. MAP estimators..
49
i i h i i i h
y = argmax Y P(Y|x)
y = argmax Y P(x|Y)
17 Oct 2013 11755/18797 50
17 Oct 2013 11755/18797 51
F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data
17 Oct 2013 11755/18797 52
N i i
1
z
x y z
YY YX XY XX
T i N i i
z z z
1
17 Oct 2013 11755/18797 53
N i i
1
z
T i N i i
z z z
1
x y z
YY YX XY XX
N i i
1
x
T i N i i XY
y x y
1
17 Oct 2013 11755/18797 54
F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data
11755/18797 55
Assume X and Y are jointly Gaussian The parameters
are learned from training data Now we are given an X, but no Y What is Y? Level set of Gaussian
17 Oct 2013 11755/18797 56
x0
17 Oct 2013 11755/18797 57
F1
F1
17 Oct 2013 11755/18797 58
x0
F1
17 Oct 2013 11755/18797 59
Most likely value x0
17 Oct 2013 11755/18797 60
Y = argmaxy P(y| X) ??? x0
17 Oct 2013 11755/18797 61
x0
17 Oct 2013 11755/18797 62
x0 Y = argmaxy P(y| X)
17 Oct 2013 11755/18797 63
x XX YX Y
1
x
1
XX YX Y
– y = argmax Y P(Y|x)
– argmax Y P(x|Y)
– MAP estimation of y is the regression on Y for Gaussian RVs – But this is not the MAP estimation of the regression parameter
17 Oct 2013 11755/18797 64
x
1
XX YX Y
17 Oct 2013 11755/18797 65
2 x
17 Oct 2013 11755/18797 66
2
T
T T T T T T
T T
The MMSE estimate is the mean of the distribution
17 Oct 2013 11755/18797 67
Most likely value is also The MEAN value
68
Let P(y|x) be a mixture density The MMSE estimate of y is given by Just a weighted combination of the MMSE
k
k
k
k
17 Oct 2013 11755/18797 69
P(y|x) is also a Gaussian mixture Let P(x,y) be a Gaussian Mixture
k k k
k k
k
17 Oct 2013 11755/18797 70
Let P(y|x) is a Gaussian Mixture
k
, , , , , ,
xx xy yx yy x y
k k k k k k
, 1 , , ,
x xx yx y
k k k k
k k k k k
, 1 , , , x xx yx y
17 Oct 2013 11755/18797 71
E[y|x] is also a mixture P(y|x) is a mixture Gaussian density
k k k k k
, 1 , , , x xx yx y
k
k k k k k
, 1 , , , x xx yx y
17 Oct 2013 11755/18797 72
A mixture of estimates from individual Gaussians
– Cepstral vector sequence
17 Oct 2013 11755/18797 73
17 Oct 2013 11755/18797 74
– Error is squared – Small variations in data large variations in weights – Outliers affect it adversely
– If dimension of X >= no. of instances
17 Oct 2013 11755/18797 75
T T
XY XX A
17 Oct 2013 11755/18797 76
a X e y=aTX+e
a
a a
17 Oct 2013 11755/18797 77
A A
P(a) = N(0, 2I) Log P(a) = C – log – 0.5-2 ||a||2 T T T T
2
A T T T T T
2 2
– Improves condition number of correlation matrix
– Will not affect the estimation from well-conditioned data – Also called Tikhonov Regularization
– Not to be confused with MAP estimate of Y
17 Oct 2013 11755/18797 78
T T T
T T
17 Oct 2013 11755/18797 79
17 Oct 2013 11755/18797 80
1 1
A
T T T T
17 Oct 2013 11755/18797 81
1 1
A
T T T T
17 Oct 2013 11755/18797 82
1 1
A
T T T T
T T T T
A
1
subject to
17 Oct 2013 11755/18797 83
– One of the loci outside center
17 Oct 2013 11755/18797 84 Image Credit: Tibshirani
– X = explanatory variable – Y = dependent variable – a = weights of regression
– X = measurement matrix – Y = measurement – a = data
17 Oct 2013 11755/18797 85
Y
X a
17 Oct 2013 11755/18797 86
17 Oct 2013 11755/18797 87
that week
17 Oct 2013 11755/18797 88
W S W S W S W S W S W S W S W S wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 O1 O2 O3 O4 O5 O6 O7 O8
17 Oct 2013 11755/18797 89