Machine Learning for Signal Processing Regression and Prediction
Class 16. 29 Oct 2015 Instructor: Bhiksha Raj
11755/18797 1
Machine Learning for Signal Processing Regression and Prediction - - PowerPoint PPT Presentation
Machine Learning for Signal Processing Regression and Prediction Class 16. 29 Oct 2015 Instructor: Bhiksha Raj 11755/18797 1 A Common Problem Can you spot the glitches? 11755/18797 2 How to fix this problem? Glitches in
11755/18797 1
11755/18797 2
11755/18797 3
11755/18797 4
“blank” region
– Forward prediction
blank region
– Backward prediction
– Regression analysis..
11755/18797 5
NOT OK OK
11755/18797 6
11755/18797 7
– A linear trend may be found relating x and y – y = dependent variable – x = explanatory variable – Given x, y can be predicted as an affine function of x
11755/18797 8
X Y
That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data
closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered.
11755/18797 9
11755/18797 10
11755/18797 11
3 2 1
3 2 1
3 2 1
Define:
T
y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3
11755/18797 12
T
Assuming no error
T
11755/18797 13
T
y1 = aTx1 + b + e1 y2 = aTx2 + b + e2 y3 = aTx3 + b + e3
2 3 2 2 2 1
2 3 3 2 2 2 2 1 1
T T T
2
T T T T
11755/18797 14
11755/18797 15
T 2
T
T
T Ty
11755/18797 16
T T T T
2
T T T T
T T T
T T T
T T
– But we can use the relationship between ys to our benefit
11755/18797 17
y1 = ATx1 + b + e1 y2 = ATx2 + b + e2 y3 = ATx3 + b + e3
yi = ATxi + b + ei yi1 = a1
Txi + b1 + ei1
yi2 = a2
Txi + b2 + ei2
yi3 = a3
Txi + b3 + ei3
yij = jth component of vector yi ai = ith column of A bj = jth component of b
11755/18797 18
3 2 1
3 2 1
3 2 1
T
i i T i
2
T T T
T T
11755/18797 19
T
2I
2 1 2 1 N N
11755/18797 20
T
2I
2I
T
] ... [ ] ... [
2 1 2 1 N N
x x x X y y y Y
i i T i
2I
11755/18797 21
e x A y
T
) , (
2I
~ e N ] ... [ ] ... [
2 1 2 1 N N
x x x X y y y Y
i i T i D
P
2 2 2
2 1 exp ) 2 ( 1 ) | ( x A y X Y
i i T i
2 2
T T T
T T
11755/18797 22
T
11755/18797 23
11755/18797 24
1 1 1 1 1 3 2 2 1 1 1
.. .. .. .. .. .. .. .. ..
K t t t K K K t t t K t t t K t t
e e e x x x x x x x x x x x x a
t t
11755/18797 25
1 2 1 2 1 1 2 1 1 1 2 1
.. .. .. .. .. .. .. .. .. e e e x x x x x x x x x x x x
K t K t t K K K t t t K t t t K t K t
b
t t
– Learn a “forward” predictor at – At each time, predict next sample xt
est = Si at,k xt-k
– Compute error: ferrt=|xt-xt
est |2
– Learn a “backward” predict and compute backward error
– Compute average prediction error over window, threshold
11755/18797 26
– For each missing sample – At each time, predict next sample xt
est = Si at,kxt-k
– For each missing sample – At each time, predict next sample xt
est = Si bt,kxt+k
11755/18797 27
11755/18797 28
Next glitch Interpolation result Reconstruction area Actual data Distorted signal Recovered signal
11755/18797 29
Requires knowledge of all (x,y) pairs
T T
t T t t t t t t t
1
error Scalar prediction version
– For the explanation we are assuming no “b” (X is 0 mean) – Explanation generalizes easily even otherwise
11755/18797 30
T T
1
T T T
T
2 1
i T T
2 1 2 1
Let and
Whitening x N-0.5 C-0.5 is the whitening matrix for x
X C X
2 1
ˆ
11755/18797 31
i T i i T
i T i i T N T N T
1 1
– With due whitening and scaling..
11755/18797 32
i T i i
– Dimensionality of X is greater than number of
– Underdetermined
11755/18797 33
1
T T T
T T 1
T T
1
11755/18797 34
T T 1
N T N T N T N N T T T N T T T
2 1 2 2 2 1 2 1 2 1 1 1
T 1
11755/18797 35
T 1
1
T
i T i i
– The normalization is done via the Gram Matrix
11755/18797 36
i T i i
1
11755/18797 37
11755/18797 38
2 1
N
2 1 K
Y = A(X)+e Replace X with (X) in earlier equations for
T T
11755/18797 39
Y = A(X)+e Replace X with (X) in earlier
(X) may be in a very high-dimensional space The high-dimensional space (or the transform
T T
11755/18797 40
i T i i
1
N T N T N T N T T T N T T T
x x x x x x x x x x x x x x x x x x G
2 1 1 2 2 2 1 2 1 2 2 1 1
1
i T i i
11755/18797 41
) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , ( ) , (
2 1 2 2 2 1 2 1 1 1 1 1 N N N N N N
K K K K K K K K K x x x x x x x x x x x x x x x x x x G
T
1
i i iK
11755/18797 42
i all i T i 2
– Note : this regression is specific to x
11755/18797 43
x x
j
neighborho j
j
) (
) ( 2 x x
neighborho i T i
j
– For linear regression d() is an inner product
11755/18797 44
x x
j
neighborho j
j
) (
j all j
– Must fall off rapidly with increasing distance between x and xj
– But first.. MAP estimators..
45
i i h i i i h
11755/18797
y = argmax Y P(Y|x)
y = argmax Y P(x|Y)
11755/18797 46
11755/18797 47
F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data
11755/18797 48
N i i
1
z
x y z
YY YX XY XX
T i N i i
z z z
1
11755/18797 49
N i i
1
z
T i N i i
z z z
1
x y z
YY YX XY XX
N i i
1
x
T i N i i XY
y x y
1
11755/18797 50
F1 F0 Assume X and Y are jointly Gaussian The parameters of the Gaussian are learned from training data
11755/18797 51
Assume X and Y are jointly Gaussian The parameters
are learned from training data Now we are given an X, but no Y What is Y? Level set of Gaussian
11755/18797 52
x0
11755/18797 53
F1
F1
11755/18797 54
x0
F1
11755/18797 55
Most likely value x0
11755/18797 56
Y = argmaxy P(y| X) ??? x0
11755/18797 57
x0
11755/18797 58
x0 Y = argmaxy P(y| X)
11755/18797 59
x XX YX Y
1
x
1
XX YX Y
– y = argmax Y P(Y|x)
– argmax Y P(x|Y)
– MAP estimation of y is the regression on Y for Gaussian RVs – But this is not the MAP estimation of the regression parameter
11755/18797 60
x
1
XX YX Y
11755/18797 61
x
1
XX YX Y
1
XX YXC
1
XX TC
T
whitened
training i T i i
11755/18797 62
2 x
11755/18797 63
2
T
T T T T T T
T T
The MMSE estimate is the mean of the distribution
11755/18797 64
Most likely value is also The MEAN value
65
Let P(y|x) be a mixture density The MMSE estimate of y is given by Just a weighted combination of the MMSE
k
k
k
k
11755/18797
11755/18797 66
P(y|x) is also a Gaussian mixture Let P(x,y) be a Gaussian Mixture
k k k
k k
k
11755/18797 67
Let P(y|x) is a Gaussian Mixture
k
, , , , , ,
xx xy yx yy x y
k k k k k k
, 1 , , ,
x xx yx y
k k k k
k k k k k
, 1 , , , x xx yx y
11755/18797 68
E[y|x] is also a mixture P(y|x) is a mixture Gaussian density
k k k k k
, 1 , , , x xx yx y
k
k k k k k
, 1 , , , x xx yx y
11755/18797 69
A mixture of estimates from individual Gaussians
– Cepstral vector sequence
11755/18797 70
11755/18797 71
– Error is squared – Small variations in data large variations in weights – Outliers affect it adversely
– If dimension of X >= no. of instances
11755/18797 72
T T
XY XX A
11755/18797 73
a X e y=aTX+e
a
a a
11755/18797 74
A A
P(a) = N(0, 2I) Log P(a) = C – log – 0.5-2 ||a||2 T T T T
2
A T T T T T
2 2
– Improves condition number of correlation matrix
– Will not affect the estimation from well-conditioned data – Also called Tikhonov Regularization
– Not to be confused with MAP estimate of Y
11755/18797 75
T T T
T T
11755/18797 76
11755/18797 77
1 1
A
T T T T
11755/18797 78
1 1
A
T T T T
11755/18797 79
1 1
A
T T T T
T T T T
A
1
subject to
11755/18797 80
– One of the loci outside center
11755/18797 81 Image Credit: Tibshirani
– X = explanatory variable – Y = dependent variable – a = weights of regression
– X = measurement matrix – Y = measurement – a = data
11755/18797 82
Y
X a
11755/18797 83
11755/18797 84
that week
11755/18797 85
W S W S W S W S W S W S W S W S wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 O1 O2 O3 O4 O5 O6 O7 O8
11755/18797 86