MLE & Regression
Ken Kreutz-Delgado (Nuno Vasconcelos)
UCSD – ECE 175A – Winter 2012
MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - - PowerPoint PPT Presentation
MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE 175A Winter 2012 Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an approximating
Ken Kreutz-Delgado (Nuno Vasconcelos)
UCSD – ECE 175A – Winter 2012
2
Goal: Given a relationship between a feature vector x and a vector y, and iid sample data (xi,yi), find an approximating function f (x) y This is called training or learning. Two major types of learning:
(“blind” curve fitting): only X is known.
value Y are known during training, only X is known at test time.
3
supervised learning problem
– Y in {0,1} referred to as detection – Y in {0, ..., M-1} referred to as (M-ary) classification – Y continuous is referred to as Regression
with classification, now we will emphasize regression
a relatively easy setting to explain non-trivial MLE problems
4
– The are two random vectors. The independent (regressor) variable X and the dependent (regressed) variable Y. – An iid dataset of training examples D = {(x1,y1) , … , (xn,yn)} – An additive noise parametric model of the form where Rp is a deterministic parameter vector, and E is an iid additive random vector that accounts for noise and model error.
– Linear regression, where f(.) is linear in – Nonlinear regression, otherwise – What matters is linearity in the parameter , not in the data X!
5
– Line Fitting – Polynomial Fitting – Truncated Fourier Series
1
k i i i
– Neural Networks – Sinusoidal Decompositions – Etc.
k i i
1
x
k i i
6
– x is (almost) always known, the goal is to predict y given x – Thus, for each x, f(x,) is treated like a constant – The realization Ε = e is added to f(x,) to form Y = y – Hence, Y is conditionally distributed as Ε but with a constant added – This only changes the mean of the distribution of E, PΕ (ε; ), yielding – The conditional probability model for Y|X is determined from the distribution of the noise, PΕ (ε; )! Also note that the noise pdf, PΕ (ε; ), might also depend on the unknown parameter vector
x y
| ( | ; )
Y X
E
is
X -conditional likelihood of given Y
and depends only on xi (both facts being a consequence
7
| | 1 | 1 | 1
Y X i Y X i i i n D i n n i i i
E
8
– Given a collection of iid training points D = {(x1,y1),..., (xn,yn)}, the natural procedure to estimate the parameter is ML estimation: Equivalently, – Note that the noise pdf, PΕ (ε; ), can also possibly depend on
L | M
Y X i i i i i i
E
L | M
Y X i i i i i i
E
9
where the noise is zero-mean with variance s2
mean f(x; ) and variance s2
2 2
2 2
e s
E
2 | 2 2
Y X
10
2 is known. Then the MLE is:
– Since this minimizes the squared Euclidean distance of the estimation error (or prediction error), it is also known as least squares curve fitting
2 2 ML 2 2
i i i i i i i i i
E
11
formulation of optimizing the loss of the learning system
– the interpretation of f (.) as a predictor even becomes more intuitive
picking a loss identical to the negative of the log of the noise probability density
x y
12
– Gaussian (AWGN case) – Laplacian – Rayleigh
– L2 Distance – L1 Distance – Rayleigh Distance
2 2
2 2
e s
E
| |
e s
E
2
2 2
2 2
e s
E
2
13
max
the parameter values such that
Hessian condition!
*
D
( ) ˆ ;
D
P
T
14
Recall some important results
sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) xTAx > 0, x 0
ii) All eigenvalues of A are real and satisfy li >0 iii) All upper-left submatrices Ak have strictly positive determinant. (strictly positive leading principal minors). iv) There exists a matrix R with independent rows such that A = RRT. Equivalently, there exists a matrix Q with independent columns such that A = QTQ
1,1 1,2 1,3 1,1 1,2 1 1,1 2 3 2,1 2,2 2,3 2,1 2,2 3,1 3,2 3,3
a a a a a A a A A a a a a a a a a
15
transpose the above.
Google “vector derivatives” or “matrix derivatives”.
T T T A
2
16
L | M
Y X i i i i i i
i i i
2 2
T i i i
ˆ
17
these conditions can be written as and
i i i
2 2 L(
T i i i
18
– The model is where is ε is scalar AWGN of known variance – The (effective) loss function is – We are looking for the line that makes the square of these vertical distances as small as possible in an averaged sense. – Our first step is to compute the zeros of the gradient
1
2 1
i i i
19
1 1 1
i i i i i i i
1 2 1 i i i i i i i i i i i
20
we get
1 2
i i i i i i i i i i i
k k i i i i i i i
1 2
21
1 2 1 2 2 2 2 M 2 L 2
22
2 2 1
ˆ
i i j i i i i i i j i
n x y x y n x x
1
i i i i
1
ˆ ˆ y x
1
cov( , ) ˆ var( ) x y x
1 2 1 2 1 2
i i i i i i i
23
has to be positive definite
is for the leading principal minors to be strictly positive
– 1 > 0 –
1
ˆ y x
1
cov( , ) ˆ var ) , ( x y x
2 2 2
2 2
var( ) = sample variance of x x x x >
24
– Note that we can write the model – as
assumption of linearity in the (k+1)-vector to form
1
1
k k
25
– Line Fitting – Polynomial Fitting – Truncated Fourier Series
1
k i i i
k i i
T
T k
T
26
(unweighted) least squares loss function which we can write as
1 1
T T n n k
2 1
i i i
2
i i T i
2
27
– Line Fitting
– Truncated Fourier Series
1
k k n
1
n
1
n
28
to find a stationary value such that
from which, assuming that (x)T(x) is invertible,
2
) ( ) L( 2 ( )
T
y x x
T T
S 1 L
T T
29
independent, which turns out to be
– the condition for (x)T(x) to be invertible, – which is the necessary condition for the solution to be stationary
this true
2 2 L
L 2 ( ) ( ) 2 ( ) ( )
T T T
x y x x x
30
– A problem of the type has a least squares solution iff (x) has full column rank. – The matrix is called the (Moore-Penrose) pseudo-inverse of (x)
2
S 1 L
T T
1
T T
31
– we have an inconsistent system of equations This can’t be solved because although (x) has full (column) rank, it is “tall” (has more rows than columns) and thus is not invertible E.g. consider the line – To make a consistent system, we multiply both sides by (x)T
1 1 1
n n
T T
32
whose solution is given by the pseudo-inverse We have just seen that this is the best approximate solution to the original problem in the (unweighted) least squares sense
S 1 L
T T
1 1 1 1 1
n n n n
S 2 L
33
column rank, the least squares solution is straightforward to compute
1 2 1
T n n
1 1
T n n
34
which is the solution that we had obtained before, but now with less work. Of course, we know from ECE174 that there is a deep geometric formalism at play here.
LS 1 2 1 2 2 2
ˆ 1 ˆ 1 x y xy x x x y x xy x x xy x y
S 1 L
T T
35
– Estimates the function f(x;) of maximum likelihood for the scalar model – where e is a scalar iid zero-mean Gaussian error (AWGN) of known variance, – The method is general – Other models f(x,) will lead to other least squares problems – If variance is unknown, we don’t have a pure LS problem – If we have a vector model, in general we have weighted LS – If the error is not Gaussian, problem is not least squares.
x y
2 2
2 2
1 ( ) 2 P e
e s
e s
E
36