Maximum Likelihood Estimation (MLE)
Ken Kreutz-Delgado (Nuno Vasconcelos)
ECE 175A – Winter 2012 – UCSD
(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - - PowerPoint PPT Presentation
Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an
Ken Kreutz-Delgado (Nuno Vasconcelos)
ECE 175A – Winter 2012 – UCSD
2
Goal: Given a relationship between a feature vector x and a vector y, and iid data samples (xi,yi), find an approximating function f (x) y This is called training or learning. Two major types of learning:
(“blind” curve fitting): only X is known.
value Y are known during training, only X is known at test time.
3
the metric is to the properties of the data
“optimal”?
– Average Loss (Risk) is the function that we want to minimize – Risk depends on true y and the prediction – Tells us how good our predictor/estimator is
4
– zero loss for no error and equal loss for two error types
This is equivalent to the “zero/one” loss :
snake prediction dart frog regular frog regular 1 dart 1
* | |
Y X i j Y X i j i
5
to the data-conditional Probability of Error,
loss minimizes the conditional probability of error. This is given by the MAP BDR :
|
j i Y X
*( )
i
6
– Pick the class with largest probability given the observation x
– We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used
* | | |
Y X i j i Y X i Y X i
7
are optimal and equivalent
– 1) – 2) – 3) – Form 1) is usually hard to use, 3) is frequently easier than 2)
* |
Y X i
* |
X Y Y i
* |
X Y Y i
8
– The pdf of each class i is a Gaussian of mean mi and covariance Si
1 |
1 1 ( | ) exp ( ) ( ) 2 (2 ) | |
T X Y i i i d i
P x i x x m m
S S
* 1
T i i i i d i Y
9
with and can be interpeted as a nearest class-neighbor classifier which uses a “funny metric”
– Note that each class has its own distance function which is related to the sum of the square of the Mahalanobis distance for that class plus the a term for that class – we effectively use different metrics in different regions of the space
* 2
i i i i
2 1
i i T
d i i Y
( | ) = 0.5
10
– all classes have the same Si=S with
– ai can be dropped when all classes have equal probability (the case shown in the above figure). In this case the classifier is close in form to a NN classifier with Mahalanobis distance, but instead of finding the nearest training data point, it looks for the nearest class prototype mi using the Mahalanobis distance
2 *( )
i i i
1 2( , )
T
i Y
( | ) = 0.5
11
y Cl Classifi ificatio cation n wi with Si =S
– One important property of this case is that the decision boundary is a hyperplane (Homework) This can be shown by computing the set of points x such that and showing that they satisfy
with normal w. x0 can be any fixed point
choose it to have minimum norm, in which case w and x0 are then parallel
.
1 1 2 2
discriminant for ( | ) = 0.5
T
x 1 x 3 x 2 x n
w x0 x x0
12
with
Eucli lidean ean Di Distance nce Templat plate e Matchi tching g with class means as templates
– E.g. for digit classification – Compare complexity to nearest neighbors!
2 *( )
i i i
2 2( , ) ||
i Y
13
look at the original definition in an alternative form: with
*( )
i i
* |
X Y Y i
| | | | |
X Y Y i Y X X X Y Y X Y Y X Y Y
14
where, as before,
*( )
i i
| |
X Y Y X Y Y
1
2 2 1 1 1
2 1
i i T
d i i Y
15
is a sigmoid and looks like this
2 2 1 1 1
(
1|
) = 0.5
16
– In neural networks, it can be interpreted as a posterior density for a Gaussian problem where the covariances are the same.
17
18
the values of the parameters m, S, PY(1)
1)
to design a Gaussian classifier:
– Start from a collection of datasets: D(i) = {x1
(i) , ..., xn (i)} = set of examples from class i
– For each class estimate the Gaussian BDR parameters using, where T is the total number of examples (over all classes) – E.g., below are sample means computed for digit classification:
( ) ( )
1 ˆ ˆ ˆ ( )( )
i i T i j i i i j j
x x n m m S
( )
1 ˆ
i i j i j
x n m
ˆ ( )
Y i
n P i T
19
use the estimate of the BDR found from
* 1
T i i i i d i Y
* 1
T i i i i d i Y
20
cease to be valid!!
to achieve the minimum loss only when we use the true probabilities
probability estimates, we could be implementing a classifier that is quite distant from the optimal
– E.g. if the PX|Y(x|i) look like the example above
parametric models (e.g. a single Gaussian).
21
parameters which define the pdf?
We shall utilize the maximum likelihood (ML) principle.
– 1) We choose a parametric model for all probabilities.
class-conditional distributions by
is NOT a random variable. It is a deterministic but unknown parameter, and the probabilities are a function of this unknown parameter.
| ( |
X p Y
22
– 2) Assemble a collection of datasets: D
(i) , ... , xn (i)} = set of examples from each class i
– 3) Select the values of the parameters of class i to be the ones that maximize the probability of the data from that class Note that it does not make any difference to maximize probabilities or their logs.
( ) ( )
( ) | ( ) |
i i
i i D Y i D Y
23
– Each sample D (i) is considered independently – Each parameter vector i is estimated only from sample D
we simply have to repeat the procedure for all classes.
parameter given the data D, or simply the likelihood function.
ML
X X
statistical model) for the data D , we define a Likelihood of the parameter vector given D : where a(D
24
D
1
2
2 1
2 1
x x
Maximum Likelihood Principle
26
function is a function
same shape as the density itself
function of a Gaussian is not bell-shaped
defined only after we have a data sample
2 2 2
1 ( ) ( ; ) exp 2 (2 )
X
d P d m
27
– The first derivative at x is zero. (x is a stationary point.) – The second derivative is negative at x.
L D M
28
and identically distributed (iid) measurements, the (standard) likelihood is
1
T N
29
yielding
30
Note that this is just the sample variance.
31
– If sample is {10,20,30,40,50}
32
f
is the gradient
interpretation
– It points in the direction of maximum growth of the function. (Steepest Ascent Direction.) – Which makes it perpendicular to the contours where the function is constant. – The above is the gradient for the simple (unweighted) Euclidean Norm (aka the Cartesian Gradient).
1
( ) ( ) ( ), , ( )
T n T
f f f f z z z z w w w
f(x,y) ) , (
0 y
x f ) , (
1 1 y
x f
33
max min saddle
– There is no direction of growth at x – also –f(x) = 0, and there is no direction of decrease at x – We are either at a local minimum or maximum or “saddle” point at x
– There is no direction of growth or decrease at x – f (x) = 0
point at x if and only if f(x) = 0
second order conditions
34
Hessian matrix of second partial derivatives:
Note that the Hessian is symmetric.
that best approximates f(x) at a stationary point x0.
2 2 2 1 2 2 2 2 2 1 1
( ) ( ) ( ) ( ) ( ) ( ) ( )
n n n T
f f x x x x x f f x x x x x x f f x x x x x
1 2 (
T
35
zero at x0, we have
– a maximum when the function f(x) can be locally approximated by an “upwards pointing” quadratic bowl (H
H (x0 ) is neg-def)
– a minimum when the function can be locally approximated by a “downwards pointing” quadratic bowl (H
H (x0 )is pos-def)
– a saddle point otherwise (H
H (x0 ) is indefinite)
saddle max min
36
saddle
For any matrix M, the quadratic function
– is an upwards pointing quadratic quadratic bowl at the point x = 0 when M is negative definite – is a downwards pointing quadratic bowl at x = 0 when M is positive definite – is a saddle point at x = 0 otherwise
definiteness property of the Hessian at a stationary point x0
when the Hessian is negative definite at x0
max min
T
37
In summary:
– f has zero gradient at w0 and the Hessian of f at w0 is positive definite where
( ) f w
2 2 2 1 2 2 2 1 1
( ( ) ( ) ( ) ( ) )
n n n
f f x x x x x f f x x x x x x
0)
T n
38
max
parameter values such that
Hessian condition
ML
D
D
T
ˆ
39
1
T N
40
from which we compute the second-order derivatives
2 2 2 T
N T
2 3
2
i i T T
T T T
41
which yields the the second-order derivatives
i i T T T
T T N
2 4 2 2 2
3
i i T T
T T T
3 2
2
42
which is clearly negative definite at the stationary point. This we have determined the MLE of the parameters.
2 2 2 2 4 2
3 2
T T T T T
N N N
2 3
2
i i T T
T T T
2 2 2 T
N T
2 3
2
i i T T
T T T
2
T
43
note that can be written as where x is the so-called indicator (or 0-1) function.
Y
1
x x Y
1 1
i i
N x x Y i
44
respect to equal to zero,
1
n Y i i i
1 1 1
N N Y i i i i N i i
45
yields the MLE estimate Note that this is just the relati tive ve frequ quen ency cy of occurr urren ence ce of the value “1” in the sample. I.e. the MLE is just the count of the number of 1’s over the total number of points! Again we see that the MLE yields an intuitively pleasing estimate of the unknown parameters.
1 1 ML
i N N i i i i i
46
for < 1.
2 2 2 2 2 1 2 2 2
N Y i i
47
generalized to the random vector case as follows:
(i) , ..., xn (i)} = set of iid vector examples from each class i
, i = 1, … , d . – The MLE estimates in the vector random data case are: – These are the sample estimates given earlier with no justification. – The ML solutions are intuitive, which is usually the case.
( ) ( )
i i T i j i i i j j
( )
i i j i j
Y i
48