(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - - PowerPoint PPT Presentation

mle
SMART_READER_LITE
LIVE PREVIEW

(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - - PowerPoint PPT Presentation

Maximum Likelihood Estimation (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an


slide-1
SLIDE 1

Maximum Likelihood Estimation (MLE)

Ken Kreutz-Delgado (Nuno Vasconcelos)

ECE 175A – Winter 2012 – UCSD

slide-2
SLIDE 2

2

Statistical Learning

Goal: Given a relationship between a feature vector x and a vector y, and iid data samples (xi,yi), find an approximating function f (x)  y This is called training or learning. Two major types of learning:

  • Unsupervised Classification (aka Clustering) or Regression

(“blind” curve fitting): only X is known.

  • Supervised Classification or Regression: both X and target

value Y are known during training, only X is known at test time.

( ) ˆ y y f x  

x ( ) · f

slide-3
SLIDE 3

3

Optimal Classifiers

  • Performance depends on the data/feature space metric
  • Some metrics are better than others
  • The meaning of “better” is connected to how well adapted

the metric is to the properties of the data

  • But can we be more rigorous? What do we mean by

“optimal”?

  • To talk about optimality we need to talk about cost or loss

– Average Loss (Risk) is the function that we want to minimize – Risk depends on true y and the prediction – Tells us how good our predictor/estimator is

) ( ˆ x f y  x ( ) · f

) ˆ , ( y y L

ˆ y

slide-4
SLIDE 4

4

Data-Conditional Risk, R(x,i), for 0/1 Loss

  • An important special case of interest:

– zero loss for no error and equal loss for two error types

This is equivalent to the “zero/one” loss :

  • Under this loss

snake prediction dart frog regular frog regular 1 dart 1

 

1 i j L i j i j       

 

* | |

( ) argmin ( | ) argmin ( | )

Y X i j Y X i j i

i x L j i P j x P j x

  

 

slide-5
SLIDE 5

5

Data-Conditional Risk, R(x,i), for 0/1 Loss

  • Note, then, that in the 0/1 loss case,
  • I.e., the data-conditional risk under the 0/1 loss is equal

to the data-conditional Probability of Error,

  • Thus the optimal Bayesian decision rule (BDR) under 0/1

loss minimizes the conditional probability of error. This is given by the MAP BDR :

 

|

]| ( , ) E [ ( | ).

j i Y X

R x i L Y P j i x Y x

   

*( )

argma ( | . x )

i

i x P i x 

( , ) ( | ) 1 ( | ) R x i P x i P i x   

slide-6
SLIDE 6

6

Data-Conditional Risk, R(x,i), for 0/1 Loss

  • Summarizing:
  • The optimal decision rule is the MAP Rule :

– Pick the class with largest probability given the observation x

  • This the Bayes Decision Rule (BDR) for the 0/1 loss

– We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used

* | | |

( ) argmin ( | ) argmin 1 ( | ) argmax ( | )

Y X i j i Y X i Y X i

i x P j x P i x P i x

       

slide-7
SLIDE 7

7

BDR (under 0/1 Loss)

  • For the zero/one loss, the following three decision rules

are optimal and equivalent

– 1) – 2) – 3) – Form 1) is usually hard to use, 3) is frequently easier than 2)

* |

( ) argmax ( | )

Y X i

i x P i x 

* |

( ) argmax ( | ) ( )

X Y Y i

i x P x i P i     

* |

( ) argmax log ( | ) log ( )

X Y Y i

i x P x i P i      

slide-8
SLIDE 8

8

Gaussian BDR Classifier (0/1 Loss)

  • A very important case is that of Gaussian classes

– The pdf of each class i is a Gaussian of mean mi and covariance Si

  • The Gaussian BDR under 0/1 Loss is

1 |

1 1 ( | ) exp ( ) ( ) 2 (2 ) | |

T X Y i i i d i

P x i x x m m 

     S      S

* 1

1 ( ) argmax ( ) ( ) 2 1 log(2 ) log ( ) 2

T i i i i d i Y

i x x x P i m m 

    S      S   

slide-9
SLIDE 9

9

Gaussian Classifier (0/1 Loss)

  • This can be written as

with and can be interpeted as a nearest class-neighbor classifier which uses a “funny metric”

– Note that each class has its own distance function which is related to the sum of the square of the Mahalanobis distance for that class plus the a term for that class – we effectively use different metrics in different regions of the space

* 2

( ) argmin ( , )

i i i i

i x d x m a      

2 1

( , ) ( ) ( )

i i T

d x y x y x y

  S  log(2 ) 2log ( )

d i i Y

P i a   S 

( | ) = 0.5

slide-10
SLIDE 10

10

Gaussian Classifier (0/1 Loss)

  • A special case of interest is when

– all classes have the same Si=S with

  • Note:

– ai can be dropped when all classes have equal probability (the case shown in the above figure). In this case the classifier is close in form to a NN classifier with Mahalanobis distance, but instead of finding the nearest training data point, it looks for the nearest class prototype mi using the Mahalanobis distance

2 *( )

argmin ( , )

i i i

i x d x m a      

1 2( , )

( ) ( )

T

d x y x y x y

  S  2log ( )

i Y

P i a  

( | ) = 0.5

slide-11
SLIDE 11

11

Gaussian Classifier (0/1 Loss)

  • Binary

y Cl Classifi ificatio cation n wi with Si =S

– One important property of this case is that the decision boundary is a hyperplane (Homework) This can be shown by computing the set of points x such that and showing that they satisfy

  • This is the equation of a hyperplane

with normal w. x0 can be any fixed point

  • n the hyperplane, but it is standard to

choose it to have minimum norm, in which case w and x0 are then parallel

.

1 1 2 2

( , ) ( , ) d x d x m a m a   

discriminant for ( | ) = 0.5

( ) 0.

T

w x x  

x 1 x 3 x 2 x n

w x0 x x0

slide-12
SLIDE 12

12

Gaussian Classifier (0/1 Loss)

  • Furthermore, if all the covariances are the identity Si=I

with

  • This is just Eu

Eucli lidean ean Di Distance nce Templat plate e Matchi tching g with class means as templates

– E.g. for digit classification – Compare complexity to nearest neighbors!

2 *( )

argmin ( , )

i i i

i x d x m a      

2 2( , ) ||

|| d x y x y   2log ( )

i Y

P i a  

* ?

slide-13
SLIDE 13

13

The Sigmoid in 0/1 Loss Detection

  • We have derived all of this from the log-based 0/1 BDR
  • When there are only two classes, it is also interesting to

look at the original definition in an alternative form: with

*( )

argmax ( )

i i

i x g x 

* |

( ) argmax log ( | ) log ( )

X Y Y i

i x P x i P i      

| | | | |

( | ) ( ) ( ) ( | ) ( ) ( | ) ( ) ( | 0) (0) ( |1) (1)

X Y Y i Y X X X Y Y X Y Y X Y Y

P x i P i g x P i x P x P x i P i P x P P x P    

slide-14
SLIDE 14

14

The Sigmoid in MAP Detection

  • Note that this can be written as
  • For Gaussian classes, the posterior probability for “0” is

where, as before,

*( )

argmax ( )

i i

i x g x 

| |

1 ( ) ( |1) (1) 1 ( | 0) (0)

X Y Y X Y Y

g x P x P P x P  

1

( ) 1 ( ) g x g x  

 

2 2 1 1 1

1 ( ) 1 exp ( , ) ( , ) g x d x d x m m a a     

2 1

( , ) ( ) ( )

i i T

d x y x y x y

  S  log(2 ) 2log ( )

d i i Y

P i a   S 

slide-15
SLIDE 15

15

The Sigmoid in MAP detection

  • The posterior density for class “0”,

is a sigmoid and looks like this

 

2 2 1 1 1

1 ( ) 1 exp ( , ) ( , ) g x d x d x m m a a     

(

1|

) = 0.5

slide-16
SLIDE 16

16

The Sigmoid in Neural Networks

  • The sigmoid function also appears in neural networks

– In neural networks, it can be interpreted as a posterior density for a Gaussian problem where the covariances are the same.

slide-17
SLIDE 17

17

The Sigmoid in Neural Networks

  • But not necessarily when the covariances are different
slide-18
SLIDE 18

18

Implementation

  • All of this is appealing, but in practice one doesn’t know

the values of the parameters m, S, PY(1)

1)

  • In the homework we use an “intuitive solution”

to design a Gaussian classifier:

– Start from a collection of datasets: D(i) = {x1

(i) , ..., xn (i)} = set of examples from class i

– For each class estimate the Gaussian BDR parameters using, where T is the total number of examples (over all classes) – E.g., below are sample means computed for digit classification:

( ) ( )

1 ˆ ˆ ˆ ( )( )

i i T i j i i i j j

x x n m m S   

( )

1 ˆ

i i j i j

x n m 

ˆ ( )

Y i

n P i T 

slide-19
SLIDE 19

19

A Practical Gaussian MAP Classifier

  • Instead of the ideal BDR

use the estimate of the BDR found from

* 1

1 ( ) argmax ( ) ( ) 2 1 log(2 ) log ( ) 2

T i i i i d i Y

i x x x P i m m 

    S      S   

* 1

1 ˆ ˆ ˆ ˆ ( ) argmax ( ) ( ) 2 1 ˆ ˆ log(2 ) log ( ) 2

T i i i i d i Y

i x x x P i m m 

    S      S   

slide-20
SLIDE 20

20

Important

  • Warning: at this point all optimality claims for the BDR

cease to be valid!!

  • The BDR is guaranteed

to achieve the minimum loss only when we use the true probabilities

  • When we “plug in”

probability estimates, we could be implementing a classifier that is quite distant from the optimal

– E.g. if the PX|Y(x|i) look like the example above

  • ne could never approximate it well by using simple

parametric models (e.g. a single Gaussian).

slide-21
SLIDE 21

21

Maximum likelihood Estimation (MLE)

  • Given a parameterized pdf how should one estimate the

parameters which define the pdf?

  • There are many techniques of “parameter estimation.”

We shall utilize the maximum likelihood (ML) principle.

  • This has three steps:

– 1) We choose a parametric model for all probabilities.

  • To make this clear we denote the vector of parameters by  and the

class-conditional distributions by

  • Note: This is a classical statistics approach, which means that 

is NOT a random variable. It is a deterministic but unknown parameter, and the probabilities are a function of this unknown parameter.

| ( |

, ; )

X p Y

P x i    

slide-22
SLIDE 22

22

Maximum Likelihood Estimation (MLE)

  • The three steps continued:

– 2) Assemble a collection of datasets: D

D (i) = {x1

(i) , ... , xn (i)} = set of examples from each class i

– 3) Select the values of the parameters of class i to be the ones that maximize the probability of the data from that class Note that it does not make any difference to maximize probabilities or their logs.

   

( ) ( )

( ) | ( ) |

argmax | ; argmaxlog | ; ˆ

i i

i i D Y i D Y

P i P i

 

  

   

 

slide-23
SLIDE 23

23

Maximum Likelihood Estimation (MLE)

  • Since

– Each sample D (i) is considered independently – Each parameter vector  i is estimated only from sample D

D (i)

we simply have to repeat the procedure for all classes.

  • So, from now on we omit the class variable i :
  • The function L( ; D) = PX(D; ) is the likelihood of the

parameter  given the data D, or simply the likelihood function.

   

ML

argmax argmaxlog ; ˆ ;

X X

P P

 

  

   

 

slide-24
SLIDE 24

The Likelihood Function

  • Given a parameterized family of pdf’s (aka known as a

statistical model) for the data D , we define a Likelihood of the parameter vector  given D : where a(D

D ) > 0 for all D, and a(D D ) is independent

  • f the parameter  .
  • The choice a(D

D ) = 1 yields the

Standard Likelihood: L( ; D) = PD(D

D ; )

which was shown on the previous slide.

24

D

( ) ( ; ) ( ) ( | ) L L P   a   

slide-25
SLIDE 25

( ) p x

1

2

 x X

2 1

2 1

) ( ) ( ( ) ( )

x x

P x P x L L

 

    

Maximum Likelihood Principle

slide-26
SLIDE 26

26

The Likelihood Function

  • Note that the likelihood

function is a function

  • f the parameters 
  • It does not have the

same shape as the density itself

  • E.g. the likelihood

function of a Gaussian is not bell-shaped

  • The likelihood is

defined only after we have a data sample

2 2 2

1 ( ) ( ; ) exp 2 (2 )

X

d P d  m            

slide-27
SLIDE 27

27

Maximum Likelihood Estimation (MLE)

  • Given a sample, to obtain ML estimate we need to solve
  • When  is a scalar, this is high-school calculus:
  • We have a local maximum of f(x) at a point x when

– The first derivative at x is zero. (x is a stationary point.) – The second derivative is negative at x.

 

L D M

argma ˆ ; x P

 

 

slide-28
SLIDE 28

28

MLE Example

  • Gaussian with unknown mean & standard deviation:
  • Given a data sample D

D = {T1, … , TN } of independent

and identically distributed (iid) measurements, the (standard) likelihood is

1

( , ; , , )

T N

T T L T 

slide-29
SLIDE 29

29

MLE Example

  • The log-likelihood is
  • The derivative with respect to the mean is zero when

yielding

  • Note that this is just the sample mean

slide-30
SLIDE 30

30

MLE Example

  • The log-likelihood is
  • The derivative wrt the standard deviation is zero when
  • r

Note that this is just the sample variance.

slide-31
SLIDE 31

31

MLE Example

  • Numerical example:

– If sample is {10,20,30,40,50}

slide-32
SLIDE 32

32

f

The Gradient

  • In higher dimensions, the generalization of the derivative

is the gradient

  • The (Cartesian) gradient of a function f(w) at z is
  • The gradient has a nice geometric

interpretation

– It points in the direction of maximum growth of the function. (Steepest Ascent Direction.) – Which makes it perpendicular to the contours where the function is constant. – The above is the gradient for the simple (unweighted) Euclidean Norm (aka the Cartesian Gradient).

1

( ) ( ) ( ), , ( )

T n T

f f f f z z z z w w w                      

f(x,y) ) , (

0 y

x f  ) , (

1 1 y

x f 

slide-33
SLIDE 33

33

max min saddle

The Gradient

  • Note that if f(x) = 0

– There is no direction of growth at x – also –f(x) = 0, and there is no direction of decrease at x – We are either at a local minimum or maximum or “saddle” point at x

  • Conversely, if there is a local min or max
  • r saddle point at x

– There is no direction of growth or decrease at x – f (x) = 0

  • This shows that we have a stationary

point at x if and only if f(x) = 0

  • To determine which type holds we need

second order conditions

slide-34
SLIDE 34

34

The Hessian

  • The extension of the scalar second-order derivative is the

Hessian matrix of second partial derivatives:

Note that the Hessian is symmetric.

  • The Hessian gives us the quadratic function

that best approximates f(x) at a stationary point x0.

2 2 2 1 2 2 2 2 2 1 1

( ) ( ) ( ) ( ) ( ) ( ) ( )

n n n T

f f x x x x x f f x x x x x x f f x x x x x

  

                                        

1 2 (

( )( ) )

T

x x x x x  

slide-35
SLIDE 35

35

Hessian as a Quadratic Approximation

  • E.g. this means that if the gradient is

zero at x0, we have

– a maximum when the function f(x) can be locally approximated by an “upwards pointing” quadratic bowl (H

H (x0 ) is neg-def)

– a minimum when the function can be locally approximated by a “downwards pointing” quadratic bowl (H

H (x0 )is pos-def)

– a saddle point otherwise (H

H (x0 ) is indefinite)

saddle max min

slide-36
SLIDE 36

36

saddle

  • This is something that we already saw:

For any matrix M, the quadratic function

– is an upwards pointing quadratic quadratic bowl at the point x = 0 when M is negative definite – is a downwards pointing quadratic bowl at x = 0 when M is positive definite – is a saddle point at x = 0 otherwise

  • Hence, similarly, what matters is the

definiteness property of the Hessian at a stationary point x0

  • E.g., we have a maximum at a stationary point x0

when the Hessian is negative definite at x0

Hessian Gives Local Behavior

max min

T

x M x

slide-37
SLIDE 37

37

Optimality Conditions

In summary:

  • w0 is a local minimum of f(w) if and only if

– f has zero gradient at w0 and the Hessian of f at w0 is positive definite where

( ) f w  

2 2 2 1 2 2 2 1 1

( ( ) ( ) ( ) ( ) )

n n n

f f x x x x x f f x x x x x x

  

                          

0)

0, ( ,

T n

d d x d d    

slide-38
SLIDE 38

38

max

Maximum Likelihood Estimation (MLE)

  • Given a sample, to obtain an MLE we want to solve
  • Candidate solutions are the

parameter values such that

  • Note that you always have to check the second-order

Hessian condition

 

ML

argmax

  • ˆ

;

D

P

 

 

 ( ) ˆ ;

D

P     

0, ˆ ( )

T

      

ˆ 

slide-39
SLIDE 39

39

MLE Example

  • Back to our Gaussian example
  • Given iid samples {T1 , … , TN } the likelihood is

1

( , ; , , )

T N

T T L T 

slide-40
SLIDE 40

40

MLE Example

  • The log-likelihood is
  • The derivative of  with respect to the mean is

from which we compute the second-order derivatives

2 2 2 T

N T      

 

2 3

2

i i T T

T T T         

slide-41
SLIDE 41

41

MLE Example

  • The derivative of  with respect to the standard deviation is

which yields the the second-order derivatives

  • The stationary parameter values are,

 

 

     

i i T T T

T T N

2 4 2 2 2

3   

 

      

i i T T

T T T

3 2

2  

 

slide-42
SLIDE 42

42

MLE Example

  • The elements of the Hessian are:
  • Thus the Hessian is

which is clearly negative definite at the stationary point. This we have determined the MLE of the parameters.

 

2 2 2 2 4 2

3 2

T T T T T

N N N            

 

2 3

2

i i T T

T T T          

2 2 2 T

N T      

 

2 3

2

i i T T

T T T          

2

1 ( 2 )

T

N          

slide-43
SLIDE 43

43

2nd MLE Example

  • To find the MLE’s of the two prior class probabilities PY(i)

note that can be written as where x is the so-called indicator (or 0-1) function.

  • Given iid indicator samples D = {x1 , ... , xN}, we have

, 1 ( ) 1 ,

Y

i P i i         

   

1

( ) 1 0 1

x x Y

P x x ,  

  

 

1 1

( ( ; ) ; ) 1

i i

N x x Y i

L P    

 

  

slide-44
SLIDE 44

44

2nd MLE Example

  • Therefore
  • Setting the derivative of the log-likelihood with

respect to  equal to zero,

 

 

 

1

log ( ) log 1 lo ; g 1

n Y i i i

P x x   

   

1 1 1

log ( ) 1 1 1 1 1 (1 ) , 1

N N Y i i i i N i i

P N x x N x       

  

           

  

slide-45
SLIDE 45

45

2nd MLE Example

yields the MLE estimate Note that this is just the relati tive ve frequ quen ency cy of occurr urren ence ce of the value “1” in the sample. I.e. the MLE is just the count of the number of 1’s over the total number of points! Again we see that the MLE yields an intuitively pleasing estimate of the unknown parameters.

1 1 ML

1 ˆ

i N N i i i i i

n x where n x N N 

 

  

 

slide-46
SLIDE 46

46

2nd MLE Example

  • Check that the second derivative is negative:

for  < 1.

     

2 2 2 2 2 1 2 2 2

log ( ) 1 2 (1 ) 1 1 2 1 1 1 1 1

N Y i i

P D N x N N           

                            

slide-47
SLIDE 47

47

Combining the MLE Examples

  • For Gaussian Classes all of the above formulas can be

generalized to the random vector case as follows:

– D(i) = {x1

(i) , ..., xn (i)} = set of iid vector examples from each class i

, i = 1, … , d . – The MLE estimates in the vector random data case are: – These are the sample estimates given earlier with no justification. – The ML solutions are intuitive, which is usually the case.

( ) ( )

1 ˆ ˆ ˆ ( )( )

i i T i j i i i j j

x x n m m S   

( )

1 ˆ

i i j i j

x n m 

ˆ ( )

Y i

n P i N 

slide-48
SLIDE 48

48

END