CS 559: Machine Learning Fundamentals and Applications 3 rd Set of - - PowerPoint PPT Presentation

cs 559 machine learning fundamentals and applications 3
SMART_READER_LITE
LIVE PREVIEW

CS 559: Machine Learning Fundamentals and Applications 3 rd Set of - - PowerPoint PPT Presentation

1 CS 559: Machine Learning Fundamentals and Applications 3 rd Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Making Decisions


slide-1
SLIDE 1

CS 559: Machine Learning Fundamentals and Applications 3rd Set of Notes

Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

1

slide-2
SLIDE 2

Overview

  • Making Decisions
  • Parameter Estimation

– Frequentist or Maximum Likelihood approach

2

slide-3
SLIDE 3

Expected Utility

  • You are asked if you wish to take a bet on the outcome of

tossing a fair coin. If you bet and win, you gain $100. If you bet and lose, you lose $200. If you don't bet, the cost to you is zero. U(win, bet) = 100 U(lose, bet) = -200 U(win, no bet) = 0 U(lose, no bet) = 0

  • Your expected winnings/losses are:

U(bet) = p(win)×U(win, bet) + p(lose)×U(lose, bet) = 0.5×100 – 0.5×200 = -50 U(no bet) = 0

  • Based on making the decision which maximizes expected

utility, you would therefore be advised not to bet.

  • D. Barber (Ch. 7)

3

slide-4
SLIDE 4

Flow of Lecture and Entire Course

  • Making optimal decisions based on prior

knowledge (prev. slide)

  • Making optimal decisions based on
  • bservations and prior knowledge

– Given models of the underlying phenomena (last week and today) – Given training data with observations and labels (most of the semester)

4

slide-5
SLIDE 5

Bayesian Decision Theory Bayesian Decision Theory

Adapted from: Duda, Hart and Stork, Pattern Classification textbook

  • O. Veksler
  • E. Sudderth

5

slide-6
SLIDE 6

Bayes’ Rule

Pattern Classification, Chapter 2 6

     

 

x x x p p P P

j j j

   | | 

posterior likelihood prior evidence

   

 

            1

| 1 | | 1 1 | 1 1               x x x x x

j j j j j j j j

        p p P p P p p P P

slide-7
SLIDE 7

Bayes Rule - Intuition

The essence of the Bayesian approach is to provide a mathematical rule explaining how you should change your existing beliefs in the light of new evidence. In other words, it allows scientists to combine new data with their existing knowledge or expertise.

From the Economist (2000)

7

slide-8
SLIDE 8

Bayes Rule - Intuition

The canonical example is to imagine that a precocious newborn

  • bserves his first sunset, and wonders whether the sun will rise

again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child's degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

From the Economist (2000) 8

slide-9
SLIDE 9

Bayesian Decision Theory

  • Knowing the probability distribution of the

categories

  • We do not even need training data to

design optimal classifiers

  • Rare in real life

Pattern Classification, Chapter 2 9

slide-10
SLIDE 10

Prior

  • Prior comes from prior knowledge, no data

have been seen yet

  • If there is a reliable source of prior

knowledge, it should be used

  • Some problems cannot even be solved

reliably without a good prior

  • However prior alone is not enough, we still

need likelihood

Pattern Classification, Chapter 2 10

slide-11
SLIDE 11

Decision Rule based on Priors

  • Model state of nature as a random variable, :

–  = 1 : the event that the next sample is from category 1 – P(1) = probability of category 1 – P(2) = probability of category 2 – P(1) + P(2) = 1

  • Exclusivity: 1 and 2 share no events
  • Exhaustivity: the union of all outcomes is the sample space

(either 1 or 2 must occur)

  • If all incorrect classifications have an equal cost:
  • Decide 1 if P(1) > P(2); otherwise, decide 2

11 Pattern Classification, Chapter 2

slide-12
SLIDE 12

Using Class-Conditional Information

  • Use of the class–conditional information can

improve accuracy

  • p(x | 1) and p(x | 2) describe the difference in

feature x between category 1 and category 2

12 Pattern Classification, Chapter 2

slide-13
SLIDE 13

Class-conditional Density vs. Likelihood

  • Class-conditional densities are probability

density functions p(x| ) when class is fixed

  • Likelihoods are values of p(x| ) for a

given x

  • This is a subtle point. Think about it.

Pattern Classification, Chapter 2 13

slide-14
SLIDE 14

14 Pattern Classification, Chapter 2

slide-15
SLIDE 15

Posterior, Likelihood, Evidence

– In the case of two categories – Posterior = (Likelihood × Prior) / Evidence

15 Pattern Classification, Chapter 2

 

2 1

) ( ) | ( ) (

j j j j P

x P x P  

     

 

x x x p p p p

j j j

   | | 

slide-16
SLIDE 16

Decision using Posteriors

  • Decision given the posterior probabilities

X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability

  • f error is :

P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1

16 Pattern Classification, Chapter 2

slide-17
SLIDE 17

17 Pattern Classification, Chapter 2

slide-18
SLIDE 18

Probability of Error

  • Minimizing the probability of error
  • Decide 1 if P(1 | x) > P(2 | x);
  • therwise decide 2

Therefore: P(error | x) = min [P(1 | x), P(2 | x)] (Bayes decision)

18 Pattern Classification, Chapter 2

slide-19
SLIDE 19

Decision Theoretic Classification

ω Ω: unknown class or category, finite set

  • f options

x : observed data, can take values in any space a A: action to chose one of the categories (or possibly to reject data) L(ω,a): loss of action a given true class ω

19

slide-20
SLIDE 20

Loss Function

  • The loss function states how costly each

action taken is

– Opposite of Utility function: L = - U

  • Most common choice is the 0-1 loss
  • In regression, square loss is the most

common choice

L(ytrue ,ypred) = (ytrue -ypred)2

20

slide-21
SLIDE 21

More General Loss Function

  • Allowing actions other than classification

primarily allows the possibility of rejection

  • Refusing to make a decision in close or

bad cases!

  • The loss function still states how costly

each action taken is

21 Pattern Classification, Chapter 2

slide-22
SLIDE 22

Notation

  • Let {1, 2,…, c} be the set of c states of

nature (or “categories”)

  • Let {1, 2,…, a} be the set of possible

actions

  • Let (i | j) be the loss incurred for taking

action i when the state of nature is j

22 Pattern Classification, Chapter 2

slide-23
SLIDE 23

Overall Risk

R = Sum of all R(i | x) for i = 1,…,a Minimizing R Minimizing R(i | x) for i = 1,…, a (select action  that minimizes risk as a function of x) for i = 1,…,a

Pattern Classification, Chapter 2 23

Conditional risk

 

c j 1 j j j i i

) x | ( P ) | ( ) x | ( R     

slide-24
SLIDE 24

Minimize Overall Risk

Select the action i for which R(i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved

24 Pattern Classification, Chapter 2

slide-25
SLIDE 25

Conditional Risk

  • Two-category classification

1 : decide 1 2 : decide 2 ij = (i | j)

loss incurred for deciding i when the true state of nature is j

Conditional risk:

R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x)

25 Pattern Classification, Chapter 2

slide-26
SLIDE 26

Decision Rule

Our rule is the following: if R(1 | x) < R(2 | x)

action 1: decide 1

This results in the equivalent rule : decide 1 if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide 2 otherwise

26 Pattern Classification, Chapter 2

slide-27
SLIDE 27

Likelihood ratio

The preceding rule is equivalent to the following rule: Then take action 1 (decide 1) Otherwise take action 2 (decide 2)

27 Pattern Classification, Chapter 2

) ( P ) ( P . ) | x ( P ) | x ( P if

1 2 11 21 22 12 2 1

         

slide-28
SLIDE 28

Optimal decision property

“If the likelihood ratio exceeds a threshold value independent of the input pattern x, we can take optimal actions”

28 Pattern Classification, Chapter 2

slide-29
SLIDE 29

Exercise

Select the optimal decision where:  = {1, 2} P(x | 1) N(2, 0.5) (Normal distribution) P(x | 2) N(1.5, 0.2) P(1) = 2/3 P(2) = 1/3

Pattern Classification, Chapter 2 29

        4 3 2 1

slide-30
SLIDE 30

Minimum-Error-Rate Classification

  • Actions are decisions on classes

If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i  j

  • Seek a decision rule that minimizes the

probability of error which is called the error rate

30 Pattern Classification, Chapter 2

slide-31
SLIDE 31

The Zero-one Loss Function

  • Zero-one loss function:

Therefore, the conditional risk is:

  • The risk corresponding to this loss function is the

average probability of error

31 Pattern Classification, Chapter 2

c ,..., 1 j , i j i 1 j i ) , (

j i

         

 

  

   

1 1

) | ( 1 ) | ( ) | ( ) | ( ) | (

j i j c j j j j i i

x P x P x P x R       

slide-32
SLIDE 32

Minimum Error Rate Decision Rule

  • Minimizing the risk requires maximizing

P(i | x) since R(i | x) = 1 – P(i | x)

  • For Minimum error rate

– Decide i if P (i | x) > P(j | x) j  i

32 Pattern Classification, Chapter 2

slide-33
SLIDE 33
  • Given the likelihood ratio and the zero-one loss function:
  • If  is the zero-one loss function which means:

33 Pattern Classification, Chapter 2

 

               ) | x ( P ) | x ( P : if decide then ) ( P ) ( P . Let

2 1 1 1 2 11 21 22 12

b 1 2 a 1 2

) ( P ) ( P 2 then 1 2 if ) ( P ) ( P then 1 1          

 

                     

slide-34
SLIDE 34

34 Pattern Classification, Chapter 2

slide-35
SLIDE 35

Classifiers, Discriminant Functions and Decision Surfaces

  • The multi-category case

– Set of discriminant functions gi(x), i = 1,…, c – The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i

35 Pattern Classification, Chapter 2

slide-36
SLIDE 36

Max Discriminant Functions

  • Let gi(x) = - R(i | x)

(max. discriminant corresponds to min. risk)

  • For the minimum error rate, we take

gi(x) = P(i | x)

(max. discriminant corresponds to max. posterior) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i)

(ln: natural logarithm)

36 Pattern Classification, Chapter 2

slide-37
SLIDE 37

Decision Regions

  • Feature space divided into c decision regions

if gi(x) > gj(x) j  i then x is in Ri (Ri means assign x to i)

  • The two-category case

– A classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2

37 Pattern Classification, Chapter 2

slide-38
SLIDE 38

Computation of g(x)

38 Pattern Classification, Chapter 2

) ( ) ( ln ) | ( ) | ( ln ) ( ) | ( ) | ( ) (

2 1 2 1 2 1

      P P x P x P x g x P x P x g    

slide-39
SLIDE 39

Discriminant Functions for the Normal Density

  • Minimum error-rate classification can be

achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i)

  • Case of multivariate normal

39 Pattern Classification, Chapter 2

) ( ln ln 2 1 2 ln 2 ) ( ) ( 2 1 ) (

1 i i i i t i i

P d x x x g            

slide-40
SLIDE 40
  • Case i = 2I

(I is the identity matrix)

40

Prove it!

Pattern Classification, Chapter 2

category) th for the threshold the called is ( ) ( ln 2 1 ; : where function) nt discrimina (linear ) (

2 2

i

i i i t i i i i i t i i

w P w w w x w x g            

slide-41
SLIDE 41

Linear Machines

– A classifier that uses linear discriminant functions is called “a linear machine” – The decision surfaces for a linear machine are pieces of hyperplanes defined by:

gi(x) = gj(x)

41 Pattern Classification, Chapter 2

slide-42
SLIDE 42

42 Pattern Classification, Chapter 2

slide-43
SLIDE 43

– The hyperplane separating Ri and Rj

always orthogonal to the line linking the means

43 Pattern Classification, Chapter 2

) ( ) ( ) ( : boundary Decision ) ( and ) (        x x w x g x g w x w x g w x w x g

t j i j t j j i t i i

) ( ) ( ) ( ln ) ( 2 1

2 2 j i j i j i j i j i

P P x w                  

) ( 2 1 x then ) ( P ) ( P if

j i j i

      

slide-44
SLIDE 44

44 Pattern Classification, Chapter 2

slide-45
SLIDE 45
  • Case i =  (covariances of all classes

are identical but arbitrary!)

– Hyperplane separating Ri and Rj (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means)

45 Pattern Classification, Chapter 2

 

) .( ) ( ) ( ) ( / ) ( ln ) ( 2 1

1 1 j i j i t j i j i j i i i

P P x w                    

 

slide-46
SLIDE 46

46 Pattern Classification, Chapter 2

slide-47
SLIDE 47

47 Pattern Classification, Chapter 2

slide-48
SLIDE 48
  • Case i = arbitrary

– The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

48 Pattern Classification, Chapter 2

) ( ln ln 2 1 2 1 w w 2 1 W : ) (

1 1 i 1 i i i i i t i i i i i i t i i t i

P where w x w x W x x g                  

  

slide-49
SLIDE 49

49 Pattern Classification, Chapter 2

slide-50
SLIDE 50

50 Pattern Classification, Chapter 2

slide-51
SLIDE 51

Bayes Decision Theory – Discrete Features

  • Components of x are binary or integer valued, x can

take only one of m discrete values v1, v2, …, vm

  • Case of independent binary features in 2 category

problem

  • Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with

probabilities: pi = P(xi = 1 | 1) qi = P(xi = 1 | 2)

51 Pattern Classification, Chapter 2

slide-52
SLIDE 52

52 Pattern Classification, Chapter 2

) ( ) ( ln 1 1 ln ) 1 ( ln g(x) : function nt Discrimina ) 1 1 ( ) ( ) | ( ) | ( : ratio Likelihood ) 1 ( ) | ( ) 1 ( ) | (

2 1 1 1 1 2 1 1 1 2 1 1 1

      P P q p x q p x q p q p x P x P q q x P p p x P

d i i i i i i i d i x i i x i i d i x i x i d i x i x i

i i i i i i

                  

   

      

slide-53
SLIDE 53

53 Pattern Classification, Chapter 2

g(x) if and g(x) if ) ( ) ( ln 1 1 ln : ,..., 1 ) 1 ( ) 1 ( ln : ) (

2 1 2 1 1 i 1

           

 

 

    decide P P q p w and d i p q q p w where w x w x g

d i i i i i i i i d i i

slide-54
SLIDE 54

Exercise: DHS Problem 2.12

Let ωmax(x) be the state of nature for which P(ωmax|x)>=P(ωi|x) for all i=1,…,c

  • Show that P(ωmax|x)>=1/c
  • Show that for the minimum-error-rate decision

rule the average probability of error is given by

  • Use these two results to show that

P(error)≤ (c-1)/c

  • Describe a situation for which P(error)=(c-1)/c

Pattern Classification, Chapter 3 54

  dx x p x P error P ) ( ) | ( 1 ) (

max

slide-55
SLIDE 55
  • Case i = 2I

55

Discriminant Functions for the Normal Density

) ( ln 2 1 ; : where ) (

2 2 i i t i i i i i t i i

P w w w x w x g            

slide-56
SLIDE 56

Discriminant Function Example

  • O. Veksler

56

slide-57
SLIDE 57

Discriminant Function Example

  • O. Veksler

57

slide-58
SLIDE 58

Discriminant Function Example

  • O. Veksler

58

slide-59
SLIDE 59

Discriminant Function Example

  • O. Veksler

59

slide-60
SLIDE 60

Maximum-Likelihood & Bayesian Maximum-Likelihood & Bayesian Parameter Estimation Parameter Estimation

Adapted from: Duda, Hart and Stork, Pattern Classification textbook

  • O. Veksler
  • E. Sudderth
  • D. Batra

Pattern Classification, Chapter 3 60

slide-61
SLIDE 61

Introduction

  • We could design an optimal classifier if we

knew:

– p(i) (priors) – p(x | i) (class-conditional densities) – Unfortunately, we rarely have this complete information!

  • Design a classifier from training data

Pattern Classification, Chapter 3 61

slide-62
SLIDE 62

Supervised Learning in a Nutshell

  • Training Stage:

– Raw Data  x (Feature Extraction) – Training Data { (x,y) }  f (Learning)

  • Testing Stage

– Raw Data  x (Feature Extraction) – Test Data x  f(x) (Apply function, Evaluate error)

(C) Dhruv Batra 62

slide-63
SLIDE 63

Statistical Estimation View

  • Probabilities to the rescue:

– x and y are random variables – D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y)

  • IID: Independent Identically Distributed

– Both training & testing data sampled IID from P(X,Y) – Learn on training set – Have some hope of generalizing to test set

(C) Dhruv Batra 63

slide-64
SLIDE 64

Parameter Estimation

  • Use a priori information about the problem
  • E.g.: Normality of p(x | i)

p(x | i) ~ N( i, i)

  • Simplify problem

– From estimating unknown distribution function – To estimating parameters

Pattern Classification, Chapter 3 64

slide-65
SLIDE 65

Why Gaussians?

  • Why does the entire world seem to always

be harping on about Gaussians?

– Central Limit Theorem! – They’re easy (and we like easy) – Closely related to squared loss (for regression) – Mixture of Gaussians is sufficient to approximate many distributions

(C) Dhruv Batra 65

slide-66
SLIDE 66

Some properties of Gaussians

  • Affine transformation

– multiplying by scalar and adding a constant – X ~ N(,2) – Y = aX + b  Y ~ N(a+b,a22)

  • Sum of Independent Gaussians

– X ~ N(X,2

X)

– Y ~ N(Y,2

Y)

– Z = X+Y  Z ~ N(X+Y, 2

X+2 Y)

(C) Dhruv Batra 66

slide-67
SLIDE 67

Estimation techniques

  • Maximum-Likelihood (ML) and Bayesian

estimation

  • Results are often identical, but the approaches are

fundamentally different

  • Frequentist View

– limit N∞ #(A is true)/N – limiting frequency of a repeating non-deterministic event

  • Bayesian View

– P(A) is your “belief” about A

Pattern Classification, Chapter 3 67

slide-68
SLIDE 68

Parameter Estimation

  • Parameters in ML estimation are fixed but unknown!
  • Best parameters are obtained by maximizing the

probability of obtaining the samples observed

  • Bayesian methods view the parameters as random

variables having some known distribution

  • In either approach, we use p(i | x) for our

classification rule

Pattern Classification, Chapter 3 68

slide-69
SLIDE 69

Independence Across Classes

  • For each class i we have a proposed density

pi(x| i) with unknown parameters θi which we need to estimate

  • Since we assumed independence of data across

the classes, estimation is an identical procedure for all classes

  • To simplify notation, we drop sub-indexes and

say that we need to estimate parameters θ for density p(x) p(x)

Pattern Classification, Chapter 2 69

slide-70
SLIDE 70

Maximum-Likelihood Estimation

  • Has good convergence properties as the

sample size increases

  • Simpler than alternative techniques
  • General principle

– Assume c datasets (classes) D1, D2, …, Dc drawn independently according to p(x| j)

Pattern Classification, Chapter 3 70

slide-71
SLIDE 71

Maximum-Likelihood Estimation

  • Assume that p(x| j) has known parametric

form determined by parameter vector θj

  • Further assume that Di gives no

information about θj if i≠j

– Drop subscripts in remainder

Pattern Classification, Chapter 3 71

slide-72
SLIDE 72

Likelihood

  • Use set of independent samples to

estimate p(D | )

– Let D = {x1, x2, …, xn} – p(x1,…, xn | ) = p(xi | )

  • ; |D| = n

Our goal is to determine (value of  that best agrees with observed training data)

  • Note if D is fixed p(D| ) is not a density

Pattern Classification, Chapter 3 72

 ˆ

slide-73
SLIDE 73

Example: Gaussian case

  • Assume we have c classes and

p(x | j) ~ N( j, j) p(x | j)  p (x | j, j) where:

Pattern Classification, Chapter 3 73

)...) , cov( , , ,..., , ( ) , (

22 11 2 1 n j m j j j j j j j

x x         

slide-74
SLIDE 74

Pattern Classification, Chapter 3 74

  • Use the information provided by the training samples to

estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category

  • Suppose that D contains n samples, x1, x2,…, xn
  • p(D| ) is called the likelihood of  w.r.t the set of samples
  • ML estimate of  is, by definition the value that

maximizes p(D | ) “It is the value of  that best agrees with the actually

  • bserved training sample”

 ˆ

) | ( ) | (

1

 

 

n k k k

x p D p

slide-75
SLIDE 75

Pattern Classification, Chapter 3 75

  • Optimal estimation

– Let  = (1, 2, …, p)t and let  be the gradient operator – We define l() as the log-likelihood function l() = ln p(D | ) – New problem statement: determine  that maximizes the log-likelihood

t p 2 1

,..., ,                 

) ( max arg ˆ    l 

slide-76
SLIDE 76

Pattern Classification, Chapter 3 76

Necessary conditions for an optimum:

l = 0

  • Local or global maximum
  • Local or global minimum
  • Saddle point
  • Boundary of parameter space

) | ( ln

1

  k n k k

x p l 

 

  

slide-77
SLIDE 77

Pattern Classification, Chapter 3 77

Example of ML estimation: unknown 

– p(xi | ) ~ N(, ) (Samples are drawn from a multivariate normal population)  =  therefore:

  • The ML estimate for  must satisfy:

 

) ( ) | ( ln ) ( ) ( 2 1 ) 2 ( ln 2 1 ) | ( ln

1 1

     

          

  k k k t k d k

x x p x x x p

) ˆ (

1 1

  

  

k n k k

x

slide-78
SLIDE 78

Pattern Classification, Chapter 3 78

  • Multiplying by  and rearranging, we obtain:

Just the arithmetic average of the samples

  • f the training samples!

Conclusion:

If p(xk | j) (j = 1, 2, …, c) is assumed to be Gaussian in a d- dimensional feature space, then we can estimate the vector  = (1, 2, …, c)t and perform optimal classification!

 

n k k k

x n

1

1 ˆ 

slide-79
SLIDE 79

Pattern Classification, Chapter 3 79

  • Example of ML estimation: unknown  and 

(univariate)  = (1, 2) = (, 2)

                                       2 ) ( 2 1 ) ( 1 )) | ( (ln )) | ( (ln ) ( 2 1 2 ln 2 1 ) | ( ln

2 2 2 1 2 1 2 2 1 2 1 2 2

            

 k k k k k k

x x x p x p l x x p l

slide-80
SLIDE 80

Pattern Classification, Chapter 3 80

Summation: Combining (1) and (2), one obtains:

            

  

      n k k n k k k n k k k

x x

1 1 2 2 2 1 2 1 1 2

(2) ˆ ) ˆ ( ˆ 1 (1) ) ( ˆ 1     

n x n x

n k k k n k k k

 

   

  

1 2 2 1

) (   

slide-81
SLIDE 81

Bias

– ML estimate for 2 is biased – For one sample, the estimated variance is always zero => under-estimate – An elementary unbiased estimator for  is: – Ultimately, interested in estimate that maximizes classification performance

Pattern Classification, Chapter 3 81

2 2 2

1 ) ( 1              n n x x n E

i

            

 

  

n k k t k k

x x

1

) ˆ )( ( 1

  • n

1 C  

Sample covariance matrix

slide-82
SLIDE 82

Model Error

  • What if we assume class distribution to be

N(µ,1), but true distribution is N(µ,10)?

– ML estimate: θ=µ is the correct mean

  • Will this θ result in best classifier

performance?

– NO

Pattern Classification, Chapter 3 82