Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) - - PowerPoint PPT Presentation

bayes decision theory ii
SMART_READER_LITE
LIVE PREVIEW

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) - - PowerPoint PPT Presentation

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 Winter 2012 - UCSD Nearest Neighbor Classifier We are considering supervised classification Nearest Neighbor (NN) Classifier A training set D = {( x 1 ,y 1


slide-1
SLIDE 1

Bayes Decision Theory - II

Ken Kreutz-Delgado (Nuno Vasconcelos)

ECE 175 – Winter 2012 - UCSD

slide-2
SLIDE 2

2

Nearest Neighbor Classifier

  • We are considering supervised classification
  • Nearest Neighbor (NN) Classifier

– A training set D = {(x1,y1), …, (xn,yn)} – xi is a vector of observations, yi is the corresponding class label – a vector x to classify

  • The “NN Decision Rule” is

– argmin means: “the i that minimizes the distance”

* {1,..., }

Set where * arg min ( , )

i i i n

y y i d x x

 

slide-3
SLIDE 3

3

Optimal Classifiers

  • We have seen that performance depends on metric
  • Some metrics are “better” than others
  • The meaning of “better” is connected to how well adapted

the metric is to the properties of the data

  • But can we be more rigorous? what do we mean by
  • ptimal?
  • To talk about optimality we define cost or loss

– Loss is the function that we want to minimize – Loss depends on true y and prediction – Loss tells us how good our predictor is

ˆ ( ) y f x  x ( ) · f

) ˆ , ( y y L

ˆ y

slide-4
SLIDE 4

4

Loss Functions & Classification Errors

  • Loss is a function of classification errors

– What errors can we have? – Two types: false positives and false negatives

  • consider a face detection problem (decide “face” or “non-face”)
  • if you see this and say

“face” “non-face”

  • you have a

false – positive false-negative (false alarm) (miss, failure to detect)

– Obviously, we have corresponding sub-classes for non-errors

  • true-positives and true-negatives

– positive/negative part reflects what we say or decide, – true/false part reflects the true class label (“true state of the world”)

slide-5
SLIDE 5

5

(Conditional) Risk

  • To weigh different errors differently

– We introduce a loss function – Denote the cost of classifying X from class i as j by – One way to measure how good the classifier is to use the (data- conditional) expected value of the loss, aka the (conditional) Risk, – Note that the (data-conditional) risk is a function of both the decision “decide class i” and the conditioning data (measured feature vector), x.

 

L i j 

 

|

[ ]| } ( , ) { ( | )

Y X j

R x i E L L Y i i P x x j j    

slide-6
SLIDE 6

6

Loss Functions

  • example: two snakes and eating poisonous dart frogs

– Regular snake will die – Frogs are a good snack for the predator dart-snake – This leads to the losses – What is optimal decision when snakes find a frog like these?

Regular snake dart frog regular frog regular dart 10

Predator snake dart frog regular frog regular 10 dart 10

slide-7
SLIDE 7

7

Minimum Risk Classification

  • We have seen that

– if both snakes have then both say “regular” – However, if then the vulnerable snake says “dart” while the predator says “regular”

  • Its infinite loss for saying regular when frog is dart, makes

the vulnerable snake much more cautious!

|

0.1 dart ( | ) 0.9 regular

Y X

j P j x j         

|

dart ( | ) 1 regular

Y X

j P j x j         

slide-8
SLIDE 8

8

BDR = Minimizing Conditional Risk

  • Note that the definition of risk:

– Immediately defines the optimal classifier as the one that minimizes the conditional risk for a given observation x – The Optimal Decision is the Bayes Decision Rule (BDR) : – The BDR yields the optimal (minimal) risk :

 

* |

( ) argmin ( , ) argmin ( | ).

i Y X i j

i x R x i L j i P j x   

 

* * |

( ) ( , ) min ( | )

Y X i j

R x R x i L j i P j x   

slide-9
SLIDE 9

What is a Decision Rule?

  • Consider the c-ary classification problem with class labels,
  • Given an observation (feature), x, to be classified, a

decision rule is a function d = d(.) of the observation that takes its values in the set of class labels,

  • Note that defined on the previous slide

is an optimal decision rule in the sense that for a specific value of x it minimizes the conditional risk R(x,i) over all possible decisions i in C

9

{1,···, }. c 

{1 ( , ) , }. c d x  

* *

( ) ( ) d x i x 

slide-10
SLIDE 10

(d-Dependent) Total Average Risk

  • Given a decision rule d and the conditional risk R(x,i), we

can consider the (d-dependent) conditional risk R(x,d(x)).

  • We can now define the total

(d-Dependent) Expected or Average Risk (aka d-Risk):

– Note that we have averaged over all possible measurements (features) x that we might encounter in the world. – Note that R(d) is a function of a function! (A function of d) – The (d-risk) R(d) is a measure of how we expect to perform on the average when we use the fixed decision rule d over-and-over- again on a large set of real world data. – It is natural to ask if there is an “optimal decision rule” which minimizes the average risk R(d) over the class of all possible decision rules.

10

( ) E ( , ) { ( } ) R d R x d x 

slide-11
SLIDE 11

Minimizing the Average Risk R(d)

  • Optimizing total risk R(d) seems hard because we are trying to

minimize it over a family of functions (decision rules), d.

  • However, since
  • ne can equivalently minimize the data-conditional risk

R(x,d(x)) point-wise in x.

  • I.e. solve for the value of the optimal decision rule at each x :
  • Thus d*(x) = i*(x) !! I.e. the BDR, which we already know
  • ptimizes the Data-Conditional Risk, ALSO optimizes the

Average Risk R(d) over ALL possible decision rules d !!

  • This makes sense: if the BDR is optimal for every single

situation, x, it must be optimal on the average over all x

11

( ) { ( , ( ))} ( , ( )) ) , (

X

R d E R x d x R x d x p x dx

  

* ( )

( ) arg min ( , ( )) argmin ( , )

d x i i

x R x d x R d x i

  

 

slide-12
SLIDE 12

12

The 0/1 Loss Function

  • An important special case of interest:

– zero loss for no error and equal loss for two error types

  • This is equivalent to the

“zero/one” loss :

  • Under this loss the optimal Bayes decision rule (BDR) is

snake prediction dart frog regular frog regular 1 dart 1

 

1 i j L i j i j       

 

| * * |

( ) argmin ( | ) argmin ( | ) ( )

Y X i j Y X i j i

d i x L j i P j P x x j x

   

 

slide-13
SLIDE 13

13

  • Note that :
  • Thus the Optimal Decision for the 0/1 loss is :

– Pick the class that is most probable given the observation x – i*(x) is known as the Maximum a Posteriori Probability (MAP) solution

  • This is also known as the Bayes Decision Rule (BDR) for

the 0/1 loss

– We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used

* | | |

( ) argmin ( | ) argmin 1 ( | ) argmax ( | )

Y X i j i Y X i Y X i

i x P j x P i x P i x

       

0/1 Loss yields MAP Decision Rule

slide-14
SLIDE 14

14

BDR for the 0/1 Loss

  • Consider the evaluation of the BDR for 0/1 loss

– This is also called the Maximum a Posteriori Probability (MAP) rule – It is usually not trivial to evaluate the posterior probabilities PY|X( i | x ) – This is due to the fact that we are trying to infer the cause (class i) from the consequence (observation x) – i.e. we are trying to solve a nontrivial inverse problem

  • E.g. imagine that I want to evaluate

PY|X( person | “has two eyes”)

  • This strongly depends on what the other classes are

* |

( ) argmax ( | )

Y X i

i x P i x 

slide-15
SLIDE 15

15

Posterior Probabilities and Detection

  • If the two classes are “people” and “cars”

– then PY|X( person | “has two eyes” ) = 1

  • But if the classes are “people” and “cats”

– then PY|X( person | “has two eyes” ) = ½ if there are equal numbers of cats and people to uniformly choose from [ this is additional info! ]

  • How do we deal with this problem?

– We note that it is much easier to infer consequence from cause – E.g., it is easy to infer that PX|Y( “has two eyes” | person ) = 1 – This does not depend on any other classes – We do not need any additional information – Given a class, just count the frequency of observation

slide-16
SLIDE 16

16

Bayes Rule

  • How do we go from PX|Y( x | j ) to PY|X( j | x ) ?
  • We use Bayes rule:
  • Consider the two-class problem, i.e. Y=0 or Y=1

– the BDR under 0/1 loss is

| |

( | ) ( ) ( | ) ( )

X Y Y Y X X

P x i P i P i x P x 

* | | | | |

( ) argmax ( | ) 0, if (0| ) (1| ) 1, if (0| ) (1| )

Y X i Y X Y X Y X Y X

i x P i x P x P x P x P x              

slide-17
SLIDE 17

17

BDR for 0/1 Loss Binary Classification

  • Pick “0” when

and “1” otherwise

  • Using Bayes rule on both sides of this inequality yields

– Noting that PX(x) is a non-negative quantity this is the same as the rule pick “0” when i.e.

| |

(0| ) (1| )

Y X Y X

P x P x 

| | | |

(0| ) (1| ) ( | 0) (0) ( |1) (1) ( ) ( )

Y X Y X X Y Y X Y Y X X

P x P x P x P P x P P x P x   

| |

( | 0) (0) ( |1) (1)

X Y Y X Y Y

P x P P x P 

* |

( ) argmax ( | ) ( )

X Y Y i

i x P x i P i 

slide-18
SLIDE 18

18

The “Log Trick”

  • Sometimes it’s not convenient to

work directly with pdf’s

– One helpful trick is to take logs – Note that the log is a monotonically increasing function from which we have

b a b a log log   

a b log b log a

     

* | | | |

( ) arg max ( | ) ( ) arg max log ( | ) ( ) arg max log ( | ) log ( ) arg min log ( | ) log ( )

X Y Y i X Y Y i X Y Y i X Y Y i

i x P x i P i P x i P i P x i P i P x i P i       

slide-19
SLIDE 19

19

“Standard” (0/1) BDR

  • In summary

– for the zero/one loss, the following three decision rules are

  • ptimal and equivalent

1) 2) 3) The form 1) is usually hardest to use, 3) is frequently easier than 2)

) | ( max arg ) (

| *

x i P x i

X Y i

* |

( ) argmax ( | ) ( )

X Y Y i

i x P x i P i     

 

) ( log ) | ( log max arg ) (

| *

i P i x P x i

Y Y X i

 

slide-20
SLIDE 20

20

(Standard 0/1-Loss) BDR - Example

  • So far the BDR is an abstract rule

– How does one implement the optimal decision in practice? – In addition to having a loss function, you need to know, model, or estimate the probabilities! – Example

  • Suppose that you run a gas station
  • On Mondays you have a promotion to sell more gas
  • Q: is the promotion working? I.e., is Y = 0 (no) or Y = 1 (yes) ?
  • A good observation to answer this question is the interarrival time (t)

between cars high t: not working (Y = 0) low t: working well (Y = 1)

slide-21
SLIDE 21

21

BDR - Example

  • What are the class-conditional and prior probabilities?

– Model the probability of arrival of a car by an Exponential density (a standard pdf to use)

  • Continuous-valued interarrival times are

assumed to be exponentially distributed.

  • Hence

where li is the arrival rate (cars/s).

  • The expected value of the interarrival time is
  • Consecutive times are assumed to be independent :

| ( | )

e

i

X Y i

P i

lt

t l

 

|

1 E |

X Y i

x y i l  

1,

, | 1 | 1 1

( , , | ) ( | ) e

i k n

n n X X Y n X Y i k k k

P i P i

lt

t t t l

  

 

 

slide-22
SLIDE 22

22

BDR - Example

  • Let’s assume that we

– know li and the (prior) class probabilities PY(i) = pi , i = 0,1 – Have measured a collection of times during the day, D = {t1,...,tn}

  • The probabilities are of exponential form

– Therefore it is easier to use the log-based BDR

 

* | 1 1 1

( ) arg max log ( | ) log ( ) arg max log e log arg max log log arg max log

i k

X Y Y i n i i i k n i k i i i k n n i k i i i k

i P i P i n n

lt

l p l t l p l t l p

   

                                      

  

slide-23
SLIDE 23

23

BDR - Example

  • This means we pick “0” when

and “1” otherwise

  • Does this decision rule make sense?

– Let’s assume, for simplicity, that p1 = p2 = 1/2

   

1 1 1 1 1 1 1 1 1 1 1 1 1

log log ( ) lo , or , or 1 1 log ( ) g

n n k n n n n k k k k n n k n n k k

n n n n l t l p l t l p l l p t l p l l p t l p l l

   

                        

   

(reasonably taking l1 > l0)

slide-24
SLIDE 24

24

BDR - Example

  • For p1 = p2 = ½, we pick “promotion did not work” (Y=0) if

The left hand side is the (sample) average interarrival time for the day – This means that there is an optimal choice of a “threshold” above which we say “promotion did not work”. This makes sense! – What is the shape of this threshold?

  • Assuming l0 = 1, it looks like this.
  • Higher the l1 , the more likely to say “promotion did not work”.

1 1 1

1 1 log ( )

n n k k

n l t l l l t

         

         

1 1

log ) ( 1 l l l l T

l1 T

slide-25
SLIDE 25

25

BDR - Example

  • When p1 = p2 = ½, we pick “did not work” (Y=0) when

– Assuming l0 = 1, T decreases with l1 – I.e. for a given daily average,

  • Larger l1: easier to say “did not work”

– This means that

  • As the expected rate of arrival for good days increases

we are going to impose a tougher standard on the average measured interarrival times

– The average has to be smaller for us to accept the day as a good one

– Once again, this makes sense! – A sensible answer is usually the case with the BDR (a good way to check your math)

1

1

n k k n

T n t t

  

         

1 1

log ) ( 1 l l l l T

l1 T

slide-26
SLIDE 26

26

The Gaussian Classifier

  • One important case is that of Multivariate Gaussian Classes

– The pdf of class i is a Gaussian of mean mi and covariance Si if

  • The BDR is

       S   S 

) ( ) ( 2 1 exp | | ) 2 ( 1 ) | (

1 | i i T i i d Y X

x x i x P m m p

* 1

1 ( ) argmax ( ) ( ) 2 1 log(2 ) log ( ) 2

T i i i i d i Y

i x x x P i m m p

    S      S   

slide-27
SLIDE 27

27

Implementation of a Gaussian Classifier

  • To design a Gaussian classifier (e.g. homework)

– Start from a collection of datasets, where the i-th class dataset D(i) = {x1

(i) , ..., xn (i)} is a set of n(i) examples from class i

– For each class estimate the Gaussian parameters : where is the total number of examples over all c classes

  • Via the “plug in rule”, the BDR is approximated as

( ( ( ) ) )

ˆ 1 ( ) ˆ ( ) ˆ

i i T i i i j j j i

x x n m m   S 

1 *

1 ( ) argmax ( ) ( ) 2 1 ˆ ˆ ˆ ˆ ˆ ( log(2 ) l g )

  • 2

T i i i d i i Y

i x P x x i p m m

          S S  

( ) ( )

ˆ 1

i i j i j

x n m 

( )

ˆ ( )

i Y

T i n P 

( ) 1 c i k

n T



slide-28
SLIDE 28

28

Gaussian Classifier

  • The Gaussian Classifier can be written as

with and can be seen as a nearest “class-neighbor” classifier with a “funny metric”

– Each class has its own “distance” measure:

  • Sum the Mahalanobis-squared for that class, then add the a constant.
  • We effectively have different “metrics” in the data (feature) space that

are class i dependent.

2 *( )

argmin ( , )

i i i i

i x d x m a      

1 2 ( , )

( ) ( )

T i i

d x y x y x y

  S  ) ( log 2 ) 2 log( i P

Y i d i

 S  p a

( | ) = 0.5

slide-29
SLIDE 29

29

Gaussian Classifier

  • A special case of interest is when

– All classes have the same covariance Si = S with

  • Note that:

– ai can be dropped when all classes have equal prior probability

– This is reminiscent of the NN classifier with Mahalanobis distance – Instead of finding the nearest data point neighbor of x, it looks for the nearest class “prototype,” (or “archetype,” or “exemplar,” or “template,” or

“representative”, or “ideal”, or “form”) , defined as the class mean mi 2 *( )

argmin ( , )

i i i

i x d x m a      

1 2( , )

( ) ( )

T

d x y x y x y

  S  ) ( log 2 i P

Y i

  a

( | ) = 0.5

slide-30
SLIDE 30

30

Binary Classifier – Special Case

  • Consider Si = S with two classes

– One important property of this case is that the decision boundary is a hyperplane (Homework) – This can be shown by computing the set of points x such that and showing that they satisfy

  • This is the equation of a hyperplane

with normal w. x0 can be any fixed point

  • n the hyperplane, but it is standard to

choose it to have minimum norm, in which case w and x0 are then parallel

1 1 2 2

( , ) ( , ) d x d x m a m a   

( | ) = 0.5

( )

T

w x x  

x 1 x 3 x 2 x n

w x0 x x0

slide-31
SLIDE 31

31

Gaussian M-ary Classifier – Special Case

  • If all the class covariances are the identity, Si=I, then

with

  • This is called (simple, Cartesian)

temp mplat ate match tching ng with class means as templates

– E.g. for digit classification Compare the complexity of this classifier to NN Classifiers!

2 *( )

argmin ( , )

i i i

i x d x m a      

2 2( , ) ||

|| d x y x y   ) ( log 2 i P

Y i

  a

* ?

slide-32
SLIDE 32

32

The Sigmoid Function

  • We have derived much of the above from the log-based BDR
  • When there are only two classes, i = 0,1, it is also

interesting manipulate the original definition as follows: where

*( )

argmax ( )

i i

i x g x 

* |

( ) argmax log ( | ) log ( )

X Y Y i

i x P x i P i      

| | | | |

( | ) ( ) ( ) ( | ) ( ) ( | ) ( ) ( | 0) (0) ( |1) (1)

X Y Y i Y X X X Y Y X Y Y X Y Y

P x i P i g x P i x P x P x i P i P x P P x P   

slide-33
SLIDE 33

33

The Sigmoid Function

  • Note that this can be written as
  • For Gaussian classes, the posterior probabilities are

where, as before,

*( )

argmax ( )

i i

i x g x 

| |

1 ( ) ( |1) (1) 1 ( | 0) (0)

X Y Y X Y Y

g x P x P P x P   ) ( 1 ) (

1

x g x g  

 

2 1 1 2 1

1 ( ) 1 exp ( , ) ( , ) g x d x d x m m a a     

1 2 ( , )

( ) ( )

T i i

d x y x y x y

  S  ) ( log 2 ) 2 log( i P

Y i d i

 S  p a

slide-34
SLIDE 34

34

The Sigmoid (“S-shaped”) Function

  • The posterior pdf for class i = 0,

is a sigmoid and looks like this

 

2 1 1 2 1

1 ( ) 1 exp ( , ) ( , ) g x d x d x m m a a     

(

1|

) = 0.5

slide-35
SLIDE 35

35

The Sigmoid Function in Neural Nets

  • The sigmoid appears in neural networks, where it can be

interpreted as a posterior pdf for a Gaussian binary classification problem when the covariances are the same

slide-36
SLIDE 36

36

The Sigmoid Function in Neural Nets

  • But not necessarily when the covariances are different
slide-37
SLIDE 37

37

END