Bayes Decision Theory - II
Ken Kreutz-Delgado (Nuno Vasconcelos)
ECE 175 – Winter 2012 - UCSD
Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) - - PowerPoint PPT Presentation
Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 Winter 2012 - UCSD Nearest Neighbor Classifier We are considering supervised classification Nearest Neighbor (NN) Classifier A training set D = {( x 1 ,y 1
Ken Kreutz-Delgado (Nuno Vasconcelos)
ECE 175 – Winter 2012 - UCSD
2
– A training set D = {(x1,y1), …, (xn,yn)} – xi is a vector of observations, yi is the corresponding class label – a vector x to classify
– argmin means: “the i that minimizes the distance”
* {1,..., }
Set where * arg min ( , )
i i i n
y y i d x x
3
the metric is to the properties of the data
– Loss is the function that we want to minimize – Loss depends on true y and prediction – Loss tells us how good our predictor is
ˆ ( ) y f x x ( ) · f
ˆ y
4
– What errors can we have? – Two types: false positives and false negatives
“face” “non-face”
false – positive false-negative (false alarm) (miss, failure to detect)
– Obviously, we have corresponding sub-classes for non-errors
– positive/negative part reflects what we say or decide, – true/false part reflects the true class label (“true state of the world”)
5
– We introduce a loss function – Denote the cost of classifying X from class i as j by – One way to measure how good the classifier is to use the (data- conditional) expected value of the loss, aka the (conditional) Risk, – Note that the (data-conditional) risk is a function of both the decision “decide class i” and the conditioning data (measured feature vector), x.
|
Y X j
6
– Regular snake will die – Frogs are a good snack for the predator dart-snake – This leads to the losses – What is optimal decision when snakes find a frog like these?
Regular snake dart frog regular frog regular dart 10
Predator snake dart frog regular frog regular 10 dart 10
7
– if both snakes have then both say “regular” – However, if then the vulnerable snake says “dart” while the predator says “regular”
the vulnerable snake much more cautious!
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
|
dart ( | ) 1 regular
Y X
j P j x j
8
– Immediately defines the optimal classifier as the one that minimizes the conditional risk for a given observation x – The Optimal Decision is the Bayes Decision Rule (BDR) : – The BDR yields the optimal (minimal) risk :
* |
i Y X i j
* * |
Y X i j
decision rule is a function d = d(.) of the observation that takes its values in the set of class labels,
is an optimal decision rule in the sense that for a specific value of x it minimizes the conditional risk R(x,i) over all possible decisions i in C
9
* *
can consider the (d-dependent) conditional risk R(x,d(x)).
(d-Dependent) Expected or Average Risk (aka d-Risk):
– Note that we have averaged over all possible measurements (features) x that we might encounter in the world. – Note that R(d) is a function of a function! (A function of d) – The (d-risk) R(d) is a measure of how we expect to perform on the average when we use the fixed decision rule d over-and-over- again on a large set of real world data. – It is natural to ask if there is an “optimal decision rule” which minimizes the average risk R(d) over the class of all possible decision rules.
10
minimize it over a family of functions (decision rules), d.
R(x,d(x)) point-wise in x.
Average Risk R(d) over ALL possible decision rules d !!
situation, x, it must be optimal on the average over all x
11
( ) { ( , ( ))} ( , ( )) ) , (
X
R d E R x d x R x d x p x dx
* ( )
d x i i
12
– zero loss for no error and equal loss for two error types
“zero/one” loss :
snake prediction dart frog regular frog regular 1 dart 1
| * * |
Y X i j Y X i j i
13
– Pick the class that is most probable given the observation x – i*(x) is known as the Maximum a Posteriori Probability (MAP) solution
the 0/1 loss
– We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used
* | | |
Y X i j i Y X i Y X i
14
– This is also called the Maximum a Posteriori Probability (MAP) rule – It is usually not trivial to evaluate the posterior probabilities PY|X( i | x ) – This is due to the fact that we are trying to infer the cause (class i) from the consequence (observation x) – i.e. we are trying to solve a nontrivial inverse problem
PY|X( person | “has two eyes”)
* |
Y X i
15
– then PY|X( person | “has two eyes” ) = 1
– then PY|X( person | “has two eyes” ) = ½ if there are equal numbers of cats and people to uniformly choose from [ this is additional info! ]
– We note that it is much easier to infer consequence from cause – E.g., it is easy to infer that PX|Y( “has two eyes” | person ) = 1 – This does not depend on any other classes – We do not need any additional information – Given a class, just count the frequency of observation
16
– the BDR under 0/1 loss is
| |
( | ) ( ) ( | ) ( )
X Y Y Y X X
P x i P i P i x P x
* | | | | |
Y X i Y X Y X Y X Y X
17
and “1” otherwise
– Noting that PX(x) is a non-negative quantity this is the same as the rule pick “0” when i.e.
| |
(0| ) (1| )
Y X Y X
P x P x
| | | |
(0| ) (1| ) ( | 0) (0) ( |1) (1) ( ) ( )
Y X Y X X Y Y X Y Y X X
P x P x P x P P x P P x P x
| |
( | 0) (0) ( |1) (1)
X Y Y X Y Y
P x P P x P
* |
X Y Y i
18
work directly with pdf’s
– One helpful trick is to take logs – Note that the log is a monotonically increasing function from which we have
a b log b log a
* | | | |
( ) arg max ( | ) ( ) arg max log ( | ) ( ) arg max log ( | ) log ( ) arg min log ( | ) log ( )
X Y Y i X Y Y i X Y Y i X Y Y i
i x P x i P i P x i P i P x i P i P x i P i
19
– for the zero/one loss, the following three decision rules are
1) 2) 3) The form 1) is usually hardest to use, 3) is frequently easier than 2)
| *
X Y i
* |
X Y Y i
| *
Y Y X i
20
– How does one implement the optimal decision in practice? – In addition to having a loss function, you need to know, model, or estimate the probabilities! – Example
between cars high t: not working (Y = 0) low t: working well (Y = 1)
21
– Model the probability of arrival of a car by an Exponential density (a standard pdf to use)
assumed to be exponentially distributed.
where li is the arrival rate (cars/s).
| ( | )
e
i
X Y i
P i
lt
t l
|
1 E |
X Y i
x y i l
1,
, | 1 | 1 1
( , , | ) ( | ) e
i k n
n n X X Y n X Y i k k k
P i P i
lt
t t t l
22
– know li and the (prior) class probabilities PY(i) = pi , i = 0,1 – Have measured a collection of times during the day, D = {t1,...,tn}
– Therefore it is easier to use the log-based BDR
* | 1 1 1
( ) arg max log ( | ) log ( ) arg max log e log arg max log log arg max log
i k
X Y Y i n i i i k n i k i i i k n n i k i i i k
i P i P i n n
lt
l p l t l p l t l p
23
and “1” otherwise
– Let’s assume, for simplicity, that p1 = p2 = 1/2
1 1 1 1 1 1 1 1 1 1 1 1 1
log log ( ) lo , or , or 1 1 log ( ) g
n n k n n n n k k k k n n k n n k k
n n n n l t l p l t l p l l p t l p l l p t l p l l
(reasonably taking l1 > l0)
24
The left hand side is the (sample) average interarrival time for the day – This means that there is an optimal choice of a “threshold” above which we say “promotion did not work”. This makes sense! – What is the shape of this threshold?
1 1 1
1 1 log ( )
n n k k
n l t l l l t
1 1
log ) ( 1 l l l l T
l1 T
25
– Assuming l0 = 1, T decreases with l1 – I.e. for a given daily average,
– This means that
we are going to impose a tougher standard on the average measured interarrival times
– The average has to be smaller for us to accept the day as a good one
– Once again, this makes sense! – A sensible answer is usually the case with the BDR (a good way to check your math)
1
1
n k k n
T n t t
1 1
log ) ( 1 l l l l T
l1 T
26
– The pdf of class i is a Gaussian of mean mi and covariance Si if
S S
) ( ) ( 2 1 exp | | ) 2 ( 1 ) | (
1 | i i T i i d Y X
x x i x P m m p
* 1
T i i i i d i Y
27
– Start from a collection of datasets, where the i-th class dataset D(i) = {x1
(i) , ..., xn (i)} is a set of n(i) examples from class i
– For each class estimate the Gaussian parameters : where is the total number of examples over all c classes
( ( ( ) ) )
ˆ 1 ( ) ˆ ( ) ˆ
i i T i i i j j j i
x x n m m S
1 *
1 ( ) argmax ( ) ( ) 2 1 ˆ ˆ ˆ ˆ ˆ ( log(2 ) l g )
T i i i d i i Y
i x P x x i p m m
S S
( ) ( )
ˆ 1
i i j i j
x n m
( )
ˆ ( )
i Y
T i n P
( ) 1 c i k
n T
28
with and can be seen as a nearest “class-neighbor” classifier with a “funny metric”
– Each class has its own “distance” measure:
are class i dependent.
2 *( )
i i i i
1 2 ( , )
T i i
Y i d i
( | ) = 0.5
29
– All classes have the same covariance Si = S with
– ai can be dropped when all classes have equal prior probability
– This is reminiscent of the NN classifier with Mahalanobis distance – Instead of finding the nearest data point neighbor of x, it looks for the nearest class “prototype,” (or “archetype,” or “exemplar,” or “template,” or
“representative”, or “ideal”, or “form”) , defined as the class mean mi 2 *( )
i i i
1 2( , )
T
Y i
( | ) = 0.5
30
– One important property of this case is that the decision boundary is a hyperplane (Homework) – This can be shown by computing the set of points x such that and showing that they satisfy
with normal w. x0 can be any fixed point
choose it to have minimum norm, in which case w and x0 are then parallel
1 1 2 2
( , ) ( , ) d x d x m a m a
( | ) = 0.5
T
x 1 x 3 x 2 x n
w x0 x x0
31
with
temp mplat ate match tching ng with class means as templates
– E.g. for digit classification Compare the complexity of this classifier to NN Classifiers!
2 *( )
argmin ( , )
i i i
i x d x m a
2 2( , ) ||
|| d x y x y ) ( log 2 i P
Y i
a
32
interesting manipulate the original definition as follows: where
*( )
i i
* |
X Y Y i
| | | | |
X Y Y i Y X X X Y Y X Y Y X Y Y
33
where, as before,
*( )
i i
| |
X Y Y X Y Y
1
2 1 1 2 1
1 2 ( , )
T i i
Y i d i
34
is a sigmoid and looks like this
2 1 1 2 1
(
1|
) = 0.5
35
interpreted as a posterior pdf for a Gaussian binary classification problem when the covariances are the same
36
37