Bayes Decision Theory - I
Ken Kreutz-Delgado (Nuno Vasconcelos)
ECE 175A – Winter 2012 - UCSD
Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE - - PowerPoint PPT Presentation
Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an
Ken Kreutz-Delgado (Nuno Vasconcelos)
ECE 175A – Winter 2012 - UCSD
2
Goal: Given a relationship between a feature vector x and a vector y, and iid data samples (xi,yi), find an approximating function f (x) y This is called training or learning. Two major types of learning:
(“blind” curve fitting): only X is known.
value Y are known during training, only X is known at test time.
( ) ˆ y y f x x ( ) · f
3
– It consists of assigning to a new, unclassified vector the same class label as that of the closest vector in the labeled training set – E.g. to classify the unlabeled point “Red”:
to all other labeled training points
labeled “A = square”, assign it to the class A
the “B = circle” class
particularly if there are a lot of labeled training points
4
– a Training Set D = {(x1,y1), …, (xn,yn)} – xi is a vector of observations, yi is the class label – a new vector x to classify
– argmin means: “the i that minimizes the distance”
} ,..., 1 { * i n i i
5
– Rd -- Continuous functions
Inner Product : Inner Product :
Euclidean norm: norm2 = ‘energy’: Euclidean distance: Distance2 = ‘energy’ of difference:
i d i i T
y x y x y x
1
,
d i i T
x x x x
1 2
d i i i
y x y x y x d
1 2
) ( ) , (
dx x g x f x g x f ) ( ) ( ) ( ), (
dx x f x f
) ( ) (
2
dx x g x f g f d
2
)] ( ) ( [ ) , (
6
– E.g.
x
d i i i
y x y x d
1 2
) ( ) , (
2 1 2
) ( ) , ( r y x r y x d
d i i i
2 2 2 2 2 1 1
) ( ) ( r y x y x
7
– features are L = fish length, W = scale width – measure L in meters and W in milimeters
– I have three fish
like salmon
d(F1,F3) = 2.8 > d(F2,F3) = 2.23
– there seems to be something wrong here – but if scale width is also measured in meters:
d(F1,F3) = .05 < d(F2,F3) = 0.35
– which seems to be right – the units are commensurate
8
– I have three fish
d(F1,F3) = .05 < d(F2,F3) = 0.35
– which seems to be right
depends on the units (or scaling) of each axis
– e.g. if I multiply the second coordinate by 1,000 The 2nd coordinates influence on the distance increases 1,000-fold!
' 2 2 1 1 2 2
( , ) ( ) 1,000,000( ) d x y x y x y
x x
9
– Taking A = R, R proper and orthogonal, is equivalent to a rotation – Another important special case is scaling (A = S, for S diagonal) – We can combine these two transformations by making taking A = SR
n n n n
x x x x
1 1 1 1
x R x S SR
10
space, is the equivalent to working in the transformed space
– Let the data tell us! – one possibility is to take M to be the inverse of the covariance matrix
– This distance is adapted to the data “scatter” and thereby yields “natural” units under a Gaussian assumption
T T T T
T
1 2( , )
T
,
T
x y x M y
11
– Points of high probability are those of small distance to the center
– Thus the Mahalanobis distance can be interpreted as the “right” norm for a certain type of non-Cartesian space
) ( ) ( 2 1 exp | | ) 2 ( 1 ) (
1
x x x P
T d X
1 2( , )
( ) ( )
T
d x x x
12
we could possibly know statistically about the data:
– The pdf for a d-dimensional Gaussian of mean and covariance is – This is equivalent to which is the exponential of the negative Mahalanobis distance-squared up to a constant scaling factor K.
) ( ) ( 2 1 exp | | ) 2 ( 1 ) (
1
x x x P
T d X
2
1 1 ( ) exp , 2
X
P x d x K
13
metric is to the properties of the data
metric? What could we mean by “optimal”?
– Cost is a real-valued loss function that we want to minimize – It depends on the true y and the prediction – The value of the cost tells us how good our predictor is
) ( ˆ x f y x (.) f
ˆ y ˆ y
14
– What types of errors can we have? – Two Types: False Positives and False Negatives
say = “face” say = “non-face”
false-positive false-negative (miss)
– Obviously, we have similar sub-classes for non-errors
– The positive/negative part reflects what we say (predict) – The true/false part reflects the reality of the situation
15
– Depends on the problem – Consider a snake looking for lunch
that it sees, Y ∈ {“dart”, “regular”}
snake prediction frog = dart frog = regular “regular” “dart” 10
16
– The one to the right is a dart frog predator – It also can classify each frog it sees, Y ∈ {“dart”, “regular”} , but it actually prefers to eat dart frogs and thus it might pass up a regular frog in its search for a tastier meal
snake prediction dart frog regular frog “regular” 10 “dart” 10
17
data vector x generated from class j as i by
how good the classifier is on the average if one (always) decides i use the (conditional) expected value of the loss, aka the (data-conditional) Risk,
– the sum, over all classes j, of the cost of classifying x as i when the truth is j times the conditional probability that the true class is j (where the conditioning is on the observed value of x)
de | f
Y X j
18
– This immediately allows us to define an optimal classifier as the
– For a given observation x, the Optimal Decision is given by and it has optimal (minimal) risk given by
* |
i Y X i j
* |
Y X i i j
19
– A snake sees this and makes probability assessments and computes an optimal decision given a loss function L
X
|
dart ( | ) 1 regular
Y X
j P j x j
20
snake prediction dart frog regular frog “regular” “dart” 10
Ordinary Snake Losses
|
Y X
| | |
reg ( | ) reg reg (reg | ) dart reg (dart | ) 0 1
Y X j Y X Y X
L j P j x L P x L P x
Class probabilities conditioned on x
21
given the observation x and has a good, safe lunch (risk = 0)
snake prediction dart frog regular frog “regular” “dart” 10
|
dart ( | ) 1 regular
Y X
j P j x j
| | |
dart ( | ) reg dart (reg | ) dart dart (dart | ) 10 1 0 0 10 + 0 = 10
Y X j Y X Y X
L j P j x L P x L P x
22
– It sees this image x – It “knows” that dart frogs can be colorful – So it assigns a nonzero probability to this image x showing a dart frog
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
X
23
snake prediction dart frog regular frog “regular” “dart” 10
Ordinary Snake Losses |
Y X
| | |
reg ( | ) reg reg (reg | ) dart reg (dart | ) 0 0.9 0.1
Y X j Y X Y X
L j P j x L P x L P x
Class probabilities conditioned on new x
24
– even though this is a regular frog with 0.9 probability
Ordinary Snake Losses
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
| | |
dart ( | ) reg dart (reg | ) dart dart (dart | ) 10 0.9 0 0.1 9
Y X j Y X Y X
L j P j x L P x L P x
snake prediction dart frog regular frog “regular” “dart” 10
25
– The dart-snake sees this and makes probability assessments and computes the optimal decision
X
|
dart ( | ) 1 regular
Y X
j P j x j
26
snake prediction dart frog regular frog “regular” 10 “dart” 10
Dart-Snake Losses
|
dart ( | ) 1 regular
Y X
j P j x j
| | |
reg ( | ) reg reg (reg | ) dart reg (dart | ) 0 1 10 0
Y X j Y X Y X
L j P j x L P x L P x
27
consistent with the x-conditional class probabilities
snake prediction dart frog regular frog regular 10 dart 10
Dart-Snake Losses
|
dart ( | ) 1 regular
Y X
j P j x j
| | |
dart ( | ) reg dart (reg | ) dart dart (dart | ) 10 1 0 0 10
Y X j Y X Y X
L j P j x L P x L P x
28
– Let’s assume that it makes the same probability assignments as the ordinary snake
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
X
29
snake prediction dart frog regular frog “regular” 10 “dart” 10
Dart-Snake Losses
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
| | |
reg ( | ) reg reg (reg | ) dart reg (dart | ) 0 0.9 10 0.1 1
Y X j Y X Y X
L j P j x L P x L P x
30
Dart-Snake Losses
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
| | |
dart ( | ) reg dart (reg | ) dart dart (dart | ) 10 0.9 0 0.1 9
Y X j Y X Y X
L j P j x L P x L P x
snake prediction dart frog regular frog regular 10 dart 10
31
then both say “regular”
– the vulnerable snake decides “dart” – the predator snake decides “regular”
makes the vulnerable snake much more cautious!
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
|
dart ( | ) 1 regular
Y X
j P j x j
32
are
– the Loss Function – and the Measurement-Conditional Probabilities
you will loose this much!”)
|
Y X j
|
Y
33
is to always decide “dart” and accept the loss of 9
loss function for an ordinary snake would have to:
– Account for how hungry the snake is. (If the snake is starving, it will have to be more risk preferring.) – Assign a finite cost to the choice of “regular” when the frog is a dart. (Maybe dart frogs will only make the snake super sick sometimes.)
– You know how much mistakes will cost you, or assess that in some way – What if I can’t do that? -- one reasonable default is the 0/1 loss function
|
0.1 dart ( | ) 0.9 regular
Y X
j P j x j
34
– i) zero loss for no error and ii) equal loss for the two error types
snake prediction dart frog regular frog “regular” 1 “dart” 1
* | |
Y X i j Y X i j i
35
– Pick the class that has largest posterior probability given the observation x. (I.e., pick the most probable class)
– We will simplify our discussion by assuming this loss, but you should always be aware that other losses may be used
| | | *
X Y i X Y i i j X Y i
36
which is the x-conditional probability that the optimal decision is wrong.
the expected probability of error of the optimal decision
*
* * | | ( ) * |
Y X j Y X j i x Y X
37