Bayesian Decision Theory Chapter 2
(Jan 11, 18, 23, 25)
- Bayes decision theory is a fundamental
statistical approach to pattern classification
- Assumption: decision problem posed in
Decision Making Probabilistic model Known Unknown Bayes Decision - - PowerPoint PPT Presentation
Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) Bayes decision theory is a fundamental statistical approach to pattern classification Assumption: decision problem posed in probabilistic terms and relevant probability values are
Probabilistic model Known Unknown Supervised Learning Unsupervised Learning Nonparametric Approach (Chapter 4, 6) Parametric Approach (Chapter 3) Nonparametric Approach (Chapter 10) Parametric Approach (Chapter 10) Bayes Decision Theory (Chapter 2) “Optimal” Rules Plug-in Rules Density Estimation K-NN, neural networks Mixture models Cluster Analysis
(uniform priors)
we are to observe a sea bass or salmon; prior prob. may depend on time of the year or the fishing area!
P(x | 1) > P(x | 2), otherwise 2
P(j | x) = {p(x | j) . P (j)} / p(x), j = 1,2 where
guarantees that the posterior probabilities sum to 1
2 1
j j j j P
given that feature value x has been observed
“Optimal” Bayes Decision rule. What does optimal mean? For a given observation (feature value) X: if P(1 | x) > P(2 | x) decide 1 if P(1 | x) < P(2 | x) decide 2 To justify the above rule, calculate the probability of error: P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1
(i) P(1) = P(2); Decide 1 if p(x | 1) > p(x | 2), otherwise 2 (ii) p(x | 1) = p(x | 2); Decide 1 if P(1) > P(2), otherwise 2
nature
general than minimizing the probability of error
possible observation x
Conditional risk
c j 1 j j j i i
state of nature is j
R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x)
Take action 1: “decide 1”
1 2 11 21 22 12 2 1
If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i j
and a unit loss for incorrect decision The conditional risk can now be simplified as: “The risk corresponding to the 0-1 loss function is the average probability of error” c ,..., 1 j , i j i 1 j i ) , (
j i
1 j i j c j 1 j j j i i
) x | ( P 1 ) x | ( P ) x | ( P ) | ( ) x | ( R
b 1 2 a 1 2
) ( P ) ( P 2 then 1 2 if ) ( P ) ( P then 1 1
) | x ( P ) | x ( P : if decide then ) ( P ) ( P . Let
2 1 1 1 2 11 21 22 12
gi(x) > gj(x) j i
gi(x) = P(i | x)
(ln: natural log)
discriminant functions g1 and g2 Let g(x) g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2
2 1 2 1 2 1
viewed as randomly corrupted (noisy) versions of a single typical or prototype pattern
where: = mean (or expected value) of x 2 = variance (or expected squared deviation) of x
, x 2 1 exp 2 1 ) x ( P
2
where: x = (x1, x2, …, xd)t (“t” stands for the transpose of a vector) = (1, 2, …, d)t mean vector = d*d covariance matrix || and -1 are determinant and inverse of , respectively
is positive definite so the determinant of is strictly positive
parameters
1 t 2 / 1 2 / d
2 1
( ) ( )
t
r x x
Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of
squared Mahalanobis distance from x to
Linear combinations of jointly normal random variables have normal distribution Linear transformation can convert an arbitrary multivariate normal distribution into a spherical one (“Whitening”)
) ( P ln ln 2 1 2 ln 2 d ) x ( ) x ( 2 1 ) x ( g
i i 1 i i t i i
(I is the identity matrix)
Features are statistically independent and each feature has the same variance irrespective of the class
i i i t i 2 i 2 i i i t i i
“a linear machine”
are pieces of hyperplanes defined by the linear equations:
j i j i 2 j i 2 j i
j i j i
) .( ) ( ) ( ) ( P / ) ( P ln ) ( 2 1 x
j i j i 1 t j i j i j i
In the 2-category case, the decision surfaces are hyperquadrics that can assume any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)
) ( P ln ln 2 1 2 1 w w 2 1 W : where w x w x W x ) x ( g
i i i 1 i t i i i 1 i i 1 i i i t i i t i
2 1 1 2
1875 . 125 . 1 514 . 3 x x x
– Simpler to computer the prob. of being correct (more ways to be wrong than to be right)
Bayes optimal decision boundary in 1-D case
1
x
1 2
( ) ( ) g x g x
1 1 1 2 1 1 2 2
1 1 ( ) ( ) log ( ) ( ) ( ) log ( ) 2 2
t t
x x P x x P
1 1 1 1 2 2 1 1 1 2 2
1 ( ) log ( )/ ( ) 2
t t t
x P P
1 1 2 2 1
~ ( , ), ~ ( , ) 1 ( ) log ( ) ( ) ( ) log ( ) 2
t i i i i i
p x N p x N g x P x x x P
1 1 1 2 1 1 1 2 2
1 ( ) ( ) 2
t t t
h x x
( ) h x
1 2
& x x
1 1 1 1 1 1 2 1 1 1 2 2 1 2 1 2 1
1 ( ) ( ) 2 1 ( ) ( ) 2
t t t t
E h x x E x
1
1 2 1 2 1
( ) ( )
t
1 2
&
1 2 2 1 2 1
1 ( ) ( ) 2
t
1 2
( ) ~ ( ,2 ) ( ) ~ ( ,2 ) p h x x N p h x x N
2 2 1 1 1 1 1 2 1 1 1 2 1 2 1
( ) ( ) ( ) ( ) ( ) 2
t t
E h x E x x
2 2
2
2
1( ) 2 1 1 2 1 1 1 2 2
1 ( ) ( ) ( ) ( ) ~ 2 2 1 2 1 1 2 2 4
t n t
P g x g x x P h x dh h x e e d t erf
2
1 2 2 1 1 2 2
r x e
1 2 1 1 2 1 2 1 2
1 2 ( ) ( ) 1 1 1 1 2 2 2 2 4 2 2
t
P P t erf erf
1 1 2 1 2 1 2 1 1 2 1 2 1 2
Mahalanobis distance is a good measure of separation between classes
(i) No Class Separation ( ) ( ) 1 2 (ii) Perfect Class Separation ( ) ( ) ( 1)
t t
erf
Assume conditional prob. are normal where
Chernoff bound for P(error) is found by determining the value of that minimizes exp(-k())
When the two covariance matrices are equal, k(1/2) is te same as the Mahalanobis distance between the two means
Chernoff Bound Bhattacharya Bound (β=1/2)
2–category, 2D data
True error using numerical integration = 0.0021 Best Chernoff error bound is 0.008190 Bhattacharya error bound is 0.008191
1 1 1 1 1 2
( ) ( ) ( ) ( | ) ( | ) 1 P error P P p x p x dx
1 ( ) 1 2
( | ) ( | )
k
p x p x dx e
1 1 2 1 2 1 1 2 2 1 1 2
(1 ) (1 ) 1 ( ) ( ) [ (1 ) ] ( ) ln 2 2 | | | |
t
k
(1/2) 1 2 1 2 1 2
( ) ( ) ( ) ( | ) ( | ) ( ) ( )
k
P error P P P x P x dx P P e
1 2 1 1 2 2 1 2 1 1 2
1 2 (1 / 2) 1 / 8( ) ( ) ln 2 2 | || |
t
k
We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (x) in detector has mean m1 (m2) when pulse is absent (present)
Discriminability: ease of determining whether the pulse is present or not
The detector uses a threshold x* to determine the presence of pulse
2
( *| ): P x x x
hit
1
( *| ): P x x x
false alarm
2
( *| ): P x x x
miss
1
( *| ): P x x x
correct rejection For given threshold, define hit, false alarm, miss and correct rejection
2 1 1
( | ) ~ ( , ) p x N
2 2 2
( | ) ~ ( , ) p x N
1 2
| | ' d
fixed x*
curve Performance shown at different
and will be multidimensional; ROC curve can still be plotted
rule and plot the resulting hit and false alarm rates
2 1 2 1 d 1 i i i i i i i i i d 1 i i
Left figure: pi=.8 and qi=.5. Right figure: p3=q3 (feature 3 does not provide any discriminatory information) so the decision surface is parallel to x3 axis
are equal; pi = 0.8 and qi = 0.5, i = 1,2,3
“Classification, Estimation and Pattern recognition” by Young and Calvert
– Consecutive states of nature may be dependent; state
– Exploit such statistical dependence to gain improved performance (use of context) – Compound decision vs. sequential compound decision – Markov dependence
– Feature measurement process is sequential – Feature measurement cost – Minimize a combination of feature measurement cost and the classification accuracy