Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 – Winter 2012 - UCSD

Nearest Neighbor Classifier • We are considering supervised classification • Nearest Neighbor (NN) Classifier – A training set D = {( x 1 ,y 1 ), …, ( x n ,y n )} – x i is a vector of observations , y i is the corresponding class label – a vector x to classify • The “NN Decision Rule” is  Set y y * i where  * arg min ( , ) i d x x i  {1,..., } i n – argmin means: “the i that minimizes the distance” 2

Optimal Classifiers • We have seen that performance depends on metric • Some metrics are “better” than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • But can we be more rigorous? what do we mean by optimal? • To talk about optimality we define cost or loss  ˆ x ( ) y f x ˆ ( , ) L y y ( ) · f – Loss is the function that we want to minimize ˆ – Loss depends on true y and prediction y – Loss tells us how good our predictor is 3

Loss Functions & Classification Errors • Loss is a function of classification errors – What errors can we have? – Two types: false positives and false negatives  consider a face detection problem (decide “face” or “non - face”)  if you see this and say “face” “non - face”  you have a false – positive false-negative (false alarm) (miss, failure to detect) – Obviously, we have corresponding sub-classes for non-errors  true-positives and true-negatives – positive/negative part reflects what we say or decide, – true/false part reflects the true class label (“true state of the world”) 4

(Conditional) Risk • To weigh different errors differently – We introduce a loss function – Denote the cost of classifying X from class i as j by    L i j – One way to measure how good the classifier is to use the (data- conditional) expected value of the loss, aka the (conditional) Risk,        ( , ) { [ ]| } ( | ) R x i E L Y i x L j i P j x | Y X j – Note that the ( data-conditional ) risk is a function of both the decision “ decide class i ” and the conditioning data (measured feature vector), x. 5

Loss Functions • example: two snakes and eating poisonous dart frogs – Regular snake will die – Frogs are a good snack for the predator dart-snake – This leads to the losses Regular dart regular Predator dart regular snake frog frog snake frog frog  regular 0 regular 10 0 dart 0 10 dart 0 10 – What is optimal decision when snakes find a frog like these? 6

Minimum Risk Classification • We have seen that – if both snakes have    0 dart j    ( | ) P j x  | Y X   1 regular j then both say “regular” – However, if    0.1 dart j    ( | ) P j x  | Y X   0.9 regular j then the vulnerable snake says “dart” while the predator says “regular” • Its infinite loss for saying regular when frog is dart, makes the vulnerable snake much more cautious! 7

BDR = Minimizing Conditional Risk • Note that the definition of risk: – Immediately defines the optimal classifier as the one that minimizes the conditional risk for a given observation x – The Optimal Decision is the Bayes Decision Rule (BDR) :  * ( ) argmin ( , ) i x R x i i      argmin ( | ). L j i P j x | Y X i j – The BDR yields the optimal (minimal) risk :       * * ( ) ( , ) min ( | ) R x R x i L j i P j x | Y X i j 8

What is a Decision Rule? • Consider the c -ary classification problem with class labels,  {1,···, }. c • Given an observation (feature), x, to be classified, a decision rule is a function d = d ( . ) of the observation that takes its values in the set of class labels, d x   ( ) {1 , , }. c i x  * * • Note that defined on the previous slide ( ) ( ) d x is an optimal decision rule in the sense that for a specific value of x it minimizes the conditional risk R(x,i) over all possible decisions i in C 9

(d-Dependent) Total Average Risk • Given a decision rule d and the conditional risk R(x,i), we can consider the (d-dependent) conditional risk R(x,d(x)). • We can now define the total ( d-Dependent) Expected or Average Risk (aka d-Risk):  ( ) E { ( , ( ) ) } R d R x d x – Note that we have averaged over all possible measurements (features) x that we might encounter in the world. – Note that R(d) is a function of a function! (A function of d ) – The (d-risk) R(d) is a measure of how we expect to perform on the average when we use the fixed decision rule d over-and-over- again on a large set of real world data. – It is natural to ask if there is an “optimal decision rule” which minimizes the average risk R(d) over the class of all possible decision rules. 10

Minimizing the Average Risk R(d) • Optimizing total risk R ( d ) seems hard because we are trying to minimize it over a family of functions (decision rules), d . • However, since    ( ) { ( , ( ))} ( , ( )) ( ) , R d E R x d x R x d x p x dx X  0 one can equivalently minimize the data-conditional risk R ( x,d ( x )) point-wise in x. • I.e. solve for the value of the optimal decision rule at each x :   * ( ) arg min ( , ( )) argmin ( , ) d x R x d x R x i    ( ) d x i i • Thus d* ( x ) = i* ( x ) !! I.e. the BDR, which we already know optimizes the Data-Conditional Risk, ALSO optimizes the Average Risk R(d) over ALL possible decision rules d !! • This makes sense: if the BDR is optimal for every single situation, x, it must be optimal on the average over all x 11

The 0/1 Loss Function • An important special case of interest: – zero loss for no error and equal loss for two error types • This is equivalent to the snake dart regular “zero/one” loss : prediction frog frog   0 i j regular 1 0      L i j   1 i j dart 0 1 • Under this loss the optimal Bayes decision rule (BDR) is       * * ( ) ( ) argmin ( | ) d x i x L j i P j x | Y X i j   argmin ( | ) P j x Y X | i  j i 12

0/1 Loss yields MAP Decision Rule • Note that :   * ( ) argmin ( | ) i x P j x | Y X i  j i     argmin 1 ( | ) P i x   Y X | i  argmax ( | ) P i x | Y X i • Thus the Optimal Decision for the 0/1 loss is : – Pick the class that is most probable given the observation x – i*(x ) is known as the Maximum a Posteriori Probability (MAP) solution • This is also known as the Bayes Decision Rule (BDR) for the 0/1 loss – We will often simplify our discussion by assuming this loss – But you should always be aware that other losses may be used 13

BDR for the 0/1 Loss • Consider the evaluation of the BDR for 0/1 loss  * ( ) argmax ( | ) i x P i x | Y X i – This is also called the Maximum a Posteriori Probability (MAP) rule – It is usually not trivial to evaluate the posterior probabilities P Y|X ( i | x ) – This is due to the fact that we are trying to infer the cause (class i ) from the consequence (observation x ) – i.e. we are trying to solve a nontrivial inverse problem  E.g. imagine that I want to evaluate P Y|X ( person | “has two eyes”)  This strongly depends on what the other classes are 14

Posterior Probabilities and Detection • If the two classes are “people” and “cars” – then P Y|X ( person | “has two eyes” ) = 1 • B ut if the classes are “people” and “cats” – then P Y|X ( person | “has two eyes” ) = ½ if there are equal numbers of cats and people to uniformly choose from [ this is additional info! ] • How do we deal with this problem? – We note that it is much easier to infer consequence from cause – E.g., it is easy to infer that P X|Y ( “has two eyes” | person ) = 1 – This does not depend on any other classes – We do not need any additional information – Given a class, just count the frequency of observation 15

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Bayes Decision Theory - II Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175 Winter 2012 - UCSD Nearest Neighbor Classifier We are considering supervised classification Nearest Neighbor (NN) Classifier A training set D = {( x 1 ,y 1

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Classification based on Bayes decision theory Machine Learning Hamid Beigy Sharif University of

Decision Trees and Nave Bayes 3/29/17 Hypothesis Spaces Decision Trees and K-Nearest

Decision Making Probabilistic model Known Unknown Bayes Decision Supervised Unsupervised

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Charged Hadrons IP resolution in -decays and reconstruction of CP- mixing angle in H A.

Ionization of Model Atomic Systems by Periodic Forcings of Arbitrary Size Joel L. Lebowitz

NOVEL FOCAL PLANE DETECTOR CONCEPTS FOR THE NSCL/FRIB S800 SPECTROMETER Marco Cortesi National

Components of Life Part I: atoms, molecules, amino acids, and proteins J. D. Price All life on

Chiral Magnetic Effect with Wigner Functions Dniel Bernyi 1 , Vladimir Skokov 2 , Pter Lvai

For the LBECA Collaboration: Rafael F. Lang, Purdue University, rafael@purdue.edu S2 Only

IR Simone Campanoni simonec@eecs.northwestern.edu Outline IR Explicit control flows

Logistics Final projects due this Friday, but extensions are possible Grading: 40%