 
              Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2019 CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 46
Bayesian Decision Theory ◮ Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions. ◮ First, we will assume that all probabilities are known. ◮ Then, we will study the cases where the probabilistic structure is not completely known. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 46
Fish Sorting Example Revisited ◮ State of nature is a random variable. ◮ Define w as the type of fish we observe (state of nature, class ) where ◮ w = w 1 for sea bass, ◮ w = w 2 for salmon. ◮ P ( w 1 ) is the a priori probability that the next fish is a sea bass. ◮ P ( w 2 ) is the a priori probability that the next fish is a salmon. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 46
Prior Probabilities ◮ Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. ◮ How can we choose P ( w 1 ) and P ( w 2 ) ? ◮ Set P ( w 1 ) = P ( w 2 ) if they are equiprobable ( uniform priors ). ◮ May use different values depending on the fishing area, time of the year, etc. ◮ Assume there are no other types of fish P ( w 1 ) + P ( w 2 ) = 1 (exclusivity and exhaustivity). CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 46
Making a Decision ◮ How can we make a decision with only the prior information?  w 1 if P ( w 1 ) > P ( w 2 )  Decide w 2 otherwise  ◮ What is the probability of error for this decision? P ( error ) = min { P ( w 1 ) , P ( w 2 ) } CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 46
Class-Conditional Probabilities ◮ Let’s try to improve the decision using the lightness measurement x . ◮ Let x be a continuous random variable. ◮ Define p ( x | w j ) as the class-conditional probability density (probability of x given that the state of nature is w j for j = 1 , 2 ). ◮ p ( x | w 1 ) and p ( x | w 2 ) describe the difference in lightness between populations of sea bass and salmon. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 46
Class-Conditional Probabilities Figure 1: Hypothetical class-conditional probability density functions for two classes. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 46
Posterior Probabilities ◮ Suppose we know P ( w j ) and p ( x | w j ) for j = 1 , 2 , and measure the lightness of a fish as the value x . ◮ Define P ( w j | x ) as the a posteriori probability (probability of the state of nature being w j given the measurement of feature value x ). ◮ We can use the Bayes formula to convert the prior probability to the posterior probability P ( w j | x ) = p ( x | w j ) P ( w j ) p ( x ) where p ( x ) = � 2 j =1 p ( x | w j ) P ( w j ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 46
Making a Decision ◮ p ( x | w j ) is called the likelihood and p ( x ) is called the evidence . ◮ How can we make a decision after observing the value of x ?  w 1 if P ( w 1 | x ) > P ( w 2 | x )  Decide w 2 otherwise  ◮ Rewriting the rule gives  if p ( x | w 1 ) p ( x | w 2 ) > P ( w 2 ) w 1  P ( w 1 ) Decide w 2 otherwise  ◮ Note that, at every x , P ( w 1 | x ) + P ( w 2 | x ) = 1 . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 46
Probability of Error ◮ What is the probability of error for this decision?  P ( w 1 | x ) if we decide w 2  P ( error | x ) = P ( w 2 | x ) if we decide w 1  ◮ What is the average probability of error? � ∞ � ∞ P ( error ) = p ( error , x ) dx = P ( error | x ) p ( x ) dx −∞ −∞ ◮ Bayes decision rule minimizes this error because P ( error | x ) = min { P ( w 1 | x ) , P ( w 2 | x ) } . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 46
Bayesian Decision Theory ◮ How can we generalize to ◮ more than one feature? ◮ replace the scalar x by the feature vector x ◮ more than two states of nature? ◮ just a difference in notation ◮ allowing actions other than just decisions? ◮ allow the possibility of rejection ◮ different risks in the decision? ◮ define how costly each action is CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 46
Bayesian Decision Theory ◮ Let { w 1 , . . . , w c } be the finite set of c states of nature ( classes , categories ). ◮ Let { α 1 , . . . , α a } be the finite set of a possible actions . ◮ Let λ ( α i | w j ) be the loss incurred for taking action α i when the state of nature is w j . ◮ Let x be the d -component vector-valued random variable called the feature vector . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 46
Bayesian Decision Theory ◮ p ( x | w j ) is the class-conditional probability density function. ◮ P ( w j ) is the prior probability that nature is in state w j . ◮ The posterior probability can be computed as P ( w j | x ) = p ( x | w j ) P ( w j ) p ( x ) where p ( x ) = � c j =1 p ( x | w j ) P ( w j ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 46
Conditional Risk ◮ Suppose we observe x and take action α i . ◮ If the true state of nature is w j , we incur the loss λ ( α i | w j ) . ◮ The expected loss with taking action α i is c � R ( α i | x ) = λ ( α i | w j ) P ( w j | x ) j =1 which is also called the conditional risk . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 46
Minimum-Risk Classification ◮ The general decision rule α ( x ) tells us which action to take for observation x . ◮ We want to find the decision rule that minimizes the overall risk � R = R ( α ( x ) | x ) p ( x ) d x . ◮ Bayes decision rule minimizes the overall risk by selecting the action α i for which R ( α i | x ) is minimum. ◮ The resulting minimum overall risk is called the Bayes risk and is the best performance that can be achieved. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 46
Two-Category Classification ◮ Define ◮ α 1 : deciding w 1 , ◮ α 2 : deciding w 2 , ◮ λ ij = λ ( α i | w j ) . ◮ Conditional risks can be written as R ( α 1 | x ) = λ 11 P ( w 1 | x ) + λ 12 P ( w 2 | x ) , R ( α 2 | x ) = λ 21 P ( w 1 | x ) + λ 22 P ( w 2 | x ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 46
Two-Category Classification ◮ The minimum-risk decision rule becomes  w 1 if ( λ 21 − λ 11 ) P ( w 1 | x ) > ( λ 12 − λ 22 ) P ( w 2 | x )  Decide w 2 otherwise  ◮ This corresponds to deciding w 1 if p ( x | w 1 ) p ( x | w 2 ) > ( λ 12 − λ 22 ) P ( w 2 ) ( λ 21 − λ 11 ) P ( w 1 ) ⇒ comparing the likelihood ratio to a threshold that is independent of the observation x . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 46
Minimum-Error-Rate Classification ◮ Actions are decisions on classes ( α i is deciding w i ). ◮ If action α i is taken and the true state of nature is w j , then the decision is correct if i = j and in error if i � = j . ◮ We want to find a decision rule that minimizes the probability of error. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 46
Minimum-Error-Rate Classification ◮ Define the zero-one loss function  0 if i = j  λ ( α i | w j ) = i, j = 1 , . . . , c 1 if i � = j  (all errors are equally costly). ◮ Conditional risk becomes c � R ( α i | x ) = λ ( α i | w j ) P ( w j | x ) j =1 � = P ( w j | x ) j � = i = 1 − P ( w i | x ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 46
Minimum-Error-Rate Classification ◮ Minimizing the risk requires maximizing P ( w i | x ) and results in the minimum-error decision rule Decide w i if P ( w i | x ) > P ( w j | x ) ∀ j � = i. ◮ The resulting error is called the Bayes error and is the best performance that can be achieved. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 20 / 46
Minimum-Error-Rate Classification Figure 2: The likelihood ratio p ( x | w 1 ) /p ( x | w 2 ) . The threshold θ a is computed using the priors P ( w 1 ) = 2 / 3 and P ( w 2 ) = 1 / 3 , and a zero-one loss function. If we penalize mistakes in classifying w 2 patterns as w 1 more than the converse, we should increase the threshold to θ b . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 21 / 46
Discriminant Functions ◮ A useful way of representing classifiers is through discriminant functions g i ( x ) , i = 1 , . . . , c , where the classifier assigns a feature vector x to class w i if g i ( x ) > g j ( x ) ∀ j � = i. ◮ For the classifier that minimizes conditional risk g i ( x ) = − R ( α i | x ) . ◮ For the classifier that minimizes error g i ( x ) = P ( w i | x ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 22 / 46
Recommend
More recommend