bayesian decision theory
play

Bayesian Decision Theory Selim Aksoy Department of Computer - PowerPoint PPT Presentation

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2019 CS 551, Fall 2019 2019, Selim Aksoy (Bilkent University) c 1 / 46 Bayesian Decision Theory


  1. Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2019 CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 46

  2. Bayesian Decision Theory ◮ Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions. ◮ First, we will assume that all probabilities are known. ◮ Then, we will study the cases where the probabilistic structure is not completely known. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 46

  3. Fish Sorting Example Revisited ◮ State of nature is a random variable. ◮ Define w as the type of fish we observe (state of nature, class ) where ◮ w = w 1 for sea bass, ◮ w = w 2 for salmon. ◮ P ( w 1 ) is the a priori probability that the next fish is a sea bass. ◮ P ( w 2 ) is the a priori probability that the next fish is a salmon. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 46

  4. Prior Probabilities ◮ Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. ◮ How can we choose P ( w 1 ) and P ( w 2 ) ? ◮ Set P ( w 1 ) = P ( w 2 ) if they are equiprobable ( uniform priors ). ◮ May use different values depending on the fishing area, time of the year, etc. ◮ Assume there are no other types of fish P ( w 1 ) + P ( w 2 ) = 1 (exclusivity and exhaustivity). CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 46

  5. Making a Decision ◮ How can we make a decision with only the prior information?  w 1 if P ( w 1 ) > P ( w 2 )  Decide w 2 otherwise  ◮ What is the probability of error for this decision? P ( error ) = min { P ( w 1 ) , P ( w 2 ) } CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 46

  6. Class-Conditional Probabilities ◮ Let’s try to improve the decision using the lightness measurement x . ◮ Let x be a continuous random variable. ◮ Define p ( x | w j ) as the class-conditional probability density (probability of x given that the state of nature is w j for j = 1 , 2 ). ◮ p ( x | w 1 ) and p ( x | w 2 ) describe the difference in lightness between populations of sea bass and salmon. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 46

  7. Class-Conditional Probabilities Figure 1: Hypothetical class-conditional probability density functions for two classes. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 46

  8. Posterior Probabilities ◮ Suppose we know P ( w j ) and p ( x | w j ) for j = 1 , 2 , and measure the lightness of a fish as the value x . ◮ Define P ( w j | x ) as the a posteriori probability (probability of the state of nature being w j given the measurement of feature value x ). ◮ We can use the Bayes formula to convert the prior probability to the posterior probability P ( w j | x ) = p ( x | w j ) P ( w j ) p ( x ) where p ( x ) = � 2 j =1 p ( x | w j ) P ( w j ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 46

  9. Making a Decision ◮ p ( x | w j ) is called the likelihood and p ( x ) is called the evidence . ◮ How can we make a decision after observing the value of x ?  w 1 if P ( w 1 | x ) > P ( w 2 | x )  Decide w 2 otherwise  ◮ Rewriting the rule gives  if p ( x | w 1 ) p ( x | w 2 ) > P ( w 2 ) w 1  P ( w 1 ) Decide w 2 otherwise  ◮ Note that, at every x , P ( w 1 | x ) + P ( w 2 | x ) = 1 . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 46

  10. Probability of Error ◮ What is the probability of error for this decision?  P ( w 1 | x ) if we decide w 2  P ( error | x ) = P ( w 2 | x ) if we decide w 1  ◮ What is the average probability of error? � ∞ � ∞ P ( error ) = p ( error , x ) dx = P ( error | x ) p ( x ) dx −∞ −∞ ◮ Bayes decision rule minimizes this error because P ( error | x ) = min { P ( w 1 | x ) , P ( w 2 | x ) } . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 46

  11. Bayesian Decision Theory ◮ How can we generalize to ◮ more than one feature? ◮ replace the scalar x by the feature vector x ◮ more than two states of nature? ◮ just a difference in notation ◮ allowing actions other than just decisions? ◮ allow the possibility of rejection ◮ different risks in the decision? ◮ define how costly each action is CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 46

  12. Bayesian Decision Theory ◮ Let { w 1 , . . . , w c } be the finite set of c states of nature ( classes , categories ). ◮ Let { α 1 , . . . , α a } be the finite set of a possible actions . ◮ Let λ ( α i | w j ) be the loss incurred for taking action α i when the state of nature is w j . ◮ Let x be the d -component vector-valued random variable called the feature vector . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 46

  13. Bayesian Decision Theory ◮ p ( x | w j ) is the class-conditional probability density function. ◮ P ( w j ) is the prior probability that nature is in state w j . ◮ The posterior probability can be computed as P ( w j | x ) = p ( x | w j ) P ( w j ) p ( x ) where p ( x ) = � c j =1 p ( x | w j ) P ( w j ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 46

  14. Conditional Risk ◮ Suppose we observe x and take action α i . ◮ If the true state of nature is w j , we incur the loss λ ( α i | w j ) . ◮ The expected loss with taking action α i is c � R ( α i | x ) = λ ( α i | w j ) P ( w j | x ) j =1 which is also called the conditional risk . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 46

  15. Minimum-Risk Classification ◮ The general decision rule α ( x ) tells us which action to take for observation x . ◮ We want to find the decision rule that minimizes the overall risk � R = R ( α ( x ) | x ) p ( x ) d x . ◮ Bayes decision rule minimizes the overall risk by selecting the action α i for which R ( α i | x ) is minimum. ◮ The resulting minimum overall risk is called the Bayes risk and is the best performance that can be achieved. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 46

  16. Two-Category Classification ◮ Define ◮ α 1 : deciding w 1 , ◮ α 2 : deciding w 2 , ◮ λ ij = λ ( α i | w j ) . ◮ Conditional risks can be written as R ( α 1 | x ) = λ 11 P ( w 1 | x ) + λ 12 P ( w 2 | x ) , R ( α 2 | x ) = λ 21 P ( w 1 | x ) + λ 22 P ( w 2 | x ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 46

  17. Two-Category Classification ◮ The minimum-risk decision rule becomes  w 1 if ( λ 21 − λ 11 ) P ( w 1 | x ) > ( λ 12 − λ 22 ) P ( w 2 | x )  Decide w 2 otherwise  ◮ This corresponds to deciding w 1 if p ( x | w 1 ) p ( x | w 2 ) > ( λ 12 − λ 22 ) P ( w 2 ) ( λ 21 − λ 11 ) P ( w 1 ) ⇒ comparing the likelihood ratio to a threshold that is independent of the observation x . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 46

  18. Minimum-Error-Rate Classification ◮ Actions are decisions on classes ( α i is deciding w i ). ◮ If action α i is taken and the true state of nature is w j , then the decision is correct if i = j and in error if i � = j . ◮ We want to find a decision rule that minimizes the probability of error. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 46

  19. Minimum-Error-Rate Classification ◮ Define the zero-one loss function  0 if i = j  λ ( α i | w j ) = i, j = 1 , . . . , c 1 if i � = j  (all errors are equally costly). ◮ Conditional risk becomes c � R ( α i | x ) = λ ( α i | w j ) P ( w j | x ) j =1 � = P ( w j | x ) j � = i = 1 − P ( w i | x ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 46

  20. Minimum-Error-Rate Classification ◮ Minimizing the risk requires maximizing P ( w i | x ) and results in the minimum-error decision rule Decide w i if P ( w i | x ) > P ( w j | x ) ∀ j � = i. ◮ The resulting error is called the Bayes error and is the best performance that can be achieved. CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 20 / 46

  21. Minimum-Error-Rate Classification Figure 2: The likelihood ratio p ( x | w 1 ) /p ( x | w 2 ) . The threshold θ a is computed using the priors P ( w 1 ) = 2 / 3 and P ( w 2 ) = 1 / 3 , and a zero-one loss function. If we penalize mistakes in classifying w 2 patterns as w 1 more than the converse, we should increase the threshold to θ b . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 21 / 46

  22. Discriminant Functions ◮ A useful way of representing classifiers is through discriminant functions g i ( x ) , i = 1 , . . . , c , where the classifier assigns a feature vector x to class w i if g i ( x ) > g j ( x ) ∀ j � = i. ◮ For the classifier that minimizes conditional risk g i ( x ) = − R ( α i | x ) . ◮ For the classifier that minimizes error g i ( x ) = P ( w i | x ) . CS 551, Fall 2019 � 2019, Selim Aksoy (Bilkent University) c 22 / 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend