Bayesian Decision Theory
Selim Aksoy
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Fall 2019
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 1 / 46
Bayesian Decision Theory Selim Aksoy Department of Computer - - PowerPoint PPT Presentation
Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2019 CS 551, Fall 2019 2019, Selim Aksoy (Bilkent University) c 1 / 46 Bayesian Decision Theory
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 1 / 46
◮ Bayesian Decision Theory is a fundamental statistical
◮ First, we will assume that all probabilities are known. ◮ Then, we will study the cases where the probabilistic
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 2 / 46
◮ State of nature is a random variable. ◮ Define w as the type of fish we observe (state of nature,
◮ w = w1 for sea bass, ◮ w = w2 for salmon. ◮ P(w1) is the a priori probability that the next fish is a sea
◮ P(w2) is the a priori probability that the next fish is a salmon. CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 3 / 46
◮ Prior probabilities reflect our knowledge of how likely each
◮ How can we choose P(w1) and P(w2)?
◮ Set P(w1) = P(w2) if they are equiprobable (uniform priors). ◮ May use different values depending on the fishing area, time
◮ Assume there are no other types of fish
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 4 / 46
◮ How can we make a decision with only the prior
◮ What is the probability of error for this decision?
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 5 / 46
◮ Let’s try to improve the decision using the lightness
◮ Let x be a continuous random variable. ◮ Define p(x|wj) as the class-conditional probability density
◮ p(x|w1) and p(x|w2) describe the difference in lightness
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 6 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 7 / 46
◮ Suppose we know P(wj) and p(x|wj) for j = 1, 2, and
◮ Define P(wj|x) as the a posteriori probability (probability of
◮ We can use the Bayes formula to convert the prior
j=1 p(x|wj)P(wj).
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 8 / 46
◮ p(x|wj) is called the likelihood and p(x) is called the
◮ How can we make a decision after observing the value of x?
◮ Rewriting the rule gives
p(x|w2) > P(w2) P(w1)
◮ Note that, at every x, P(w1|x) + P(w2|x) = 1.
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 9 / 46
◮ What is the probability of error for this decision?
◮ What is the average probability of error?
−∞
−∞
◮ Bayes decision rule minimizes this error because
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 10 / 46
◮ How can we generalize to
◮ more than one feature? ◮ replace the scalar x by the feature vector x ◮ more than two states of nature? ◮ just a difference in notation ◮ allowing actions other than just decisions? ◮ allow the possibility of rejection ◮ different risks in the decision? ◮ define how costly each action is CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 11 / 46
◮ Let {w1, . . . , wc} be the finite set of c states of nature
◮ Let {α1, . . . , αa} be the finite set of a possible actions. ◮ Let λ(αi|wj) be the loss incurred for taking action αi when
◮ Let x be the d-component vector-valued random variable
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 12 / 46
◮ p(x|wj) is the class-conditional probability density function. ◮ P(wj) is the prior probability that nature is in state wj. ◮ The posterior probability can be computed as
j=1 p(x|wj)P(wj).
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 13 / 46
◮ Suppose we observe x and take action αi. ◮ If the true state of nature is wj, we incur the loss λ(αi|wj). ◮ The expected loss with taking action αi is
c
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 14 / 46
◮ The general decision rule α(x) tells us which action to take
◮ We want to find the decision rule that minimizes the overall
◮ Bayes decision rule minimizes the overall risk by selecting
◮ The resulting minimum overall risk is called the Bayes risk
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 15 / 46
◮ Define
◮ α1: deciding w1, ◮ α2: deciding w2, ◮ λij = λ(αi|wj).
◮ Conditional risks can be written as
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 16 / 46
◮ The minimum-risk decision rule becomes
◮ This corresponds to deciding w1 if
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 17 / 46
◮ Actions are decisions on classes (αi is deciding wi). ◮ If action αi is taken and the true state of nature is wj, then
◮ We want to find a decision rule that minimizes the
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 18 / 46
◮ Define the zero-one loss function
◮ Conditional risk becomes
c
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 19 / 46
◮ Minimizing the risk requires maximizing P(wi|x) and results
◮ The resulting error is called the Bayes error and is the best
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 20 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 21 / 46
◮ A useful way of representing classifiers is through
◮ For the classifier that minimizes conditional risk
◮ For the classifier that minimizes error
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 22 / 46
◮ These functions divide the feature space into c decision
◮ Note that the results do not change even if we replace every
◮ This may lead to significant analytical and computational
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 23 / 46
◮ Gaussian can be considered as a model where the feature
◮ Some properties of the Gaussian:
◮ Analytically tractable. ◮ Completely specified by the 1st and 2nd moments. ◮ Has the maximum entropy of all distributions with a given
◮ Many processes are asymptotically Gaussian (Central Limit
◮ Linear transformations of a Gaussian are also Gaussian. ◮ Uncorrelatedness implies independence. CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 24 / 46
◮ For x ∈ R:
−∞
−∞
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 25 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 26 / 46
◮ For x ∈ Rd:
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 27 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 28 / 46
◮ Recall that, given x ∈ Rd, A ∈ Rd×k, y = ATx ∈ Rk,
◮ As a special case, the whitening transform
◮ Φ is the matrix whose columns are the orthonormal
◮ Λ is the diagonal matrix of the corresponding eigenvalues,
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 29 / 46
◮ Discriminant functions for minimum-error-rate classification
◮ For p(x|wi) = N(µi, Σi)
i (x−µi)− d
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 30 / 46
◮ Discriminant functions are
i x + wi0
i µi + ln P(wi)
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 31 / 46
◮ Decision boundaries are the hyperplanes gi(x) = gj(x), and
◮ Hyperplane separating Ri and Rj passes through the point
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 32 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 33 / 46
◮ Special case when P(wi) are the same for i = 1, . . . , c is the
i=1,...,c x − µi.
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 34 / 46
◮ Discriminant functions are
i x + wi0
i Σ−1µi + ln P(wi).
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 35 / 46
◮ Decision boundaries can be written as
◮ Hyperplane passes through x0 but is not necessarily
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 36 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 37 / 46
◮ Discriminant functions are
i x + wi0
i
i
i Σ−1 i µi − 1
◮ Decision boundaries are hyperquadrics.
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 38 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 39 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 40 / 46
◮ For the two-category case
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 41 / 46
◮ For the multicategory case
c
c
c
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 42 / 46
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 43 / 46
◮ Consider the two-category case and define
◮ w1: target is present, ◮ w2: target is not present.
◮ Mis-detection is also called false negative or Type II error. ◮ False alarm is also called false positive or Type I error.
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 44 / 46
◮ If we use a parameter (e.g.,
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 45 / 46
◮ To minimize the overall risk, choose the action that
◮ To minimize the probability of error, choose the class that
◮ If there are different penalties for misclassifying patterns
◮ Do not forget that these decisions are the optimal ones
CS 551, Fall 2019 c 2019, Selim Aksoy (Bilkent University) 46 / 46