Bayesian Learning
Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell
0.
Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW - - PowerPoint PPT Presentation
0. Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 6 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Two Roles for the Bayesian Methods in Learning 1. Provides
Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell
0.
by combining prior knowledge/probabilities with observed data:
learning in the presence of unobserved variables
concept learning through general-to-specific hypotheses ordering (FindS, and CandidateElimination), neural networks, liniar regression
1.
Bayes’ Theorem Defining classes of hypotheses: Maximum A posteriori Probability (MAP) hypotheses Maximum Likelihood (ML) hypotheses
2.1 The brute force MAP hypotheses learning algorithm 2.2 The Bayes optimal classifier; 2.3 Gibbs classifier 2.4 The Naive Bayes Learner. Example: Learning over text data 2.5 The Minimum Description Length (MDL) Principle; MDL hypotheses
3.1 ML hypotheses in learning real-valued functions 3.2 ML hypotheses in learning to predict probabilities 3.3 The Expectation Maximization (EM) algorithm
2.
probability of a conjunction of two events A and B: P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
P(A|B) = P(B|A)P(A) P(B)
if events A1, . . . , An are mutually exclusive, with n
i=1 P(Ai) = 1, then
P(B) =
n
P(B|Ai)P(Ai) in particular P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)
3.
4.
hi∈H
hMAP = argmax
h∈H
P(h|D) = argmax
h∈H
P(D|h)P(h) P(D) = argmax
h∈H
P(D|h)P(h)
5.
Suppose the following data characterize the lab result for cancer-suspect people. P(cancer) = 0.008 P(¬cancer) = 0.992 h1 = cancer, h2 = ¬cancer P(+|cancer) = 0.98 P(−|cancer) = 0.02 D = {+, −}, P(D | h1), P(D | h2) P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97 Question: Should we diagnoze a patient x whose lab result is positive as having cancer? Answer: No. Indeed, we have to find argmax{P(cancer|+), P(¬cancer|+)}. Applying Bayes theorem (for D = {+}): P(+ | cancer)P(cancer) = 0.98 × 0.008 = 0.0078 P(+ | ¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298
(We can infer P(cancer | +) =
0.0078 0.0078+0.0298 = 21%) 6.
h∈H
h∈H
7.
hi∈H P(vj|hi)P(hi|D)
vj∈V
8.
Let us consider three possible hypotheses: P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Obviously, hMAP = h1. Let’s consider an instance x such that h1(x) = +, h2(x) = −, h3(x) = − Question: What is the most probable classification of x? Answer: P(−|h1) = 0, P(+|h1) = 1 P(−|h2) = 1, P(+|h2) = 0 P(−|h3) = 1, P(+|h3) = 0
P(+|hi)P(hi|D) = 0.4 and
P(−|hi)P(hi|D) = 0.6 therefore argmax
vj∈V
P(vj|hi)P(hi|D) = −
9.
Note: The Bayes optimal classifier provides the best result, but it can be expensive if there are many hypotheses. Gibbs algorithm:
Surprising fact [Haussler et al. 1994]: If the target concept is selected randomly according to the P(h|D) distribution, then the expected error of Gibbs Classifier is no worse than twice the expected error of the Bayes optimal classifier! E[errorGibbs] ≤ 2E[errorBayesOptimal]
10.
When to use it:
independent w.r.t. to the given classification: P(a1, a2 . . . an|vj) =
P(ai|vj) The most probable value of f(x) is: vMAP = argmax
vj∈V
P(vj|a1, a2 . . . an) = argmax
vj∈V
P(a1, a2 . . . an|vj)P(vj) P(a1, a2 . . . an) = argmax
vj∈V
P(a1, a2 . . . an|vj)P(vj) = argmax
vj∈V
P(ai|vj)P(vj)
11.
12.
ai∈x ˆ
13.
Consider again the PlayTennis example, and new instance Outlook = sun, Temp = cool, Humidity = high, Wind = strong We compute: vNB = argmaxvj∈V P(vj)
i P(ai|vj)
P(yes) =
9 14 = 0.64 P(no) = 5 14 = 0.36
. . . P(strong|yes) = 3
9 = 0.33 P(strong|no) = 3 5 = 0.60
P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes) = 0.0053 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no) = 0.0206 → vNB = no
14.
vj∈V
vj∈V
15.
What if none of the training instances with target value vj have the at- tribute value ai? It follows that ˆ P(ai|vj) = 0, and ˆ P(vj)
i ˆ
P(ai|vj) = 0 The typical solution is to (re)define P(ai|vj), for each value vj of ai:
n+m , where
P(ai|vj) (for instance, if the attribute a has k values, then p = 1
k)
(i.e. number of “virtual” examples)
16.
17.
P(doc|vj) =
length(doc)
P(ai = wk|vj) where P(ai = wk|vj) is probability that word in position i is wk, given vj
∀i, m P(ai = wk|vj) = P(am = wk|vj) = P(wk|vj) i.e. attributes are (not only indep. but) also identically distributed
18.
V ocabulary ← all distinct words and other tokens in Examples
For each target value vj in V docsj ← the subset of Examples for which the target value is vj
|docsj| |Examples|
Textj ← a single doc. created by concat. all members of docsj n ← the total number of words in Textj For each word wk in V ocabulary nk ← the number of times word wk occurs in Textj
nk+1 n+|V ocabulary|
(here we use the m-estimate)
19.
positions ← all word positions in Doc that contain tokens from V ocabulary Return vNB = argmaxvj∈V P(vj)
i∈positions P(ai = wk|vj) 20.
Given 1000 training documents from each of the 20 newsgroups, learn to classify new documents according to which newsgroup it came from comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey alt.atheism sci.space soc.religion.christian sci.crypt talk.religion.misc sci.electronics talk.politics.mideast sci.med talk.politics.misc talk.politics.guns Naive Bayes: 89% classification accuracy (having used 2/3 of each group for training; eliminated rare words, and the 100 most freq. words)
21.
10 20 30 40 50 60 70 80 90 100 100 1000 10000 20News Bayes TFIDF PRTFIDF
Accuracy vs. Training set size
22.
Occam’s razor: prefer the shortest hypothesis Bayes analysis: prefer the hypothesis hMAP hMAP = argmax
h∈H
P(D|h)P(h) = argmax
h∈H
(log2 P(D|h) + log2 P(h)) = argmin
h∈H
(− log2 P(D|h) − log2 P(h)) Interesting fact from the Information Theory: The optimal (shortest expected coding length) code for an event with probability p is the one using − log2 p bits. So we can interpret: − log2 P(h): the length of h under the optimal code − log2 P(D|h): the length of D given h under the optimal code Therefore we prefer the hypothesis h that minimizes...
23.
We saw that a MAP learner prefers the hypothesis h that minimizes LC1(h) + LC2(D|h), where LC(x) is the description length of x under encoding C hMDL = argmin
h∈H
(LC1(h) + LC2(D|h)) Example: H = decision trees, D = training data labels
In literature, the application of MDL to practical problems often include arguments justifying the choice of the encodings C1 and C2.
24.
LC2(D|h) = 0 if examples are classified perfectly by h, and both the transmitter and the receiver know h. Therefore, in this situation we need only to describe exceptions. So: hMDL = argmin
h∈H
(length(h) + length(misclassifications)) In general, MDL trades off hypothesis size for training errors: it might select a shorter hypothesis that makes few errors over a longer hypothesis that perfectly classifies the data! Consequence: In learning (for instance) decision trees, (using) the MDL principle can work as an alternative to pruning.
25.
26.
hML f e y x
Problem: Consider learning a real-valued target function f : X → R from D, a training set consisting of examples xi, di, i = 1, . . . , m with xi, assumed fixed (to simplify) di noisy training value di = f(xi) + ei ei is random variable (noise) drawn inde- pendently for each xi, according to some Gaussian distribution with mean=0.
27.
Considering H, a certain class of functions h : X → R such that h(xi) = f(xi) and assuming that xi are mutually independent given h, the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors: hML
def.
= argmax
h∈H
P(D|h) = arg min
h∈H m
(di − h(xi))2
28.
Note: We will use the probability density function: p(x0)
def.
= limǫ→0 1 ǫP(x0 ≤ x < x0 + ǫ) hML = argmax
h∈H
P(D|h) = argmax
h∈H m
p(di|h)
µi=f(xi)
= argmax
h∈H m
p(ei|h) = argmax
h∈H m
p(di − f(xi)|h)
h(xi)=f(xi)
= argmax
h∈H m
p(di − h(xi)|h) = argmax
h∈H m
1 √ 2πσ2e− 1
2( di−h(xi) σ
)2 = argmax h∈H
(
m
ln 1 √ 2πσ2 − 1 2 di − h(xi) σ 2 ) = argmax
h∈H m
−1 2 di − h(xi) σ 2 = argmax
h∈H m
− (di − h(xi))2 = argmin
h∈H m
(di − h(xi))2
29.
noise distributions (than Gaussians), producing different results.
amples xi. Otherwise, the analysis becomes significantly more complex.
30.
Let us consider a non-deterministic function (i.e. one-to-many relation) f : X → {0, 1}. Given a set of independently drawn examples D = {< x1, d1 >, . . . , < xm, dm >} where di = f(xi) ∈ {0, 1}, we would like to learn a ML hypothesis for the probability function g(x)
def.
= P(f(x) = 1). For example, h(xi) = 0.92 if P({< xi, di > |di = 1}) = 0.92. Proposition: In this setting, hML = argmaxh∈H P(D | h) maximizes the sum m
i=1[di ln h(xi) + (1 − di) ln (1 − h(xi))].
Proof: P(D | h) = Πm
i=1P(xi, di | h) = Πm i=1P(di | xi, h) · P(xi | h)
It can be assumed that xi is independent of h, therefore: P(D | h) = Πm
i=1P(di | xi, h) · P(xi)
31.
Proof (continued): What we wanted to compute is h(xi) = P(di = 1 | xi, h). In a more general form: P(di | xi, h) =
if di = 1 1 − h(xi) if di = 0 In a more convenient mathematical form: P(di | xi, h) = h(xi)di(1 − h(xi))1−di. ⇒ hML = argmaxh∈H Πm
i=1[h(xi)di(1 − h(xi))1−diP(xi)]
= argmaxh∈H Πm
i=1h(xi)di(1 − h(xi))1−di · Πm i=1P(xi)
= argmaxh∈H Πm
i=1h(xi)di(1 − h(xi))1−di
= argmaxh∈H
m
[di ln h(xi) + (1 − di) ln (1 − h(xi))] Note: The quantity − m
i=1[di ln h(xi) + (1 − di) ln (1 − h(xi))] is called cross-
entropy; the above hML minimizes this quantity.
32.
[Dempster et al, 1977]
the target value is unobservable
some instance attributes are unobservable
Estimate the means of k Gausseans
33.
34.
35.
not.
h
36.
h
t
P(Y|X; )
++t
h t+1
37.
38.
39.
40.
Storm Campfire Lightning Thunder ForestFire Campfire C ¬C ¬S,B ¬S,¬B 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 S,¬B BusTourGroup S,B
The network is defined by
assertions: Each node — representing a random variable — is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Example: P(Thunder|ForestFire, Lightning) = P(Thunder|Lightning)
41.
n
42.
P(A|S)=0.7 P(A|~S)=0.3 P(G|S)=0.8 P(G|~S)=0.2 P(S|L,F)=0.8 P(S|~L,F)=0.5 P(S|~L,~F)=0.3 P(S|L,~F)=0.6 L F S A G P(L)=0.4 P(F)=0.6 43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
0.
Exemplifying Text classification using the Naive Bayes algorithm CMU, 2009 spring, Ziv Bar-Joseph, midterm, pr. 2
1.
Training data: ‘study’ ‘free’ ‘money’ Category count 1 Regular 1 1 Regular 1 1 Regular 1 1 1 Regular 1 1 Spam 4 1 1 Spam 4 Estimating the parameters, by MLE and applying Laplace’s rule (“add-one”): P(study|spam)= 0 + 1 8 + 2 = 1 10 P(free|spam)= 8 + 1 8 + 2 = 9 10 P(money|spam)= 4 + 1 8 + 2 = 1 2 P(study|regular)= 3 + 1 4 + 2 = 2 3 P(free|regular)= 1 + 1 4 + 2 = 1 3 P(money|regular)= 1 + 1 4 + 2 = 1 3 2.
Classification of the message s = “money for psychology study”, using the a priori probability P(spam) = 0.1: P(spam | s) = P(spam | study, ¬free, money)
= P(study, ¬free, money | spam) · P(spam) P(study,¬free,money | spam)P(spam) + P(study,¬free,money | reg)P(reg) P(study, ¬free, money|spam)
= P(study|spam)·P(¬free|spam)·P(money|spam) = 1 10 · 1 10 · 1 2 = 1 200 P(study, ¬free, money|reg)
= P(study|reg)·P(¬free|reg)·P(money|reg) = 2 3 · 2 3 · 1 3 = 4 27 Therefore, P(spam| s) = 1 200 · 1 10 1 200 · 1 10 + 4 27 · 9 10 ≈ 0.0037 3.
Exemplifying The computation of the error rate for the Naive Bayes algorithm CMU, 2010 fall, Aarti Singh, HW1, pr. 4.2
4.
Consider a simple learning problem of determining whether Alice and Bob from CA will go to hiking or not Y : Hike ∈ {T, F} given the weather conditions X1 : Sunny ∈ {T, F} and X2 : Windy ∈ {T, F} by a Naive Bayes classifier. Using training data, we estimated the parameters P(Hike) = 0.5 P(Sunny | Hike) = 0.8, P(Sunny | ¬Hike) = 0.7 P(Windy | Hike) = 0.4, P(Windy | ¬Hike) = 0.5 Assume that the true distribution of X1, X2, and Y satisfies the Naive Bayes assumption of conditional independence with the above parameters.
sunny and windy, that is P(Sunny, Windy, Hike)? Answer: P(Sunny, Windy, Hike)
= P(Sunny|Hike) · P(Windy|Hike) · P(Hike) = 0.8 · 0.4 · 0.5 = 0.16. 5.
(The expected error rate is the probability that each class generates an observation where the decision rule is incorrect.) Answer: P(X1, X2, Y ) = X1 X2 Y P(X1|Y ) · P(X2|Y ) · P(Y ) fNB(X1, X2) PNB(Y |X1, X2) F F F 0.3 · 0.5 · 0.5 = 0.075 F 0.555556 F F T 0.2 · 0.6 · 0.5 = 0.060 F 0.444444 F T F 0.3 · 0.5 · 0.5 = 0.075 F 0.652174 F T T 0.2 · 0.4 · 0.5 = 0.040 F 0.347826 T F F 0.7 · 0.5 · 0.5 = 0.175 T 0.421686 T F T 0.8 · 0.6 · 0.5 = 0.240 T 0.578314 T T F 0.7 · 0.5 · 0.5 = 0.175 F 0.522388 T T T 0.8 · 0.4 · 0.5 = 0.160 F 0.477612
Note: Joint probabilities corresponding to incorrect predictions are shown in bold.
error
def.
=
I[fNB(X1, X2) = Y ] · P(X1, X2, Y ) = 0.060 + 0.040 + 0.175 + 0.160 = 0.435 Note:
I is the indicator func- tion; its value is 1 whenever the associated condition (in our case, fNB(X1, X2) = Y ) is true, and 0 otherwise.
6.
Next, suppose that we gather more information about weather conditions and introduce a new feature denoting whether the weather is X3 : Rainy or not. Assume that each day the weather in CA can be either Rainy or Sunny. That is, it can not be both Sunny and Rainy. (Similarly, it can not be ¬ Sunny and ¬Rainy).
What is the joint probability that Alice and Bob go to hiking and the weather is sunny, windy and not rainy, that is P(Sunny, Windy, ¬Rainy, Hike)? Answer: The conditional independence of variables given the class label assumption of Naive Bayes is violated. Indeed, knowing if the weather is Sunny completely determines whether it is Rainy or not. Therefore, Sunny and Rainy are clearly NOT condition- ally independent given Hike. P(Sunny, Windy, ¬Rainy, Hike) = P(Sunny|Hike) · P(Windy|Hike) · P(¬Rainy|Hike, Sunny, Windy) · P(Hike) = 0.8 · 0.4 · 1 · 0.5 = 0.16. 7.
tributes? Does the performance of Naive Bayes improve by observing the new attribute Rainy? Explain why. Answer:
X1 X2 X3 Y P(X1, X2, Y ) P(X1|Y ) · P(X2|Y ) · P(Y ) fNB(X1, X2, X3) PNB(Y |X1, X2, X3) F F F F 0.075 · 0.7 = 0.0525 F 0.522388 F F F T 0.060 · 0.8 = 0.0480 F 0.477612 F F T F 0.075 0.075 · 0.3 = 0.0225 F 0.652174 F F T T 0.060 0.060 · 0.2 = 0.0120 F 0.347826 F T F F 0.075 · 0.7 = 0.0525 F 0.621302 F T F T 0.040 · 0.8 = 0.0320 F 0.378698 F T T F 0.075 0.075 · 0.3 = 0.0225 F 0.737705 F T T T 0.040 0.040 · 0.2 = 0.0080 F 0.262295 T F F F 0.175 0.175 · 0.7 = 0.0525 T 0.389507 T F F T 0.240 0.240 · 0.8 = 0.1920 T 0.610493 T F T F 0.175 · 0.3 = 0.0525 F 0.522388 T F T T 0.240 · 0.2 = 0.0480 F 0.477612 T T F F 0.175 0.175 · 0.7 = 0.0525 T 0.489022 T T F T 0.160 0.160 · 0.8 = 0.1280 T 0.510978 T T T F 0.175 · 0.3 = 0.0225 F 0.621302 T T T T 0.060 · 0.2 = 0.0120 F 0.378698
8.
The new error rate is: 0.060 + 0.040 + 0.175 + 0.175 = 0.45 > 0.435 (see question b). The Naive Bayes classifier performance drops because the conditional inde- pendence assumptions do not hold for the correlated features. 9.
Computing The sample complexity of the Naive Bayes and Joint Bayes Clssifiers CMU, 2009 spring, Eric Xing, Tom Mitchell, Aarti Singh, HW2, pr. 1.1
10.
A big reason we use Naive Bayes classifiers is that they require less training data than Full Bayes Classifiers. This problem should give you a “feel” for how great the disparity really is. Imagine that each instance is an independent observation of the multi-variate random variable ¯ X = X1, ..., Xd, where the Xi are i.i.d. and Bernoulli of pa- rameter p = 0.5. To train a Full Bayes classifier, we need to see every value of ¯ X “enough” times; training a Naive Bayes classifier only requires seeing both values of Xi “enough” times. Question: How many observations are needed until, with probability 1−ε, we have seen every variable we need to see at least once. Note: To train the classifier well would require more than this, but for this problem we only require one observation. Hint: You may want to use the following inequalities:
i=1 Pr{Ei}.
(This is called the “union bounds” property.) 11.
Consider the Naive Bayes classifier.
Xi (either 0 or 1) has not been seen is ≤ 1 2N−1 .
ε d
probability that any Xi has not been observed in both states is ≤ ε. Answer:
a nu fi ˆ ıntˆ alnit toate valorile variabilei Xi) = 1 2 N + 1 2 N = 2 2N = 1 2N−1
a nu fi ˆ ıntˆ alnit toate valorile variabilelorXi, i = 1, d, ˆ ın NNB observat ¸ii) ≤ d
i=1 P(s˘
a nu fi ˆ ıntˆ alnit toate valorile variabilei Xi ˆ ın NNB observat ¸ii) = d
i=1
1 2NNB−1 = d · 1 2NNB−1 = d · 1 21+log2
d ε −1 = d ·
1 2log2
d ε = d · 1
d ε = d · ε d = ε 12.
Consider the Joint Bayes classifier.
x be a particular value of ¯
we have never seen ¯ x is ≤ e−N/2d.
2d ε
fixed) value of ¯ X has not been seen is ≤ ε. Answer:
¸a ¯ x s˘ a nu fie ˆ ıntˆ alnit˘ a ˆ ın N observat ¸ii) =
2d
N =
2d
2dN/2d ≤ 1 e N/2d = e−N/2d
a nu fi ˆ ıntˆ alnit toate instant ¸ele variabilei ¯ X ˆ ın NJB observat ¸ii) ≤
¯ x P(s˘
a nu fi ˆ ıntˆ alnit ¯ x ˆ ın NJB observat ¸ii) =
¯ x e−NJB/2d = 2d · e−NJB/2d = 2d · e− ln 2d
ε = 2d ·
1 eln 2d
ε
= 2d
2d ε
= ε 13.
What about d = 5? And d = 10? Answer: ε = 0.1, d = 2 ⇒ NNB = 1 + log2 2 0.1 = 1 + log2 20 ≈ 5.32 NJB = 22 · ln 22 0.1 = 4 · ln 40 ≈ 14.75 ε = 0.1, d = 5 ⇒ NNB = 1 + log2 5 0.1 = 1 + log2 50 ≈ 6.64 NJB = 25 · ln 25 0.1 = 32 · ln 320 ≈ 184.58 ε = 0.1, d = 10 ⇒ NNB = 1 + log2 10 0.1 = 1 + log2 100 ≈ 7.64 NJB = 210 · ln 210 0.1 = 1024 · ln 10240 ≈ 9455.67 14.
Exemplifying ML hypotheses and MAP hypotheses CMU, 2009 spring, Tom Mitchell, midterm, pr. 2
15.
2 3 4 3.5 2.5 1.5 1 X
Let’s consider the 1-dimensional data set shown above, based on the single real-valued attribute X. Notice there are two classes (values of Y ), and five data points. Consider a new class of decision trees where leaves have probabilistic labels. Each leaf node gives the probability of each possible label, where the proba- bility is the fraction of points at that leaf node with that label. For example, a decision tree learned from the data set above with zero splits would say P(Y = 1) = 3/5 and P(Y = 0) = 2/5. A decision tree with one split (at X = 2.5) would say P(Y = 1) = 1 if X < 2.5, and P(Y = 1) = 1/3 if X ≥ 2.5.
imizes the likelihood of the data. TML = argmaxT PT (D), where PT (D)
def.
= P(D|T )
i.i.d.
= 5
i=1 P(Y = yi|X = xi, T ),
with yi the label/classs of the instance xi ∈ {1.5, 2, 3, 3.5, 3.75}. Answer:
P(Y=1)=1 P(Y=1)=1
X>2.5 X>3.625
Nu Da Nu Da P(Y=1)=0
16.
P(T ) ∝ 1 4 splits(T )2 where T is a tree, splits(T ) is the number of splits in T , and ∝ means “is proportional to”. For the same data set, give the MAP tree when using this prior, P (T ), over trees. Answer: 0 nodes: P(T0 | D) ∝ 3 5 3 · 2 5 2 · 1 4 = 33 · 22 55 = 108 3125 = 0.0336 1 node: P(T1 | D) ∝ 12 · 2 3 2 · 1 3 · 1 4 1 = 1 27 = 0.037 2 nodes: P(T2) ∝ 1 4 4 ⇒ P(T2 | D) ∝ 1 · 1 4 4 = 1 256 = 0.0039
P(Y=1)=3/5
P(Y=1)=1
X>2.5
Nu Da P(Y=1)=1/3
17.