Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW - - PowerPoint PPT Presentation

bayesian learning
SMART_READER_LITE
LIVE PREVIEW

Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW - - PowerPoint PPT Presentation

0. Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 6 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Two Roles for the Bayesian Methods in Learning 1. Provides


slide-1
SLIDE 1

Bayesian Learning

Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell

0.

slide-2
SLIDE 2

Two Roles for the Bayesian Methods in Learning

  • 1. Provides practical learning algorithms

by combining prior knowledge/probabilities with observed data:

  • Naive Bayes learning algorithm
  • Expectation Maximization (EM) learning algorithm (scheme):

learning in the presence of unobserved variables

  • Bayesian Belief Network learning
  • 2. Provides a useful conceptual framework
  • Serves for evaluating other learning algorithms, e.g.

concept learning through general-to-specific hypotheses ordering (FindS, and CandidateElimination), neural networks, liniar regression

  • Provides additional insight into Occam’s razor

1.

slide-3
SLIDE 3

PLAN

  • 1. Basic Notions

Bayes’ Theorem Defining classes of hypotheses: Maximum A posteriori Probability (MAP) hypotheses Maximum Likelihood (ML) hypotheses

  • 2. Learning MAP hypotheses

2.1 The brute force MAP hypotheses learning algorithm 2.2 The Bayes optimal classifier; 2.3 Gibbs classifier 2.4 The Naive Bayes Learner. Example: Learning over text data 2.5 The Minimum Description Length (MDL) Principle; MDL hypotheses

  • 3. Learning ML hypotheses

3.1 ML hypotheses in learning real-valued functions 3.2 ML hypotheses in learning to predict probabilities 3.3 The Expectation Maximization (EM) algorithm

  • 4. Bayesian Belief Networks

2.

slide-4
SLIDE 4

1 Basic Notions

  • Product Rule:

probability of a conjunction of two events A and B: P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)

  • Bayes’ Theorem:

P(A|B) = P(B|A)P(A) P(B)

  • Theorem of total probability:

if events A1, . . . , An are mutually exclusive, with n

i=1 P(Ai) = 1, then

P(B) =

n

  • i=1

P(B|Ai)P(Ai) in particular P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)

3.

slide-5
SLIDE 5

Using Bayes’ Theorem for Hypothesis Learning

P(h|D) = P(D|h)P(h) P(D)

  • P(D) = the (prior) probability of training data D
  • P(h) = the (prior) probability of the hypothesis h
  • P(D|h) = the (a posteriori) probability of D given h
  • P(h|D) = the (a posteriori) probability of h given D

4.

slide-6
SLIDE 6

Classes of Hypotheses

Maximum Likelihood (ML) hypothesis: the hypothesis that best explains the training data hML = argmax

hi∈H

P(D|hi) Maximum A posteriori Probability (MAP) hypothesis: the most probable hypothesis given the training data

hMAP = argmax

h∈H

P(h|D) = argmax

h∈H

P(D|h)P(h) P(D) = argmax

h∈H

P(D|h)P(h)

Note: If P(hi) = P(hj), ∀i, j, then hMAP = hML

5.

slide-7
SLIDE 7

Exemplifying MAP Hypotheses

Suppose the following data characterize the lab result for cancer-suspect people. P(cancer) = 0.008 P(¬cancer) = 0.992 h1 = cancer, h2 = ¬cancer P(+|cancer) = 0.98 P(−|cancer) = 0.02 D = {+, −}, P(D | h1), P(D | h2) P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97 Question: Should we diagnoze a patient x whose lab result is positive as having cancer? Answer: No. Indeed, we have to find argmax{P(cancer|+), P(¬cancer|+)}. Applying Bayes theorem (for D = {+}): P(+ | cancer)P(cancer) = 0.98 × 0.008 = 0.0078 P(+ | ¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298

  • ⇒ hMAP = ¬cancer

(We can infer P(cancer | +) =

0.0078 0.0078+0.0298 = 21%) 6.

slide-8
SLIDE 8

2 Learning MAP Hypothesis

2.1 The Brute Force MAP Hypothesis Learning Algorithm Training: Choose the hypothesis with the highest posterior proba- bility hMAP = argmax

h∈H

P(h|D) = argmax

h∈H

P(D|h)P(h) Testing: Given x, compute hMAP(x) Drawback: Requires to compute all probabilities P(D|h) and P(h).

7.

slide-9
SLIDE 9

2.2 The Bayes Optimal Classifier:

The Most Probable Classification of New Instances So far we’ve sought hMAP, the most probable hypothesis given the data D. Question: Given new instance x — the classification of which can take any value vj in some set V —, what is its most probable classification? Answer: P(vj|D) =

hi∈H P(vj|hi)P(hi|D)

Therefore, the Bayes optimal classification of x is: argmax

vj∈V

  • hi∈H

P(vj|hi)P(hi|D) Remark: hMAP(x) is not the most probable classification of x! (See the next example.)

8.

slide-10
SLIDE 10

The Bayes Optimal Classifier: An Example

Let us consider three possible hypotheses: P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Obviously, hMAP = h1. Let’s consider an instance x such that h1(x) = +, h2(x) = −, h3(x) = − Question: What is the most probable classification of x? Answer: P(−|h1) = 0, P(+|h1) = 1 P(−|h2) = 1, P(+|h2) = 0 P(−|h3) = 1, P(+|h3) = 0

  • hi∈H

P(+|hi)P(hi|D) = 0.4 and

  • hi∈H

P(−|hi)P(hi|D) = 0.6 therefore argmax

vj∈V

  • hi∈H

P(vj|hi)P(hi|D) = −

9.

slide-11
SLIDE 11

2.3 Gibbs Classifier

[Opper and Haussler, 1991]

Note: The Bayes optimal classifier provides the best result, but it can be expensive if there are many hypotheses. Gibbs algorithm:

  • 1. Choose one hypothesis at random, according to P(h|D)
  • 2. Use this to classify new instance

Surprising fact [Haussler et al. 1994]: If the target concept is selected randomly according to the P(h|D) distribution, then the expected error of Gibbs Classifier is no worse than twice the expected error of the Bayes optimal classifier! E[errorGibbs] ≤ 2E[errorBayesOptimal]

10.

slide-12
SLIDE 12

2.4 The Naive Bayes Classifier

When to use it:

  • The target function f takes value from a finite set V = {v1, . . ., vk}
  • Moderate or large training data set is available
  • The attributes < a1, . . . , an > that describe instances are conditionally

independent w.r.t. to the given classification: P(a1, a2 . . . an|vj) =

  • i

P(ai|vj) The most probable value of f(x) is: vMAP = argmax

vj∈V

P(vj|a1, a2 . . . an) = argmax

vj∈V

P(a1, a2 . . . an|vj)P(vj) P(a1, a2 . . . an) = argmax

vj∈V

P(a1, a2 . . . an|vj)P(vj) = argmax

vj∈V

  • i

P(ai|vj)P(vj)

11.

slide-13
SLIDE 13

The Naive Bayes Classifier: Remarks

  • 1. Along with decision trees, neural networks, k-nearest neigh-

bours, the Naive Bayes Classifier is one of the most prac- tical learning methods.

  • 2. Compared to the previously presented learning algorithms,

the Naive Bayes Classifier does no search through the hy- pothesis space; the output hypothesis is simply formed by estimating the parameters P(vj), P(ai|vj).

12.

slide-14
SLIDE 14

The Naive Bayes Classification Algorithm

Naive Bayes Learn(examples) for each target value vj ˆ P(vj) ← estimate P(vj) for each attribute value ai of each attribute a ˆ P(ai|vj) ← estimate P(ai|vj) Classify New Instance(x) vNB = argmaxvj∈V ˆ P(vj)

ai∈x ˆ

P(ai|vj)

13.

slide-15
SLIDE 15

The Naive Bayes: An Example

Consider again the PlayTennis example, and new instance Outlook = sun, Temp = cool, Humidity = high, Wind = strong We compute: vNB = argmaxvj∈V P(vj)

i P(ai|vj)

P(yes) =

9 14 = 0.64 P(no) = 5 14 = 0.36

. . . P(strong|yes) = 3

9 = 0.33 P(strong|no) = 3 5 = 0.60

P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes) = 0.0053 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no) = 0.0206 → vNB = no

14.

slide-16
SLIDE 16

A Note on The Conditional Independence Assumption of Attributes

P(a1, a2 . . . an|vj) =

  • i

P(ai|vj) It is often violated in practice ...but it works surprisingly well anyway. Note that we don’t need estimated posteriors ˆ P(vj|x) to be correct; we only need that argmax

vj∈V

ˆ P(vj)

  • i

ˆ P(ai|vj) = argmax

vj∈V

P(vj)P(a1 . . . , an|vj) [Domingos & Pazzani, 1996] analyses this phenomenon.

15.

slide-17
SLIDE 17

Naive Bayes Classification: The problem of unseen data

What if none of the training instances with target value vj have the at- tribute value ai? It follows that ˆ P(ai|vj) = 0, and ˆ P(vj)

i ˆ

P(ai|vj) = 0 The typical solution is to (re)define P(ai|vj), for each value vj of ai:

ˆ P(ai|vj) ← nc+mp

n+m , where

  • n is number of training examples for which v = vj,
  • nc number of examples for which v = vj and a = ai
  • p is a prior estimate for ˆ

P(ai|vj) (for instance, if the attribute a has k values, then p = 1

k)

  • m is a weight given to that prior estimate

(i.e. number of “virtual” examples)

16.

slide-18
SLIDE 18

Using the Naive Bayes Learner: Learning to Classify Text

  • Learn which news articles are of interest

Target concept Interesting? : Document → {+, −}

  • Learn to classify web pages by topic

Target concept Category : Document → {c1, . . ., cn} Naive Bayes is among most effective algorithms

17.

slide-19
SLIDE 19

Learning to Classify Text: Main Design Issues

  • 1. Represent each document by a vector of words
  • one attribute per word position in document
  • 2. Learning:
  • use training examples to estimate P(+), P(−), P(doc|+), P(doc|−)
  • Naive Bayes conditional independence assumption:

P(doc|vj) =

length(doc)

  • i=1

P(ai = wk|vj) where P(ai = wk|vj) is probability that word in position i is wk, given vj

  • Make one more assumption:

∀i, m P(ai = wk|vj) = P(am = wk|vj) = P(wk|vj) i.e. attributes are (not only indep. but) also identically distributed

18.

slide-20
SLIDE 20

Learn naive Bayes text(Examples, Vocabulary)

  • 1. Collect all words and other tokens that occur in Examples

V ocabulary ← all distinct words and other tokens in Examples

  • 2. Calculate the required P(vj) and P(wk|vj) probability terms

For each target value vj in V docsj ← the subset of Examples for which the target value is vj

P(vj) ←

|docsj| |Examples|

Textj ← a single doc. created by concat. all members of docsj n ← the total number of words in Textj For each word wk in V ocabulary nk ← the number of times word wk occurs in Textj

P(wk|vj) ←

nk+1 n+|V ocabulary|

(here we use the m-estimate)

19.

slide-21
SLIDE 21

Classify naive Bayes text(Doc)

positions ← all word positions in Doc that contain tokens from V ocabulary Return vNB = argmaxvj∈V P(vj)

i∈positions P(ai = wk|vj) 20.

slide-22
SLIDE 22

Application: Learning to Classify Usenet News Articles

Given 1000 training documents from each of the 20 newsgroups, learn to classify new documents according to which newsgroup it came from comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey alt.atheism sci.space soc.religion.christian sci.crypt talk.religion.misc sci.electronics talk.politics.mideast sci.med talk.politics.misc talk.politics.guns Naive Bayes: 89% classification accuracy (having used 2/3 of each group for training; eliminated rare words, and the 100 most freq. words)

21.

slide-23
SLIDE 23

Learning Curve for 20 Newsgroups

10 20 30 40 50 60 70 80 90 100 100 1000 10000 20News Bayes TFIDF PRTFIDF

Accuracy vs. Training set size

22.

slide-24
SLIDE 24

2.5 The Minimum Description Length Principle

Occam’s razor: prefer the shortest hypothesis Bayes analysis: prefer the hypothesis hMAP hMAP = argmax

h∈H

P(D|h)P(h) = argmax

h∈H

(log2 P(D|h) + log2 P(h)) = argmin

h∈H

(− log2 P(D|h) − log2 P(h)) Interesting fact from the Information Theory: The optimal (shortest expected coding length) code for an event with probability p is the one using − log2 p bits. So we can interpret: − log2 P(h): the length of h under the optimal code − log2 P(D|h): the length of D given h under the optimal code Therefore we prefer the hypothesis h that minimizes...

23.

slide-25
SLIDE 25

Bayes Analysis and the MDL Principle

We saw that a MAP learner prefers the hypothesis h that minimizes LC1(h) + LC2(D|h), where LC(x) is the description length of x under encoding C hMDL = argmin

h∈H

(LC1(h) + LC2(D|h)) Example: H = decision trees, D = training data labels

  • LC1(h) is the number of bits to describe tree h
  • LC2(D|h) is the number of bits to describe D given h

In literature, the application of MDL to practical problems often include arguments justifying the choice of the encodings C1 and C2.

24.

slide-26
SLIDE 26

For instance:

LC2(D|h) = 0 if examples are classified perfectly by h, and both the transmitter and the receiver know h. Therefore, in this situation we need only to describe exceptions. So: hMDL = argmin

h∈H

(length(h) + length(misclassifications)) In general, MDL trades off hypothesis size for training errors: it might select a shorter hypothesis that makes few errors over a longer hypothesis that perfectly classifies the data! Consequence: In learning (for instance) decision trees, (using) the MDL principle can work as an alternative to pruning.

25.

slide-27
SLIDE 27

The MDL Principle: Back to Occam’s Rasor MDL hypotheses are not necessarily also the best/MAP ones. (For that, we should know all the probabilities P(D|h) and P(h).)

26.

slide-28
SLIDE 28

3 Learning Maximum Likelihood (ML) Hypothesis

3.1 Learning Real Valued Functions: ML Hypotheses as Least Suquered Error Hypotheses

hML f e y x

Problem: Consider learning a real-valued target function f : X → R from D, a training set consisting of examples xi, di, i = 1, . . . , m with xi, assumed fixed (to simplify) di noisy training value di = f(xi) + ei ei is random variable (noise) drawn inde- pendently for each xi, according to some Gaussian distribution with mean=0.

27.

slide-29
SLIDE 29

Proposition

Considering H, a certain class of functions h : X → R such that h(xi) = f(xi) and assuming that xi are mutually independent given h, the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors: hML

def.

= argmax

h∈H

P(D|h) = arg min

h∈H m

  • i=1

(di − h(xi))2

28.

slide-30
SLIDE 30

Proof

Note: We will use the probability density function: p(x0)

def.

= limǫ→0 1 ǫP(x0 ≤ x < x0 + ǫ) hML = argmax

h∈H

P(D|h) = argmax

h∈H m

  • i=1

p(di|h)

µi=f(xi)

= argmax

h∈H m

  • i=1

p(ei|h) = argmax

h∈H m

  • i=1

p(di − f(xi)|h)

h(xi)=f(xi)

= argmax

h∈H m

  • i=1

p(di − h(xi)|h) = argmax

h∈H m

  • i=1

1 √ 2πσ2e− 1

2( di−h(xi) σ

)2 = argmax h∈H

(

m

  • i=1

ln 1 √ 2πσ2 − 1 2 di − h(xi) σ 2 ) = argmax

h∈H m

  • i=1

−1 2 di − h(xi) σ 2 = argmax

h∈H m

  • i=1

− (di − h(xi))2 = argmin

h∈H m

  • i=1

(di − h(xi))2

29.

slide-31
SLIDE 31

Generalisations...

  • 1. Similar derivations can be performed starting with other assumed

noise distributions (than Gaussians), producing different results.

  • 2. It was assumed that
  • a. the noise affects only f(xi), and
  • b. no noise was recorded in the attribute values for the given ex-

amples xi. Otherwise, the analysis becomes significantly more complex.

30.

slide-32
SLIDE 32

3.2 ML hypotheses for Learning Probability Functions

Let us consider a non-deterministic function (i.e. one-to-many relation) f : X → {0, 1}. Given a set of independently drawn examples D = {< x1, d1 >, . . . , < xm, dm >} where di = f(xi) ∈ {0, 1}, we would like to learn a ML hypothesis for the probability function g(x)

def.

= P(f(x) = 1). For example, h(xi) = 0.92 if P({< xi, di > |di = 1}) = 0.92. Proposition: In this setting, hML = argmaxh∈H P(D | h) maximizes the sum m

i=1[di ln h(xi) + (1 − di) ln (1 − h(xi))].

Proof: P(D | h) = Πm

i=1P(xi, di | h) = Πm i=1P(di | xi, h) · P(xi | h)

It can be assumed that xi is independent of h, therefore: P(D | h) = Πm

i=1P(di | xi, h) · P(xi)

31.

slide-33
SLIDE 33

Proof (continued): What we wanted to compute is h(xi) = P(di = 1 | xi, h). In a more general form: P(di | xi, h) =

  • h(xi)

if di = 1 1 − h(xi) if di = 0 In a more convenient mathematical form: P(di | xi, h) = h(xi)di(1 − h(xi))1−di. ⇒ hML = argmaxh∈H Πm

i=1[h(xi)di(1 − h(xi))1−diP(xi)]

= argmaxh∈H Πm

i=1h(xi)di(1 − h(xi))1−di · Πm i=1P(xi)

= argmaxh∈H Πm

i=1h(xi)di(1 − h(xi))1−di

= argmaxh∈H

m

  • i=1

[di ln h(xi) + (1 − di) ln (1 − h(xi))] Note: The quantity − m

i=1[di ln h(xi) + (1 − di) ln (1 − h(xi))] is called cross-

entropy; the above hML minimizes this quantity.

32.

slide-34
SLIDE 34

3.3 The Expectation Maximization (EM) Algorithm

[Dempster et al, 1977]

Find (local) Maximum Likelihood hypotheses when data is only partially observable:

  • Unsupervised learning (i.e., clustering):

the target value is unobservable

  • Supervised learning:

some instance attributes are unobservable

Some applications:

  • Non-hierarchical clustering:

Estimate the means of k Gausseans

  • Learn Hidden Markov Models
  • Learn Probabilistic Context Free Grammars
  • Train Radial Basis Function Networks
  • Train Bayesian Belief Networks

33.

slide-35
SLIDE 35

The General EM Problem

Given

  • observed data X = {x1, . . ., xm}

independently generated using the parameterized distributions/hypotheses h1, . . . , hm

  • unobserved data Z = {z1, . . . , zm}

determine ˆ h that (locally) maximizes P(Y |h), where Y = {y1, . . . , ym} is the full data yi = xi ∪ zi

34.

slide-36
SLIDE 36

The Essence of the EM Approach

Start with h0, an arbitrarily/conveniently chosen value of h. Repeatedly

  • 1. Use the observed data X and the current hypothe-

sis ht to estimate [the probabilities associted to the values of] the unobserved variables Z, and further

  • n compute their expectations, E[Z].
  • 2. The expected values of the unobserved variables Z

are used to calculate an improved hypothesis ht+1, based on maximizing the mean of a log-verosimility function: E[ln P(Y |h)|X, ht].

35.

slide-37
SLIDE 37

The General EM Algorithm

Repeat the following two steps until convergence is reached: Estimation (E) step: Calculate the log likelihood function Q(h|ht)

not.

= E[ln P(Y |h)|X, ht] where Y = X ∪ Z. Maximization (M) step: Replace hypothesis ht by the hypothesis ht+1 that maxi- mizes this Q function. ht+1 ← argmax

h

Q(h|ht)

36.

slide-38
SLIDE 38

ht E[Z | ] X, h t

h

t

P(Y|X; )

++t

h = argmax

h t+1

P(X|h)

E [ln P(Y|h)]

37.

slide-39
SLIDE 39

Baum-Welch Theorem

When Q is continuous, it can be shown that EM con- verges to a stationary point (local maximum) of the likelyhood function P(Y |h).

38.

slide-40
SLIDE 40

4 Bayesian Belief Networks

(also called Bayes Nets) Interesting because:

  • The Naive Bayes assumption of conditional independence
  • f attributes is too restrictive.

(But it’s intractable without some such assumptions...)

  • Bayesian Belief networks describe conditional indepen-

dence among subsets of variables.

  • It

allows the combination

  • f

prior knowledge about (in)dependencies among variables with observed training data.

39.

slide-41
SLIDE 41

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value of Z: (∀xi, yj, zk) P(X = xi|Y = yj, Z = zk) = P(X = xi|Z = zk) More compactly, we write P(X|Y, Z) = P(X|Z) Note: Naive Bayes uses conditional independence to justify P(A1, A2|V ) = P(A1|A2, V )P(A2|V ) = P(A1|V )P(A2|V ) Generalizing the above definition: P(X1 . . . Xl|Y1 . . . Ym, Z1 . . . Zn) = P(X1 . . . Xl|Z1 . . . Zn)

40.

slide-42
SLIDE 42

A Bayes Net

Storm Campfire Lightning Thunder ForestFire Campfire C ¬C ¬S,B ¬S,¬B 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 S,¬B BusTourGroup S,B

The network is defined by

  • A directed acyclic graph, represening a set of conditional independence

assertions: Each node — representing a random variable — is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Example: P(Thunder|ForestFire, Lightning) = P(Thunder|Lightning)

  • A table of local conditional probabilities for each node/variable.

41.

slide-43
SLIDE 43

A Bayes Net (Cont’d)

represents the joint probability distribution over all variables Y1, Y2, . . . , Yn: This joint distribution is fully defined by the graph, plus the conditional probabilities: P(y1, . . . , yn) = P(Y1 = y1, . . . , Yn = yn) =

n

  • i=1

P(yi|Parents(Yi)) where Parents(Yi) denotes immediate predecessors of Yi in the graph. In our example: P(Storm, BusTourGroup, . . . , ForestFire)

42.

slide-44
SLIDE 44

Inference in Bayesian Nets

Question: Given a Bayes net, can one infer the probabilities of values of one or more network variables, given the observed values of (some) others? Example: Given the Bayes net compute: (a) P(S) (b) P(A, S) (b) P(A)

P(A|S)=0.7 P(A|~S)=0.3 P(G|S)=0.8 P(G|~S)=0.2 P(S|L,F)=0.8 P(S|~L,F)=0.5 P(S|~L,~F)=0.3 P(S|L,~F)=0.6 L F S A G P(L)=0.4 P(F)=0.6 43.

slide-45
SLIDE 45

Inference in Bayesian Nets (Cont’d)

Answer(s):

  • If only one variable is of unknown (probability) value,

then it is easy to infer it

  • In the general case, we can compute the probability dis-

tribution for any subset of network variables, given the distribution for any subset of the remaining variables. But...

  • The exact inference of probabilities for an arbitrary

Bayes net is an NP-hard problem!!

44.

slide-46
SLIDE 46

Inference in Bayesian Nets (Cont’d)

In practice, we can succeed in many cases:

  • Exact inference methods work well for some net structures.
  • Monte Carlo methods “simulate” the network randomly

to calculate approximate solutions [Pradham & Dagum, 1996]. (In theory even approximate inference of probabilities in Bayes Nets can be NP-hard!! [ Dagum & Luby, 1993])

45.

slide-47
SLIDE 47

Learning Bayes Nets (I)

There are several variants of this learning task

  • The network structure might be either known or unknown

(i.e., it has to be inferred from the training data).

  • The training examples might provide values of all network

variables, or just for some of them. The simplest case: If the structure is known and we can observe the values

  • f all variables,

then it is easy to estimate the conditional probability table entries (analogous to training a Naive Bayes classifier).

46.

slide-48
SLIDE 48

Learning Bayes Nets (II)

When

  • the structure of the Bayes Net is known, and
  • the variables are only partially observable in the training

data learning the entries in the conditional probabilities tables is similar to (learning the weights of hidden units in) training a neural network with hidden units: − We can learn the net’s conditional probability tables using the gradient ascent! − Converge to the network h that (locally) maximizes P(D|h).

47.

slide-49
SLIDE 49

Gradient Ascent for Bayes Nets

Let wijk denote one entry in the conditional probability table for the variable Yi in the network wijk = P(Yi = yij|Parents(Yi) = the list uik of values) It can be shown (see the next two slides) that ∂ ln Ph(D) ∂wijk =

  • d∈D

Ph(yij, uik|d) wijk therefore perform gradient ascent by repeatedly

  • 1. update all wijk using the

training data D wijk ← wijk+η

  • d∈D

Ph(yij, uik|d) wijk

  • 2. renormalize the wijk to as-

sure

  • j

wijk = 1 and 0 ≤ wijk ≤ 1

48.

slide-50
SLIDE 50

Gradient Ascent for Bayes Nets: Calculus

∂ ln Ph(D) ∂wijk = ∂ ∂wijk ln

  • d∈D

Ph(d) =

  • d∈D

∂ ln Ph(d) ∂wijk =

  • d∈D

1 Ph(d) ∂Ph(d) ∂wijk Summing over all values yij′ of Yi, and uik′ of Ui = Parents(Yi): ∂ ln Ph(D) ∂wijk =

  • d∈D

1 Ph(d) ∂ ∂wijk

  • j′k′

Ph(d|yij′, uik′)Ph(yij′, uik′) =

  • d∈D

1 Ph(d) ∂ ∂wijk

  • j′k′

Ph(d|yij′, uik′)Ph(yij′|uik′)Ph(uik′) Note that wijk ≡ Ph(yij|uik), therefore...

49.

slide-51
SLIDE 51

Gradient Ascent for Bayes Nets: Calculus (Cont’d)

∂ ln Ph(D) ∂wijk =

  • d∈D

1 Ph(d) ∂ ∂wijk Ph(d|yij, uik)wijkPh(uik) =

  • d∈D

1 Ph(d)Ph(d|yij, uik)Ph(uik) (applying Bayes th.) =

  • d∈D

1 Ph(d) Ph(yij, uik|d)Ph(d)Ph(uik) Ph(yij, uik) =

  • d∈D

Ph(yij, uik|d)Ph(uik) Ph(yij, uik) =

  • d∈D

Ph(yij, uik|d) Ph(yij|uik) =

  • d∈D

Ph(yij, uik|d) wijk

50.

slide-52
SLIDE 52

Learning Bayes Nets (II, Cont’d)

The EM algorithm can also be used. Repeatedly:

  • 1. Calculate/estimate from data the probabilities of unob-

served variables wijk, assuming that the hypothesis h holds

  • 2. Calculate a new h (i.e. new values of wijk) so to maximize

E[ln P(D|h)], where D now includes both the observed and the unob- served variables.

51.

slide-53
SLIDE 53

Learning Bayes Nets (III)

When the structure is unknown, algorithms usually use greedy search to trade off network complexity (add/substract edges/nodes) against degree of fit to the data. Example: [Cooper & Herscovitz, 1992] the K2 algorithm: When data is fully observable, use a score metric to choose among alternative networks. They report an experiment on (re-learning) a network with 37 nodes and 46 arcs describing anesthesia problems in a hospital

  • perating room. Using 3000 examples, the program succeeds

almost perfectly: it misses one arc and adds an arc which is not in the original net.

52.

slide-54
SLIDE 54

Some exercises

0.

slide-55
SLIDE 55

Exemplifying Text classification using the Naive Bayes algorithm CMU, 2009 spring, Ziv Bar-Joseph, midterm, pr. 2

1.

slide-56
SLIDE 56

Training data: ‘study’ ‘free’ ‘money’ Category count 1 Regular 1 1 Regular 1 1 Regular 1 1 1 Regular 1 1 Spam 4 1 1 Spam 4 Estimating the parameters, by MLE and applying Laplace’s rule (“add-one”): P(study|spam)= 0 + 1 8 + 2 = 1 10 P(free|spam)= 8 + 1 8 + 2 = 9 10 P(money|spam)= 4 + 1 8 + 2 = 1 2 P(study|regular)= 3 + 1 4 + 2 = 2 3 P(free|regular)= 1 + 1 4 + 2 = 1 3 P(money|regular)= 1 + 1 4 + 2 = 1 3 2.

slide-57
SLIDE 57

Classification of the message s = “money for psychology study”, using the a priori probability P(spam) = 0.1: P(spam | s) = P(spam | study, ¬free, money)

  • F. Bayes

= P(study, ¬free, money | spam) · P(spam) P(study,¬free,money | spam)P(spam) + P(study,¬free,money | reg)P(reg) P(study, ¬free, money|spam)

  • indep. cdt.

= P(study|spam)·P(¬free|spam)·P(money|spam) = 1 10 · 1 10 · 1 2 = 1 200 P(study, ¬free, money|reg)

  • indep. cdt.

= P(study|reg)·P(¬free|reg)·P(money|reg) = 2 3 · 2 3 · 1 3 = 4 27 Therefore, P(spam| s) = 1 200 · 1 10 1 200 · 1 10 + 4 27 · 9 10 ≈ 0.0037 3.

slide-58
SLIDE 58

Exemplifying The computation of the error rate for the Naive Bayes algorithm CMU, 2010 fall, Aarti Singh, HW1, pr. 4.2

4.

slide-59
SLIDE 59

Consider a simple learning problem of determining whether Alice and Bob from CA will go to hiking or not Y : Hike ∈ {T, F} given the weather conditions X1 : Sunny ∈ {T, F} and X2 : Windy ∈ {T, F} by a Naive Bayes classifier. Using training data, we estimated the parameters P(Hike) = 0.5 P(Sunny | Hike) = 0.8, P(Sunny | ¬Hike) = 0.7 P(Windy | Hike) = 0.4, P(Windy | ¬Hike) = 0.5 Assume that the true distribution of X1, X2, and Y satisfies the Naive Bayes assumption of conditional independence with the above parameters.

  • a. What is the joint probability that Alice and Bob go to hiking and the weather is

sunny and windy, that is P(Sunny, Windy, Hike)? Answer: P(Sunny, Windy, Hike)

  • indep. cdt.

= P(Sunny|Hike) · P(Windy|Hike) · P(Hike) = 0.8 · 0.4 · 0.5 = 0.16. 5.

slide-60
SLIDE 60
  • b. What is the expected error rate of the Naive Bayes classifier?

(The expected error rate is the probability that each class generates an observation where the decision rule is incorrect.) Answer: P(X1, X2, Y ) = X1 X2 Y P(X1|Y ) · P(X2|Y ) · P(Y ) fNB(X1, X2) PNB(Y |X1, X2) F F F 0.3 · 0.5 · 0.5 = 0.075 F 0.555556 F F T 0.2 · 0.6 · 0.5 = 0.060 F 0.444444 F T F 0.3 · 0.5 · 0.5 = 0.075 F 0.652174 F T T 0.2 · 0.4 · 0.5 = 0.040 F 0.347826 T F F 0.7 · 0.5 · 0.5 = 0.175 T 0.421686 T F T 0.8 · 0.6 · 0.5 = 0.240 T 0.578314 T T F 0.7 · 0.5 · 0.5 = 0.175 F 0.522388 T T T 0.8 · 0.4 · 0.5 = 0.160 F 0.477612

Note: Joint probabilities corresponding to incorrect predictions are shown in bold.

error

def.

=

  • X1,X2,Y

I[fNB(X1, X2) = Y ] · P(X1, X2, Y ) = 0.060 + 0.040 + 0.175 + 0.160 = 0.435 Note:

I is the indicator func- tion; its value is 1 whenever the associated condition (in our case, fNB(X1, X2) = Y ) is true, and 0 otherwise.

6.

slide-61
SLIDE 61

Next, suppose that we gather more information about weather conditions and introduce a new feature denoting whether the weather is X3 : Rainy or not. Assume that each day the weather in CA can be either Rainy or Sunny. That is, it can not be both Sunny and Rainy. (Similarly, it can not be ¬ Sunny and ¬Rainy).

  • c. In the above new case, are any of the Naive Bayes assumptions violated? Why (not)?

What is the joint probability that Alice and Bob go to hiking and the weather is sunny, windy and not rainy, that is P(Sunny, Windy, ¬Rainy, Hike)? Answer: The conditional independence of variables given the class label assumption of Naive Bayes is violated. Indeed, knowing if the weather is Sunny completely determines whether it is Rainy or not. Therefore, Sunny and Rainy are clearly NOT condition- ally independent given Hike. P(Sunny, Windy, ¬Rainy, Hike) = P(Sunny|Hike) · P(Windy|Hike) · P(¬Rainy|Hike, Sunny, Windy) · P(Hike) = 0.8 · 0.4 · 1 · 0.5 = 0.16. 7.

slide-62
SLIDE 62
  • d. What is the expected error rate when the Naive Bayes classifier uses all three at-

tributes? Does the performance of Naive Bayes improve by observing the new attribute Rainy? Explain why. Answer:

X1 X2 X3 Y P(X1, X2, Y ) P(X1|Y ) · P(X2|Y ) · P(Y ) fNB(X1, X2, X3) PNB(Y |X1, X2, X3) F F F F 0.075 · 0.7 = 0.0525 F 0.522388 F F F T 0.060 · 0.8 = 0.0480 F 0.477612 F F T F 0.075 0.075 · 0.3 = 0.0225 F 0.652174 F F T T 0.060 0.060 · 0.2 = 0.0120 F 0.347826 F T F F 0.075 · 0.7 = 0.0525 F 0.621302 F T F T 0.040 · 0.8 = 0.0320 F 0.378698 F T T F 0.075 0.075 · 0.3 = 0.0225 F 0.737705 F T T T 0.040 0.040 · 0.2 = 0.0080 F 0.262295 T F F F 0.175 0.175 · 0.7 = 0.0525 T 0.389507 T F F T 0.240 0.240 · 0.8 = 0.1920 T 0.610493 T F T F 0.175 · 0.3 = 0.0525 F 0.522388 T F T T 0.240 · 0.2 = 0.0480 F 0.477612 T T F F 0.175 0.175 · 0.7 = 0.0525 T 0.489022 T T F T 0.160 0.160 · 0.8 = 0.1280 T 0.510978 T T T F 0.175 · 0.3 = 0.0225 F 0.621302 T T T T 0.060 · 0.2 = 0.0120 F 0.378698

8.

slide-63
SLIDE 63

The new error rate is: 0.060 + 0.040 + 0.175 + 0.175 = 0.45 > 0.435 (see question b). The Naive Bayes classifier performance drops because the conditional inde- pendence assumptions do not hold for the correlated features. 9.

slide-64
SLIDE 64

Computing The sample complexity of the Naive Bayes and Joint Bayes Clssifiers CMU, 2009 spring, Eric Xing, Tom Mitchell, Aarti Singh, HW2, pr. 1.1

10.

slide-65
SLIDE 65

A big reason we use Naive Bayes classifiers is that they require less training data than Full Bayes Classifiers. This problem should give you a “feel” for how great the disparity really is. Imagine that each instance is an independent observation of the multi-variate random variable ¯ X = X1, ..., Xd, where the Xi are i.i.d. and Bernoulli of pa- rameter p = 0.5. To train a Full Bayes classifier, we need to see every value of ¯ X “enough” times; training a Naive Bayes classifier only requires seeing both values of Xi “enough” times. Question: How many observations are needed until, with probability 1−ε, we have seen every variable we need to see at least once. Note: To train the classifier well would require more than this, but for this problem we only require one observation. Hint: You may want to use the following inequalities:

  • For any k ≥ 1, (1 − 1/k)k ≤ e−1
  • For any events E1, ..., Ek, Pr{E1, ..., Ek} ≤ k

i=1 Pr{Ei}.

(This is called the “union bounds” property.) 11.

slide-66
SLIDE 66

Consider the Naive Bayes classifier.

  • a. Show that if N observations have been made, the probability that a given value of

Xi (either 0 or 1) has not been seen is ≤ 1 2N−1 .

  • b. Show that if more than NNB = 1 + log2

ε d

  • bservations have been made, then the

probability that any Xi has not been observed in both states is ≤ ε. Answer:

  • a. P(s˘

a nu fi ˆ ıntˆ alnit toate valorile variabilei Xi) = 1 2 N + 1 2 N = 2 2N = 1 2N−1

  • b. P(s˘

a nu fi ˆ ıntˆ alnit toate valorile variabilelorXi, i = 1, d, ˆ ın NNB observat ¸ii) ≤ d

i=1 P(s˘

a nu fi ˆ ıntˆ alnit toate valorile variabilei Xi ˆ ın NNB observat ¸ii) = d

i=1

1 2NNB−1 = d · 1 2NNB−1 = d · 1 21+log2

d ε −1 = d ·

1 2log2

d ε = d · 1

d ε = d · ε d = ε 12.

slide-67
SLIDE 67

Consider the Joint Bayes classifier.

  • c. Let ¯

x be a particular value of ¯

  • X. Show that after N observations, the probability

we have never seen ¯ x is ≤ e−N/2d.

  • d. Using the “union bounds” property, show that if more than NJB = 2d ln

2d ε

  • bservations have been made, then the probability that an arbitrarily chosen (but

fixed) value of ¯ X has not been seen is ≤ ε. Answer:

  • c. P(instant

¸a ¯ x s˘ a nu fie ˆ ıntˆ alnit˘ a ˆ ın N observat ¸ii) =

  • 1 − 1

2d

N =

  • 1 − 1

2d

2dN/2d ≤ 1 e N/2d = e−N/2d

  • d. P(s˘

a nu fi ˆ ıntˆ alnit toate instant ¸ele variabilei ¯ X ˆ ın NJB observat ¸ii) ≤

¯ x P(s˘

a nu fi ˆ ıntˆ alnit ¯ x ˆ ın NJB observat ¸ii) =

¯ x e−NJB/2d = 2d · e−NJB/2d = 2d · e− ln 2d

ε = 2d ·

1 eln 2d

ε

= 2d

2d ε

= ε 13.

slide-68
SLIDE 68
  • e. Let d = 2 and ε = 0.1. What are the values of NJB and NNB?

What about d = 5? And d = 10? Answer: ε = 0.1, d = 2 ⇒        NNB = 1 + log2 2 0.1 = 1 + log2 20 ≈ 5.32 NJB = 22 · ln 22 0.1 = 4 · ln 40 ≈ 14.75 ε = 0.1, d = 5 ⇒        NNB = 1 + log2 5 0.1 = 1 + log2 50 ≈ 6.64 NJB = 25 · ln 25 0.1 = 32 · ln 320 ≈ 184.58 ε = 0.1, d = 10 ⇒        NNB = 1 + log2 10 0.1 = 1 + log2 100 ≈ 7.64 NJB = 210 · ln 210 0.1 = 1024 · ln 10240 ≈ 9455.67 14.

slide-69
SLIDE 69

Exemplifying ML hypotheses and MAP hypotheses CMU, 2009 spring, Tom Mitchell, midterm, pr. 2

15.

slide-70
SLIDE 70

2 3 4 3.5 2.5 1.5 1 X

Let’s consider the 1-dimensional data set shown above, based on the single real-valued attribute X. Notice there are two classes (values of Y ), and five data points. Consider a new class of decision trees where leaves have probabilistic labels. Each leaf node gives the probability of each possible label, where the proba- bility is the fraction of points at that leaf node with that label. For example, a decision tree learned from the data set above with zero splits would say P(Y = 1) = 3/5 and P(Y = 0) = 2/5. A decision tree with one split (at X = 2.5) would say P(Y = 1) = 1 if X < 2.5, and P(Y = 1) = 1/3 if X ≥ 2.5.

  • a. For the above data set, draw a tree that max-

imizes the likelihood of the data. TML = argmaxT PT (D), where PT (D)

def.

= P(D|T )

i.i.d.

= 5

i=1 P(Y = yi|X = xi, T ),

with yi the label/classs of the instance xi ∈ {1.5, 2, 3, 3.5, 3.75}. Answer:

P(Y=1)=1 P(Y=1)=1

X>2.5 X>3.625

Nu Da Nu Da P(Y=1)=0

16.

slide-71
SLIDE 71
  • b. Consider a prior probability distribution P(T ) over trees that penalizes the number
  • f splits in the tree.

P(T ) ∝ 1 4 splits(T )2 where T is a tree, splits(T ) is the number of splits in T , and ∝ means “is proportional to”. For the same data set, give the MAP tree when using this prior, P (T ), over trees. Answer: 0 nodes: P(T0 | D) ∝ 3 5 3 · 2 5 2 · 1 4 = 33 · 22 55 = 108 3125 = 0.0336 1 node: P(T1 | D) ∝ 12 · 2 3 2 · 1 3 · 1 4 1 = 1 27 = 0.037 2 nodes: P(T2) ∝ 1 4 4 ⇒ P(T2 | D) ∝ 1 · 1 4 4 = 1 256 = 0.0039

P(Y=1)=3/5

P(Y=1)=1

X>2.5

Nu Da P(Y=1)=1/3

17.