csce 478 878 lecture 7 bayesian learning
play

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted - PDF document

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides) October 31, 2006 1 Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have


  1. CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell’s slides) October 31, 2006 1

  2. Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have reasons (domain information) to favor some hypotheses over others a priori Bayesian methods work with probabilities, and have two main roles: 1. Provide practical learning algorithms: • Na¨ ıve Bayes learning • Bayesian belief network learning • Combine prior knowledge (prior probabilities) with observed data • Requires prior probabilities 2. Provides useful conceptual framework • Provides “gold standard” for evaluating other learn- ing algorithms • Additional insight into Occam’s razor 2

  3. Outline • Bayes Theorem • MAP , ML hypotheses • MAP learners • Minimum description length principle • Bayes optimal classifier/Gibbs algorithm • Na¨ ıve Bayes classifier • Bayesian belief networks 3

  4. Bayes Theorem In general, an identity for conditional probabilities For our work, we want to know the probability that a par- ticular h ∈ H is the correct hypothesis given that we have seen training data D (examples and labels). Bayes theo- rem lets us do this. P ( h | D ) = P ( D | h ) P ( h ) P ( D ) • P ( h ) = prior probability of hypothesis h (might include domain information) • P ( D ) = probability of training data D • P ( h | D ) = probability of h given D • P ( D | h ) = probability of D given h Note P ( h | D ) increases with P ( D | h ) and P ( h ) and decreases with P ( D ) 4

  5. Choosing Hypotheses P ( h | D ) = P ( D | h ) P ( h ) P ( D ) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : h MAP = argmax P ( h | D ) h ∈ H P ( D | h ) P ( h ) = argmax P ( D ) h ∈ H = argmax P ( D | h ) P ( h ) h ∈ H If assume P ( h i ) = P ( h j ) for all i, j , then can further sim- plify, and choose the maximum likelihood (ML) hypothesis h ML = argmax P ( D | h i ) h i ∈ H 5

  6. Bayes Theorem Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the dis- ease is actually present, and a correct negative result in only 97% of the cases in which the dis- Furthermore, . 008 of the ease is not present. entire population have this cancer. P ( cancer ) = P ( ¬ cancer ) = P (+ | cancer ) = P ( − | cancer ) = P (+ | ¬ cancer ) = P ( − | ¬ cancer ) = Now consider new patient for whom the test is positive. What is our diagnosis? P (+ | cancer ) P ( cancer ) = P (+ | ¬ cancer ) P ( ¬ cancer ) = So h MAP = 6

  7. Basic Formulas for Probabilities • Product Rule : probability P ( A ∧ B ) of a conjunction of two events A and B: P ( A ∧ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • Sum Rule : probability of a disjunction of two events A and B: P ( A ∨ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) • Theorem of total probability : if events A 1 , . . . , A n are mutually exclusive with � n i =1 P ( A i ) = 1 , then n � P ( B ) = P ( B | A i ) P ( A i ) i =1 7

  8. Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H , calculate the posterior probability P ( h | D ) = P ( D | h ) P ( h ) P ( D ) 2. Output the hypothesis h MAP with the highest poste- rior probability h MAP = argmax P ( h | D ) h ∈ H Problem: what if H exponentially or infinitely large? 8

  9. Relation to Concept Learning Consider our usual concept learning task: instance space X , hypothesis space H , training examples D Consider the Find-S learning algorithm (outputs most spe- cific hypothesis from the version space V S H,D ) What would brute-force MAP learner output as MAP hy- pothesis? Does Find-S output a MAP hypothesis?? 9

  10. Relation to Concept Learning (cont’d) Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Assume no noise and c ∈ H , so choose  1 if d i = h ( x i ) for all d i ∈ D   P ( D | h ) =  0  otherwise Choose P ( h ) = 1 / | H | ∀ h ∈ H , i.e. uniform dist. If h inconsistent with D , then P ( h | D ) = (0 · P ( h )) /P ( D ) = 0 If h consistent with D , then � � P ( h | D ) = (1 · 1 / | H | ) /P ( D ) = (1 / | H | ) / | V S H,D | / | H | = 1 / | V S H,D | (see Thrm of total prob., slide 7) Thus if D noise-free and c ∈ H and P ( h ) uniform, every consistent hypothesis is a MAP hypothesis 10

  11. Characterizing Learning Algorithms by Equivalent MAP Learners Inductive system Training examples D Output hypotheses Candidate Elimination Hypothesis space H Algorithm Equivalent Bayesian inference system Training examples D Output hypotheses Hypothesis space H Brute force MAP learner P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent Prior assumptions made explicit So we can characterize algorithms in a Bayesian frame- work even though they don’t directly manipulate probabili- ties Other priors will allow Find-S, etc. to output MAP; e.g. P ( h ) that favors more specific hypotheses 11

  12. Learning A Real-Valued Function Consider any real-valued target function f Training examples � x i , d i � , where d i is noisy training value • d i = f ( x i ) + e i • e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean µ e i = 0 Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors, e.g. a linear unit trained with GD/EG: m � ( d i − h ( x i )) 2 h ML = argmin h ∈ H i =1 12

  13. Learning A Real-Valued Function (cont’d) h ML = argmax p ( D | h ) = argmax p ( d 1 , . . . , d m | h ) h ∈ H h ∈ H m � = argmax p ( d i | h ) (if d i ’s cond. indep.) h ∈ H i =1  � 2  � m 1  − 1 d i − h ( x i ) � = argmax √ 2 πσ 2 exp  2 σ h ∈ H i =1 ( µ e i = 0 ⇒ E [ d i | h ] = h ( x i ) ) Maximize natural log instead: � 2 � m 2 πσ 2 − 1 1 d i − h ( x i ) � √ h ML = argmax ln 2 σ h ∈ H i =1 � 2 � m − 1 d i − h ( x i ) � = argmax 2 σ h ∈ H i =1 m � − ( d i − h ( x i )) 2 = argmax h ∈ H i =1 m � ( d i − h ( x i )) 2 = argmin h ∈ H i =1 Thus have Bayesian justification for minimizing squared error (under certain assumptions) 13

  14. Learning to Predict Probabilities Consider predicting survival probability from patient data Training examples � x i , d i � , where d i is 1 or 0 (assume label is [or appears] probabilistically generated) Want to train neural network to output the probability that x i has label 1, not the label itself Using approach similar to previous slide (p. 169), can show m � h ML = argmax d i ln h ( x i )+(1 − d i ) ln(1 − h ( x i )) h ∈ H i =1 i.e. find h minimizing cross-entropy For single sigmoid unit, use update rule m � w j ← w j + η ( d i − h ( x i )) x ij i =1 to find h ML (can also derive EG rule) 14

  15. Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that satisfies h MDL = argmin L C 1 ( h ) + L C 2 ( D | h ) h ∈ H where L C ( x ) is the description length of x under encoding C Example: H = decision trees, D = training data labels • L C 1 ( h ) is # bits to describe tree h • L C 2 ( D | h ) is # bits to describe D given h – Note L C 2 ( D | h ) = 0 if examples classified per- fectly by h . Need only describe exceptions • Hence h MDL trades off tree size for training errors 15

  16. Minimum Description Length Principle Bayesian Justification = argmax P ( D | h ) P ( h ) h MAP h ∈ H = argmax log 2 P ( D | h ) + log 2 P ( h ) h ∈ H = argmin − log 2 P ( D | h ) − log 2 P ( h ) (1) h ∈ H Interesting fact from information theory: The optimal (short- est expected coding length) code for an event with proba- bility p is − log 2 p bits. So interpret (1): • − log 2 P ( h ) is length of h under optimal code • − log 2 P ( D | h ) is length of D given h under optimal code → prefer the hypothesis that minimizes length ( h ) + length ( misclassifications ) Caveat: h MDL = h MAP doesn’t apply for arbitrary en- codings (need P ( h ) and P ( D | h ) to be optimal); merely a guide 16

  17. Bayes Optimal Classifier • So far we’ve sought the most probable hypothesis given the data D , i.e. h MAP • But given new instance x , h MAP ( x ) is not necessar- ily the most probable classification! • Consider three possible hypotheses: P ( h 1 | D ) = 0 . 4 , P ( h 2 | D ) = 0 . 3 , P ( h 3 | D ) = 0 . 3 Given new instance x , h 1 ( x ) = + , h 2 ( x ) = − , h 3 ( x ) = − • h MAP ( x ) = • What’s the most probable classification of x ? 17

  18. Bayes Optimal Classifier (cont’d) Bayes optimal classification: � argmax P ( v j | h i ) P ( h i | D ) v j ∈ V h i ∈ H where V is set of possible labels (e.g. { + , −} ) Example: P ( h 1 | D ) = 0 . 4 , P ( − | h 1 ) = 0 , P (+ | h 1 ) = 1 P ( h 2 | D ) = 0 . 3 , P ( − | h 2 ) = 1 , P (+ | h 2 ) = 0 P ( h 3 | D ) = 0 . 3 , P ( − | h 3 ) = 1 , P (+ | h 3 ) = 0 therefore � P (+ | h i ) P ( h i | D ) = 0 . 4 h i ∈ H � P ( − | h i ) P ( h i | D ) = 0 . 6 h i ∈ H and � argmax P ( v j | h i ) P ( h i | D ) = − v j ∈ V h i ∈ H On average, no other classifier using same prior and same hyp. space can outperform Bayes optimal! 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend