CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted - PDF document

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell’s slides) October 31, 2006 1

Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have reasons (domain information) to favor some hypotheses over others a priori Bayesian methods work with probabilities, and have two main roles: 1. Provide practical learning algorithms: • Na¨ ıve Bayes learning • Bayesian belief network learning • Combine prior knowledge (prior probabilities) with observed data • Requires prior probabilities 2. Provides useful conceptual framework • Provides “gold standard” for evaluating other learning algorithms • Additional insight into Occam’s razor 2

Outline • Bayes Theorem • MAP , ML hypotheses • MAP learners • Minimum description length principle • Bayes optimal classifier/Gibbs algorithm • Na¨ ıve Bayes classifier • Bayesian belief networks 3

Bayes Theorem In general, an identity for conditional probabilities For our work, we want to know the probability that a par- ticular h ∈ H is the correct hypothesis given that we have seen training data D (examples and labels). Bayes theorem lets us do this. P ( h | D ) = P ( D | h ) P ( h ) P ( D ) • P ( h ) = prior probability of hypothesis h (might include domain information) • P ( D ) = probability of training data D • P ( h | D ) = probability of h given D • P ( D | h ) = probability of D given h Note P ( h | D ) increases with P ( D | h ) and P ( h ) and decreases with P ( D ) 4

Choosing Hypotheses P ( h | D ) = P ( D | h ) P ( h ) P ( D ) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : h MAP = argmax P ( h | D ) h ∈ H P ( D | h ) P ( h ) = argmax P ( D ) h ∈ H = argmax P ( D | h ) P ( h ) h ∈ H If assume P ( h i ) = P ( h j ) for all i, j , then can further sim- plify, and choose the maximum likelihood (ML) hypothesis h ML = argmax P ( D | h i ) h i ∈ H 5

Bayes Theorem Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the dis- ease is actually present, and a correct negative result in only 97% of the cases in which the dis- Furthermore, . 008 of the ease is not present. entire population have this cancer. P ( cancer ) = P ( ¬ cancer ) = P (+ | cancer ) = P ( − | cancer ) = P (+ | ¬ cancer ) = P ( − | ¬ cancer ) = Now consider new patient for whom the test is positive. What is our diagnosis? P (+ | cancer ) P ( cancer ) = P (+ | ¬ cancer ) P ( ¬ cancer ) = So h MAP = 6

Basic Formulas for Probabilities • Product Rule : probability P ( A ∧ B ) of a conjunction of two events A and B: P ( A ∧ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • Sum Rule : probability of a disjunction of two events A and B: P ( A ∨ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) • Theorem of total probability : if events A 1 , . . . , A n are mutually exclusive with � n i =1 P ( A i ) = 1 , then n � P ( B ) = P ( B | A i ) P ( A i ) i =1 7

Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H , calculate the posterior probability P ( h | D ) = P ( D | h ) P ( h ) P ( D ) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argmax P ( h | D ) h ∈ H Problem: what if H exponentially or infinitely large? 8

Relation to Concept Learning Consider our usual concept learning task: instance space X , hypothesis space H , training examples D Consider the Find-S learning algorithm (outputs most specific hypothesis from the version space V S H,D ) What would brute-force MAP learner output as MAP hypothesis? Does Find-S output a MAP hypothesis?? 9

Relation to Concept Learning (cont’d) Assume fixed set of instances � x 1 , . . . , x m � Assume D is the set of classifications D = � c ( x 1 ) , . . . , c ( x m ) � Assume no noise and c ∈ H , so choose  1 if d i = h ( x i ) for all d i ∈ D   P ( D | h ) =  0  otherwise Choose P ( h ) = 1 / | H | ∀ h ∈ H , i.e. uniform dist. If h inconsistent with D , then P ( h | D ) = (0 · P ( h )) /P ( D ) = 0 If h consistent with D , then � � P ( h | D ) = (1 · 1 / | H | ) /P ( D ) = (1 / | H | ) / | V S H,D | / | H | = 1 / | V S H,D | (see Thrm of total prob., slide 7) Thus if D noise-free and c ∈ H and P ( h ) uniform, every consistent hypothesis is a MAP hypothesis 10

Characterizing Learning Algorithms by Equivalent MAP Learners Inductive system Training examples D Output hypotheses Candidate Elimination Hypothesis space H Algorithm Equivalent Bayesian inference system Training examples D Output hypotheses Hypothesis space H Brute force MAP learner P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent Prior assumptions made explicit So we can characterize algorithms in a Bayesian framework even though they don’t directly manipulate probabilities Other priors will allow Find-S, etc. to output MAP; e.g. P ( h ) that favors more specific hypotheses 11

Learning A Real-Valued Function Consider any real-valued target function f Training examples � x i , d i � , where d i is noisy training value • d i = f ( x i ) + e i • e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean µ e i = 0 Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors, e.g. a linear unit trained with GD/EG: m � ( d i − h ( x i )) 2 h ML = argmin h ∈ H i =1 12

Learning A Real-Valued Function (cont’d) h ML = argmax p ( D | h ) = argmax p ( d 1 , . . . , d m | h ) h ∈ H h ∈ H m � = argmax p ( d i | h ) (if d i ’s cond. indep.) h ∈ H i =1  � 2  � m 1  − 1 d i − h ( x i ) � = argmax √ 2 πσ 2 exp  2 σ h ∈ H i =1 ( µ e i = 0 ⇒ E [ d i | h ] = h ( x i ) ) Maximize natural log instead: � 2 � m 2 πσ 2 − 1 1 d i − h ( x i ) � √ h ML = argmax ln 2 σ h ∈ H i =1 � 2 � m − 1 d i − h ( x i ) � = argmax 2 σ h ∈ H i =1 m � − ( d i − h ( x i )) 2 = argmax h ∈ H i =1 m � ( d i − h ( x i )) 2 = argmin h ∈ H i =1 Thus have Bayesian justification for minimizing squared error (under certain assumptions) 13

Learning to Predict Probabilities Consider predicting survival probability from patient data Training examples � x i , d i � , where d i is 1 or 0 (assume label is [or appears] probabilistically generated) Want to train neural network to output the probability that x i has label 1, not the label itself Using approach similar to previous slide (p. 169), can show m � h ML = argmax d i ln h ( x i )+(1 − d i ) ln(1 − h ( x i )) h ∈ H i =1 i.e. find h minimizing cross-entropy For single sigmoid unit, use update rule m � w j ← w j + η ( d i − h ( x i )) x ij i =1 to find h ML (can also derive EG rule) 14

Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that satisfies h MDL = argmin L C 1 ( h ) + L C 2 ( D | h ) h ∈ H where L C ( x ) is the description length of x under encoding C Example: H = decision trees, D = training data labels • L C 1 ( h ) is # bits to describe tree h • L C 2 ( D | h ) is # bits to describe D given h – Note L C 2 ( D | h ) = 0 if examples classified per- fectly by h . Need only describe exceptions • Hence h MDL trades off tree size for training errors 15

Minimum Description Length Principle Bayesian Justification = argmax P ( D | h ) P ( h ) h MAP h ∈ H = argmax log 2 P ( D | h ) + log 2 P ( h ) h ∈ H = argmin − log 2 P ( D | h ) − log 2 P ( h ) (1) h ∈ H Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is − log 2 p bits. So interpret (1): • − log 2 P ( h ) is length of h under optimal code • − log 2 P ( D | h ) is length of D given h under optimal code → prefer the hypothesis that minimizes length ( h ) + length ( misclassifications ) Caveat: h MDL = h MAP doesn’t apply for arbitrary en- codings (need P ( h ) and P ( D | h ) to be optimal); merely a guide 16

Bayes Optimal Classifier • So far we’ve sought the most probable hypothesis given the data D , i.e. h MAP • But given new instance x , h MAP ( x ) is not necessar- ily the most probable classification! • Consider three possible hypotheses: P ( h 1 | D ) = 0 . 4 , P ( h 2 | D ) = 0 . 3 , P ( h 3 | D ) = 0 . 3 Given new instance x , h 1 ( x ) = + , h 2 ( x ) = − , h 3 ( x ) = − • h MAP ( x ) = • What’s the most probable classification of x ? 17

Bayes Optimal Classifier (cont’d) Bayes optimal classification: � argmax P ( v j | h i ) P ( h i | D ) v j ∈ V h i ∈ H where V is set of possible labels (e.g. { + , −} ) Example: P ( h 1 | D ) = 0 . 4 , P ( − | h 1 ) = 0 , P (+ | h 1 ) = 1 P ( h 2 | D ) = 0 . 3 , P ( − | h 2 ) = 1 , P (+ | h 2 ) = 0 P ( h 3 | D ) = 0 . 3 , P ( − | h 3 ) = 1 , P (+ | h 3 ) = 0 therefore � P (+ | h i ) P ( h i | D ) = 0 . 4 h i ∈ H � P ( − | h i ) P ( h i | D ) = 0 . 6 h i ∈ H and � argmax P ( v j | h i ) P ( h i | D ) = − v j ∈ V h i ∈ H On average, no other classifier using same prior and same hyp. space can outperform Bayes optimal! 18

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted - PDF document

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides) October 31, 2006 1 Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have

Introduction CSCE CSCE In Homework 1, you are (supposedly) 478/878 478/878 Lecture 4:

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Introduction Decision Tree for PlayTennis (Mitchell) CSCE CSCE 478/878 478/878 Outlook

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

CSCE 478/878 Lecture 2: Supervised Learning Supervised Learning Stephen Scott Introduction

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 478/878 Lecture 2: Concept Learning General-to-specific ordering over hypotheses and the

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

CSCE 478/878 Lecture 3: Computational Learning Theory Stephen D. Scott (Adapted from Tom

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott (Adapted from Tom Mitchells

CSCE 478/878 Lecture 4: Artificial Neural Networks Stephen D. Scott (Adapted from Tom

CSCE 478/878 Lecture 5: Evaluating will misclassify an instance drawn at random accord-

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Clustering with k-means and Gaussian mixture distributions Machine Learning and Category

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science

Data Warehousing and Machine Learning Probabilistic Classifiers Thomas D. Nielsen Aalborg

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Phylogenetic trees III Maximum Parsimony . Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Phylogenetic tree Michael Schroeder Biotechnology Center TU Dresden Phylogenetic trees

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted - PDF document

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides) October 31, 2006 1 Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Might have

Introduction CSCE CSCE In Homework 1, you are (supposedly) 478/878 478/878 Lecture 4:

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Introduction Decision Tree for PlayTennis (Mitchell) CSCE CSCE 478/878 478/878 Outlook

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

CSCE 478/878 Lecture 2: Supervised Learning Supervised Learning Stephen Scott Introduction

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 478/878 Lecture 2: Concept Learning General-to-specific ordering over hypotheses and the

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

CSCE 478/878 Lecture 3: Computational Learning Theory Stephen D. Scott (Adapted from Tom

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott (Adapted from Tom Mitchells

CSCE 478/878 Lecture 4: Artificial Neural Networks Stephen D. Scott (Adapted from Tom

CSCE 478/878 Lecture 5: Evaluating will misclassify an instance drawn at random accord-

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Clustering with k-means and Gaussian mixture distributions Machine Learning and Category

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Statistical Learning Marco Chiarandini Deptartment of Mathematics &amp; Computer Science

Data Warehousing and Machine Learning Probabilistic Classifiers Thomas D. Nielsen Aalborg

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Phylogenetic trees III Maximum Parsimony . Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Phylogenetic tree Michael Schroeder Biotechnology Center TU Dresden Phylogenetic trees

Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science