statistical learning
play

Statistical Learning CS 786 University of Waterloo Lecture 6: May - PDF document

Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree Predictions Can make deterministic and probabilistic predictions Deterministic rule: 485 231


  1. Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree Predictions • Can make deterministic and probabilistic predictions – Deterministic rule: • ��485 � � ∧ ����231 � � ⟹ ��786 � � – Probabilistic rule: ���786 � � | ��485 � � ∧ ����231 � �� � 0.9 • Pr • Probabilistic rule is a conditional distribution… could we use Bayes nets instead of decision trees? 2 CS786 Lecture Slides (c) 2012 P. Poupart 1

  2. Bayesian Network Predictions • Inference queries can be used to make probabilistic predictions: • Advantages: – Predict any variable – Prediction based on partial evidence • Question: how do we learn the parameters of a Bayesian network? 3 CS786 Lecture Slides (c) 2012 P. Poupart Statistical Learning • Three common approaches – Bayesian learning – Maximum a posteriori – Maximum likelihood • Conditional maximum likelihood 4 CS786 Lecture Slides (c) 2012 P. Poupart 2

  3. Candy Example • Favorite candy sold in two flavors: – Lime (hugh) – Cherry (yum) • Same wrapper for both flavors • Sold in bags with different ratios: – 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime 5 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example • You bought a bag of candy but don’t know its flavor ratio • After eating k candies: – What’s the flavor ratio of the bag? – What will be the flavor of the next candy? 6 CS786 Lecture Slides (c) 2012 P. Poupart 3

  4. Statistical Learning • Hypothesis H: probabilistic theory of the world – h 1 : 100% cherry – h 2 : 75% cherry + 25% lime – h 3 : 50% cherry + 50% lime – h 4 : 25% cherry + 75% lime – h 5 : 100% lime • Data D: evidence about the world – d 1 : 1 st candy is cherry – d 2 : 2 nd candy is lime – d 3 : 3 rd candy is lime – … 7 CS786 Lecture Slides (c) 2012 P. Poupart Bayesian Learning • Prior: Pr(H) • Likelihood: Pr(d|H) • Evidence: d = <d 1 ,d 2 ,…,d n > • Bayesian Learning amounts to computing the posterior using Bayes’ Theorem: Pr(H| d ) = k Pr( d |H)Pr(H) 8 CS786 Lecture Slides (c) 2012 P. Poupart 4

  5. Bayesian Prediction • Suppose we want to make a prediction about an unknown quantity X (i.e., the flavor of the next candy) • Pr(X| d ) = Σ i Pr(X| d ,h i )P(h i | d ) = Σ i Pr(X|h i )P(h i | d ) • Predictions are weighted averages of the predictions of the individual hypotheses • Hypotheses serve as “intermediaries” between raw data and prediction 9 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1> • Assume candies are i.i.d. (identically and independently distributed) – P( d |h) =  j P(d j |h) • Suppose first 10 candies all taste lime: – P( d |h 5 ) = 1 10 = 1 – P( d |h 3 ) = 0.5 10 = 0.00097 – P( d |h 1 ) = 0 10 = 0 10 CS786 Lecture Slides (c) 2012 P. Poupart 5

  6. Posterior Posteriors given data generated from h_5 1 P(h_1|E) P(h_2|E) 0.8 P(h_3|E) P(h_4|E) P(h_i|e_1...e_t) P(h_5|E) 0.6 0.4 0.2 0 0 2 4 6 8 10 Number of samples 11 CS786 Lecture Slides (c) 2012 P. Poupart Prediction Bayes predictions with data generated from h_5 Probability that next candy is lime 1 0.9 P(red|e_1...e_t) 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples 12 CS786 Lecture Slides (c) 2012 P. Poupart 6

  7. Bayesian Learning • Bayesian learning properties: – Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses weighted and considered) • There is a price to pay: – When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable • Solution: approximate Bayesian learning 13 CS786 Lecture Slides (c) 2012 P. Poupart Maximum a posteriori (MAP) • Idea: make prediction based on most probable hypothesis h MAP – h MAP = argmax hi P(h i | d ) – P(X| d )  P(X|h MAP ) • In contrast, Bayesian learning makes prediction based on all hypotheses weighted by their probability 14 CS786 Lecture Slides (c) 2012 P. Poupart 7

  8. Candy Example (MAP) • Prediction after – 1 lime: h MAP = h 3 , Pr(lime|h MAP ) = 0.5 – 2 limes: h MAP = h 4 , Pr(lime|h MAP ) = 0.75 – 3 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – 4 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – … • After only 3 limes, it correctly selects h 5 15 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example (MAP) • But what if correct hypothesis is h 4 ? – h 4 : P(lime) = 0.75 and P(cherry) = 0.25 • After 3 limes – MAP incorrectly predicts h 5 – MAP yields P(lime|h MAP ) = 1 – Bayesian learning yields P(lime| d ) = 0.8 16 CS786 Lecture Slides (c) 2012 P. Poupart 8

  9. MAP properties • MAP prediction less accurate than Bayesian prediction since it relies only on one hypothesis h MAP • But MAP and Bayesian predictions converge as data increases • Controlled overfitting (prior can be used to penalize complex hypotheses) • Finding h MAP may be intractable: – h MAP = argmax P(h| d ) – Optimization may be difficult 17 CS786 Lecture Slides (c) 2012 P. Poupart MAP computation • Optimization: – h MAP = argmax h P(h| d ) = argmax h P(h) P( d |h) = argmax h P(h)  i P(d i |h) • Product induces non-linear optimization • Take the log to linearize optimization – h MAP = argmax h log P(h) + Σ i log P(d i |h) 18 CS786 Lecture Slides (c) 2012 P. Poupart 9

  10. Maximum Likelihood (ML) • Idea: simplify MAP by assuming uniform prior (i.e., P(h i ) = P(h j )  i,j) – h MAP = argmax h P(h) P( d |h) – h ML = argmax h P( d |h) • Make prediction based on h ML only: – P(X| d )  P(X|h ML ) 19 CS786 Lecture Slides (c) 2012 P. Poupart Candy Example (ML) • Prediction after – 1 lime: h ML = h 5 , Pr(lime|h ML ) = 1 – 2 limes: h ML = h 5 , Pr(lime|h ML ) = 1 – … • Frequentist: “objective” prediction since it relies only on the data (i.e., no prior) • Bayesian: prediction based on data and uniform prior (since no prior  uniform prior) 20 CS786 Lecture Slides (c) 2012 P. Poupart 10

  11. ML properties • ML prediction less accurate than Bayesian and MAP predictions since it ignores prior info and relies only on one hypothesis h ML • But ML, MAP and Bayesian predictions converge as data increases • Subject to overfitting (no prior to penalize complex hypothesis that could exploit statistically insignificant data patterns) • Finding h ML is often easier than h MAP – h ML = argmax h Σ i log P(d i |h) 21 CS786 Lecture Slides (c) 2012 P. Poupart Statistical Learning • Use Bayesian Learning, MAP or ML • Complete data: – When data has multiple attributes, all attributes are known – Easy • Incomplete data: – When data has multiple attributes, some attributes are unknown – Harder 22 CS786 Lecture Slides (c) 2012 P. Poupart 11

  12. Simple ML example • Hypothesis h  : – P(cherry)=  & P(lime)=1-  • Data d : – c cherries and l limes • ML hypothesis: –  is relative frequency of observed data –  = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l) 23 CS786 Lecture Slides (c) 2012 P. Poupart ML computation • 1) Likelihood expression – P( d |h  ) =  c (1-  ) l • 2) log likelihood – log P( d |h  ) = c log  + l log (1-  ) • 3) log likelihood derivative – d(log P( d |h  ))/d  = c/  - l/(1-  ) • 4) ML hypothesis – c/  - l/(1-  ) = 0   = c/(c+l) 24 CS786 Lecture Slides (c) 2012 P. Poupart 12

  13. More complicated ML example • Hypothesis: h  ,  1,  2 • Data: – c cherries • g c green wrappers • r c red wrappers – l limes • g l green wrappers • r l red wrappers 25 CS786 Lecture Slides (c) 2012 P. Poupart ML computation • 1) Likelihood expression – P( d |h  ,  1,  2 ) =  c (1-  ) l  1 r c (1-  1 ) g c  2 r l (1-  2 ) g l • … • 4) ML hypothesis – c/  - l/(1-  ) = 0   = c/(c+l) – r c /  1 - g c /(1-  1 ) = 0   1 = r c /(r c +g c ) – r l /  2 - g l /(1-  2 ) = 0   2 = r l /(r l +g l ) 26 CS786 Lecture Slides (c) 2012 P. Poupart 13

  14. Naïve Bayes model • Want to predict a C class C based on attributes A i • Parameters: … A 1 A 2 A 3 A n –  = P(C=true) –  i1 = P(A i =true|C=true) –  i2 = P(A i =true|C=false) • Assumption: A i ’s are independent given C 27 CS786 Lecture Slides (c) 2012 P. Poupart Naïve Bayes model for Restaurant Problem • Data: • ML sets –  to relative frequencies of wait and ~wait –  i1 ,  i2 to relative frequencies of each attribute value given wait and ~wait 28 CS786 Lecture Slides (c) 2012 P. Poupart 14

  15. Naïve Bayes model vs decision trees • Wait prediction for restaurant problem 1 Proportion correct on test set 0.9 Why is naïve 0.8 Bayes less accurate than 0.7 decision tree? 0.6 Decision tree Naive Bayes 0.5 0.4 0 20 40 60 80 100 Training set size 29 CS786 Lecture Slides (c) 2012 P. Poupart Bayesian network parameter learning (ML) • Parameters  V,pa(V)= v : – CPTs:  V,pa(V)= v = P(V|pa(V)= v ) • Data d : – d 1 : <V 1 =v 1,1 , V 2 =v 2,1 , …, V n = v n,1 > – d 2 : <V 1 =v 1,2 , V 2 =v 2,2 , …, V n = v n,2 > – … • Maximum likelihood: – Set  V,pa(V)= v to the relative frequencies of the values of V given the values v of the parents of V 30 CS786 Lecture Slides (c) 2012 P. Poupart 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend