statistical learning rn2 sec 20 1 20 2 rn3 sec 20 1 20 2
play

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS - PDF document

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo Lecture 15: Oct 30, 2012 Outline Statistical learning Bayesian learning Maximum a posteriori Maximum likelihood Learning


  1. Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo Lecture 15: Oct 30, 2012 Outline • Statistical learning – Bayesian learning – Maximum a posteriori – Maximum likelihood • Learning from complete Data 2 CS486/686 Lecture Slides (c) 2012 P. Poupart 1

  2. Statistical Learning • View: we have uncertain knowledge of the world • Idea: learning simply reduces this uncertainty 3 CS486/686 Lecture Slides (c) 2012 P. Poupart Candy Example • Favorite candy sold in two flavors: – Lime (hugh) – Cherry (yum) • Same wrapper for both flavors • Sold in bags with different ratios: – 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime 4 CS486/686 Lecture Slides (c) 2012 P. Poupart 2

  3. Candy Example • You bought a bag of candy but don’t know its flavor ratio • After eating k candies: – What’s the flavor ratio of the bag? – What will be the flavor of the next candy? 5 CS486/686 Lecture Slides (c) 2012 P. Poupart Statistical Learning • Hypothesis H: probabilistic theory of the world – h 1 : 100% cherry – h 2 : 75% cherry + 25% lime – h 3 : 50% cherry + 50% lime – h 4 : 25% cherry + 75% lime – h 5 : 100% lime • Data D: evidence about the world – d 1 : 1 st candy is cherry – d 2 : 2 nd candy is lime – d 3 : 3 rd candy is lime – … 6 CS486/686 Lecture Slides (c) 2012 P. Poupart 3

  4. Bayesian Learning • Prior: Pr(H) • Likelihood: Pr(d|H) • Evidence: d = <d 1 ,d 2 ,…,d n > • Bayesian Learning amounts to computing the posterior using Bayes’ Theorem: Pr(H| d ) = k Pr( d |H)Pr(H) 7 CS486/686 Lecture Slides (c) 2012 P. Poupart Bayesian Prediction • Suppose we want to make a prediction about an unknown quantity X (i.e., the flavor of the next candy) • Pr(X| d ) = Σ i Pr(X| d ,h i )P(h i | d ) = Σ i Pr(X|h i )P(h i | d ) • Predictions are weighted averages of the predictions of the individual hypotheses • Hypotheses serve as “intermediaries” between raw data and prediction 8 CS486/686 Lecture Slides (c) 2012 P. Poupart 4

  5. Candy Example • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1> • Assume candies are i.i.d. (identically and independently distributed) – P( d |h) =  j P(d j |h) • Suppose first 10 candies all taste lime: – P( d |h 5 ) = 1 10 = 1 – P( d |h 3 ) = 0.5 10 = 0.00097 – P( d |h 1 ) = 0 10 = 0 9 CS486/686 Lecture Slides (c) 2012 P. Poupart Posterior Posteriors given data generated from h_5 1 P(h_1|E) P(h_2|E) P(h_3|E) 0.8 P(h_4|E) P(h_i|e_1...e_t) P(h_5|E) 0.6 0.4 0.2 0 0 2 4 6 8 10 Number of samples 10 CS486/686 Lecture Slides (c) 2012 P. Poupart 5

  6. Prediction Bayes predictions with data generated from h_5 Probability that next candy is lime 1 0.9 P(red|e_1...e_t) 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples 11 CS486/686 Lecture Slides (c) 2012 P. Poupart Bayesian Learning • Bayesian learning properties: – Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses weighted and considered) • There is a price to pay: – When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable • Solution: approximate Bayesian learning 12 CS486/686 Lecture Slides (c) 2012 P. Poupart 6

  7. Maximum a posteriori (MAP) • Idea: make prediction based on most probable hypothesis h MAP – h MAP = argmax hi P(h i | d ) – P(X| d )  P(X|h MAP ) • In contrast, Bayesian learning makes prediction based on all hypotheses weighted by their probability 13 CS486/686 Lecture Slides (c) 2012 P. Poupart Candy Example (MAP) • Prediction after – 1 lime: h MAP = h 3 , Pr(lime|h MAP ) = 0.5 – 2 limes: h MAP = h 4 , Pr(lime|h MAP ) = 0.75 – 3 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – 4 limes: h MAP = h 5 , Pr(lime|h MAP ) = 1 – … • After only 3 limes, it correctly selects h 5 14 CS486/686 Lecture Slides (c) 2012 P. Poupart 7

  8. Candy Example (MAP) • But what if correct hypothesis is h 4 ? – h 4 : P(lime) = 0.75 and P(cherry) = 0.25 • After 3 limes – MAP incorrectly predicts h 5 – MAP yields P(lime|h MAP ) = 1 – Bayesian learning yields P(lime| d ) = 0.8 15 CS486/686 Lecture Slides (c) 2012 P. Poupart MAP properties • MAP prediction less accurate than Bayesian prediction since it relies only on one hypothesis h MAP • But MAP and Bayesian predictions converge as data increases • Controlled overfitting (prior can be used to penalize complex hypotheses) • Finding h MAP may be intractable: – h MAP = argmax P(h| d ) – Optimization may be difficult 16 CS486/686 Lecture Slides (c) 2012 P. Poupart 8

  9. MAP computation • Optimization: – h MAP = argmax h P(h| d ) = argmax h P(h) P( d |h) = argmax h P(h)  i P(d i |h) • Product induces non-linear optimization • Take the log to linearize optimization – h MAP = argmax h log P(h) + Σ i log P(d i |h) 17 CS486/686 Lecture Slides (c) 2012 P. Poupart Maximum Likelihood (ML) • Idea: simplify MAP by assuming uniform prior (i.e., P(h i ) = P(h j )  i,j) – h MAP = argmax h P(h) P( d |h) – h ML = argmax h P( d |h) • Make prediction based on h ML only: – P(X| d )  P(X|h ML ) 18 CS486/686 Lecture Slides (c) 2012 P. Poupart 9

  10. Candy Example (ML) • Prediction after – 1 lime: h ML = h 5 , Pr(lime|h ML ) = 1 – 2 limes: h ML = h 5 , Pr(lime|h ML ) = 1 – … • Frequentist: “objective” prediction since it relies only on the data (i.e., no prior) • Bayesian: prediction based on data and uniform prior (since no prior  uniform prior) 19 CS486/686 Lecture Slides (c) 2012 P. Poupart ML properties • ML prediction less accurate than Bayesian and MAP predictions since it ignores prior info and relies only on one hypothesis h ML • But ML, MAP and Bayesian predictions converge as data increases • Subject to overfitting (no prior to penalize complex hypothesis that could exploit statistically insignificant data patterns) • Finding h ML is often easier than h MAP – h ML = argmax h Σ i log P(d i |h) 20 CS486/686 Lecture Slides (c) 2012 P. Poupart 10

  11. Statistical Learning • Use Bayesian Learning, MAP or ML • Complete data: – When data has multiple attributes, all attributes are known – Easy • Incomplete data: – When data has multiple attributes, some attributes are unknown – Harder 21 CS486/686 Lecture Slides (c) 2012 P. Poupart Simple ML example • Hypothesis h  : – P(cherry)=  & P(lime)=1-  • Data d : – c cherries and l limes • ML hypothesis: –  is relative frequency of observed data –  = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l) 22 CS486/686 Lecture Slides (c) 2012 P. Poupart 11

  12. ML computation • 1) Likelihood expression – P( d |h  ) =  c (1-  ) l • 2) log likelihood – log P( d |h  ) = c log  + l log (1-  ) • 3) log likelihood derivative – d(log P( d |h  ))/d  = c/  - l/(1-  ) • 4) ML hypothesis – c/  - l/(1-  ) = 0   = c/(c+l) 23 CS486/686 Lecture Slides (c) 2012 P. Poupart More complicated ML example • Hypothesis: h  ,  1,  2 • Data: – c cherries • g c green wrappers • r c red wrappers – l limes • g l green wrappers • r l red wrappers 24 CS486/686 Lecture Slides (c) 2012 P. Poupart 12

  13. ML computation • 1) Likelihood expression – P( d |h  ,  1,  2 ) =  c (1-  ) l  1 r c (1-  1 ) g c  2 r l (1-  2 ) g l • … • 4) ML hypothesis – c/  - l/(1-  ) = 0   = c/(c+l) – r c /  1 - g c /(1-  1 ) = 0   1 = r c /(r c +g c ) – r l /  2 - g l /(1-  2 ) = 0   2 = r l /(r l +g l ) 25 CS486/686 Lecture Slides (c) 2012 P. Poupart Laplace Smoothing • An important case of overfitting happens when there is no sample for a certain outcome – E.g. no cherries eaten so far – P(cherry) =  = c/(c+l) = 0 – Zero prob. are dangerous: they rule out outcomes • Solution: Laplace (add-one) smoothing – Add 1 to all counts – P(cherry) =  = (c+1)/(c+l+2) > 0 – Much better results in practice 26 CS486/686 Lecture Slides (c) 2012 P. Poupart 13

  14. Naïve Bayes model • Want to predict a C class C based on attributes A i • Parameters: … A 1 A 2 A 3 A n –  = P(C=true) –  i1 = P(A i =true|C=true) –  i2 = P(A i =true|C=false) • Assumption: A i ’s are independent given C 27 CS486/686 Lecture Slides (c) 2012 P. Poupart Naïve Bayes model for Restaurant Problem • Data: • ML sets –  to relative frequencies of wait and ~wait –  i1 ,  i2 to relative frequencies of each attribute value given wait and ~wait 28 CS486/686 Lecture Slides (c) 2012 P. Poupart 14

  15. Naïve Bayes model vs decision trees • Wait prediction for restaurant problem 1 Proportion correct on test set 0.9 Why is naïve 0.8 Bayes less accurate than 0.7 decision tree? 0.6 Decision tree Naive Bayes 0.5 0.4 0 20 40 60 80 100 Training set size 29 CS486/686 Lecture Slides (c) 2012 P. Poupart Bayesian network parameter learning (ML) • Parameters  V,pa(V)= v : – CPTs:  V,pa(V)= v = P(V|pa(V)= v ) • Data d : – d 1 : <V 1 =v 1,1 , V 2 =v 2,1 , …, V n = v n,1 > – d 2 : <V 1 =v 1,2 , V 2 =v 2,2 , …, V n = v n,2 > – … • Maximum likelihood: – Set  V,pa(V)= v to the relative frequencies of the values of V given the values v of the parents of V 30 CS486/686 Lecture Slides (c) 2012 P. Poupart 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend