Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS - - PDF document

statistical learning rn2 sec 20 1 20 2 rn3 sec 20 1 20 2
SMART_READER_LITE
LIVE PREVIEW

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS - - PDF document

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo Lecture 15: Oct 30, 2012 Outline Statistical learning Bayesian learning Maximum a posteriori Maximum likelihood Learning


slide-1
SLIDE 1

1

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2]

CS 486/686 University of Waterloo Lecture 15: Oct 30, 2012

CS486/686 Lecture Slides (c) 2012 P. Poupart

2

Outline

  • Statistical learning

– Bayesian learning – Maximum a posteriori – Maximum likelihood

  • Learning from complete Data
slide-2
SLIDE 2

2

CS486/686 Lecture Slides (c) 2012 P. Poupart

3

Statistical Learning

  • View: we have uncertain knowledge of

the world

  • Idea: learning simply reduces this

uncertainty

CS486/686 Lecture Slides (c) 2012 P. Poupart

4

Candy Example

  • Favorite candy sold in two flavors:

– Lime (hugh) – Cherry (yum)

  • Same wrapper for both flavors
  • Sold in bags with different ratios:

– 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime

slide-3
SLIDE 3

3

CS486/686 Lecture Slides (c) 2012 P. Poupart

5

Candy Example

  • You bought a bag of candy but don’t

know its flavor ratio

  • After eating k candies:

– What’s the flavor ratio of the bag? – What will be the flavor of the next candy?

CS486/686 Lecture Slides (c) 2012 P. Poupart

6

Statistical Learning

  • Hypothesis H: probabilistic theory of the

world

– h1: 100% cherry – h2: 75% cherry + 25% lime – h3: 50% cherry + 50% lime – h4: 25% cherry + 75% lime – h5: 100% lime

  • Data D: evidence about the world

– d1: 1st candy is cherry – d2: 2nd candy is lime – d3: 3rd candy is lime – …

slide-4
SLIDE 4

4

CS486/686 Lecture Slides (c) 2012 P. Poupart

7

Bayesian Learning

  • Prior: Pr(H)
  • Likelihood: Pr(d|H)
  • Evidence: d = <d1,d2,…,dn>
  • Bayesian Learning amounts to computing

the posterior using Bayes’ Theorem: Pr(H|d) = k Pr(d|H)Pr(H)

CS486/686 Lecture Slides (c) 2012 P. Poupart

8

Bayesian Prediction

  • Suppose we want to make a prediction about

an unknown quantity X (i.e., the flavor of the next candy)

  • Pr(X|d) = Σi Pr(X|d,hi)P(hi|d)

= Σi Pr(X|hi)P(hi|d)

  • Predictions are weighted averages of the

predictions of the individual hypotheses

  • Hypotheses serve as “intermediaries”

between raw data and prediction

slide-5
SLIDE 5

5

CS486/686 Lecture Slides (c) 2012 P. Poupart

9

Candy Example

  • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1>
  • Assume candies are i.i.d. (identically and

independently distributed)

– P(d|h) = j P(dj|h)

  • Suppose first 10 candies all taste lime:

– P(d|h5) = 110 = 1 – P(d|h3) = 0.510 = 0.00097 – P(d|h1) = 010 = 0

CS486/686 Lecture Slides (c) 2012 P. Poupart

10

Posterior

0.2 0.4 0.6 0.8 1 2 4 6 8 10 P(h_i|e_1...e_t) Number of samples Posteriors given data generated from h_5 P(h_1|E) P(h_2|E) P(h_3|E) P(h_4|E) P(h_5|E)

slide-6
SLIDE 6

6

CS486/686 Lecture Slides (c) 2012 P. Poupart

11

Prediction

0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 P(red|e_1...e_t) Number of samples Bayes predictions with data generated from h_5 Probability that next candy is lime

CS486/686 Lecture Slides (c) 2012 P. Poupart

12

Bayesian Learning

  • Bayesian learning properties:

– Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses weighted and considered)

  • There is a price to pay:

– When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable

  • Solution: approximate Bayesian learning
slide-7
SLIDE 7

7

CS486/686 Lecture Slides (c) 2012 P. Poupart

13

Maximum a posteriori (MAP)

  • Idea: make prediction based on most

probable hypothesis hMAP

– hMAP = argmaxhi P(hi|d) – P(X|d)  P(X|hMAP)

  • In contrast, Bayesian learning makes

prediction based on all hypotheses weighted by their probability

CS486/686 Lecture Slides (c) 2012 P. Poupart

14

Candy Example (MAP)

  • Prediction after

– 1 lime: hMAP = h3, Pr(lime|hMAP) = 0.5 – 2 limes: hMAP = h4, Pr(lime|hMAP) = 0.75 – 3 limes: hMAP = h5, Pr(lime|hMAP) = 1 – 4 limes: hMAP = h5, Pr(lime|hMAP) = 1 – …

  • After only 3 limes, it correctly selects

h5

slide-8
SLIDE 8

8

CS486/686 Lecture Slides (c) 2012 P. Poupart

15

Candy Example (MAP)

  • But what if correct hypothesis is h4?

– h4: P(lime) = 0.75 and P(cherry) = 0.25

  • After 3 limes

– MAP incorrectly predicts h5 – MAP yields P(lime|hMAP) = 1 – Bayesian learning yields P(lime|d) = 0.8

CS486/686 Lecture Slides (c) 2012 P. Poupart

16

MAP properties

  • MAP prediction less accurate than Bayesian

prediction since it relies only on one hypothesis hMAP

  • But MAP and Bayesian predictions converge as

data increases

  • Controlled overfitting (prior can be used to

penalize complex hypotheses)

  • Finding hMAP may be intractable:

– hMAP = argmax P(h|d) – Optimization may be difficult

slide-9
SLIDE 9

9

CS486/686 Lecture Slides (c) 2012 P. Poupart

17

MAP computation

  • Optimization:

– hMAP = argmaxh P(h|d) = argmaxh P(h) P(d|h) = argmaxh P(h) i P(di|h)

  • Product induces non-linear optimization
  • Take the log to linearize optimization

– hMAP = argmaxh log P(h) + Σi log P(di|h)

CS486/686 Lecture Slides (c) 2012 P. Poupart

18

Maximum Likelihood (ML)

  • Idea: simplify MAP by assuming uniform

prior (i.e., P(hi) = P(hj) i,j)

– hMAP = argmaxh P(h) P(d|h) – hML = argmaxh P(d|h)

  • Make prediction based on hML only:

– P(X|d)  P(X|hML)

slide-10
SLIDE 10

10

CS486/686 Lecture Slides (c) 2012 P. Poupart

19

Candy Example (ML)

  • Prediction after

– 1 lime: hML = h5, Pr(lime|hML) = 1 – 2 limes: hML = h5, Pr(lime|hML) = 1 – …

  • Frequentist: “objective” prediction since it

relies only on the data (i.e., no prior)

  • Bayesian: prediction based on data and uniform

prior (since no prior  uniform prior)

CS486/686 Lecture Slides (c) 2012 P. Poupart

20

ML properties

  • ML prediction less accurate than Bayesian

and MAP predictions since it ignores prior info and relies only on one hypothesis hML

  • But ML, MAP and Bayesian predictions

converge as data increases

  • Subject to overfitting (no prior to penalize

complex hypothesis that could exploit statistically insignificant data patterns)

  • Finding hML is often easier than hMAP

– hML = argmaxh Σi log P(di|h)

slide-11
SLIDE 11

11

CS486/686 Lecture Slides (c) 2012 P. Poupart

21

Statistical Learning

  • Use Bayesian Learning, MAP or ML
  • Complete data:

– When data has multiple attributes, all attributes are known – Easy

  • Incomplete data:

– When data has multiple attributes, some attributes are unknown – Harder

CS486/686 Lecture Slides (c) 2012 P. Poupart

22

Simple ML example

  • Hypothesis h:

– P(cherry)= & P(lime)=1-

  • Data d:

– c cherries and l limes

  • ML hypothesis:

–  is relative frequency of observed data –  = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l)

slide-12
SLIDE 12

12

CS486/686 Lecture Slides (c) 2012 P. Poupart

23

ML computation

  • 1) Likelihood expression

– P(d|h) = c (1-)l

  • 2) log likelihood

– log P(d|h) = c log  + l log (1-)

  • 3) log likelihood derivative

– d(log P(d|h))/d = c/ - l/(1-)

  • 4) ML hypothesis

– c/ - l/(1-) = 0   = c/(c+l)

CS486/686 Lecture Slides (c) 2012 P. Poupart

24

More complicated ML example

  • Hypothesis: h,1,2
  • Data:

– c cherries

  • gc green wrappers
  • rc red wrappers

– l limes

  • gl green wrappers
  • rl red wrappers
slide-13
SLIDE 13

13

CS486/686 Lecture Slides (c) 2012 P. Poupart

25

ML computation

  • 1) Likelihood expression

– P(d|h,1,2) = c(1-)l 1

rc(1-1)gc 2 rl(1-2)gl

  • 4) ML hypothesis

– c/ - l/(1-) = 0   = c/(c+l) – rc/1 - gc/(1-1) = 0  1 = rc/(rc+gc) – rl/2 - gl/(1-2) = 0  2 = rl/(rl+gl)

CS486/686 Lecture Slides (c) 2012 P. Poupart

26

Laplace Smoothing

  • An important case of overfitting happens

when there is no sample for a certain outcome

– E.g. no cherries eaten so far – P(cherry) =  = c/(c+l) = 0 – Zero prob. are dangerous: they rule out outcomes

  • Solution: Laplace (add-one) smoothing

– Add 1 to all counts – P(cherry) =  = (c+1)/(c+l+2) > 0 – Much better results in practice

slide-14
SLIDE 14

14

CS486/686 Lecture Slides (c) 2012 P. Poupart

27

Naïve Bayes model

C An A3 A2 A1

  • Want to predict a

class C based on attributes Ai

  • Parameters:

–  = P(C=true) – i1 = P(Ai=true|C=true) – i2 = P(Ai=true|C=false)

  • Assumption: Ai’s are independent given C

CS486/686 Lecture Slides (c) 2012 P. Poupart

28

Naïve Bayes model for Restaurant Problem

  • Data:
  • ML sets

–  to relative frequencies of wait and ~wait – i1, i2 to relative frequencies of each attribute value given wait and ~wait

slide-15
SLIDE 15

15

CS486/686 Lecture Slides (c) 2012 P. Poupart

29

Naïve Bayes model vs decision trees

  • Wait prediction for restaurant problem

0.4 0.5 0.6 0.7 0.8 0.9 1 20 40 60 80 100 Proportion correct on test set Training set size Decision tree Naive Bayes

Why is naïve Bayes less accurate than decision tree?

CS486/686 Lecture Slides (c) 2012 P. Poupart

30

Bayesian network parameter learning (ML)

  • Parameters V,pa(V)=v:

– CPTs: V,pa(V)=v = P(V|pa(V)=v)

  • Data d:

– d1 : <V1=v1,1, V2=v2,1, …, Vn = vn,1> – d2 : <V1=v1,2, V2=v2,2, …, Vn = vn,2> – …

  • Maximum likelihood:

– Set V,pa(V)=v to the relative frequencies of the values of V given the values v of the parents of V

slide-16
SLIDE 16

16

CS486/686 Lecture Slides (c) 2012 P. Poupart

31

Next Class

  • Next Class:
  • Continue statistical learning
  • Learning from incomplete data