Statistical Learning CS 786 University of Waterloo Lecture 6: May - - PDF document

statistical learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Learning CS 786 University of Waterloo Lecture 6: May - - PDF document

Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree Predictions Can make deterministic and probabilistic predictions Deterministic rule: 485 231


slide-1
SLIDE 1

1

Statistical Learning

CS 786 University of Waterloo Lecture 6: May 17, 2012

Decision Tree Predictions

  • Can make deterministic and probabilistic

predictions

– Deterministic rule:

  • 485 ∧ 231 ⟹ 786

– Probabilistic rule:

  • Pr

786 | 485 ∧ 231 0.9

  • Probabilistic rule is a conditional

distribution… could we use Bayes nets instead of decision trees?

CS786 Lecture Slides (c) 2012 P. Poupart

2

slide-2
SLIDE 2

2

Bayesian Network Predictions

  • Inference queries can be used to make

probabilistic predictions:

  • Advantages:

– Predict any variable – Prediction based on partial evidence

  • Question: how do we learn the

parameters of a Bayesian network?

CS786 Lecture Slides (c) 2012 P. Poupart

3

CS786 Lecture Slides (c) 2012 P. Poupart

4

Statistical Learning

  • Three common approaches

– Bayesian learning – Maximum a posteriori – Maximum likelihood

  • Conditional maximum likelihood
slide-3
SLIDE 3

3

CS786 Lecture Slides (c) 2012 P. Poupart

5

Candy Example

  • Favorite candy sold in two flavors:

– Lime (hugh) – Cherry (yum)

  • Same wrapper for both flavors
  • Sold in bags with different ratios:

– 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime

CS786 Lecture Slides (c) 2012 P. Poupart

6

Candy Example

  • You bought a bag of candy but don’t

know its flavor ratio

  • After eating k candies:

– What’s the flavor ratio of the bag? – What will be the flavor of the next candy?

slide-4
SLIDE 4

4

CS786 Lecture Slides (c) 2012 P. Poupart

7

Statistical Learning

  • Hypothesis H: probabilistic theory of the

world

– h1: 100% cherry – h2: 75% cherry + 25% lime – h3: 50% cherry + 50% lime – h4: 25% cherry + 75% lime – h5: 100% lime

  • Data D: evidence about the world

– d1: 1st candy is cherry – d2: 2nd candy is lime – d3: 3rd candy is lime – …

CS786 Lecture Slides (c) 2012 P. Poupart

8

Bayesian Learning

  • Prior: Pr(H)
  • Likelihood: Pr(d|H)
  • Evidence: d = <d1,d2,…,dn>
  • Bayesian Learning amounts to computing

the posterior using Bayes’ Theorem: Pr(H|d) = k Pr(d|H)Pr(H)

slide-5
SLIDE 5

5

CS786 Lecture Slides (c) 2012 P. Poupart

9

Bayesian Prediction

  • Suppose we want to make a prediction about

an unknown quantity X (i.e., the flavor of the next candy)

  • Pr(X|d) = Σi Pr(X|d,hi)P(hi|d)

= Σi Pr(X|hi)P(hi|d)

  • Predictions are weighted averages of the

predictions of the individual hypotheses

  • Hypotheses serve as “intermediaries”

between raw data and prediction

CS786 Lecture Slides (c) 2012 P. Poupart

10

Candy Example

  • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1>
  • Assume candies are i.i.d. (identically and

independently distributed)

– P(d|h) = j P(dj|h)

  • Suppose first 10 candies all taste lime:

– P(d|h5) = 110 = 1 – P(d|h3) = 0.510 = 0.00097 – P(d|h1) = 010 = 0

slide-6
SLIDE 6

6

CS786 Lecture Slides (c) 2012 P. Poupart

11

Posterior

0.2 0.4 0.6 0.8 1 2 4 6 8 10 P(h_i|e_1...e_t) Number of samples Posteriors given data generated from h_5 P(h_1|E) P(h_2|E) P(h_3|E) P(h_4|E) P(h_5|E)

CS786 Lecture Slides (c) 2012 P. Poupart

12

Prediction

0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 P(red|e_1...e_t) Number of samples Bayes predictions with data generated from h_5 Probability that next candy is lime

slide-7
SLIDE 7

7

CS786 Lecture Slides (c) 2012 P. Poupart

13

Bayesian Learning

  • Bayesian learning properties:

– Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses weighted and considered)

  • There is a price to pay:

– When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable

  • Solution: approximate Bayesian learning

CS786 Lecture Slides (c) 2012 P. Poupart

14

Maximum a posteriori (MAP)

  • Idea: make prediction based on most

probable hypothesis hMAP

– hMAP = argmaxhi P(hi|d) – P(X|d)  P(X|hMAP)

  • In contrast, Bayesian learning makes

prediction based on all hypotheses weighted by their probability

slide-8
SLIDE 8

8

CS786 Lecture Slides (c) 2012 P. Poupart

15

Candy Example (MAP)

  • Prediction after

– 1 lime: hMAP = h3, Pr(lime|hMAP) = 0.5 – 2 limes: hMAP = h4, Pr(lime|hMAP) = 0.75 – 3 limes: hMAP = h5, Pr(lime|hMAP) = 1 – 4 limes: hMAP = h5, Pr(lime|hMAP) = 1 – …

  • After only 3 limes, it correctly selects

h5

CS786 Lecture Slides (c) 2012 P. Poupart

16

Candy Example (MAP)

  • But what if correct hypothesis is h4?

– h4: P(lime) = 0.75 and P(cherry) = 0.25

  • After 3 limes

– MAP incorrectly predicts h5 – MAP yields P(lime|hMAP) = 1 – Bayesian learning yields P(lime|d) = 0.8

slide-9
SLIDE 9

9

CS786 Lecture Slides (c) 2012 P. Poupart

17

MAP properties

  • MAP prediction less accurate than Bayesian

prediction since it relies only on one hypothesis hMAP

  • But MAP and Bayesian predictions converge as

data increases

  • Controlled overfitting (prior can be used to

penalize complex hypotheses)

  • Finding hMAP may be intractable:

– hMAP = argmax P(h|d) – Optimization may be difficult

CS786 Lecture Slides (c) 2012 P. Poupart

18

MAP computation

  • Optimization:

– hMAP = argmaxh P(h|d) = argmaxh P(h) P(d|h) = argmaxh P(h) i P(di|h)

  • Product induces non-linear optimization
  • Take the log to linearize optimization

– hMAP = argmaxh log P(h) + Σi log P(di|h)

slide-10
SLIDE 10

10

CS786 Lecture Slides (c) 2012 P. Poupart

19

Maximum Likelihood (ML)

  • Idea: simplify MAP by assuming uniform

prior (i.e., P(hi) = P(hj) i,j)

– hMAP = argmaxh P(h) P(d|h) – hML = argmaxh P(d|h)

  • Make prediction based on hML only:

– P(X|d)  P(X|hML)

CS786 Lecture Slides (c) 2012 P. Poupart

20

Candy Example (ML)

  • Prediction after

– 1 lime: hML = h5, Pr(lime|hML) = 1 – 2 limes: hML = h5, Pr(lime|hML) = 1 – …

  • Frequentist: “objective” prediction since it

relies only on the data (i.e., no prior)

  • Bayesian: prediction based on data and uniform

prior (since no prior  uniform prior)

slide-11
SLIDE 11

11

CS786 Lecture Slides (c) 2012 P. Poupart

21

ML properties

  • ML prediction less accurate than Bayesian

and MAP predictions since it ignores prior info and relies only on one hypothesis hML

  • But ML, MAP and Bayesian predictions

converge as data increases

  • Subject to overfitting (no prior to penalize

complex hypothesis that could exploit statistically insignificant data patterns)

  • Finding hML is often easier than hMAP

– hML = argmaxh Σi log P(di|h)

CS786 Lecture Slides (c) 2012 P. Poupart

22

Statistical Learning

  • Use Bayesian Learning, MAP or ML
  • Complete data:

– When data has multiple attributes, all attributes are known – Easy

  • Incomplete data:

– When data has multiple attributes, some attributes are unknown – Harder

slide-12
SLIDE 12

12

CS786 Lecture Slides (c) 2012 P. Poupart

23

Simple ML example

  • Hypothesis h:

– P(cherry)= & P(lime)=1-

  • Data d:

– c cherries and l limes

  • ML hypothesis:

–  is relative frequency of observed data –  = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l)

CS786 Lecture Slides (c) 2012 P. Poupart

24

ML computation

  • 1) Likelihood expression

– P(d|h) = c (1-)l

  • 2) log likelihood

– log P(d|h) = c log  + l log (1-)

  • 3) log likelihood derivative

– d(log P(d|h))/d = c/ - l/(1-)

  • 4) ML hypothesis

– c/ - l/(1-) = 0   = c/(c+l)

slide-13
SLIDE 13

13

CS786 Lecture Slides (c) 2012 P. Poupart

25

More complicated ML example

  • Hypothesis: h,1,2
  • Data:

– c cherries

  • gc green wrappers
  • rc red wrappers

– l limes

  • gl green wrappers
  • rl red wrappers

CS786 Lecture Slides (c) 2012 P. Poupart

26

ML computation

  • 1) Likelihood expression

– P(d|h,1,2) = c(1-)l 1

rc(1-1)gc 2 rl(1-2)gl

  • 4) ML hypothesis

– c/ - l/(1-) = 0   = c/(c+l) – rc/1 - gc/(1-1) = 0  1 = rc/(rc+gc) – rl/2 - gl/(1-2) = 0  2 = rl/(rl+gl)

slide-14
SLIDE 14

14

CS786 Lecture Slides (c) 2012 P. Poupart

27

Naïve Bayes model

C An A3 A2 A1

  • Want to predict a

class C based on attributes Ai

  • Parameters:

–  = P(C=true) – i1 = P(Ai=true|C=true) – i2 = P(Ai=true|C=false)

  • Assumption: Ai’s are independent given C

CS786 Lecture Slides (c) 2012 P. Poupart

28

Naïve Bayes model for Restaurant Problem

  • Data:
  • ML sets

–  to relative frequencies of wait and ~wait – i1, i2 to relative frequencies of each attribute value given wait and ~wait

slide-15
SLIDE 15

15

CS786 Lecture Slides (c) 2012 P. Poupart

29

Naïve Bayes model vs decision trees

  • Wait prediction for restaurant problem

0.4 0.5 0.6 0.7 0.8 0.9 1 20 40 60 80 100 Proportion correct on test set Training set size Decision tree Naive Bayes

Why is naïve Bayes less accurate than decision tree?

CS786 Lecture Slides (c) 2012 P. Poupart

30

Bayesian network parameter learning (ML)

  • Parameters V,pa(V)=v:

– CPTs: V,pa(V)=v = P(V|pa(V)=v)

  • Data d:

– d1 : <V1=v1,1, V2=v2,1, …, Vn = vn,1> – d2 : <V1=v1,2, V2=v2,2, …, Vn = vn,2> – …

  • Maximum likelihood:

– Set V,pa(V)=v to the relative frequencies of the values of V given the values v of the parents of V