Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - - PowerPoint PPT Presentation

ensembles
SMART_READER_LITE
LIVE PREVIEW

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - - PowerPoint PPT Presentation

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. L eon Bottou


slide-1
SLIDE 1

Ensembles

L´ eon Bottou COS 424 – 4/8/2010

slide-2
SLIDE 2

Readings

  • T. G. Dietterich (2000)

“Ensemble Methods in Machine Learning”.

  • R. E. Schapire (2003):

“The Boosting Approach to Machine Learning”. Sections 1,2,3,4,6.

L´ eon Bottou 2/33 COS 424 – 4/8/2010

slide-3
SLIDE 3

Summary

  • 1. Why ensembles?
  • 2. Combining outputs.
  • 3. Constructing ensembles.
  • 4. Boosting.

L´ eon Bottou 3/33 COS 424 – 4/8/2010

slide-4
SLIDE 4
  • I. Ensembles

L´ eon Bottou 4/33 COS 424 – 4/8/2010

slide-5
SLIDE 5

Ensemble of classifiers

Ensemble of classifiers – Consider a set of classifiers h1, h2, . . . , hL. – Construct a classifier by combining their individual decisions. – For example by voting their outputs. Accuracy – The ensemble works if the classifiers have low error rates. Diversity – No gain if all classifiers make the same mistakes. – What if classifiers make different mistakes?

L´ eon Bottou 5/33 COS 424 – 4/8/2010

slide-6
SLIDE 6

Uncorrelated classifiers

Assume ∀r = s Cov [ 1

I{hr(x) = y} , 1 I{hs(x) = y} ] = 0

The tally of classifier votes follows a binomial distribution. Example Twenty-one uncorrelated classifiers with 30% error rate.

L´ eon Bottou 6/33 COS 424 – 4/8/2010

slide-7
SLIDE 7

Statistical motivation

blue : classifiers that work well on the training set(s)

f : best classifier.

L´ eon Bottou 7/33 COS 424 – 4/8/2010

slide-8
SLIDE 8

Computational motivation

blue : classifier search may reach local optima

f : best classifier.

L´ eon Bottou 8/33 COS 424 – 4/8/2010

slide-9
SLIDE 9

Representational motivation

blue : classifier space may not contain best classifier

f : best classifier.

L´ eon Bottou 9/33 COS 424 – 4/8/2010

slide-10
SLIDE 10

Practical success

Recommendation system – Netflix “movies you may like”. – Customers sometimes rate movies they rent. – Input: (movie, customer) – Output: rating Netflix competition – 1M$ for the first team to do 10% better than their system. Winner: BellKor team and friends – Ensemble of more than 800 rating systems. Runner-up: everybody else – Ensemble of all the rating systems built by the other teams.

L´ eon Bottou 10/33 COS 424 – 4/8/2010

slide-11
SLIDE 11

Bayesian ensembles

Let D represent the training data. Enumerating all the classifiers

P(y|x, D) =

  • h

P(y, h|x, D) =

  • h

P(h|x, D) P(y|h, x, D) =

  • h

P(h|D) P(y|x, h) P(h|D) : how well does h match the training data. P(y|x, h) : what h predicts for pattern x.

Note that this is a weighted average.

L´ eon Bottou 11/33 COS 424 – 4/8/2010

slide-12
SLIDE 12
  • II. Combining Outputs

L´ eon Bottou 12/33 COS 424 – 4/8/2010

slide-13
SLIDE 13

Simple averaging

eon Bottou 13/33 COS 424 – 4/8/2010

slide-14
SLIDE 14

Weighted averaging a priori

  • Weights derived from the training errors, e.g. exp(−β TrainingError(ht)).

Approximate Bayesian ensemble.

L´ eon Bottou 14/33 COS 424 – 4/8/2010

slide-15
SLIDE 15

Weighted averaging with trained weights

  • Train weights on the validation set.

Training weights on the training set overfits easily. You need another validation set to estimate the performance!

L´ eon Bottou 15/33 COS 424 – 4/8/2010

slide-16
SLIDE 16

Stacked classifiers

  • Second tier classifier trained on the validation set.

You need another validation set to estimate the performance!

L´ eon Bottou 16/33 COS 424 – 4/8/2010

slide-17
SLIDE 17
  • III. Constructing Ensembles

L´ eon Bottou 17/33 COS 424 – 4/8/2010

slide-18
SLIDE 18

Diversification

Cause of the mistake Diversification strategy Pattern was difficult. hopeless Overfitting (⋆) vary the training sets Some features were noisy vary the set of input features Multiclass decisions were inconsistent vary the class encoding

L´ eon Bottou 18/33 COS 424 – 4/8/2010

slide-19
SLIDE 19

Manipulating the training examples

Bootstrap replication simulates training set selection – Given a training set of size n, construct a new training set by sampling n examples with replacement. – About 30% of the examples are excluded. Bagging – Create bootstrap replicates of the training set. – Build a decision tree for each replicate. – Estimate tree performance using out-of-bootstrap data. – Average the outputs of all decision trees. Boosting – See part IV.

L´ eon Bottou 19/33 COS 424 – 4/8/2010

slide-20
SLIDE 20

Manipulating the features

Random forests – Construct decision trees on bootstrap replicas. Restrict the node decisions to a small subset of features picked randomly for each node. – Do not prune the trees. Estimate tree performance using out-of-bootstrap data. Average the outputs of all decision trees. Multiband speech recognition – Filter speech to eliminate a random subset of the frequencies. – Train speech recognizer on filtered data. – Repeat and combine with a second tier classifier. – Resulting recognizer is more robust to noise.

L´ eon Bottou 20/33 COS 424 – 4/8/2010

slide-21
SLIDE 21

Manipulating the output codes

Reducing multiclass problems to binary classification – We have seen one versus all. – We have seen all versus all. Error correcting codes for multiclass problems – Code the class numbers with an error correcting code. – Construct a binary classifier for each bit of the code. – Run the error correction algorithm on the binary classifier outputs.

L´ eon Bottou 21/33 COS 424 – 4/8/2010

slide-22
SLIDE 22
  • IV. Boosting

L´ eon Bottou 22/33 COS 424 – 4/8/2010

slide-23
SLIDE 23

Motivation

  • Easy to come up with rough rules of thumb for classifying data

– email contains more than 50% capital letters. – email contains expression “buy now”.

  • Each alone isnt great, but better than random.
  • Boosting converts rough rules of thumb into an accurate classier.

Boosting was invented by Prof. Schapire.

L´ eon Bottou 23/33 COS 424 – 4/8/2010

slide-24
SLIDE 24

Adaboost

Given examples (x1, y1) . . . (xn, yn) with yi = ±1. Let D1(i) = 1/n for i = 1 . . . n. For t = 1 . . . T do

  • Run weak learner using examples with weights Dt.
  • Get weak classifier ht
  • Compute error: εt =

i Dt(i) 1

I(ht(xi) = yi)

  • Compute magic coefficient αt = 1

2 log 1 − εt εt

  • Update weights Dt+1(i) = Dt(i) e−αt yi ht(xi)

Zt

Output the final classifier fT(x) = sign

 

T

  • t=1

αtht(x)  

L´ eon Bottou 24/33 COS 424 – 4/8/2010

slide-25
SLIDE 25

Toy example

Weak classifiers: vertical or horizontal half-planes.

L´ eon Bottou 25/33 COS 424 – 4/8/2010

slide-26
SLIDE 26

Adaboost round 1

L´ eon Bottou 26/33 COS 424 – 4/8/2010

slide-27
SLIDE 27

Adaboost round 2

L´ eon Bottou 27/33 COS 424 – 4/8/2010

slide-28
SLIDE 28

Adaboost round 3

L´ eon Bottou 28/33 COS 424 – 4/8/2010

slide-29
SLIDE 29

Adaboost final classifier

L´ eon Bottou 29/33 COS 424 – 4/8/2010

slide-30
SLIDE 30

From weak learner to strong classifier (1)

Preliminary

DT+1(i) = D1(i)e−α1 yi h1(xi) Z1 · · · e−αT yi hT(xi) ZT = 1 n e−yi fT(xi)

  • t Zt

Bounding the training error

1 n

  • i

1 I{fT(xi) = yi} ≤ 1 n

  • i

e−yi fT(xi) = 1 n

  • i

DT+1(i)

  • t

Zt =

  • t

Zt

Idea: make Zt as small as possible.

Zt =

n

  • i=1

Dt(i)e−αt yi ht(xi) = n (1 − εt) e−αt + n εt eαt

  • 1. Pick ht to minimize εt.
  • 2. Pick αt to minimize Zt.

L´ eon Bottou 30/33 COS 424 – 4/8/2010

slide-31
SLIDE 31

From weak learner to strong classifier (2)

Pick αt to minimize Zt (the magic coefficient)

∂Zt ∂αt = −(1 − εt) e−αt + εt eαt = 0 = ⇒ αt = 1 2 log 1 − εt εt

Weak learner assumption: γt = 1

2 − εt is positive and small.

Zt = (1 − ε)

  • ε

1 − ε + ε

  • 1 − ε

ε =

  • 4ε(1 − ε) =
  • 1 − 4γ2

t

≤ exp

  • − 2γ2

t

  • TrainingError(fT) ≤

T

  • t=1

Zt ≤ exp  −2

T

  • t=1

γ2

t

 

The training error decreases exponentially if inf γt > 0. But that does not happen beyond a certain point. . .

L´ eon Bottou 31/33 COS 424 – 4/8/2010

slide-32
SLIDE 32

Boosting and exponential loss

Proofs are instructive We obtain the bound TrainingError(fT) ≤ 1

n

  • i

e−yiH(xi) =

T

  • t=1

Zt

– without saying how Dt relates to ht – without using the value of αt

y y(x) ^

Conclusion – Round T chooses the hT and αT that maximize the exponential loss reduction from fT−1 to fT. Exercise – Tweak Adaboost to minimize the log loss instead of the exp loss.

L´ eon Bottou 32/33 COS 424 – 4/8/2010

slide-33
SLIDE 33

Boosting and margins

marginH(x, y) = y H(x)

  • t |αt| =
  • t αt y ht(x)
  • t |αt|

Remember support vector machines?

L´ eon Bottou 33/33 COS 424 – 4/8/2010