Boosting Machine Learning - 10601 Geoff Gordon, MiroslavDudk - - PDF document

boosting
SMART_READER_LITE
LIVE PREVIEW

Boosting Machine Learning - 10601 Geoff Gordon, MiroslavDudk - - PDF document

11/9/2009 Boosting Machine Learning - 10601 Geoff Gordon, MiroslavDudk ([[[partly based on slides of Rob Schapire and Carlos Guestrin] http://www.cs.cmu.edu/~ggordon/10601/ November 9, 2009 Ensembles of trees BAGGING and BOOSTING RANDOM


slide-1
SLIDE 1

11/9/2009 1

Machine Learning - 10601

Boosting

Geoff Gordon, MiroslavDudík

([[[partly based on slides of Rob Schapire and Carlos Guestrin]

http://www.cs.cmu.edu/~ggordon/10601/ November 9, 2009

Ensembles of trees

BAGGING and RANDOM FORESTS

  • learn many big trees
  • each tree aims to fit

the same target concept

– random training sets – randomized tree growth

  • voting ≈ averaging:

DECREASE in VARIANCE BOOSTING

  • learn many small trees

(weak classifiers)

  • each tree ‘specializes’ to a

different part of target concept

– reweight training examples – higher weights where still errors

  • voting increases expressivity:

DECREASE in BIAS

slide-2
SLIDE 2

11/9/2009 2

Boosting

  • boosting = general method of converting

rough rules of thumb (e.g., decision stumps) into highly accurate prediction rule

Boosting

  • boosting = general method of converting

rough rules of thumb (e.g., decision stumps) into highly accurate prediction rule

  • technically:

– assume given “weak” learning algorithm that can consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55% (in two-class setting) – given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%

slide-3
SLIDE 3

11/9/2009 3

AdaBoost

[Freund-Schapire 1995]

slide-4
SLIDE 4

11/9/2009 4 weak classifiers = decision stumps (vertical or horizontal half-planes)

slide-5
SLIDE 5

11/9/2009 5

slide-6
SLIDE 6

11/9/2009 6

A typical run of AdaBoost

  • training error rapidly drops

(combining weak learners increases expressivity)

  • test error does not increase with number of trees T

(robustness to overfitting)

slide-7
SLIDE 7

11/9/2009 7

slide-8
SLIDE 8

11/9/2009 8

Bounding true error

[Freund-Schapire 1997]

  • T = number of rounds
  • d = VC dimension of weak learner
  • m = number of training examples
slide-9
SLIDE 9

11/9/2009 9

Bounding true error (a first guess) A typical run contradicts a naïve bound

slide-10
SLIDE 10

11/9/2009 10

Finer analysis: margins

[Schapire et al. 1998]

Empirical evidence: margin distribution

slide-11
SLIDE 11

11/9/2009 11

Theoretical evidence: large margins  simple classifiers

More technically…

Bound depends on:

  • d = VC dimension of weak learner
  • m = number of training examples
  • entire distribution of training margins

Previously

slide-12
SLIDE 12

11/9/2009 12

Practical advantages of AdaBoost Application: detecting faces

[Viola-Jones 2001]

slide-13
SLIDE 13

11/9/2009 13

Caveats

“Hard” predictions can slow down learning!

slide-14
SLIDE 14

11/9/2009 14

Confidence-rated Predictions

[Schapire-Singer 1999]

Confidence-rated Predictions

slide-15
SLIDE 15

11/9/2009 15

Confidence-rated predictions help a lot!

Loss in logistic regression

slide-16
SLIDE 16

11/9/2009 16

Loss in AdaBoost Logistic regression vs AdaBoost

slide-17
SLIDE 17

11/9/2009 17

Benefits of model-fitting view

What you should know about boosting

  • weak classifiers  strong classifiers

– weak: slightly better than random on training data – strong: eventually zero error on training data

  • AdaBoost prevents overfitting by increasing margins
  • regimes when AdaBoost overfits

– weak learner too strong: use small trees or stop early – data noisy: stop early

  • AdaBoost vs Logistic Regression

– exponential loss vs log loss – single-coordinate updates vs full optimization