Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 - - PowerPoint PPT Presentation

theory and applications of boosting
SMART_READER_LITE
LIVE PREVIEW

Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 - - PowerPoint PPT Presentation

Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 l o o h c S r e m m Many slides from Rob Schapire u S z u r C a t n a S Monday, July 16, 2012 0 1 2 m e r S c h o o l 2 a C r u z S u m


slide-1
SLIDE 1

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Theory and Applications of Boosting

Yoav Freund UCSD

Many slides from Rob Schapire

Monday, July 16, 2012

slide-2
SLIDE 2

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Monday, July 16, 2012

slide-3
SLIDE 3

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Plan

  • Day 1: Basics
  • Boosting,
  • Adaboost,
  • Margins theory.
  • Confidence-rated

boosting

  • Day 3: Advanced Topics
  • Boosting and repeated

matrix games

  • Boosting and Loss

minimization.

  • Drifting games and Boost

By Majority.

  • Brownboost and Boosting

with High Noise.

  • Day 2: Applications
  • ADTrees
  • JBoost
  • Viola and Jones
  • Active Learning and

Pedestrian Detection

  • Genome Wide association

studies

  • Online boosting and

tracking.

Monday, July 16, 2012

slide-4
SLIDE 4

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?”

[Gorin et al.]

  • goal: automatically categorize type of call requested by phone

customer (Collect, CallingCard, PersonToPerson, etc.)

  • yes I’d like to place a collect call long distance

please (Collect)

  • operator I need to make a call but I need to bill

it to my office (ThirdNumber)

  • yes I’d like to place a call on my master card

please (CallingCard)

  • I just called a number in sioux city and I musta

rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)

  • observation:
  • easy to find “rules of thumb” that are “often” correct
  • e.g.: “IF ‘card’ occurs in utterance

THEN predict ‘CallingCard’ ”

  • hard to find single highly accurate prediction rule

Monday, July 16, 2012

slide-5
SLIDE 5

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

The Boosting Approach The Boosting Approach The Boosting Approach The Boosting Approach The Boosting Approach

  • devise computer program for deriving rough rules of thumb
  • apply procedure to subset of examples
  • obtain rule of thumb
  • apply to 2nd subset of examples
  • obtain 2nd rule of thumb
  • repeat T times

Monday, July 16, 2012

slide-6
SLIDE 6

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Key Details Key Details Key Details Key Details Key Details

  • how to choose examples on each round?
  • concentrate on “hardest” examples

(those most often misclassified by previous rules of thumb)

  • how to combine rules of thumb into single prediction rule?
  • take (weighted) majority vote of rules of thumb

Monday, July 16, 2012

slide-7
SLIDE 7

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Boosting Boosting Boosting Boosting Boosting

  • boosting = general method of converting rough rules of

thumb into highly accurate prediction rule

  • technically:
  • assume given “weak” learning algorithm that can

consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55% (in two-class setting) [ “weak learning assumption” ]

  • given sufficient data, a boosting algorithm can provably

construct single classifier with very high accuracy, say, 99%

Monday, July 16, 2012

slide-8
SLIDE 8

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Some History

  • How it all began ...

8

Monday, July 16, 2012

slide-9
SLIDE 9

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability

  • boosting’s roots are in “PAC” learning model

[Valiant ’84]

  • get random examples from unknown, arbitrary distribution
  • strong PAC learning algorithm:
  • for any distribution

with high probability given polynomially many examples (and polynomial time) can find classifier with arbitrarily small generalization error

  • weak PAC learning algorithm
  • same, but generalization error only needs to be slightly

better than random guessing (1

2 − γ)

  • [Kearns & Valiant ’88]:
  • does weak learnability imply strong learnability?

Monday, July 16, 2012

slide-10
SLIDE 10

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then...

  • can use (fairly) wild guesses to produce highly accurate

predictions

  • if can learn “part way” then can learn “all the way”
  • should be able to improve any learning algorithm
  • for any learning problem:
  • either can always learn with nearly perfect accuracy
  • or there exist cases where cannot learn even slightly

better than random guessing

Monday, July 16, 2012

slide-11
SLIDE 11

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms

  • [Schapire ’89]:
  • first provable boosting algorithm
  • [Freund ’90]:
  • “optimal” algorithm that “boosts by majority”
  • [Drucker, Schapire & Simard ’92]:
  • first experiments using boosting
  • limited by practical drawbacks
  • [Freund & Schapire ’95]:
  • introduced “AdaBoost” algorithm
  • strong practical advantages over previous boosting

algorithms

Monday, July 16, 2012

slide-12
SLIDE 12

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory

  • introduction to AdaBoost
  • analysis of training error
  • analysis of test error

and the margins theory

  • experiments and applications

Monday, July 16, 2012

slide-13
SLIDE 13

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting

  • given training set

(x1, y1), . . . , (xm, ym)

  • yi ∈ {−1, +1} correct label of instance xi ∈ X
  • for t = 1, . . . , T:
  • construct distribution Dt on {1, . . . , m}
  • find weak classifier (“rule of thumb”)

ht : X → {−1, +1} with small error t on Dt: t = Pri∼Dt[ht(xi) = yi]

  • output final classifier Hfinal

Monday, July 16, 2012

slide-14
SLIDE 14

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost

[with Freund]

  • constructing Dt:
  • D1(i) = 1/m
  • given Dt and ht:

Dt+1(i) = Dt(i) Zt × e−αt if yi = ht(xi) eαt if yi = ht(xi) = Dt(i) Zt exp(−αt yi ht(xi)) where Zt = normalization factor αt = 1

2 ln

1 − t t

  • > 0
  • final classifier:
  • Hfinal(x) = sign
  • t

αtht(x)

  • [ Freund & Schapire 96]

Monday, July 16, 2012

slide-15
SLIDE 15

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Toy Example Toy Example Toy Example Toy Example Toy Example

D1

weak classifiers = vertical or horizontal half-planes

Monday, July 16, 2012

slide-16
SLIDE 16

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Round 1 Round 1 Round 1 Round 1 Round 1

h1 ! "1 1 =0.30 =0.42 2 D

Monday, July 16, 2012

slide-17
SLIDE 17

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Round 2 Round 2 Round 2 Round 2 Round 2

! "2 2 =0.21 =0.65 h2 3 D

Monday, July 16, 2012

slide-18
SLIDE 18

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Round 3 Round 3 Round 3 Round 3 Round 3

h3 ! "3 3=0.92 =0.14

Monday, July 16, 2012

slide-19
SLIDE 19

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier

H final + 0.92 + 0.65 0.42 sign = =

Monday, July 16, 2012

slide-20
SLIDE 20

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

http://cseweb.ucsd.edu/~yfreund/adaboost/index.html

Monday, July 16, 2012

slide-21
SLIDE 21

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory

  • introduction to AdaBoost
  • analysis of training error
  • analysis of test error

and the margins theory

  • experiments and applications

Monday, July 16, 2012

slide-22
SLIDE 22

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error

[with Freund]

  • Theorem:
  • write t as 1

2 − γt

[ γt = “edge” ]

  • then

training error(Hfinal) ≤

  • t
  • 2
  • t(1 − t)
  • =
  • t
  • 1 − 4γ2

t

≤ exp

  • −2
  • t

γ2

t

  • so: if ∀t : γt ≥ γ > 0

then training error(Hfinal) ≤ e−2γ2T

  • AdaBoost is adaptive:
  • does not need to know γ or T a priori
  • can exploit γt γ

[ Freund & Schapire 96]

Monday, July 16, 2012

slide-23
SLIDE 23

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Proof Proof Proof Proof Proof

  • let F(x) =
  • t

αtht(x) ⇒ Hfinal(x) = sign(F(x))

  • Step 1: unwrapping recurrence:

Dfinal(i) = 1 m exp

  • −yi
  • t

αtht(xi)

  • t

Zt = 1 m exp (−yiF(xi))

  • t

Zt

Scoring function

Monday, July 16, 2012

slide-24
SLIDE 24

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.)

  • Step 2: training error(Hfinal) ≤
  • t

Zt

  • Proof:

training error(Hfinal) = 1 m

  • i

1 if yi = Hfinal(xi) else = 1 m

  • i

1 if yiF(xi) ≤ 0 else ≤ 1 m

  • i

exp(−yiF(xi)) =

  • i

Dfinal(i)

  • t

Zt =

  • t

Zt

Monday, July 16, 2012

slide-25
SLIDE 25

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.) Proof (cont.)

  • Step 3: Zt = 2
  • t(1 − t)
  • Proof:

Zt =

  • i

Dt(i) exp(−αt yi ht(xi)) =

  • i:yi=ht(xi)

Dt(i)eαt +

  • i:yi=ht(xi)

Dt(i)e−αt = t eαt + (1 − t) e−αt = 2

  • t(1 − t)

Monday, July 16, 2012

slide-26
SLIDE 26

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory

  • introduction to AdaBoost
  • analysis of training error
  • analysis of test error

and the margins theory

  • experiments and applications

Monday, July 16, 2012

slide-27
SLIDE 27

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess)

20 40 60 80 100 0.2 0.4 0.6 0.8 1

# of rounds ( error T) train test

expect:

  • training error to continue to drop (or reach zero)
  • test error to increase when Hfinal becomes “too complex”
  • “Occam’s razor”
  • overfitting
  • hard to know when to stop training

Monday, July 16, 2012

slide-28
SLIDE 28

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Technically... Technically... Technically... Technically... Technically...

  • with high probability:

generalization error ≤ training error + ˜ O

  • dT

m

  • bound depends on
  • m = # training examples
  • d = “complexity” of weak classifiers
  • T = # rounds
  • generalization error = E [test error]
  • predicts overfitting

Monday, July 16, 2012

slide-29
SLIDE 29

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen

5 10 15 20 25 30 1 10 100 1000

test train error # rounds

(boosting “stumps” on heart-disease dataset)

  • but often doesn’t...

Monday, July 16, 2012

slide-30
SLIDE 30

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Actual Typical Run Actual Typical Run Actual Typical Run Actual Typical Run Actual Typical Run

10 100 1000 5 10 15 20

error test train ) T # of rounds ( (boosting C4.5 on “letter” dataset)

  • test error does not increase, even after 1000 rounds
  • (total size > 2,000,000 nodes)
  • test error continues to drop even after training error is zero!

# rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1

  • Occam’s razor wrongly predicts “simpler” rule is better

Monday, July 16, 2012

slide-31
SLIDE 31

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

A Better Story: The Margins Explanation A Better Story: The Margins Explanation A Better Story: The Margins Explanation A Better Story: The Margins Explanation A Better Story: The Margins Explanation

[with Freund, Bartlett & Lee]

  • key idea:
  • training error only measures whether classifications are

right or wrong

  • should also consider confidence of classifications
  • recall: Hfinal is weighted majority vote of weak classifiers
  • measure confidence by margin = strength of the vote

= (weighted fraction voting correctly) −(weighted fraction voting incorrectly)

correct incorrect correct incorrect high conf. high conf. low conf. !1 +1

final

H

final

H

[ Schapire, Freund & Bartlett and Lee 97]

Monday, July 16, 2012

slide-32
SLIDE 32

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution Empirical Evidence: The Margin Distribution

  • margin distribution

= cumulative distribution of margins of training examples

10 100 1000 5 10 15 20

error test train ) T # of rounds (

  • 1
  • 0.5

0.5 1 0.5 1.0

cumulative distribution 1000 100 margin 5

# rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 % margins ≤ 0.5 7.7 0.0 0.0 minimum margin 0.14 0.52 0.55

Monday, July 16, 2012

slide-33
SLIDE 33

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins Theoretical Evidence: Analyzing Boosting Using Margins

  • Theorem: large margins ⇒ better bound on generalization

error (independent of number of rounds)

  • proof idea: if all margins are large, then can approximate

final classifier by a much smaller classifier (just as polls can predict not-too-close election)

  • Theorem: boosting tends to increase margins of training

examples (given weak learning assumption)

  • moreover, larger edges ⇒ larger margins
  • proof idea: similar to training error proof
  • so:

although final classifier is getting larger, margins are likely to be increasing, so final classifier actually getting close to a simpler classifier, driving down the test error

Monday, July 16, 2012

slide-34
SLIDE 34

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

More Technically... More Technically... More Technically... More Technically... More Technically...

  • with high probability, ∀θ > 0 :

generalization error ≤ ˆ Pr[margin ≤ θ] + ˜ O

  • d/m

θ

Pr[ ] = empirical probability)

  • bound depends on
  • m = # training examples
  • d = “complexity” of weak classifiers
  • entire distribution of margins of training examples
  • ˆ

Pr[margin ≤ θ] → 0 exponentially fast (in T) if t < 1

2 − θ (∀t)

  • so: if weak learning assumption holds, then all examples

will quickly have “large” margins

Monday, July 16, 2012

slide-35
SLIDE 35

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Consequences of Margins Theory Consequences of Margins Theory Consequences of Margins Theory Consequences of Margins Theory Consequences of Margins Theory

  • predicts good generalization with no overfitting if:
  • weak classifiers have large edges (implying large margins)
  • weak classifiers not too complex relative to size of

training set

  • e.g., boosting decision trees resistant to overfitting since trees
  • ften have large edges and limited complexity
  • overfitting may occur if:
  • small edges (underfitting), or
  • overly complex weak classifiers
  • e.g., heart-disease dataset:
  • stumps yield small edges
  • also, small dataset

Monday, July 16, 2012

slide-36
SLIDE 36

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization? Improved Boosting with Better Margin-Maximization?

  • can design algorithms more effective than AdaBoost at

maximizing the minimum margin

  • in practice, often perform worse

[Breiman]

  • why??
  • more aggressive margin maximization seems to lead to:
  • more complex weak classifiers

(even using same weak learner); or

  • higher minimum margins,

but margin distributions that are lower overall

[with Reyzin]

[Reyzin & Schapire]

Monday, July 16, 2012

slide-37
SLIDE 37

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Comparison to SVM’s Comparison to SVM’s Comparison to SVM’s Comparison to SVM’s Comparison to SVM’s

  • both AdaBoost and SVM’s:
  • work by maximizing “margins”
  • find linear threshold function in high-dimensional space
  • differences:
  • margin measured slightly differently

(using different norms)

  • SVM’s handle high-dimensional space using kernel trick;

AdaBoost uses weak learner to search over space

Monday, July 16, 2012

slide-38
SLIDE 38

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Practical Extensions Practical Extensions Practical Extensions Practical Extensions Practical Extensions

  • multiclass classification
  • ranking problems
  • confidence-rated predictions

Monday, July 16, 2012

slide-39
SLIDE 39

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

“Hard” Predictions Can Slow Learning “Hard” Predictions Can Slow Learning “Hard” Predictions Can Slow Learning “Hard” Predictions Can Slow Learning “Hard” Predictions Can Slow Learning

L

  • ideally, want weak classifier that says:

h(x) = +1 if x above L “don’t know” else

  • problem: cannot express using “hard” predictions
  • if must predict ±1 below L, will introduce many “bad”

predictions

  • need to “clean up” on later rounds
  • dramatically increases time to convergence

Monday, July 16, 2012

slide-40
SLIDE 40

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Confidence-Rated Predictions Confidence-Rated Predictions Confidence-Rated Predictions Confidence-Rated Predictions Confidence-Rated Predictions

[with Singer]

  • useful to allow weak classifiers to assign confidences to

predictions

  • formally, allow ht : X → R

sign(ht(x)) = prediction |ht(x)| = “confidence”

  • use identical update:

Dt+1(i) = Dt(i) Zt · exp(−αt yi ht(xi)) and identical rule for combining weak classifiers

  • question: how to choose αt and ht on each round

[Schapire & Singer]

Monday, July 16, 2012

slide-41
SLIDE 41

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Confidence-Rated Predictions (cont.) Confidence-Rated Predictions (cont.) Confidence-Rated Predictions (cont.) Confidence-Rated Predictions (cont.) Confidence-Rated Predictions (cont.)

  • saw earlier:

training error(Hfinal) ≤

  • t

Zt = 1 m

  • i

exp

  • −yi
  • t

αtht(xi)

  • therefore, on each round t, should choose αtht to minimize:

Zt =

  • i

Dt(i) exp(−αt yi ht(xi))

  • in many cases (e.g., decision stumps), best confidence-rated

weak classifier has simple form that can be found efficiently

Monday, July 16, 2012

slide-42
SLIDE 42

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Confidence-Rated Predictions Help a Lot Confidence-Rated Predictions Help a Lot Confidence-Rated Predictions Help a Lot Confidence-Rated Predictions Help a Lot Confidence-Rated Predictions Help a Lot

10 20 30 40 50 60 70 1 10 100 1000 10000 % Error Number of rounds train no conf test no conf train conf test conf

round first reached % error conf. no conf. speedup 40 268 16,938 63.2 35 598 65,292 109.2 30 1,888 >80,000 –

Monday, July 16, 2012

slide-43
SLIDE 43

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Application: Boosting for Text Categorization Application: Boosting for Text Categorization Application: Boosting for Text Categorization Application: Boosting for Text Categorization Application: Boosting for Text Categorization

[with Singer]

  • weak classifiers: very simple weak classifiers that test on

simple patterns, namely, (sparse) n-grams

  • find parameter αt and rule ht of given form which

minimize Zt

  • use efficiently implemented exhaustive search
  • “How may I help you” data:
  • 7844 training examples
  • 1000 test examples
  • categories: AreaCode, AttService, BillingCredit, CallingCard,

Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.

[Schapire & Singer]

Monday, July 16, 2012

slide-44
SLIDE 44

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Weak Classifiers Weak Classifiers Weak Classifiers Weak Classifiers Weak Classifiers

rnd term

AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 1

collect

2

card

3

my home

4

person ? person

5

code

6

I

Monday, July 16, 2012

slide-45
SLIDE 45

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

More Weak Classifiers More Weak Classifiers More Weak Classifiers More Weak Classifiers More Weak Classifiers

rnd term

AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 7

time

8

wrong number

9

how

10

call

11

seven

12

trying to

13

and

Monday, July 16, 2012

slide-46
SLIDE 46

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

More Weak Classifiers More Weak Classifiers More Weak Classifiers More Weak Classifiers More Weak Classifiers

rnd term

AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 14

third

15

to

16

for

17

charges

18

dial

19

just

Monday, July 16, 2012

slide-47
SLIDE 47

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Finding Outliers Finding Outliers Finding Outliers Finding Outliers Finding Outliers

examples with most weight are often outliers (mislabeled and/or ambiguous)

  • I’m trying to make a credit card call

(Collect)

  • hello

(Rate)

  • yes I’d like to make a long distance collect call

please (CallingCard)

  • calling card please

(Collect)

  • yeah I’d like to use my calling card number

(Collect)

  • can I get a collect call

(CallingCard)

  • yes I would like to make a long distant telephone call

and have the charges billed to another number (CallingCard DialForMe)

  • yeah I can not stand it this morning I did oversea

call is so bad (BillingCredit)

  • yeah special offers going on for long distance

(AttService Rate)

  • mister allen please william allen

(PersonToPerson)

  • yes ma’am I I’m trying to make a long distance call to

a non dialable point in san miguel philippines (AttService Other)

  • Monday, July 16, 2012
slide-48
SLIDE 48

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory

  • introduction to AdaBoost
  • analysis of training error
  • analysis of test error

and the margins theory

  • experiments and applications

Monday, July 16, 2012

slide-49
SLIDE 49

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Practical Advantages of AdaBoost Practical Advantages of AdaBoost Practical Advantages of AdaBoost Practical Advantages of AdaBoost Practical Advantages of AdaBoost

  • fast
  • simple and easy to program
  • no parameters to tune (except T)
  • flexible — can combine with any learning algorithm
  • no prior knowledge needed about weak learner
  • provably effective, provided can consistently find rough rules
  • f thumb

→ shift in mind set — goal now is merely to find classifiers barely better than random guessing

  • versatile
  • can use with data that is textual, numeric, discrete, etc.
  • has been extended to learning problems well beyond

binary classification

Monday, July 16, 2012

slide-50
SLIDE 50

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Caveats Caveats Caveats Caveats Caveats

  • performance of AdaBoost depends on data and weak learner
  • consistent with theory, AdaBoost can fail if
  • weak classifiers too complex

→ overfitting

  • weak classifiers too weak (γt → 0 too quickly)

→ underfitting → low margins → overfitting

  • empirically, AdaBoost seems especially susceptible to uniform

noise

Monday, July 16, 2012

slide-51
SLIDE 51

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

UCI Experiments UCI Experiments UCI Experiments UCI Experiments UCI Experiments

[with Freund]

  • tested AdaBoost on UCI benchmarks
  • used:
  • C4.5 (Quinlan’s decision tree algorithm)
  • “decision stumps”: very simple rules of thumb that test
  • n single attributes
  • 1

predict +1 predict no yes height > 5 feet ? predict

  • 1

predict +1 no yes eye color = brown ?

[Freund & Schapire]

Monday, July 16, 2012

slide-52
SLIDE 52

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

UCI Results UCI Results UCI Results UCI Results UCI Results

5 10 15 20 25 30

boosting Stumps

5 10 15 20 25 30

C4.5

5 10 15 20 25 30

boosting C4.5

5 10 15 20 25 30

C4.5

Monday, July 16, 2012

slide-53
SLIDE 53

S a n t a C r u z S u m m e r S c h

  • l

2 1 2

Tomorrow: more experiments and applications

  • Download and play around with jboost (2.4):

http://jboost.sourceforge.net

Monday, July 16, 2012