The Boosting Approach to Machine Learning Maria-Florina Balcan - - PowerPoint PPT Presentation

the boosting approach to
SMART_READER_LITE
LIVE PREVIEW

The Boosting Approach to Machine Learning Maria-Florina Balcan - - PowerPoint PPT Presentation

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest


slide-1
SLIDE 1

The Boosting Approach to Machine Learning

Maria-Florina Balcan

03/16/2015

slide-2
SLIDE 2

Boosting

  • Works by creating a series of challenge datasets s.t. even

modest performance on these can be used to produce an

  • verall high-accuracy predictor.
  • Backed up by solid foundations.
  • Works amazingly well in practice --- Adaboost and its

variations one of the top 10 algorithms.

  • General method for improving the accuracy of any given

learning algorithm.

slide-3
SLIDE 3

Readings:

  • The Boosting Approach to Machine Learning: An
  • Overview. Rob Schapire, 2001
  • Theory and Applications of Boosting. NIPS tutorial.

http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf

Plan for today:

  • Motivation.
  • A bit of history.
  • Adaboost: algo, guarantees, discussion.
  • Focus on supervised classification.
slide-4
SLIDE 4

An Example: Spam Detection

Key observation/motivation:

  • Easy to find rules of thumb that are often correct.
  • Harder to find single rule that is very highly accurate.
  • E.g., “If buy now in the message, then predict spam.”

Not spam spam

  • E.g., “If say good-bye to debt in the message, then predict spam.”
  • E.g., classify which emails are spam and which are important.
slide-5
SLIDE 5
  • Boosting: meta-procedure that takes in an algo for finding rules
  • f thumb (weak learner). Produces a highly accurate rule, by calling

the weak learner repeatedly on cleverly chosen datasets.

An Example: Spam Detection

  • apply weak learner to a subset of emails, obtain rule of thumb
  • apply to 2nd subset of emails, obtain 2nd rule of thumb
  • apply to 3rd subset of emails, obtain 3rd rule of thumb
  • repeat T times; combine weak rules into a single highly accurate rule.

𝒊𝟐 𝒊𝟑 𝒊𝟒 𝒊𝑼 …

Emails

slide-6
SLIDE 6

Boosting: Important Aspects

How to choose examples on each round? How to combine rules of thumb into single prediction rule?

  • take (weighted) majority vote of rules of thumb
  • Typically, concentrate on “hardest” examples (those most
  • ften misclassified by previous rules of thumb)
slide-7
SLIDE 7

Historically….

slide-8
SLIDE 8

Weak Learning vs Strong/PAC Learning

  • [Kearns & Valiant ’88]: defined weak learning:

being able to predict better than random guessing (error ≤ 1

2 − 𝛿) , consistently.

  • Posed an open pb: “Does there exist a boosting algo that

turns a weak learner into a strong PAC learner (that can

produce arbitrarily accurate hypotheses)?”

  • Informally, given “weak” learning algo that can consistently

find classifiers of error ≤ 1

2 − 𝛿, a boosting algo would

provably construct a single classifier with error ≤ 𝜗.

slide-9
SLIDE 9

Weak Learning vs Strong/PAC Learning

Strong (PAC) Learning

  • ∃ algo A
  • ∀ 𝑑 ∈ 𝐼
  • ∀𝐸
  • ∀ 𝜗 > 0
  • ∀ 𝜀 > 0
  • A produces h s.t.:

Weak Learning

  • ∃ algo A
  • ∃𝛿 > 0
  • ∀ 𝑑 ∈ 𝐼
  • ∀𝐸
  • ∀ 𝜗 >

1 2 − 𝛿

  • ∀ 𝜀 > 0
  • A produces h s.t.

Pr 𝑓𝑠𝑠 ℎ ≥ 𝜗 ≤ 𝜀 Pr 𝑓𝑠𝑠 ℎ ≥ 𝜗 ≤ 𝜀

  • [Kearns & Valiant ’88]: defined weak learning &

posed an open pb of finding a boosting algo.

slide-10
SLIDE 10

Surprisingly….

Weak Learning =Strong (PAC) Learning Original Construction [Schapire ’89]:

  • poly-time boosting algo, exploits that we can

learn a little on every distribution.

  • A modest booster obtained via calling the weak learning

algorithm on 3 distributions.

  • Cool conceptually and technically, not very practical.
  • Then amplifies the modest boost of accuracy by

running this somehow recursively.

Error = 𝛾 <

1 2 − 𝛿 → error 3𝛾2 − 2𝛾3

slide-11
SLIDE 11

An explosion of subsequent work

slide-12
SLIDE 12

Adaboost (Adaptive Boosting)

[Freund-Schapire, JCSS’97]

Godel Prize winner 2003 “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

slide-13
SLIDE 13

Informal Description Adaboost

  • For t=1,2, … ,T
  • Construct Dt on {x1, …, xm}
  • Run A on Dt producing ht: 𝑌 → {−1,1} (weak classifier)

xi ∈ 𝑌, 𝑧𝑗 ∈ 𝑍 = {−1,1}

+ + + + + + + +

  • ht
  • Boosting: turns a weak algo into a strong (PAC) learner.
  • Output Hfinal 𝑦 = sign

𝛽𝑢ℎ𝑢 𝑦

𝑢=1

Input: S={(x1, 𝑧1), …,(xm, 𝑧m)};

Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

weak learning algo A (e.g., Naïve Bayes, decision stumps)

ϵt = P

xi ~Dt(ht xi ≠ yi) error of ht over Dt

slide-14
SLIDE 14

Adaboost (Adaptive Boosting)

  • For t=1,2, … ,T
  • Construct 𝐄𝐮 on {𝐲𝟐, …, 𝒚𝐧}
  • Run A on Dt producing ht

Dt+1 puts half of weight on examples xi where ht is incorrect & half on examples where ht is correct

  • Weak learning algorithm A.

𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗

𝑎𝑢 e −𝛽𝑢 if 𝑧𝑗 = ℎ𝑢 𝑦𝑗

𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗

𝑎𝑢 e 𝛽𝑢 if 𝑧𝑗 ≠ ℎ𝑢 𝑦𝑗

Constructing 𝐸𝑢 [i.e., D1 𝑗 =

1 𝑛]

  • Given Dt and ht set

𝛽𝑢 = 1 2 ln 1 − 𝜗𝑢 𝜗𝑢 > 0 Final hyp: Hfinal 𝑦 = sign 𝛽𝑢ℎ𝑢 𝑦

𝑢

  • D1 uniform on {x1, …, xm}

𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗 𝑎𝑢 e −𝛽𝑢𝑧𝑗 ℎ𝑢 𝑦𝑗

slide-15
SLIDE 15

Adaboost: A toy example

Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

slide-16
SLIDE 16

Adaboost: A toy example

slide-17
SLIDE 17

Adaboost: A toy example

slide-18
SLIDE 18

Adaboost (Adaptive Boosting)

  • For t=1,2, … ,T
  • Construct 𝐄𝐮 on {𝐲𝟐, …, 𝒚𝐧}
  • Run A on Dt producing ht

Dt+1 puts half of weight on examples xi where ht is incorrect & half on examples where ht is correct

  • Weak learning algorithm A.

𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗

𝑎𝑢 e −𝛽𝑢 if 𝑧𝑗 = ℎ𝑢 𝑦𝑗

𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗

𝑎𝑢 e 𝛽𝑢 if 𝑧𝑗 ≠ ℎ𝑢 𝑦𝑗

Constructing 𝐸𝑢 [i.e., D1 𝑗 =

1 𝑛]

  • Given Dt and ht set

𝛽𝑢 = 1 2 ln 1 − 𝜗𝑢 𝜗𝑢 > 0 Final hyp: Hfinal 𝑦 = sign 𝛽𝑢ℎ𝑢 𝑦

𝑢

  • D1 uniform on {x1, …, xm}

𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗 𝑎𝑢 e −𝛽𝑢𝑧𝑗 ℎ𝑢 𝑦𝑗

slide-19
SLIDE 19

Nice Features of Adaboost

  • Very general: a meta-procedure, it can use any weak learning

algorithm!!!

  • Very fast (single pass through data each round) & simple to

code, no parameters to tune.

  • Grounded in rich theory.
  • Shift in mindset: goal is now just to find classifiers a

bit better than random guessing.

  • Relevant for big data age: quickly focuses on “core

difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12]. (e.g., Naïve Bayes, decision stumps)

slide-20
SLIDE 20

Analyzing Training Error

Theorem 𝜗𝑢 = 1/2 − 𝛿𝑢 (error of ℎ𝑢 over 𝐸𝑢)

𝑓𝑠𝑠

𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿𝑢 2 𝑢

So, if ∀𝑢, 𝛿𝑢 ≥ 𝛿 > 0, then 𝑓𝑠𝑠

𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿2𝑈

Adaboost is adaptive

  • Does not need to know 𝛿 or T a priori
  • Can exploit 𝛿𝑢 ≫ 𝛿

The training error drops exponentially in T!!! To get 𝑓𝑠𝑠

𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ 𝜗, need only 𝑈 = 𝑃 1 𝛿2 log 1 𝜗

rounds

slide-21
SLIDE 21

Understanding the Updates & Normalization

Pr

𝐸𝑢+1 𝑧𝑗 = ℎ𝑢 𝑦𝑗

= 𝐸𝑢 𝑗 𝑎𝑢 𝑓−𝛽𝑢

𝑗:𝑧𝑗=ℎ𝑢 𝑦𝑗

= 1 − 𝜗𝑢 𝑎𝑢 𝑓−𝛽𝑢 = 1 − 𝜗𝑢 𝑎𝑢 𝜗𝑢 1 − 𝜗𝑢 = 1 − 𝜗𝑢 𝜗𝑢 𝑎𝑢 Pr

𝐸𝑢+1 𝑧𝑗 ≠ ℎ𝑢 𝑦𝑗

= 𝐸𝑢 𝑗 𝑎𝑢 𝑓𝛽𝑢

𝑗:𝑧𝑗≠ℎ𝑢 𝑦𝑗

=

Probabilities are equal!

𝑎𝑢 = 𝐸𝑢 𝑗 𝑓−𝛽𝑢𝑧𝑗 ℎ𝑢 𝑦𝑗

𝑗:𝑧𝑗=ℎ𝑢 𝑦𝑗

Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.

Recall 𝐸𝑢+1 𝑗 =

𝐸𝑢 𝑗 𝑎𝑢 e −𝛽𝑢𝑧𝑗 ℎ𝑢 𝑦𝑗

= 1 − 𝜗𝑢 𝑓−𝛽𝑢 + 𝜗𝑢𝑓𝛽𝑢 = 2 𝜗𝑢 1 − 𝜗𝑢 = 𝐸𝑢 𝑗 𝑓−𝛽𝑢

𝑗:𝑧𝑗=ℎ𝑢 𝑦𝑗

+ 𝐸𝑢 𝑗 𝑓𝛽𝑢

𝑗:𝑧𝑗≠ℎ𝑢 𝑦𝑗

𝜗𝑢 1 𝑎𝑢 𝑓𝛽𝑢 = 𝜗𝑢 𝑎𝑢 1 − 𝜗𝑢 𝜗𝑢 = 𝜗𝑢 1 − 𝜗𝑢 𝑎𝑢

slide-22
SLIDE 22
  • If 𝐼

𝑔𝑗𝑜𝑏𝑚 incorrectly classifies 𝑦𝑗,

Analyzing Training Error: Proof Intuition

  • Then 𝑦𝑗 incorrectly classified by (wtd) majority of ℎ𝑢’s.
  • On round 𝑢, we increase weight of 𝑦𝑗 for which ℎ𝑢 is wrong.
  • Which implies final prob. weight of 𝑦𝑗 is large.

Can show probability ≥

1 𝑛 1 𝑎𝑢

𝑢

  • Since sum of prob. = 1, can’t have too many of high weight.

And ( 𝑎𝑢) → 0

𝑢

. Can show # incorrectly classified ≤ 𝑛 𝑎𝑢

𝑢

.

Theorem 𝜗𝑢 = 1/2 − 𝛿𝑢 (error of ℎ𝑢 over 𝐸𝑢)

𝑓𝑠𝑠

𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿𝑢 2 𝑢

slide-23
SLIDE 23

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐸𝑈+1 𝑗 =

1 𝑛 exp −𝑧𝑗𝑔 𝑦𝑗 𝑎𝑢

𝑢

where 𝑔 𝑦𝑗 = 𝛽𝑢ℎ𝑢 𝑦𝑗

𝑢

. Step 2: errS 𝐼

𝑔𝑗𝑜𝑏𝑚 ≤ 𝑎𝑢. 𝑢

Step 3: 𝑎𝑢 = 2 𝜗𝑢 1 − 𝜗𝑢

𝑢

= 1 − 4𝛿𝑢

2 𝑢

≤ 𝑓−2 𝛿𝑢

2 𝑢

𝑢

[Unthresholded weighted vote of ℎ𝑗 on 𝑦𝑗 ]

slide-24
SLIDE 24

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐸𝑈+1 𝑗 =

1 𝑛 exp −𝑧𝑗𝑔 𝑦𝑗 𝑎𝑢

𝑢

where 𝑔 𝑦𝑗 = 𝛽𝑢ℎ𝑢 𝑦𝑗

𝑢

.

Recall 𝐸1 𝑗 =

1 𝑛 and 𝐸𝑢+1 𝑗 = 𝐸𝑢 𝑗 exp −𝑧𝑗𝛽𝑢ℎ𝑢 𝑦𝑗 𝑎𝑢

𝐸𝑈+1 𝑗 =

exp −𝑧𝑗𝛽𝑈ℎ𝑈 𝑦𝑗 𝑎𝑈

× 𝐸𝑈 𝑗 =

exp −𝑧𝑗𝛽𝑈ℎ𝑈 𝑦𝑗 𝑎𝑈

×

exp −𝑧𝑗𝛽𝑈−1ℎ𝑈−1 𝑦𝑗 𝑎𝑈−1

× 𝐸𝑈−1 𝑗 … … . =

exp −𝑧𝑗𝛽𝑈ℎ𝑈 𝑦𝑗 𝑎𝑈

× ⋯ ×

exp −𝑧𝑗𝛽1ℎ1 𝑦𝑗 𝑎1

1

𝑛

= 1

𝑛 exp −𝑧𝑗(𝛽1ℎ1 𝑦𝑗 +⋯+𝛽𝑈ℎ𝑈 𝑦𝑈 ) 𝑎1⋯𝑎𝑈

= 1

𝑛 exp −𝑧𝑗𝑔 𝑦𝑗 𝑎𝑢

𝑢

slide-25
SLIDE 25

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐸𝑈+1 𝑗 =

1 𝑛 exp −𝑧𝑗𝑔 𝑦𝑗 𝑎𝑢

𝑢

where 𝑔 𝑦𝑗 = 𝛽𝑢ℎ𝑢 𝑦𝑗

𝑢

.

errS 𝐼

𝑔𝑗𝑜𝑏𝑚 = 1

𝑛 1𝑧𝑗≠𝐼𝑔𝑗𝑜𝑏𝑚 𝑦𝑗

𝑗

1 0/1 loss exp loss

= 1 𝑛 1𝑧𝑗𝑔 𝑦𝑗 ≤0

𝑗

≤ 1 𝑛 exp −𝑧𝑗𝑔 𝑦𝑗

𝑗

= 𝑎𝑢.

𝑢 = 𝐸𝑈+1 𝑗 𝑎𝑢

𝑢 𝑗

Step 2: errS 𝐼

𝑔𝑗𝑜𝑏𝑚 ≤ 𝑎𝑢. 𝑢

slide-26
SLIDE 26

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐸𝑈+1 𝑗 =

1 𝑛 exp −𝑧𝑗𝑔 𝑦𝑗 𝑎𝑢

𝑢

where 𝑔 𝑦𝑗 = 𝛽𝑢ℎ𝑢 𝑦𝑗

𝑢

. Step 2: errS 𝐼

𝑔𝑗𝑜𝑏𝑚 ≤ 𝑎𝑢. 𝑢

Step 3: 𝑎𝑢 = 2 𝜗𝑢 1 − 𝜗𝑢

𝑢

= 1 − 4𝛿𝑢

2 𝑢

≤ 𝑓−2 𝛿𝑢

2 𝑢

𝑢

Note: recall 𝑎𝑢 = 1 − 𝜗𝑢 𝑓−𝛽𝑢 + 𝜗𝑢𝑓𝛽𝑢 = 2 𝜗𝑢 1 − 𝜗𝑢 𝛽𝑢 minimizer of 𝛽 → 1 − 𝜗𝑢 𝑓−𝛽 + 𝜗𝑢𝑓𝛽

slide-27
SLIDE 27
  • If 𝐼

𝑔𝑗𝑜𝑏𝑚 incorrectly classifies 𝑦𝑗,

Analyzing Training Error: Proof Intuition

  • Then 𝑦𝑗 incorrectly classified by (wtd) majority of ℎ𝑢’s.
  • On round 𝑢, we increase weight of 𝑦𝑗 for which ℎ𝑢 is wrong.
  • Which implies final prob. weight of 𝑦𝑗 is large.

Can show probability ≥

1 𝑛 1 𝑎𝑢

𝑢

  • Since sum of prob. = 1, can’t have too many of high weight.

And ( 𝑎𝑢) → 0

𝑢

. Can show # incorrectly classified ≤ 𝑛 𝑎𝑢

𝑢

.

Theorem 𝜗𝑢 = 1/2 − 𝛿𝑢 (error of ℎ𝑢 over 𝐸𝑢)

𝑓𝑠𝑠

𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿𝑢 2 𝑢

slide-28
SLIDE 28

Generalization Guarantees

G={all fns of the form sign( 𝛽𝑢ℎ𝑢(𝑦))

𝑈 𝑢=1

} 𝐼

𝑔𝑗𝑜𝑏𝑚 is a weighted vote, so the hypothesis class is:

Theorem [Freund&Schapire’97] ∀ 𝑕 ∈ 𝐻, 𝑓𝑠𝑠 𝑕 ≤ 𝑓𝑠𝑠

𝑇 𝑕 + 𝑃 𝑈𝑒 𝑛 T= # of rounds

Key reason: VCd𝑗𝑛 𝐻 = 𝑃 𝑒𝑈 plus typical VC bounds.

  • H space of weak hypotheses; d=VCdim(H)

Theorem

where 𝜗𝑢 = 1/2 − 𝛿𝑢

𝑓𝑠𝑠

𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿𝑢 2 𝑢

How about generalization guarantees?

Original analysis [Freund&Schapire’97]

slide-29
SLIDE 29

Generalization Guarantees

Theorem [Freund&Schapire’97] ∀ 𝑕 ∈ 𝑑𝑝(𝐼), 𝑓𝑠𝑠 𝑕 ≤ 𝑓𝑠𝑠

𝑇 𝑕 + 𝑃 𝑈𝑒 𝑛 where d=VCdim(H)

error complexity train error generalization error

T= # of rounds

slide-30
SLIDE 30

Generalization Guarantees

  • Experiments with boosting showed that the test error of

the generated classifier usually does not increase as its size becomes very large.

  • Experiments showed that continuing to add new weak

learners after correct classification of the training set had been achieved could further improve test set performance!!!

slide-31
SLIDE 31

Generalization Guarantees

  • Experiments with boosting showed that the test error of

the generated classifier usually does not increase as its size becomes very large.

  • Experiments showed that continuing to add new weak

learners after correct classification of the training set had been achieved could further improve test set performance!!!

  • These results seem to contradict FS’87 bound and Occam’s

razor (in order achieve good test error the classifier should be as

simple as possible)!

slide-32
SLIDE 32

How can we explain the experiments?

Key Idea:

  • R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in

“Boosting the margin: A new explanation for the effectiveness

  • f voting methods” a nice theoretical explanation.

Training error does not tell the whole story. We need also to consider the classification confidence!!