the boosting approach to
play

The Boosting Approach to Machine Learning Maria-Florina Balcan - PowerPoint PPT Presentation

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest


  1. The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015

  2. Boosting • General method for improving the accuracy of any given learning algorithm. • Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor. • Works amazingly well in practice --- Adaboost and its variations one of the top 10 algorithms. • Backed up by solid foundations.

  3. Readings: The Boosting Approach to Machine Learning: An • Overview. Rob Schapire, 2001 Theory and Applications of Boosting. NIPS tutorial. • http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf Plan for today: Motivation. • A bit of history. • Adaboost: algo, guarantees, discussion. • Focus on supervised classification. •

  4. An Example: Spam Detection E.g., classify which emails are spam and which are important. • spam Not spam Key observation/motivation: • Easy to find rules of thumb that are often correct . • E.g., “If buy now in the message, then predict spam .” • E.g., “If say good-bye to debt in the message, then predict spam .” • Harder to find single rule that is very highly accurate.

  5. An Example: Spam Detection • Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner) . Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets. 𝒊 𝟐 𝒊 𝟑 𝒊 𝟒 Emails … 𝒊 𝑼 • apply weak learner to a subset of emails, obtain rule of thumb • apply to 2nd subset of emails, obtain 2nd rule of thumb • apply to 3 rd subset of emails, obtain 3rd rule of thumb • repeat T times; combine weak rules into a single highly accurate rule.

  6. Boosting: Important Aspects How to choose examples on each round? • Typically, concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) How to combine rules of thumb into single prediction rule? • take (weighted) majority vote of rules of thumb

  7. Historically….

  8. Weak Learning vs Strong/PAC Learning [Kearns & Valiant ’88 ] : defined weak learning: • being able to predict better than random guessing (error ≤ 1 2 − 𝛿 ) , consistently. • Posed an open pb : “Does there exist a boosting algo that turns a weak learner into a strong PAC learner (that can produce arbitrarily accurate hypotheses) ?” • Informally, given “weak” learning algo that can consistently find classifiers of error ≤ 1 2 − 𝛿 , a boosting algo would provably construct a single classifier with error ≤ 𝜗 .

  9. Weak Learning vs Strong/PAC Learning Strong (PAC) Learning Weak Learning • ∃ algo A • ∃ algo A • ∃𝛿 > 0 • ∀ 𝑑 ∈ 𝐼 • ∀ 𝑑 ∈ 𝐼 • ∀𝐸 • ∀𝐸 • ∀ 𝜗 > 0 1 • ∀ 𝜀 > 0 2 − 𝛿 • ∀ 𝜗 > • A produces h s.t.: • ∀ 𝜀 > 0 Pr 𝑓𝑠𝑠 ℎ ≥ 𝜗 ≤ 𝜀 • A produces h s.t. Pr 𝑓𝑠𝑠 ℎ ≥ 𝜗 ≤ 𝜀 [Kearns & Valiant ’88 ] : defined weak learning & • posed an open pb of finding a boosting algo.

  10. Surprisingly…. Weak Learning =Strong (PAC) Learning Original Construction [Schapire ’89 ] : poly-time boosting algo, exploits that we can • learn a little on every distribution. A modest booster obtained via calling the weak learning • algorithm on 3 distributions. 2 − 𝛿 → error 3𝛾 2 − 2𝛾 3 1 Error = 𝛾 < Then amplifies the modest boost of accuracy by • running this somehow recursively. Cool conceptually and technically, not very practical. •

  11. An explosion of subsequent work

  12. Adaboost (Adaptive Boosting) “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting” [Freund-Schapire , JCSS’97] Godel Prize winner 2003

  13. Informal Description Adaboost • Boosting: turns a weak algo into a strong (PAC) learner. Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; x i ∈ 𝑌 , 𝑧 𝑗 ∈ 𝑍 = {−1,1} + + weak learning algo A (e.g., Naïve Bayes, decision stumps) + + + • For t=1,2, … ,T h t + + - + • Construct D t on { x 1 , …, x m } - - • Run A on D t producing h t : 𝑌 → {−1,1} (weak classifier) - - - - - x i ~D t (h t x i ≠ y i ) error of h t over D t ϵ t = P • Output H final 𝑦 = sign 𝛽 𝑢 ℎ 𝑢 𝑦 𝑢=1 Roughly speaking D t+1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct.

  14. Adaboost (Adaptive Boosting) • Weak learning algorithm A. • For t=1,2, … ,T • Construct 𝐄 𝐮 on { 𝐲 𝟐 , …, 𝒚 𝐧 } • Run A on D t producing h t Constructing 𝐸 𝑢 1 𝑛 ] • D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑗 = • Given D t and h t set 𝐸 𝑢+1 𝑗 = 𝐸 𝑢 𝑗 𝑎 𝑢 e −𝛽 𝑢 if 𝑧 𝑗 = ℎ 𝑢 𝑦 𝑗 𝐸 𝑢+1 𝑗 = 𝐸 𝑢 𝑗 e −𝛽 𝑢 𝑧 𝑗 ℎ 𝑢 𝑦 𝑗 𝑎 𝑢 𝐸 𝑢+1 𝑗 = 𝐸 𝑢 𝑗 𝑎 𝑢 e 𝛽 𝑢 if 𝑧 𝑗 ≠ ℎ 𝑢 𝑦 𝑗 𝛽 𝑢 = 1 2 ln 1 − 𝜗 𝑢 D t+1 puts half of weight on examples > 0 x i where h t is incorrect & half on 𝜗 𝑢 examples where h t is correct Final hyp: H final 𝑦 = sign 𝛽 𝑢 ℎ 𝑢 𝑦 𝑢

  15. Adaboost : A toy example Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

  16. Adaboost : A toy example

  17. Adaboost : A toy example

  18. Adaboost (Adaptive Boosting) • Weak learning algorithm A. • For t=1,2, … ,T • Construct 𝐄 𝐮 on { 𝐲 𝟐 , …, 𝒚 𝐧 } • Run A on D t producing h t Constructing 𝐸 𝑢 1 𝑛 ] • D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑗 = • Given D t and h t set 𝐸 𝑢+1 𝑗 = 𝐸 𝑢 𝑗 𝑎 𝑢 e −𝛽 𝑢 if 𝑧 𝑗 = ℎ 𝑢 𝑦 𝑗 𝐸 𝑢+1 𝑗 = 𝐸 𝑢 𝑗 e −𝛽 𝑢 𝑧 𝑗 ℎ 𝑢 𝑦 𝑗 𝑎 𝑢 𝐸 𝑢+1 𝑗 = 𝐸 𝑢 𝑗 𝑎 𝑢 e 𝛽 𝑢 if 𝑧 𝑗 ≠ ℎ 𝑢 𝑦 𝑗 𝛽 𝑢 = 1 2 ln 1 − 𝜗 𝑢 D t+1 puts half of weight on examples > 0 x i where h t is incorrect & half on 𝜗 𝑢 examples where h t is correct Final hyp: H final 𝑦 = sign 𝛽 𝑢 ℎ 𝑢 𝑦 𝑢

  19. Nice Features of Adaboost • Very general: a meta-procedure, it can use any weak learning algorithm!!! (e.g., Naïve Bayes, decision stumps) • Very fast (single pass through data each round) & simple to code, no parameters to tune. • Shift in mindset: goal is now just to find classifiers a bit better than random guessing. • Grounded in rich theory. • Relevant for big data age: quickly focuses on “core difficulties”, well -suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine- Mansour COLT’12] .

  20. Analyzing Training Error Theorem 𝜗 𝑢 = 1/2 − 𝛿 𝑢 (error of ℎ 𝑢 over 𝐸 𝑢 ) 2 𝑓𝑠𝑠 𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿 𝑢 𝑢 ∀𝑢, 𝛿 𝑢 ≥ 𝛿 > 0 , then 𝑓𝑠𝑠 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿 2 𝑈 So, if 𝑇 𝐼 The training error drops exponentially in T!!! 1 1 To get 𝑓𝑠𝑠 𝑔𝑗𝑜𝑏𝑚 ≤ 𝜗 , need only 𝑈 = 𝑃 rounds 𝑇 𝐼 𝛿 2 log 𝜗 Adaboost is adaptive • Does not need to know 𝛿 or T a priori Can exploit 𝛿 𝑢 ≫ 𝛿 •

  21. Understanding the Updates & Normalization Claim : D t+1 puts half of the weight on x i where h t was incorrect and half of the weight on x i where h t was correct. 𝐸 𝑢 𝑗 Recall 𝐸 𝑢+1 𝑗 = 𝑎 𝑢 e −𝛽 𝑢 𝑧 𝑗 ℎ 𝑢 𝑦 𝑗 Probabilities are equal! 𝜗 𝑢 1 𝐸 𝑢 𝑗 𝜗 𝑢 1 − 𝜗 𝑢 𝜗 𝑢 1 − 𝜗 𝑢 𝑓 𝛽 𝑢 = = 𝑓 𝛽 𝑢 𝐸 𝑢+1 𝑧 𝑗 ≠ ℎ 𝑢 𝑦 𝑗 Pr = = 𝑎 𝑢 𝑎 𝑢 𝑎 𝑢 𝜗 𝑢 𝑎 𝑢 𝑗:𝑧 𝑗 ≠ℎ 𝑢 𝑦 𝑗 𝐸 𝑢 𝑗 = 1 − 𝜗 𝑢 𝑓 −𝛽 𝑢 = 1 − 𝜗 𝑢 𝜗 𝑢 1 − 𝜗 𝑢 𝜗 𝑢 𝑓 −𝛽 𝑢 𝐸 𝑢+1 𝑧 𝑗 = ℎ 𝑢 𝑦 𝑗 Pr = = 𝑎 𝑢 𝑎 𝑢 𝑎 𝑢 1 − 𝜗 𝑢 𝑎 𝑢 𝑗:𝑧 𝑗 =ℎ 𝑢 𝑦 𝑗 𝐸 𝑢 𝑗 𝑓 −𝛽 𝑢 𝑧 𝑗 ℎ 𝑢 𝑦 𝑗 𝑎 𝑢 = 𝐸 𝑢 𝑗 𝑓 −𝛽 𝑢 𝐸 𝑢 𝑗 𝑓 𝛽 𝑢 = + 𝑗:𝑧 𝑗 =ℎ 𝑢 𝑦 𝑗 𝑗:𝑧 𝑗 =ℎ 𝑢 𝑦 𝑗 𝑗:𝑧 𝑗 ≠ℎ 𝑢 𝑦 𝑗 = 1 − 𝜗 𝑢 𝑓 −𝛽 𝑢 + 𝜗 𝑢 𝑓 𝛽 𝑢 = 2 𝜗 𝑢 1 − 𝜗 𝑢

  22. Analyzing Training Error: Proof Intuition Theorem 𝜗 𝑢 = 1/2 − 𝛿 𝑢 (error of ℎ 𝑢 over 𝐸 𝑢 ) 2 𝑓𝑠𝑠 𝑇 𝐼 𝑔𝑗𝑜𝑏𝑚 ≤ exp −2 𝛿 𝑢 𝑢 • On round 𝑢 , we increase weight of 𝑦 𝑗 for which ℎ 𝑢 is wrong. • If 𝐼 𝑔𝑗𝑜𝑏𝑚 incorrectly classifies 𝑦 𝑗 , - Then 𝑦 𝑗 incorrectly classified by (wtd) majority of ℎ 𝑢 ’s. - Which implies final prob. weight of 𝑦 𝑗 is large. 1 1 Can show probability ≥ 𝑎 𝑢 𝑛 𝑢 Since sum of prob. = 1 , can’t have too many of high weight. • Can show # incorrectly classified ≤ 𝑛 𝑎 𝑢 . 𝑢 And ( 𝑎 𝑢 ) → 0 . 𝑢

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend