statistical machine learning
play

Statistical Machine Learning A Crash Course Part III: Boosting - - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Combining Classifiers Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer


  1. Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

  2. Combining Classifiers ■ Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

  3. Combining Classifiers ■ How do we make money from horse racing bets? ■ Ask a professional. ■ It is very likely that... • The professional cannot give a single highly accurate rule. • But presented with a set of races, can always generate better- than-random rules. ■ Can you get rich? ■ Disclaimer: We are not saying you should actually try this at home :-) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3

  4. Combining Classifiers ■ Idea: • Ask an expert for their rule-of-thumb. • Assemble the set of cases where the rule-of-thumb fails (hard cases). • Ask the expert again for the selected set of hard cases. • And so on… ■ Combine many rules-of-thumb. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4

  5. Combining Classifiers ■ How do we actually do this? ■ How to choose races on each round? • Concentrate on “hardest” races (those most often misclassified by previous rules of thumb) ■ How to combine rules of thumb into single prediction rule? • Take (weighted) majority vote of several h t : R d → { +1 , − 1 } rules-of-thumb • We take a weighted average of simple rules (models): � T ⇥ ⇤ H ( x ) = sign α t h t ( x ) t =1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5

  6. Boosting ■ General method of converting rough rules of thumb into a highly accurate prediction rule. ■ More formally: • Given a “weak” learning algorithm that can consistently find ≤ 1 “weak classifiers” with a (training) error of 2 − γ • A boosting algorithm can provably construct a “strong classifier” that has a training error of . ≤ � ■ As long as we have a “weak” learning algorithm that does better than chance, we can convert it into an algorithm that performs arbitrarily well! Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 6

  7. AdaBoost: Toy Example ■ Training data: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7

  8. AdaBoost: Toy Example ■ Round 1: 1st weak reweighted classifier training data Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 8

  9. AdaBoost: Toy Example ■ Round 2: 1st weak 2nd weak reweighted classifier classifier training data Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9

  10. AdaBoost: Toy Example ■ Round 3: 1st weak 2nd weak 3rd weak classifier classifier classifier Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10

  11. AdaBoost: Toy Example ■ Weighted combination: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 11

  12. AdaBoost: Toy Example ■ Final hypothesis / “strong” classifier: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 12

  13. AdaBoost ■ Given: Training data with labels ( x 1 , y 1 ) , . . . , ( x N , y N ) x i ∈ R d , y i ∈ { +1 , − 1 } where ■ Initialize weights for every data point: D 1 ( i ) = 1 N # of boosting rounds ■ Loop over : t = 1 , . . . , T h t : R d → { +1 , − 1 } • Train the weak learner on the training data so that the weighted error with weights is minimized. D t α t ∈ R + • Choose an appropriate weight for the weak classifier. • Update the data weights as 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } where is chosen such that sums to 1. Z t D t +1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 13

  14. AdaBoost ■ Given: Training data with labels ( x 1 , y 1 ) , . . . , ( x N , y N ) x i ∈ R d , y i ∈ { +1 , − 1 } where ■ Return the weighted (“strong”, ensemble) classifier: � T ⇥ ⇤ H ( x ) = sign α t h t ( x ) t =1 ■ Intuition: • Boosting uses weighted training data and adapts the weights every round. • The weights make the algorithm focus on the wrongly classified examples: � ⇤ 1 if y i ⌅ = h t ( x i ) exp { � α t y i h t ( x i ) } = ⇥ 1 if y i = h t ( x i ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 14

  15. AdaBoost: Weak Learners ■ Training the weak learner: ( x 1 , y 1 ) , . . . , ( x N , y N ) • Given training data D t ( i ) • and weights for all data point. • Select the weak classifier with the smallest weighted error: N � � t = D t ( i )[ y i � = h ( x i )] h t = arg min with h ∈ H � t i =1 ⇥ t ≤ 1 • Prerequisite: Weighted training error � t > 0 2 − � t , ■ Examples for : H • Weighted least-squares classifier • Decision stumps (hold on...) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 15

  16. AdaBoost: Weak Learners Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 16

  17. AdaBoost: Weak Learners Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 17

  18. AdaBoost ■ How do we select ? α t ■ We want to minimize the empirical error: N 1 � � tr ( H ) = M [ y i � = H ( x i )] i =1 ■ The empirical error can be upper bounded: � N ⇥ T T ⌅ ⇤ � = D t ( i ) exp { − α t y i h t ( x i ) } � tr ( H ) ≤ Z t t =1 t =1 i =1 [Freund & Schapire] ■ To minimize the empirical error, we can greedily minimize in each round. Z t Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18

  19. AdaBoost ■ Select by greedily minimizing in each round. Z t ( α ) α t • Minimizes an upper bound on the empirical error. N ■ Minimize � Z t ( α ) = D t ( i ) exp { − α y i h t ( x i ) } i =1 ■ We obtain the AdaBoost weighting: � 1 − ⇥ t ⇥ � t = 1 2 log ⇥ t N � � t = D t ( i )[ y i � = h ( x i )] with i =1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 19

  20. AdaBoost: Reweighting 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 20

  21. AdaBoost: Reweighting 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Increase the weight on incorrectly classified examples Decrease the weight on correctly classified examples � ⇤ 1 if y i ⌅ = h t ( x i ) exp { � α t y i h t ( x i ) } = ⇥ 1 if y i = h t ( x i ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21

  22. AdaBoost: Reweighting ■ Eventually only the very difficult cases will be focused on: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 22

  23. AdaBoost: More realistic example t = 0 ■ Initialize... Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 23

  24. AdaBoost: More realistic example t = 1 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 24

  25. AdaBoost: More realistic example t = 2 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 25

  26. AdaBoost: More realistic example t = 3 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 26

  27. AdaBoost: More realistic example t = 4 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 27

  28. AdaBoost: More realistic example t = 5 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 28

  29. AdaBoost: More realistic example t = 6 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 29

  30. AdaBoost: More realistic example t = 7 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend