bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - PowerPoint PPT Presentation

Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019 Last time Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of


  1. Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of 
 Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019

  2. Last time… Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of bias and variance. http://scott.fortmann-roe.com/docs/BiasVariance.html 2

  3. Last time… Bagging • Leo Breiman (1994) • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples, create D ’ by drawing N examples at random with replacement from D. • Bagging: - Create k bootstrap samples D 1 ... D k . - Train distinct classifier on each D i . - Classify new instance by majority vote / average. slide by David Sontag 3

  4. Last time… Random Forests Tree t=1 t=2 t=3 slide by Nando de Freitas [From the book of Hastie, Friedman and Tibshirani] 4

  5. Boosting 5

  6. Boosting Ideas • Main idea: use weak learner to create strong learner. • Ensemble method: combine base classifiers returned by weak learner. • Finding simple relatively accurate base classifiers often not hard. • But, how should base classifiers be combined? slide by Mehryar Mohri 6

  7. Example: “How May I Help You?” Goal: automatically categorize type of call requested by • phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I’d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I’d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observation: • - easy to find “rules of thumb” that are “often” correct slide by Rob Schapire • e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ” - hard to find single highly accurate prediction rule [Gorin et al.] 7

  8. Boosting: Intuition • Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space • Output class: (Weighted) vote of each classifier - Classifiers that are most “sure” will vote with more conviction - Classifiers will be most “sure” about a particular part of the space - On average, do better than single classifier! • But how do you??? slide by Aarti Singh & Barnabas Poczos - force classifiers to learn about di ff erent parts of the input space? - weigh the votes of di ff erent classifiers? 8

  9. Boosting [Schapire, 1989] • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified - Learn a hypothesis – h t - A strength for this hypothesis – a t • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 9

  10. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 10

  11. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 11

  12. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 12

  13. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 13

  14. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 14

  15. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 15

  16. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 16

  17. First Boosting Algorithms • [Schapire ’89]: - first provable boosting algorithm • [Freund ’90]: - “optimal” algorithm that “boosts by majority” • [Drucker, Schapire & Simard ’92]: - first experiments using boosting - limited by practical drawbacks • [Freund & Schapire ’95]: - introduced “ AdaBoost ” algorithm - strong practical advantages over previous boosting slide by Rob Schapire algorithms 17

  18. The AdaBoost Algorithm 18

  19. Toy Example Minimize the error For binary h t , typically use slide by Rob Schapire weak hypotheses = vertical or horizontal half-planes 19

  20. Round 1 h 1 ε 1=0.30 slide by Rob Schapire 20

  21. Round 1 h 1 ε 1 =0.30 =0.42 α 1 slide by Rob Schapire 21

  22. Round 1 h 1 D 2 ε 1 =0.30 =0.42 α 1 slide by Rob Schapire 22

  23. Round 2 h 2 3 ε 2=0.21 slide by Rob Schapire 23

  24. Round 2 h 2 3 ε 2 =0.21 =0.65 α slide by Rob Schapire 2 24

  25. Round 2 h 2 D 3 ε 2 =0.21 =0.65 α slide by Rob Schapire 2 25

  26. Round 3 h 3 ε 3=0.14 slide by Rob Schapire 26

  27. Round 3 h 3 ε 3 =0.14 3=0.92 α slide by Rob Schapire 27

  28. Final Hypothesis H = sign 0.42 + 0.65 + 0.92 final = slide by Rob Schapire 28

  29. Voted combination of classifiers • The general problem here is to try to combine many simple “weak” classifiers into a single “strong” classifier • We consider voted combinations of simple binary ±1 component classifiers where the (non-negative) votes α i can be used to 
 emphasize component classifiers that are more 
 reliable than others slide by Tommi S. Jaakkola 29

  30. Components: Decision stumps • Consider the following simple family of component classifiers generating ±1 labels: where These are called decision 
 stumps. • Each decision stump pays attention to only a single component of the input vector slide by Tommi S. Jaakkola 30

  31. Voted combinations (cont’d.) • We need to define a loss function for the combination so we can determine which new component h (x; θ ) to add and how many votes it should receive 
 • While there are many options for the loss function we consider here only a simple exponential loss slide by Tommi S. Jaakkola 31

  32. Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 32

  33. Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 33

  34. Modularity, errors, and loss • Consider adding the m th component: 
 • So at the m th iteration the new component (and the votes) slide by Tommi S. Jaakkola should optimize a weighted loss (weighted towards mistakes). 34

  35. Empirical exponential loss (cont’d.) • To increase modularity we’d like to further decouple the optimization of h (x; θ m ) from the associated votes α m • To this end we select h (x; θ m ) that optimizes the rate at which the loss would decrease as a function of α m slide by Tommi S. Jaakkola 35

  36. 
 Empirical exponential loss (cont’d.) • We find that minimizes • We can also normalize the weights: 
 slide by Tommi S. Jaakkola so that 36

  37. Empirical exponential loss (cont’d.) • We find that minimizes 
 where • is subsequently chosen to minimize slide by Tommi S. Jaakkola 37

  38. 38 The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman

  39. The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } slide by Jiri Matas and Jan Š ochman 39

  40. The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } Initialise weights D 1 ( i ) = 1 /m slide by Jiri Matas and Jan Š ochman 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend