theory and applications of boosting
play

Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 - PowerPoint PPT Presentation

Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 l o o h c S r e m m Many slides from Rob Schapire u S z u r C a t n a S Monday, July 16, 2012 0 1 2 m e r S c h o o l 2 a C r u z S u m


  1. Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 l o o h c S r e m m Many slides from Rob Schapire u S z u r C a t n a S Monday, July 16, 2012

  2. 0 1 2 m e r S c h o o l 2 a C r u z S u m Monday, July 16, 2012 S a n t

  3. Plan • • • Day 2: Applications Day 3: Advanced Topics Day 1: Basics • • • ADTrees Boosting and repeated Boosting, matrix games • • JBoost Adaboost, • Boosting and Loss • • Viola and Jones minimization. Margins theory. • • • Active Learning and Drifting games and Boost Confidence-rated Pedestrian Detection By Majority. boosting • • Genome Wide association Brownboost and Boosting studies with High Noise. • 2 Online boosting and 1 0 2 tracking. l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  4. Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” [Gorin et al.] • goal: automatically categorize type of call requested by phone customer ( Collect, CallingCard, PersonToPerson, etc. ) • yes I’d like to place a collect call long distance (Collect) please • operator I need to make a call but I need to bill (ThirdNumber) it to my office • yes I’d like to place a call on my master card (CallingCard) please • I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of (BillingCredit) my bill 2 1 0 2 • observation: l o o h • easy to find “rules of thumb” that are “often” correct c S r e m • e.g.: “IF ‘ card ’ occurs in utterance m u S THEN predict ‘ CallingCard ’ ” z u r C • hard to find single highly accurate prediction rule a t n a S Monday, July 16, 2012

  5. The Boosting Approach The Boosting Approach The Boosting Approach The Boosting Approach The Boosting Approach • devise computer program for deriving rough rules of thumb • apply procedure to subset of examples • obtain rule of thumb • apply to 2nd subset of examples • obtain 2nd rule of thumb • repeat T times 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  6. Key Details Key Details Key Details Key Details Key Details • how to choose examples on each round? • concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) • how to combine rules of thumb into single prediction rule? • take (weighted) majority vote of rules of thumb 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  7. Boosting Boosting Boosting Boosting Boosting • boosting = general method of converting rough rules of thumb into highly accurate prediction rule • technically: • assume given “weak” learning algorithm that can consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55% (in two-class setting) [ “weak learning assumption” ] • given su ffi cient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 2 1 0 2 99% l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  8. Some History • How it all began ... 2 1 0 2 l o o h c S r e m m u S z u r C a t n 8 a S Monday, July 16, 2012

  9. Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability • boosting’s roots are in “PAC” learning model [Valiant ’84] • get random examples from unknown, arbitrary distribution • strong PAC learning algorithm: • for any distribution with high probability given polynomially many examples (and polynomial time) can find classifier with arbitrarily small generalization error • weak PAC learning algorithm 2 • same, but generalization error only needs to be slightly 1 0 2 better than random guessing ( 1 l 2 − γ ) o o h c S • [Kearns & Valiant ’88] : r e m m • does weak learnability imply strong learnability? u S z u r C a t n a S Monday, July 16, 2012

  10. If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... • can use (fairly) wild guesses to produce highly accurate predictions • if can learn “part way” then can learn “all the way” • should be able to improve any learning algorithm • for any learning problem: • either can always learn with nearly perfect accuracy • or there exist cases where cannot learn even slightly better than random guessing 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  11. First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms • [Schapire ’89] : • first provable boosting algorithm • [Freund ’90] : • “optimal” algorithm that “boosts by majority” • [Drucker, Schapire & Simard ’92] : • first experiments using boosting • limited by practical drawbacks • [Freund & Schapire ’95] : • introduced “AdaBoost” algorithm 2 1 0 2 • strong practical advantages over previous boosting l o o h algorithms c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  12. Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory • introduction to AdaBoost • analysis of training error • analysis of test error and the margins theory 2 1 • experiments and applications 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  13. A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting • given training set ( x 1 , y 1 ) , . . . , ( x m , y m ) • y i ∈ { − 1 , +1 } correct label of instance x i ∈ X • for t = 1 , . . . , T : • construct distribution D t on { 1 , . . . , m } • find weak classifier (“rule of thumb”) h t : X → { − 1 , +1 } with small error � t on D t : 2 1 0 � t = Pr i ∼ D t [ h t ( x i ) � = y i ] 2 l o o h • output final classifier H final c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  14. AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost [ Freund & Schapire 96] [with Freund] • constructing D t : • D 1 ( i ) = 1 / m • given D t and h t : � e − α t D t ( i ) if y i = h t ( x i ) D t +1 ( i ) = × if y i � = h t ( x i ) e α t Z t D t ( i ) = exp( − α t y i h t ( x i )) Z t where Z t = normalization factor 2 1 � 1 − � t � 0 2 α t = 1 2 ln > 0 l o o � t h c S • final classifier: r e m �� � m u S • H final ( x ) = sign α t h t ( x ) z u r C t a t n a S Monday, July 16, 2012

  15. Toy Example Toy Example Toy Example Toy Example Toy Example D 1 2 1 0 2 l o o h c S r e m m weak classifiers = vertical or horizontal half-planes u S z u r C a t n a S Monday, July 16, 2012

  16. Round 1 Round 1 Round 1 Round 1 Round 1 h 1 D 2 " 1 =0.30 ! =0.42 1 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  17. Round 2 Round 2 Round 2 Round 2 Round 2 h 2 D 3 " 2 =0.21 ! =0.65 2 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  18. Round 3 Round 3 Round 3 Round 3 Round 3 h 3 " 3 =0.14 ! 3=0.92 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  19. Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier H = sign 0.42 + 0.65 + 0.92 final = 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  20. 2 1 0 2 l o o h c S r e m m u S http://cseweb.ucsd.edu/~yfreund/adaboost/index.html z u r C a t n a S Monday, July 16, 2012

  21. Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory • introduction to AdaBoost • analysis of training error • analysis of test error and the margins theory 2 1 • experiments and applications 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  22. Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error [ Freund & Schapire 96] [with Freund] • Theorem: • write � t as 1 [ γ t = “edge” ] 2 − γ t • then � � � � training error ( H final ) 2 � t (1 − � t ) ≤ t � � 1 − 4 γ 2 = t t � � � γ 2 exp − 2 ≤ t t 2 1 0 2 • so: if ∀ t : γ t ≥ γ > 0 l o o h then training error ( H final ) ≤ e − 2 γ 2 T c S r e m • AdaBoost is adaptive: m u S • does not need to know γ or T a priori z u r C • can exploit γ t � γ a t n a S Monday, July 16, 2012

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend