an overview of boosting
play

An overview of Boosting Yoav Freund UCSD Plan of talk Generative - PowerPoint PPT Presentation

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2 Toy Example Computer receives telephone call


  1. An overview of Boosting Yoav Freund UCSD

  2. Plan of talk • Generative vs. non-generative modeling • Boosting • Alternating decision trees • Boosting and over-fitting • Applications 2

  3. Toy Example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller Male Human Voice Female 3

  4. Generative modeling mean1 mean2 Probability var1 var2 Voice Pitch

  5. Discriminative approach [Vapnik 85] No. of mistakes Voice Pitch

  6. Ill-behaved data mean1 mean2 No. of mistakes Probability Voice Pitch

  7. Traditional Statistics vs. 
 Machine Learning Machine Learning Predictions Decision Estimated Data Actions Theory world state Statistics

  8. Comparison of methodologies Model Generative Discriminative Goal Probability Classification rule estimates Performance Likelihood Misclassification rate measure Mismatch Outliers Misclassifications problems 8

  9. Boosting

  10. A weighted training set Feature vectors Binary labels {-1,+1} Positive weights ( ) , x 2 , y 2 , w 2 ( ) , … , x m , y m , w m ( ) x 1 , y 1 , w 1 10

  11. A weak learner Weighted training set A weak rule ( ) , x 2 , y 2 , w 2 ( ) , … , x m , y m , w m ( ) x 1 , y 1 , w 1 Weak Learner h instances predictions h x 1 , x 2 , … , x m y 2 , … , ˆ y i ∈ { − 1, + 1} y 1 , ˆ ˆ y m ; ˆ ∑ m y i ˆ y i w i The weak requirement: > γ > 0 i = 1 ∑ m w i i = 1

  12. The boosting process ( ) , x 2 , y 2 ,1 ( ) , … , x n , y n ,1 ( ) x 1 , y 1 ,1 weak learner h 1 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) 1 1 1 x 1 , y 1 , w 1 weak learner h 2 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) 2 2 2 x 1 , y 1 , w 1 h 3 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) T − 1 T − 1 T − 1 x 1 , y 1 , w 1 h T ... Prediction( x ) = sign ( F T ( x ))

  13. A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting [Slides from Rob Schapire] • given training set ( x 1 , y 1 ) , . . . , ( x m , y m ) • y i ∈ { − 1 , +1 } correct label of instance x i ∈ X • for t = 1 , . . . , T : • construct distribution D t on { 1 , . . . , m } • find weak classifier (“rule of thumb”) h t : X → { − 1 , +1 } with small error � t on D t : � t = Pr i ∼ D t [ h t ( x i ) ̸ = y i ] • output final classifier H final

  14. AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost [Slides from Rob Schapire] [with Freund] • constructing D t : • D 1 ( i ) = 1 / m • given D t and h t : � e − α t D t ( i ) if y i = h t ( x i ) D t +1 ( i ) = × if y i ̸ = h t ( x i ) e α t Z t D t ( i ) = exp( − α t y i h t ( x i )) Z t where Z t = normalization factor � 1 − � t � α t = 1 2 ln > 0 � t • final classifier: �� � • H final ( x ) = sign α t h t ( x ) t

  15. Toy Example Toy Example Toy Example Toy Example Toy Example [Slides from Rob Schapire] D 1 weak classifiers = vertical or horizontal half-planes

  16. Round 1 Round 1 Round 1 Round 1 Round 1 [Slides from Rob Schapire] h 1 D 2 ε 1 =0.30 α =0.42 1

  17. Round 2 Round 2 Round 2 Round 2 Round 2 [Slides from Rob Schapire] h 2 D 3 ε 2 =0.21 α =0.65 2

  18. Round 3 Round 3 Round 3 Round 3 Round 3 [Slides from Rob Schapire] h 3 ε 3 =0.14 α 3=0.92

  19. Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier [Slides from Rob Schapire] H = sign 0.42 + 0.65 + 0.92 final =

  20. Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error [Slides from Rob Schapire] [with Freund] • Theorem: • write � t as 1 [ γ t = “edge” ] 2 − γ t • then � � � � training error ( H final ) 2 � t (1 − � t ) ≤ t � � 1 − 4 γ 2 = t t � � � γ 2 exp − 2 ≤ t t • so: if ∀ t : γ t ≥ γ > 0 then training error ( H final ) ≤ e − 2 γ 2 T • AdaBoost is adaptive: • does not need to know γ or T a priori • can exploit γ t � γ

  21. Boosting block diagram Strong Learner Accurate Rule Weak Learner Example Weak weights rule Booster

  22. Boosting with specialists • Specialists predict only when they are confident. • In addition to {-1,+1} specialists use 0 to indicate no-prediction. • As boosting allows both positive and negative weights: we restrict attention to specialists that output {0,1}. 22

  23. Weak Rule: h t : X → {0,1} Adaboost as variational Label: y ∈ { − 1, + 1} optimization Training set: {(x 1 , y 1 ,1),( x 2 , y 2 ,1), … ,( x n , y n , + - ⎛ ⎞ ⎛ ⎞ α t = 1 ∑ ∑ 2 ln ε + ε + t t w i w i ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ i : h t ( x i ) = 1, y i = 1 i : h t ( x i ) = 1, y i = − 1 t = F t − 1 + α t h t F

  24. AdaBoost as variational optimization. • x = input, a scalar, • y = output, -1 or +1 • (x 1 ,y 1 ),(x 2 ,y 2 ), … ,(x n ,y n ) = training set • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 24

  25. AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t − 1 ( x ) + 0 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 25

  26. AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t − 1 ( x ) + 0.4 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 26

  27. AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t ( x ) = F t − 1 ( x ) + 0.8 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 27

  28. Margins ! h ( x ) = ( h 1 ( x ), h 2 ( x ), … , h T ( x )) Fix a set of weak rules: ! Represent the i'th example x i with outputs of the weak rules: h i ! ! ! Labels: y ∈ { − 1, + 1} h 2 , y 2 ), … ,( Training set: ( h 1 , y 1 ),( h n , y n ) Goal : Find a weight vector ! α = ( α 1 , … , α T ) that minimizes number of training mistakes ! - α - + + + - - + + + ! - - " ( h i , y i ) h i ⋅ " - - α margin ! y i + t c e reflect r r ! o C α Cumulative # examples + + + P r o j e c + e - - Mistakes Correct t k + a t + s i - M - ! - + - ( y i h i , y i ) - - Margin

  29. Boosting as gradient descent Adaboost : e − margin ✦ Our goal is to minimize 
 ✦ Adaboost minimizes the exponential loss 
 the number of mistakes (0-1 loss) which is an upper bound. ✦ Unfortunately, the step function 
 ✦ Finds the vector \alpha through 
 has deriv. zero at all points other than 
 coordinate-wise gradient descent. at zero where the derivative is undefined. Loss ✦ Weak rules are added one at a time. 
 ✦ Ada boost uses an upper bound on 
 the 0-1 loss. 0-1 loss Margin Mistakes Correct

  30. Adaboost as gradient descent ✦ Weak rules are added one at a t − 1 ( x i ) + α y i h t ( x i ) time. 
 new margin of ( x i , y i ) is y i F n ( ) ( ) ∑ exp − y i F t − 1 ( x i ) + α h t ( x i ) ✦ Fixing the alphas for 1..t-1, find new total loss = i = 1 alpha for the new rule (h t ) 
 derivative of total loss wrt α : ✦ Weak rule defines how each − ∂ n ( ) ∑ exp − y i F ( ) t − 1 ( x i ) + α h t ( x i ) example would move = increase or ∂ α α = 0 i = 1 decrease the margin. ! # # " ## $ weight of example ( x i , y i ) n ( ) ∑ y i h t ( x i ) exp -y i F t-1 (x i ) = 0-1 loss i = 1 Margin Mistakes Correct

  31. Logitboost as gradient descent Also called Gentle-Boost and Logit Boost, Hastie, Freedman & Tibshirani Logitboost loss Adaboost (loss and weight) t − 1 ( x i ) + α y i h t ( x i ) new margin of ( x i , y i ) is y i F n ( margin= m ( ∑ log 1 + exp − y i F ⎡ t − 1 ( x i ) + α h t ( x i new total loss = ⎣ Instead of the loss e − m i = 1 derivative of total loss wrt α : ( ) define loss to be log 1 + e − m − ∂ n ( ) ∑ log 1 + exp − y i F ( ) ⎡ ⎤ t − 1 ( x i ) + α h t ( x i ) ( ) ≈ e − m ⎣ ⎦ When margin ≫ 0 : log 1 + e − m ∂ α α = 0 i = 1 ( ) ≈ − m ! ## # " ### $ weight of example ( x i , y i ) When margin ≪ 0 :log 1 + e − m ( ) n exp - y i F t -1 ( x i ) ∑ y i h t ( x i ) = ( ) 1 + exp - y i F t -1 ( x i ) i = 1 Logitboost weight Margin Mistakes Correct

  32. Noise resistance • Adaboost: – perform well when a achievable error rate is close to zero (almost consistent case). – Errors = examples with negative margins, get very large weights, can overfit. • Logitboost: – Inferior to adaboost when achievable error rate is close to zero. – Often better than Adaboost when achievable error rate is high. – Weight on any example never larger than 1. 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend