 
              An overview of Boosting Yoav Freund UCSD
Plan of talk • Generative vs. non-generative modeling • Boosting • Alternating decision trees • Boosting and over-fitting • Applications 2
Toy Example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller Male Human Voice Female 3
Generative modeling mean1 mean2 Probability var1 var2 Voice Pitch
Discriminative approach [Vapnik 85] No. of mistakes Voice Pitch
Ill-behaved data mean1 mean2 No. of mistakes Probability Voice Pitch
Traditional Statistics vs. Machine Learning Machine Learning Predictions Decision Estimated Data Actions Theory world state Statistics
Comparison of methodologies Model Generative Discriminative Goal Probability Classification rule estimates Performance Likelihood Misclassification rate measure Mismatch Outliers Misclassifications problems 8
Boosting
A weighted training set Feature vectors Binary labels {-1,+1} Positive weights ( ) , x 2 , y 2 , w 2 ( ) , … , x m , y m , w m ( ) x 1 , y 1 , w 1 10
A weak learner Weighted training set A weak rule ( ) , x 2 , y 2 , w 2 ( ) , … , x m , y m , w m ( ) x 1 , y 1 , w 1 Weak Learner h instances predictions h x 1 , x 2 , … , x m y 2 , … , ˆ y i ∈ { − 1, + 1} y 1 , ˆ ˆ y m ; ˆ ∑ m y i ˆ y i w i The weak requirement: > γ > 0 i = 1 ∑ m w i i = 1
The boosting process ( ) , x 2 , y 2 ,1 ( ) , … , x n , y n ,1 ( ) x 1 , y 1 ,1 weak learner h 1 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) 1 1 1 x 1 , y 1 , w 1 weak learner h 2 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) 2 2 2 x 1 , y 1 , w 1 h 3 ( ) , x 2 , y 2 , w 2 ( ) , … , x n , y n , w n ( ) T − 1 T − 1 T − 1 x 1 , y 1 , w 1 h T ... Prediction( x ) = sign ( F T ( x ))
A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting [Slides from Rob Schapire] • given training set ( x 1 , y 1 ) , . . . , ( x m , y m ) • y i ∈ { − 1 , +1 } correct label of instance x i ∈ X • for t = 1 , . . . , T : • construct distribution D t on { 1 , . . . , m } • find weak classifier (“rule of thumb”) h t : X → { − 1 , +1 } with small error � t on D t : � t = Pr i ∼ D t [ h t ( x i ) ̸ = y i ] • output final classifier H final
AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost [Slides from Rob Schapire] [with Freund] • constructing D t : • D 1 ( i ) = 1 / m • given D t and h t : � e − α t D t ( i ) if y i = h t ( x i ) D t +1 ( i ) = × if y i ̸ = h t ( x i ) e α t Z t D t ( i ) = exp( − α t y i h t ( x i )) Z t where Z t = normalization factor � 1 − � t � α t = 1 2 ln > 0 � t • final classifier: �� � • H final ( x ) = sign α t h t ( x ) t
Toy Example Toy Example Toy Example Toy Example Toy Example [Slides from Rob Schapire] D 1 weak classifiers = vertical or horizontal half-planes
Round 1 Round 1 Round 1 Round 1 Round 1 [Slides from Rob Schapire] h 1 D 2 ε 1 =0.30 α =0.42 1
Round 2 Round 2 Round 2 Round 2 Round 2 [Slides from Rob Schapire] h 2 D 3 ε 2 =0.21 α =0.65 2
Round 3 Round 3 Round 3 Round 3 Round 3 [Slides from Rob Schapire] h 3 ε 3 =0.14 α 3=0.92
Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier [Slides from Rob Schapire] H = sign 0.42 + 0.65 + 0.92 final =
Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error [Slides from Rob Schapire] [with Freund] • Theorem: • write � t as 1 [ γ t = “edge” ] 2 − γ t • then � � � � training error ( H final ) 2 � t (1 − � t ) ≤ t � � 1 − 4 γ 2 = t t � � � γ 2 exp − 2 ≤ t t • so: if ∀ t : γ t ≥ γ > 0 then training error ( H final ) ≤ e − 2 γ 2 T • AdaBoost is adaptive: • does not need to know γ or T a priori • can exploit γ t � γ
Boosting block diagram Strong Learner Accurate Rule Weak Learner Example Weak weights rule Booster
Boosting with specialists • Specialists predict only when they are confident. • In addition to {-1,+1} specialists use 0 to indicate no-prediction. • As boosting allows both positive and negative weights: we restrict attention to specialists that output {0,1}. 22
Weak Rule: h t : X → {0,1} Adaboost as variational Label: y ∈ { − 1, + 1} optimization Training set: {(x 1 , y 1 ,1),( x 2 , y 2 ,1), … ,( x n , y n , + - ⎛ ⎞ ⎛ ⎞ α t = 1 ∑ ∑ 2 ln ε + ε + t t w i w i ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ i : h t ( x i ) = 1, y i = 1 i : h t ( x i ) = 1, y i = − 1 t = F t − 1 + α t h t F
AdaBoost as variational optimization. • x = input, a scalar, • y = output, -1 or +1 • (x 1 ,y 1 ),(x 2 ,y 2 ), … ,(x n ,y n ) = training set • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 24
AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t − 1 ( x ) + 0 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 25
AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t − 1 ( x ) + 0.4 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 26
AdaBoost as variational optimization. • F t-1 = Strong rule after t-1 boosting iterations. • h t = Weak rule produced at iteration t t ( x ) = F t − 1 ( x ) + 0.8 h t ( x ) F h t + + + + + + + + + + + + + F t-1 x - - - - - - - - - - - - - - - - - - 27
Margins ! h ( x ) = ( h 1 ( x ), h 2 ( x ), … , h T ( x )) Fix a set of weak rules: ! Represent the i'th example x i with outputs of the weak rules: h i ! ! ! Labels: y ∈ { − 1, + 1} h 2 , y 2 ), … ,( Training set: ( h 1 , y 1 ),( h n , y n ) Goal : Find a weight vector ! α = ( α 1 , … , α T ) that minimizes number of training mistakes ! - α - + + + - - + + + ! - - " ( h i , y i ) h i ⋅ " - - α margin ! y i + t c e reflect r r ! o C α Cumulative # examples + + + P r o j e c + e - - Mistakes Correct t k + a t + s i - M - ! - + - ( y i h i , y i ) - - Margin
Boosting as gradient descent Adaboost : e − margin ✦ Our goal is to minimize ✦ Adaboost minimizes the exponential loss the number of mistakes (0-1 loss) which is an upper bound. ✦ Unfortunately, the step function ✦ Finds the vector \alpha through has deriv. zero at all points other than coordinate-wise gradient descent. at zero where the derivative is undefined. Loss ✦ Weak rules are added one at a time. ✦ Ada boost uses an upper bound on the 0-1 loss. 0-1 loss Margin Mistakes Correct
Adaboost as gradient descent ✦ Weak rules are added one at a t − 1 ( x i ) + α y i h t ( x i ) time. new margin of ( x i , y i ) is y i F n ( ) ( ) ∑ exp − y i F t − 1 ( x i ) + α h t ( x i ) ✦ Fixing the alphas for 1..t-1, find new total loss = i = 1 alpha for the new rule (h t ) derivative of total loss wrt α : ✦ Weak rule defines how each − ∂ n ( ) ∑ exp − y i F ( ) t − 1 ( x i ) + α h t ( x i ) example would move = increase or ∂ α α = 0 i = 1 decrease the margin. ! # # " ## $ weight of example ( x i , y i ) n ( ) ∑ y i h t ( x i ) exp -y i F t-1 (x i ) = 0-1 loss i = 1 Margin Mistakes Correct
Logitboost as gradient descent Also called Gentle-Boost and Logit Boost, Hastie, Freedman & Tibshirani Logitboost loss Adaboost (loss and weight) t − 1 ( x i ) + α y i h t ( x i ) new margin of ( x i , y i ) is y i F n ( margin= m ( ∑ log 1 + exp − y i F ⎡ t − 1 ( x i ) + α h t ( x i new total loss = ⎣ Instead of the loss e − m i = 1 derivative of total loss wrt α : ( ) define loss to be log 1 + e − m − ∂ n ( ) ∑ log 1 + exp − y i F ( ) ⎡ ⎤ t − 1 ( x i ) + α h t ( x i ) ( ) ≈ e − m ⎣ ⎦ When margin ≫ 0 : log 1 + e − m ∂ α α = 0 i = 1 ( ) ≈ − m ! ## # " ### $ weight of example ( x i , y i ) When margin ≪ 0 :log 1 + e − m ( ) n exp - y i F t -1 ( x i ) ∑ y i h t ( x i ) = ( ) 1 + exp - y i F t -1 ( x i ) i = 1 Logitboost weight Margin Mistakes Correct
Noise resistance • Adaboost: – perform well when a achievable error rate is close to zero (almost consistent case). – Errors = examples with negative margins, get very large weights, can overfit. • Logitboost: – Inferior to adaboost when achievable error rate is close to zero. – Often better than Adaboost when achievable error rate is high. – Weight on any example never larger than 1. 32
Recommend
More recommend