foundations of machine learning boosting
play

Foundations of Machine Learning Boosting Weak Learning (Kearns and - PowerPoint PPT Presentation

Foundations of Machine Learning Boosting Weak Learning (Kearns and Valiant, 1994) Definition: concept class is weakly PAC-learnable C if there exists a (weak) learning algorithm and L > 0 such that: for all , for all


  1. Foundations of Machine Learning Boosting

  2. Weak Learning (Kearns and Valiant, 1994) Definition: concept class is weakly PAC-learnable C if there exists a (weak) learning algorithm and L γ > 0 such that: • for all , for all and all distributions , δ > 0 c ∈ C D � R ( h S ) ≤ 1 � Pr ≥ 1 − δ , 2 − γ S ∼ D • for samples of size for a fixed S m = poly (1 / δ ) polynomial.

  3. Boosting Ideas Finding simple relatively accurate base classifiers often not hard weak learner. Main ideas: • use weak learner to create a strong learner. • combine base classifiers returned by weak learner (ensemble method). But, how should the base classifiers be combined?

  4. AdaBoost (Freund and Schapire, 1997) H ⊆ { − 1 , +1 } X . AdaBoost ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 for i 1 to m do D 1 ( i ) 1 2 m 3 for t 1 to T do 4 h t base classifier in H with small error ✏ t = Pr i ∼ D t [ h t ( x i ) 6 = y i ] 2 log 1 − ✏ t ↵ t 1 5 ✏ t 1 6 Z t 2[ ✏ t (1 � ✏ t )] . normalization factor 2 7 for i 1 to m do D t +1 ( i ) D t ( i ) exp( − ↵ t y i h t ( x i )) 8 Z t f t P t 9 s =1 ↵ s h s 10 return h = sgn( f T )

  5. Notes Distributions over training sample: D t • originally uniform. • at each round, the weight of a misclassified example is increased. • observation: , since D t +1 ( i )= e − yift ( xi ) m Q t s =1 Z s P t D t +1 ( i ) = D t ( i ) e − α t y i h t ( x i ) = D t − 1 ( i ) e − α t − 1 y i h t − 1 ( x i ) e − α t y i h t ( x i ) s =1 α s h s ( x i ) e − y i = 1 . � t Z t Z t − 1 Z t m s =1 Z s Weight assigned to base classifier : directly h t α t depends on the accuracy of at round . h t t

  6. Illustration t = 1 t = 2

  7. t = 3 . . . . . .

  8. + α 2 + α 3 α 1 =

  9. Bound on Empirical Error (Freund and Schapire, 1997) Theorem: The empirical error of the classifier output by AdaBoost verifies: � � 2 � � 1 T � � R ( h ) ≤ exp − 2 . 2 − � t t =1 • If further for all , , then t ∈ [1 , T ] � ≤ ( 1 2 − � t ) � R ( h ) ≤ exp( − 2 γ 2 T ) . • does not need to be known in advance: γ adaptive boosting.

  10. • Proof: Since, as we saw, , D t +1 ( i )= e − yift ( xi ) m Q t s =1 Z s m m � � R ( h ) = 1 1 y i f ( x i ) ≤ 0 ≤ 1 � exp( − y i f ( x i )) m m i =1 i =1 � � m T T � � � ≤ 1 D T +1 ( i ) = m Z t Z t . m i =1 t =1 t =1 • Now, since is a normalization factor, Z t m � D t ( i ) e − � t y i h t ( x i ) Z t = i =1 D t ( i ) e − � t + � � = D t ( i ) e � t i : y i h t ( x i ) ≥ 0 i : y i h t ( x i ) < 0 = (1 − � t ) e − � t + � t e � t � � 1 − � t � = (1 − � t ) 1 − � t + � t = 2 � t (1 − � t ) . � t � t

  11. • Thus, T T T � � 2 � � � � � 1 Z t = 2 � t (1 − � t ) = 1 − 4 2 − � t t =1 t =1 t =1 T T � � 2 � � � 2 � � � � � 1 1 exp − 2 = exp − 2 . ≤ 2 − � t 2 − � t t =1 t =1 • Notes: • minimizer of . � �� (1 � � t ) e − α + � t e α α t • since , at each round, AdaBoost (1 − � t ) e − α t = � t e α t assigns the same probability mass to correctly classified and misclassified instances. • for base classifiers , can be x �� [ � 1 , +1] α t similarly chosen to minimize . Z t

  12. AdaBoost Coordinate Descent = Objective Function: convex and differentiable. m m α ) = 1 e − y i f ( x i ) = 1 α j h j ( x i ) . P N j =1 ¯ X X e − y i F (¯ m m i =1 i =1 e − x 0 − 1 loss

  13. • Direction: unit vector with best directional e k derivative: F (¯ α t � 1 + η e k ) − F (¯ α t � 1 ) F 0 (¯ α t � 1 , e k ) = lim . η ! 0 η • Since , m P N j =1 ¯ α t − 1 ,j h j ( x i ) − η y i h k ( x i ) X e − y i F (¯ α t − 1 + η e k ) = i =1 m α t � 1 , e k ) = − 1 P N X j =1 ¯ α t − 1 ,j h j ( x i ) y i h k ( x i ) e � y i F 0 (¯ m i =1 m = − 1 X y i h k ( x i ) ¯ D t ( i ) ¯ Z t m i =1 # ¯ " m m Z t ¯ ¯ X X = − D t ( i )1 y i h k ( x i )=+1 − D t ( i )1 y i h k ( x i )= � 1 m i =1 i =1 i ¯ i ¯ Z t Z t h h = − (1 − ¯ ✏ t,k ) − ¯ m = 2¯ ✏ t,k − 1 m . ✏ t,k Thus, direction corresponding to base classifier with smallest error.

  14. • Step size: chosen to minimize ; F (¯ α t − 1 + η e k ) η m dF (¯ α t − 1 + ⌘ e k ) α t − 1 ,j h j ( x i ) e − η y i h k ( x i ) = 0 P N X j =1 ¯ y i h k ( x i ) e − y i = 0 ⇔ − d ⌘ i =1 m Z t e − η y i h k ( x i ) = 0 X y i h k ( x i ) ¯ D t ( i ) ¯ ⇔ − i =1 m D t ( i ) e − η y i h k ( x i ) = 0 y i h k ( x i ) ¯ X ⇔ − i =1 ✏ t,k ) e − η − ¯ ⇥ ✏ t,k e η ⇤ (1 − ¯ = 0 ⇔ − ⇔ ⌘ = 1 2 log 1 − ¯ ✏ t,k . ¯ ✏ t,k Thus, step size matches base classifier weight of AdaBoost.

  15. Alternative Loss Functions boosting loss square loss x �� (1 � x ) 2 1 x ≤ 1 x �� e − x logistic loss x �� log 2 (1 + e − x ) hinge loss x �� max(1 � x, 0) zero-one loss x �� 1 x< 0

  16. Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: • data in , e.g., , . R N N =2 (height( x ) , weight( x )) • associate a stump to each component. • pre-sort each component: . O ( Nm log m ) • at each round, find best component and threshold. • total complexity: . O (( m log m ) N + mNT ) • stumps not weak learners: think XOR example!

  17. Overfitting? Assume that and for a fixed , define VCdim( H )= d T T � � � � � F T = α t h t − b : α t , b ∈ R , h t ∈ H . sgn t =1 can form a very rich family of classifiers. It can F T be shown (Freund and Schapire, 1997) that: VCdim( F T ) ≤ 2( d + 1)( T + 1) log 2 (( T + 1) e ) . This suggests that AdaBoost could overfit for large values of , and that is in fact observed in some T cases, but in various others it is not!

  18. Empirical Observations Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: 20 test error 15 error 10 training error 5 0 10 100 1000 # rounds C4.5 decision trees (Schapire et al., 1998).

  19. Rademacher Complexity of Convex Hulls Theorem: Let be a set of functions mapping H from to . Let the convex hull of be defined as X R H p p � � conv( H ) = { µ k h k : p ≥ 1 , µ k ≥ 0 , µ k ≤ 1 , h k ∈ H } . k =1 k =1 Then, for any sample , S � R S (conv( H )) = � R S ( H ) . � � p m � � Proof: R S (conv( H )) = 1 � m E sup µ k h k ( x i ) σ i σ h k � H, µ � 0 , � µ � 1 � 1 i =1 k =1 � �� p � m � � = 1 m E sup sup σ i h k ( x i ) µ k σ h k � H µ � 0 , � µ � 1 � 1 i =1 k =1 � �� � m � = 1 m E sup max σ i h k ( x i ) k � [1 ,p ] σ h k � H i =1 � � m � = 1 = � m E sup σ i h ( x i ) R S ( H ) . σ h � H i =1

  20. Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002) Corollary: Let be a set of real-valued functions. H Fix . For any , with probability at least , δ > 0 ρ > 0 1 − δ the following holds for all : h ∈ conv( H ) � log 1 � � R ρ ( h ) + 2 R ( h ) ≤ � δ + ρ R m H 2 m � log 2 � � R ρ ( h ) + 2 R ( h ) ≤ � � δ + 3 R S H 2 m . ρ Proof: Direct consequence of margin bound of Lecture 4 and . R S (conv( H ))= � � R S ( H )

  21. Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998) Corollary: Let be a family of functions taking H values in with VC dimension . Fix . { − 1 , +1 } d ρ > 0 For any , with probability at least , the δ > 0 1 − δ following holds for all : h ∈ conv( H ) � � 2 d log em log 1 R ρ ( h ) + 2 R ( h ) ≤ � d δ + 2 m . m ρ Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).

  22. Notes All of these bounds can be generalized to hold uniformly for all , at the cost of an additional ρ ∈ (0 , 1) term and other minor constant factor � 2 log log 2 ρ changes (Koltchinskii and Panchenko, 2002). m For AdaBoost, the bound applies to the functions � T t =1 α t h t ( x ) x �� f ( x ) = � conv( H ) . � α � 1 � α � 1 Note that does not appear in the bound. T

  23. Margin Distribution Theorem: For any , the following holds: ρ > 0 � yf ( x ) � � T � � � 1 − ρ � 2 T Pr � � (1 � � t ) 1+ ρ . t � α � 1 t =1 Proof: Using the identity , D t +1 ( i )= e − yif ( xi ) m Q T t =1 Z t m m 1 1 y i f ( x i ) �� α � 1 � � 0 � 1 � � exp( � y i f ( x i ) + � α � 1 � ) m m i =1 i =1 m T � � = 1 � � e � α � 1 � D T +1 ( i ) m Z t m i =1 t =1 T T �� � � � � � = e � α � 1 � 1 � � t Z t = 2 T � t (1 � � t ) . � t t =1 t =1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend