Foundations of Machine Learning Boosting Weak Learning (Kearns and - PowerPoint PPT Presentation

Foundations of Machine Learning Boosting

Weak Learning (Kearns and Valiant, 1994) Definition: concept class is weakly PAC-learnable C if there exists a (weak) learning algorithm and L γ > 0 such that: • for all , for all and all distributions , δ > 0 c ∈ C D � R ( h S ) ≤ 1 � Pr ≥ 1 − δ , 2 − γ S ∼ D • for samples of size for a fixed S m = poly (1 / δ ) polynomial.

Boosting Ideas Finding simple relatively accurate base classifiers often not hard weak learner. Main ideas: • use weak learner to create a strong learner. • combine base classifiers returned by weak learner (ensemble method). But, how should the base classifiers be combined?

AdaBoost (Freund and Schapire, 1997) H ⊆ { − 1 , +1 } X . AdaBoost ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 for i 1 to m do D 1 ( i ) 1 2 m 3 for t 1 to T do 4 h t base classifier in H with small error ✏ t = Pr i ∼ D t [ h t ( x i ) 6 = y i ] 2 log 1 − ✏ t ↵ t 1 5 ✏ t 1 6 Z t 2[ ✏ t (1 � ✏ t )] . normalization factor 2 7 for i 1 to m do D t +1 ( i ) D t ( i ) exp( − ↵ t y i h t ( x i )) 8 Z t f t P t 9 s =1 ↵ s h s 10 return h = sgn( f T )

Notes Distributions over training sample: D t • originally uniform. • at each round, the weight of a misclassified example is increased. • observation: , since D t +1 ( i )= e − yift ( xi ) m Q t s =1 Z s P t D t +1 ( i ) = D t ( i ) e − α t y i h t ( x i ) = D t − 1 ( i ) e − α t − 1 y i h t − 1 ( x i ) e − α t y i h t ( x i ) s =1 α s h s ( x i ) e − y i = 1 . � t Z t Z t − 1 Z t m s =1 Z s Weight assigned to base classifier : directly h t α t depends on the accuracy of at round . h t t

Illustration t = 1 t = 2

t = 3 . . . . . .

+ α 2 + α 3 α 1 =

Bound on Empirical Error (Freund and Schapire, 1997) Theorem: The empirical error of the classifier output by AdaBoost verifies: � � 2 � � 1 T � � R ( h ) ≤ exp − 2 . 2 − � t t =1 • If further for all , , then t ∈ [1 , T ] � ≤ ( 1 2 − � t ) � R ( h ) ≤ exp( − 2 γ 2 T ) . • does not need to be known in advance: γ adaptive boosting.

• Proof: Since, as we saw, , D t +1 ( i )= e − yift ( xi ) m Q t s =1 Z s m m � � R ( h ) = 1 1 y i f ( x i ) ≤ 0 ≤ 1 � exp( − y i f ( x i )) m m i =1 i =1 � � m T T � � � ≤ 1 D T +1 ( i ) = m Z t Z t . m i =1 t =1 t =1 • Now, since is a normalization factor, Z t m � D t ( i ) e − � t y i h t ( x i ) Z t = i =1 D t ( i ) e − � t + � � = D t ( i ) e � t i : y i h t ( x i ) ≥ 0 i : y i h t ( x i ) < 0 = (1 − � t ) e − � t + � t e � t � � 1 − � t � = (1 − � t ) 1 − � t + � t = 2 � t (1 − � t ) . � t � t

• Thus, T T T � � 2 � � � � � 1 Z t = 2 � t (1 − � t ) = 1 − 4 2 − � t t =1 t =1 t =1 T T � � 2 � � � 2 � � � � � 1 1 exp − 2 = exp − 2 . ≤ 2 − � t 2 − � t t =1 t =1 • Notes: • minimizer of . � �� (1 � � t ) e − α + � t e α α t • since , at each round, AdaBoost (1 − � t ) e − α t = � t e α t assigns the same probability mass to correctly classified and misclassified instances. • for base classifiers , can be x �� [ � 1 , +1] α t similarly chosen to minimize . Z t

AdaBoost Coordinate Descent = Objective Function: convex and differentiable. m m α ) = 1 e − y i f ( x i ) = 1 α j h j ( x i ) . P N j =1 ¯ X X e − y i F (¯ m m i =1 i =1 e − x 0 − 1 loss

• Direction: unit vector with best directional e k derivative: F (¯ α t � 1 + η e k ) − F (¯ α t � 1 ) F 0 (¯ α t � 1 , e k ) = lim . η ! 0 η • Since , m P N j =1 ¯ α t − 1 ,j h j ( x i ) − η y i h k ( x i ) X e − y i F (¯ α t − 1 + η e k ) = i =1 m α t � 1 , e k ) = − 1 P N X j =1 ¯ α t − 1 ,j h j ( x i ) y i h k ( x i ) e � y i F 0 (¯ m i =1 m = − 1 X y i h k ( x i ) ¯ D t ( i ) ¯ Z t m i =1 # ¯ " m m Z t ¯ ¯ X X = − D t ( i )1 y i h k ( x i )=+1 − D t ( i )1 y i h k ( x i )= � 1 m i =1 i =1 i ¯ i ¯ Z t Z t h h = − (1 − ¯ ✏ t,k ) − ¯ m = 2¯ ✏ t,k − 1 m . ✏ t,k Thus, direction corresponding to base classifier with smallest error.

• Step size: chosen to minimize ; F (¯ α t − 1 + η e k ) η m dF (¯ α t − 1 + ⌘ e k ) α t − 1 ,j h j ( x i ) e − η y i h k ( x i ) = 0 P N X j =1 ¯ y i h k ( x i ) e − y i = 0 ⇔ − d ⌘ i =1 m Z t e − η y i h k ( x i ) = 0 X y i h k ( x i ) ¯ D t ( i ) ¯ ⇔ − i =1 m D t ( i ) e − η y i h k ( x i ) = 0 y i h k ( x i ) ¯ X ⇔ − i =1 ✏ t,k ) e − η − ¯ ⇥ ✏ t,k e η ⇤ (1 − ¯ = 0 ⇔ − ⇔ ⌘ = 1 2 log 1 − ¯ ✏ t,k . ¯ ✏ t,k Thus, step size matches base classifier weight of AdaBoost.

Alternative Loss Functions boosting loss square loss x �� (1 � x ) 2 1 x ≤ 1 x �� e − x logistic loss x �� log 2 (1 + e − x ) hinge loss x �� max(1 � x, 0) zero-one loss x �� 1 x< 0

Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: • data in , e.g., , . R N N =2 (height( x ) , weight( x )) • associate a stump to each component. • pre-sort each component: . O ( Nm log m ) • at each round, find best component and threshold. • total complexity: . O (( m log m ) N + mNT ) • stumps not weak learners: think XOR example!

Overfitting? Assume that and for a fixed , define VCdim( H )= d T T � � � � � F T = α t h t − b : α t , b ∈ R , h t ∈ H . sgn t =1 can form a very rich family of classifiers. It can F T be shown (Freund and Schapire, 1997) that: VCdim( F T ) ≤ 2( d + 1)( T + 1) log 2 (( T + 1) e ) . This suggests that AdaBoost could overfit for large values of , and that is in fact observed in some T cases, but in various others it is not!

Empirical Observations Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: 20 test error 15 error 10 training error 5 0 10 100 1000 # rounds C4.5 decision trees (Schapire et al., 1998).

Rademacher Complexity of Convex Hulls Theorem: Let be a set of functions mapping H from to . Let the convex hull of be defined as X R H p p � � conv( H ) = { µ k h k : p ≥ 1 , µ k ≥ 0 , µ k ≤ 1 , h k ∈ H } . k =1 k =1 Then, for any sample , S � R S (conv( H )) = � R S ( H ) . � � p m � � Proof: R S (conv( H )) = 1 � m E sup µ k h k ( x i ) σ i σ h k � H, µ � 0 , � µ � 1 � 1 i =1 k =1 � �� p � m � � = 1 m E sup sup σ i h k ( x i ) µ k σ h k � H µ � 0 , � µ � 1 � 1 i =1 k =1 � �� m � = 1 m E sup max σ i h k ( x i ) k � [1 ,p ] σ h k � H i =1 � � m � = 1 = � m E sup σ i h ( x i ) R S ( H ) . σ h � H i =1

Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002) Corollary: Let be a set of real-valued functions. H Fix . For any , with probability at least , δ > 0 ρ > 0 1 − δ the following holds for all : h ∈ conv( H ) � log 1 � � R ρ ( h ) + 2 R ( h ) ≤ � δ + ρ R m H 2 m � log 2 � � R ρ ( h ) + 2 R ( h ) ≤ � � δ + 3 R S H 2 m . ρ Proof: Direct consequence of margin bound of Lecture 4 and . R S (conv( H ))= � � R S ( H )

Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998) Corollary: Let be a family of functions taking H values in with VC dimension . Fix . { − 1 , +1 } d ρ > 0 For any , with probability at least , the δ > 0 1 − δ following holds for all : h ∈ conv( H ) � � 2 d log em log 1 R ρ ( h ) + 2 R ( h ) ≤ � d δ + 2 m . m ρ Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).

Notes All of these bounds can be generalized to hold uniformly for all , at the cost of an additional ρ ∈ (0 , 1) term and other minor constant factor � 2 log log 2 ρ changes (Koltchinskii and Panchenko, 2002). m For AdaBoost, the bound applies to the functions � T t =1 α t h t ( x ) x �� f ( x ) = � conv( H ) . � α � 1 � α � 1 Note that does not appear in the bound. T

Margin Distribution Theorem: For any , the following holds: ρ > 0 � yf ( x ) � � T � � � 1 − ρ � 2 T Pr � � (1 � � t ) 1+ ρ . t � α � 1 t =1 Proof: Using the identity , D t +1 ( i )= e − yif ( xi ) m Q T t =1 Z t m m 1 1 y i f ( x i ) �� α � 1 � � 0 � 1 � � exp( � y i f ( x i ) + � α � 1 � ) m m i =1 i =1 m T � � = 1 � � e � α � 1 � D T +1 ( i ) m Z t m i =1 t =1 T T �� = e � α � 1 � 1 � � t Z t = 2 T � t (1 � � t ) . � t t =1 t =1

Foundations of Machine Learning Boosting Weak Learning (Kearns and - PowerPoint PPT Presentation

Foundations of Machine Learning Boosting Weak Learning (Kearns and Valiant, 1994) Definition: concept class is weakly PAC-learnable C if there exists a (weak) learning algorithm and L > 0 such that: for all , for all

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

The Boosting Approach to Machine Learning Maria-Florina Balcan 03/16/2015 Boosting General

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

recap to this point foundations foundations foundations foundations genetics =

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Inverting Sampled Traffic Nicolas Hohn, Darryl Veitch Australian Research Council Special

Spectrum Sharing Applications Sreeraj Rajendran rsreeraj@gmail.com FOSDEM 15 , Brussels

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

IV and IV-GMM Christopher F Baum ECON 8823: Applied Econometrics Boston College, Spring 2016

Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of

Steins Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan ,

Generalization Error of Generalized Linear Models in High Dimensions Melika Emami 1 , Mojtaba

Sambuz

Useful Links

Newsletter

Mail Us