Classification and statistical machine learning Sylvain Arlot - PowerPoint PPT Presentation

Introduction Goals Overfitting Examples Key issues Conclusion Classification rule/algorithm Classification rule � ( X × { 0 , 1 } ) n → S � f : n ≥ 1 Input: a data set D n (of any size n ≥ 1) Output: a classifier � f ( D n ): X → { 0 , 1 } Example: k -nearest neighbours ( k -NN): x ∈ X → majority vote among the Y i such that X i is one of the k nearest neighbours of x in X 1 , . . . , X n 14/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Example: 3-nearest neighbours 1 0 0 10 15/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → 16/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → a . s . strong consistency: R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → 16/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → a . s . strong consistency: R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → universal (weak) consistency: for all P , � � R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → E a . s . universal strong consistency: for all P , R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → 16/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Universal consistency � � R ( � n →∞ R ( f ⋆ ) weak consistency: E f ( D n )) − − − → a . s . strong consistency: R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → universal (weak) consistency: for all P , � � R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → E a . s . universal strong consistency: for all P , R ( � n →∞ R ( f ⋆ ) f ( D n )) − − − → Stone’s theorem [Stone, 1977]: If X = R d with the Euclidean distance, k n -NN is (weakly) universally consistent if k n → + ∞ and k n / n → 0 as n → + ∞ . 16/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Uniform universal consistency? universal weak consistency: � � R ( � − R ( f ⋆ ) = 0 sup n → + ∞ E lim f ( D n )) P ∈M 1 ( X×{ 0 , 1 } ) uniform universal weak consistency: � � � � R ( � − R ( f ⋆ ) lim sup f ( D n )) = 0 E n → + ∞ P ∈M 1 ( X×{ 0 , 1 } ) that is, a common learning rate for all P ? 17/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Uniform universal consistency? universal weak consistency: � � R ( � − R ( f ⋆ ) = 0 sup n → + ∞ E lim f ( D n )) P ∈M 1 ( X×{ 0 , 1 } ) uniform universal weak consistency: � � � � R ( � − R ( f ⋆ ) lim sup f ( D n )) = 0 E n → + ∞ P ∈M 1 ( X×{ 0 , 1 } ) that is, a common learning rate for all P ? Yes if X is finite. No otherwise (see Chapter 7 of [Devroye et al., 1996]). 17/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Classification on X finite Theorem f maj is the majority vote rule ( for each x ∈ X , If X is finite and � majority vote among { Y i / X i = x } ) , � � � � � Card( X ) log(2) R ( � − R ( f ⋆ ) f maj ( D n )) sup ≤ . E 2 n P 18/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Classification on X finite Theorem f maj is the majority vote rule ( for each x ∈ X , If X is finite and � majority vote among { Y i / X i = x } ) , � � � � � Card( X ) log(2) R ( � − R ( f ⋆ ) f maj ( D n )) sup ≤ . E 2 n P Proof: standard risk bounds (see next section) + maximal inequality � � �� n � log(Card( T )) E sup ξ i , t ≤ 2 n t ∈ T i =1 if for all t , ( ξ i , t ) i are independent, centered and in [0 , 1]. 18/53 See e.g. http://www.di.ens.fr/~arlot/2013orsay.htm Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Classification on X finite Theorem f maj is the majority vote rule ( for each x ∈ X , If X is finite and � majority vote among { Y i / X i = x } ) , � � � � � Card( X ) log(2) R ( � f maj ( D n )) − R ( f ⋆ ) sup ≤ . E 2 n P Constants matter: Card( X ) can be larger than n ⇒ beware of asymptotic results and O ( · ) that can hide such constants in first or second order terms. 18/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup f ( D n )) 2 . E P ∈M 1 ( X×{ 0 , 1 } ) 19/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator 0 X 19/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) 1 Y=eta(X) data points estimator unobserved points 0 X 19/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem Theorem If X is infinite, for any classification rule � f and any n ≥ 1 , � � � � ≥ 1 R ( � − R ( f ⋆ ) sup E f ( D n )) 2 . P ∈M 1 ( X×{ 0 , 1 } ) Remark: for any ( a n ) decreasing to zero and any � f , some P exists � � R ( � − R ( f ⋆ ) ≥ a n . See Chapter 7 of such that E f ( D n )) [Devroye et al., 1996]. C ( P ) ⇒ impossible to have log log n as a universal risk bound! 19/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion No Free Lunch Theorem: proof Assume N ⊂ X and let K ≥ 1. For any r ∈ { 0 , 1 } K , define P r by X uniform on { 1 , . . . , K } and P ( Y = r i | X = i ) = 1 for all i = 1 , . . . , K . Under P r , f ⋆ ( x ) = r x and R ( f ⋆ ) = 0. So, � � � � � � �� R P ( � � − R P ( f ⋆ ) sup E P f ( D n )) ≥ sup P P r f ( X ; D n ) � = r X P P r � � �� ≥ E r ∼ R f ( X ; D n ) � = r X P P r � � � �� ≥ E ∈{ X 1 ,..., X n } E � X , ( X i , r X i ) i =1 ... n 1 X / 1 � f ( X ;( X i , r Xi ) i =1 ... n ) � = r X � � n = 1 ∈ { X 1 , . . . , X n } ) = 1 1 − 1 2 P ( X / 2 K 19/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) 20/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) No Free Lunch Theorems ⇒ must make assumptions on P 20/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) No Free Lunch Theorems ⇒ must make assumptions on P Minimax rate: given a set P ⊂ M 1 ( X × { 0 , 1 } ), � � � � �� − R ( f ⋆ ) � inf sup E R f ( D n ) � f P ∈P 20/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Learning rates How can we get a bound such as � � − R ( f ⋆ ) ≤ C ( P ) n − 1 / 2 ? � R f ( D n ) No Free Lunch Theorems ⇒ must make assumptions on P Minimax rate: given a set P ⊂ M 1 ( X × { 0 , 1 } ), � � � � �� − R ( f ⋆ ) � inf sup E R f ( D n ) � f P ∈P Examples: � V / n when f ⋆ ∈ S known and dim VC ( S ) = V [Devroye et al., 1996] V / ( nh ) when in addition P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 (margin assumption) [Massart and N´ ed´ elec, 2006] 20/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Outline Introduction 1 Goals 2 Overfitting 3 Examples 4 Key issues 5 21/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Overfitting with k -nearest-neighbours: k = 1 1 0 0 10 21/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Choosing k ∈ { 1 , 3 , 20 , 200 } for k -NN ( n = 200) 1 1 0 0 0 10 0 10 1 1 0 0 22/53 0 10 0 10 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization Empirical risk n � R n ( f ) := 1 � ℓ ( f ( X i ) , Y i ) n i =1 Empirical risk minimizer over a model S ⊂ S : � � � � f S ∈ argmin f ∈ S R n ( f ) 23/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization Empirical risk n � R n ( f ) := 1 � ℓ ( f ( X i ) , Y i ) n i =1 Empirical risk minimizer over a model S ⊂ S : � � � � f S ∈ argmin f ∈ S R n ( f ) Examples: �� partitioning rule: S = k ≥ 1 α k 1 A k / α k ∈ { 0 , 1 } for some partition ( A k ) k ≥ 1 of X linear discrimination ( X = R d ): � � x �→ 1 β ⊤ x + β 0 ≥ 0 / β ∈ R d , β 0 ∈ R S = ... 23/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Example: linear discrimination Fig. 4.3 of [Devroye et al., 1996] 24/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bias-variance trade-off � � � � − R ( f ⋆ ) � R f S = Bias + Variance E Bias or Approximation error S ) − R ( f ⋆ ) = inf f ∈ S R ( f ) − R ( f ⋆ ) R ( f ⋆ Variance or Estimation error σ 2 dim( S ) OLS in regression: n σ 2 k-NN in regression: k 25/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bias-variance trade-off � � � � − R ( f ⋆ ) � R f S = Bias + Variance E Bias or Approximation error S ) − R ( f ⋆ ) = inf f ∈ S R ( f ) − R ( f ⋆ ) R ( f ⋆ Variance or Estimation error σ 2 dim( S ) OLS in regression: n σ 2 k-NN in regression: k Bias-variance trade-off ⇔ avoid overfitting and underfitting 25/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Outline Introduction 1 Goals 2 Overfitting 3 Examples 4 Plug in rules Empirical risk minimization and model selection Convexification and support vector machines Decision trees and forests Key issues 5 26/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Plug in classifiers Idea: f ⋆ ( x ) = 1 η ( x ) ≥ 1 2 ⇒ if � η ( D n ) estimates η (regression problem), � f ( x ; D n ) = 1 � η ( x ; D n ) ≥ 1 2 Examples: partitioning, k -NN, local average classifiers [Devroye et al., 1996], [Audibert and Tsybakov, 2007]... 26/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Risk bound for plug in Proposition (Theorem 2.2 in [Devroye et al., 1996]) For a plug in classifier � f , � � − R ( f ⋆ ) ≤ 2 E [ | η ( X ) − � � R f ( D n ) η ( X ; D n ) | | D n ] � η ( X ; D n )) 2 � � � � ≤ 2 E ( η ( X ) − � � D n (First step for proving Stone’s theorem [Stone, 1977]) Proof: � � � � � � − R ( f ⋆ ) = E � R f ( D n ) | 2 η ( X ) − 1 | 1 � � D n f ( X ; D n ) � = f ⋆ ( X ) � f ( X ; D n ) � = f ⋆ ( X ) implies | 2 η ( X ) − 1 | ≤ 2 | η ( X ) − � and η ( X ; D n ) | . 27/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization (ERM) � � ERM over S : � � f S ∈ argmin f ∈ S R n ( f ) � � � � − R ( f ⋆ ) � R f S = Approximation error + Estimation error E 28/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization (ERM) � � ERM over S : � � f S ∈ argmin f ∈ S R n ( f ) � � � � − R ( f ⋆ ) � R f S = Approximation error + Estimation error E S ) − R ( f ⋆ ): bounded thanks to Approximation error R ( f ⋆ approximation theory, or assumed equal to zero 28/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Empirical risk minimization (ERM) � � ERM over S : � � f S ∈ argmin f ∈ S R n ( f ) � � � � − R ( f ⋆ ) � R f S = Approximation error + Estimation error E S ) − R ( f ⋆ ): bounded thanks to Approximation error R ( f ⋆ approximation theory, or assumed equal to zero Estimation error � �� R ( f ) − � − R ( f ⋆ R f S S ) ≤ E sup R n ( f ) E f ∈ S � � � − R ( f ⋆ Proof: R f S S ) � � � � � � � − � � − R ( f ⋆ S ) + � R n ( f ⋆ S ) + � � − � R n ( f ⋆ = R f S R n f S R n f S S ) � � R ( f ) − � + � R n ( f ⋆ S ) − R ( f ⋆ ≤ sup R n ( f ) S ) 28/53 f ∈ S Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (1): global approach � � � � � − R ( f ⋆ E R f S S ) � �� R ( f ) − � ≤ E sup R n ( f ) (global complexity of S ) f ∈ S � � �� n 1 ≤ 2 E sup ε i ℓ ( f ( X i ) , Y i ) (symmetrization) n f ∈ S i =1 √ �� ≤ 2 2 √ n E H ( S ; X 1 , . . . , X n ) (combinatorial entropy) � � � � � en � 2 V ( S ) log V ( S ) ≤ 2 (VC dimension) n References: Section 3 of [Boucheron et al., 2005], Chapters 12–13 of [Devroye et al., 1996] 29/53 See also lectures 1–2 of http://www.di.ens.fr/~arlot/2013orsay.htm Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � 30/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � Margin condition: P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that � f S is not anywhere in S 30/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � Margin condition: P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that � f S is not anywhere in S f S ∈ { f ∈ S / R ( f ) − R ( f ⋆ ) ≤ ε } � ⊂ { f ∈ S / var ( ℓ ( f ( X ) , Y ) − ℓ ( f ⋆ ( X ) , Y )) ≤ ε/ h } by the margin condition. 30/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Bounds on the estimation error (2): localization R n ( f )) } ≥ Cn − 1 / 2 ⇒ no faster rate sup f ∈ S { var( R ( f ) − � Margin condition: P ( | η ( X ) − 1 / 2 | ≤ h ) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that � f S is not anywhere in S f S ∈ { f ∈ S / R ( f ) − R ( f ⋆ ) ≤ ε } � ⊂ { f ∈ S / var ( ℓ ( f ( X ) , Y ) − ℓ ( f ⋆ ( X ) , Y )) ≤ ε/ h } by the margin condition. + Talagrand concentration inequality [Talagrand, 1996, Bousquet, 2002] + . . . ⇒ fast rates (depending on the assumptions), e.g., � � �� κ V ( S ) nh 2 1 + log nh V ( S ) [Boucheron et al., 2005, Sec. 5], [Massart and N´ ed´ elec, 2006]. 30/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Model selection family of models ( S m ) m ∈M ⇒ family of classifiers ( � f m ( D n )) m ∈M n � � � ⇒ choose � m = � m ( D n ) such that R f � m ( D n ) is minimal? 31/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Model selection family of models ( S m ) m ∈M ⇒ family of classifiers ( � f m ( D n )) m ∈M n � � � ⇒ choose � m = � m ( D n ) such that R f � m ( D n ) is minimal? Goal: minimize the risk, i.e., Oracle inequality (in expectation or with a large probability): � � � � � � − R ( f ⋆ ) ≤ C inf − R ( f ⋆ ) � � R f � R f m + R n m m ∈M Interpretation of � m : the best model can be wrong / the true model can be worse than smaller ones. 31/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Penalization: � � � � � � m ∈ argmin m ∈M � R n f m + pen( m ) Ideal penalty: � � � � � � �� − � � � pen id ( m ) = R f m R n f m ⇔ m ∈ argmin m ∈M � R f m 32/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Penalization: � � � � � � m ∈ argmin m ∈M � R n f m + pen( m ) Ideal penalty: � � � � � � �� − � � � pen id ( m ) = R f m R n f m ⇔ m ∈ argmin m ∈M � R f m General idea: choose pen such that pen( m ) ≈ pen id ( m ) or at least pen( m ) ≥ pen id ( m ) for all m ∈ M . Lemma (see next slide): if pen( m ) ≥ pen id ( m ) for all m ∈ M , � � � � � � −R ( f ⋆ ) ≤ inf − R ( f ⋆ ) + pen( m ) − pen id ( m ) � � R f � R f m . m m ∈M 32/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection: lemma Lemma If ∀ m ∈ M , − B ( m ) ≤ pen( m ) − pen id ( m ) ≤ A ( m ) , then, � � � � � � − R ( f ⋆ ) − B ( � − R ( f ⋆ ) + A ( m ) � � R f � m ) ≤ inf R f m . m m ∈M Proof: For all m ∈ M , by definition of � m , � � � � � � m ) ≤ � � R n f � + pen( � R n f m + pen( m ) . m � � � � � � � + pen( � − pen id ( � m ) + pen( � So, R n f � m ) = R f � m ) m m � � � ≥ R f � − B ( � m ) m � � � � � � � and R n f m + pen( m ) = R f m − pen id ( m ) + pen( m ) � � � ≤ R f m + A ( m ) . 32/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Structural risk minimization (Vapnik): � � R ( f ) − � pen id ( m ) ≤ sup R n ( f ) f ∈ S m ⇒ can use previous bounds [Koltchinskii, 2001, Bartlett et al., 2002, Fromont, 2007] but remainder terms ≥ Cn − 1 / 2 ⇒ no fast rates. 33/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Penalization for model selection Structural risk minimization (Vapnik): � � R ( f ) − � pen id ( m ) ≤ sup R n ( f ) f ∈ S m ⇒ can use previous bounds [Koltchinskii, 2001, Bartlett et al., 2002, Fromont, 2007] but remainder terms ≥ Cn − 1 / 2 ⇒ no fast rates. Tighter estimates of pen id ( m ) for fast rates: localization [Koltchinskii, 2006], resampling [Arlot, 2009]. See also Section 8 of [Boucheron et al., 2005]. 33/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Convexification of the classification problem Y i ∈ {− 1 , 1 } so that 1 y � = y ′ = 1 yy ′ < 0 = Φ 0 − 1 ( yy ′ ) Convention: � n 1 min Φ 0 − 1 ( Y i f ( X i )) computationally heavy in general. n f i =1 34/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Convexification of the classification problem Y i ∈ {− 1 , 1 } so that 1 y � = y ′ = 1 yy ′ < 0 = Φ 0 − 1 ( yy ′ ) Convention: � n 1 min Φ 0 − 1 ( Y i f ( X i )) computationally heavy in general. n f i =1 Classifier f : X → {− 1 , 1 } ⇒ prediction function f : X → R such that sign( f ( x )) will be used to classify x Risk R 0 − 1 ( f ) = E [Φ 0 − 1 ( Yf ( X ))] ⇒ Φ-risk R Φ ( f ) = E [Φ ( Yf ( X ))] for some Φ : R → R + 34/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Convexification of the classification problem Y i ∈ {− 1 , 1 } so that 1 y � = y ′ = 1 yy ′ < 0 = Φ 0 − 1 ( yy ′ ) Convention: � n 1 min Φ 0 − 1 ( Y i f ( X i )) computationally heavy in general. n f i =1 Classifier f : X → {− 1 , 1 } ⇒ prediction function f : X → R such that sign( f ( x )) will be used to classify x Risk R 0 − 1 ( f ) = E [Φ 0 − 1 ( Yf ( X ))] ⇒ Φ-risk R Φ ( f ) = E [Φ ( Yf ( X ))] for some Φ : R → R + � n 1 ⇒ min Φ( Y i f ( X i )) with S and Φ convex. n f ∈ S i =1 34/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Examples of functions Φ exponential: Φ( u ) = e − u ⇒ AdaBoost hinge: Φ( u ) = max { 1 − u , 0 } ⇒ support vector machines logistic/logit: Φ( u ) = log(1 + exp( − u )) ⇒ logistic regression truncated quadratic: Φ( u ) = (max { 1 − u , 0 } ) 2 Figure from [Bartlett et al., 2006]. References: [Bartlett et al., 2006] and Section 4 of [Boucheron et al., 2005]. 35/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Links between 0–1 and convex risks Definition Φ is classification-calibrated if for any x with η ( x ) � = 1 / 2, sign( f ⋆ Φ ( x )) = f ⋆ ( x ) f ⋆ for any Φ ∈ argmin f R Φ ( f ) 36/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Links between 0–1 and convex risks Definition Φ is classification-calibrated if for any x with η ( x ) � = 1 / 2, sign( f ⋆ Φ ( x )) = f ⋆ ( x ) f ⋆ for any Φ ∈ argmin f R Φ ( f ) Theorem ([Bartlett et al., 2006]) Φ convex is classification-calibrated ⇔ Φ differentiable at 0 and Φ ′ (0) < 0 . Then, a function ψ exists such that � � �� f ⋆ ≤ R Φ ( f ) − R Φ ( f ⋆ ψ R 0 − 1 ( f ) − R 0 − 1 Φ ) . 0 − 1 Examples: √ 1 − θ 2 exponential loss: ψ ( θ ) = 1 − hinge loss: ψ ( θ ) = | θ | truncated quadratic: ψ ( θ ) = θ 2 36/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: linear classifier X = R d , linear classifier: sign( β ⊤ x + β 0 ) with β ∈ R d , β 0 ∈ R 37/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: linear classifier X = R d , linear classifier: sign( β ⊤ x + β 0 ) with β ∈ R d , β 0 ∈ R � �� n � � � 1 β ⊤ X i + β 0 argmin β,β 0 / � β �≤ R Φ hinge Y i n i =1 � � � � �� n � 1 + λ � β � 2 β ⊤ X i + β 0 ⇔ argmin β,β 0 Φ hinge Y i n i =1 up to some (random) reparametrization ( λ = λ ( R ; D n )). ⇒ quadratic program with 2 n linear constraints. 37/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: linear classifier 38/53 Figure from http://cbio.ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: kernel trick Positive definite kernel k : X × X → R s.t. ( k ( X i , X j )) i , j symmetric positive definite Reproducing Kernel Hilbert Space (RKHS) F : space of functions X → R spanned by the Φ( x ) = k ( x , · ), x ∈ X . Figure from http://cbio.ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf 39/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: kernel trick Positive definite kernel k : X × X → R s.t. ( k ( X i , X j )) i , j symmetric positive definite Reproducing Kernel Hilbert Space (RKHS) F : space of functions X → R spanned by the Φ( x ) = k ( x , · ), x ∈ X . Theorem (Representer theorem) For any cost function ℓ , � � n � 1 ℓ ( Y i , f ( X i )) + λ � f � 2 min F n f ∈F i =1 n � is attained at some f of the form α i k ( X i , · ) i =1 ⇒ any algorithm for X = R d relying only on the dot products ( � X i , X j � ) i , j can be kernelized. 39/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Kernel examples linear kernel: X = R d , k ( x , y ) = � x , y � ⇒ F = R d Euclidean polynomial kernel: X = R d , k ( x , y ) = ( � x , y � + 1) r ⇒ F = R r [ X 1 , . . . , X d ] 40/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Kernel examples linear kernel: X = R d , k ( x , y ) = � x , y � ⇒ F = R d Euclidean polynomial kernel: X = R d , k ( x , y ) = ( � x , y � + 1) r ⇒ F = R r [ X 1 , . . . , X d ] Gaussian kernel: X = R d , k ( x , y ) = e −� x − y � 2 / (2 σ 2 ) Laplace kernel: X = R , k ( x , y ) = e −| x − y | / 2 ⇒ F = H 1 (Sobolev space), � f � 2 F = � f � 2 L 2 + � f ′ � 2 L 2 . min kernel: X = [0 , 1], k ( x , y ) = min { x , y } ⇒ F = { f ∈ C 0 ([0 , 1]), f ′ ∈ L 2 , f (0) = 0 } , � f � F = � f ′ � L 2 . 40/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Kernel examples Gaussian kernel: X = R d , k ( x , y ) = e −� x − y � 2 / (2 σ 2 ) Laplace kernel: X = R , k ( x , y ) = e −| x − y | / 2 ⇒ F = H 1 (Sobolev space), � f � 2 L 2 + � f ′ � 2 F = � f � 2 L 2 . min kernel: X = [0 , 1], k ( x , y ) = min { x , y } ⇒ F = { f ∈ C 0 ([0 , 1]), f ′ ∈ L 2 , f (0) = 0 } , � f � F = � f ′ � L 2 . � � p ∈ [0 , 1] d / p 1 + · · · + p d = 1 ⇒ intersection kernel: X = , k ( p , q ) = � d i =1 min( p i , q i ), useful in computer vision [Hein and Bousquet, 2004, Maji et al., 2008]. other kernels on non-vectorial data (graphs, words / DNA sequences, ...): see for instance [Sch¨ olkopf et al., 2004, Mah´ e et al., 2005, Shervashidze et al., 2011] and http://cbio. ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf 40/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Support Vector Machines: results / references Main mathematical tools for SVM analysis: probability in Hilbert spaces (RKHS), functional analysis. Some references: Risk bounds: e.g., [Blanchard et al., 2008] (SVM as a penalization procedure for selecting among balls). see also [Boucheron et al., 2005, Section 4] Tutorials and lecture notes: [Burges, 1998], http://cbio.ensmp.fr/~jvert/svn/kernelcourse/ slides/master/master.pdf Books: e.g., [Steinwart and Christmann, 2008, Hastie et al., 2009, Scholkopf and Smola, 2001] 41/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Decision / classification tree piecewise constant predictor partition obtained by recursive splitting of X ⊂ R p , orthogonally to one axis ( X j < t vs. X j ≥ t ) empirical risk minimization 42/53 Figures from [Hastie et al., 2009] Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion CART (Classification And Regression Trees) CART [Breiman et al., 1984]: 1 generate one large tree by splitting recursively the data (minimization of some impurity measure), ⇒ over-adapted to data 43/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion CART (Classification And Regression Trees) CART [Breiman et al., 1984]: 1 generate one large tree by splitting recursively the data (minimization of some impurity measure), ⇒ over-adapted to data 2 pruning ( ⇔ model selection) Model selection results: e.g., [Gey and N´ ed´ elec, 2005, Sauv´ e and Tuleau-Malot, 2011, Gey and Mary-Huard, 2011]. 43/53 Classification and statistical machine learning Sylvain Arlot

� � � � � � Introduction Goals Overfitting Examples Key issues Conclusion Random forests [Breiman, 2001] D n � � �� Bootstrap � � � � � � � � � � � � � � . . . . . . D ⋆ 1 D ⋆ 2 D ⋆ K n n n . . . . . . Tree building . . . . . . T 1 T 2 T K � � � � �� Voting � � � � � � Classifier Various ways to build individual trees (subset of variables...) Purely random forests: partitions independent from training data. 44/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Results on random forests (classification and regression) Most theoretical results on purely random forests (partitions independent from training data: by data splitting or with simpler models) Consistency result in classification [Biau et al., 2008] Convergence rate and some combination with variable selection [Biau, 2012] From a single tree to a large forest: estimation error reduction (at least a constant factor) [Genuer, 2012] approximation error reduction (A. & Genuer, work in progress) ⇒ sometimes improvement in the learning rate See also [Breiman, 2004, Genuer et al., 2008, Genuer et al., 2010]. 45/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Kinect: depth features ⇒ body part Depth image ⇒ depth comparison features at each pixel ⇒ body part at each pixel ⇒ body part positions ⇒ · · · Figure from [Shotton et al., 2011] 46/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Outline Introduction 1 Goals 2 Overfitting 3 Examples 4 Key issues 5 47/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Hyperparameter choice Always one or several parameters to choose: k for k -NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point 47/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Hyperparameter choice Always one or several parameters to choose: k for k -NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point Most general ideas: data splitting (cross-validation) [Arlot and Celisse, 2010] Sometimes specific approaches (penalization...): more efficient (for risk and computational cost) but also dependent on stronger assumptions 47/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Hyperparameter choice Always one or several parameters to choose: k for k -NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point Most general ideas: data splitting (cross-validation) [Arlot and Celisse, 2010] Sometimes specific approaches (penalization...): more efficient (for risk and computational cost) but also dependent on stronger assumptions Important to choose a good parametrization (e.g., for cross-validation, the optimal parameter should not vary too much from a sample to another) 47/53 Classification and statistical machine learning Sylvain Arlot

Introduction Goals Overfitting Examples Key issues Conclusion Computational complexity � Most classifiers are defined as f ∈ argmin f ∈ S C ( f ) Optimization algorithms: usually faster (polynomial) when C and S convex. Often NP hard with 0–1 loss. Counterexample: interval classification [Kearns et al., 1997]. 48/53 Classification and statistical machine learning Sylvain Arlot

Classification and statistical machine learning Sylvain Arlot - PowerPoint PPT Presentation

Introduction Goals Overfitting Examples Key issues Conclusion Classification and statistical machine learning Sylvain Arlot http://www.di.ens.fr/~arlot/ 1 Cnrs 2 erieure (Paris), DI/ENS , Ecole Normale Sup Equipe Sierra CEMRACS

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Classification over Encrypted Data Raphael Bost, Raluca Ada Popa, Stephen Tu,

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun & Rich Zemels lectures

of human actions Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Objects:

Trans ansfor orming ming Tec echnic hnical al As Assis sistan tance: ce: Using Using

Random Forests vs. Deep Learning Christian Wolf Universit de Lyon, INSA-Lyon LIRIS UMR CNRS

3D Landmark Model Discovery from a Registered Set of Organic Shapes Clement Creusot, Nick Pears,

Augusta Maine Short Handed Firefighting National Volunteer Fire Council Concord North Carolina

Ignition from a Fire Perimeter in a WRF Wildland Fire Model Volodymyr Kondratenko y y Based on

Coupled fire-atmosphere-fuel moisture-smoke online modeling with WRF-SFIRE Jan Mandel,

Classification and statistical machine learning Sylvain Arlot - PowerPoint PPT Presentation

Introduction Goals Overfitting Examples Key issues Conclusion Classification and statistical machine learning Sylvain Arlot http://www.di.ens.fr/~arlot/ 1 Cnrs 2 erieure (Paris), DI/ENS , Ecole Normale Sup Equipe Sierra CEMRACS

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Classification over Encrypted Data Raphael Bost, Raluca Ada Popa, Stephen Tu,

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun &amp; Rich Zemels lectures

of human actions Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Objects:

Trans ansfor orming ming Tec echnic hnical al As Assis sistan tance: ce: Using Using

Random Forests vs. Deep Learning Christian Wolf Universit de Lyon, INSA-Lyon LIRIS UMR CNRS

3D Landmark Model Discovery from a Registered Set of Organic Shapes Clement Creusot, Nick Pears,

Augusta Maine Short Handed Firefighting National Volunteer Fire Council Concord North Carolina

Ignition from a Fire Perimeter in a WRF Wildland Fire Model Volodymyr Kondratenko y y Based on

Coupled fire-atmosphere-fuel moisture-smoke online modeling with WRF-SFIRE Jan Mandel,

CSC 411: Lecture 06: Decision Trees Class based on Raquel Urtasun & Rich Zemels lectures