Extreme Classification Many modern applications involve a huge number - PDF document

Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms Marius Kloft Joint work with Yunwen Lei (CU Hong Kong), Urun Dogan (Microsoft Research), and Alexander Binder (Singapore). Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 1 Extreme Classification Many modern applications involve a huge number of classes . ◮ E.g., image annotation (Deng, Dong, Socher, Li, Li, and Fei-Fei, 2009) ◮ Still growing datasets Need for theory and algorithms for extreme classification (multi-class classification with huge amount of classes). Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 2

Discrepancy of Theory and Algorithms in Extreme Classification ◮ Algorithms for handling huge class sizes ◮ (stochastic) dual coordinate ascent (Keerthi et al., 2008; Shalev-Shwartz and Zhang, (to appear) ◮ Theory not prepared for extreme classification ◮ Data-dependent bounds scale at least linearly with the number of classes (Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al., 2014) Questions ◮ Can we get bounds with mild dependence on #classes? ◮ What would we learn from such bounds? ⇒ Novel algorithms? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 3 Theory Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 4

Multi-class Classification Given : i.i.d. ◮ Training data z 1 = ( x 1 , y 1 ) , . . . , z n = ( x n , y n ) ∼ P � �� ∈X×Y ◮ Y := { 1 , 2 , . . . , c } ◮ c = number of classes aeroplane bicycle bottle bird boat bus car cat chair cow diningtable horse person dog motorbike pottedplant sheep sofa train tvmonitor Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 5 Formal Problem Setting Aim : ◮ Define a hypothesis class H of functions h = ( h 1 , . . . , h c ) ◮ Find an h ∈ H that “predicts well” via ˆ y := arg max y ∈Y h y ( x ) Multi-class SVMs : ◮ h y ( x ) = � w y , φ ( x ) � ◮ Introduce notion of the (multi-class) margin ρ h ( x , y ) := h y ( x ) − max y ′ : y ′ � = y h y ′ ( x ) ◮ the larger the margin, the better Want : large expected margin E ρ h ( X , Y ) . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 6

Types of Generalization bounds for Multi-class Classification Data-independent bounds ◮ based on covering numbers (Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007) - conservative ◮ unable to adapt to data Data-dependent bounds ◮ based on Rademacher complexity (Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013; Kuznetsov et al., 2014) + tighter ◮ able to capture the real data ◮ computable from the data Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 7 Rademacher & Gaussian Complexity Definition ◮ Let σ 1 , . . . , σ n be independent Rademacher variables (taking only values ± 1 , with equal probability). ◮ The Rademacher complexity (RC) is defined as n 1 � � � R ( H ) := E σ σ i h ( z i ) sup n h ∈ H i = 1 Definition ◮ Let g 1 , . . . , g n ∼ N ( 0 , 1 ) . ◮ The Gaussian complexity (GC) is defined as n 1 � � � G ( H ) = E g g i h ( z i ) sup n h ∈ H i = 1 Interpretation: RC and GC reflect the ability of the hypothesis class to correlate with random noise . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 8 Theorem ( (Ledoux and Talagrand, 1991) ) � �

Existing Data-Dependent Analysis The key step is estimating R ( { ρ h : h ∈ H } ) induced from the margin operator ρ h and class H . Existing bounds build on the structural result: c � R ( max { h 1 , . . . , h c } : h j ∈ H j , j = 1 , . . . , c ) ≤ R ( H j ) (1) j = 1 The correlation among class-wise components is ignored. Best known dependence on the number of classes: ◮ quadratic dependence Koltchinskii and Panchenko (2002); Mohri et al. (2012); Cortes et al. (2013) ◮ linear dependence Kuznetsov et al. (2014) Can we do better? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 9 A New Structural Lemma on Gaussian Complexities We consider Gaussian complexity. ◮ H is a vector-valued function class, g 11 , . . . , g nc ∼ N ( 0 , 1 ) ◮ We show: � � { max { h 1 , . . . , h c } : h = ( h 1 , . . . , h c ) ∈ H } ≤ G n c 1 � � g ij h j ( x i ) . (2) n E g sup h =( h 1 ,..., h c ) ∈ H i = 1 j = 1 Core idea: Comparison inequality on GPs: (Slepian, 1962) n n c � � � X h := g i max { h 1 ( x i ) , . . . , h c ( x i ) } , Y h := g ij h j ( x i ) , ∀ h ∈ H . i = 1 i = 1 j = 1 θ ) 2 ] ≤ E [( Y θ − Y ¯ θ ) 2 ] = E [( X θ − X ¯ ⇒ E [ sup X θ ] ≤ E [ sup Y θ ] . θ ∈ Θ θ ∈ Θ Eq. (2) preserves the coupling among class-wise components! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 10

Example on Comparison of the Structural Lemma ◮ Consider H := { ( x 1 , x 2 ) → ( h 1 , h 2 )( x 1 , x 2 ) = ( w 1 x 1 , w 2 x 2 ) : � ( w 1 , w 2 ) � 2 ≤ 1 } ◮ For the function class { max { h 1 , h 2 } : h = ( h 1 , h 2 ) ∈ H } , � n i = 1 σ i h 1 ( x i ) + sup n � ( h 1 , h 2 ) ∈ H [ g i 1 h 1 ( x i ) + g i 2 h 2 ( x i )] sup � n i = 1 σ i h 2 ( x i ) sup ( h 1 , h 2 ) ∈ H i = 1 ( h 1 , h 2 ) ∈ H Preserving the coupling means supremum in a smaller space! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 11 Estimating Multi-class Gaussian Complexity ◮ Consider a vector-valued function class defined by H := { h w = ( � w 1 , φ ( x ) � , . . . , � w c , φ ( x ) � ) : f ( w ) ≤ Λ } , where f is β -strongly convex w.r.t. � · � ◮ f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) − β 2 α ( 1 − α ) � x − y � 2 . Theorem � � n c n � � � c � � 2 π Λ 1 j ( x i ) ≤ 1 2 � � � � g ij h w � � g ij φ ( x i ) (3) n E g sup β E g ∗ , � � n j = 1 h w ∈ H i = 1 j = 1 i = 1 where � · � ∗ is the dual norm of � · � . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 12

Features of the complexity bound ◮ Applies to a general function class defined through a strongly-convex regularizer f ◮ Class-wise components h 1 , . . . , h c are correlated through � � � � c 2 � � the term g ij φ ( x i ) � � j = 1 ∗ ◮ Consider class H p , Λ := { h w : � w � 2 , p ≤ Λ } , ( 1 p + 1 p ∗ = 1 ) ; then: � � n c n j ( x i ) ≤ Λ 1 � � � � g ij h w k ( x i , x i ) × n E g sup � n h w ∈ H p , Λ i = 1 j = 1 i = 1  √ e ( 4 log c ) 1 + 1 if p ∗ ≥ 2 log c , 2 log c ,  2 p ∗ � 1 + 1 � 1 p ∗ c p ∗ ,  otherwise . The dependence is sublinear for 1 ≤ p ≤ 2 , and even logarithmic when p approaches to 1 ! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 13 Algorithms Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 14

ℓ p -norm Multi-class SVM Motivated by the mild dependence on c as p → 1 , we consider ( ℓ p -norm) Multi-class SVM, 1 ≤ p ≤ 2 c n � 2 � 1 � p + C � � w j � p ( 1 − t i ) + , min 2 2 w (P) j = 1 i = 1 s.t. t i = � w y i , φ ( x i ) � − max y : y � = y i � w y , φ ( x i ) � , Dual Problem c n n � 2 ( p − 1 ) � p α ∈ R n × c − 1 � � � � � p p − 1 α ij φ ( x i ) + sup α iy i � � 2 2 (D) j = 1 i = 1 i = 1 s.t. α i ≤ e y i · C ∧ α i · 1 = 0 , ∀ i = 1 , . . . , n . (D) is not quadratic if p � = 2 ; how to optimize? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 15 Equivalent Formulation We introduce class weights β 1 , . . . , β c to get quadratic dual � � w j � 2 � c + λ � β � p � w j � 2 . 1 p + 1 has optimum for β j ∝ min β j = 1 p 2 β j Equivalent Problem c n � w j � 2 � � 2 + C ( 1 − t i ) + min 2 β j w , β j = 1 i = 1 (E) s.t. t i ≤ � w y i , φ ( x i ) � − � w y , φ ( x i ) � , y � = y i , i = 1 , . . . , n , p = p ( 2 − p ) − 1 , β j ≥ 0 . � β � ¯ p ≤ 1 , ¯ Alternating optimization w.r.t. β and to w Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 16

Extreme Classification Many modern applications involve a huge number - PDF document

Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms Marius Kloft Joint work with Yunwen Lei (CU Hong Kong), Urun Dogan (Microsoft Research), and Alexander Binder (Singapore). Multi-class SVMs From Tighter

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Multilingual Dependency Analysis with a Two-Stage Discriminative Parser R. McDonald and K. Lerman

Surveillance Event Detection(SED) Yu Cheng *, Lisa Brown , Quanfu Fan , Rogerio Feris ,

The Naproche system: Proof-checking mathematical texts in controlled natural language Marcos

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain Adaptation, Transfer Learning, Data

Ranking Related News Predictions Nattiya Kanhabua 1 , Roi Blanco 2 and Michael Matthews 2 1

!,/&"012,)"'34,&"',%.'5$."6'7/62"88$%&' @ANU ML Workshop,

Extreme Classification Many modern applications involve a huge number - PDF document

Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms Marius Kloft Joint work with Yunwen Lei (CU Hong Kong), Urun Dogan (Microsoft Research), and Alexander Binder (Singapore). Multi-class SVMs From Tighter

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Multilingual Dependency Analysis with a Two-Stage Discriminative Parser R. McDonald and K. Lerman

Surveillance Event Detection(SED) Yu Cheng *, Lisa Brown , Quanfu Fan , Rogerio Feris ,

The Naproche system: Proof-checking mathematical texts in controlled natural language Marcos

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain Adaptation, Transfer Learning, Data

Ranking Related News Predictions Nattiya Kanhabua 1 , Roi Blanco 2 and Michael Matthews 2 1

!,/&amp;&quot;012,)&quot;'34,&amp;&quot;',%.'5$.&quot;6'7/62&quot;88$%&amp;' @ANU ML Workshop,

!,/&"012,)"'34,&"',%.'5$."6'7/62"88$%&' @ANU ML Workshop,