Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony - PowerPoint PPT Presentation

Weston and Watkins (WW) Formulation • Prediction: h ( x ) = argmax [ w j · x + b j ] = argmax f j ( x ) j j Weston, J., Watkins, C., et al. Support vector machines for multi-class pattern recognition. in ESANN 99 (1999), 219–224. Weston and Watkins (WW) Formulation 29

Crammer and Singer (CS) Formulation • A parameter w j for each class • Only one slack variable ξ i for each example, (instead of k ) Crammer, K. & Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research 2, 265–292 (2002). Crammer and Singer (CS) Formulation 30

Weston and Watkins (WW) Formulation k n 1 � w j � 2 + C � � � min ξ i , j 2 w , b , ξ j =1 i =1 j ∈{ 1 , ··· , k }\ y i subject to: ( w y i · x i + b y i ) − ( w j · x i + b j ) ≥ 2 − ξ i , j ξ i , j ≥ 0 , i ∈ [1 , n ] , j ∈ { 1 , · · · , k }\ y i Crammer and Singer (CS) Formulation k n 1 � w j � 2 + C � � min ξ i 2 w , b , ξ j =1 i =1 subject to: ( w y i · x i + b y i ) − ( w j · x i + b j ) ≥ 1 − ξ i ξ i ≥ 0 , i ∈ [1 , n ] , j ∈ { 1 , · · · , k }\ y i Crammer and Singer (CS) Formulation 31

Lee, Lin, and Wahba (LLW) Formulation • A parameter w j for each class • A slack variable ξ i , j for each example and each class Lee, Y. et al. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004). Lee, Lin, and Wahba (LLW) Formulation 32

Lee, Lin, and Wahba (LLW) Formulation • A parameter w j for each class • A slack variable ξ i , j for each example and each class • Use the absolute potential value f j ( x i ) Instead of using the relative potential difference f y i ( x i ) − f j ( x i ) Lee, Y. et al. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004). Lee, Lin, and Wahba (LLW) Formulation 32

Weston and Watkins (WW) Formulation k n 1 � w j � 2 + C � � � min ξ i , j 2 w , b , ξ j =1 i =1 j ∈{ 1 , ··· , k }\ y i subject to: ξ i , j ≥ 2 + f j ( x i ) − f y i ( x i ) ξ i , j ≥ 0 , i ∈ [1 , n ] , j ∈ { 1 , · · · , k }\ y i Lee, Lin, and Wahba (LLW) Formulation k n 1 � w j � 2 + C � � � min ξ i , j 2 w , b , ξ j =1 i =1 j ∈{ 1 , ··· , k }\ y i k 1 � subject to: ξ i , j ≥ f j ( x i ) + k − 1; f j ( x i ) = 0 j =1 ξ i , j ≥ 0; i ∈ [1 , n ] , j ∈ { 1 , · · · , k }\ y i Lee, Lin, and Wahba (LLW) Formulation 33

Fisher Consistency

Fisher Consistency in Binary Classification • Fisher consistency / Bayes Consistency: Requires a classifier to asymptotically yields Bayes decision boundary 1 Lin, Y. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6, 259–275 (2002). Fisher Consistency in Binary Classification 34

Fisher Consistency in Binary Classification • Fisher consistency / Bayes Consistency: Requires a classifier to asymptotically yields Bayes decision boundary • Binary case: A loss V ( f ( x , y )) is Fisher consistent if: The minimizer of E [ V ( f ( X , Y )) | X = x ] has the same sign as the Bayes decision P ( Y = 1 | X = x ) − 1 2 1 Lin, Y. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6, 259–275 (2002). Fisher Consistency in Binary Classification 34

Fisher Consistency in Binary Classification • Fisher consistency / Bayes Consistency: Requires a classifier to asymptotically yields Bayes decision boundary • Binary case: A loss V ( f ( x , y )) is Fisher consistent if: The minimizer of E [ V ( f ( X , Y )) | X = x ] has the same sign as the Bayes decision P ( Y = 1 | X = x ) − 1 2 • Binary SVM is Fisher consistent 1 The minimizer of E [[1 − Yf ( X )] + | X = x ] is sign( P ( Y = 1 | X = x ) − 1 2 ) 1 Lin, Y. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6, 259–275 (2002). Fisher Consistency in Binary Classification 34

Fisher Consistency in Multi-class Classification • k class. y ∈ [1 , k ] • Let: P j ( x ) = P ( Y = j | X = x ) Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Fisher Consistency in Multi-class Classification 35

Fisher Consistency in Multi-class Classification • k class. y ∈ [1 , k ] • Let: P j ( x ) = P ( Y = j | X = x ) • Potential vectors : f ( x ) = [ f 1 ( x ) , · · · , f k ( x )] T k ( x )] T is the minimizer • Denote: f ∗ ( x ) = [ f ∗ 1 ( x ) , · · · , f ∗ of E [ V ( f ( X , Y )) | X = x ] Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Fisher Consistency in Multi-class Classification 35

Fisher Consistency in Multi-class Classification • k class. y ∈ [1 , k ] • Let: P j ( x ) = P ( Y = j | X = x ) • Potential vectors : f ( x ) = [ f 1 ( x ) , · · · , f k ( x )] T k ( x )] T is the minimizer • Denote: f ∗ ( x ) = [ f ∗ 1 ( x ) , · · · , f ∗ of E [ V ( f ( X , Y )) | X = x ] • Fisher consistency requires: f ∗ argmax j ( x ) = argmax P j ( x ) j j Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Fisher Consistency in Multi-class Classification 35

Fisher Consistency in Multi-class Classification • k class. y ∈ [1 , k ] • Let: P j ( x ) = P ( Y = j | X = x ) • Potential vectors : f ( x ) = [ f 1 ( x ) , · · · , f k ( x )] T k ( x )] T is the minimizer • Denote: f ∗ ( x ) = [ f ∗ 1 ( x ) , · · · , f ∗ of E [ V ( f ( X , Y )) | X = x ] • Fisher consistency requires: f ∗ argmax j ( x ) = argmax P j ( x ) j j • Remove redundant solutions: Employ the constraint: � k i =1 f j ( x ) = 0 Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Fisher Consistency in Multi-class Classification 35

All-in-One Machines Simplify the losses for analysis: change the constants to 1 1. LLW loss: � V LLW ( f ( X , Y )) = [1 + f j ( x )] + j � = y 2. WW loss: � V WW ( f ( X , Y )) = [1 − ( f y ( x ) − f j ( x ))] + j � = y 3. CS loss: V CS ( f ( X , Y )) = [1 − min ( f y ( x ) − f j ( x ))] + j 4. Naive loss: V Naive ( f ( X , Y )) = [1 − f y ( x )] + WW and CS: Relative potential differences LLW and Naive: Absolute potential values Fisher Consistency in Multi-class Classification 36

Fisher Consistency of the All-in-One Machines SVM A. Fisher Consistency of the All-in-One Machines SVM 1. Inconsistency of the Naive Formulation 2. Consistency of the LLW Formulation 3. Inconsistency of the WW Formulation 4. Inconsistency of the CS Formulation Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Fisher Consistency of the All-in-One Machines SVM 37

Fisher Consistency of the All-in-One Machines SVM A. Fisher Consistency of the All-in-One Machines SVM 1. Inconsistency of the Naive Formulation 2. Consistency of the LLW Formulation 3. Inconsistency of the WW Formulation 4. Inconsistency of the CS Formulation B. Modification of the Inconsistent Formulations 1. Modification of the Naive Formulation 2. Modification of the WW Formulation 3. Modification of the CS Formulation Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Fisher Consistency of the All-in-One Machines SVM 37

Inconsistency of the Naive Formulation • For any fixed X = x : Minimizing E [ V Naive ( f ( X , Y ))] = E [[1 − f Y ( x )] + ] is equal to minimizing � k l =1 P l ( x )([1 − f l ( x )] + ) Inconsistency of the Naive Formulation 38

Inconsistency of the Naive Formulation • For any fixed X = x : Minimizing E [ V Naive ( f ( X , Y ))] = E [[1 − f Y ( x )] + ] is equal to minimizing � k l =1 P l ( x )([1 − f l ( x )] + ) • We want to find properties of the minimizer f ∗ Lemma 1. The minimizer f ∗ of E [[1 − f Y ( X )] + | X = x ] = � k l =1 P l ( x )([1 − f l ( x )] + ) subject to � k j =1 f j ( x ) = 0 satisfies the following: f ∗ j ( x ) = − ( k − 1) if j = argmin j P j ( x ) and 1 otherwise. Inconsistency of the Naive Formulation 38

Lemma 1. The minimizer f ∗ of E [[1 − f Y ( X )] + | X = x ] = � k l =1 P l ( x )([1 − f l ( x )] + ) subject to � k j =1 f j ( x ) = 0 satisfies the following: f ∗ j ( x ) = − ( k − 1) if j = argmin j P j ( x ) and 1 otherwise. • The minimization can be reduced to: (proof omitted) k � max P l ( x ) f l ( x ) f l =1 k � subject to: f l ( x ) = 0 l =1 f j ( x ) ≤ 1 , ∀ l ∈ [1 , k ] Inconsistency of the Naive Formulation 39

Lemma 1. The minimizer f ∗ of E [[1 − f Y ( X )] + | X = x ] = � k l =1 P l ( x )([1 − f l ( x )] + ) subject to � k j =1 f j ( x ) = 0 satisfies the following: f ∗ j ( x ) = − ( k − 1) if j = argmin j P j ( x ) and 1 otherwise. • The minimization can be reduced to: (proof omitted) k � max P l ( x ) f l ( x ) f l =1 k � subject to: f l ( x ) = 0 l =1 f j ( x ) ≤ 1 , ∀ l ∈ [1 , k ] • The solution for the maximization above: satisfies f ∗ j ( x ) = − ( k − 1) if j = argmin j P j ( x ) and 1 otherwise • The Naive hinge loss formulation is not Fisher consistent Inconsistency of the Naive Formulation 39

Consistency of the LLW Formulation • For any fixed X = x : Minimizing E [ V LLW ( f ( X , Y ))] = E [ � j � = Y [1 + f j ( X )] + ] is equal to minimizing � k � j � = l P l ( x )([1 + f j ( x )] + ) l =1 Consistency of the LLW Formulation 40

Consistency of the LLW Formulation • For any fixed X = x : Minimizing E [ V LLW ( f ( X , Y ))] = E [ � j � = Y [1 + f j ( X )] + ] is equal to minimizing � k � j � = l P l ( x )([1 + f j ( x )] + ) l =1 • We want to find properties of the minimizer f ∗ Lemma 2. The minimizer f ∗ of j � = Y [1 + f j ( X )] + | X = x ] = � k E [ � � j � = l P l ( x )([1 + f j ( x )] + ) subject to l =1 � k j =1 f j ( x ) = 0 satisfies the following: f ∗ j ( x ) = k − 1 if j = argmax j P j ( x ) and -1 otherwise. Consistency of the LLW Formulation 40

Lemma 2. The minimizer f ∗ of E [ � j � = Y [1 + f j ( X )] + | X = x ] = � k � j � = l P l ( x )([1 + f j ( x )] + ) l =1 subject to � k j =1 f j ( x ) = 0 satisfies the following: f ∗ j ( x ) = k − 1 if j = argmax j P j ( x ) and -1 otherwise. Proof • The minimization can be reduced to: (proof omitted) k � max P l ( x ) f l ( x ) f l =1 k � subject to: f l ( x ) = 0 l =1 f l ( x ) ≥ − 1 , ∀ l ∈ [1 , k ] Consistency of the LLW Formulation 41

Lemma 2. The minimizer f ∗ of E [ � j � = Y [1 + f j ( X )] + | X = x ] = � k � j � = l P l ( x )([1 + f j ( x )] + ) l =1 subject to � k j =1 f j ( x ) = 0 satisfies the following: f ∗ j ( x ) = k − 1 if j = argmax j P j ( x ) and -1 otherwise. Proof • The minimization can be reduced to: (proof omitted) k � max P l ( x ) f l ( x ) f l =1 k � subject to: f l ( x ) = 0 l =1 f l ( x ) ≥ − 1 , ∀ l ∈ [1 , k ] ◦ The solution for the maximization above: satisfies f ∗ j ( x ) = k − 1 if j = argmax j P j ( x ) and -1 otherwise • The LLW formulation is Fisher consistent Consistency of the LLW Formulation 41

Inconsistency of the WW Formulation • For any fixed X = x : Minimizing E [ V WW ( f ( X , Y ))] = E [ � j � = y [1 − ( f Y ( x ) − f j ( x ))] + ] is equal to minimizing � k � j � = l P l ( x )([1 − ( f l ( x ) − f j ( x ))] + ) l =1 Inconsistency of the WW Formulation 42

Inconsistency of the WW Formulation • For any fixed X = x : Minimizing E [ V WW ( f ( X , Y ))] = E [ � j � = y [1 − ( f Y ( x ) − f j ( x ))] + ] is equal to minimizing � k � j � = l P l ( x )([1 − ( f l ( x ) − f j ( x ))] + ) l =1 • We focus on the case where k = 3, and find the minimizer f ∗ Lemma 3. Consider the case where k = 3 with 1 2 > P 1 > P 2 > P 3 . The minimizer f ∗ = ( f ∗ 1 , f ∗ 2 , f ∗ 3 ) of j � = y [1 − ( f Y ( X ) − f j ( X ))] + | X = x ] = � k E [ � � j � = l P l ( x )([1 − ( f l ( x ) − f j ( x ))] + ) l =1 is the following: 3 , any f ∗ satisfying f ∗ 1 ≥ f ∗ 2 ≥ f ∗ 3 and f ∗ 1 − f ∗ (1) If P 2 = 1 3 = 1. 3 , any f ∗ satisfying f ∗ (2) If P 2 > 1 1 ≥ f ∗ 2 ≥ f ∗ 3 , f ∗ 1 = f ∗ 2 and f ∗ 2 − f ∗ 3 = 1. 3 , any f ∗ satisfying f ∗ (3) If P 2 < 1 1 ≥ f ∗ 2 ≥ f ∗ 3 , f ∗ 2 = f ∗ 3 and f ∗ 1 − f ∗ 2 = 1. Inconsistency of the WW Formulation 42

Lemma 3. Consider the case where k = 3 with 1 2 > P 1 > P 2 > P 3 . The minimizer f ∗ = ( f ∗ 1 , f ∗ 2 , f ∗ 3 ) of j � = y [1 − ( f Y ( X ) − f j ( X ))] + | X = x ] = � k E [ � � j � = l P l ( x )([1 − ( f l ( x ) − f j ( x ))] + ) is l =1 the following: 3 , any f ∗ satisfying f ∗ (1) If P 2 = 1 1 ≥ f ∗ 2 ≥ f ∗ 3 and f ∗ 1 − f ∗ 3 = 1. 3 , any f ∗ satisfying f ∗ (2) If P 2 > 1 1 ≥ f ∗ 2 ≥ f ∗ 3 , f ∗ 1 = f ∗ 2 and f ∗ 2 − f ∗ 3 = 1. 3 , any f ∗ satisfying f ∗ (3) If P 2 < 1 1 ≥ f ∗ 2 ≥ f ∗ 3 , f ∗ 2 = f ∗ 3 and f ∗ 1 − f ∗ 2 = 1. From Lemma 3: • In the case of k = 3 with 1 2 > P 1 > P 2 > P 3 • The WW formulation is Fisher consistent only when P 2 < 1 3 Inconsistency of the WW Formulation 43

Inconsistency of the CS Formulation • Denote g ( f ( x ) , y ) = { f y ( x ) − f j ( x ); j � = y } The CS loss can be rewritten as: [1 − min g ( f ( x ) , y )] + • For any fixed X = x : Minimizing E [ V CS ( f ( X , Y ))] = E [[1 − min j ( f Y ( X ) − f j ( X ))] + ] is equal to minimizing � k l =1 P l ( x )([1 − min g ( f ( x ) , l )] + ) Inconsistency of the CS Formulation 44

Inconsistency of the CS Formulation • Denote g ( f ( x ) , y ) = { f y ( x ) − f j ( x ); j � = y } The CS loss can be rewritten as: [1 − min g ( f ( x ) , y )] + • For any fixed X = x : Minimizing E [ V CS ( f ( X , Y ))] = E [[1 − min j ( f Y ( X ) − f j ( X ))] + ] is equal to minimizing � k l =1 P l ( x )([1 − min g ( f ( x ) , l )] + ) • We want to find properties of the minimizer f ∗ Lemma 4. The minimizer f ∗ of E [1 − min j ( f Y ( X ) − f j ( X )) + | X = x ] subject to � k j =1 f j ( x ) = 0 satisfies the following properties: (1) If max j P j > 1 2 , then argmax j f ∗ j = argmax j P j and min g ∗ ( f ( x ) , argmax j f ∗ j ) = 1. 2 , then f ∗ = 0 (2) If max j P j < 1 Inconsistency of the CS Formulation 44

Lemma 4. The minimizer f ∗ of E [1 − min j ( f Y ( X ) − f j ( X )) + | X = x ] subject to � k j =1 f j ( x ) = 0 satisfies the following properties: (1) If max j P j > 1 2 , then argmax j f ∗ = argmax j P j and min g ∗ ( f ( x ) , argmax j f ∗ j ) = 1. j 2 , then f ∗ = 0 (2) If max j P j < 1 From Lemma 4: • For the problem with k > 2, the existence of a domination class ( P j > 1 2 ) cannot be guaranteed • If max j P j < 1 2 for a given x , then f ∗ ( x ) = 0 In this case argmax j f j ( x ) cannot uniquely determined • The CS formulation is Fisher consistent only when there is a domination class Inconsistency of the CS Formulation 45

Modification of the Inconsistent Formulations B. Modification of the Inconsistent Formulations 1. Modification of the Naive Formulation 2. Modification of the WW Formulation 3. Modification of the CS Formulation Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Modification of the Inconsistent Formulations 46

Modification of the Naive Formulation Reduced problem in the Naive Formula (Inconsistent Loss) k � max P l ( x ) f l ( x ) f l =1 k � subject to: f l ( x ) = 0 , f l ( x ) ≤ 1 , ∀ l ∈ [1 , k ] l =1 Reduced problem in the LLW Formula (Consistent Loss) k � max P l ( x ) f l ( x ) f l =1 k � subject to: f l ( x ) = 0 , f l ( x ) ≥ − 1 , ∀ l ∈ [1 , k ] l =1 → The only difference is the constraint for f l ( x ) Modification of the Naive Formulation 47

Modification of the Naive Formulation 1 • If we add an additional constraint f l ( x ) ≥ − k − 1 , ∀ l ∈ [1 , k ] to the Naive formulation, the minimizer becomes: f ∗ 1 j ( x ) = 1 if j = argmax j P j ( x ) and − k − 1 otherwise which indicates consistency. Modification of the Naive Formulation 48

Modification of the Naive Formulation 1 • If we add an additional constraint f l ( x ) ≥ − k − 1 , ∀ l ∈ [1 , k ] to the Naive formulation, the minimizer becomes: f ∗ 1 j ( x ) = 1 if j = argmax j P j ( x ) and − k − 1 otherwise which indicates consistency. • By rescaling the constant, we get the following consistent loss: V Consistent-Naive ( f ( X , Y )) = [ k − 1 − f y ( x )] + k � subject to: f j ( x ) = 0; f l ( x ) ≥ − 1 , ∀ l ∈ [1 , k ] j =1 Modification of the Naive Formulation 48

Modification of the WW Formulation • Note that the WW loss: � V WW ( f ( X , Y )) = [1 − ( f y ( x ) − f j ( x ))] + j � = y • Add a new constraint − 1 ≤ f j ( x ) ≤ k − 1, change the constant part, the loss reduces to: V ( f ( X , Y )) = k [ k − 1 − f y ( x )] + k � subject to: f j ( x ) = 0; f l ( x ) ≥ − 1 , ∀ l ∈ [1 , k ] j =1 • The loss is equivalent to the Consistent-Naive formulation. Therefore it is Fisher consistent. Modification of the WW Formulation 49

Modification of the WW Formulation : Optimization • The constraint − 1 ≤ f j ( x ) ≤ k − 1 , ∀ j can be difficult to achieve for all possible x in the feature spaces • It is suggested that we need to restrict the constraint to the training data points only. k n 1 � f j � 2 + C � � min f y i ( x i ) 2 f j =1 i =1 k � subject to: f j ( x i ) = 0; f j ( x ) ≥ − 1; ∀ l ∈ [1 , k ] , i ∈ [1 , n ] . j =1 Modification of the WW Formulation 50

Modification of the WW Formulation : Optimization • The constraint − 1 ≤ f j ( x ) ≤ k − 1 , ∀ j can be difficult to achieve for all possible x in the feature spaces • It is suggested that we need to restrict the constraint to the training data points only. k n 1 � f j � 2 + C � � min f y i ( x i ) 2 f j =1 i =1 k � subject to: f j ( x i ) = 0; f j ( x ) ≥ − 1; ∀ l ∈ [1 , k ] , i ∈ [1 , n ] . j =1 • To better understand the formulation above, we analyze the binary case version ( y ∈ {± 1 } ) Modification of the WW Formulation 50

An example of standard binary SVM solution (left) and modified WW formulation solution (right) in a two dimensional dataset. Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Modification of the WW Formulation 51

Modification of the CS Formulation • The CS formulation cannot easily modified by adding a bounded constraint as in the WW formulation • We explore the idea of truncating the hinge loss Modification of the CS Formulation 52

Function plot of H 1 ( u ) (left), H s ( u ) (middle), and T s ( u ) (right) Liu, Y. Fisher consistency of multicategory support vector machines in International Conference on Artificial Intelligence and Statistics (2007), 291–298. Modification of the CS Formulation 53

Modification of the CS Formulation • For any s ≤ 0, it can be proven that the truncated version of the CS formulation is Fisher consistent, even in the case there is no dominating class Modification of the CS Formulation 54

Experiments

Experiments A. Artificial Benchmark Problem 1. Artificial Benchmark Setup 2. Benchmark Result Dogan, U. et al. A Unified View on Multi-class Support Vector Classification. The Journal of Machine Learning Research (2015). Experiments 55

Experiments A. Artificial Benchmark Problem 1. Artificial Benchmark Setup 2. Benchmark Result B. Empirical Comparison 1. Experiment Setup 2. Experiment Result Dogan, U. et al. A Unified View on Multi-class Support Vector Classification. The Journal of Machine Learning Research (2015). Experiments 55

Artificial Benchmark Setup • Help understand when and why some formulations deliver substantially sub-optimal solutions Artificial Benchmark Problem 56

Artificial Benchmark Setup • Help understand when and why some formulations deliver substantially sub-optimal solutions • Domain: X = S 1 = { x ∈ R 2 | � x � = 1 } → unit circle • Circle is parameterized using: β ( t ) = (cos( t · π 10 ) , sin( t · π 10 )) where t ∈ [0 , 20] Artificial Benchmark Problem 56

Artificial Benchmark Setup • Help understand when and why some formulations deliver substantially sub-optimal solutions • Domain: X = S 1 = { x ∈ R 2 | � x � = 1 } → unit circle • Circle is parameterized using: β ( t ) = (cos( t · π 10 ) , sin( t · π 10 )) where t ∈ [0 , 20] • 3 classes classification, Y = { 1 , 2 , 3 } Artificial Benchmark Problem 56

Artificial Benchmark Setup • Noise-less problem ◦ The label y is drawn uniformly from Y ◦ Then x is drawn uniformly at random from sector X y Sectors: X 1 = β ([0 , 5)), X 2 = β ([5 , 11)), and X 3 = β ([11 , 20)) • Bayes-optimal prediction: Predict label y on sector X y Artificial Benchmark Problem 57

Artificial Benchmark Setup • Noisy problem ◦ The same step as in the noise-less problem ◦ Reassign 90% of the labels uniformly at random ◦ Therefore, the distribution of X is remain unchanged The conditional distributions of the label given a x point are changed: Conditioned on x ∈ X z , the event of y = z has probability 40%, while the other two cases have probability of 30% • Bayes-optimal prediction: Predict label y on sector X y Artificial Benchmark Problem 58

Artificial Benchmark Result Multi-class SVM Loss Review: 1. LLW loss: � V LLW ( f ( X , Y )) = [1 + f j ( x )] + j � = y 2. WW loss: � V WW ( f ( X , Y )) = [1 − ( f y ( x ) − f j ( x ))] + j � = y 3. CS loss: V CS ( f ( X , Y )) = [1 − min ( f y ( x ) − f j ( x ))] + j WW and CS: Relative potential differences, i.e. ( f y ( x ) − f j ( x )) LLW: Absolute potential values, i.e. f j ( x ) Artificial Benchmark Problem 59

Artificial Benchmark Result Multi-class SVM Loss Review: 1. LLW loss: � V LLW ( f ( X , Y )) = [1 + f j ( x )] + j � = y 2. WW loss: � V WW ( f ( X , Y )) = [1 − ( f y ( x ) − f j ( x ))] + j � = y 3. CS loss: V CS ( f ( X , Y )) = [1 − min ( f y ( x ) − f j ( x ))] + j WW and CS: Relative potential differences, i.e. ( f y ( x ) − f j ( x )) LLW: Absolute potential values, i.e. f j ( x ) OVA: k binary classifiers, the loss in each classifier depends on the potential f j ( x ). Therefore, the loss for OVA can be viewed as the summation over absolute potential value losses. Artificial Benchmark Problem 59

Noise-less problem Sector separators : Bayes-optimal predictor. Colors : Blue = Class 1 Green = Class 2 Red = Class 3 Points outside the circle : 100 training samples. Colored circles : Classifier prediction for C = 10 n , n ∈ { 0 , 1 , 2 , 3 , 4 } , from inner to outer circles Dogan, U. et al. A Unified View on Multi-class Support Vector Classification. The Journal of Machine Learning Research (2015). Artificial Benchmark Problem 60

Noise-less problem results • Sub-optimal solution of absolute potential values losses (LLW and OVA) ◦ Both the LLW and OVA formulations give sub-optimal solutions Artificial Benchmark Problem 61

Noise-less problem results • Sub-optimal solution of absolute potential values losses (LLW and OVA) ◦ Both the LLW and OVA formulations give sub-optimal solutions ◦ Fisher consistency property of the LLW formulation does not help Artificial Benchmark Problem 61

Noise-less problem results • Sub-optimal solution of absolute potential values losses (LLW and OVA) ◦ Both the LLW and OVA formulations give sub-optimal solutions ◦ Fisher consistency property of the LLW formulation does not help ◦ Dogan claimed that the sub-optimal solutions are caused by the absolute potential values used in the loss construction, which are not compatible with the form of the decision function. Artificial Benchmark Problem 61

Noisy problem Sector separators : Bayes-optimal predictor. Colors : Blue = Class 1 Green = Class 2 Red = Class 3 Points outside the circle : 500 training samples. Colored circles : Classifier prediction for C = 10 n , n ∈ {− 4 , − 3 , − 2 , − 1 , 0 } , from inner to outer circles Dogan, U. et al. A Unified View on Multi-class Support Vector Classification. The Journal of Machine Learning Research (2015). Artificial Benchmark Problem 62

Review of Lemma 4. in the CS Formulation Lemma 4. The minimizer f ∗ of E [1 − min j ( f Y ( X ) − f j ( X )) + | X = x ] subject to � k j =1 f j ( x ) = 0 satisfies the following properties: (1) If max j P j > 1 2 , then argmax j f ∗ = argmax j P j and min g ∗ ( f ( x ) , argmax j f ∗ j ) = 1. j 2 , then f ∗ = 0 (2) If max j P j < 1 Artificial Benchmark Problem 63

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony - PowerPoint PPT Presentation

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of Illinois at Chicago Introduction Support Vector Machine The Support Vector Machine is a classification algorithm developed based on a geometric

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

vector class homogeneous aggregate with random access templated class: Vector<int>

WITH C++ Prof. Amr Goneid AUC Part 11a. The Vector Class Prof. amr Goneid, AUC 1 The Vector

CSE 143 Class Vector: Interface class Vector { public: Dynamic Memory In Classes Vector ( );

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X i y i ( x

Support Vector Machine (Part 2) OUTLINE Multi-class classification Nonlinear mapping

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Tre Ore Service This Good Friday, our congregation is collaborating with other congregations in

to your home church! Insights from Isaiah What God Does with Sour Grapes Isaiah 5 I. A REVIEW OF

Retrospective Antipatterns Aino Corry @apaipi Putting speakers on stage Messing with the heads of

Crypto Forum Research Group Kenny Paterson RHUL C F R G Crypto Forum Research

Differential Cryptanalysis of Round-Reduced Simon and Speck Farzaneh Abed Eik List Stefan Lucks

Matt. 7:1, Judge not, that you be not judged. Matt. 7:2, For with what judgment you judge,

Rotational-XOR cryptanalysis on ARX and AND-RX ciphers Yunwen Liu ASK 2019 at Kobe National

A Brief Comparison of Simon and Simeck Stefan Klbl 1 The Simeck block cipher family 1 48 128,

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony - PowerPoint PPT Presentation

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of Illinois at Chicago Introduction Support Vector Machine The Support Vector Machine is a classification algorithm developed based on a geometric

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

vector class homogeneous aggregate with random access templated class: Vector&lt;int&gt;

WITH C++ Prof. Amr Goneid AUC Part 11a. The Vector Class Prof. amr Goneid, AUC 1 The Vector

CSE 143 Class Vector: Interface class Vector { public: Dynamic Memory In Classes Vector ( );

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X i y i ( x

Support Vector Machine (Part 2) OUTLINE Multi-class classification Nonlinear mapping

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Tre Ore Service This Good Friday, our congregation is collaborating with other congregations in

to your home church! Insights from Isaiah What God Does with Sour Grapes Isaiah 5 I. A REVIEW OF

Retrospective Antipatterns Aino Corry @apaipi Putting speakers on stage Messing with the heads of

Crypto Forum Research Group Kenny Paterson RHUL C F R G Crypto Forum Research

Differential Cryptanalysis of Round-Reduced Simon and Speck Farzaneh Abed Eik List Stefan Lucks

Matt. 7:1, Judge not, that you be not judged. Matt. 7:2, For with what judgment you judge,

Rotational-XOR cryptanalysis on ARX and AND-RX ciphers Yunwen Liu ASK 2019 at Kobe National

A Brief Comparison of Simon and Simeck Stefan Klbl 1 The Simeck block cipher family 1 48 128,

vector class homogeneous aggregate with random access templated class: Vector<int>