Foundations of Machine Learning Multi-Class Classification

Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? • can the algorithms used for binary classification be generalized to multi-class classification? • can we reduce multi-class classification to binary classification? Mehryar Mohri - Foundations of Machine Learning page 2

Multi-Class Classification Problem Training data: sample drawn i.i.d. from set X according to some distribution , D S =(( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ X × Y, • mono-label case: . Card( Y )= k • multi-label case: . Y = { − 1 , +1 } k Problem: find classifier in with small h : X → Y H generalization error, • mono-label case: . R ( h )=E x ⇠ D [1 h ( x ) 6 = f ( x ) ] • multi-label case: . ⇥ 1 P k ⇤ R ( h )=E x ⇠ D l =1 1 [ h ( x )] l 6 =[ f ( x )] l k Mehryar Mohri - Foundations of Machine Learning page

Notes In most tasks considered, number of classes k ≤ 100 . For large, problem often not treated as a multi- k class classification problem (ranking or density estimation, e.g., automatic speech recognition). Computational efficiency issues arise for larger s. k In general, classes not balanced. Mehryar Mohri - Foundations of Machine Learning page 4

Multi-Class Classification - Margin Hypothesis set : H • functions . h : X × Y → R • label returned: . x �� argmax h ( x, y ) y ∈ Y Margin: • . y � � = y h ( x, y � ) ρ h ( x, y ) = h ( x, y ) − max • error: . 1 ρ h ( x,y ) ≤ 0 ≤ Φ ρ ( ρ h ( x, y )) • empirical margin loss: m � R ρ ( h ) = 1 � Φ ρ ( ρ h ( x, y )) . m i =1 Mehryar Mohri - Foundations of Machine Learning page 5

Multi-Class Margin Bound (MM et al. 2012; Kuznetsov, MM, and Syed, 2014) Theorem: let with . Fix . H ⊆ R X × Y Y = { 1 , . . ., k } ρ > 0 Then, for any , with probability at least , the δ > 0 1 − δ following multi-class classification bound holds for all : h ∈ H � log 1 R ρ ( h ) + 4 k R ( h ) ≤ � δ ρ R m ( Π 1 ( H )) + 2 m , with Π 1 ( H ) = { x �� h ( x, y ): y � Y, h � H } . Mehryar Mohri - Foundations of Machine Learning page 6

Kernel Based Hypotheses Hypothesis set : H K,p • feature mapping associated to PDS kernel . Φ K • functions , . ( x, y ) �� w y · Φ ( x ) y ∈ { 1 , . . . , k } • label returned: . x �� argmax w y · Φ ( x ) • for any , y ∈ { 1 ,...,k } p ≥ 1 H K,p = { ( x, y ) � X � [1 , k ] �� w y · Φ ( x ): W = ( w 1 , . . . , w k ) � , � W � H ,p � Λ } . Mehryar Mohri - Foundations of Machine Learning page 7

Multi-Class Margin Bound - Kernels (MM et al. 2012) Theorem: let be a PDS kernel and K : X × X → R let be a feature mapping associated to . Φ : X → H K Fix . Then, for any , with probability at δ > 0 ρ > 0 least , the following multiclass bound holds for 1 − δ all : h ∈ H K,p � � log 1 r 2 Λ 2 R ( h ) ≤ � δ R ρ ( h ) + 4 k ρ 2 m + 2 m , where r 2 = sup K ( x, x ) . x ∈ X Mehryar Mohri - Foundations of Machine Learning page 8

Approaches Single classifier: • Multi-class SVMs. • AdaBoost.MH. • Conditional Maxent. • Decision trees. Combination of binary classifiers: • One-vs-all. • One-vs-one. • Error-correcting codes. Mehryar Mohri - Foundations of Machine Learning page

Multi-Class SVMs (Weston and Watkins, 1999; Crammer and Singer, 2001) Optimization problem: k m 1 � w l � 2 + C � � min ξ i 2 w , ξ i =1 l =1 subject to: w y i · x i + δ y i ,l � w l · x i + 1 � ξ i ( i, l ) � [1 , m ] � Y. Decision function: h : x �� argmax ( w l · x ) . l ∈ Y Mehryar Mohri - Foundations of Machine Learning page 10

Notes Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999) : single slack variable per point, maximum of slack variables (penalty for worst class): k k � ξ il → l =1 ξ il . max l =1 PDS kernel instead of inner product Optimization: complex constraints, -size problem. mk • specific solution based on decomposition into m disjoint sets of constraints (Crammer and Singer, 2001) . Mehryar Mohri - Foundations of Machine Learning page 11

Dual Formulation Optimization problem: th row of matrix . α ∈ R m × k α i i m m α i · e y i � 1 � � max ( α i · α j )( x i · x j ) 2 α =[ α ij ] i =1 i =1 subject to: � i � [1 , m ] , (0 � α iy i � C ) � ( � j � = y i , α ij � 0) � ( α i · 1 = 0) . Decision function: � m � k � h ( x ) = argmax α il ( x i · x ) . l =1 i =1 Mehryar Mohri - Foundations of Machine Learning page 12

AdaBoost (Schapire and Singer, 2000) Training data (multi-label case): ( x 1 , y 1 ) , . . . , ( x m , y m ) ∈ X × { − 1 , 1 } k . Reduction to binary classification: • each example leads to binary examples: k ( x i , y i ) → (( x i , 1) , y i [1]) , . . . , (( x i , k ) , y i [ k ]) , i ∈ [1 , m ] . • apply AdaBoost to the resulting problem. • choice of . α t Computational cost: distribution updates at mk each round. Mehryar Mohri - Foundations of Machine Learning page 13

AdaBoost.MH H ⊆ ( { − 1 , +1 } k ) ( X × Y ) . AdaBoost.MH ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 for i � 1 to m do 2 for l � 1 to k do 1 3 D 1 ( i, l ) � mk 4 for t � 1 to T do 5 h t � base classifier in H with small error � t =Pr D t [ h t ( x i , l ) � = y i [ l ]] 6 � t � choose � to minimize Z t 7 Z t � � i,l D t ( i, l ) exp( � � t y i [ l ] h t ( x i , l )) 8 for i � 1 to m do 9 for l � 1 to k do D t +1 ( i, l ) � D t ( i,l ) exp( − α t y i [ l ] h t ( x i ,l )) 10 Z t f T � � T 11 t =1 � t h t 12 return h T = sgn( f T ) Mehryar Mohri - Foundations of Machine Learning page 14

Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: T � � R ( h ) ≤ Z t . t =1 Proof: similar to the proof for AdaBoost. Choice of : α t • for , as for AdaBoost, H ⊆ ( { − 1 , +1 } k ) X × Y α t = 1 2 log 1 − � t � t . • for , same choice: minimize upper H ⊆ ([ − 1 , 1] k ) X × Y bound. • other cases: numerical/approximation method. Mehryar Mohri - Foundations of Machine Learning page 15

Notes Objective function: m k m k e − y i [ l ] f n ( x i ,l ) = e − y i [ l ] P n � � � � t =1 α t h t ( x i ,l ) . F ( α ) = i =1 i =1 l =1 l =1 All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (ranking lecture). Mehryar Mohri - Foundations of Machine Learning page 16

Decision Trees X 2 X1 < a1 R 2 R 5 a 4 X1 < a2 X2 < a3 R 3 a 3 R 1 X2 < a4 R3 R4 R5 R 4 a 2 a 1 X 1 R1 R2 Mehryar Mohri - Foundations of Machine Learning page

Different Types of Questions Decision trees • : categorical questions. X ∈ { blue , white , red } • : continuous variables. X ≤ a Binary space partition (BSP) trees: • : partitioning with convex � n i =1 α i X i ≤ a polyhedral regions. Sphere trees: • : partitioning with pieces of spheres. || X − a 0 || ≤ a Mehryar Mohri - Foundations of Machine Learning page 18

Hypotheses In each region , R t • classification: majority vote - ties broken arbitrarily, y t = argmax |{ x i ∈ R t : i ∈ [1 , m ] , y i = y }| . � y ∈ Y • regression: average value, � 1 y t = � y i . | S ∩ R t | x i ∈ R t i ∈ [1 ,m ] Form of hypotheses: � h : x �� y t 1 x ∈ R t . � t Mehryar Mohri - Foundations of Machine Learning page 19

Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. • for all , , j ∈ [1 , N ] θ ∈ R R + ( j, θ )= { x i ∈ R : x i [ j ] ≥ θ , i ∈ [1 , m ] } R − ( j, θ )= { x i ∈ R : x i [ j ] < θ , i ∈ [1 , m ] } . Decision-Trees ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 P ← { S } � initial partition 2 for each region R ∈ P such that Pred( R ) do ( j, � ) ← argmin ( j, θ ) error( R − ( j, � )) + error( R + ( j, � )) 3 P ← P − R ∪ { R − ( j, � ) , R + ( j, � ) } 4 5 return P Mehryar Mohri - Foundations of Machine Learning page 20

Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: • split node only if loss reduced by some fixed value . η > 0 • issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): • grow very large tree, . Pred( R ): | R | > | n 0 | • prune tree based on: , F ( T )= � Loss( T )+ α | T | α ≥ 0 parameter determined by cross-validation. Mehryar Mohri - Foundations of Machine Learning page 21

Foundations of Machine Learning Multi-Class Classification - PowerPoint PPT Presentation

Foundations of Machine Learning Multi-Class Classification Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD CLASS BUILDING THE

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Classification over Encrypted Data Raphael Bost, Raluca Ada Popa, Stephen Tu,

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1

Hardness of Mastermind Giovanni Viglietta Department of Computer Science, University of Pisa,

Discrete symmetries of hypergraph states David W. Lyons Lebanon Valley College Tetrahedral

Algebraic Map Theory Gareth Jones School of Mathematics University of Southampton UK June 1,

Amit Chakrabarti Dartmouth College WAPMDS, IIT Kanpur, Dec 2009 Amit Chakrabarti 1 Multi-Pass

A simplified proof of Hausslers packing Theorem Nikita Zhivotovskiy 1 1 Technion Based on

Codes, matroids and trellises occur at several levels. Do you (a) change one factor at a time?

Improved Cryptanalysis of the AJPS Mersenne Based Cryptosystem Jean-Sbastien Coron and Agnese

Unaligned Rebound Attack Application to K ECCAK Alexandre Duc 1 , Jian Guo 2 , Thomas Peyrin 3 and