foundations of machine learning multi class classification
play

Foundations of Machine Learning Multi-Class Classification - PowerPoint PPT Presentation

Foundations of Machine Learning Multi-Class Classification Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do


  1. Foundations of Machine Learning Multi-Class Classification

  2. Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? • can the algorithms used for binary classification be generalized to multi-class classification? • can we reduce multi-class classification to binary classification? Mehryar Mohri - Foundations of Machine Learning page 2

  3. Multi-Class Classification Problem Training data: sample drawn i.i.d. from set X according to some distribution , D S =(( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ X × Y, • mono-label case: . Card( Y )= k • multi-label case: . Y = { − 1 , +1 } k Problem: find classifier in with small h : X → Y H generalization error, • mono-label case: . R ( h )=E x ⇠ D [1 h ( x ) 6 = f ( x ) ] • multi-label case: . ⇥ 1 P k ⇤ R ( h )=E x ⇠ D l =1 1 [ h ( x )] l 6 =[ f ( x )] l k Mehryar Mohri - Foundations of Machine Learning page

  4. Notes In most tasks considered, number of classes k ≤ 100 . For large, problem often not treated as a multi- k class classification problem (ranking or density estimation, e.g., automatic speech recognition). Computational efficiency issues arise for larger s. k In general, classes not balanced. Mehryar Mohri - Foundations of Machine Learning page 4

  5. Multi-Class Classification - Margin Hypothesis set : H • functions . h : X × Y → R • label returned: . x �� argmax h ( x, y ) y ∈ Y Margin: • . y � � = y h ( x, y � ) ρ h ( x, y ) = h ( x, y ) − max • error: . 1 ρ h ( x,y ) ≤ 0 ≤ Φ ρ ( ρ h ( x, y )) • empirical margin loss: m � R ρ ( h ) = 1 � Φ ρ ( ρ h ( x, y )) . m i =1 Mehryar Mohri - Foundations of Machine Learning page 5

  6. Multi-Class Margin Bound (MM et al. 2012; Kuznetsov, MM, and Syed, 2014) Theorem: let with . Fix . H ⊆ R X × Y Y = { 1 , . . ., k } ρ > 0 Then, for any , with probability at least , the δ > 0 1 − δ following multi-class classification bound holds for all : h ∈ H � log 1 R ρ ( h ) + 4 k R ( h ) ≤ � δ ρ R m ( Π 1 ( H )) + 2 m , with Π 1 ( H ) = { x �� h ( x, y ): y � Y, h � H } . Mehryar Mohri - Foundations of Machine Learning page 6

  7. Kernel Based Hypotheses Hypothesis set : H K,p • feature mapping associated to PDS kernel . Φ K • functions , . ( x, y ) �� w y · Φ ( x ) y ∈ { 1 , . . . , k } • label returned: . x �� argmax w y · Φ ( x ) • for any , y ∈ { 1 ,...,k } p ≥ 1 H K,p = { ( x, y ) � X � [1 , k ] �� w y · Φ ( x ): W = ( w 1 , . . . , w k ) � , � W � H ,p � Λ } . Mehryar Mohri - Foundations of Machine Learning page 7

  8. Multi-Class Margin Bound - Kernels (MM et al. 2012) Theorem: let be a PDS kernel and K : X × X → R let be a feature mapping associated to . Φ : X → H K Fix . Then, for any , with probability at δ > 0 ρ > 0 least , the following multiclass bound holds for 1 − δ all : h ∈ H K,p � � log 1 r 2 Λ 2 R ( h ) ≤ � δ R ρ ( h ) + 4 k ρ 2 m + 2 m , where r 2 = sup K ( x, x ) . x ∈ X Mehryar Mohri - Foundations of Machine Learning page 8

  9. Approaches Single classifier: • Multi-class SVMs. • AdaBoost.MH. • Conditional Maxent. • Decision trees. Combination of binary classifiers: • One-vs-all. • One-vs-one. • Error-correcting codes. Mehryar Mohri - Foundations of Machine Learning page

  10. Multi-Class SVMs (Weston and Watkins, 1999; Crammer and Singer, 2001) Optimization problem: k m 1 � w l � 2 + C � � min ξ i 2 w , ξ i =1 l =1 subject to: w y i · x i + δ y i ,l � w l · x i + 1 � ξ i ( i, l ) � [1 , m ] � Y. Decision function: h : x �� argmax ( w l · x ) . l ∈ Y Mehryar Mohri - Foundations of Machine Learning page 10

  11. Notes Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999) : single slack variable per point, maximum of slack variables (penalty for worst class): k k � ξ il → l =1 ξ il . max l =1 PDS kernel instead of inner product Optimization: complex constraints, -size problem. mk • specific solution based on decomposition into m disjoint sets of constraints (Crammer and Singer, 2001) . Mehryar Mohri - Foundations of Machine Learning page 11

  12. Dual Formulation Optimization problem: th row of matrix . α ∈ R m × k α i i m m α i · e y i � 1 � � max ( α i · α j )( x i · x j ) 2 α =[ α ij ] i =1 i =1 subject to: � i � [1 , m ] , (0 � α iy i � C ) � ( � j � = y i , α ij � 0) � ( α i · 1 = 0) . Decision function: � m � k � h ( x ) = argmax α il ( x i · x ) . l =1 i =1 Mehryar Mohri - Foundations of Machine Learning page 12

  13. AdaBoost (Schapire and Singer, 2000) Training data (multi-label case): ( x 1 , y 1 ) , . . . , ( x m , y m ) ∈ X × { − 1 , 1 } k . Reduction to binary classification: • each example leads to binary examples: k ( x i , y i ) → (( x i , 1) , y i [1]) , . . . , (( x i , k ) , y i [ k ]) , i ∈ [1 , m ] . • apply AdaBoost to the resulting problem. • choice of . α t Computational cost: distribution updates at mk each round. Mehryar Mohri - Foundations of Machine Learning page 13

  14. AdaBoost.MH H ⊆ ( { − 1 , +1 } k ) ( X × Y ) . AdaBoost.MH ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 for i � 1 to m do 2 for l � 1 to k do 1 3 D 1 ( i, l ) � mk 4 for t � 1 to T do 5 h t � base classifier in H with small error � t =Pr D t [ h t ( x i , l ) � = y i [ l ]] 6 � t � choose � to minimize Z t 7 Z t � � i,l D t ( i, l ) exp( � � t y i [ l ] h t ( x i , l )) 8 for i � 1 to m do 9 for l � 1 to k do D t +1 ( i, l ) � D t ( i,l ) exp( − α t y i [ l ] h t ( x i ,l )) 10 Z t f T � � T 11 t =1 � t h t 12 return h T = sgn( f T ) Mehryar Mohri - Foundations of Machine Learning page 14

  15. Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: T � � R ( h ) ≤ Z t . t =1 Proof: similar to the proof for AdaBoost. Choice of : α t • for , as for AdaBoost, H ⊆ ( { − 1 , +1 } k ) X × Y α t = 1 2 log 1 − � t � t . • for , same choice: minimize upper H ⊆ ([ − 1 , 1] k ) X × Y bound. • other cases: numerical/approximation method. Mehryar Mohri - Foundations of Machine Learning page 15

  16. Notes Objective function: m k m k e − y i [ l ] f n ( x i ,l ) = e − y i [ l ] P n � � � � t =1 α t h t ( x i ,l ) . F ( α ) = i =1 i =1 l =1 l =1 All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (ranking lecture). Mehryar Mohri - Foundations of Machine Learning page 16

  17. Decision Trees X 2 X1 < a1 R 2 R 5 a 4 X1 < a2 X2 < a3 R 3 a 3 R 1 X2 < a4 R3 R4 R5 R 4 a 2 a 1 X 1 R1 R2 Mehryar Mohri - Foundations of Machine Learning page

  18. Different Types of Questions Decision trees • : categorical questions. X ∈ { blue , white , red } • : continuous variables. X ≤ a Binary space partition (BSP) trees: • : partitioning with convex � n i =1 α i X i ≤ a polyhedral regions. Sphere trees: • : partitioning with pieces of spheres. || X − a 0 || ≤ a Mehryar Mohri - Foundations of Machine Learning page 18

  19. Hypotheses In each region , R t • classification: majority vote - ties broken arbitrarily, y t = argmax |{ x i ∈ R t : i ∈ [1 , m ] , y i = y }| . � y ∈ Y • regression: average value, � 1 y t = � y i . | S ∩ R t | x i ∈ R t i ∈ [1 ,m ] Form of hypotheses: � h : x �� y t 1 x ∈ R t . � t Mehryar Mohri - Foundations of Machine Learning page 19

  20. Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. • for all , , j ∈ [1 , N ] θ ∈ R R + ( j, θ )= { x i ∈ R : x i [ j ] ≥ θ , i ∈ [1 , m ] } R − ( j, θ )= { x i ∈ R : x i [ j ] < θ , i ∈ [1 , m ] } . Decision-Trees ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 P ← { S } � initial partition 2 for each region R ∈ P such that Pred( R ) do ( j, � ) ← argmin ( j, θ ) error( R − ( j, � )) + error( R + ( j, � )) 3 P ← P − R ∪ { R − ( j, � ) , R + ( j, � ) } 4 5 return P Mehryar Mohri - Foundations of Machine Learning page 20

  21. Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: • split node only if loss reduced by some fixed value . η > 0 • issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): • grow very large tree, . Pred( R ): | R | > | n 0 | • prune tree based on: , F ( T )= � Loss( T )+ α | T | α ≥ 0 parameter determined by cross-validation. Mehryar Mohri - Foundations of Machine Learning page 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend