adversarial surrogate losses
play

Adversarial Surrogate Losses for General Multiclass Classification - PowerPoint PPT Presentation

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Lev Reyzin Prof. Xinhua Zhang Prof. Simon Lacoste-Julien 1 Multiclass


  1. Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Lev Reyzin Prof. Xinhua Zhang Prof. Simon Lacoste-Julien 1

  2. β†’ Multiclass Classification Supervised Learning Testing Training 1 π’š π‘œ+1 ? π’š 1 𝑧 1 2 π’š π‘œ+2 ? π’š 2 𝑧 2 3 … … … π’š π‘œ 𝑧 π‘œ |𝒡| possible value of 𝑧 Sample Distribution ΰ·¨ 𝑄(π’š, 𝑧) Loss Function: loss(ො 𝑧, 𝑧) Data 2

  3. Multiclass Classification β†’ Zero-One Loss Digit Recognition 1 2 3 … … Loss Function: loss ො 𝑧, 𝑧 = 𝐽(ො 𝑧 β‰  𝑧) General Multiclass Classification β†’ any loss 3

  4. Multiclass Classification β†’ Ordinal Classification Movie Rating Prediction 1 2 … … 5 Predicted vs Actual Label: Loss Function (example): loss ො 𝑧, 𝑧 = |ො 𝑧 βˆ’ 𝑧| Distance Loss 4

  5. Multiclass Classification β†’ Taxonomy Classification Object Classification β„Ž = 4 Object Nonlife Life Two-wheeled Four-wheeled 5:Person Animal 1:Bicycle 2:Motorbike 3:Bus 4:Car Carnivore Herbivore 6:Cat 7:Dog 8:Cow loss(Bus,Car) = 2 loss(Cat,Dog) = 1 Loss Function (example): loss(Bus,Bicycle) = 3 loss(Cat,Cow) = 2 loss ො 𝑧, 𝑧 = β„Ž βˆ’ 𝑀 ො 𝑧, 𝑧 + 1 loss(Car,Cow) = 4 loss(Cow,Person) = 3 loss(Bus,Person) = 4 Loss(Cow,Motorbike)= 4 β„Ž : tree height 𝑀 ො 𝑧, 𝑧 : level of the common ancestor 5

  6. Multiclass Classification β†’ Loss Matrix 𝑧, 𝑧 β†’ loss matrix: 𝑴 loss function: loss ො Taxonomy-based loss Ordinal Classification Loss Zero One Loss loss ො 𝑧, 𝑧 = 𝐽(ො 𝑧 β‰  𝑧) loss ො 𝑧, 𝑧 = |ො 𝑧 βˆ’ 𝑧| loss ො 𝑧, 𝑧 = β„Ž βˆ’ 𝑀 ො 𝑧, 𝑧 + 1 6

  7. Empirical Risk Minimization (ERM) β€’ Assume a family of parametric hypothesis function 𝑔 (e.g. linear discriminator) Find the hypothesis 𝑔 βˆ— that minimize the empirical risk: β€’ Intractable optimization, non-convex, non-continuous Convex surrogate loss need to be employed Example: Binary zero-one loss Surrogate Loss: β€’ Hinge loss (used by SVM) β€’ Log loss (used by Logistic Regression) β€’ Exponential loss (used by AdaBoost) 7

  8. ERM under Hinge Loss and Log Loss SVM (hinge loss): Logistic regression (log loss): Probabilistic prediction ΰ·  𝑄 𝑔 (𝑧|π’š) Binary SVM and Binary Logistic Regression: Fisher consistent Fisher consistent: produce Bayes optimal decision in the limit Binary SVM only: Dual parameter sparsity Surrogate loss for multiclass cases: Extend binary surrogate loss like hinge-loss and log-loss to multiclass 8

  9. Adversarial Prediction (Asif et. al., 2015) Empirical Risk Minimization Original Loss Approximate the loss Non-convex, non-continuous with convex surrogates Empirical Risk Minimization Approximate loss Exact training data Probabilistic prediction Adversarial Prediction Adversarial Prediction Exact loss Evaluate against an adversary, instead of using empirical data Approximate training data (by only using the statistics) Adversary’s probabilistic prediction Constraint the statistics of the adversary ’s distribution to match the empirical statistics 9

  10. Adversarial Prediction β†’ Optimization Adversarial Prediction Example of game matrix for zero-one loss Minimax and Lagrangian duality Minimization over many zero-sum games where: Inner optimization Can be solved using Linear Programming 𝒡 3.5 ) Complexity: 𝑃(| | 10

  11. Adversarial Prediction β†’ ERM perspective Adversarial Prediction (optimization) Empirical Risk Minimization with surrogate loss: where: Adversarial Surrogate Loss = The Nash equilibrium value of β€² the zero-sum game characterized by matrix 𝑴 π’š,πœ„ 11

  12. AL 0-1 Outline The Adversarial Surrogate Loss for Multiclass Zero-One Classification 1 AL ord Adversarial Surrogate Losses for Multiclass Ordinal Classification 2 3 Ongoing and Future Works 12

  13. The Adversarial Surrogate Loss for Multiclass Zero-One Classification Based on: Rizal Fathony , Anqi Liu, Kaiser Asif, Brian D. Ziebart. β€œ Adversarial Multiclass Classification: A Risk Minimization Perspective ”. Advances in Neural Information Processing Systems 29 (NIPS), 2016. 13

  14. Multiclass Zero-One: Related Works Multiclass Support Vector Machine Fisher Consistent? Perform well in (Tewari and Bartlett, 2007) low feature spaces? (Liu, 2007) (Dogan, 2016) 1. The WW Model (Weston et.al., 2002) Relative Margin Model 2. The CS Model (Crammer and Singer, 1999) Relative Margin Model 3. The LLW Model (Lee et.al., 2004) with: Absolute Margin Model 14

  15. Adversarial Prediction : Multiclass Zero-One Loss Adversarial Game shorter notation Nash Equilibrium 15

  16. ΖΌ ΖΈ ΖΌ ΖΈ ΖΈ ΖΈ ΖΈ ΖΈ ΖΈ ΖΌ ΖΈ ΖΈ ΖΌ ΖΈ ΖΌ ΖΌ ΖΌ ΖΌ ΖΌ ΖΈ ΖΌ ΖΌ ΖΌ ΖΈ ΖΌ ΖΈ ΖΈ Adversarial Zero-Sum Game (Zero-One Loss) The augmented game for 4 classes π‘ž 1 π‘ž 2 π‘ž 3 π‘ž 4 π‘ž 1 π‘ž 2 if completely π‘ž 3 mixed π‘ž 4 when ΖΌ π‘ž 3 = 0 Considering all possible set of adversary’s non -zero probability: π‘ž 1 π‘ž 2 0 π‘ž 4 π‘ž 1 π‘ž 2 π‘ž 3 π‘ž 4 AL 0-1 : maximization over 2 |𝒡| βˆ’ 1 hyperplanes π‘ž 1 π‘ž 2 π‘ž 4 π‘ž 1 π‘ž 2 π‘ž 4 π‘ž 1 π‘ž 1 π‘ž 2 if completely π‘ž 2 0 mixed π‘ž 4 π‘ž 4 16

  17. AL 0-1 (Adversarial Surrogate Loss) β†’ Binary Classification AL 0-1 for binary zero-one classification: If the true label 𝑧 = 1 Change classification notation to 𝑧 ∈ +1, βˆ’1 , parameter to 𝒙 and 𝑐 , add L2 regularization Soft Margin SVM Binary AL 0-1 17

  18. 18

  19. 19

  20. AL 0-1 β†’ 3 Class Classification AL 0-1 for 3-class zero-one classification: Maximization over 7 hyperplanes: 𝑧 = 1 20

  21. AL 0-1 β†’ Fisher Consistency β†’ Property of the Minimizer Fisher Consistency in Multiclass Zero-One Classification constraint is employed to remove redundant solution The minimizer π’ˆ βˆ— lies in the area defined by all class label 𝑧 = 1 𝒯 = {1,2} 21

  22. AL 0-1 β†’ Fisher Consistency Finding the minimizer π’ˆ βˆ— based on the properties of the minimizer Solution: Fisher Consistent 22

  23. AL 0-1 β†’ Optimization β†’ Primal Optimization of AL 0-1 (Empirical Risk Minimization) Gradient for a single sample π’š 𝑗 Let 𝑆 be the set that maximize AL 0-1 for π’š 𝑗 , The sub-gradient for a single sample π’š 𝑗 includes: Finding the set 𝑆 : Greedy algorithm: Compute all πœ” π‘˜ β‰œ πœ„ π‘ˆ 𝜚 π’š 𝑗 , π‘˜ βˆ’ 𝜚 π’š 𝑗 , 𝑧 𝑗 1. Sort πœ” π‘˜ in non-descending order 2. 3. Start with empty set 𝑆 = βˆ… 4. Repeat: Incrementally add π‘˜ to the set 𝑆 , update the value of AL 0-1 5. Until adding another one decrease the value of AL 0-1 6. 23

  24. AL 0-1 β†’ Optimization β†’ Dual Primal Quadratic Programming Formulation of AL 0-1 with L2 regularization Constrained Primal QP Dual QP Formulation where: , and is the constant part of 24

  25. AL 0-1 β†’ Optimization β†’ Dual β†’ Kernel Trick Kernel trick input space rich feature space π’š 𝑗 πœ•(π’š 𝑗 ) Compute the dot products implicitly Dual QP Formulation where: 𝑆 𝑗,𝑙 is the set of labels included in the constraint Ξ” 𝑗,𝑙 25

  26. AL 0-1 β†’ Optimization β†’ Dual β†’ Constraint Generation Primal and Dual Optimization Exponential number of constraints (in primal) and dual variables Constraint Generation Algorithm Polynomial time convergence guarantee is provided Experiment shows better convergence rate 26

  27. AL 0-1 β†’ Experiments Dataset properties and AL 0-1 constraints 27

  28. AL 0-1 β†’ Experiments β†’ Results Results for Linear Kernel and Gaussian Kernel The mean (standard deviation) of the accuracy Bold numbers: best or not significantly worse than the best 28

  29. Multiclass Zero-One Classification Perform well in Fisher Consistent? low feature spaces? 1. The SVM WW Model (Weston et.al., 2002) Relative Margin Model 2. The SVM CS Model (Crammer and Singer, 1999) Relative Margin Model 3. The SVM LLW Model (Lee et.al., 2004) Absolute Margin Model 4. The AL 0-1 (Adversarial Surrogate Loss) Relative Margin Model 29

  30. Adversarial Surrogate Losses for Multiclass Ordinal Classification Based on: Rizal Fathony , Mohammad Bashiri , Brian D. Ziebart. β€œ Adversarial Surrogate Losses for Ordinal Regression ”. Advances in Neural Information Processing Systems 30 (NIPS), 2017. 30

  31. Ordinal Classification: Related Works Support Vector Machine for Ordinal Classification Extend hinge loss to ordinal classification A. Threshold Methods (Sashua & Levin, 2003; Chu & Keerthi, 2005; Rennie & Srebro, 2005) 1. All Threshold (also called SVORIM) πœ€ is a surrogate for binary classification, e.g. the hinge loss 2. Immediate Threshold (also called SVOREX) B. Reduction Framework (Li & Lin, 2007) - Create 𝒡 βˆ’ 1 weighted extended samples for each training sample, - Run binary classification with binary surrogate loss (e.g. hinge loss) on the extended samples C. Cost Sensitive Classification Based Methods (Lin, 2008; Tu & Lin, 2010; Lin, 2014) 1. Cost Sensitive One-Versus-All (CSOVA) 2. Cost Sensitive One-Versus-One (CSOVO) 3. Cost Sensitive One-Sided-Regression (CSOSR) 31

  32. Adversarial Surrogate Loss : Ordinal Classification Adversarial Game where Nash Equilibrium AL ord : maximization over pairs Can be independently realized 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend