Rizal Zaini Ahmad Fathony
Adversarial Surrogate Losses
for General Multiclass Classification
Committee: Prof. Brian Ziebart (Chair)- Prof. Bhaskar DasGupta
- Prof. Lev Reyzin
- Prof. Xinhua Zhang
- Prof. Simon Lacoste-Julien
Adversarial Surrogate Losses for General Multiclass Classification - - PowerPoint PPT Presentation
Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Lev Reyzin Prof. Xinhua Zhang Prof. Simon Lacoste-Julien 1 Multiclass
Rizal Zaini Ahmad Fathony
Adversarial Surrogate Losses
for General Multiclass Classification
Committee: Prof. Brian Ziebart (Chair)Data
Sample Distribution ΰ·¨ π(π, π§)
π1 π§1 π2 π§2 ππ π§π
β¦
Training
Supervised Learning
ππ+1 ? Testing ππ+2 ?
β Multiclass Classification
β¦ β¦ 1 2 3 |π΅|
Loss Function: loss(ΰ· π§, π§)
possible value of π§
2Multiclass Classification β Zero-One Loss
Digit Recognition
β¦ 1 2 3 β¦
Loss Function: loss ΰ· π§, π§ = π½(ΰ· π§ β π§)
3General Multiclass Classification β any loss
Multiclass Classification β Ordinal Classification
β¦ 1 2 5
Loss Function (example): loss ΰ· π§, π§ = |ΰ· π§ β π§|
Movie Rating Prediction
β¦
Predicted vs Actual Label: Distance Loss
4Multiclass Classification β Taxonomy Classification
Loss Function (example): loss ΰ· π§, π§ = β β π€ ΰ· π§, π§ + 1 Object Classification
Object Nonlife Two-wheeled 1:Bicycle 2:Motorbike Four-wheeled 3:Bus 4:Car Life 5:Person Animal Carnivore 6:Cat 7:Dog Herbivore 8:Cowβ = 4
β : tree height π€ ΰ· π§, π§ : level of the common ancestorloss(Cat,Dog) = 1 loss(Cat,Cow) = 2 loss(Cow,Person) = 3 Loss(Cow,Motorbike)= 4 loss(Bus,Car) = 2 loss(Bus,Bicycle) = 3 loss(Car,Cow) = 4 loss(Bus,Person) = 4
5Multiclass Classification β Loss Matrix
Zero One Loss
6loss function: loss ΰ· π§, π§ β loss matrix: π΄ loss ΰ· π§, π§ = π½(ΰ· π§ β π§) Ordinal Classification Loss loss ΰ· π§, π§ = |ΰ· π§ β π§| Taxonomy-based loss loss ΰ· π§, π§ = β β π€ ΰ· π§, π§ + 1
Empirical Risk Minimization (ERM)
Intractable optimization, non-convex, non-continuous Convex surrogate loss need to be employed
Example: Binary zero-one loss Surrogate Loss:
ERM under Hinge Loss and Log Loss
SVM (hinge loss): Binary SVM and Binary Logistic Regression:
Fisher consistent
Logistic regression (log loss):
Probabilistic prediction ΰ· π
π(π§|π)Binary SVM only:
Dual parameter sparsity
Surrogate loss for multiclass cases:
Fisher consistent: produce Bayes optimal decision in the limit Extend binary surrogate loss like hinge-loss and log-loss to multiclass
8Adversarial Prediction (Asif et. al., 2015)
Empirical Risk Minimization
9 Approximate the lossOriginal Loss
Non-convex, non-continuous with convex surrogates Probabilistic prediction Evaluate against an adversary, instead of using empirical data Adversaryβs probabilistic prediction Constraint the statistics of the adversaryβs distribution to match the empirical statisticsAdversarial Prediction
Empirical Risk Minimization
Approximate loss Exact training data
Adversarial Prediction
Exact loss Approximate training data
(by only using the statistics)Adversarial Prediction β Optimization
10 where:Adversarial Prediction
Minimization over many zero-sum games Example of game matrix for zero-one loss Inner optimization Can be solved using Linear Programming Complexity: π(| | π΅ 3.5) Minimax and Lagrangian dualityAdversarial Prediction β ERM perspective
11Adversarial Prediction (optimization)
Empirical Risk Minimization with surrogate loss: Adversarial Surrogate Loss =
The Nash equilibrium value of the zero-sum game characterized by matrix π΄π,π β² where:Outline
The Adversarial Surrogate Loss for Multiclass Zero-One Classification
1
Adversarial Surrogate Losses for Multiclass Ordinal Classification
2
Ongoing and Future Works
3
The Adversarial Surrogate Loss
for Multiclass Zero-One Classification
13Multiclass Zero-One: Related Works
14Multiclass Support Vector Machine
Adversarial Prediction : Multiclass Zero-One Loss
15Adversarial Game
shorter notationNash Equilibrium
Adversarial Zero-Sum Game (Zero-One Loss)
16 ΖΌ π1 ΖΌ π2 ΖΌ π3 ΖΌ π4 ΖΈ π1 ΖΈ π2 ΖΈ π3 ΖΈ π4 The augmented game for 4 classes ΖΌ π1 ΖΌ π2 ΖΌ π4 ΖΈ π1 ΖΈ π2 ΖΈ π3 ΖΈ π4 when ΖΌ π3 = 0 ΖΈ π1 ΖΈ π2 ΖΈ π4 ΖΌ π1 ΖΌ π2 ΖΌ π4 ΖΌ π1 ΖΌ π2 ΖΌ π4 ΖΈ π1 ΖΈ π2 ΖΈ π4 if completely mixed if completely mixed Considering all possible set of adversaryβs non-zero probability:AL0-1 : maximization over 2|π΅| β 1 hyperplanes
AL0-1 (Adversarial Surrogate Loss) β Binary Classification
17 AL0-1 for binary zero-one classification: If the true label π§ = 1 Change classification notation to π§ β +1, β1 , parameter to π and π, add L2 regularizationBinary AL0-1 Soft Margin SVM
AL0-1 β 3 Class Classification
20AL0-1 for 3-class zero-one classification: Maximization over 7 hyperplanes:
π§ = 1AL0-1 β Fisher Consistency β Property of the Minimizer
21 Fisher Consistency in Multiclass Zero-One Classification The minimizer πβ lies in the area defined by all class label constraint is employed to remove redundant solution π§ = 1 π― = {1,2}AL0-1 β Fisher Consistency
22Finding the minimizer πβ
based on the properties of the minimizerSolution:
Fisher Consistent
AL0-1 β Optimization β Primal
23Optimization of AL0-1 (Empirical Risk Minimization) Gradient for a single sample ππ
Let π be the set that maximize AL0-1 for ππ, The sub-gradient for a single sample ππ includes:Finding the set π:
Greedy algorithm: 1. Compute all ππ β ππ π ππ, π β π ππ, π§π 2. Sort ππ in non-descending order 3. Start with empty set π = β 4. Repeat: 5. Incrementally add π to the set π, update the value of AL0-1 6. Until adding another one decrease the value of AL0-1AL0-1 β Optimization β Dual
24 Primal Quadratic Programming Formulation of AL0-1 with L2 regularization Constrained Primal QP Dual QP Formulation where: , and is the constant part ofAL0-1 β Optimization β Dual β Kernel Trick
25 Dual QP Formulation Kernel trick input space ππ rich feature space π(ππ) Compute the dot products implicitly where: ππ,π is the set of labels included in the constraint Ξπ,πAL0-1 β Optimization β Dual β Constraint Generation
26Primal and Dual Optimization
Exponential number of constraints (in primal) and dual variablesPolynomial time convergence guarantee is provided Constraint Generation Algorithm Experiment shows better convergence rate
AL0-1 β Experiments
27Dataset properties and AL0-1 constraints
AL0-1 β Experiments β Results
28Results for Linear Kernel and Gaussian Kernel
The mean (standard deviation) of the accuracy Bold numbers: best or not significantly worse than the bestMulticlass Zero-One Classification
29Adversarial Surrogate Losses
for Multiclass Ordinal Classification
30Ordinal Classification: Related Works
31Support Vector Machine for Ordinal Classification
Adversarial Surrogate Loss : Ordinal Classification
32Adversarial Game
whereNash Equilibrium
ALord : maximization over pairs
Can be independently realized
ALord β Feature Representation
33Thresholded Regression Representation Multiclass Representation
size : π + π΅ β 1 size : π π΅ π is the dimension of input space π is the dimension of input spaceALord β Thresholded Regression Representation
34ALord for the Thresholded Regression Representation
ALord-th : based on averaging the threshold label predictions
for potentials π β ππ + 1 and π β ππ β 1
ALord β Multiclass Representation
35ALord for the Multiclass Representation
ALord-mc : maximization over |π΅|( π΅ +1)
2hyperplanes
ALord β Fisher Consistency
36 Fisher Consistency in Ordinal Classification constraint is employed to remove redundant solution Properties of the minimizer πβ The minimizer πβ satisfies the loss reflective property examples: [-1, 0, -1, -2] [-2, -1, 0, -1, -2] [0, -1, -2, -3] [-3, -2, -1, 0, -1, -2]Fisher Consistent
ALord β Optimization β Primal
37Optimization of ALord (Empirical Risk Minimization) Stochastic Average Gradient (SAG) (Schimidt et.al. 2013, 2015)
Average over the gradient of each example from the last iteration it was selected Requires storing the gradient of each sample.SAG for ALord-mc
Objective: Gradient for a single sample ππ: assuming πβ β πβ β π§πStore πβ and πβ
instead of full gradient
ALord β Optimization β Dual
38 Primal Quadratic Programming Formulation of ALord with L2 regularization Constrained Primal QP Dual QP Formulation Kernel trick can also be easily applied!ALord β Experiments
39Dataset properties
ALord β Experiments β Linear Kernel
40Results for Linear Kernel
ALord β Experiments β Gaussian Kernel
41Results for Gaussian Kernel
All Threshold Intermediate ThresholdConclusion
42Conclusion
Establish connections between Adversarial Prediction and ERM
A
Propose Adversarial Surrogate Losses:
B
Align better with the original loss
Optimizing Adversarial Loss in the ERM Framework = Optimizing the original loss in the Adversarial Prediction Framework1 Guarantee Fisher Consistency 2 Enable computational efficiency for rich feature space
via kernel trick and dual parameter sparsity3 Perform well in practice 4
Ongoing and Future Works
44[1.] Taxonomy-based Classification
45Adversarial Surrogate Loss for Taxonomy-based Classification
1 2 3 4 5 1 2 3 4 π― = {1,2,3,5} Example: Nash equilibrium: Analyze non-zero probability strategy of the adversary 5 ΰ΅ 1 2 ΰ΅ 1 2 1 1 ΰ΅ 1 2 ΰ΅ 3 2 2 2 ΰ΅ 4 7 ΰ΅ 3 7 1 ΰ΅ 8 7 3 ΰ΅ 13 7 ΰ΅ 21 34 ΰ΅ 13 34 ΰ΅ 6 34 ΰ΅ 6 34 ΰ΅ 9 34 ΰ΅ 13 34 Adversaryβs probability:[1.] Taxonomy-based Classification β Algorithm
46Algorithm for finding the adversarial loss
Potentials: 1 2 3 4 5 Algorithm: π = [0.1, 1.2, 0.5, 1.5, 0.7] Sorted index for each non-leaf nodes: 1 2 3 4 5 [4, 2, 5, 3, 1] [4, 5] [2, 3, 1] [4] [5] [2, 1] [3] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5Complexity
π(ππ2π2) π : max # of children π : depth of the tree π : # of class Ref: LP complexity: π(π3.5)Completed:
Future:
[2.] Sequence Prediction with Ordinal Classification Loss
47 Adversarial Surrogate Loss for Sequence Prediction with Ordinal Classification LossAdversarial game over joint distributions: Nash equilibrium:
(*) Formal proof is in on going work Exists an equilibrium where only two strategies that has non-zero probability(*) Exponential size of matrix π΄π¦,π Can be solved using dynamic programming, complexity: π(π π΅ 4) Double oracle (Li, et.al., 2016) β no polynomial convergence guaranteeCompleted:
Future:
[3.] Adversarial Graphical Model
48Focus on tree structures with additive loss Adversarial Game Over Marginal Distributions Linear Programming
with π( π΅ 2|πΉ|) variablesComplexity:
with π( π΅ 7 πΉ 3.5) π΅ βΆ # classes πΉ βΆ # edges in the graphCompleted:
Future:
[*.] Timeline
49[1.] Taxonomy Classification
[2.] Adversarial Loss for Sequence Prediction
[3.] Adversarial Graphical Models (with focus on tree structures)
3 6 9 months
Thank You
50