Performance-Aligned Learning Algorithms with Statistical Guarantees - PowerPoint PPT Presentation

Performance-Aligned Learning Algorithms with Statistical Guarantees Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Xinhua Zhang Prof. Lev Reyzin Prof. Simon Lacoste-Julien 1

Outline “New learning algorithms that align with performance/loss metrics and provide the statistical guarantees of Fisher consistency” Introduction & Motivation 4 Bipartite Matching in Graphs 1 2 General Multiclass Classification 5 Conclusion & Future Directions 3 Graphical Models 2

Introduction and Motivation 3

Supervised Learning Multiclass Classification Testing Training • Zero one loss / accuracy metric • Absolute loss (for ordinal regression) 𝒚 1 𝑧 1 𝒚 𝑜+1 ො 𝑧 𝑜+1 Multivariate Performance 𝒚 2 𝑧 2 𝒚 𝑜+2 𝑧 𝑜+2 ො • F1-score • Precision@k … … Structured Prediction 𝒚 𝑜 𝑧 𝑜 • Hamming loss (sum of 0-1 loss) Data Loss/Performance Metrics: Distribution loss ො 𝑧, 𝑧 / score(ො 𝑧, 𝑧) 𝑄(𝒚, 𝑧) Data 4

Empirical Risk Minimization (ERM) (Vapnik, 1992) • Assume a family of parametric hypothesis function 𝑔 (e.g. linear discriminator) • Find the hypothesis 𝑔 ∗ that minimize the empirical risk: Non-convex, non-continuous metrics → Intractable optimization Convex surrogate loss need to be employed! A desirable property of convex surrogates: Fisher Consistency Under ideal condition: optimize surrogate → minimizes the loss metric (given the true distribution and fully expressive model) 5

Two Main Approaches Probabilistic Approach 1 • Construct prediction probability model • Employ the logistic loss surrogate Logistic Regression, Conditional Random Fields (CRF) Large-Margin Approach 2 • Maximize the margin that separates correct prediction from the incorrect one • Employ the hinge loss surrogate Support Vector Machine (SVM), Structured SVM * Pictures are taken from MLPP book (Kevin Murphy) 6

Multiclass Classification | Logistic Regression vs SVM Multiclass SVM Multiclass Logistic Regression 2 1 Current multiclass SVM formulations: Statistical guarantee of Fisher consistency ( minimizes the zero-one loss metric in the limit) - Lack Fisher consistency property, or - Doesn’t perform well in practice Computational efficiency No dual parameter sparsity (via the kernel trick & dual parameter sparsity) 7

Structured Prediction| CRF vs Structured SVM Structured SVM Conditional Random Fields (CRF) 2 1 No Fisher consistency guaranteees Statistical guarantee of Fisher consistency Flexibility to incorporate customized No easy mechanism to incorporate loss/performance metrics customized loss/performance metrics Computation of the normalization term Relatively more efficient in computation may be intractable 8

New Learning Algorithms? Align better with the loss/performance metric (by incorporating the metric into its learning objective) Provide Fisher consistency guarantee Computationally efficient Perform well in practice How? Robust adversarial learning approach “What predictor best maximizes the performance metric (or minimizes the loss metric) in the worst case given the statistical summaries of the empirical distributions?” 9

Performance-Aligned Surrogate Losses for General Multiclass Classification Based on: Fathony, R ., Asif, K., Liu, A., Bashiri, M. A., Xing, W., Behpour, S., Zhang, X., and Ziebart, B. D.: Consistent robust adversarial prediction for general multiclass classification . arXiv preprint arXiv:1812.07526, 2018. (Submitted to JMLR). Fathony, R. , Liu, A., Asif, K., and Ziebart, B.: Adversarial multiclass classification: A risk minimization perspective . NIPS 2016. Fathony, R ., Bashiri, M. A., and Ziebart, B.: Adversarial surrogate losses for ordinal regression. NIPS 2017. 10

Supervised Learning | Multiclass Classification Testing Training 1 𝒚 1 𝑧 1 𝒚 𝑜+1 ො 𝑧 𝑜+1 2 𝒚 2 𝑧 2 𝒚 𝑜+2 𝑧 𝑜+2 ො 3 … … … 𝒚 𝑜 𝑧 𝑜 𝑙 Finite set of possible Data value of 𝑧 Distribution 𝑄(𝒚, 𝑧) Loss/Performance Metrics: loss ො 𝑧, 𝑧 / score(ො 𝑧, 𝑧) Data 11

Multiclass Classification | Zero-One Loss Loss Metric: Zero-One Loss Example: Digit Recognition Loss Metric: 1 loss ො 𝑧, 𝑧 = 𝐽(ො 𝑧 ≠ 𝑧) 2 L = 3 … … 12

Multiclass Classification | Ordinal Classification Example: Movie Rating Prediction Loss Metric: Absolute Loss 1 Loss Metric: loss ො 𝑧, 𝑧 = |ො 𝑧 − 𝑧| 2 … … L = 5 Predicted vs Actual Label: Distance Loss 13

Multiclass Classification | Classification with Abstention Loss Metric: Abstention Loss Predictor can say ‘abstain’ Prediction Loss Metric: 𝛽 if abstain 1 loss ො 𝑧, 𝑧 = ቊ 𝐽(ො 𝑧 ≠ 𝑧) otherwise 2 L = 3 Abstain 14

Multiclass Classification | Other Loss Metrics Squared loss metric Taxonomy-based loss metric Cost-sensitive loss metric 𝑧 − 𝑧 2 loss ො 𝑧, 𝑧 = ො loss ො 𝑧, 𝑧 = 𝐃 ො loss ො 𝑧, 𝑧 = ℎ − 𝑤 ො 𝑧, 𝑧 + 1 𝑧,𝑧 15

Robust Adversarial Learning 16

Robust Adversarial Learning (Grunwald & Dawid, 2004; Delage & Ye, 2010; Asif et.al, 2015) Empirical Risk Minimization Original Loss Metric Approximate the loss Non-convex, non-continuous with convex surrogates Probabilistic prediction Robust Adversarial Learning Evaluate against an adversary, instead of using empirical data Adversary’s probabilistic prediction Constraint the statistics of the adversary ’s distribution to match the empirical statistics 17

Robust Adversarial Dual Formulation Primal: Lagrange multiplier, minimax duality Dual: ERM with the adversarial surrogate loss (AL): Convex in 𝜄 Simplified notation where: 18

Adversarial Surrogate Loss Adversarial Surrogate Loss Example for a four class classification Convert to a Linear Program LP Solver 𝑃(𝑙 3.5 ) Extreme points of the (bounded) polytope There is always an optimal solution that Convex Polytope formed by the constraints is an extreme point of the domain. Computing AL = finding the best extreme point 19

Zero-One Loss : AL 0-1 | Convex Polytope Convex Polytope of the AL 0-1 The Adversarial Surrogate Loss for Zero-One Loss Metrics (AL 0-1 ) Computation of AL 0-1 Extreme points of the polytope - Sort 𝑔 𝑗 in non-increasing order - Incrementally add potentials to the set 𝑇 , until adding more potential decrease the loss value 𝒇 𝑗 is a vector with a single 1 at the i -th index, O (𝑙 log 𝑙) , where 𝑙 is the number of classes and 0 elsewhere. 20

AL 0-1 | Loss Surface Binary Classification - Plots over the space of potential differences 𝜔 𝑗 = 𝑔 𝑗 − 𝑔 𝑧 - The true label is 𝑧 = 1 Three Class Classification 21

Other Multiclass Loss Metrics Ordinal Regression with Absolute Loss Metric Extreme points of the polytope: 𝒇 𝑗 is a vector with a single 1 at the i -th index, and 0 elsewhere. Adversarial Surrogate Loss AL ord : Computation cost: O (𝑙) , where 𝑙 is the number of classes 22

Other Multiclass Loss Metrics Classification with Abstention ( 0 ≤ 𝛽 ≤ 0.5 ) Extreme points of the polytope: 𝒇 𝑗 is a vector with a single 1 at the i -th index, and 0 elsewhere. Adversarial Surrogate Loss AL abstain : Computation cost: O (𝑙) , where 𝑙 is the number of classes 23

Fisher Consistency Fisher Consistency Requirement in Multiclass Classification - 𝑄(𝑍|𝒚) is the true conditional distribution - 𝑔 is optimized over all measurable functions Bayes risk minimizer Minimizer Property - 𝐞 is the true conditional distribution - 𝑧 ⋄ is the Bayes optimal predictor Under 𝐠 ∗ : Consistency Fisher consistent 24

Optimization Sub-gradient descent Example: AL 0-1 𝑇 ∗ is the set that maximize AL 0-1 Incorporate Rich Feature Spaces via the Kernel Trick input space rich feature space 1. Dual Optimization (benefit: dual parameter sparsity) 𝒚 𝑗 𝜕(𝒚 𝑗 ) Compute the dot products 2. Primal Optimization (via PEGASOS (Shalev-Shwartz, 2010) ) 25

Experiments: Example: Multiclass Classification (0-1 loss) 26

Multiclass Classification | Related Works Multiclass Support Vector Machine (SVM) Perform well in Fisher Consistent? (Tewari and Bartlett, 2007) low dimensional feature? (Liu, 2007) (Dogan et.al., 2016) 1. The WW Model (Weston et.al., 2002) Relative Margin Model 2. The CS Model (Crammer and Singer, 1999) Relative Margin Model 3. The LLW Model (Lee et.al., 2004) with: Absolute Margin Model 27

AL 0-1 | Experiments Dataset properties and AL 0-1 constraints 12 datasets dual parameter sparsity 28

AL 0-1 | Experiments | Results Results for Linear Kernel and Gaussian Kernel The mean (standard deviation) of the accuracy. Bold numbers: best or not significantly worse than the best Linear Kernel AL 01 : slight benefit LLW: poor perf. Gauss. Kernel LLW: improved perf. AL 01 : maintain benefit 29

Multiclass Zero-One Classification Perform well in Fisher Consistent? low dimensional feature? 1. The SVM WW Model (Weston et.al., 2002) Relative Margin Model 2. The SVM CS Model (Crammer and Singer, 1999) Relative Margin Model 3. The SVM LLW Model (Lee et.al., 2004) Absolute Margin Model 4. The AL 0-1 (Adversarial Surrogate Loss) Relative Margin Model 30

Performance-Aligned Learning Algorithms with Statistical Guarantees - PowerPoint PPT Presentation

Performance-Aligned Learning Algorithms with Statistical Guarantees Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Xinhua Zhang Prof. Lev Reyzin Prof. Simon Lacoste-Julien 1 Outline New

Simulation of Field-Aligned Simulation of Field-Aligned Ideal MHD Flows Around deal MHD Flows

Byte-Aligned Codes Indexing, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Bora Housing Limited Locally focused, coordinated and aligned Corporate Profile Date: 12-April

Republic of Nauru Planning & Monitoring Frameworks for National, Sectoral and aligned

Collecting Aligned Textual Corpora from the Hidden Web Botjan Pajntar bostjan.pajntar@ijs.si

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff Research NIPS 2017 Workshop on

MPA (Marker PDU Aligned Framing for TCP) draft-culley-iwarp-mpa-01 Paul R . Culley HP

Induction of Treebank-Aligned Lexical Resources LREC 2008 Tejaswini Deoskar, Mats Rooth

Network-Aligned content delivery through collaborative optimization Dr.-Ing. Ingmar Poese

Non-aligned drawings of planar graphs Therese Biedl 1 , Claire Pennarun 2 1 University of Waterloo

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

1 1. Polymorphism and divergence are correlated Neutral theory is a bridge between microevolution

Provider, Payer and Patient Alignment in WNY Presented by Kate Ebersole KEE Concepts

CP Violation 2HDM from collider to EDM Hao-Lin Li Amherst Center for Fundamental Interaction

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational

Flavor physics in the LHC era Zoltan Ligeti Lawrence Berkeley Lab Flavor physics in the LHC era

Protein Structure Prediction Protein = chain of amino acids (AA) aa connected by peptide

Referential scales and differential case marking: A study using hierarchical models in Bayesian

CLAS12 software status update July 21, 2020 Outline Software organization Progress since

Performance-Aligned Learning Algorithms with Statistical Guarantees - PowerPoint PPT Presentation

Performance-Aligned Learning Algorithms with Statistical Guarantees Rizal Zaini Ahmad Fathony Committee: Prof. Brian Ziebart (Chair) Prof. Bhaskar DasGupta Prof. Xinhua Zhang Prof. Lev Reyzin Prof. Simon Lacoste-Julien 1 Outline New

Simulation of Field-Aligned Simulation of Field-Aligned Ideal MHD Flows Around deal MHD Flows

Byte-Aligned Codes Indexing, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Bora Housing Limited Locally focused, coordinated and aligned Corporate Profile Date: 12-April

Republic of Nauru Planning &amp; Monitoring Frameworks for National, Sectoral and aligned

Collecting Aligned Textual Corpora from the Hidden Web Botjan Pajntar bostjan.pajntar@ijs.si

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff Research NIPS 2017 Workshop on

MPA (Marker PDU Aligned Framing for TCP) draft-culley-iwarp-mpa-01 Paul R . Culley HP

Induction of Treebank-Aligned Lexical Resources LREC 2008 Tejaswini Deoskar, Mats Rooth

Network-Aligned content delivery through collaborative optimization Dr.-Ing. Ingmar Poese

Non-aligned drawings of planar graphs Therese Biedl 1 , Claire Pennarun 2 1 University of Waterloo

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

1 1. Polymorphism and divergence are correlated Neutral theory is a bridge between microevolution

Provider, Payer and Patient Alignment in WNY Presented by Kate Ebersole KEE Concepts

CP Violation 2HDM from collider to EDM Hao-Lin Li Amherst Center for Fundamental Interaction

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational

Flavor physics in the LHC era Zoltan Ligeti Lawrence Berkeley Lab Flavor physics in the LHC era

Protein Structure Prediction Protein = chain of amino acids (AA) aa connected by peptide

Referential scales and differential case marking: A study using hierarchical models in Bayesian

CLAS12 software status update July 21, 2020 Outline Software organization Progress since

Republic of Nauru Planning & Monitoring Frameworks for National, Sectoral and aligned