Recent Results on Algorithmic Fairness and Meta-Learning - PowerPoint PPT Presentation

Recent Results on Algorithmic Fairness and Meta-Learning Massimiliano Pontil Computational Statistics and Machine Learning Istituto Italiano di Tecnologia and Department of Computer Science University College London 4th Annual Machine Learning in the Real World workshop (MLiTRW) Criteo AI Lab, Paris, October 2, 2019

Plan ◮ Fair empirical risk minimization ◮ Using labeled and unlabeled data ◮ Multi-task approach ◮ Learning fair representations ◮ Online meta-learning 2

Algorithmic fairness ◮ Aim: ensure that learning algorithms do not treat subgroups in the population “unfairly” ◮ How: impose “fairness” constraints (different notions) ◮ Difficulty: study computationally efficient algorithms with statistical guarantees w.r.t. both the risk and the fairness measure Binary classification setting: let µ be a prob. distribution on X × S × {− 1 , +1 } , where S = { a , b } is the sensitive variable set. We wish to find a solution f ∗ of � � � � min f ( X , S ) � = Y s . t . “ f is fair” P f ∈F 3

Fairness constraints (see e.g. [Hardt et al., 2016, Zafar et al., 2017]) ◮ Equal opportunity (EO): P � � � � f ( X , S ) > 0 | Y =1 , S = a = P f ( X , S ) > 0 | Y =1 , S = b ◮ Equalized odds (EOd): f ( X , S ) and S are conditionally independent given Y , i.e. � � � � P f ( X , S ) > 0 | Y = y , S = a = P f ( X , S ) > 0 | Y = y , S = b , y ∈ {− 1 , 1 } ◮ Demographic parity (DP): P � � � � f ( X , S ) > 0 | S = a = P f ( X , S ) > 0 | S = b ◮ We may also loose each constraint by requiring the l.h.s. to be close to the r.h.s. 4

Statistical learning setting ◮ Let ℓ : R × Y → R be a loss function and let L be the associated risk: L ( f ) = E [ ℓ ( f ( X ) , Y )] , for f : X → Y ◮ Conditional risk of f for the positive class in group s : L + , s ( f ) = E [ ℓ ( f ( X ) , Y ) | Y = 1 , S = s ] ◮ We relax the fairness constraint by using a loss function in place of the 01-loss and introduce a parameter ǫ ∈ [0 , 1]. For EO, we obtain � � � ≤ ǫ � L + , a ( f ) − L + , b ( f ) � � min L ( f ) s . t . (1) f ∈F 5

Fair empirical risk minimization (FERM) [Donini et al. NeurIPS 2018] ◮ Distribution µ is unknown and we only have a data sequence ( x i , s i , y i ) n i =1 sampled independently from µ . We then consider the empirical problem � � ˆ � ˆ L + , a ( f ) − ˆ L + , b ( f ) � ≤ ˆ � � min L ( f ) s . t . (2) ǫ f ∈F where ˆ ǫ is a parameter linked to ǫ n ◮ Empirical risk ˆ L ( f ) = 1 � ℓ ( f ( x i ) , y i ) n i =1 ◮ Empirical risk for the positive samples in group g : ˆ L + , g ( f ) = 1 � i ∈I + , g ℓ ( y i , f ( x i )) n + , g with I + , g = { i : y i =1 , s i = g } and n + , g = |I + , g | , g ∈ { a , b } 6

Statistical analysis of FERM We say a class of functions F is learnable (wrt. loss ℓ ) if: � L ( f ) − ˆ � ≤ B ( δ, n , F ) , � � sup L ( f ) n →∞ B ( δ, n , F ) = 0 lim with f ∈F Proposition 1. Let δ ∈ (0 , 1). If F is learnable f ∗ solves (1) and ˆ f solves (2) with g ∈{ a , b } B ( δ, n + , g , F ) then with prob. ≥ 1 − 6 δ it holds simultaneously: ˆ ǫ = ǫ + � L (ˆ f ) − L ( f ∗ ) ≤ 2 B ( δ, n , F ) � L + , a (ˆ f ) − L + , b (ˆ � ≤ ǫ + 2 � B ( δ, n + , g , F ) � � f ) g ∈{ a , b } 7

Implications of the bound f of (2) is close to a solution of f ∗ of (1) both in ◮ Bound implies that a solution ˆ terms of the risk and fairness constraint ◮ But how do we find ˆ f ? We would like to solve problem (2) for the hard (misclassification) loss: n � min 1 { f ( x i ) � = y i } (3) f ∈F i =1 � ˆ P { f ( x ) > 0 | y =1 , s = a } − ˆ � ≤ ǫ � � P { f ( x ) > 0 | y =1 , s = b } ◮ We propose to replace the hard loss in the risk with the (larger) hinge loss, and the hard loss in the fairness constraint with a linear loss 8

Fair learning with kernels ◮ Linear model f ( · ) = � w , φ ( · ) � , with φ : X → H a kernel-induced feature map � ≤ ˆ ◮ For the linear loss, the fairness constraint takes the form � � � � w , u a − u b � ǫ , where u g is the barycenter of the positive points in group g : 1 � u g = φ ( x i ) , g ∈ { a , b } n + , g i : ∈I + , g ◮ We consider the regularized empirical risk minimization problem n ℓ ( y i � w , φ ( x i ) � )+ λ � w � 2 s.t. � � � min � � w , u a − u b � � ≤ ˆ λ > 0 ǫ w ∈ H i =1 9

Form of the optimal classifier [Chzhen et al. NeurIPS 2019] Proposition. Let η ( x , s ) = E [ Y | X = x , S = s ] be the regression function. If for each s ∈ { 0 , 1 } the mapping t �→ P ( η ( X , S ) ≤ t | S = s ) is continuous on (0 , 1), then an optimal classifier f ∗ can be obtained for all ( x , s ) ∈ R d × { a , b } as f θ ( x , a ) = 1 { 1 ≤ η ( x , a )(2 − f θ ( x , b ) = 1 { 1 ≤ η ( x , b )(2+ P ( Y =1 , S = a ) ) } , θ θ P ( Y =1 , S = b ) ) } where θ ∈ [0 , 2] solves the equation E X | S = a [ η ( X , a ) f θ ( X , a )] = E X | S = b [ η ( X , b ) f θ ( X , b )] . P ( Y = 1 | S = a ) P ( Y = 1 | S = b ) ◮ Similar result when S is not included as a predictor 10

Leveraging labeled and unlabeled ◮ FERM leaves open the question of designing a computationally efficient and statistically consistent estimator for problem (*) ◮ Alternative method: estimate η from a labeled sample and θ from an independent unlabeled sample by minimizing the empirical difference of equal opportunity (DEO) � ˆ ˆ � E X | S = a ˆ η ( X , a ) f θ ( X , a ) E X | S = b ˆ η ( X , b ) f θ ( X , b ) � � ˆ ∆( f , µ ) = − � � ˆ ˆ � � E X | S = a ˆ η ( X , a ) E X | S = b ˆ η ( X , b ) � � Theorem (informal). If ˆ η → η as n → ∞ , under mild additional assumptions the proposed estimator is consistent w.r.t. both accuracy and fairness: n , N →∞ E ( D n , D N ) [∆(ˆ n , N →∞ E ( D n , D N ) [ R (ˆ f )] ≤ R ( f ∗ ) lim f , µ )] = 0 and lim 11

Modified validation procedure ◮ In experiments, we employ a two steps 10-fold CV procedure: – Step 1: shortlist all hyperparameters with accuracy above a certain percentage (we choose 90%) of the best accuracy – Step 2, from the list, select the hyperparameter with highest fairness (i.e. lowest DEO) ◮ We compare: – Na¨ ıve SVM, validated with a standard nested 10-fold cross validation – SVM with the novel validation procedure – The method by [Hardt et al., 2016] applied to the best SVM – The method [Zafar et al., 2017] (code provided by the authors for the linear case ∗ ) ∗ Python code: https://github.com/mbilalzafar/fair-classification 12

Experiments Comparison between different methods. DEO is normalized in [0 , 1] column-wise. The closer a point is to the origin, the better the result The proposed methods slightly decrease accuracy while greatly improving in the fairness measure Code: https://github.com/jmikko/fair_ERM 13

Taking advantage of multitask learning [Oneto et al. AIES 2019] ◮ We consider group specific models: f ( x , s ) = � w s , x � and a multitask learning (MTL) formulation k k L s ( w s ) + λ � w s − w 0 � 2 + (1 − λ ) � w 0 � 2 � ˆ � min k w 1 ,... w k ∈ H s =1 s =1 ◮ Regularization around a common mean encourages task similarities ◮ We impose additional (linearized) fairness constraints on f and the common mean Left: Shared model trained with MTL, with fairness constraint, and no sensitive feature in the predictors vs. the group specific models trained with MTL, with fairness constraint Right: The latter models vs. the same models when the sensitive feature is predicted via random forest 14

Learning fair representations [Oneto et al. Arxiv 2019] ◮ Now consider demographic parity: P ( f ( x ) = 1 | S = 0) = P ( f ( x ) = 1 | S = 1) ◮ Suppose f ( x ) = g ( h ( x )). If representation h : X → R r is fair in the following sense ∀ C ∈ R r P ( h ( x ) ∈ C | S = a ) = P ( h ( x ) ∈ C | S = b ) , then f is fair as well ◮ We relax this by requiring that both distributions have the same means. We let c ( z ) the difference of the empirical means from a dataset z ◮ We use multiple tasks to learn h . We illustrate the approach in the linear case, h ( x ) = A ⊤ x , and f ( x ) = b ⊤ h ( x ): � T n � 1 � 2 + λ � � � � min y t , i −� b t , A ⊤ x t , i � 2 � A � F � B � F � A ⊤ c ( z t ) = 0 , 1 ≤ t ≤ T � Tm A , B t =1 i =1 15

Learning fair representations (cont.) Theorem. Let A solve the above problem and � A � F = 1. Let tasks µ 1 , . . . , µ T be i.i.d. from a meta-distribution ρ . Then, with probability at least 1 − δ , the average risk of the algorithm with representation A run on a random task is upper bounded  �   �  T n � ˆ ln 1 1 � 2 + O  1 C � ∞ � �  + O � δ y t , i −� b t , A ⊤ x t , i �   Tn λ n T t =1 i =1 and the linearized fairness constraint is bounded as � � Ac ( z t ) � 2 + 96ln 8 T 2 T � ˆ Σ � ∞ ln 8 T 2 E µ ∼ ρ E z ∼ µ n � Ac ( z ) � 2 ≤ 1 � δ δ + 6 T T T t =1 16

Experiments M1: Standard MTL with the fairness constraints on the outputs M2: Feed-forward neural network (FFNN) with adversarially generated representation [Madras et al. ICML 2018] M3: Similar to M2 but with different loss function [Edwards &Storkey, ICLR 2016] 17

Recent Results on Algorithmic Fairness and Meta-Learning - PowerPoint PPT Presentation

Recent Results on Algorithmic Fairness and Meta-Learning Massimiliano Pontil Computational Statistics and Machine Learning Istituto Italiano di Tecnologia and Department of Computer Science University College London 4th Annual Machine

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Fairness in Machine Learning: Part I Privacy & Fairness in Data Science CS848 Fall 2019

INTRO TO ETHICS AND INTRO TO ETHICS AND FAIRNESS FAIRNESS Eunsuk Kang Required reading: R.

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

COMP30112: Concurrency Topics 5.4: Fairness and Starvation Howard Barringer Room KB2.20: email:

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Media Fairness, Diversity 1 Outline Fairness (case studies, basic definitions) Diversity

On Combining State Space Reductions with Global Fairness Assumptions Shaojie Zhang 1 Jun Sun 2 Jun

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

ETHICS & FAIRNESS IN AI- ETHICS & FAIRNESS IN AI- ENABLED SYSTEMS ENABLED SYSTEMS

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

Counterfactual Reasoning in Algorithmic Fairness Ricardo Silva University College London and

Fairness, equality, and power in algorithmic decision making Rediet Abebe Maximilian Kasy May

Risk bounds for cl classification and re regre ression rules that interpolate Daniel Hsu

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

The Failure of a Clearinghouse: Empirical Evidence Vincent Bignon Guillaume Vuillemey Banque de

Concentration of risk measures: A Wasserstein distance approach 1 Prashanth L. A. Joint work

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

Recent Results on Algorithmic Fairness and Meta-Learning - PowerPoint PPT Presentation

Recent Results on Algorithmic Fairness and Meta-Learning Massimiliano Pontil Computational Statistics and Machine Learning Istituto Italiano di Tecnologia and Department of Computer Science University College London 4th Annual Machine

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Fairness in Machine Learning: Part I Privacy &amp; Fairness in Data Science CS848 Fall 2019

INTRO TO ETHICS AND INTRO TO ETHICS AND FAIRNESS FAIRNESS Eunsuk Kang Required reading: R.

Algorithmic Complexity Algorithmic Complexity &quot;Algorithmic Complexity&quot;, also called

COMP30112: Concurrency Topics 5.4: Fairness and Starvation Howard Barringer Room KB2.20: email:

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Media Fairness, Diversity 1 Outline Fairness (case studies, basic definitions) Diversity

On Combining State Space Reductions with Global Fairness Assumptions Shaojie Zhang 1 Jun Sun 2 Jun

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

ETHICS &amp; FAIRNESS IN AI- ETHICS &amp; FAIRNESS IN AI- ENABLED SYSTEMS ENABLED SYSTEMS

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

Counterfactual Reasoning in Algorithmic Fairness Ricardo Silva University College London and

Fairness, equality, and power in algorithmic decision making Rediet Abebe Maximilian Kasy May

Risk bounds for cl classification and re regre ression rules that interpolate Daniel Hsu

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

The Failure of a Clearinghouse: Empirical Evidence Vincent Bignon Guillaume Vuillemey Banque de

Concentration of risk measures: A Wasserstein distance approach 1 Prashanth L. A. Joint work

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

Fairness in Machine Learning: Part I Privacy & Fairness in Data Science CS848 Fall 2019

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

ETHICS & FAIRNESS IN AI- ETHICS & FAIRNESS IN AI- ENABLED SYSTEMS ENABLED SYSTEMS