Robustness and Regularization: Two sides of the same coin (Joint - PowerPoint PPT Presentation

Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18

Introduction ◮ Richer data has tempted us to consider more elaborate models Elaborate models = ⇒ More factors / variables ◮ Generalization has become a lot more challenging ◮ Regularization has been useful in avoiding overfitting Goal: A distributionally robust approach for improving generalization 1 / 18

Motivation for Distributionally robust optimization ◮ Want to solve the stochastic optimization problem � � � min β E Loss X , β ) ◮ Typically, we have access to the probability distribution of X only via its samples { X 1 , . . . , X n } ◮ A common practice is to instead solve n 1 � min Loss( X i , β ) n β i =1 2 / 18

n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

45 40 35 30 25 20 15 10 5 0 − 15 − 10 − 5 0 5 10 15 n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 − 15 − 10 − 5 0 5 10 15 − 15 − 10 − 5 0 5 10 15 n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n x i = ( x 1 , . . . , x d ) is the vector of predictors y i is the corresponding response a a Image source: r-bloggers.com 4 / 18

Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n Empirical loss/risk minimization (ERM): n 1 � � � Loss f ( x i ) , y i n i =1 a a Image source: r-bloggers.com 4 / 18

Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n Empirical loss/risk minimization (ERM): n 1 � � � Loss f ( x i ) , y i n i =1 n = 1 a � y i − f ( x i ) 2 � � n i =1 a Image source: r-bloggers.com 4 / 18

Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n a a Image source: r-bloggers.com Not enough Find an f that fits well over “future” values as well 4 / 18

Generalization Think of data ( x 1 , y 1 ) , . . . ( x n , y n ) as samples from a probability distribution P Then “future values” can also be interpreted as samples from P 5 / 18

Generalization Think of data ( x 1 , y 1 ) , . . . ( x n , y n ) as samples from a probability distribution P Then “future values” can also be interpreted as samples from P n 1 � � � �− → � � � min Loss f ( x i ) , y i min E P Loss f ( X ) , Y ) n f f i =1 However, the access to P is still via samples, P n = 1 � n i =1 δ ( x i , y i ) n 5 / 18

P � � �� Want to solve min f ∈F E P Loss f ( X ) , Y P unknown 6 / 18

P n P � � �� Know how to solve min f ∈F E P n Loss f ( X ) , Y Access to P via training samples P n 6 / 18

P n P More and more samples give better approximation to P , however, the quality of this approximation depends on dim 6 / 18

P n P We are provided with only limited training data ( n samples) Sometimes, to an extent that even n < dim of the parameter of interest . 6 / 18

P n δ P Instead of finding the best fit with respect to P n , why not find a fit that works over all Q such that D ( Q , P n ) ≤ δ 6 / 18

P n δ P Formally, � � �� min Q : D ( Q , P n ) ≤ δ E Q max Loss f ( X ) , Y f ∈F 6 / 18

DR Regression: � � �� min Q : D ( Q , P n ) ≤ δ E Q max Loss f ( X ) , Y f ∈F 7 / 18

DR Linear Regression: �� 2 � Y − β T X min Q : D ( Q , P n ) ≤ δ E Q max β ∈ R d 7 / 18

DR Linear Regression: �� 2 � Y − β T X min Q : D ( Q , P n ) ≤ δ E Q max β ∈ R d I. Are these DR regression problems solvable? ◮ If so, how do they compare with known methods for improving generalization? II. How to beat the curse of dimensionality while choosing δ ? ◮ Robust Wasserstein profile function III. Does the framework scale? ◮ Support vector machines ◮ Logistic regression ◮ General sample average approximation 7 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d How to quantify the distance D ( P , Q )? 8 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then D ( P , Q ) = inf π E π � U − V � 8 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d T x y remblais 1 d´ eblais How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then π E π � U − V � D ( P , Q ) = inf 1 Image from the book Optimal Transport: Old and New by C´ edric Villani 8 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d T x y remblais d´ eblais How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then � � D c ( P , Q ) = inf π E π c ( U , V ) The metric D c is called optimal transport metric. is the p th order Wasserstein distance When c ( u , v ) = � u − v � p , D 1 / p c 8 / 18

DR Linear Regression: �� 2 � Y − β T X min Q : D c ( Q , P n ) ≤ δ E Q max β ∈ R d Next, how do we choose δ ? 9 / 18

DR Linear Regression: �� 2 � Y − β T X min Q : D c ( Q , P n ) ≤ δ E Q max β ∈ R d P n Next, how do we choose δ ? δ P See Fournier and Guillin (2015), Lee and Mehrotra (2013), Shafieezadeh-Abadeh, Esfahani and Kuhn (2015) 9 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� E P ∗ X X = 0 P n δ P 10 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� E P ∗ X X = 0 P n � 0 � = P X ) T X β − ∗ � Y ( E : Q � Q 10 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� E P ∗ X X = 0 P n δ � 0 � = P X ) T X β − ∗ � Y ( E : Q � Q � � Y − β T � � �� R n ( β ∗ ) = min D c Q , P n : E Q ∗ X X = 0 10 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L � � Y − β T � � �� R n ( β ∗ ) = min D c Q , P n : E Q ∗ X X = 0 10 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L Choose δ = η n where η is such that P {L ≤ η } ≥ 0 . 95 10 / 18

DR Linear Regression: �� 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L Choose δ = η α n where η α is such that P {L ≤ η α } ≥ 1 − α. 10 / 18

Robust Wasserstein � � �� Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: P n 11 / 18

Robust Wasserstein � � �� Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) P n x y 11 / 18

Robust Wasserstein � � �� Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ˜ P n P n x y 11 / 18

Robust Wasserstein � � �� Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ˜ D c ( P n , P n ) = R n ( β ) x y 11 / 18

Robust Wasserstein � � �� Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ◮ Basically, R n ( β ) is a measure of goodness of β ˜ D c ( P n , P n ) = R n ( β )  L , if β = β ∗  nR n ( β ) − → ∞ , if β � = β ∗  ◮ Similar to empirical likelihood x profile function ◮ In high-dimensional setting, one can instead consider suitable non-asymptotic bounds for nR n ( β ) . y 11 / 18

Robustness and Regularization: Two sides of the same coin (Joint - PowerPoint PPT Presentation

Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18 Introduction Richer data has tempted us to consider more elaborate models

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

10. Regularization More on tradeoffs Regularization Effect of using different norms

About ATPS Our mission, unchanged since 1994, is building Africas STI capabilities for

The Dutch Ticket Tax, lessons for Germany ? Infraday Berlin 8-9 october 2010 Hugo Gordijn

Organized by Under the patronage of our slogan: Come together, play and pray o 28 th April

Experienced Management 6 Key Copper Prospects Copper Targets, Artisanal Mining Christian de

e-I RG W orkshop Keynote by Prof. I van Dim ov , Deputy Minister of Education and Science :

Application of a Quality Control Framework Principal Investigators: Kevin Fiscella, MD, MPH

Nepal Expert Group Meeting on Regional Cooperation towards Building an Information Society in

Division of Information 2012 Quarterly Update - #1 Peter Nikoletatos Director IS and Chief

Robustness and Regularization: Two sides of the same coin (Joint - PowerPoint PPT Presentation

Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18 Introduction Richer data has tempted us to consider more elaborate models

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

10. Regularization More on tradeoffs Regularization Effect of using different norms

About ATPS Our mission, unchanged since 1994, is building Africas STI capabilities for

The Dutch Ticket Tax, lessons for Germany ? Infraday Berlin 8-9 october 2010 Hugo Gordijn

Organized by Under the patronage of our slogan: Come together, play and pray o 28 th April

Experienced Management 6 Key Copper Prospects Copper Targets, Artisanal Mining Christian de

e-I RG W orkshop Keynote by Prof. I van Dim ov , Deputy Minister of Education and Science :

Application of a Quality Control Framework Principal Investigators: Kevin Fiscella, MD, MPH

Nepal Expert Group Meeting on Regional Cooperation towards Building an Information Society in

Division of Information 2012 Quarterly Update - #1 Peter Nikoletatos Director IS and Chief

Regularization Overview Regularization Overview Problems & Multicollinearity We will