robustness and regularization
play

Robustness and Regularization: Two sides of the same coin (Joint - PowerPoint PPT Presentation

Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18 Introduction Richer data has tempted us to consider more elaborate models


  1. Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18

  2. Introduction ◮ Richer data has tempted us to consider more elaborate models Elaborate models = ⇒ More factors / variables ◮ Generalization has become a lot more challenging ◮ Regularization has been useful in avoiding overfitting Goal: A distributionally robust approach for improving generalization 1 / 18

  3. Motivation for Distributionally robust optimization ◮ Want to solve the stochastic optimization problem � � � min β E Loss X , β ) ◮ Typically, we have access to the probability distribution of X only via its samples { X 1 , . . . , X n } ◮ A common practice is to instead solve n 1 � min Loss( X i , β ) n β i =1 2 / 18

  4. n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

  5. 45 40 35 30 25 20 15 10 5 0 − 15 − 10 − 5 0 5 10 15 n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

  6. 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 − 15 − 10 − 5 0 5 10 15 − 15 − 10 − 5 0 5 10 15 n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

  7. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n x i = ( x 1 , . . . , x d ) is the vector of predictors y i is the corresponding response a a Image source: r-bloggers.com 4 / 18

  8. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n Empirical loss/risk minimization (ERM): n 1 � � � Loss f ( x i ) , y i n i =1 a a Image source: r-bloggers.com 4 / 18

  9. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n Empirical loss/risk minimization (ERM): n 1 � � � Loss f ( x i ) , y i n i =1 n = 1 a � y i − f ( x i ) 2 � � n i =1 a Image source: r-bloggers.com 4 / 18

  10. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n a a Image source: r-bloggers.com Not enough Find an f that fits well over “future” values as well 4 / 18

  11. Generalization Think of data ( x 1 , y 1 ) , . . . ( x n , y n ) as samples from a probability distribution P Then “future values” can also be interpreted as samples from P 5 / 18

  12. Generalization Think of data ( x 1 , y 1 ) , . . . ( x n , y n ) as samples from a probability distribution P Then “future values” can also be interpreted as samples from P n 1 � � � �− → � � � min Loss f ( x i ) , y i min E P Loss f ( X ) , Y ) n f f i =1 However, the access to P is still via samples, P n = 1 � n i =1 δ ( x i , y i ) n 5 / 18

  13. P � � �� Want to solve min f ∈F E P Loss f ( X ) , Y P unknown 6 / 18

  14. P n P � � �� Know how to solve min f ∈F E P n Loss f ( X ) , Y Access to P via training samples P n 6 / 18

  15. P n P More and more samples give better approximation to P , however, the quality of this approximation depends on dim 6 / 18

  16. P n P We are provided with only limited training data ( n samples) Sometimes, to an extent that even n < dim of the parameter of interest . 6 / 18

  17. P n δ P Instead of finding the best fit with respect to P n , why not find a fit that works over all Q such that D ( Q , P n ) ≤ δ 6 / 18

  18. P n δ P Formally, � � �� min Q : D ( Q , P n ) ≤ δ E Q max Loss f ( X ) , Y f ∈F 6 / 18

  19. DR Regression: � � �� min Q : D ( Q , P n ) ≤ δ E Q max Loss f ( X ) , Y f ∈F 7 / 18

  20. DR Linear Regression: �� � 2 � Y − β T X min Q : D ( Q , P n ) ≤ δ E Q max β ∈ R d 7 / 18

  21. DR Linear Regression: �� � 2 � Y − β T X min Q : D ( Q , P n ) ≤ δ E Q max β ∈ R d I. Are these DR regression problems solvable? ◮ If so, how do they compare with known methods for improving generalization? II. How to beat the curse of dimensionality while choosing δ ? ◮ Robust Wasserstein profile function III. Does the framework scale? ◮ Support vector machines ◮ Logistic regression ◮ General sample average approximation 7 / 18

  22. DR Linear Regression: �� � 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d How to quantify the distance D ( P , Q )? 8 / 18

  23. DR Linear Regression: �� � 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then D ( P , Q ) = inf π E π � U − V � 8 / 18

  24. DR Linear Regression: �� � 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d T x y remblais 1 d´ eblais How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then π E π � U − V � D ( P , Q ) = inf 1 Image from the book Optimal Transport: Old and New by C´ edric Villani 8 / 18

  25. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d T x y remblais d´ eblais How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then � � D c ( P , Q ) = inf π E π c ( U , V ) The metric D c is called optimal transport metric. is the p th order Wasserstein distance When c ( u , v ) = � u − v � p , D 1 / p c 8 / 18

  26. DR Linear Regression: �� � 2 � Y − β T X min Q : D c ( Q , P n ) ≤ δ E Q max β ∈ R d Next, how do we choose δ ? 9 / 18

  27. DR Linear Regression: �� � 2 � Y − β T X min Q : D c ( Q , P n ) ≤ δ E Q max β ∈ R d P n Next, how do we choose δ ? δ P See Fournier and Guillin (2015), Lee and Mehrotra (2013), Shafieezadeh-Abadeh, Esfahani and Kuhn (2015) 9 / 18

  28. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� � � E P ∗ X X = 0 P n δ P 10 / 18

  29. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� � � E P ∗ X X = 0 P n � 0 � = P X ) T X β − ∗ � Y ( E : Q � Q 10 / 18

  30. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� � � E P ∗ X X = 0 P n δ � 0 � = P X ) T X β − ∗ � Y ( E : Q � Q � � Y − β T � � �� � � R n ( β ∗ ) = min D c Q , P n : E Q ∗ X X = 0 10 / 18

  31. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L � � Y − β T � � �� � � R n ( β ∗ ) = min D c Q , P n : E Q ∗ X X = 0 10 / 18

  32. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L Choose δ = η n where η is such that P {L ≤ η } ≥ 0 . 95 10 / 18

  33. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L Choose δ = η α n where η α is such that P {L ≤ η α } ≥ 1 − α. 10 / 18

  34. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: P n 11 / 18

  35. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) P n x y 11 / 18

  36. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ˜ P n P n x y 11 / 18

  37. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ˜ D c ( P n , P n ) = R n ( β ) x y 11 / 18

  38. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ◮ Basically, R n ( β ) is a measure of goodness of β ˜ D c ( P n , P n ) = R n ( β )  L , if β = β ∗  nR n ( β ) − → ∞ , if β � = β ∗  ◮ Similar to empirical likelihood x profile function ◮ In high-dimensional setting, one can instead consider suitable non-asymptotic bounds for nR n ( β ) . y 11 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend