robust statistics and generative adversarial networks
play

Robust Statistics and Generative Adversarial Networks Yuan YAO - PowerPoint PPT Presentation

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST 1 Chao GAO Jiyi LIU Weizhi ZHU U. Chicago Yale U. HKUST 2 Deep Learning is Notoriously Not Robust! Imperceivable adversarial examples are ubiquitous to fail


  1. Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST 1

  2. Chao GAO Jiyi LIU Weizhi ZHU U. Chicago Yale U. HKUST 2

  3. Deep Learning is Notoriously Not Robust! • Imperceivable adversarial examples are ubiquitous to fail neural networks • How can one achieve robustness? 3

  4. Robust Optimization • Traditional training: ✓ J n ( ✓ , z = ( x i , y i ) n min i =1 ) • e.g. square or cross-entropy loss as negative log-likelihood of logit models • Robust optimization (Madry et al. ICLR’2018): k ✏ i k � J n ( ✓ , z = ( x i + ✏ i , y i ) n min max i =1 ) ✓ • robust to any distributions, yet computationally hard 4

  5. Distributionally Robust Optimization (DRO) • Distributional Robust Optimization: min ✓ max E z ∼ P ✏ ∈ D [ J n ( ✓ , z )] ✏ • D is a set of ambiguous distributions, e.g. Wasserstein ambiguity set D = { P ✏ : W 2 ( P ✏ , uniform distribution) ≤ ✏ } where DRO may be reduced to regularized maximum likelihood estimates (Shafieezadeh-Abadeh, Esfahani, Kuhn, NIPS’2015) that are convex optimizations and tractable 5

  6. Wasserstein DRO and Sqrt-Lasso ( Jose Blanchet et al.’2016 ) Theorem (B., Kang, Murthy (2016)) Suppose that ( k x − x 0 k 2 ! ( x , y ) , ! x 0 , y 0 "" = y = y 0 if q c y 6 = y 0 . if ∞ Then, if 1 / p + 1 / q = 1 $% & 2 ' (% & 2 ) p Y − β T X Y − β T X P : D c ( P , P n ) ≤ δ E 1 / 2 = E 1 / 2 max + δ k β k p . P P n Remark 1: This is sqrt-Lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c ( · ) ” 6

  7. Certified Robustness of Lasso Take q = 1 and p = 1, with ( k x � x 0 k 2 y = y 0 if x 0 , y 0 �� � � 1 ( x , y ) , = c y 6 = y 0 1 if Then for n = 1 X P 0 δ x 0 n i i with k x i � x 0 i k 1  δ , Z D c ( P 0 π (( x , y ) , ( x 0 , y 0 )) c x 0 , y 0 �� � � n , P n ) = ( x , y ) ,  δ , for small enough δ and well-separated x ’s. Sqrt-Lasso � 2 p ⇢ ⇣ ⌘ 2 � E 1 / 2 Y � β T X min + δ k β k 1 P n β ✓⇣ ⌘ 2 ◆ Y � β T X = min max P : D c ( P , P n )  δ E P β provides a certified robust estimate in terms of Madry’s adversarial training, using a convex Wasserstein relaxation. 7

  8. TV-neighborhood • Now how about the TV-uncertainty set? D = { P ✏ : TV ( P ✏ , uniform distribution) ≤ ✏ } ? • an example from robust statistics … 8

  9. Huber’s Model X 1 , ..., X n ⇠ (1 � ✏ ) P ✓ + ✏ Q contamination proportion arbitrary contamination parameter of interest [Huber 1964] 9

  10. An Example X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? how to estimate ? 10

  11. Robust Maxmum-Likelihood Does not work! X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? n ( ✓ − X i ) 2 X ` ( ✓ , Q ) = negative log-likelihood = i =1 ∼ (1 − ✏ ) E N ( θ ) ( ✓ − X ) 2 + ✏ E Q ( ✓ − X ) 2 the sample mean n ✓ mean = 1 ˆ X X i = arg min θ ` ( ✓ , Q ) n i =1 ` (ˆ min θ max ` ( ✓ , Q ) ≥ max min θ ` ( ✓ , Q ) = max ✓ mean , Q ) = ∞ Q Q Q 11

  12. Medians 1. Coordinatewise median ✓ = (ˆ ˆ ✓ j ), where ˆ ✓ j = Median( { X ij } n i =1 ); 2. Tukey’s median Estimator 2: n 1 I { u T X i > u T ⌘ } . ˆ X ✓ = arg max η ∈ R p min n || u || =1 i =1 12

  13. Comparisons Coordinatewise Median Tukey’s Median breakdown point 1 / 2 1 / 3 p p statistical precision n n (no contamination) p p n + p ✏ 2 n + ✏ 2 : minimax statistical precision (with contamination) [Chen-Gao-Ren’15] computational complexity Polynomial NP-hard [Amenta et al. ’00] Note: R-package for Tukey median can not deal with more than 10 dimensions! [https://github.com/ChenMengjie/DepthDescent] 13

  14. Depth and Statistical Properties 14

  15. Multivariate Location Depth ( n n ) 1 I { u T X i > u T ⌘ } ^ 1 I { u T X i  u T ⌘ } ˆ X X ✓ = arg max η 2 R p min n n k u k =1 i =1 i =1 Estimator 2: n 1 I { u T X i > u T ⌘ } . ˆ X ✓ = arg max η ∈ R p min n || u || =1 i =1 [Tukey, 1975] 15

  16. Regression Depth y | X ∼ N ( X T β , σ 2 ) model Xy | X ∼ N ( XX T β , σ 2 XX T ) embedding u T Xy | X ∼ N ( u T XX T β , σ 2 u T XX T u ) projection ( n n ) 1 i η ) > 0 } ∧ 1 I { u T X i ( y i − X T I { u T X i ( y i − X T ˆ X X β = argmax min i η ) ≤ 0 } n n u ∈ R p η ∈ R p i =1 i =1 [Rousseeuw & Hubert, 1999] 16

  17. Tukey’s depth is not a special case of regression depth. 17

  18. Multi-task Regression Depth ( X, Y ) ∈ R p × R m ∼ P of B ∈ R p × m population version: U T X, Y − B T X �⌦ ↵ D U ( B, P ) = inf ≥ 0 U ∈ U P empirical version: n 1 U T X i , Y i − B T X i X D U ( B, { ( X i , Y i ) } n �⌦ ↵ i =1 ) = inf ≥ 0 I n U ∈ U i =1 [Mizera, 2002] 18

  19. Multi-task Regression Depth U T X, Y � B T X �⌦ ↵ D U ( B, P ) = inf � 0 U 2 U P p = 1 , X = 1 ∈ R , u T ( Y − b ) ≥ 0 � D U ( b, P ) = inf u ∈ U P m = 1 , u T X ( y − β T X ) ≥ 0 � D U ( β , P ) = inf U ∈ U P 19

  20. Multi-task Regression Depth Estimation Error. For any , y � > 0 , r r pm log(1 / � ) B 2 R p × m |D ( B, P n ) � D ( B, P ) |  C sup n + , 2 n with probability at least . st 1 � 2 � Contamination Error. sup |D ( B, (1 � ✏ P B ∗ ) + ✏ Q ) � D ( B, P B ∗ ) |  ✏ B,Q 20

  21. Multi-task Regression Depth � Y | X ∼ N ( B T X, � 2 I m ) ( X, Y ) ∼ P B : X ∼ N (0 , Σ ) , ( X 1 , Y 1 ) , ..., ( X n , Y n ) ∼ (1 − ✏ ) P B + ✏ Q Theorem [G17]. For some C > 0 , B � B ))  C � 2 ⇣ pm n _ ✏ 2 ⌘ B � B ) T Σ ( b Tr (( b , ⇣ pm n _ ✏ 2 ⌘ F  C � 2 k b B � B k 2 ,  2 � � �� with high probability uniformly over . B, Q 21

  22. Covariance Matrix X 1 , ..., X n ⇠ (1 � ✏ ) N (0 , Σ ) + ✏ Q . How to estimate Σ ? how to estimate ? 22

  23. Covariance Matrix 23

  24. Covariance Matrix ( n n ) 1 I {| u T X i | 2 � u T Γ u } , 1 I {| u T X i | 2 < u T Γ u } X X D ( Γ , { X i } n i =1 ) = min k u k =1 min n n i =1 i =1 ˆ Γ ⌫ 0 D ( Γ , { X i } n Σ = ˆ ˆ Γ = arg max i =1 ) Γ / � , Theorem [CGR15]. For some C > 0 , ⇣ p n _ ✏ 2 ⌘ k ˆ Σ � Σ k 2 op  C with high probability uniformly over . Σ , Q 24

  25. Summary ⇣ p n _ ✏ 2 ⌘ k · k 2 mean � 2 _ � 2 reduced rank r ( p + m ) k · k 2  2 ✏ 2 regression F  2 n s 2 log( ep/s ) Gaussian graphical k · k 2 _ s ✏ 2 ` 1 model n ⇣ p n _ ✏ 2 ⌘ k · k 2 covariance matrix op _ ✏ 2 s log( ep/s ) k · k 2 sparse PCA F n � 2 � 2 25

  26. Computation 26

  27. Computational Challenges X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen • Polynomial algorithms are proposed [Diakonikolas et al.’16, Lai et al. 16] of minimax optimal statistical precision • needs information on second or higher order of moments • some priori knowledge about ✏ 27

  28. Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution 28

  29. A practically good algorithm? 29

  30. Generative Adversarial Networks [Goodfellow et al. 2014] men- we , lin- al- exceeds the Note: R-package for Tukey median can not deal with more than 10 dimensions [https://github.com/ChenMengjie/ DepthDescent] 30

  31. Robust Learning of Cauchy Distributions Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 � ✏ ) Cauchy (0 p , I p ) + ✏ Q with ✏ = 0 . 2 , p = 50 and various choices of Q . Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator g ω ( ⇠ ) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN ( G 1 ) JS-GAN ( G 2 ) Dimension Halving Iterative Filtering Cauchy (1 . 5 ⇤ 1 p , I p ) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy (5 . 0 ⇤ 1 p , I p ) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288)) • Dimension Halving: [Lai et al.’16] https://github.com/kal2000/AgnosticMeanAndCovarianceCode . • Iterative Filtering: [Diakonikolas et al.’17] https://github.com/hoonose/robust-filter . 31

  32. f-GAN Given a strictly convex function f that satisfies f (1) = 0, the f -divergence between two probability distributions P and Q is defined by ✓ p ◆ Z D f ( P k Q ) = dQ . (8) f q Let f ⇤ be the convex conjugate of f . A variational lower bound of (8) is [ E P T ( X ) � E Q f ⇤ ( T ( X ))] . D f ( P k Q ) � sup (9) T 2 T where equality holds whenever the class T contains the function f 0 ( p / q ). [Nowozin-Cseke-Tomioka’16] f -GAN minimizes the variational lower bound (9) " # n X 1 b T ( X i ) � E Q f ⇤ ( T ( X )) P = arg min sup . (10) n Q 2 Q T 2 T i =1 with i.i.d. observations X 1 , ..., X n ⇠ P . 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend