Robust Statistics and Generative Adversarial Networks Yuan YAO - PowerPoint PPT Presentation

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST 1

Chao GAO Jiyi LIU Weizhi ZHU U. Chicago Yale U. HKUST 2

Deep Learning is Notoriously Not Robust! • Imperceivable adversarial examples are ubiquitous to fail neural networks • How can one achieve robustness? 3

Robust Optimization • Traditional training: ✓ J n ( ✓ , z = ( x i , y i ) n min i =1 ) • e.g. square or cross-entropy loss as negative log-likelihood of logit models • Robust optimization (Madry et al. ICLR’2018): k ✏ i k � J n ( ✓ , z = ( x i + ✏ i , y i ) n min max i =1 ) ✓ • robust to any distributions, yet computationally hard 4

Distributionally Robust Optimization (DRO) • Distributional Robust Optimization: min ✓ max E z ∼ P ✏ ∈ D [ J n ( ✓ , z )] ✏ • D is a set of ambiguous distributions, e.g. Wasserstein ambiguity set D = { P ✏ : W 2 ( P ✏ , uniform distribution) ≤ ✏ } where DRO may be reduced to regularized maximum likelihood estimates (Shafieezadeh-Abadeh, Esfahani, Kuhn, NIPS’2015) that are convex optimizations and tractable 5

Wasserstein DRO and Sqrt-Lasso ( Jose Blanchet et al.’2016 ) Theorem (B., Kang, Murthy (2016)) Suppose that ( k x − x 0 k 2 ! ( x , y ) , ! x 0 , y 0 "" = y = y 0 if q c y 6 = y 0 . if ∞ Then, if 1 / p + 1 / q = 1 $% & 2 ' (% & 2 ) p Y − β T X Y − β T X P : D c ( P , P n ) ≤ δ E 1 / 2 = E 1 / 2 max + δ k β k p . P P n Remark 1: This is sqrt-Lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c ( · ) ” 6

Certified Robustness of Lasso Take q = 1 and p = 1, with ( k x � x 0 k 2 y = y 0 if x 0 , y 0 �� 1 ( x , y ) , = c y 6 = y 0 1 if Then for n = 1 X P 0 δ x 0 n i i with k x i � x 0 i k 1  δ , Z D c ( P 0 π (( x , y ) , ( x 0 , y 0 )) c x 0 , y 0 �� n , P n ) = ( x , y ) ,  δ , for small enough δ and well-separated x ’s. Sqrt-Lasso � 2 p ⇢ ⇣ ⌘ 2 � E 1 / 2 Y � β T X min + δ k β k 1 P n β ✓⇣ ⌘ 2 ◆ Y � β T X = min max P : D c ( P , P n )  δ E P β provides a certified robust estimate in terms of Madry’s adversarial training, using a convex Wasserstein relaxation. 7

TV-neighborhood • Now how about the TV-uncertainty set? D = { P ✏ : TV ( P ✏ , uniform distribution) ≤ ✏ } ? • an example from robust statistics … 8

Huber’s Model X 1 , ..., X n ⇠ (1 � ✏ ) P ✓ + ✏ Q contamination proportion arbitrary contamination parameter of interest [Huber 1964] 9

An Example X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? how to estimate ? 10

Robust Maxmum-Likelihood Does not work! X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? n ( ✓ − X i ) 2 X ` ( ✓ , Q ) = negative log-likelihood = i =1 ∼ (1 − ✏ ) E N ( θ ) ( ✓ − X ) 2 + ✏ E Q ( ✓ − X ) 2 the sample mean n ✓ mean = 1 ˆ X X i = arg min θ ` ( ✓ , Q ) n i =1 ` (ˆ min θ max ` ( ✓ , Q ) ≥ max min θ ` ( ✓ , Q ) = max ✓ mean , Q ) = ∞ Q Q Q 11

Medians 1. Coordinatewise median ✓ = (ˆ ˆ ✓ j ), where ˆ ✓ j = Median( { X ij } n i =1 ); 2. Tukey’s median Estimator 2: n 1 I { u T X i > u T ⌘ } . ˆ X ✓ = arg max η ∈ R p min n || u || =1 i =1 12

Comparisons Coordinatewise Median Tukey’s Median breakdown point 1 / 2 1 / 3 p p statistical precision n n (no contamination) p p n + p ✏ 2 n + ✏ 2 : minimax statistical precision (with contamination) [Chen-Gao-Ren’15] computational complexity Polynomial NP-hard [Amenta et al. ’00] Note: R-package for Tukey median can not deal with more than 10 dimensions! [https://github.com/ChenMengjie/DepthDescent] 13

Depth and Statistical Properties 14

Multivariate Location Depth ( n n ) 1 I { u T X i > u T ⌘ } ^ 1 I { u T X i  u T ⌘ } ˆ X X ✓ = arg max η 2 R p min n n k u k =1 i =1 i =1 Estimator 2: n 1 I { u T X i > u T ⌘ } . ˆ X ✓ = arg max η ∈ R p min n || u || =1 i =1 [Tukey, 1975] 15

Regression Depth y | X ∼ N ( X T β , σ 2 ) model Xy | X ∼ N ( XX T β , σ 2 XX T ) embedding u T Xy | X ∼ N ( u T XX T β , σ 2 u T XX T u ) projection ( n n ) 1 i η ) > 0 } ∧ 1 I { u T X i ( y i − X T I { u T X i ( y i − X T ˆ X X β = argmax min i η ) ≤ 0 } n n u ∈ R p η ∈ R p i =1 i =1 [Rousseeuw & Hubert, 1999] 16

Tukey’s depth is not a special case of regression depth. 17

Multi-task Regression Depth ( X, Y ) ∈ R p × R m ∼ P of B ∈ R p × m population version: U T X, Y − B T X �⌦ ↵ D U ( B, P ) = inf ≥ 0 U ∈ U P empirical version: n 1 U T X i , Y i − B T X i X D U ( B, { ( X i , Y i ) } n �⌦ ↵ i =1 ) = inf ≥ 0 I n U ∈ U i =1 [Mizera, 2002] 18

Multi-task Regression Depth U T X, Y � B T X �⌦ ↵ D U ( B, P ) = inf � 0 U 2 U P p = 1 , X = 1 ∈ R , u T ( Y − b ) ≥ 0 � D U ( b, P ) = inf u ∈ U P m = 1 , u T X ( y − β T X ) ≥ 0 � D U ( β , P ) = inf U ∈ U P 19

Multi-task Regression Depth Estimation Error. For any , y � > 0 , r r pm log(1 / � ) B 2 R p × m |D ( B, P n ) � D ( B, P ) |  C sup n + , 2 n with probability at least . st 1 � 2 � Contamination Error. sup |D ( B, (1 � ✏ P B ∗ ) + ✏ Q ) � D ( B, P B ∗ ) |  ✏ B,Q 20

Multi-task Regression Depth � Y | X ∼ N ( B T X, � 2 I m ) ( X, Y ) ∼ P B : X ∼ N (0 , Σ ) , ( X 1 , Y 1 ) , ..., ( X n , Y n ) ∼ (1 − ✏ ) P B + ✏ Q Theorem [G17]. For some C > 0 , B � B ))  C � 2 ⇣ pm n _ ✏ 2 ⌘ B � B ) T Σ ( b Tr (( b , ⇣ pm n _ ✏ 2 ⌘ F  C � 2 k b B � B k 2 ,  2 � � �� with high probability uniformly over . B, Q 21

Covariance Matrix X 1 , ..., X n ⇠ (1 � ✏ ) N (0 , Σ ) + ✏ Q . How to estimate Σ ? how to estimate ? 22

Covariance Matrix 23

Covariance Matrix ( n n ) 1 I {| u T X i | 2 � u T Γ u } , 1 I {| u T X i | 2 < u T Γ u } X X D ( Γ , { X i } n i =1 ) = min k u k =1 min n n i =1 i =1 ˆ Γ ⌫ 0 D ( Γ , { X i } n Σ = ˆ ˆ Γ = arg max i =1 ) Γ / � , Theorem [CGR15]. For some C > 0 , ⇣ p n _ ✏ 2 ⌘ k ˆ Σ � Σ k 2 op  C with high probability uniformly over . Σ , Q 24

Summary ⇣ p n _ ✏ 2 ⌘ k · k 2 mean � 2 _ � 2 reduced rank r ( p + m ) k · k 2  2 ✏ 2 regression F  2 n s 2 log( ep/s ) Gaussian graphical k · k 2 _ s ✏ 2 ` 1 model n ⇣ p n _ ✏ 2 ⌘ k · k 2 covariance matrix op _ ✏ 2 s log( ep/s ) k · k 2 sparse PCA F n � 2 � 2 25

Computation 26

Computational Challenges X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen • Polynomial algorithms are proposed [Diakonikolas et al.’16, Lai et al. 16] of minimax optimal statistical precision • needs information on second or higher order of moments • some priori knowledge about ✏ 27

Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution 28

A practically good algorithm? 29

Generative Adversarial Networks [Goodfellow et al. 2014] men- we , lin- al- exceeds the Note: R-package for Tukey median can not deal with more than 10 dimensions [https://github.com/ChenMengjie/ DepthDescent] 30

Robust Learning of Cauchy Distributions Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 � ✏ ) Cauchy (0 p , I p ) + ✏ Q with ✏ = 0 . 2 , p = 50 and various choices of Q . Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator g ω ( ⇠ ) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN ( G 1 ) JS-GAN ( G 2 ) Dimension Halving Iterative Filtering Cauchy (1 . 5 ⇤ 1 p , I p ) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy (5 . 0 ⇤ 1 p , I p ) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288)) • Dimension Halving: [Lai et al.’16] https://github.com/kal2000/AgnosticMeanAndCovarianceCode . • Iterative Filtering: [Diakonikolas et al.’17] https://github.com/hoonose/robust-filter . 31

f-GAN Given a strictly convex function f that satisfies f (1) = 0, the f -divergence between two probability distributions P and Q is defined by ✓ p ◆ Z D f ( P k Q ) = dQ . (8) f q Let f ⇤ be the convex conjugate of f . A variational lower bound of (8) is [ E P T ( X ) � E Q f ⇤ ( T ( X ))] . D f ( P k Q ) � sup (9) T 2 T where equality holds whenever the class T contains the function f 0 ( p / q ). [Nowozin-Cseke-Tomioka’16] f -GAN minimizes the variational lower bound (9) " # n X 1 b T ( X i ) � E Q f ⇤ ( T ( X )) P = arg min sup . (10) n Q 2 Q T 2 T i =1 with i.i.d. observations X 1 , ..., X n ⇠ P . 32

Robust Statistics and Generative Adversarial Networks Yuan YAO - PowerPoint PPT Presentation

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST 1 Chao GAO Jiyi LIU Weizhi ZHU U. Chicago Yale U. HKUST 2 Deep Learning is Notoriously Not Robust! Imperceivable adversarial examples are ubiquitous to fail

Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science

Generative Adversarial Networks Benjamin Striner CMU 11-785 March 21, 2018 Benjamin Striner

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST Chao Gao (Chicago) Jiyu

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

GAN-based Photo Video Synthesis Summary of Generative Adversarial Nets Lei Zhang What is

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

Generative Adversarial Networks presented by Ian Goodfellow presentation co-developed with Aaron

Adversarial Training Attacks on Deep Networks and Generative Adversarial Networks Erkut Erdem

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 Generative Adversial Networks

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and beyond

LAB MEETING: A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning

Generative Adversarial Networks Sahin Olut Department of Computer Engineering Istanbul Technical

Database Programming (JDBC) Lecture 5 1 Outline JDBC overview JDBC API Reading:

EMMA MA RF System C. Ohmori and J. S. Berg EMMA MA System * Many FFAG applications require slow

Mixed-Signal VLSI Design Course Code: EE719/EE410 Department: Electrical Engineering Semester:

Interacting w ith Smart Objects: Application Scenarios w ith the BTnode Platform Friedemann

Fermilab Test Beam Facility Mandy Rominsky FTBF Coodinator

Frontiers and Open-Challenges CS330 Logistics Final project presentations next week Schedule on

Distributionally Robust Planning Tool for Sustainable Microgrids Shahab Dehghan 1 , Agnes

CS 245: Database System Principles Notes 03: Disk Organization Hector Garcia-Molina CS 245