robust estimation and generative adversarial networks
play

Robust Estimation and Generative Adversarial Networks Weizhi ZHU - PowerPoint PPT Presentation

Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science and Technology wzhuai@ust.hk April 3, 2019 Robust Estimation and Generative Adversarial Nets [GLYZ18] Generative Adversarial Nets for Robust


  1. Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science and Technology wzhuai@ust.hk April 3, 2019 Robust Estimation and Generative Adversarial Nets [GLYZ18] Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective [GYZ19] Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 1 / 24

  2. Huber’s Contamination Model Huber’s contamination model [Huber, 1964] , P = (1 − ǫ ) P θ + ǫ Q . Strong contamination model [Diakonikolas et al., 2016a] , TV ( P , P θ ) ≤ ǫ. Can we recover θ by data drawn from P with arbitrary unknown contamination ( ǫ, Q )? Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 2 / 24

  3. Example: Robust Mean Estimation Let’s firstly consider the robust estimation of location parameter θ in normal distribution, X 1 , . . . , X n ∼ (1 − ǫ ) N ( θ, I p ) + ǫ Q Coordinate-wise median. Tukey median [Tukey, 1978] . � � � � n n � � � u T X i > u T η u T X i ≤ u T η θ = argmax min 1 ∧ 1 � u � 2 =1 η ∈ R p i =1 i =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 3 / 24

  4. Comparison Median Tukey Median statistical convergence rate p p n n (no contamination) statistical convergence rate p p n ∨ p ǫ 2 n ∨ ǫ 2 , [minimax] (Huber’s ǫ contamination) computational complexity Polynomial NP-Hard Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 4 / 24

  5. Example: Robust Covariance Estimation We can also estimate the covariance matrix Σ in normal distribution, X 1 , . . . , X n ∼ (1 − ǫ ) N (0 , Σ) + ǫ Q Covariance depth [Chen-Gao-Ren, 2017] . � n � n � � � � | u T X i | 2 > u T Γ u | u T X i | 2 ≤ u T Γ u � ∧ Γ = argmax min 1 1 , � u � 2 =1 Γ > 0 i =1 i =1 (1) � � � � Γ = 3 � Σ = β , P N (0 , 1) < β 4 . � � Σ − Σ � op ≤ C ( p n + ǫ 2 ) with high probability uniformly over Σ and Q . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 5 / 24

  6. Computational Complexity Polynomial algorithms are proposed [Lai et al., 2016; Diakonikolas et al., 2018] of nearly minimax optimal statistical precision. - Prior knowledge on ǫ . - Needs some moment constraints. Advantages of the depth estimation. - Does not need prior knowledge on ǫ . - Adaptive to any elliptical distributions. - A well defined objective function. - Any feasible algorithms in practice? Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 6 / 24

  7. f -divergence Given a convex function f satisfying f (1) = 0, the f -divergence of P from Q is defined as, � � dP � D f ( P � Q ) = f dQ (2) dQ Let f ∗ be the convex conjugate of f , then a variational lower bound of (2) is given by, � � � t p ( x ) q ( x ) − f ∗ ( t ) D f ( P � Q ) = q ( x ) sup dx , t ∈ dom f ∗ E x ∼ P [ T ( x )] − E x ∼ Q [ f ∗ ( T ( x ))] . ≥ sup (3) T ∈T The equality holds in (3) if f ′ � � p ∈ T . q � ˜ � � � � ˜ ��� n � 1 q ( X i ) q ( X i ) f ′ f ∗ f ′ D f ( P � Q ) ≥ max − E X ∼ Q . n q ( X i ) q ( X i ) Q ∈ ˜ ˜ Q i =1 (4) Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 7 / 24

  8. f -GAN and f -Learning f -Learning. Let ˜ Q be a distribution family, � ˜ � � � � ˜ ��� � n 1 q ( X i ) q ( X i ) � f ′ f ∗ f ′ P = argmin max − E X ∼ Q . n q ( X i ) q ( X i ) Q ∈ ˜ ˜ Q ∈Q Q i =1 f -GAN [Nowozin et al., 2016] , n � 1 � T ( X i ) − E X ∼ Q [ f ∗ ( T ( x ))] , P = argmin max n T ∈T Q ∈Q i =1 where T is usually parametrized by a neural network. - f -GAN can smooth f -Learning’s objective function. - f-divergence is robust. - There exist practical efficient algorithms to solve. Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 8 / 24

  9. Example f(x) = x log x (KL-divergence), p ∈ ˜ Q (or f ′ ( p / q ) ∈ T ), then KL-Learning (or KL-GAN) becomes maximal likelihood estimate. f(x) = x log x − ( x + 1) log 1+ x (JS-divergence), which leads to the 2 original JS-GAN [Goodfellow et al., 2014] , � n 1 � P = argmin max log (sigmoid ( T ( X i )))+ E x ∼ Q log (1 − sigmoid ( T ( x ))) . n T ∈T Q ∈Q i =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 9 / 24

  10. Example (Continued) f(x) = ( x − 1) + (TV-divergence) and f ∗ ( t ) = t , 0 ≤ t ≤ 1 . - When taking Q = {N ( θ, I p ) : θ ∈ R p } , Q ( θ, r ) = {N (˜ ˜ θ, I p ) : � ˜ θ − θ � 2 ≤ r } . TV-Learning is defined as, � ˜ � � ˜ � n � 1 q ( X i ) q min max 1 q ( X i ) ≥ 1 − Q q ≥ 1 n Q ∈Q Q ∈ ˜ ˜ Q ( θ, r ) i =1 r → 0 - TV-Learning → Tukey median, � � � n u T X i > u T η max η ∈ R p min � u � 2 =1 i =1 1 . - With T parameterized by the class of neural networks, TV-GAN is defined as, � n 1 � P = argmin max sigmoid ( T ( X i )) − E x ∼ Q [sigmoid ( T ( x ))] . n T ∈T Q ∈Q i =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 10 / 24

  11. Proper Scoring Rule { S ( · , 1) , S ( · , 0) } is the forecaster’s reward if a player quotes t when event 1 or 0 occurs. S ( t ; p ) = pS ( t , 1) + (1 − p ) S ( t , 0) is the expected reward when the event occurs with probability p . { S ( · , 1) , S ( · , 0) } is a proper scoring rule if S ( p ; p ) ≥ S ( t ; p ) , ∀ t ∈ [0 , 1] . (Savage representation) S is proper iff there exists a convex function G ( · ) such that, � S ( t , 1) = G ( t ) + (1 − t ) G ′ ( t ) , S ( t , 0) = G ( t ) − tG ′ ( t ) . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 11 / 24

  12. Proper Scoring Rule and f-divergence We consider a natural cost function with assumption X | y = 1 ∼ P and X | y = 0 ∼ Q with prior P ( y = 1) = 1 / 2, that is, 1 1 2 S ( T ( X ) , 1) + E X ∼ Q 2 S ( T ( X ) , 0) . E X ∼ P Then one can find a good classification rule T ( · ) by maximizing the above objective over T ∈ T , � 1 � 2 E X ∼ P S ( T ( X ) , 1) + 1 2 E X ∼ Q S ( T ( X ) , 0) − G (1 D T ( P , Q ) = max 2) T ∈T Log Score (JS-divergence). S ( t , 1) = log t , S ( t , 0) = log(1 − t ) Zero-One Score (TV-divergence). S ( t , 1) = I { t ≥ 1 / 2 } , S ( t , 0) = I { t < 1 / 2 } . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 12 / 24

  13. (Multi-layers) JS-GAN is Statistical Optimal � � n � 1 � θ = argmin max log T ( X i ) + E N ( η, I p ) log(1 − T ( X i )) + log 4 , n T ∈T η ∈ R p i =1 Theorem (Gao-Liu-Yao-Zhu’ 2018) With i.i.d. observations X 1 , ..., X n ∼ (1 − ǫ ) N ( θ, I p ) + ǫ Q and some regularizations on weight matrix, we have � p n ∨ ǫ 2 , at least one bounded activation θ − θ � 2 � � � p log p ∨ ǫ 2 , ReLU n with high probability uniformly over all θ ∈ R p and all Q. It can be generalized to elliptical distribution µ + Σ 1 / 2 ξ U and the strong contamination model. Covariance and mean can be estimated simultaneously. Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 13 / 24

  14. Proof Sketch �� � � p log(1 /δ ) sup D ∈D | E P n D ( X ) − E P D ( X ) | ≤ C n + . n �� � � log(1 /δ ) p sup D ∈D | E P θ D ( X ) − E P ˆ θ ( D ( X )) | ≤ 2 C n + + 2 ǫ. n | f ( t ) − f (0) | ≥ c ′ | t | , | t | < τ for some τ > 0, where f ( t ) = E N (0 , 1) (sigmoid( z − t )) satisfies, � w � 2 =1 , b = − w T θ θ D ( X ) = f ( w T ( θ − ˆ E P θ D ( X ) = = = = = = = = = = = = f (0) , E P ˆ θ )) . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 14 / 24

  15. Covariance Matrix Estimation: Improper Network Structure       � � w j sigmoid( u T  : | w j | ≤ κ, u j ∈ R p T 1 =  T ( x ) = sigmoid j x )  . j ≥ 1 j ≥ 1       � � w j ReLU( u T  : T 2 =  T ( x ) = sigmoid j x ) | w j | ≤ κ, � u j � ≤ 1  . j ≥ 1 j ≥ 1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 15 / 24

  16. Covariance Matrix Estimation: Proper Network Structure   � � w j sigmoid( u T  : T 3 = T ( x ) = sigmoid j x + b j ) j ≥ 1 � � | w j | ≤ κ, u j ∈ R p , b j ∈ R . j ≥ 1  � � � H � � v jl ReLU( u T  : T 4 = T ( x ) = sigmoid w j sigmoid l x ) j ≥ 1 l =1 � H � � | w j | ≤ κ 1 , | v jl | ≤ κ 2 , � u l � ≤ 1 . j ≥ 1 l =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 16 / 24

  17. � � � n 1 � Σ = argmin max S ( T ( X i ) , 1) + E X ∼ N (0 , Γ) S ( T ( X ) , 0) n T ∈T Γ ∈E p ( M ) i =1 Theorem (Gao-Yao-Zhu’ 2019) With i.i.d. observations X 1 , ..., X n ∼ (1 − ǫ ) N (0 , Σ) + ǫ Q and some regularizations on network weight matrix, we have op � p � � Σ − Σ � 2 n ∨ ǫ 2 with high probability uniformly over all � Σ � op ≤ M = O (1) and all Q. Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 17 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend