Robust Statistics and Generative Adversarial Networks Yuan YAO - - PowerPoint PPT Presentation

robust statistics and generative adversarial networks
SMART_READER_LITE
LIVE PREVIEW

Robust Statistics and Generative Adversarial Networks Yuan YAO - - PowerPoint PPT Presentation

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST Chao Gao (Chicago) Jiyu Liu (Yale) Weizhi Zhu (HKUST) Deep Learning is Notoriously Not Robust! Imperceivable adversarial examples are ubiquitous to fail neural networks


slide-1
SLIDE 1

Robust Statistics and Generative Adversarial Networks

Yuan YAO HKUST

slide-2
SLIDE 2

Chao Gao (Chicago) Jiyu Liu (Yale) Weizhi Zhu (HKUST)

slide-3
SLIDE 3

Deep Learning is Notoriously Not Robust!

  • Imperceivable adversarial examples are ubiquitous

to fail neural networks

  • How can one achieve robustness?
slide-4
SLIDE 4

Robust Optimization

  • Traditional training:

min

✓ Jn(✓, z = (xi, yi)n i=1)

  • e.g. square or cross-entropy loss as negative log-likelihood of logit

models

  • Robust optimization (Madry et al. ICLR’2018):

min

max

k✏i k Jn(✓, z = (xi + ✏i, yi)n i=1)

  • robust to any distributions, yet computationally hard
slide-5
SLIDE 5

Distributionally Robust Optimization (DRO)

  • Distributional Robust Optimization:

min

✓ max ✏

Ez∼P✏∈D[Jn(✓, z)]

  • D is a set of ambiguous distributions, e.g. Wasserstein ambiguity set

D = {P✏ : W2(P✏, uniform distribution) ≤ ✏} where DRO may be reduced to regularized maximum likelihood estimates (Shafieezadeh-Abadeh, Esfahani, Kuhn, NIPS’2015) that are convex

  • ptimizations and tractable
slide-6
SLIDE 6

Wasserstein DRO and Sqrt-Lasso

Theorem (B., Kang, Murthy (2016)) Suppose that c !(x, y) , ! x0, y 0"" = (

kx − x0k2

q

if y = y 0 ∞ if y 6= y 0 . Then, if 1/p + 1/q = 1 max

P:Dc (P,Pn)≤δ E 1/2 P

$% Y − βT X &2'

= E 1/2

Pn

(% Y − βT X &2)

+ p

δ kβkp . Remark 1: This is sqrt-Lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c (·) ”

slide-7
SLIDE 7

Certified Robustness of Lasso

Take q = 1 and p = 1, with c

  • (x, y),
  • x0, y 0

= ( kx x0k2

1

if y = y 0 1 if y 6= y 0 Then for P0

n = 1

n X

i

δx0

i

with kxi x0

i k1  δ,

Dc(P0

n, Pn) =

Z π((x, y), (x0, y 0))c

  • (x, y),
  • x0, y 0

 δ, for small enough δ and well-separated x’s. Sqrt-Lasso min

β

⇢ E 1/2

Pn

⇣ Y βTX ⌘2 + p δkβk1 2 = min

β

max

P:Dc (P,Pn)δ EP

✓⇣ Y βTX ⌘2◆ provides a certified robust estimate in terms of Madry’s adversarial training, using a convex Wasserstein relaxation.

slide-8
SLIDE 8

TV-neighborhood

  • Now how about the TV-uncertainty set?

D = {P✏ : TV (P✏, uniform distribution) ≤ ✏}?

slide-9
SLIDE 9

Huber’s Model

[Huber 1964]

X1, ..., Xn ⇠ (1 ✏)P✓ + ✏Q

slide-10
SLIDE 10

Huber’s Model

parameter of interest [Huber 1964]

X1, ..., Xn ⇠ (1 ✏)P✓ + ✏Q

slide-11
SLIDE 11

Huber’s Model

contamination proportion parameter of interest [Huber 1964]

X1, ..., Xn ⇠ (1 ✏)P✓ + ✏Q

slide-12
SLIDE 12

Huber’s Model

contamination proportion parameter of interest arbitrary contamination [Huber 1964]

X1, ..., Xn ⇠ (1 ✏)P✓ + ✏Q

slide-13
SLIDE 13

An Example

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

slide-14
SLIDE 14

An Example

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

how to estimate ?

slide-15
SLIDE 15

Robust Maxmum-Likelihood Does not work!

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

slide-16
SLIDE 16

Robust Maxmum-Likelihood Does not work!

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

`(✓, Q) = negative log-likelihood =

n

X

i=1

(✓ − Xi)2 ∼ (1 − ✏)EN (θ)(✓ − X)2 + ✏EQ(✓ − X)2 the sample mean ˆ ✓mean = 1 n

n

X

i=1

Xi = arg min

θ `(✓, Q)

min

θ max Q

`(✓, Q) ≥ max

Q

min

θ `(✓, Q) = max Q

`(ˆ ✓mean, Q) = ∞

slide-17
SLIDE 17

Medians

ˆ ✓ = (ˆ ✓j), where ˆ ✓j = Median({Xij}n

i=1);

  • 1. Coordinatewise median
slide-18
SLIDE 18

Medians

ˆ ✓ = (ˆ ✓j), where ˆ ✓j = Median({Xij}n

i=1);

  • 1. Coordinatewise median

Estimator 2: ˆ ✓ = arg max

η∈Rp min ||u||=1

1 n

n

X

i=1

I{uT Xi > uT ⌘}.

  • 2. Tukey’s median
slide-19
SLIDE 19

Coordinatewise Median Tukey’s Median breakdown point 1/2 1/3 statistical precision p n p n (no contamination) statistical precision p n + p✏2 p n + ✏2: minimax (with contamination) [Chen-Gao-Ren’15] computational complexity Polynomial NP-hard [Amenta et al. ’00]

Comparisons

Note: R-package for Tukey median can not deal with more than 10 dimensions! [https://github.com/ChenMengjie/DepthDescent]

slide-20
SLIDE 20

ˆ ✓ = arg max

η2Rp min kuk=1

( 1 n

n

X

i=1

I{uT Xi > uT ⌘} ^ 1 n

n

X

i=1

I{uT Xi  uT ⌘} )

Multivariate Location Depth

slide-21
SLIDE 21

ˆ ✓ = arg max

η2Rp min kuk=1

( 1 n

n

X

i=1

I{uT Xi > uT ⌘} ^ 1 n

n

X

i=1

I{uT Xi  uT ⌘} )

Multivariate Location Depth

slide-22
SLIDE 22

ˆ ✓ = arg max

η2Rp min kuk=1

( 1 n

n

X

i=1

I{uT Xi > uT ⌘} ^ 1 n

n

X

i=1

I{uT Xi  uT ⌘} )

Multivariate Location Depth

slide-23
SLIDE 23

ˆ ✓ = arg max

η2Rp min kuk=1

( 1 n

n

X

i=1

I{uT Xi > uT ⌘} ^ 1 n

n

X

i=1

I{uT Xi  uT ⌘} )

Multivariate Location Depth

[Tukey, 1975]

Estimator 2: ˆ ✓ = arg max

η∈Rp min ||u||=1

1 n

n

X

i=1

I{uT Xi > uT ⌘}.

slide-24
SLIDE 24

Regression Depth

y|X ∼ N(XT β, σ2)

model

slide-25
SLIDE 25

Regression Depth

y|X ∼ N(XT β, σ2)

model

Xy|X ∼ N(XXT β, σ2XXT )

embedding

slide-26
SLIDE 26

Regression Depth

y|X ∼ N(XT β, σ2)

model

Xy|X ∼ N(XXT β, σ2XXT )

embedding

uT Xy|X ∼ N(uT XXT β, σ2uT XXT u)

projection

slide-27
SLIDE 27

Regression Depth

y|X ∼ N(XT β, σ2)

ˆ β = argmax

η∈Rp

min

u∈Rp

( 1 n

n

X

i=1

I{uT Xi(yi − XT

i η) > 0} ∧ 1

n

n

X

i=1

I{uT Xi(yi − XT

i η) ≤ 0}

)

model

Xy|X ∼ N(XXT β, σ2XXT )

embedding

uT Xy|X ∼ N(uT XXT β, σ2uT XXT u)

projection

slide-28
SLIDE 28

Regression Depth

y|X ∼ N(XT β, σ2)

ˆ β = argmax

η∈Rp

min

u∈Rp

( 1 n

n

X

i=1

I{uT Xi(yi − XT

i η) > 0} ∧ 1

n

n

X

i=1

I{uT Xi(yi − XT

i η) ≤ 0}

)

model

Xy|X ∼ N(XXT β, σ2XXT )

embedding

uT Xy|X ∼ N(uT XXT β, σ2uT XXT u)

projection

slide-29
SLIDE 29

Regression Depth

y|X ∼ N(XT β, σ2)

ˆ β = argmax

η∈Rp

min

u∈Rp

( 1 n

n

X

i=1

I{uT Xi(yi − XT

i η) > 0} ∧ 1

n

n

X

i=1

I{uT Xi(yi − XT

i η) ≤ 0}

)

model

Xy|X ∼ N(XXT β, σ2XXT )

embedding

uT Xy|X ∼ N(uT XXT β, σ2uT XXT u)

projection

slide-30
SLIDE 30

Regression Depth

y|X ∼ N(XT β, σ2)

ˆ β = argmax

η∈Rp

min

u∈Rp

( 1 n

n

X

i=1

I{uT Xi(yi − XT

i η) > 0} ∧ 1

n

n

X

i=1

I{uT Xi(yi − XT

i η) ≤ 0}

)

model

Xy|X ∼ N(XXT β, σ2XXT )

embedding

uT Xy|X ∼ N(uT XXT β, σ2uT XXT u)

projection [Rousseeuw & Hubert, 1999]

slide-31
SLIDE 31

Tukey’s depth is not a special case of regression depth.

slide-32
SLIDE 32

Multi-task Regression Depth

(X, Y ) ∈ Rp × Rm ∼ P

slide-33
SLIDE 33

Multi-task Regression Depth

(X, Y ) ∈ Rp × Rm ∼ P

  • f B ∈ Rp×m
slide-34
SLIDE 34

Multi-task Regression Depth

(X, Y ) ∈ Rp × Rm ∼ P

  • f B ∈ Rp×m

DU(B, P) = inf

U∈U P

⌦ U T X, Y − BT X ↵ ≥ 0

population version:

slide-35
SLIDE 35

Multi-task Regression Depth

(X, Y ) ∈ Rp × Rm ∼ P

  • f B ∈ Rp×m

DU(B, {(Xi, Yi)}n

i=1) = inf U∈U

1 n

n

X

i=1

I ⌦ U T Xi, Yi − BT Xi ↵ ≥ 0

empirical version:

DU(B, P) = inf

U∈U P

⌦ U T X, Y − BT X ↵ ≥ 0

population version:

slide-36
SLIDE 36

Multi-task Regression Depth

(X, Y ) ∈ Rp × Rm ∼ P

  • f B ∈ Rp×m

DU(B, {(Xi, Yi)}n

i=1) = inf U∈U

1 n

n

X

i=1

I ⌦ U T Xi, Yi − BT Xi ↵ ≥ 0

empirical version:

DU(B, P) = inf

U∈U P

⌦ U T X, Y − BT X ↵ ≥ 0

population version: [Mizera, 2002]

slide-37
SLIDE 37

Multi-task Regression Depth

DU(B, P) = inf

U2U P

⌦ U T X, Y BT X ↵

slide-38
SLIDE 38

Multi-task Regression Depth

DU(B, P) = inf

U2U P

⌦ U T X, Y BT X ↵

p = 1, X = 1 ∈ R,

DU(b, P) = inf

u∈U P

  • uT (Y − b) ≥ 0
slide-39
SLIDE 39

Multi-task Regression Depth

DU(B, P) = inf

U2U P

⌦ U T X, Y BT X ↵

p = 1, X = 1 ∈ R,

DU(b, P) = inf

u∈U P

  • uT (Y − b) ≥ 0

m = 1,

DU(β, P) = inf

U∈U P

  • uT X(y − βT X) ≥ 0
slide-40
SLIDE 40

Multi-task Regression Depth

  • Proposition. For any ,

with probability at least .

y > 0,

sup

B2Rp×m |D(B, Pn) D(B, P)|  C

rpm n + r log(1/) 2n ,

st 1 2

slide-41
SLIDE 41

Multi-task Regression Depth

  • Proposition. For any ,

with probability at least .

y > 0,

sup

B2Rp×m |D(B, Pn) D(B, P)|  C

rpm n + r log(1/) 2n ,

st 1 2 Proposition.

sup

B,Q

|D(B, (1 ✏PB∗) + ✏Q) D(B, PB∗)|  ✏

slide-42
SLIDE 42

Multi-task Regression Depth

  • (X, Y ) ∼ PB : X ∼ N(0, Σ),

Y |X ∼ N(BT X, 2Im)

slide-43
SLIDE 43

Multi-task Regression Depth

  • (X, Y ) ∼ PB : X ∼ N(0, Σ),

Y |X ∼ N(BT X, 2Im)

slide-44
SLIDE 44

Multi-task Regression Depth

  • (X, Y ) ∼ PB : X ∼ N(0, Σ),

Y |X ∼ N(BT X, 2Im)

(X1, Y1), ..., (Xn, Yn) ∼ (1 − ✏)PB + ✏Q

slide-45
SLIDE 45

Multi-task Regression Depth

  • (X, Y ) ∼ PB : X ∼ N(0, Σ),

Y |X ∼ N(BT X, 2Im)

(X1, Y1), ..., (Xn, Yn) ∼ (1 − ✏)PB + ✏Q

Theorem [G17]. For some with high probability uniformly over .

C > 0,

Tr(( b B B)T Σ( b B B))  C2 ⇣pm n _ ✏2⌘ , k b B Bk2

F  C 2

2 ⇣pm n _ ✏2⌘ ,

  • B, Q
slide-46
SLIDE 46

Covariance Matrix

X1, ..., Xn ⇠ (1 ✏)N(0, Σ) + ✏Q. How to estimate Σ?

slide-47
SLIDE 47

Covariance Matrix

X1, ..., Xn ⇠ (1 ✏)N(0, Σ) + ✏Q. How to estimate Σ?

how to estimate ?

slide-48
SLIDE 48

Covariance Matrix

slide-49
SLIDE 49

Covariance Matrix

slide-50
SLIDE 50

Covariance Matrix

slide-51
SLIDE 51

Covariance Matrix

slide-52
SLIDE 52

Covariance Matrix

slide-53
SLIDE 53

Covariance Matrix

D(Γ, {Xi}n

i=1) = min kuk=1 min

( 1 n

n

X

i=1

I{|uT Xi|2 uT Γu}, 1 n

n

X

i=1

I{|uT Xi|2 < uT Γu} )

slide-54
SLIDE 54

Covariance Matrix

ˆ Σ = ˆ Γ/,

ˆ Γ = arg max

Γ⌫0 D(Γ, {Xi}n i=1)

D(Γ, {Xi}n

i=1) = min kuk=1 min

( 1 n

n

X

i=1

I{|uT Xi|2 uT Γu}, 1 n

n

X

i=1

I{|uT Xi|2 < uT Γu} )

slide-55
SLIDE 55

Covariance Matrix

ˆ Σ = ˆ Γ/,

ˆ Γ = arg max

Γ⌫0 D(Γ, {Xi}n i=1)

Theorem [CGR15]. For some with high probability uniformly over .

C > 0,

kˆ Σ Σk2

  • p  C

⇣ p n _ ✏2⌘

Σ, Q

D(Γ, {Xi}n

i=1) = min kuk=1 min

( 1 n

n

X

i=1

I{|uT Xi|2 uT Γu}, 1 n

n

X

i=1

I{|uT Xi|2 < uT Γu} )

slide-56
SLIDE 56

Summary

mean reduced rank regression Gaussian graphical model covariance matrix sparse PCA

k · k2

k·k2

F

k · k2

  • p

k · k2

`1

k·k2

F

⇣ p n _ ✏2⌘ ⇣ p n _ ✏2⌘

2 2 r(p + m) n _ 2 2 ✏2

s2 log(ep/s) n _ s✏2

s log(ep/s) n2 _ ✏2 2

slide-57
SLIDE 57

Summary

mean reduced rank regression Gaussian graphical model covariance matrix sparse PCA

k · k2

k·k2

F

k · k2

  • p

k · k2

`1

k·k2

F

⇣ p n _ ✏2⌘ ⇣ p n _ ✏2⌘

2 2 r(p + m) n _ 2 2 ✏2

s2 log(ep/s) n _ s✏2

s log(ep/s) n2 _ ✏2 2

slide-58
SLIDE 58

Computation

slide-59
SLIDE 59

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

Computational Challenges

slide-60
SLIDE 60

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

Computational Challenges

Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen

slide-61
SLIDE 61

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏Q. How to estimate ✓?

Computational Challenges

Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen

  • Polynomial algorithms are proposed [Diakonikolas et al.’16, Lai et al. 16]
  • f minimax optimal statistical precision
  • needs information on second or higher order of moments
  • some priori knowledge about ✏
slide-62
SLIDE 62

Advantages of Tukey Median

  • A well-defined objective function
  • Adaptive to and
  • Optimal for any elliptical distribution

Σ

slide-63
SLIDE 63

Advantages of Tukey Median

  • A well-defined objective function
  • Adaptive to and
  • Optimal for any elliptical distribution

Σ

slide-64
SLIDE 64

Advantages of Tukey Median

  • A well-defined objective function
  • Adaptive to and
  • Optimal for any elliptical distribution

Σ

slide-65
SLIDE 65

Advantages of Tukey Median

  • A well-defined objective function
  • Adaptive to and
  • Optimal for any elliptical distribution

Σ

slide-66
SLIDE 66

A practically good algorithm?

slide-67
SLIDE 67

men- we , lin- al- exceeds the

Generative Adversarial Networks [Goodfellow et al. 2014]

Note: R-package for Tukey median can not deal with more than 10 dimensions [https://github.com/ChenMengjie/ DepthDescent]

slide-68
SLIDE 68

Robust Learning of Cauchy Distributions

Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 ✏)Cauchy(0p, Ip) + ✏Q with ✏ = 0.2, p = 50 and various choices of Q. Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator gω(⇠) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN (G1) JS-GAN (G2) Dimension Halving Iterative Filtering Cauchy(1.5 ⇤ 1p, Ip) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy(5.0 ⇤ 1p, Ip) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy(1.5 ⇤ 1p, 5 ⇤ Ip) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal(1.5 ⇤ 1p, 5 ⇤ Ip) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288))

  • Dimension Halving: [Lai et al.’16]

https://github.com/kal2000/AgnosticMeanAndCovarianceCode.

  • Iterative Filtering: [Diakonikolas et al.’17]

https://github.com/hoonose/robust-filter.

slide-69
SLIDE 69

f-GAN

Given a strictly convex function f that satisfies f (1) = 0, the f -divergence between two probability distributions P and Q is defined by Df (PkQ) = Z f ✓p q ◆ dQ. (8) Let f ⇤ be the convex conjugate of f . A variational lower bound of (8) is Df (PkQ) sup

T2T

[EPT(X) EQf ⇤(T(X))] . (9) where equality holds whenever the class T contains the function f 0 (p/q). [Nowozin-Cseke-Tomioka’16] f -GAN minimizes the variational lower bound (9) b P = arg min

Q2Q

sup

T2T

" 1 n

n

X

i=1

T(Xi) EQf ⇤(T(X)) # . (10) with i.i.d. observations X1, ..., Xn ⇠ P.

slide-70
SLIDE 70

From f-GAN to Tukey’s Median: f-learning

Consider the special case T = ⇢ f 0 ✓ e q q ◆ : e q 2 e Q

  • .

(11) which is tight if P 2 e

  • Q. The sample version leads to the following f -learning

b P = arg min

Q2Q

sup

e Q2 e Q

" 1 n

n

X

i=1

f 0 ✓ e q(Xi) q(Xi) ◆ EQf ⇤ ✓ f 0 ✓ e q(X) q(X) ◆◆# . (12)

  • If f (x) = x log x, Q = e

Q, (12) ) Maximum Likelihood Estimate

  • If f (x) = (x 1)+, then Df (PkQ) = 1

2

R |p q| is the TV-distance, f ⇤(t) = tI{0  t  1}, f -GAN ) TV-GAN

  • Q = {N(η, Ip) : η 2 Rp} and e

Q = {N(e η, Ip) : ke η ηk  r}, (12)

r!0

) Tukey’s Median

slide-71
SLIDE 71

f-Learning

slide-72
SLIDE 72

f-Learning

Df(PkQ) = Z f ✓p q ◆ dQ.

f-divergence

slide-73
SLIDE 73

f-Learning

f(u) = sup

t

(tu f⇤(t)),

Df(PkQ) = Z f ✓p q ◆ dQ.

f-divergence

slide-74
SLIDE 74

f-Learning

Df(PkQ) = Z f ✓p q ◆ dQ.

f-divergence variational representation

) = sup

T

[EX⇠P T(X) EX⇠Qf⇤(T(X))]

slide-75
SLIDE 75

f-Learning

Df(PkQ) = Z f ✓p q ◆ dQ.

f-divergence variational representation

) = sup

T

[EX⇠P T(X) EX⇠Qf⇤(T(X))]

  • ptimal T

T(x) = f0 ✓p(x) q(x) ◆

slide-76
SLIDE 76

f-Learning

Df(PkQ) = Z f ✓p q ◆ dQ.

f-divergence variational representation

) = sup

T

[EX⇠P T(X) EX⇠Qf⇤(T(X))]

= sup

˜ Q

( EX⇠P f0 d ˜ Q(X) dQ(X) ! EX⇠Qf⇤ f0 d ˜ Q(X) dQ(X) !!)

slide-77
SLIDE 77

f-Learning

f-Learning f-GAN

min

Q2Q max T2T

( 1 n

n

X

i=1

T(Xi) Z f⇤ (T) dQ ) min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

f0 ✓ ˜ q(Xi) q(Xi) ◆

  • Z

f⇤ ✓ f0 ✓ ˜ q q ◆◆ dQ ) ,

slide-78
SLIDE 78

f-Learning

f-Learning f-GAN

min

Q2Q max T2T

( 1 n

n

X

i=1

T(Xi) Z f⇤ (T) dQ ) min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

f0 ✓ ˜ q(Xi) q(Xi) ◆

  • Z

f⇤ ✓ f0 ✓ ˜ q q ◆◆ dQ ) ,

slide-79
SLIDE 79

f-Learning

f-Learning f-GAN

min

Q2Q max T2T

( 1 n

n

X

i=1

T(Xi) Z f⇤ (T) dQ ) min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

f0 ✓ ˜ q(Xi) q(Xi) ◆

  • Z

f⇤ ✓ f0 ✓ ˜ q q ◆◆ dQ ) ,

slide-80
SLIDE 80

f-Learning

f-Learning f-GAN

min

Q2Q max T2T

( 1 n

n

X

i=1

T(Xi) Z f⇤ (T) dQ ) min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

f0 ✓ ˜ q(Xi) q(Xi) ◆

  • Z

f⇤ ✓ f0 ✓ ˜ q q ◆◆ dQ ) ,

[Nowozin, Cseke, Tomioka]

slide-81
SLIDE 81

f-Learning

Jensen-Shannon GAN Kullback-Leibler MLE Hellinger Squared rho Total Variation depth

f(x) = x log x (x + 1) log(x + 1).

f(x) = (x 1)+

f(x) = x log x

f(x) = 2 2px ,

[Goodfellow et al., Baraud and Birge]

slide-82
SLIDE 82

f-Learning

Jensen-Shannon GAN Kullback-Leibler MLE Hellinger Squared rho Total Variation depth

f(x) = x log x (x + 1) log(x + 1).

f(x) = (x 1)+

f(x) = x log x

f(x) = 2 2px ,

[Goodfellow et al., Baraud and Birge]

slide-83
SLIDE 83

f-Learning

Jensen-Shannon GAN Kullback-Leibler MLE Hellinger Squared rho Total Variation depth

f(x) = x log x (x + 1) log(x + 1).

f(x) = (x 1)+

f(x) = x log x

f(x) = 2 2px ,

[Goodfellow et al., Baraud and Birge]

slide-84
SLIDE 84

f-Learning

Jensen-Shannon GAN Kullback-Leibler MLE Hellinger Squared rho Total Variation depth

f(x) = x log x (x + 1) log(x + 1).

f(x) = (x 1)+

f(x) = x log x

f(x) = 2 2px ,

[Goodfellow et al., Baraud and Birge]

slide-85
SLIDE 85

f-Learning

Jensen-Shannon GAN Kullback-Leibler MLE Hellinger Squared rho Total Variation depth

f(x) = x log x (x + 1) log(x + 1).

f(x) = (x 1)+

f(x) = x log x

f(x) = 2 2px ,

[Goodfellow et al., Baraud and Birge]

slide-86
SLIDE 86

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

slide-87
SLIDE 87

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

Q = n N(✓, Ip) : ✓ 2 Rpo

˜ Q = n N(˜ ✓, Ip) : ˜ ✓ 2 Nr(✓)

slide-88
SLIDE 88

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

Q = n N(✓, Ip) : ✓ 2 Rpo

˜ Q = n N(˜ ✓, Ip) : ˜ ✓ 2 Nr(✓)

  • r ! 0
slide-89
SLIDE 89

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

Q = n N(✓, Ip) : ✓ 2 Rpo

˜ Q = n N(˜ ✓, Ip) : ˜ ✓ 2 Nr(✓)

  • r ! 0

Tukey depth

max

θ2R min kuk=1

1 n

n

X

i=1

I

  • uT Xi uT θ
slide-90
SLIDE 90

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

slide-91
SLIDE 91

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

Q = n N(0, Σ) : Σ 2 Rp⇥po ˜ Q = n N(0, ˜ Σ) : ˜ Σ = Σ + ruuT , kuk = 1

slide-92
SLIDE 92

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

r ! 0

Q = n N(0, Σ) : Σ 2 Rp⇥po ˜ Q = n N(0, ˜ Σ) : ˜ Σ = Σ + ruuT , kuk = 1

slide-93
SLIDE 93

TV-Learning

min

Q2Q max ˜ Q2 ˜ Q

( 1 n

n

X

i=1

I ⇢ ˜ q(Xi) q(Xi) 1

  • Q

✓ ˜ q q 1 ◆)

r ! 0

Q = n N(0, Σ) : Σ 2 Rp⇥po ˜ Q = n N(0, ˜ Σ) : ˜ Σ = Σ + ruuT , kuk = 1

  • (related to)

matrix depth

max

Σ

min

kuk=1

" 1 n

n

X

i=1

I{|uT Xi|2  uT Σu} P(2

1  1)

! ^ 1 n

n

X

i=1

I{|uT Xi|2 > uT Σu} P(2

1 > 1)

!#

slide-94
SLIDE 94

robust statistics community deep learning community

slide-95
SLIDE 95

robust statistics community deep learning community f-GAN f-Learning

slide-96
SLIDE 96

robust statistics community deep learning community f-GAN f-Learning practically good algorithms

slide-97
SLIDE 97

robust statistics community deep learning community f-GAN f-Learning theoretical foundation practically good algorithms

slide-98
SLIDE 98

TV-GAN

b ✓ = argmin

η

sup

w,b

" 1 n

n

X

i=1

1 1 + e−wT Xi−b Eη 1 1 + e−wT X−b #

slide-99
SLIDE 99

TV-GAN

b ✓ = argmin

η

sup

w,b

" 1 n

n

X

i=1

1 1 + e−wT Xi−b Eη 1 1 + e−wT X−b #

N(⌘, Ip)

slide-100
SLIDE 100

TV-GAN

b ✓ = argmin

η

sup

w,b

" 1 n

n

X

i=1

1 1 + e−wT Xi−b Eη 1 1 + e−wT X−b #

logistic regression classifier N(⌘, Ip)

slide-101
SLIDE 101

TV-GAN

b ✓ = argmin

η

sup

w,b

" 1 n

n

X

i=1

1 1 + e−wT Xi−b Eη 1 1 + e−wT X−b #

logistic regression classifier N(⌘, Ip)

Theorem [GLYZ18]. For some with high probability uniformly over .

C > 0,

kb ✓ ✓k2  C ⇣ p n _ ✏2⌘

✓ 2 Rp, Q

slide-102
SLIDE 102

TV-GAN

rugged landscape!

Figure: Heatmaps of the landscape of F(⌘, w) = supb[EP sigmoid(wX + b) − EN(⌘,1)sigmoid(wX + b)], where b is maximized

  • ut for visualization. Left: samples are drawn from P = (1 − ✏)N(1, 1) + ✏N(1.5, 1) with ✏ = 0.2. Right: samples are drawn from

P = (1 − ✏)N(1, 1) + ✏N(10, 1) with ✏ = 0.2. Left: the landscape is good in the sense that no matter whether we start from the left-top area or the right-bottom area of the heatmap, gradient ascent on ⌘ does not consistently increase or decrease the value of ⌘. This is because the signal becomes weak when it is close to the saddle point around ⌘ = 1. Right: it is clear that ˜ F(w) = F(⌘, w) has two local maxima for a given ⌘, achieved at w = +∞ and w = −∞. In fact, the global maximum for ˜ F(w) has a phase transition from w = +∞ to w = −∞ as ⌘ grows. For example, the maximum is achieved at w = +∞ when ⌘ = 1 (blue solid) and is achieved at w = −∞ when ⌘ = 5 (red solid). Unfortunately, even if we initialize with ⌘0 = 1 and w0 > 0, gradient ascents on ⌘ will only increase the value of ⌘ (green dash), and thus as long as the discriminator cannot reach the global maximizer, w will be stuck in the positive half space {w : w > 0} and further increase the value of ⌘.

slide-103
SLIDE 103

The Original JS-GAN

[Goodfellow et al. 2014] For f (x) = x log x − (x + 1) log x+1

2 ,

b θ = arg min

η∈Rp

max

D∈D

" 1 n

n

X

i=1

log D(Xi) + EN (η,Ip) log(1 − D(X)) # + log 4. (15) What are D, the class of discriminators?

  • Single layer (no hidden layer):

D = n D(x) = sigmoid(w Tx + b) : w ∈ Rp, b ∈ R

  • One-hidden or Multiple layer:

D = n D(x) = sigmoid(w Tg(X))

slide-104
SLIDE 104

JS-GAN

: b ✓ = argmin

η2Rp

max

T2T

" 1 n

n

X

i=1

log T(Xi) + Eη log(1 T(X)) # + log 4

slide-105
SLIDE 105

JS-GAN

numerical experiment

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏N(e ✓, Ip)

: b ✓ = argmin

η2Rp

max

T2T

" 1 n

n

X

i=1

log T(Xi) + Eη log(1 T(X)) # + log 4

slide-106
SLIDE 106

JS-GAN

numerical experiment

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏N(e ✓, Ip)

: b ✓ = argmin

η2Rp

max

T2T

" 1 n

n

X

i=1

log T(Xi) + Eη log(1 T(X)) # + log 4

slide-107
SLIDE 107

JS-GAN

numerical experiment

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏N(e ✓, Ip)

b ✓ ⇡ (1 ✏)✓ + ✏e ✓

: b ✓ = argmin

η2Rp

max

T2T

" 1 n

n

X

i=1

log T(Xi) + Eη log(1 T(X)) # + log 4

slide-108
SLIDE 108

JS-GAN

numerical experiment

X1, ..., Xn ⇠ (1 ✏)N(✓, Ip) + ✏N(e ✓, Ip)

b ✓ ⇡ (1 ✏)✓ + ✏e ✓

b ✓ ⇡ ✓ b ✓ ⇡ ✓

: b ✓ = argmin

η2Rp

max

T2T

" 1 n

n

X

i=1

log T(Xi) + Eη log(1 T(X)) # + log 4

slide-109
SLIDE 109

JS-GAN

A classifier with hidden layers leads to robustness. Why?

slide-110
SLIDE 110

JS-GAN

A classifier with hidden layers leads to robustness. Why?

JSg(P, Q) = max

w2Rd

 P log 1 1 + ewT g(X) + Q log 1 1 + ewT g(X)

  • + log 4.
slide-111
SLIDE 111

JS-GAN

A classifier with hidden layers leads to robustness. Why?

JSg(P, Q) = max

w2Rd

 P log 1 1 + ewT g(X) + Q log 1 1 + ewT g(X)

  • + log 4.

Proposition.

JSg(P, Q) = 0 ( ) Pg(X) = Qg(X)

slide-112
SLIDE 112

JS-GAN

Theorem [GLYZ18]. For a neural network class with at least one hidden layer and appropriate regularization, we have with high probability uniformly over .

ax

2T

✓ 2 Rp, Q (indicator/sigmoid/ramp) (ReLUs+sigmoid features)

: b ✓ = argmin

η2Rp

max

T2T

" 1 n

n

X

i=1

log T(Xi) + Eη log(1 T(X)) # + log 4

kb ✓ ✓k2 . 8 > < > : p n + ✏2 p log p n + ✏2

slide-113
SLIDE 113

JS-GAN: Adaptation to Unknown Covariance

unknown covariance?

X1, ..., Xn ⇠ (1 ✏)N(✓, Σ) + ✏Q

slide-114
SLIDE 114

JS-GAN: Adaptation to Unknown Covariance

unknown covariance?

X1, ..., Xn ⇠ (1 ✏)N(✓, Σ) + ✏Q

b (b ✓, b Σ) = argmin

η,Γ

max

T2T

" 1 n

n

X

i=1

log T(Xi) + EX⇠N(η,Γ) log(1 T(X)) #

slide-115
SLIDE 115

JS-GAN: Adaptation to Unknown Covariance

unknown covariance?

X1, ..., Xn ⇠ (1 ✏)N(✓, Σ) + ✏Q

b (b ✓, b Σ) = argmin

η,Γ

max

T2T

" 1 n

n

X

i=1

log T(Xi) + EX⇠N(η,Γ) log(1 T(X)) #

no need to change the discriminator class

slide-116
SLIDE 116

Generalization

Strong Contamination model:

X1, ..., Xn

iid

∼ P for some P satisfying TV(P, E(✓, Σ, H)) ≤ ✏.

(b ✓, b Σ, b H) = argmin

η2Rp,Γ2Ep(M),H2H(M0)

max

T2T

" 1 n

n

X

i=1

S(T(Xi), 1) + EX⇠E(η,Γ,G)S(T(X), 0) #

A scoring rule S is regular if both S(·, 0) and S(·, 1) are real-valued, except possibly that S(0, 1) = −∞ or S(1, 0) = −∞. The celebrated Savage representation [50] asserts that a regular scoring rule S is proper if and only if there is a convex function G(·), such that ( S(t, 1) = G(t) + (1 − t)G0(t), S(t, 0) = G(t) − tG0(t). (10) Here, G0(t) is a subgradient of G at the point t. Moreover, the statement also holds for strictly proper scoring rules when convex is replaced by strictly convex.

slide-117
SLIDE 117

Consistency

Theorem [GYZ19]. For a neural network class with at least one hidden layer and appropriate regularization, we have

kb ✓ ✓k2  C ⇣ p n _ ✏2⌘ , kb Σ Σk2

  • p

 C ⇣ p n _ ✏2⌘ ,

ax

2T

slide-118
SLIDE 118

Example 1: Log Score and JS-GAN

  • 1. Log Score. The log score is perhaps the most commonly used rule because of its various

intriguing properties [31]. The scoring rule with S(t, 1) = log t and S(t, 0) = log(1 − t) is regular and strictly proper. Its Savage representation is given by the convex function G(t) = t log t + (1 − t) log(1 − t), which is interpreted as the negative Shannon entropy of Bernoulli(t). The corresponding divergence function DT (P, Q), according to Proposition 3.1, is a variational lower bound of the Jensen-Shannon divergence JS(P, Q) = 1 2 Z log ✓ dP dP + dQ ◆ dP + 1 2 Z log ✓ dQ dP + dQ ◆ dQ + log 2. Its sample version (13) is the original GAN proposed by [25] that is widely used in learning distributions of images.

slide-119
SLIDE 119

Example 2: Zero-One Score and TV-GAN

  • 2. Zero-One Score. The zero-one score S(t, 1) = 2I{t ≥ 1/2} and S(t, 0) = 2I{t < 1/2} is

also known as the misclassification loss. This is a regular proper scoring rule but not strictly proper. The induced divergence function DT (P, Q) is a variational lower bound

  • f the total variation distance

TV(P, Q) = P ✓dP dQ ≥ 1 ◆ − Q ✓dP dQ ≥ 1 ◆ = 1 2 Z |dP − dQ|. The sample version (13) is recognized as the TV-GAN that is extensively studied by [21] in the context of robust estimation.

slide-120
SLIDE 120

Example 3: Quadratic Score and LS-GAN

  • 3. Quadratic Score. Also known as the Brier score [6], the definition is given by S(t, 1) =

(1 t)2 and S(t, 0) = t2. The corresponding convex function in the Savage repre- sentation is given by G(t) = t(1 t). By Proposition 2.1, the divergence function (3) induced by this regular strictly proper scoring rule is a variational lower bound of the following divergence function, ∆(P, Q) = 1 8 Z (dP dQ)2 dP + dQ , known as the triangular discrimination. The sample version (5) belongs to the family

  • f least-squares GANs proposed by [39].
slide-121
SLIDE 121

Example 4: Boosting Score

  • 4. Boosting Score. The boosting score was introduced by [7] with S(t, 1) =

1−t

t

1/2 and S(t, 0) = ⇣

t 1−t

⌘1/2 and has an connection to the AdaBoost algorithm. The corre- sponding convex function in the Savage representation is given by G(t) = 2 p t(1 t). The induced divergence function DT (P, Q) is thus a variational lower bound of the squared Hellinger distance H2(P, Q) = 1 2 Z ⇣p dP p dQ ⌘2 .

slide-122
SLIDE 122

Example 5: Beta Score and new GANs

p

  • 5. Beta Score. A general Beta family of proper scoring rules was introduced by [7] with

S(t, 1) = R 1

t cα−1(1 c)βdc and S(t, 0) =

R t

0 cα(1 c)β−1dc for any α, β > 1. The

log score, the quadratic score and the boosting score are special cases of the Beta score with α = β = 0, α = β = 1, α = β = 1/2. The zero-one score is a limiting case of the Beta score by letting α = β ! 1. Moreover, it also leads to asymmetric scoring rules with α 6= β.

slide-123
SLIDE 123

Robust Learning of Gaussian Distributions

Q n p ✏ TV-GAN JS-GAN Dimension Halving Iterative Filtering N(0.5 ∗ 1p, Ip) 50,000 100 .2 0.0953 (0.0064) 0.1144 (0.0154) 0.3247 (0.0058) 0.1472 (0.0071) N(0.5 ∗ 1p, Ip) 5,000 100 .2 0.1941 (0.0173) 0.2182 (0.0527) 0.3568 (0.0197) 0.2285 (0.0103) N(0.5 ∗ 1p, Ip) 50,000 200 .2 0.1108 (0.0093) 0.1573 (0.0815) 0.3251 (0.0078) 0.1525 (0.0045) N(0.5 ∗ 1p, Ip) 50,000 100 .05 0.0913 (0.0527) 0.1390 (0.0050) 0.0814 (0.0056) 0.0530 (0.0052) N(5 ∗ 1p, Ip) 50,000 100 .2 2.7721 (0.1285) 0.0534 (0.0041) 0.3229 (0.0087) 0.1471 (0.0059) N(0.5 ∗ 1p, Σ) 50,000 100 .2 0.1189 (0.0195) 0.1148 (0.0234) 0.3241 (0.0088) 0.1426 (0.0113) Cauchy(0.5 ∗ 1p) 50,000 100 .2 0.0738 (0.0053) 0.0525 (0.0029) 0.1045 (0.0071) 0.0633 (0.0042)

Table: Comparison of various robust mean estimation methods. The smallest error of each case is highlighted in bold.

  • Dimension Halving: [Lai et al.’16]

https://github.com/kal2000/AgnosticMeanAndCovarianceCode.

  • Iterative Filtering: [Diakonikolas et al.’17]

https://github.com/hoonose/robust-filter.

slide-124
SLIDE 124

Robust Learning of Cauchy Distributions

Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 ✏)Cauchy(0p, Ip) + ✏Q with ✏ = 0.2, p = 50 and various choices of Q. Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator gω(⇠) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN (G1) JS-GAN (G2) Dimension Halving Iterative Filtering Cauchy(1.5 ⇤ 1p, Ip) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy(5.0 ⇤ 1p, Ip) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy(1.5 ⇤ 1p, 5 ⇤ Ip) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal(1.5 ⇤ 1p, 5 ⇤ Ip) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288))

  • Dimension Halving: [Lai et al.’16]

https://github.com/kal2000/AgnosticMeanAndCovarianceCode.

  • Iterative Filtering: [Diakonikolas et al.’17]

https://github.com/hoonose/robust-filter.

slide-125
SLIDE 125
  • Discriminator helps identify outliers or contaminated samples
  • Generator fits uncontaminated portion of true samples

Discriminator identifies

  • utliers
  • m (1 − ✏)N(0p, Ip) + ✏Q.

∗ N(5 ∗ 1p, Ip)

slide-126
SLIDE 126

Application: Price of 50 stocks from 2007/01 to 2018/12 Corps are selected by ranking in market capitalization

slide-127
SLIDE 127

Log-return. y[i] = log(price_{i+1}/price_{i})

slide-128
SLIDE 128

Fit data by Elliptica-GAN.
 Apply SVD on scatter.
 Dimension reduction on R^2.


  • utlier x and o are selected from Discriminator value distribution.
slide-129
SLIDE 129

Discriminator value distribution from (Elliptical) Generator and real samples. Outliers are chosen from samples larger/ lower than a chosen percentile of Generator distribution

slide-130
SLIDE 130

Loading of PCA.
 First two direction are dominated by few corps —> not robust

slide-131
SLIDE 131

Loading of Elliptical Scatter:
 Comparing with PCA, it’s more robust in the sense that it does not totally dominate by Financial company (JPM, GS)

slide-132
SLIDE 132

Reference

  • Gao, Liu, Yao, Zhu, Robust Estimation and

Generative Adversarial Networks, ICLR 2019, https://arxiv.org/abs/1810.02030

  • Gao, Yao, Zhu, Generative Adversarial Networks for

Robust Scatter Estimation: A Proper Scoring Rule Perspective, https://arxiv.org/abs/1903.01944

slide-133
SLIDE 133

Thank You