GANs, Optimal Transport, and Implicit Distribution Estimation - - PowerPoint PPT Presentation

gans optimal transport and implicit distribution
SMART_READER_LITE
LIVE PREVIEW

GANs, Optimal Transport, and Implicit Distribution Estimation - - PowerPoint PPT Presentation

Intro. Adversarial Framework GANs Optimization Optimal Transport GANs, Optimal Transport, and Implicit Distribution Estimation Tengyuan Liang Econometrics and Statistics 1 / 40 Intro. Adversarial Framework GANs Optimization Optimal


slide-1
SLIDE 1

Intro. Adversarial Framework GANs Optimization Optimal Transport

GANs, Optimal Transport, and Implicit Distribution Estimation

Tengyuan Liang

Econometrics and Statistics

1 / 40

slide-2
SLIDE 2

Intro. Adversarial Framework GANs Optimization Optimal Transport 2 / 40

slide-3
SLIDE 3

Intro. Adversarial Framework GANs Optimization Optimal Transport

OUTLINE

Implicit Distribution Estimation Given i.i.d. Y1, . . . , Yn ∼ ν. Use transformation T ∶ Rd → Rd to represent and learn unknown dist. Y ∼ ν via simple Z ∼ µ (say Uniform or Gaussian). T(Z)

close in dist.?

≈ Y

3 / 40

slide-4
SLIDE 4

Intro. Adversarial Framework GANs Optimization Optimal Transport

OUTLINE

Implicit Distribution Estimation Given i.i.d. Y1, . . . , Yn ∼ ν. Use transformation T ∶ Rd → Rd to represent and learn unknown dist. Y ∼ ν via simple Z ∼ µ (say Uniform or Gaussian). T(Z)

close in dist.?

≈ Y equivalently T#µ

?

≈ ν

3 / 40

slide-5
SLIDE 5

Intro. Adversarial Framework GANs Optimization Optimal Transport

OUTLINE

Implicit Distribution Estimation Generative Adversarial Networks

  • statistical rates
  • pair regularization
  • optimization

Optimal Transport

  • estimate the Wasserstein metric

vs.

  • estimate under the Wasserstein metric

3 / 40

slide-6
SLIDE 6

Intro. Adversarial Framework GANs Optimization Optimal Transport

GENERATIVE ADVERSARIAL NETWORKS

  • GAN Goodfellow et al. (2014)
  • WGAN Arjovsky et al. (2017); Arjovsky and

Bottou (2017)

  • MMD GAN Li, Swersky, and Zemel (2015);

Dziugaite, Roy, and Ghahramani (2015); Arbel, Sutherland, Bi´ nkowski, and Gretton (2018)

  • f-GAN Nowozin, Cseke, and Tomioka (2016)
  • Sobolev GAN Mroueh et al. (2017)
  • many others... Liu, Bousquet, and

Chaudhuri (2017); Tolstikhin, Gelly, Bousquet, Simon-Gabriel, and Sch¨

  • lkopf

(2017)

4 / 40

slide-7
SLIDE 7

Intro. Adversarial Framework GANs Optimization Optimal Transport

GENERATIVE ADVERSARIAL NETWORKS

4 / 40

slide-8
SLIDE 8

Intro. Adversarial Framework GANs Optimization Optimal Transport

GENERATIVE ADVERSARIAL NETWORKS

Generator gθ, Discriminator fω U(θ, ω) = E

Y∼ν

  • target

[fω(Y)] − E

Z∼µ

  • input

[fω(gθ(Z))] min

θ max ω

U(θ, ω) GANs are widely used in practice, however

4 / 40

slide-9
SLIDE 9

Intro. Adversarial Framework GANs Optimization Optimal Transport

MUCH NEEDS TO BE UNDERSTOOD, IN THEORY

  • Approximation:

what dist. can be approximated by the generator (gθ)#(µ)?

  • Statistical:

given n samples, what is the statistical/generalization error rate?

  • Computational:

local convergence for practical optimization, how to stablize?

  • Landscape:

are local saddle points good globally?

5 / 40

slide-10
SLIDE 10

Intro. Adversarial Framework GANs Optimization Optimal Transport

FORMULATION

TG class of generator transformations, FD class of discriminator functions ν target dist. population g∗ ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] − E Y∼ν[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

6 / 40

slide-11
SLIDE 11

Intro. Adversarial Framework GANs Optimization Optimal Transport

FORMULATION

TG class of generator transformations, FD class of discriminator functions ν target dist. population g∗ ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] − E Y∼ν[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ ̂ νn empirical dist. empirical ̂ g ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] −

E

Y∼̂ νn[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ ̂ g#µ as estimate for ν

6 / 40

slide-12
SLIDE 12

Intro. Adversarial Framework GANs Optimization Optimal Transport

FORMULATION

TG class of generator transformations, FD class of discriminator functions ν target dist. population g∗ ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] − E Y∼ν[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ ̂ νn empirical dist. empirical ̂ g ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] −

E

Y∼̂ νn[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ ̂ g#µ as estimate for ν

  • Density learning/estimation: long history nonparametric statistics

model target density ρν ∈ Wα - Sobolev space with smoothness α ≥ 0

Stone (1982); Nemirovski (2000); Tsybakov (2009); Wassermann (2006)

  • GAN statistical theory is needed

Arora and Zhang (2017); Arora et al. (2017a,b); Liu et al. (2017)

6 / 40

slide-13
SLIDE 13

Intro. Adversarial Framework GANs Optimization Optimal Transport

DISCRIMINATOR METRIC

Define the critic metric (IPM) dF(µ, ν) ∶= sup

f∈F

∣ E

X∼µ f(X) − E Y∼ν f(Y) ∣ .

7 / 40

slide-14
SLIDE 14

Intro. Adversarial Framework GANs Optimization Optimal Transport

DISCRIMINATOR METRIC

Define the critic metric (IPM) dF(µ, ν) ∶= sup

f∈F

∣ E

X∼µ f(X) − E Y∼ν f(Y) ∣ .

  • F Lip-1: Wasserstein metric dW
  • F bounded by 1: total variation/Radon metric dTV
  • RKHS H, F = {f ∈ H, ∥f∥H ≤ 1}: MMD GAN
  • F Sobolev smoothness β: Sobolev GAN

Statistical question: statistical error rate with n-i.i.d samples, E dF(ν, ̂ µn)? for a range of F and ν with certain regularity.

7 / 40

slide-15
SLIDE 15

Intro. Adversarial Framework GANs Optimization Optimal Transport

SUMMARY OF FIRST HALF OF TALK

Goal Evaluation Metric Results Generator Class G Discriminator Class F Property Adversarial Framework (nonparametric) dF Sobolev GAN minimax

  • ptimal

Sobolev Wα Sobolev Wβ MMD GAN upper bound smooth subspace in RKHS RKHS H

  • racle

results any Sobolev Wβ G† Generative Adversarial Networks (parametric) dTV leaky- ReLU GANs upper bound leaky- ReLU leaky- ReLU F‡, m∗ dTV, dKL, dH any GANs

  • racle

results neural networks neural networks G†, F‡, m∗ dW Lipschitz GANs

  • racle

results Lipschitz neural networks Lipschitz neural networks G†, F‡, m∗

8 / 40

slide-16
SLIDE 16

Intro. Adversarial Framework GANs Optimization Optimal Transport

SUMMARY OF FIRST HALF OF TALK

Goal Evaluation Metric Results Generator Class G Discriminator Class F Property Adversarial Framework (nonparametric) dF Sobolev GAN minimax

  • ptimal

Sobolev Wα Sobolev Wβ MMD GAN upper bound smooth subspace in RKHS RKHS H

  • racle

results any Sobolev Wβ G† Generative Adversarial Networks (parametric) dTV leaky- ReLU GANs upper bound leaky- ReLU leaky- ReLU F‡, m∗ dTV, dKL, dH any GANs

  • racle

results neural networks neural networks G†, F‡, m∗ dW Lipschitz GANs

  • racle

results Lipschitz neural networks Lipschitz neural networks G†, F‡, m∗ The symbols: (G†) and (F‡) to denote the mis-specification for the generator class and the discriminator class respectively, and (m∗) to indicate the dependence on the number of generator samples.

8 / 40

slide-17
SLIDE 17

Intro. Adversarial Framework GANs Optimization Optimal Transport

Implicit Distribution Estimator: GANs, Optimal Transport vs. Explicit Density Estimator: KDE, Projection/Series Estimator, ...

9 / 40

slide-18
SLIDE 18

Intro. Adversarial Framework GANs Optimization Optimal Transport

Adversarial Framework (nonparametric)

10 / 40

slide-19
SLIDE 19

Intro. Adversarial Framework GANs Optimization Optimal Transport

MINIMAX OPTIMAL RATES: SOBOLEV GAN

Consider the target G ∶= {ν ∶ ρν ∈ Wα} Sobolev space with smoothness α, and the evaluation metric F = Wβ with smoothness β.

11 / 40

slide-20
SLIDE 20

Intro. Adversarial Framework GANs Optimization Optimal Transport

MINIMAX OPTIMAL RATES: SOBOLEV GAN

Consider the target G ∶= {ν ∶ ρν ∈ Wα} Sobolev space with smoothness α, and the evaluation metric F = Wβ with smoothness β. The minimax optimal rate is inf

̃ νn

sup

ν∈G

E dF (ν, ̃ νn) ≍ n− α+β

2α+d ∨ n− 1 2 .

Theorem (L. ’17 & L. ’18, Sobolev). Here ̃ νn any estimator based on n samples. d-dim.

Liang (2017); Singh et al. (2018); Weed and Berthet (2019)

11 / 40

slide-21
SLIDE 21

Intro. Adversarial Framework GANs Optimization Optimal Transport

MINIMAX OPTIMAL RATES: MMD GAN

Consider a reproducing kernel Hilbert space (RKHS) H

  • integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ < ∞
  • evaluation metric F = {f ∈ H ∣ ∥f∥H ≤ 1}
  • target density ρν in G = {ν ∣ ∥T − α−1

2 ρν∥H ≤ 1} with smoothness α 12 / 40

slide-22
SLIDE 22

Intro. Adversarial Framework GANs Optimization Optimal Transport

MINIMAX OPTIMAL RATES: MMD GAN

Consider a reproducing kernel Hilbert space (RKHS) H

  • integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ < ∞
  • evaluation metric F = {f ∈ H ∣ ∥f∥H ≤ 1}
  • target density ρν in G = {ν ∣ ∥T − α−1

2 ρν∥H ≤ 1} with smoothness α

The minimax optimal rate is inf

̃ νn

sup

ν∈G

E dF (ν, ̃ νn) ≾ n− (α+1)κ

2ακ+2 ∨ n− 1 2 .

Theorem (L. ’18, RKHS).

12 / 40

slide-23
SLIDE 23

Intro. Adversarial Framework GANs Optimization Optimal Transport

MINIMAX OPTIMAL RATES: MMD GAN

Consider a reproducing kernel Hilbert space (RKHS) H

  • integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ < ∞
  • evaluation metric F = {f ∈ H ∣ ∥f∥H ≤ 1}
  • target density ρν in G = {ν ∣ ∥T − α−1

2 ρν∥H ≤ 1} with smoothness α

The minimax optimal rate is inf

̃ νn

sup

ν∈G

E dF (ν, ̃ νn) ≾ n− (α+1)κ

2ακ+2 ∨ n− 1 2 .

Theorem (L. ’18, RKHS).

κ > 1: intrinsic dim. ∑i≥1 ti = ∑i≥1 i−κ ≤ C, parametric rate n− (α+1)κ

2ακ+2 ∨ n− 1 2 = n−1/2.

κ < 1: sample complexity scales n = ǫ2+

2 α+1 ( 1 κ −1), effective dim. 1 κ . 12 / 40

slide-24
SLIDE 24

Intro. Adversarial Framework GANs Optimization Optimal Transport

ORACLE INEQUALITY FOR GANS

Generator class G may not contain the target ν: oracle approach.

13 / 40

slide-25
SLIDE 25

Intro. Adversarial Framework GANs Optimization Optimal Transport

ORACLE INEQUALITY FOR GANS

Generator class G may not contain the target ν: oracle approach. Let TG be any generator transformation. The discriminator metric FD = Wβ, target density ρν ∈ Wα. ̂ g#µ, ̃ g#µ are Implicit Density Esti- mators! With empirical ̂ νn as plug-in, GAN ̂ g ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] −

E

Y∼̂ νn[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , attains a sub-optimal rate E dFD(̂ g#µ, ν) ≤ min

g∈TG

dFD(g#µ, ν) + n− β

d ∨ log n

√n . Corollary (L. ’17).

Canas and Rosasco (2012):β = 1

13 / 40

slide-26
SLIDE 26

Intro. Adversarial Framework GANs Optimization Optimal Transport

ORACLE INEQUALITY FOR GANS

Generator class G may not contain the target ν: oracle approach. Let TG be any generator transformation. The discriminator metric FD = Wβ, target density ρν ∈ Wα. ̂ g#µ, ̃ g#µ are Implicit Density Esti- mators! With empirical ̂ νn as plug-in, GAN ̂ g ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] −

E

Y∼̂ νn[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , attains a sub-optimal rate E dFD(̂ g#µ, ν) ≤ min

g∈TG

dFD(g#µ, ν) + n− β

d ∨ log n

√n . Corollary (L. ’17).

Canas and Rosasco (2012):β = 1

In contrast, a regularized empirical ̃ νn as plug-in ̃ g ∈ arg min

g∈TG

max

f∈FD

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ E

X∼g#µ[f(X)] −

E

Y∼̃ νn[f(Y)]

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , a faster rate is attainable E dFD(̃ g#µ, ν) ≤ min

g∈TG

dFD(g#µ, ν) + n− α+β

2α+d ∨

1 √n . Corollary (L. ’17).

13 / 40

slide-27
SLIDE 27

Intro. Adversarial Framework GANs Optimization Optimal Transport

SUB-OPTIMALITY AND REGULARIZATION

Regularization helps achieve faster rate Use ̃ νn “smoothed” empirical estimate, that serves as regularization For example, kernel smoothing: ̃ νn(x) =

1 nhn K ( x−xi hn ), SGD works

Turns out, this is used in practice, called “instance noise” or “data augmentation”

Sønderby et al. (2016); Liang et al. (2017); Arjovsky and Bottou (2017); Mescheder et al. (2018)

14 / 40

slide-28
SLIDE 28

Intro. Adversarial Framework GANs Optimization Optimal Transport

Generative Adversarial Networks and Pair Regularization (parametric)

15 / 40

slide-29
SLIDE 29

Intro. Adversarial Framework GANs Optimization Optimal Transport

Consider the parametrized GAN estimator ̂ θm,n ∈ arg min

θ∶gθ∈G

max

ω∶fω∈F

{̂ Emfω(gθ(Z)) − ̂ Enfω(Y)} , with m generator samples and n target samples. How well GANs learn the distribution, under objective evaluation metric, say dTV ((ĝ

θm,n)#µ, ν) ?

16 / 40

slide-30
SLIDE 30

Intro. Adversarial Framework GANs Optimization Optimal Transport

GENERALIZED ORACLE INEQUALITY

  • approx. err.

A1(F, G, ν) ∶= sup

θ

inf

ω ∥log ρν

ρµθ − fω∥ , A2(G, ν) ∶= inf

θ ∥log ρµθ

ρν ∥

1/2

,

  • sto. err.

Sn,m(F, G) ∶= √ Pdim(F)log(m ∧ n) m ∧ n ∨ √ Pdim(F ○ G)log(m) m , Pdim(⋅) the pseudo-dimension of the neural network function.

17 / 40

slide-31
SLIDE 31

Intro. Adversarial Framework GANs Optimization Optimal Transport

GENERALIZED ORACLE INEQUALITY

  • approx. err.

A1(F, G, ν) ∶= sup

θ

inf

ω ∥log ρν

ρµθ − fω∥ , A2(G, ν) ∶= inf

θ ∥log ρµθ

ρν ∥

1/2

,

  • sto. err.

Sn,m(F, G) ∶= √ Pdim(F)log(m ∧ n) m ∧ n ∨ √ Pdim(F ○ G)log(m) m , Pdim(⋅) the pseudo-dimension of the neural network function. E d2

TV (ν, (ĝ θm,n)#µ) , E d2 W (ν, (ĝ θm,n)#µ) ,

E dKL (ν∣∣(ĝ

θm,n)#µ) + E dKL ((ĝ θm,n)#µ∣∣ν)

≤ A1(F, G, ν) + A2(G, ν) + Sn,m(F, G) . Theorem (L. ’18, generalized oracle inequality).

17 / 40

slide-32
SLIDE 32

Intro. Adversarial Framework GANs Optimization Optimal Transport

GENERALIZED ORACLE INEQUALITY

  • approx. err.

A1(F, G, ν) ∶= sup

θ

inf

ω ∥log ρν

ρµθ − fω∥ , A2(G, ν) ∶= inf

θ ∥log ρµθ

ρν ∥

1/2

,

  • sto. err.

Sn,m(F, G) ∶= √ Pdim(F)log(m ∧ n) m ∧ n ∨ √ Pdim(F ○ G)log(m) m , Pdim(⋅) the pseudo-dimension of the neural network function. E d2

TV (ν, (ĝ θm,n)#µ) , E d2 W (ν, (ĝ θm,n)#µ) ,

E dKL (ν∣∣(ĝ

θm,n)#µ) + E dKL ((ĝ θm,n)#µ∣∣ν)

≤ A1(F, G, ν) + A2(G, ν) + Sn,m(F, G) . Theorem (L. ’18, generalized oracle inequality). We emphasize on the interplay between (G, F) as a pair of tuning parameters for regularization.

17 / 40

slide-33
SLIDE 33

Intro. Adversarial Framework GANs Optimization Optimal Transport

  • approx. err.

A1(F, G, ν) ∶= sup

θ

inf

ω ∥

√ρν − √ρµθ √ρν + √ρµθ − fω∥ , A2(G, ν) ∶= inf

θ ∥

√ρν − √ρµθ √ρν + √ρµθ ∥ , E d2

TV (ν, (ĝ θm,n)#µ) , E d2 H (ν, (ĝ θm,n)#µ) ,

≤ A1(F, G, ν) + A2(G, ν) + Sn,m(F, G) . Theorem (L. ’18, generalized oracle inequality). similar result for Hellinger dH, for non-absolutely continuous (gθ)#µ and ν.

18 / 40

slide-34
SLIDE 34

Intro. Adversarial Framework GANs Optimization Optimal Transport

PAIR REGULARIZATION

fix G, as F increase ∶ A1(F, G, ν) decrease, A2(G, ν) constant, Sn,m(F, G) increase, fix F, as G increase ∶ A1(F, G, ν) increase, A2(G, ν) decrease, Sn,m(F, G) increase.

Generator Class Discriminator Class and dominated by , , ,

19 / 40

slide-35
SLIDE 35

Intro. Adversarial Framework GANs Optimization Optimal Transport

Applications of pair regularization

20 / 40

slide-36
SLIDE 36

Intro. Adversarial Framework GANs Optimization Optimal Transport

APPLICATION I: PARAMETRIC RATES FOR LEAKY RELU NETWORKS

When the generator G and discriminator F are both leaky ReLU networks with depth L (width properly chosen depends on dimension).

Generator Discriminator

When the target density is realizable by the generator. log ρ(gθ)#µ(x) = c1

L−1

l=1 d

i=1

1mli(x)≥0 + c0,

Bai et al. (2018)

21 / 40

slide-37
SLIDE 37

Intro. Adversarial Framework GANs Optimization Optimal Transport

APPLICATION I: PARAMETRIC RATES FOR LEAKY RELU NETWORKS

When the generator G and discriminator F are both leaky ReLU networks with depth L (width properly chosen depends on dimension). E d2

TV (ν, (ĝ θm,n)#µ) ≾

√ d2L2 log(dL) (log m m ∨ log n n ). Theorem (L. ’18, leaky ReLU). The results hold for very deep networks with depth L = o( √ n/ log n).

21 / 40

slide-38
SLIDE 38

Intro. Adversarial Framework GANs Optimization Optimal Transport

APPLICATION II: LEARNING MULTIVARIATE GAUSSIAN

Consider ν ∼ N(µ, Σ). GANs enjoy near optimal sampling complexity (w.r.t.

  • dim. d), with proper choices of the architecture and activation,

E d2

TV (ν, (ĝ θm,n)#µ) ≾

√ d2 log d n ∧ m . Corollary (L. ’18, Gaussian).

22 / 40

slide-39
SLIDE 39

Intro. Adversarial Framework GANs Optimization Optimal Transport

PAIR REGULARIZATION: WHY GANS MIGHT BE BETTER

Generator Class Discriminator Class and dominated by data-memorization, empirical deviation nonparametric density estimation classic parametric models 23 / 40

slide-40
SLIDE 40

Intro. Adversarial Framework GANs Optimization Optimal Transport

Optimization (local convergence)

24 / 40

slide-41
SLIDE 41

Intro. Adversarial Framework GANs Optimization Optimal Transport

FORMULATION

Generator gθ, Discriminator fω U(θ, ω) = E

Y∼ν

  • target

[h1 ○ fω(Y)] − E

Z∼µ

  • input

[h2 ○ fω(gθ(Z))] min

θ max ω

U(θ, ω)

  • global optimization for general U(θ, ω) is hard Singh et al. (2000); Pfau and Vinyals (2016);

Salimans et al. (2016)

25 / 40

slide-42
SLIDE 42

Intro. Adversarial Framework GANs Optimization Optimal Transport

FORMULATION

Generator gθ, Discriminator fω U(θ, ω) = E

Y∼ν

  • target

[h1 ○ fω(Y)] − E

Z∼µ

  • input

[h2 ○ fω(gθ(Z))] min

θ max ω

U(θ, ω)

  • global optimization for general U(θ, ω) is hard Singh et al. (2000); Pfau and Vinyals (2016);

Salimans et al. (2016)

Local saddle point (θ∗, ω∗) such that no incentive to deviate locally U(θ∗, ω) ≤ U(θ∗, ω∗) ≤ U(θ, ω∗) , for (θ, ω) in an open neighborhood of (θ∗, ω∗).

  • also called local Nash Equilibrium (NE)
  • modest goal: initialized properly, algorithm converges to a local NE

25 / 40

slide-43
SLIDE 43

Intro. Adversarial Framework GANs Optimization Optimal Transport

INTERACTION MATTERS:

∂2 ∂θ∂ωU(θ, ω)

Geometrically fast local convergence to stable equilibrium However, “interaction term” matters, slows down the convergence ⇐ curse

26 / 40

slide-44
SLIDE 44

Intro. Adversarial Framework GANs Optimization Optimal Transport

INTERACTION MATTERS:

∂2 ∂θ∂ωU(θ, ω)

Geometrically fast local convergence to stable equilibrium However, “interaction term” matters, slows down the convergence ⇐ curse Unstable equilibrium? turns out “interaction term” matters, utilize it renders geometrically fast convergence ⇐ blessing Motivation for: optimistic mirror descent, extra-gradients, negative-momentum ...

26 / 40

slide-45
SLIDE 45

Intro. Adversarial Framework GANs Optimization Optimal Transport

“However, no guarantees are known beyond the convex-concave setting and, more importantly for the paper, even in convex-concave games, no guarantees are known for the last-iterate pair.”

— Daskalakis, Ilyas, Syrgkanis, and Zeng (2017)

27 / 40

slide-46
SLIDE 46

Intro. Adversarial Framework GANs Optimization Optimal Transport

GEOMETRICALLY FAST CONVERGENCE TO UNSTABLE EQUILIBRIUM

OMD proposed in Daskalakis et al. (2017) θt+1 = θt − 2η∇θU(θt, ωt) + η∇θU(θt−1, ωt−1) ωt+1 = ωt + 2η∇ωU(θt, ωt) − η∇ωU(θt−1, ωt−1)

Rakhlin and Sridharan (2013)

For bi-linear game U(θ, ω) = θTCω, to obtain ǫ-close solution shown in Daskalakis et al. (2017) ∶ T ≿ ǫ−4 log 1 ǫ ⋅ Poly (λmax(CCT) λmin(CCT) )

28 / 40

slide-47
SLIDE 47

Intro. Adversarial Framework GANs Optimization Optimal Transport

GEOMETRICALLY FAST CONVERGENCE TO UNSTABLE EQUILIBRIUM

OMD proposed in Daskalakis et al. (2017) θt+1 = θt − 2η∇θU(θt, ωt) + η∇θU(θt−1, ωt−1) ωt+1 = ωt + 2η∇ωU(θt, ωt) − η∇ωU(θt−1, ωt−1)

Rakhlin and Sridharan (2013)

For bi-linear game U(θ, ω) = θTCω, to obtain ǫ-close solution shown in Daskalakis et al. (2017) ∶ T ≿ ǫ−4 log 1 ǫ ⋅ Poly (λmax(CCT) λmin(CCT) ) we proved ∶ T ≿ log 1 ǫ ⋅ λmax(CCT) λmin(CCT) Theorem (L. & Stokes, ’18). further generalized beyond bi-linear game in Mokhtari et al. (2019).

28 / 40

slide-48
SLIDE 48

Intro. Adversarial Framework GANs Optimization Optimal Transport

GEOMETRICALLY FAST CONVERGENCE TO UNSTABLE EQUILIBRIUM

1000 2000 3000 4000 5000 6000 Gradient step 0.0 0.1 0.2 0.3 0.4 0.5 Radius Predictive Method OMD

we proved ∶ T ≿ log 1 ǫ ⋅ λmax(CCT) λmin(CCT) Theorem (L. & Stokes, ’18). further generalized beyond bi-linear game in Mokhtari et al. (2019).

28 / 40

slide-49
SLIDE 49

Intro. Adversarial Framework GANs Optimization Optimal Transport

  • Statistical

∶ −) given n samples, what is the statistical/generalization error rate?

  • Approximation

:-( what dist. can be approximated by the generator gθ(Z)?

  • Computational

:-O local convergence for practical optimization, how to stablize?

  • Landscape

:-( are local saddle points good globally? Other approach? theory of optimal transport ⇒ GANs?

29 / 40

slide-50
SLIDE 50

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT

Wasserstein-p metric, Wp(µ, ν) ∶= ( inf

π∈Π(µ,ν) ∫X×Y ∥x − y∥pdπ) 1/p

Π(µ, ν) all couplings

30 / 40

slide-51
SLIDE 51

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT

Wasserstein-p metric, Wp(µ, ν) ∶= ( inf

π∈Π(µ,ν) ∫X×Y ∥x − y∥pdπ) 1/p

Π(µ, ν) all couplings Let X = Y = Rd. Let µ, ν absolutely continuous w.r.t. Lebesgue measure. There exists a unique convex ψopt ∶ Rd → R, 1 2W2

2(µ, ν) =

inf

π∈Π(µ,ν) ∫

1 2 ∥x − y∥2dπ = ∫ (∥x∥2 2 − ψopt(x))µ(dx) + ∫ (∥y∥2 2 − ψ⋆

  • pt(y))ν(dy)

Here ψ⋆(y) = supy {⟨y, x⟩ − ψ(x)} is the Legendre-Fenchel conjugate of ψ. Theorem (Brenier, ’87, p = 2).

Peyr´ e et al. (2019)

30 / 40

slide-52
SLIDE 52

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT

Approximation :-) Consider [0, 1]d, Z ∼ Unif([0, 1]d), with a convex ψ (∇ψ)(Z) can represent distribution ν! Let X = Y = Rd. Let µ, ν absolutely continuous w.r.t. Lebesgue measure. There exists a unique convex ψopt ∶ Rd → R, 1 2W2

2(µ, ν) =

inf

π∈Π(µ,ν) ∫

1 2 ∥x − y∥2dπ = ∫ (∥x∥2 2 − ψopt(x))µ(dx) + ∫ (∥y∥2 2 − ψ⋆

  • pt(y))ν(dy)

= ∫ 1 2∥x − (∇ψopt)(x)∥2µ(dx), ν = (∇ψopt)#µ Theorem (Brenier, ’87, p = 2).

Peyr´ e et al. (2019)

30 / 40

slide-53
SLIDE 53

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT

Computation :-) linear program, or smooth convex program simple Landscape Recall input measure µ given, empirical target measure ̂ νn 1 2 W2

2(µ, ̂

νn) = sup

φ

{∫ φc(x)µ(dx) + ∫ φ(y)̂ νn(dy)} where φc(x) ∶= infy{ 1

2 ∥x − y∥2 − φ(y)}. Genevay, Cuturi, Peyr´ e, and Bach (2016)

31 / 40

slide-54
SLIDE 54

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT

Computation :-) linear program, or smooth convex program simple Landscape Add ǫ-entropic regularization 1 2W2

2,ǫ(µ, ̂

νn) = sup

φ

{∫ φc

ǫ(x)µ(dx) + ∫ φ(y)̂

νn(dy)} where φc

ǫ(x) ∶= −ǫ log [∫ exp (−

1 2 ∥x−y∥2−φ(y)

ǫ

) ̂ νn(dy)]. On data y1, . . . , yn

  • ptimization reduces to SGD on [φ(y1), . . . , φ(yn)] ∈ Rn

Genevay, Cuturi, Peyr´ e, and Bach (2016)

31 / 40

slide-55
SLIDE 55

Intro. Adversarial Framework GANs Optimization Optimal Transport

varying ǫ, solving W2

2,ǫ(µ, ̂

νn) induced transportation map (Id − ∇φc)(x) = ∑n

i=1 yi exp (−

1 2 ∥x−yi∥2−φ(yi)

ǫ

) ∑n

i=1 exp (−

1 2 ∥x−yi∥2−φ(yi)

ǫ

) On data y1, . . . , yn

  • ptimization reduces to SGD on [φ(y1), . . . , φ(yn)] ∈ Rn

32 / 40

slide-56
SLIDE 56

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT AND PAIR REGULARIZATION

Recall input measure µ given, empirical target measure ̂ νn 1 2 W2

2(µ, ̂

νn) = sup

φ

{∫ φc(x)µ(dx) + ∫ φ(y)̂ νn(dy)} where φc(x) ∶= infy{ 1

2 ∥x − y∥2 − φ(y)}.

Analogy to GANs: φ ∶ Rd → R as discriminator function Id − ∇φc ∶ Rd → Rd as generator transformation

33 / 40

slide-57
SLIDE 57

Intro. Adversarial Framework GANs Optimization Optimal Transport

OPTIMAL TRANSPORT AND PAIR REGULARIZATION

Recall input measure µ given, empirical target measure ̂ νn 1 2 W2

2(µ, ̂

νn) = sup

φ

{∫ φc(x)µ(dx) + ∫ φ(y)̂ νn(dy)} where φc(x) ∶= infy{ 1

2 ∥x − y∥2 − φ(y)}.

Analogy to GANs: φ ∶ Rd → R as discriminator function Id − ∇φc ∶ Rd → Rd as generator transformation However, (Id − ∇φc)#µ = ̂ νn data memorization W2 ((Id − ∇φc)#µ, ν) = W2 (̂ νn, ν) ≍ n− 1

d 33 / 40

slide-58
SLIDE 58

Intro. Adversarial Framework GANs Optimization Optimal Transport

PAIR REGULARIZATION, AGAIN

Analogy to GANs: φ ∶ Rd → R as discriminator function Id − ∇φc ∶ Rd → Rd as generator transformation Solution: pair regularization F∗ = {φ, regular}, G∗ = {Id − ∇φc, regular} for better statistical rate

Generator Class Discriminator Class and dominated by , , ,

34 / 40

slide-59
SLIDE 59

Intro. Adversarial Framework GANs Optimization Optimal Transport

Estimating Transportation Cost

35 / 40

slide-60
SLIDE 60

Intro. Adversarial Framework GANs Optimization Optimal Transport

ANOTHER APPLICATION OF PAIR REGULARIZATION

Regularity in OT Caffarelli (1992, 1991): µ, ν ∈ Cα H¨

  • lder.

Statistical question: estimate “transportation cost” W2

2(µ, ν) based on n-i.i.d. samples

y1, . . . , yn ∼ ν. Suppose µ ∼ Unif([0, 1]d) known. sup

ν∈Cα E ∣ ̃

Wn − W2

2(µ, ν)∣ ≾ n− 2α+2

2α+d + n− 1 2

Lemma (L. & Sadhanala, ’19).

36 / 40

slide-61
SLIDE 61

Intro. Adversarial Framework GANs Optimization Optimal Transport

ANOTHER APPLICATION OF PAIR REGULARIZATION

Regularity in OT Caffarelli (1992, 1991): µ, ν ∈ Cα H¨

  • lder.

Statistical question: estimate “transportation cost” W2

2(µ, ν) based on n-i.i.d. samples

y1, . . . , yn ∼ ν. Suppose µ ∼ Unif([0, 1]d) known. sup

ν∈Cα E ∣ ̃

Wn − W2

2(µ, ν)∣ ≾ n− 2α+2

2α+d + n− 1 2

Lemma (L. & Sadhanala, ’19). Elbow phenomenon: α ≥ d

2 − 2, one gets parametric rate

36 / 40

slide-62
SLIDE 62

Intro. Adversarial Framework GANs Optimization Optimal Transport

ANOTHER APPLICATION OF PAIR REGULARIZATION

Regularity in OT Caffarelli (1992, 1991): µ, ν ∈ Cα H¨

  • lder.

Statistical question: estimate “transportation cost” W2

2(µ, ν) based on n-i.i.d. samples

y1, . . . , yn ∼ ν. Suppose µ ∼ Unif([0, 1]d) known. sup

ν∈Cα E ∣ ̃

Wn − W2

2(µ, ν)∣ ≾ n− 2α+2

2α+d + n− 1 2

Lemma (L. & Sadhanala, ’19). Pair regularization: φ ∈ Cα+2, Id − ∇φc ∈ Cα+1, by Caffarelli (1992, 1991)

36 / 40

slide-63
SLIDE 63

Intro. Adversarial Framework GANs Optimization Optimal Transport

ANOTHER APPLICATION OF PAIR REGULARIZATION

Regularity in OT Caffarelli (1992, 1991): µ, ν ∈ Cα H¨

  • lder.

Statistical question: estimate “transportation cost” W2

2(µ, ν) based on n-i.i.d. samples

y1, . . . , yn ∼ ν. Suppose µ ∼ Unif([0, 1]d) known. sup

ν∈Cα E ∣ ̃

Wn − W2

2(µ, ν)∣ ≾ n− 2α+2

2α+d + n− 1 2

Lemma (L. & Sadhanala, ’19). typically an easier problem than estimating measure under W2, or estimating transportation map T under metric EX∼µ ∥̂ T(X) − T(X)∥2

H¨ utter and Rigollet (2019)

36 / 40

slide-64
SLIDE 64

Intro. Adversarial Framework GANs Optimization Optimal Transport

BACK TO THE ADVERSARIAL FRAMEWORK

Two related problems Estimate under the metric/loss inf

̃ νn

sup

ν∈G

E d2

F (ν, ̃

νn) ≍ n− 2α+2β

2α+d ∨ n−1

G = Wα, F = Wβ Theorem (L.,’17). No elbow phenomenon on α.

Liang (2017); Singh et al. (2018); Weed and Berthet (2019)

37 / 40

slide-65
SLIDE 65

Intro. Adversarial Framework GANs Optimization Optimal Transport

BACK TO THE ADVERSARIAL FRAMEWORK

Two related problems Estimate under the metric/loss inf

̃ νn

sup

ν∈G

E d2

F (ν, ̃

νn) ≍ n− 2α+2β

2α+d ∨ n−1

G = Wα, F = Wβ Theorem (L.,’17). No elbow phenomenon on α.

Liang (2017); Singh et al. (2018); Weed and Berthet (2019)

Estimating the metric/loss itself inf

̃ Wn

sup

ν∈G

E ∣ ̃ Wn − d2

F (µ, ν) ∣2 ≍ n− 8α+8β

4α+d ∨ n−1

G = Wα, F = Wβ Theorem (L. & Sadhanala, ’19). Elbow phenomenon on α = d/4 − 2β. typically an easier problem.

37 / 40

slide-66
SLIDE 66

Intro. Adversarial Framework GANs Optimization Optimal Transport

HOWEVER, FOR WASSERSTEIN METRIC

Consider d ≥ 2 and the domain Ω = [0, 1]d. Given n i.i.d. samples y1, . . . , yn from ν, inf

̃ Wn

sup

ν∈Cα E ∣ ̃

Wn − W1(µ, ν)∣ ≾ n− α+1

2α+d ,

Theorem (L., ’19). as we know inf

̃ νn

sup

ν∈Cα E W(̃

νn, ν) ≍ n− α+1

2α+d . 38 / 40

slide-67
SLIDE 67

Intro. Adversarial Framework GANs Optimization Optimal Transport

HOWEVER, FOR WASSERSTEIN METRIC

Consider d ≥ 2 and the domain Ω = [0, 1]d. Given n i.i.d. samples y1, . . . , yn from ν, log log(n) log(n) ⋅ n− α+1

2α+d ≾ inf

̃ Wn

sup

ν∈Cα E ∣ ̃

Wn − W1(µ, ν)∣ ≾ n− α+1

2α+d ,

Theorem (L., ’19). as we know inf

̃ νn

sup

ν∈Cα E W(̃

νn, ν) ≍ n− α+1

2α+d .

estimating the Wasserstein-1 metric itself is almost as hard as estimating under the Wasserstein-1 metric

38 / 40

slide-68
SLIDE 68

Intro. Adversarial Framework GANs Optimization Optimal Transport

HOWEVER, FOR WASSERSTEIN METRIC

Consider d ≥ 2 and the domain Ω = [0, 1]d. Given n i.i.d. samples y1, . . . , yn from ν, log log(n) log(n) ⋅ n− α+1

2α+d ≾ inf

̃ Wn

sup

ν∈Cα E ∣ ̃

Wn − W1(µ, ν)∣ ≾ n− α+1

2α+d ,

Theorem (L., ’19). as we know inf

̃ νn

sup

ν∈Cα E W(̃

νn, ν) ≍ n− α+1

2α+d .

  • the main technicality is in deriving the lower bound: wavelets
  • construct two composite/fuzzy hypotheses using delicate priors with matching log n moments
  • and the Wasserstein metric differs sufficiently
  • calculate total variation metric directly on the posterior of data (sum-product form), via a telescoping trick

38 / 40

slide-69
SLIDE 69

Intro. Adversarial Framework GANs Optimization Optimal Transport

SUMMARY

  • In this talk, we study statistical rates for d(̂

T#µ, ν) and ̂ d(µ, ν), with ν = T⋆

#µ.

Implicit Distribution Estimation motivated from GANs, OT.

  • Conceptually, to learn the distribution via transformation/transportation,

vs., to estimate the transformation/transportation difficulty.

  • Closely related problems in the lens of Optimal Transport.

harder d(̂ T#µ, ν)

induces plug-in estimate

  • ̂

d(µ, ν) easier

sometimes induces a transportation map

  • Idea of pair regularization

what GANs have over classical nonparametrics.

39 / 40

slide-70
SLIDE 70

Intro. Adversarial Framework GANs Optimization Optimal Transport

SUMMARY

  • In this talk, we study statistical rates for d(̂

T#µ, ν) and ̂ d(µ, ν), with ν = T⋆

#µ.

Implicit Distribution Estimation motivated from GANs, OT.

  • Conceptually, to learn the distribution via transformation/transportation,

vs., to estimate the transformation/transportation difficulty.

  • Closely related problems in the lens of Optimal Transport.

harder d(̂ T#µ, ν)

induces plug-in estimate

  • ̂

d(µ, ν) easier

sometimes induces a transportation map

  • Idea of pair regularization

what GANs have over classical nonparametrics. Many interesting open problems both statistically and computationally, with new insights on regularization and adaptivity.

39 / 40

slide-71
SLIDE 71

References

Thank you!

Liang, T. (2018). — On How Well Generative Adversarial Networks Learn Densities: Nonparametric and Parametric Results. arXiv:1811.03179 under review Liang, T. & Stokes, J. (2018). — Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks. arXiv:1802.06132 AISTATS 2019 Liang, T. (2019). — On the Minimax Optimality of Estimating the Wasserstein Metric. arXiv:1908:10324 Liang, T. & Sadhanala, V. (2019). — Working Paper.

Michael Arbel, Dougal J Sutherland, Mikołaj Bi´ nkowski, and Arthur Gretton. On gradient regularizers for mmd gans. arXiv preprint arXiv:1805.11565, 2018. Martin Arjovsky and L´ eon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017. Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017. Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017a. Sanjeev Arora, Andrej Risteski, and Yi Zhang. Theoretical limitations of encoder-decoder gan architectures. arXiv preprint arXiv:1711.02651, 2017b. Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018. Luis A Caffarelli. Some regularity properties of solutions of monge ampere equation. Communications on pure and applied mathematics, 44 (8-9):965–969, 1991. Luis A Caffarelli. The regularity of mappings with a convex potential. Journal of the American Mathematical Society, 5(1):99–104, 1992. Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. In Advances in Neural Information Processing Systems, pages 2492–2500, 2012. Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint 40 / 40