Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - - PowerPoint PPT Presentation

β–Ά
theory and statistics
SMART_READER_LITE
LIVE PREVIEW

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - - PowerPoint PPT Presentation

Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve: inf sup , where , high-dimensional Applications: Mathematics, Optimization, Game


slide-1
SLIDE 1

Improving GANs using Game Theory and Statistics

Constantinos Daskalakis

CSAIL and EECS, MIT

slide-2
SLIDE 2

Solve: inf

πœ„ sup π‘₯

𝑔 πœ„, π‘₯

where πœ„, π‘₯ high-dimensional

  • Applications: Mathematics, Optimization, Game Theory,…

[von Neumann 1928, Dantzig ’47, Brown’50, Robinson’51, Blackwell’56,…]

  • Best-Case Scenario: 𝑔 is convex in πœ„, concave in w
  • Modern Applications: GANs, adversarial examples, …

– exacerbate the importance of first-order methods, non convex-concave objectives

Min-Max Optimization

  • BEGAN. Bertholet et al. 2017.
slide-3
SLIDE 3

GAN Outputs

  • LSGAN. Mao et al. 2017.
  • BEGAN. Bertholet et al. 2017.
slide-4
SLIDE 4
  • CycleGAN. Zhu et al. 2017.

GAN uses

  • Pix2pix. Isola 2017. Many examples at

https://phillipi.github.io/pix2pix/

Reed et al. 2017.

12 7 Lecture 13 -

Text -> Image Synthesis

Many applications:

  • Domain adaptation
  • Super-resolution
  • Image Synthesis
  • Image Completion
  • Compressed Sensing
  • …
slide-5
SLIDE 5

Solve: inf

πœ„ sup π‘₯

𝑔 πœ„, π‘₯

where πœ„, π‘₯ high-dimensional

  • Applications: Mathematics, Optimization, Game Theory,…

[von Neumann 1928, Dantzig ’47, Brown’50, Robinson’51, Blackwell’56,…]

  • Best-Case Scenario: 𝑔 is convex in πœ„, concave in w
  • Modern Applications: GANs, adversarial examples, …

– exacerbate the importance of first-order methods, non convex-concave objectives

  • Personal Perspective: applications of min-max optimization will multiply, going forward, as

ML develops more complex and harder to interpret algorithms – sup players will be introduced to check the behavior of the inf players

Min-Max Optimization

  • BEGAN. Bertholet et al. 2017.
slide-6
SLIDE 6

Generative Adversarial Networks (GANs)

[Goodfellow et al. NeurIPS’14]

π‘Ž ∼ 𝑂(0, 𝐽)

Simple Randomness

Generator: DNN w/ parameters πœ„

Discriminator: DNN w/ parameters π‘₯

Hallucinated Images (from generator) Real Images (from training set)

Real or Hallucinated

inf

πœ„ sup π‘₯

π’ˆ 𝜾, 𝒙 expresses how well Discriminator distinguishes true vs generated images e.g. Wasserstein-GANs: 𝑔(πœ„, π‘₯) = π”½π‘ŒβˆΌπ‘žπ‘ π‘“π‘π‘š 𝐸π‘₯ π‘Œ βˆ’ π”½π‘ŽβˆΌπ‘‚(0,𝐽) 𝐸π‘₯ π»πœ„(π‘Ž)

  • πœ„, π‘₯: high-dimensional

⇝ solve game by having min (resp. max) player run online gradient descent (resp. ascent)

  • major challenges:

– training oscillations – generated & real distributions high-dimensional ⇝ no rigorous statistical guarantees

slide-7
SLIDE 7

Menu

  • Min-Max Optimization and Adversarial Training
  • Training Challenges:
  • reducing training oscillations
  • Statistical Challenges:
  • reducing sample requirements
  • attaining statistical guarantees
slide-8
SLIDE 8

Menu

  • Min-Max Optimization and Adversarial Training
  • Training Challenges:
  • reducing training oscillations
  • Statistical Challenges:
  • reducing sample requirements
  • attaining statistical guarantees
slide-9
SLIDE 9

Training Oscillations: Gaussian Mixture

True Distribution: Mixture of 8 Gaussians on a circle Output Distribution of standard GAN, trained via gradient descent/ascent dynamics: cycling through modes at different steps of training

from [Metz et al ICLR’17]

slide-10
SLIDE 10

Training Oscillations: Handwritten Digits

True Distribution: MNIST Output Distribution of standard GAN, trained via gradient descent/ascent dynamics cycling through β€œproto-digits” at different steps of training

from [Metz et al ICLR’17]

slide-11
SLIDE 11
  • True distribution: isotropic Normal distribution, namely π‘Œ ∼ π’ͺ

3 4 , 𝐽2Γ—2

  • Generator architecture: 𝐻𝜾 π‘Ž = 𝜾 + π‘Ž

(adds input π‘Ž to internal params)

  • Discriminator architecture: 𝐸𝒙 β‹… = 𝒙,β‹…

(linear projection)

  • W-GAN objective: min

𝜾 max 𝒙 π”½π‘Œ 𝐸𝒙 π‘Œ

βˆ’ π”½π‘Ž 𝐸𝒙 𝐻𝜾(π‘Ž) = min

𝜾 max 𝒙

𝒙T β‹… 3 4 βˆ’ 𝜾 from [Daskalakis, Ilyas, Syrgkanis, Zeng ICLR’18]

convex-concave function π‘Ž, πœ„, π‘₯: 2-dimensional

Gradient Descent Dynamics

Training Oscillations: even for bilinear objectives!

slide-12
SLIDE 12

Training Oscillations:

persistence under many variants of Gradient Descent

from [Daskalakis, Ilyas, Syrgkanis, Zeng ICLR’18]

slide-13
SLIDE 13
  • Best-Case Scenario: Given convex-concave 𝑔(𝑦, 𝑧), solve: min

π‘¦βˆˆπ‘Œ max π‘§βˆˆπ‘ 𝑔(𝑦, 𝑧)

  • [von Neumann’28]: min-max=max-min; solvable via convex-programming
  • Online Learning: if min and max players run any no-regret learning procedure

they converge to minimax equilibrium

  • E.g. follow-the-regularized-leader (FTRL), follow-the-perturbed-leader, MWU
  • Follow-the-regularized-leader with β„“2

2-regularization ≑ gradient descent

  • β€œConvergence:” Sequence 𝑦𝑒, 𝑧𝑒 𝑒 converges to minimax equilibrium in the

average sense, i.e. 𝑔

1 𝑒 Οƒπœβ‰€π‘’ π‘¦πœ , 1 𝑒 Οƒπœβ‰€π‘’ π‘§πœ

β†’ min

π‘¦βˆˆπ‘Œ max π‘§βˆˆπ‘ 𝑔(𝑦, 𝑧)

  • Can we show point-wise convergence of no-regret learning methods?
  • [Mertikopoulos-Papadimitriou-Piliouras SODA’18]: No for any FTRL

Training Oscillations:

Online Learning Perspective

slide-14
SLIDE 14
  • Variant of gradient descent:

βˆ€π‘’: 𝑦𝑒+1 = 𝑦𝑒 βˆ’ πœƒ β‹… 𝛼𝑔 𝑦𝑒 + 𝜽/πŸ‘ β‹… π›‚π’ˆ(π’šπ’–βˆ’πŸ)

  • Interpretation: undo today, some of yesterday’s gradient; ie negative momentum
  • Gradient Descent w/ negative momentum

= Optimistic FTRL w/ β„“2

2-regularization

[Rakhlin-Sridharan COLT’13, Syrgkanis et al. NeurIPS’15] β‰ˆ extra-gradient method [Korpelevich’76, Chiang et al COLT’12, Gidel et al’18, Mertikopoulos et al’18]

  • Does it help in min-max optimization?

Negative Momentum

slide-15
SLIDE 15
  • E.g. 𝑔 𝑦, 𝑧 = 𝑦 βˆ’ 0.5 β‹… 𝑧 βˆ’ 0.5

Negative Momentum: why it could help

: start : min-max equilibrium 𝑦𝑒+1 = 𝑦𝑒 βˆ’ πœƒ β‹… 𝛼

𝑦𝑔 𝑦𝑒, 𝑧𝑒

𝑧𝑒+1 = 𝑧𝑒 + πœƒ β‹… 𝛼

𝑧𝑔 𝑦𝑒, 𝑧𝑒

𝑦𝑒+1 = 𝑦𝑒 βˆ’ πœƒ β‹… 𝛼

𝑦𝑔 𝑦𝑒, 𝑧𝑒

+𝜽/πŸ‘ β‹… π›‚π’šπ’ˆ(π’šπ’–βˆ’πŸ, π’›π’–βˆ’πŸ) 𝑧𝑒+1 = 𝑧𝑒 + πœƒ β‹… 𝛼

𝑧𝑔 𝑦𝑒, 𝑧𝑒

βˆ’πœ½/πŸ‘ β‹… π›‚π’›π’ˆ(π’šπ’–βˆ’πŸ, π’›π’–βˆ’πŸ)

slide-16
SLIDE 16
  • Optimistic gradient descent-ascent (OGDA) dynamics:

βˆ€π‘’: 𝑦𝑒+1 = 𝑦𝑒 βˆ’ πœƒ β‹… 𝛼

𝑦𝑔 𝑦𝑒, 𝑧𝑒 + 𝜽 πŸ‘ β‹… π›‚π²π’ˆ π’šπ’–βˆ’πŸ, π’›π’–βˆ’πŸ

𝑧𝑒+1 = 𝑧𝑒 + πœƒ β‹… 𝛼

𝑧𝑔 𝑦𝑒, 𝑧𝑒 βˆ’ 𝜽 πŸ‘ β‹… π›‚π’›π’ˆ(π’šπ’–βˆ’πŸ, π’›π’–βˆ’πŸ)

  • [Daskalakis-Ilyas-Syrgkanis-Zeng ICLR’18]: OGDA exhibits last iterate convergence

for unconstrained bilinear games: min

π‘¦βˆˆβ„π‘œ max π‘§βˆˆβ„π‘› 𝑔 𝑦, 𝑧 = 𝑦T𝐡𝑧 + π‘π‘ˆπ‘¦ + π‘‘π‘ˆπ‘§

  • [Liang-Stokes’18]: …convergence rate is geometric if 𝐡 is well-conditioned, extends

to strongly convex-concave functions 𝑔 𝑦, 𝑧

  • E.g. in previous isotropic Gaussian case: π‘Œ ∼ π’ͺ

3,4 , 𝐽2Γ—2 , π»πœ„ π‘Ž = πœ„ + π‘Ž, 𝐸π‘₯ β‹… = π‘₯,β‹…

Negative Momentum: convergence

slide-17
SLIDE 17
  • Optimistic gradient descent-ascent (OGDA) dynamics:

βˆ€π‘’: 𝑦𝑒+1 = 𝑦𝑒 βˆ’ πœƒ β‹… 𝛼

𝑦𝑔 𝑦𝑒, 𝑧𝑒 + 𝜽 πŸ‘ β‹… π›‚π²π’ˆ π’šπ’–βˆ’πŸ, π’›π’–βˆ’πŸ

𝑧𝑒+1 = 𝑧𝑒 + πœƒ β‹… 𝛼

𝑧𝑔 𝑦𝑒, 𝑧𝑒 βˆ’ 𝜽 πŸ‘ β‹… π›‚π’›π’ˆ(π’šπ’–βˆ’πŸ, π’›π’–βˆ’πŸ)

  • [Daskalakis-Ilyas-Syrgkanis-Zeng ICLR’18]: OGDA exhibits last iterate convergence

for unconstrained bilinear games: min

π‘¦βˆˆβ„π‘œ max π‘§βˆˆβ„π‘› 𝑔 𝑦, 𝑧 = 𝑦T𝐡𝑧 + π‘π‘ˆπ‘¦ + π‘‘π‘ˆπ‘§

  • [Liang-Stokes’18]: …convergence rate is geometric if 𝐡 is well-conditioned, extends

to strongly convex-concave functions 𝑔 𝑦, 𝑧

  • [Daskalakis-Panageas ITCS’18]: Projected OGDA exhibits last iterate convergence

even for constrained bilinear games: min

π‘¦βˆˆΞ”π‘œ max π‘§βˆˆΞ”π‘› 𝑦T𝐡𝑧

= all linear programming

Negative Momentum: convergence

slide-18
SLIDE 18

Negative Momentum: in the Wild

  • Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧
  • Issue [Daskalakis, Panageas NeurIPS’18]: No hope that stable points of OGDA or

GDA are only local min-max points

  • e.g. 𝑔 𝑦, 𝑧 = βˆ’

1 8 β‹… 𝑦2 βˆ’ 1 2 β‹… 𝑧2 + 6 10 β‹… 𝑦 β‹… 𝑧

  • Nested-ness: Local Min-Max βŠ† Stable Points of GDA βŠ† Stable Points of OGDA

Gradient Descent-Ascent field

slide-19
SLIDE 19

Negative Momentum: in the Wild

  • Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧
  • Issue [Daskalakis, Panageas NeurIPS’18]: No hope that stable points of OGDA or

GDA are only local min-max points

  • Local Min-Max βŠ† Stable Points of GDA βŠ† Stable Points of OGDA
  • also [Adolphs et al. 18]: left inclusion
  • Question: identify first-order method converging to local min-max w/ probability 1
  • While this is pending, evaluate optimism in practice…
  • [Daskalakis-Ilyas-Syrgkanis-Zeng ICLR’18]: propose optimistic Adam
  • Adam, a variant of gradient descent proposed by [Kingma-Ba ICLR’15],

has found wide adoption in deep learning, although it doesn’t always converge [Reddi-Kale-Kumar ICLR’18]

  • Optimistic Adam is the right adaptation of Adam to β€œundo some of the

past gradients”

slide-20
SLIDE 20

Optimistic Adam on CIFAR10

  • Compare Adam, Optimistic Adam, trained on CIFAR10, in terms of

Inception Score

  • No fine-tuning for Optimistic Adam, used same hyper-parameters

for both algorithms as suggested in Gulrajani et al. (2017)

slide-21
SLIDE 21

Optimistic Adam on CIFAR10

  • Compare Adam, Optimistic Adam, trained on CIFAR10, in terms of

Inception Score

  • No fine-tuning for Optimistic Adam, used same hyper-parameters

for both algorithms as suggested in Gulrajani et al. (2017)

slide-22
SLIDE 22

Menu

  • Min-Max Optimization and Adversarial Training
  • Training Challenges:
  • reducing training oscillations
  • Statistical Challenges:
  • reducing sample requirements
  • attaining statistical guarantees
slide-23
SLIDE 23

Menu

  • Min-Max Optimization and Adversarial Training
  • Training Challenges:
  • reducing training oscillations
  • Statistical Challenges:
  • reducing sample requirements
  • attaining statistical guarantees
slide-24
SLIDE 24

Generative Adversarial Networks (GANs)

π‘Ž ∼ 𝑂(0, 𝐽)

Simple Randomness

Generator: DNN w/ parameters πœ„

Discriminator: DNN w/ parameters π‘₯

Hallucinated Images (from generator) Real Images (from training set)

Real or Hallucinated

inf

πœ„ sup π‘₯

π’ˆ 𝜾, 𝒙

expresses how well Discriminator distinguishes true from generated images e.g. Wasserstein-GANs: 𝑔(πœ„, π‘₯) = π”½π‘ŒβˆΌπ‘žπ‘ π‘“π‘π‘š 𝐸π‘₯ π‘Œ βˆ’ π”½π‘ŽβˆΌπ‘‚(0,𝐽) 𝐸π‘₯ π»πœ„(π‘Ž)

  • Inner sup (Discrimination) problem: a statistical estimation problem

– how close is π‘žπ‘ π‘“π‘π‘š and π‘žπ‘•π‘“π‘œπ‘“π‘ π‘π‘’π‘“π‘’ in distance defined by test functions expressible in the architecture of the discriminator? – because training will fail to solve min-max problem to optimality, this distance won’t be truly minimized

  • major statistical challenges:

– Certifying a trained GAN: how close is π‘žπ‘ π‘“π‘π‘š and π‘žπ‘•π‘“π‘œπ‘“π‘ π‘π‘’π‘“π‘’ in some distance of interest? – Alleviating computational & statistical burden of discrimination – Scaling up the dimensionality of generated distributions

slide-25
SLIDE 25
  • Certifying a trained GAN: how close is π‘žπ‘ π‘“π‘π‘š and π‘žπ‘•π‘“π‘œπ‘“π‘ π‘π‘’π‘“π‘’ in some distance of interest?
  • Fundamental Challenge: curse of dimensionality
  • claim (birthday paradox): given sample access to dist’n 𝑄 over {0,1}π‘œ, and 𝑅=Unif

({0,1}π‘œ), estimating Wasserstein(𝑄, 𝑅) to within Β±1/4 requires Ξ© 2π‘œ/2 samples

  • for π‘œ=1000’s (e.g. CIFAR)

⇝ infeasible, unless lower-dimensional structure in π‘žπ‘ π‘“π‘π‘š and π‘žπ‘•π‘“π‘œπ‘“π‘ π‘π‘’π‘“π‘’ is exploited

  • Alleviating Computational & Statistical Burden of Discriminator:

⇝ infeasible, unless lower-dimensional structure in π‘žπ‘ π‘“π‘π‘š and π‘žπ‘•π‘“π‘œπ‘“π‘ π‘π‘’π‘“π‘’ is exploited

  • Scaling-up Dimensionality of Generated Distribution (e.g. video generation):

⇝ infeasible, unless lower-dimensional structure in π‘žπ‘ π‘“π‘π‘š is exploited

GANs: Statistical Challenges

slide-26
SLIDE 26

Lower-Dimensional Structure:

Bayesian Networks

  • Probability distribution defined in terms of a DAG 𝐻 = (π‘Š, 𝐹)
  • Node 𝑀 associated w/ random variable π‘Œπ‘€ ∈ Ξ£
  • Distribution factorizable in terms of parenthood relationships

Pr(𝑦) = ΰ·‘

𝑀

Prπ‘Œπ‘€|π‘ŒΞ π‘€ 𝑦𝑀|𝑦Π𝑀 , βˆ€π‘¦ ∈ Ξ£π‘Š

π‘Œ1 π‘Œ3 π‘Œ4 π‘Œ5 π‘Œ2

Pr Τ¦ 𝑦 = Pr x1 β‹… Pr 𝑦2 β‹… Pr 𝑦3 𝑦1, 𝑦2 β‹… Pr 𝑦4 𝑦3 β‹… Pr 𝑦5 𝑦3, 𝑦4]

Is it easier to discriminate between Bayes-nets whose structure is known?

slide-27
SLIDE 27

BayesNet Discrimination

[Daskalakis-Pan COLT’17]: If 𝑒𝑗𝑑𝑒 is Total Variation distance, there exist computationally efficient testers using ΰ·¨ 𝑃

Ξ£ 0.75 𝑒+1 π‘œ 𝜁2

samples. Moreover, the dependence on π‘œ, 𝜁 of both bounds is tight up to a O(log π‘œ) factor, and the exponential in 𝑒 dependence is necessary and essentially tight. [Canonne et al. COLT’17]: Identify conditions under which dependence on π‘œ can be made π‘œ when one of the two Bayesnets is known Effective dimensionality is: n 𝑒 Bayesnet 𝑄 on DAG 𝐻 with:

  • π‘œ nodes
  • in-degree 𝑒

Bayesnet 𝑅 on DAG 𝐻 with:

  • π‘œ nodes
  • in-degree 𝑒

??

Goal: Given samples from 𝑄, 𝑅 and 𝜁, distinguish: 𝑄 = 𝑅 vs 𝑒𝑗𝑑𝑒 𝑄, 𝑅 > 𝜁

slide-28
SLIDE 28

BayesNet Discrimination in TV

  • Goal: distinguish 𝑄 = 𝑅 vs π‘’π‘ˆπ‘Š 𝑄, 𝑅 > 𝜁
  • Idea: distance localization
  • prove statement of the form: β€œIf BayesNets 𝑄 and 𝑅 are far in TV, there exists

a small size witness set 𝑇 of variables such that 𝑄

𝑇 and 𝑅𝑇, the marginals of 𝑄

and 𝑅 on variables 𝑇, are also somewhat far away”

  • reduces the original problem to identity testing on small size sets whose

distributions can be sampled

  • Question: which distances are localizable?
  • 𝐿𝑀 𝑄||𝑅 ≀ σ𝑀 𝐿𝑀 𝑄𝑀βˆͺΠ𝑀||𝑅𝑀βˆͺΠ𝑀

(chain rule of KL)

  • 𝑒TV 𝑄, 𝑅 ≀ σ𝑀 𝑒TV 𝑄𝑀βˆͺΠ𝑀, 𝑅𝑀βˆͺΠ𝑀 + σ𝑀 𝑒TV 𝑄Π𝑀, 𝑅Π𝑀

(hybrid argument)

  • 𝐼2 𝑄, 𝑅 ≀ σ𝑀 𝐼2 𝑄𝑀βˆͺΠ𝑀, 𝑅𝑀βˆͺΠ𝑀
slide-29
SLIDE 29

Wasserstein Subadditivity

Q: Does Wasserstein satisfy subadditivity Wass 𝑄, 𝑅 ≀ ෍

𝑀

Wass 𝑄𝑀βˆͺΠ𝑀||𝑅𝑀βˆͺΠ𝑀 ? A: Not always; exist pair of Markov Chains: π‘Œ β†’ 𝑍 β†’ π‘Ž and π‘Œβ€² β†’ 𝑍′ β†’ π‘Žβ€² such that

Wass π‘Œ, 𝑍 , π‘Œβ€², 𝑍′ + Wass 𝑍, π‘Ž , 𝑍′, π‘Žβ€² Wass π‘Œ, 𝑍, π‘Ž , π‘Œβ€², 𝑍′, π‘Žβ€²

can be made arbitrarily small. [Preliminary Result]: Wasserstein distance between two Markov Chains π‘Œ1, … , π‘Œπ‘ˆ and 𝑍

1, … , 𝑍 π‘ˆ satisfies subbadditivity if the conditional densities 𝑔 𝒀 𝑦𝑒 π‘¦π‘’βˆ’1 and

𝑔

𝒁(𝑧𝑒|π‘§π‘’βˆ’1) are Lipschitz wrt π‘¦π‘’βˆ’1 and π‘§π‘’βˆ’1 respectively, for all 𝑒.

(extends to Bayesian Networks)

Bayesnet 𝑄

  • n DAG 𝐻

Bayesnet 𝑅

  • n DAG 𝐻

??

slide-30
SLIDE 30

Video Generation

Discriminate generated video against target distribution over videos

rand rand rand rand

Gen Gen Gen Gen

Too high dimensional

frame 1 frame 2 frame 3 frame 4

π‘Œmodel π‘Œdata discriminate

slide-31
SLIDE 31

Video Generation

rand rand rand rand

Gen Gen Gen Gen

frame 1 frame 2 frame 3 frame 4

can exploit subadditivity and discriminate only pairs of consecutive frames of generated distribution against pairs of consecutive frames of target distribution

Disc1 Disc2 Disc3

N.B. resulting multi-player zero-sum game falls in realm of [D-Papadimitriou ICALP’09],

[Even-Dar et al STOC’09], [Cai-D SODA’11], [Cai et al MATHOR’15]; efficient dynamics known

Gen

Disc1 Disc2 Disc3

slide-32
SLIDE 32

Video Generation: experiment [Ilyas’18]

  • Created random 4-frame videos of MNIST digits
  • in every training video, digits are weakly increasing in time
  • Trained two video GANs:
  • a GAN w/ an un-factorized discriminator
  • and a GAN w/ a factorized discriminator
  • GANs must learn both how to hallucinate handwritten digits,

and that they need to put them in increasing order

  • Compare factorized vs unfactorized models in terms of accuracy
slide-33
SLIDE 33

Conclusions

  • Min-Max Optimization has found numerous applications in

Optimization, Game Theory, Adversarial Training

  • Applications to Generative Adversarial Networks pose serious

challenges, of optimization (oscillations) and statistical (curse of dimensionality) nature

  • We propose gradient descent with negative momentum as an

approach to ease training oscillations

  • We prove Wasserstein subadditivity in Bayesnets and propose

modeling dependencies in the data as an approach to ease the curse of dimensionality

  • Lots of interesting theoretical and practical challenges going

forward

slide-34
SLIDE 34

Thanks!

slide-35
SLIDE 35

The First Auction by Christie’s