Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - - PowerPoint PPT Presentation
Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - - PowerPoint PPT Presentation
Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve: inf sup , where , high-dimensional Applications: Mathematics, Optimization, Game
Solve: inf
π sup π₯
π π, π₯
where π, π₯ high-dimensional
- Applications: Mathematics, Optimization, Game Theory,β¦
[von Neumann 1928, Dantzig β47, Brownβ50, Robinsonβ51, Blackwellβ56,β¦]
- Best-Case Scenario: π is convex in π, concave in w
- Modern Applications: GANs, adversarial examples, β¦
β exacerbate the importance of first-order methods, non convex-concave objectives
Min-Max Optimization
- BEGAN. Bertholet et al. 2017.
GAN Outputs
- LSGAN. Mao et al. 2017.
- BEGAN. Bertholet et al. 2017.
- CycleGAN. Zhu et al. 2017.
GAN uses
- Pix2pix. Isola 2017. Many examples at
https://phillipi.github.io/pix2pix/
Reed et al. 2017.
12 7 Lecture 13 -
Text -> Image Synthesis
Many applications:
- Domain adaptation
- Super-resolution
- Image Synthesis
- Image Completion
- Compressed Sensing
- β¦
Solve: inf
π sup π₯
π π, π₯
where π, π₯ high-dimensional
- Applications: Mathematics, Optimization, Game Theory,β¦
[von Neumann 1928, Dantzig β47, Brownβ50, Robinsonβ51, Blackwellβ56,β¦]
- Best-Case Scenario: π is convex in π, concave in w
- Modern Applications: GANs, adversarial examples, β¦
β exacerbate the importance of first-order methods, non convex-concave objectives
- Personal Perspective: applications of min-max optimization will multiply, going forward, as
ML develops more complex and harder to interpret algorithms β sup players will be introduced to check the behavior of the inf players
Min-Max Optimization
- BEGAN. Bertholet et al. 2017.
Generative Adversarial Networks (GANs)
[Goodfellow et al. NeurIPSβ14]
π βΌ π(0, π½)
Simple Randomness
Generator: DNN w/ parameters π
Discriminator: DNN w/ parameters π₯
Hallucinated Images (from generator) Real Images (from training set)
Real or Hallucinated
inf
π sup π₯
π πΎ, π expresses how well Discriminator distinguishes true vs generated images e.g. Wasserstein-GANs: π(π, π₯) = π½πβΌππ πππ πΈπ₯ π β π½πβΌπ(0,π½) πΈπ₯ π»π(π)
- π, π₯: high-dimensional
β solve game by having min (resp. max) player run online gradient descent (resp. ascent)
- major challenges:
β training oscillations β generated & real distributions high-dimensional β no rigorous statistical guarantees
Menu
- Min-Max Optimization and Adversarial Training
- Training Challenges:
- reducing training oscillations
- Statistical Challenges:
- reducing sample requirements
- attaining statistical guarantees
Menu
- Min-Max Optimization and Adversarial Training
- Training Challenges:
- reducing training oscillations
- Statistical Challenges:
- reducing sample requirements
- attaining statistical guarantees
Training Oscillations: Gaussian Mixture
True Distribution: Mixture of 8 Gaussians on a circle Output Distribution of standard GAN, trained via gradient descent/ascent dynamics: cycling through modes at different steps of training
from [Metz et al ICLRβ17]
Training Oscillations: Handwritten Digits
True Distribution: MNIST Output Distribution of standard GAN, trained via gradient descent/ascent dynamics cycling through βproto-digitsβ at different steps of training
from [Metz et al ICLRβ17]
- True distribution: isotropic Normal distribution, namely π βΌ πͺ
3 4 , π½2Γ2
- Generator architecture: π»πΎ π = πΎ + π
(adds input π to internal params)
- Discriminator architecture: πΈπ β = π,β
(linear projection)
- W-GAN objective: min
πΎ max π π½π πΈπ π
β π½π πΈπ π»πΎ(π) = min
πΎ max π
πT β 3 4 β πΎ from [Daskalakis, Ilyas, Syrgkanis, Zeng ICLRβ18]
convex-concave function π, π, π₯: 2-dimensional
Gradient Descent Dynamics
Training Oscillations: even for bilinear objectives!
Training Oscillations:
persistence under many variants of Gradient Descent
from [Daskalakis, Ilyas, Syrgkanis, Zeng ICLRβ18]
- Best-Case Scenario: Given convex-concave π(π¦, π§), solve: min
π¦βπ max π§βπ π(π¦, π§)
- [von Neumannβ28]: min-max=max-min; solvable via convex-programming
- Online Learning: if min and max players run any no-regret learning procedure
they converge to minimax equilibrium
- E.g. follow-the-regularized-leader (FTRL), follow-the-perturbed-leader, MWU
- Follow-the-regularized-leader with β2
2-regularization β‘ gradient descent
- βConvergence:β Sequence π¦π’, π§π’ π’ converges to minimax equilibrium in the
average sense, i.e. π
1 π’ Οπβ€π’ π¦π , 1 π’ Οπβ€π’ π§π
β min
π¦βπ max π§βπ π(π¦, π§)
- Can we show point-wise convergence of no-regret learning methods?
- [Mertikopoulos-Papadimitriou-Piliouras SODAβ18]: No for any FTRL
Training Oscillations:
Online Learning Perspective
- Variant of gradient descent:
βπ’: π¦π’+1 = π¦π’ β π β πΌπ π¦π’ + π½/π β ππ(ππβπ)
- Interpretation: undo today, some of yesterdayβs gradient; ie negative momentum
- Gradient Descent w/ negative momentum
= Optimistic FTRL w/ β2
2-regularization
[Rakhlin-Sridharan COLTβ13, Syrgkanis et al. NeurIPSβ15] β extra-gradient method [Korpelevichβ76, Chiang et al COLTβ12, Gidel et alβ18, Mertikopoulos et alβ18]
- Does it help in min-max optimization?
Negative Momentum
- E.g. π π¦, π§ = π¦ β 0.5 β π§ β 0.5
Negative Momentum: why it could help
: start : min-max equilibrium π¦π’+1 = π¦π’ β π β πΌ
π¦π π¦π’, π§π’
π§π’+1 = π§π’ + π β πΌ
π§π π¦π’, π§π’
π¦π’+1 = π¦π’ β π β πΌ
π¦π π¦π’, π§π’
+π½/π β πππ(ππβπ, ππβπ) π§π’+1 = π§π’ + π β πΌ
π§π π¦π’, π§π’
βπ½/π β πππ(ππβπ, ππβπ)
- Optimistic gradient descent-ascent (OGDA) dynamics:
βπ’: π¦π’+1 = π¦π’ β π β πΌ
π¦π π¦π’, π§π’ + π½ π β ππ²π ππβπ, ππβπ
π§π’+1 = π§π’ + π β πΌ
π§π π¦π’, π§π’ β π½ π β πππ(ππβπ, ππβπ)
- [Daskalakis-Ilyas-Syrgkanis-Zeng ICLRβ18]: OGDA exhibits last iterate convergence
for unconstrained bilinear games: min
π¦ββπ max π§ββπ π π¦, π§ = π¦Tπ΅π§ + πππ¦ + πππ§
- [Liang-Stokesβ18]: β¦convergence rate is geometric if π΅ is well-conditioned, extends
to strongly convex-concave functions π π¦, π§
- E.g. in previous isotropic Gaussian case: π βΌ πͺ
3,4 , π½2Γ2 , π»π π = π + π, πΈπ₯ β = π₯,β
Negative Momentum: convergence
- Optimistic gradient descent-ascent (OGDA) dynamics:
βπ’: π¦π’+1 = π¦π’ β π β πΌ
π¦π π¦π’, π§π’ + π½ π β ππ²π ππβπ, ππβπ
π§π’+1 = π§π’ + π β πΌ
π§π π¦π’, π§π’ β π½ π β πππ(ππβπ, ππβπ)
- [Daskalakis-Ilyas-Syrgkanis-Zeng ICLRβ18]: OGDA exhibits last iterate convergence
for unconstrained bilinear games: min
π¦ββπ max π§ββπ π π¦, π§ = π¦Tπ΅π§ + πππ¦ + πππ§
- [Liang-Stokesβ18]: β¦convergence rate is geometric if π΅ is well-conditioned, extends
to strongly convex-concave functions π π¦, π§
- [Daskalakis-Panageas ITCSβ18]: Projected OGDA exhibits last iterate convergence
even for constrained bilinear games: min
π¦βΞπ max π§βΞπ π¦Tπ΅π§
= all linear programming
Negative Momentum: convergence
Negative Momentum: in the Wild
- Can try optimism for non convex-concave min-max objectives π π¦, π§
- Issue [Daskalakis, Panageas NeurIPSβ18]: No hope that stable points of OGDA or
GDA are only local min-max points
- e.g. π π¦, π§ = β
1 8 β π¦2 β 1 2 β π§2 + 6 10 β π¦ β π§
- Nested-ness: Local Min-Max β Stable Points of GDA β Stable Points of OGDA
Gradient Descent-Ascent field
Negative Momentum: in the Wild
- Can try optimism for non convex-concave min-max objectives π π¦, π§
- Issue [Daskalakis, Panageas NeurIPSβ18]: No hope that stable points of OGDA or
GDA are only local min-max points
- Local Min-Max β Stable Points of GDA β Stable Points of OGDA
- also [Adolphs et al. 18]: left inclusion
- Question: identify first-order method converging to local min-max w/ probability 1
- While this is pending, evaluate optimism in practiceβ¦
- [Daskalakis-Ilyas-Syrgkanis-Zeng ICLRβ18]: propose optimistic Adam
- Adam, a variant of gradient descent proposed by [Kingma-Ba ICLRβ15],
has found wide adoption in deep learning, although it doesnβt always converge [Reddi-Kale-Kumar ICLRβ18]
- Optimistic Adam is the right adaptation of Adam to βundo some of the
past gradientsβ
Optimistic Adam on CIFAR10
- Compare Adam, Optimistic Adam, trained on CIFAR10, in terms of
Inception Score
- No fine-tuning for Optimistic Adam, used same hyper-parameters
for both algorithms as suggested in Gulrajani et al. (2017)
Optimistic Adam on CIFAR10
- Compare Adam, Optimistic Adam, trained on CIFAR10, in terms of
Inception Score
- No fine-tuning for Optimistic Adam, used same hyper-parameters
for both algorithms as suggested in Gulrajani et al. (2017)
Menu
- Min-Max Optimization and Adversarial Training
- Training Challenges:
- reducing training oscillations
- Statistical Challenges:
- reducing sample requirements
- attaining statistical guarantees
Menu
- Min-Max Optimization and Adversarial Training
- Training Challenges:
- reducing training oscillations
- Statistical Challenges:
- reducing sample requirements
- attaining statistical guarantees
Generative Adversarial Networks (GANs)
π βΌ π(0, π½)
Simple Randomness
Generator: DNN w/ parameters π
Discriminator: DNN w/ parameters π₯
Hallucinated Images (from generator) Real Images (from training set)
Real or Hallucinated
inf
π sup π₯
π πΎ, π
expresses how well Discriminator distinguishes true from generated images e.g. Wasserstein-GANs: π(π, π₯) = π½πβΌππ πππ πΈπ₯ π β π½πβΌπ(0,π½) πΈπ₯ π»π(π)
- Inner sup (Discrimination) problem: a statistical estimation problem
β how close is ππ πππ and ππππππ ππ’ππ in distance defined by test functions expressible in the architecture of the discriminator? β because training will fail to solve min-max problem to optimality, this distance wonβt be truly minimized
- major statistical challenges:
β Certifying a trained GAN: how close is ππ πππ and ππππππ ππ’ππ in some distance of interest? β Alleviating computational & statistical burden of discrimination β Scaling up the dimensionality of generated distributions
- Certifying a trained GAN: how close is ππ πππ and ππππππ ππ’ππ in some distance of interest?
- Fundamental Challenge: curse of dimensionality
- claim (birthday paradox): given sample access to distβn π over {0,1}π, and π =Unif
({0,1}π), estimating Wasserstein(π, π ) to within Β±1/4 requires Ξ© 2π/2 samples
- for π=1000βs (e.g. CIFAR)
β infeasible, unless lower-dimensional structure in ππ πππ and ππππππ ππ’ππ is exploited
- Alleviating Computational & Statistical Burden of Discriminator:
β infeasible, unless lower-dimensional structure in ππ πππ and ππππππ ππ’ππ is exploited
- Scaling-up Dimensionality of Generated Distribution (e.g. video generation):
β infeasible, unless lower-dimensional structure in ππ πππ is exploited
GANs: Statistical Challenges
Lower-Dimensional Structure:
Bayesian Networks
- Probability distribution defined in terms of a DAG π» = (π, πΉ)
- Node π€ associated w/ random variable ππ€ β Ξ£
- Distribution factorizable in terms of parenthood relationships
Pr(π¦) = ΰ·
π€
Prππ€|πΞ π€ π¦π€|π¦Ξ π€ , βπ¦ β Ξ£π
π1 π3 π4 π5 π2
Pr Τ¦ π¦ = Pr x1 β Pr π¦2 β Pr π¦3 π¦1, π¦2 β Pr π¦4 π¦3 β Pr π¦5 π¦3, π¦4]
Is it easier to discriminate between Bayes-nets whose structure is known?
BayesNet Discrimination
[Daskalakis-Pan COLTβ17]: If πππ‘π’ is Total Variation distance, there exist computationally efficient testers using ΰ·¨ π
Ξ£ 0.75 π+1 π π2
samples. Moreover, the dependence on π, π of both bounds is tight up to a O(log π) factor, and the exponential in π dependence is necessary and essentially tight. [Canonne et al. COLTβ17]: Identify conditions under which dependence on π can be made π when one of the two Bayesnets is known Effective dimensionality is: n π Bayesnet π on DAG π» with:
- π nodes
- in-degree π
Bayesnet π on DAG π» with:
- π nodes
- in-degree π
??
Goal: Given samples from π, π and π, distinguish: π = π vs πππ‘π’ π, π > π
BayesNet Discrimination in TV
- Goal: distinguish π = π vs πππ π, π > π
- Idea: distance localization
- prove statement of the form: βIf BayesNets π and π are far in TV, there exists
a small size witness set π of variables such that π
π and π π, the marginals of π
and π on variables π, are also somewhat far awayβ
- reduces the original problem to identity testing on small size sets whose
distributions can be sampled
- Question: which distances are localizable?
- πΏπ π||π β€ Οπ€ πΏπ ππ€βͺΞ π€||π π€βͺΞ π€
(chain rule of KL)
- πTV π, π β€ Οπ€ πTV ππ€βͺΞ π€, π π€βͺΞ π€ + Οπ€ πTV πΞ π€, π Ξ π€
(hybrid argument)
- πΌ2 π, π β€ Οπ€ πΌ2 ππ€βͺΞ π€, π π€βͺΞ π€
Wasserstein Subadditivity
Q: Does Wasserstein satisfy subadditivity Wass π, π β€ ΰ·
π€
Wass ππ€βͺΞ π€||π π€βͺΞ π€ ? A: Not always; exist pair of Markov Chains: π β π β π and πβ² β πβ² β πβ² such that
Wass π, π , πβ², πβ² + Wass π, π , πβ², πβ² Wass π, π, π , πβ², πβ², πβ²
can be made arbitrarily small. [Preliminary Result]: Wasserstein distance between two Markov Chains π1, β¦ , ππ and π
1, β¦ , π π satisfies subbadditivity if the conditional densities π π π¦π’ π¦π’β1 and
π
π(π§π’|π§π’β1) are Lipschitz wrt π¦π’β1 and π§π’β1 respectively, for all π’.
(extends to Bayesian Networks)
Bayesnet π
- n DAG π»
Bayesnet π
- n DAG π»
??
Video Generation
Discriminate generated video against target distribution over videos
rand rand rand rand
Gen Gen Gen Gen
Too high dimensional
frame 1 frame 2 frame 3 frame 4
πmodel πdata discriminate
Video Generation
rand rand rand rand
Gen Gen Gen Gen
frame 1 frame 2 frame 3 frame 4
can exploit subadditivity and discriminate only pairs of consecutive frames of generated distribution against pairs of consecutive frames of target distribution
Disc1 Disc2 Disc3
N.B. resulting multi-player zero-sum game falls in realm of [D-Papadimitriou ICALPβ09],
[Even-Dar et al STOCβ09], [Cai-D SODAβ11], [Cai et al MATHORβ15]; efficient dynamics known
Gen
Disc1 Disc2 Disc3
Video Generation: experiment [Ilyasβ18]
- Created random 4-frame videos of MNIST digits
- in every training video, digits are weakly increasing in time
- Trained two video GANs:
- a GAN w/ an un-factorized discriminator
- and a GAN w/ a factorized discriminator
- GANs must learn both how to hallucinate handwritten digits,
and that they need to put them in increasing order
- Compare factorized vs unfactorized models in terms of accuracy
Conclusions
- Min-Max Optimization has found numerous applications in
Optimization, Game Theory, Adversarial Training
- Applications to Generative Adversarial Networks pose serious
challenges, of optimization (oscillations) and statistical (curse of dimensionality) nature
- We propose gradient descent with negative momentum as an
approach to ease training oscillations
- We prove Wasserstein subadditivity in Bayesnets and propose
modeling dependencies in the data as an approach to ease the curse of dimensionality
- Lots of interesting theoretical and practical challenges going
forward