Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - PowerPoint PPT Presentation

Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT

Min-Max Optimization Solve: inf 𝜄 sup 𝑔 𝜄, 𝑥 𝑥 where 𝜄 , 𝑥 high-dimensional • Applications: Mathematics, Optimization, Game Theory ,… [von Neumann 1928, Dantzig ’47, Brown’50 , Robinson’51, Blackwell’56,…] • Best-Case Scenario: 𝑔 is convex in 𝜄, concave in w BEGAN. Bertholet et al. 2017. • Modern Applications: GANs, adversarial examples, … – exacerbate the importance of first-order methods, non convex-concave objectives

GAN Outputs BEGAN. Bertholet et al. 2017. LSGAN. Mao et al. 2017.

GAN uses Text -> Image Synthesis Reed et al. 2017. Pix2pix. Isola 2017. Many examples at https://phillipi.github.io/pix2pix/ Many applications: • Domain adaptation • Super-resolution • Image Synthesis 12 CycleGAN. Zhu et al. 2017. • Image Completion Lecture 13 - 7 • Compressed Sensing • …

Min-Max Optimization Solve: inf 𝜄 sup 𝑔 𝜄, 𝑥 𝑥 where 𝜄 , 𝑥 high-dimensional • Applications: Mathematics, Optimization, Game Theory ,… [von Neumann 1928, Dantzig ’47, Brown’50 , Robinson’51, Blackwell’56,…] • Best-Case Scenario: 𝑔 is convex in 𝜄, concave in w BEGAN. Bertholet et al. 2017. • Modern Applications: GANs, adversarial examples, … – exacerbate the importance of first-order methods, non convex-concave objectives • Personal Perspective: applications of min-max optimization will multiply, going forward, as ML develops more complex and harder to interpret algorithms – sup players will be introduced to check the behavior of the inf players

Generative Adversarial Networks (GANs) [Goodfellow et al. NeurIPS’14] Real or Hallucinated inf 𝜄 sup 𝒈 𝜾, 𝒙 Discriminator : DNN w/ 𝑥 parameters 𝑥 expresses how well Real Images Hallucinated Images Discriminator distinguishes (from training set) (from generator) true vs generated images Generator : DNN w/ parameters 𝜄 e.g. Wasserstein-GANs: Simple 𝑎 ∼ 𝑂(0, 𝐽) 𝑔(𝜄, 𝑥) = 𝔽 𝑌∼𝑞 𝑠𝑓𝑏𝑚 𝐸 𝑥 𝑌 − 𝔽 𝑎∼𝑂(0,𝐽) 𝐸 𝑥 𝐻 𝜄 (𝑎) Randomness • 𝜄, 𝑥 : high-dimensional ⇝ solve game by having min (resp. max) player run online gradient descent (resp. ascent) • major challenges: – training oscillations – generated & real distributions high-dimensional ⇝ no rigorous statistical guarantees

Menu • Min-Max Optimization and Adversarial Training • Training Challenges: • reducing training oscillations • Statistical Challenges: • reducing sample requirements • attaining statistical guarantees

Training Oscillations: Gaussian Mixture True Distribution : Mixture of 8 Gaussians on a circle Output Distribution of standard GAN, trained via gradient descent/ascent dynamics: cycling through modes at different steps of training from [Metz et al ICLR’17]

Training Oscillations: Handwritten Digits True Distribution : MNIST Output Distribution of standard GAN, trained via gradient descent/ascent dynamics cycling through “proto - digits” at different steps of training from [Metz et al ICLR’17]

Training Oscillations: even for bilinear objectives! 3 • True distribution: isotropic Normal distribution, namely 𝑌 ∼ 𝒪 4 , 𝐽 2×2 • Generator architecture : 𝐻 𝜾 𝑎 = 𝜾 + 𝑎 (adds input 𝑎 to internal params) 𝑎 , 𝜄, 𝑥 : 2-dimensional • Discriminator architecture : 𝐸 𝒙 ⋅ = 𝒙,⋅ (linear projection) • W-GAN objective: min 𝜾 max 𝒙 𝔽 𝑌 𝐸 𝒙 𝑌 − 𝔽 𝑎 𝐸 𝒙 𝐻 𝜾 (𝑎) convex-concave 3 𝒙 T ⋅ = min 𝜾 max 4 − 𝜾 function 𝒙 Gradient Descent Dynamics from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLR’18]

Training Oscillations: persistence under many variants of Gradient Descent from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLR’18]

Training Oscillations: Online Learning Perspective • Best-Case Scenario: Given convex-concave 𝑔(𝑦, 𝑧) , solve: min 𝑦∈𝑌 max 𝑧∈𝑍 𝑔(𝑦, 𝑧) • [von Neumann’28]: min-max=max-min; solvable via convex-programming • Online Learning : if min and max players run any no-regret learning procedure they converge to minimax equilibrium • E.g. follow-the-regularized-leader (FTRL), follow-the-perturbed-leader, MWU 2 -regularization ≡ gradient descent • Follow-the-regularized-leader with ℓ 2 • “Convergence:” Sequence 𝑦 𝑢 , 𝑧 𝑢 𝑢 converges to minimax equilibrium in the 1 1 𝑢 σ 𝜐≤𝑢 𝑦 𝜐 , 𝑢 σ 𝜐≤𝑢 𝑧 𝜐 average sense , i.e. 𝑔 → min 𝑦∈𝑌 max 𝑧∈𝑍 𝑔(𝑦, 𝑧) • Can we show point-wise convergence of no-regret learning methods? • [Mertikopoulos-Papadimitriou-Piliouras SODA’18]: No for any FTRL

Negative Momentum • Variant of gradient descent: ∀𝑢: 𝑦 𝑢+1 = 𝑦 𝑢 − 𝜃 ⋅ 𝛼𝑔 𝑦 𝑢 + 𝜽/𝟑 ⋅ 𝛂𝒈(𝒚 𝒖−𝟐 ) • Interpretation: undo today, some of yesterday’s gradient; ie negative momentum • Gradient Descent w/ negative momentum 2 -regularization = Optimistic FTRL w/ ℓ 2 [Rakhlin-Sridharan COLT’13, Syrgkanis et al. NeurIPS’15] ≈ extra-gradient method [Korpelevich’76, Chiang et al COLT’12 , Gidel et al’18, Mertikopoulos et al’18] • Does it help in min-max optimization?

Negative Momentum: why it could help • E.g. 𝑔 𝑦, 𝑧 = 𝑦 − 0.5 ⋅ 𝑧 − 0.5 𝑦 𝑢+1 = 𝑦 𝑢 − 𝜃 ⋅ 𝛼 𝑦 𝑔 𝑦 𝑢 , 𝑧 𝑢 𝑧 𝑢+1 = 𝑧 𝑢 + 𝜃 ⋅ 𝛼 𝑧 𝑔 𝑦 𝑢 , 𝑧 𝑢 : start : min-max equilibrium 𝑦 𝑢+1 = 𝑦 𝑢 − 𝜃 ⋅ 𝛼 𝑦 𝑔 𝑦 𝑢 , 𝑧 𝑢 +𝜽/𝟑 ⋅ 𝛂 𝒚 𝒈(𝒚 𝒖−𝟐 , 𝒛 𝒖−𝟐 ) 𝑧 𝑢+1 = 𝑧 𝑢 + 𝜃 ⋅ 𝛼 𝑧 𝑔 𝑦 𝑢 , 𝑧 𝑢 −𝜽/𝟑 ⋅ 𝛂 𝒛 𝒈(𝒚 𝒖−𝟐 , 𝒛 𝒖−𝟐 )

Negative Momentum: convergence • Optimistic gradient descent-ascent (OGDA) dynamics: 𝜽 ∀𝑢: 𝑦 𝑢+1 = 𝑦 𝑢 − 𝜃 ⋅ 𝛼 𝑦 𝑔 𝑦 𝑢 , 𝑧 𝑢 + 𝟑 ⋅ 𝛂 𝐲 𝒈 𝒚 𝒖−𝟐 , 𝒛 𝒖−𝟐 𝜽 𝑧 𝑢+1 = 𝑧 𝑢 + 𝜃 ⋅ 𝛼 𝑧 𝑔 𝑦 𝑢 , 𝑧 𝑢 − 𝟑 ⋅ 𝛂 𝒛 𝒈(𝒚 𝒖−𝟐 , 𝒛 𝒖−𝟐 ) • [Daskalakis-Ilyas-Syrgkanis- Zeng ICLR’18]: OGDA exhibits last iterate convergence 𝑧∈ℝ 𝑛 𝑔 𝑦, 𝑧 = 𝑦 T 𝐵𝑧 + 𝑐 𝑈 𝑦 + 𝑑 𝑈 𝑧 for unconstrained bilinear games: min 𝑦∈ℝ 𝑜 max • [Liang- Stokes’18]: …convergence rate is geometric if 𝐵 is well-conditioned, extends to strongly convex-concave functions 𝑔 𝑦, 𝑧 • E.g. in previous isotropic Gaussian case: 𝑌 ∼ 𝒪 3,4 , 𝐽 2×2 , 𝐻 𝜄 𝑎 = 𝜄 + 𝑎 , 𝐸 𝑥 ⋅ = 𝑥,⋅

Negative Momentum: convergence • Optimistic gradient descent-ascent (OGDA) dynamics: 𝜽 ∀𝑢: 𝑦 𝑢+1 = 𝑦 𝑢 − 𝜃 ⋅ 𝛼 𝑦 𝑔 𝑦 𝑢 , 𝑧 𝑢 + 𝟑 ⋅ 𝛂 𝐲 𝒈 𝒚 𝒖−𝟐 , 𝒛 𝒖−𝟐 𝜽 𝑧 𝑢+1 = 𝑧 𝑢 + 𝜃 ⋅ 𝛼 𝑧 𝑔 𝑦 𝑢 , 𝑧 𝑢 − 𝟑 ⋅ 𝛂 𝒛 𝒈(𝒚 𝒖−𝟐 , 𝒛 𝒖−𝟐 ) • [Daskalakis-Ilyas-Syrgkanis- Zeng ICLR’18]: OGDA exhibits last iterate convergence 𝑧∈ℝ 𝑛 𝑔 𝑦, 𝑧 = 𝑦 T 𝐵𝑧 + 𝑐 𝑈 𝑦 + 𝑑 𝑈 𝑧 for unconstrained bilinear games: min 𝑦∈ℝ 𝑜 max • [Liang- Stokes’18]: …convergence rate is geometric if 𝐵 is well-conditioned, extends to strongly convex-concave functions 𝑔 𝑦, 𝑧 • [Daskalakis-Panageas ITCS’18]: Projected OGDA exhibits last iterate convergence 𝑧∈Δ 𝑛 𝑦 T 𝐵𝑧 even for constrained bilinear games: min 𝑦∈Δ 𝑜 max = all linear programming

Negative Momentum: in the Wild • Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧 • Issue [Daskalakis, Panageas NeurIPS’18 ]: No hope that stable points of OGDA or GDA are only local min-max points 1 8 ⋅ 𝑦 2 − 1 2 ⋅ 𝑧 2 + 6 • e.g. 𝑔 𝑦, 𝑧 = − 10 ⋅ 𝑦 ⋅ 𝑧 Gradient Descent-Ascent field • Nested-ness: Local Min-Max ⊆ Stable Points of GDA ⊆ Stable Points of OGDA

Negative Momentum: in the Wild • Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧 • Issue [Daskalakis, Panageas NeurIPS’18 ]: No hope that stable points of OGDA or GDA are only local min-max points • Local Min-Max ⊆ Stable Points of GDA ⊆ Stable Points of OGDA • also [Adolphs et al. 18]: left inclusion • Question: identify first-order method converging to local min-max w/ probability 1 • While this is pending, evaluate optimism in practice… • [Daskalakis-Ilyas-Syrgkanis-Zeng ICLR’18 ]: propose optimistic Adam • Adam , a variant of gradient descent proposed by [Kingma- Ba ICLR’15] , has found wide adoption in deep learning, although it doesn’t always converge [Reddi-Kale- Kumar ICLR’18] • Optimistic Adam is the right adaptation of Adam to “undo some of the past gradients”

Optimistic Adam on CIFAR10 • Compare Adam, Optimistic Adam , trained on CIFAR10, in terms of Inception Score • No fine-tuning for Optimistic Adam , used same hyper-parameters for both algorithms as suggested in Gulrajani et al. (2017)

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - PowerPoint PPT Presentation

Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve: inf sup , where , high-dimensional Applications: Mathematics, Optimization, Game

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Statistics Asymptotic Theory Shiu-Sheng Chen Department of Economics National Taiwan University

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Information Theory, Statistics, and Decision Trees L eon Bottou COS 424 4/6/2010 Summary

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola

Outline 1. Standing on the Shoulders of Giants . . . 2. What is Information? 3. Shannon

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015

Decision Problems Decision Making under Uncertainty, Part III Christos Dimitrakakis Chalmers

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup

The Challenge Initiative: Business Unusual Approach to Scale up Kojo Lokko Bill & Melinda

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - PowerPoint PPT Presentation

Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve: inf sup , where , high-dimensional Applications: Mathematics, Optimization, Game

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics &amp; Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Statistics Asymptotic Theory Shiu-Sheng Chen Department of Economics National Taiwan University

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Information Theory, Statistics, and Decision Trees L eon Bottou COS 424 4/6/2010 Summary

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola

Outline 1. Standing on the Shoulders of Giants . . . 2. What is Information? 3. Shannon

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015

Decision Problems Decision Making under Uncertainty, Part III Christos Dimitrakakis Chalmers

Multi-agent learning Erik Berbee &amp; Bas van Gijzel , Master Student AT, Utrecht University Erik

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup

The Challenge Initiative: Business Unusual Approach to Scale up Kojo Lokko Bill &amp; Melinda

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

The Challenge Initiative: Business Unusual Approach to Scale up Kojo Lokko Bill & Melinda