theory and statistics
play

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - PowerPoint PPT Presentation

Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve: inf sup , where , high-dimensional Applications: Mathematics, Optimization, Game


  1. Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT

  2. Min-Max Optimization Solve: inf πœ„ sup 𝑔 πœ„, π‘₯ π‘₯ where πœ„ , π‘₯ high-dimensional β€’ Applications: Mathematics, Optimization, Game Theory ,… [von Neumann 1928, Dantzig ’47, Brown’50 , Robinson’51, Blackwell’56,…] β€’ Best-Case Scenario: 𝑔 is convex in πœ„, concave in w BEGAN. Bertholet et al. 2017. β€’ Modern Applications: GANs, adversarial examples, … – exacerbate the importance of first-order methods, non convex-concave objectives

  3. GAN Outputs BEGAN. Bertholet et al. 2017. LSGAN. Mao et al. 2017.

  4. GAN uses Text -> Image Synthesis Reed et al. 2017. Pix2pix. Isola 2017. Many examples at https://phillipi.github.io/pix2pix/ Many applications: β€’ Domain adaptation β€’ Super-resolution β€’ Image Synthesis 12 CycleGAN. Zhu et al. 2017. β€’ Image Completion Lecture 13 - 7 β€’ Compressed Sensing β€’ …

  5. Min-Max Optimization Solve: inf πœ„ sup 𝑔 πœ„, π‘₯ π‘₯ where πœ„ , π‘₯ high-dimensional β€’ Applications: Mathematics, Optimization, Game Theory ,… [von Neumann 1928, Dantzig ’47, Brown’50 , Robinson’51, Blackwell’56,…] β€’ Best-Case Scenario: 𝑔 is convex in πœ„, concave in w BEGAN. Bertholet et al. 2017. β€’ Modern Applications: GANs, adversarial examples, … – exacerbate the importance of first-order methods, non convex-concave objectives β€’ Personal Perspective: applications of min-max optimization will multiply, going forward, as ML develops more complex and harder to interpret algorithms – sup players will be introduced to check the behavior of the inf players

  6. Generative Adversarial Networks (GANs) [Goodfellow et al. NeurIPS’14] Real or Hallucinated inf πœ„ sup π’ˆ 𝜾, 𝒙 Discriminator : DNN w/ π‘₯ parameters π‘₯ expresses how well Real Images Hallucinated Images Discriminator distinguishes (from training set) (from generator) true vs generated images Generator : DNN w/ parameters πœ„ e.g. Wasserstein-GANs: Simple π‘Ž ∼ 𝑂(0, 𝐽) 𝑔(πœ„, π‘₯) = 𝔽 π‘ŒβˆΌπ‘ž π‘ π‘“π‘π‘š 𝐸 π‘₯ π‘Œ βˆ’ 𝔽 π‘ŽβˆΌπ‘‚(0,𝐽) 𝐸 π‘₯ 𝐻 πœ„ (π‘Ž) Randomness β€’ πœ„, π‘₯ : high-dimensional ⇝ solve game by having min (resp. max) player run online gradient descent (resp. ascent) β€’ major challenges: – training oscillations – generated & real distributions high-dimensional ⇝ no rigorous statistical guarantees

  7. Menu β€’ Min-Max Optimization and Adversarial Training β€’ Training Challenges: β€’ reducing training oscillations β€’ Statistical Challenges: β€’ reducing sample requirements β€’ attaining statistical guarantees

  8. Menu β€’ Min-Max Optimization and Adversarial Training β€’ Training Challenges: β€’ reducing training oscillations β€’ Statistical Challenges: β€’ reducing sample requirements β€’ attaining statistical guarantees

  9. Training Oscillations: Gaussian Mixture True Distribution : Mixture of 8 Gaussians on a circle Output Distribution of standard GAN, trained via gradient descent/ascent dynamics: cycling through modes at different steps of training from [Metz et al ICLR’17]

  10. Training Oscillations: Handwritten Digits True Distribution : MNIST Output Distribution of standard GAN, trained via gradient descent/ascent dynamics cycling through β€œproto - digits” at different steps of training from [Metz et al ICLR’17]

  11. Training Oscillations: even for bilinear objectives! 3 β€’ True distribution: isotropic Normal distribution, namely π‘Œ ∼ π’ͺ 4 , 𝐽 2Γ—2 β€’ Generator architecture : 𝐻 𝜾 π‘Ž = 𝜾 + π‘Ž (adds input π‘Ž to internal params) π‘Ž , πœ„, π‘₯ : 2-dimensional β€’ Discriminator architecture : 𝐸 𝒙 β‹… = 𝒙,β‹… (linear projection) β€’ W-GAN objective: min 𝜾 max 𝒙 𝔽 π‘Œ 𝐸 𝒙 π‘Œ βˆ’ 𝔽 π‘Ž 𝐸 𝒙 𝐻 𝜾 (π‘Ž) convex-concave 3 𝒙 T β‹… = min 𝜾 max 4 βˆ’ 𝜾 function 𝒙 Gradient Descent Dynamics from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLR’18]

  12. Training Oscillations: persistence under many variants of Gradient Descent from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLR’18]

  13. Training Oscillations: Online Learning Perspective β€’ Best-Case Scenario: Given convex-concave 𝑔(𝑦, 𝑧) , solve: min π‘¦βˆˆπ‘Œ max π‘§βˆˆπ‘ 𝑔(𝑦, 𝑧) β€’ [von Neumann’28]: min-max=max-min; solvable via convex-programming β€’ Online Learning : if min and max players run any no-regret learning procedure they converge to minimax equilibrium β€’ E.g. follow-the-regularized-leader (FTRL), follow-the-perturbed-leader, MWU 2 -regularization ≑ gradient descent β€’ Follow-the-regularized-leader with β„“ 2 β€’ β€œConvergence:” Sequence 𝑦 𝑒 , 𝑧 𝑒 𝑒 converges to minimax equilibrium in the 1 1 𝑒 Οƒ πœβ‰€π‘’ 𝑦 𝜐 , 𝑒 Οƒ πœβ‰€π‘’ 𝑧 𝜐 average sense , i.e. 𝑔 β†’ min π‘¦βˆˆπ‘Œ max π‘§βˆˆπ‘ 𝑔(𝑦, 𝑧) β€’ Can we show point-wise convergence of no-regret learning methods? β€’ [Mertikopoulos-Papadimitriou-Piliouras SODA’18]: No for any FTRL

  14. Negative Momentum β€’ Variant of gradient descent: βˆ€π‘’: 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼𝑔 𝑦 𝑒 + 𝜽/πŸ‘ β‹… π›‚π’ˆ(π’š π’–βˆ’πŸ ) β€’ Interpretation: undo today, some of yesterday’s gradient; ie negative momentum β€’ Gradient Descent w/ negative momentum 2 -regularization = Optimistic FTRL w/ β„“ 2 [Rakhlin-Sridharan COLT’13, Syrgkanis et al. NeurIPS’15] β‰ˆ extra-gradient method [Korpelevich’76, Chiang et al COLT’12 , Gidel et al’18, Mertikopoulos et al’18] β€’ Does it help in min-max optimization?

  15. Negative Momentum: why it could help β€’ E.g. 𝑔 𝑦, 𝑧 = 𝑦 βˆ’ 0.5 β‹… 𝑧 βˆ’ 0.5 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 : start : min-max equilibrium 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 +𝜽/πŸ‘ β‹… 𝛂 π’š π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ ) 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 βˆ’πœ½/πŸ‘ β‹… 𝛂 𝒛 π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ )

  16. Negative Momentum: convergence β€’ Optimistic gradient descent-ascent (OGDA) dynamics: 𝜽 βˆ€π‘’: 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 + πŸ‘ β‹… 𝛂 𝐲 π’ˆ π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ 𝜽 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 βˆ’ πŸ‘ β‹… 𝛂 𝒛 π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ ) β€’ [Daskalakis-Ilyas-Syrgkanis- Zeng ICLR’18]: OGDA exhibits last iterate convergence π‘§βˆˆβ„ 𝑛 𝑔 𝑦, 𝑧 = 𝑦 T 𝐡𝑧 + 𝑐 π‘ˆ 𝑦 + 𝑑 π‘ˆ 𝑧 for unconstrained bilinear games: min π‘¦βˆˆβ„ π‘œ max β€’ [Liang- Stokes’18]: …convergence rate is geometric if 𝐡 is well-conditioned, extends to strongly convex-concave functions 𝑔 𝑦, 𝑧 β€’ E.g. in previous isotropic Gaussian case: π‘Œ ∼ π’ͺ 3,4 , 𝐽 2Γ—2 , 𝐻 πœ„ π‘Ž = πœ„ + π‘Ž , 𝐸 π‘₯ β‹… = π‘₯,β‹…

  17. Negative Momentum: convergence β€’ Optimistic gradient descent-ascent (OGDA) dynamics: 𝜽 βˆ€π‘’: 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 + πŸ‘ β‹… 𝛂 𝐲 π’ˆ π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ 𝜽 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 βˆ’ πŸ‘ β‹… 𝛂 𝒛 π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ ) β€’ [Daskalakis-Ilyas-Syrgkanis- Zeng ICLR’18]: OGDA exhibits last iterate convergence π‘§βˆˆβ„ 𝑛 𝑔 𝑦, 𝑧 = 𝑦 T 𝐡𝑧 + 𝑐 π‘ˆ 𝑦 + 𝑑 π‘ˆ 𝑧 for unconstrained bilinear games: min π‘¦βˆˆβ„ π‘œ max β€’ [Liang- Stokes’18]: …convergence rate is geometric if 𝐡 is well-conditioned, extends to strongly convex-concave functions 𝑔 𝑦, 𝑧 β€’ [Daskalakis-Panageas ITCS’18]: Projected OGDA exhibits last iterate convergence π‘§βˆˆΞ” 𝑛 𝑦 T 𝐡𝑧 even for constrained bilinear games: min π‘¦βˆˆΞ” π‘œ max = all linear programming

  18. Negative Momentum: in the Wild β€’ Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧 β€’ Issue [Daskalakis, Panageas NeurIPS’18 ]: No hope that stable points of OGDA or GDA are only local min-max points 1 8 β‹… 𝑦 2 βˆ’ 1 2 β‹… 𝑧 2 + 6 β€’ e.g. 𝑔 𝑦, 𝑧 = βˆ’ 10 β‹… 𝑦 β‹… 𝑧 Gradient Descent-Ascent field β€’ Nested-ness: Local Min-Max βŠ† Stable Points of GDA βŠ† Stable Points of OGDA

  19. Negative Momentum: in the Wild β€’ Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧 β€’ Issue [Daskalakis, Panageas NeurIPS’18 ]: No hope that stable points of OGDA or GDA are only local min-max points β€’ Local Min-Max βŠ† Stable Points of GDA βŠ† Stable Points of OGDA β€’ also [Adolphs et al. 18]: left inclusion β€’ Question: identify first-order method converging to local min-max w/ probability 1 β€’ While this is pending, evaluate optimism in practice… β€’ [Daskalakis-Ilyas-Syrgkanis-Zeng ICLR’18 ]: propose optimistic Adam β€’ Adam , a variant of gradient descent proposed by [Kingma- Ba ICLR’15] , has found wide adoption in deep learning, although it doesn’t always converge [Reddi-Kale- Kumar ICLR’18] β€’ Optimistic Adam is the right adaptation of Adam to β€œundo some of the past gradients”

  20. Optimistic Adam on CIFAR10 β€’ Compare Adam, Optimistic Adam , trained on CIFAR10, in terms of Inception Score β€’ No fine-tuning for Optimistic Adam , used same hyper-parameters for both algorithms as suggested in Gulrajani et al. (2017)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend