min L ( ) - GANs: Hard (different) optimization problem: minimax. - - PowerPoint PPT Presentation

min l
SMART_READER_LITE
LIVE PREVIEW

min L ( ) - GANs: Hard (different) optimization problem: minimax. - - PowerPoint PPT Presentation

SVRE: NEW METHOD FOR TRAINING GAN S G AUTHIER G IDEL Mila, Universit e de Montr eal Research intern at Element AI G ENERATIVE M ODELING AND M ODEL -B ASED R EASONING FOR R OBOTICS AND AI W ORKSHOP June 14, 2019 R EDUCING N OISE IN GAN T


slide-1
SLIDE 1

SVRE: NEW METHOD FOR TRAINING GANS

GAUTHIER GIDEL

Mila, Universit´ e de Montr´ eal Research intern at Element AI GENERATIVE MODELING AND MODEL-BASED REASONING

FOR ROBOTICS AND AI WORKSHOP

June 14, 2019

slide-2
SLIDE 2

REDUCING NOISE IN GAN TRAINING WITH VARIANCE REDUCED EXTRAGRADIENT

TATJANA CHAVDAROVA * GAUTHIER GIDEL * FRANC ¸ OIS FLEURET SIMON LACOSTE-JULIEN

* Equal contribution

slide-3
SLIDE 3

GENERATIVE ADVERSARIAL NETWORKS

[Goodfellow et al., 2014]

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 3 / 17

slide-4
SLIDE 4

CHALLENGES

  • Standard supervised learning:

min

θ L(θ)

  • GANs: Hard (different) optimization problem: minimax.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 4 / 17

slide-5
SLIDE 5

CHALLENGES

  • Standard supervised learning:

min

θ L(θ)

  • GANs: Hard (different) optimization problem: minimax.

Image source: Vaishnavh Nagarajan

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 4 / 17

slide-6
SLIDE 6

“NOISE”: NOISY GRADIENT ESTIMATES

DUE TO STOCHASTICITY

  • Using sub-samples (mini-batches) of the full dataset to update the parameters
  • Variance Reduced (VR) Gradient: optimization methods that reduce such noise

Minimization: Single-objective θ φ Batch method direction Stochastic method direction: noisy

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 5 / 17

slide-7
SLIDE 7

VARIANCE REDUCTION–MOTIVATION FOR GAMES

  • INTUITIVELY: MINIMIZATION VS. GAME (NOISE FROM STOCHASTIC GRADIENT)
  • EMPIRICALLY: BIGGAN–“INCREASED BATCH SIZE SIGNIFICANTLY IMPROVES

PERFORMANCES”

  • TO SUM UP, TWO ISSUES:

θ φ θ φ Minimization Game “approximately” the right direction Direction with noise can be “bad”

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 6 / 17

slide-8
SLIDE 8

VARIANCE REDUCTION–MOTIVATION FOR GAMES

  • INTUITIVELY: MINIMIZATION VS. GAME (NOISE FROM STOCHASTIC GRADIENT)
  • EMPIRICALLY: BIGGAN–“INCREASED BATCH SIZE SIGNIFICANTLY IMPROVES

PERFORMANCES”

  • TO SUM UP, TWO ISSUES:

Brock et al. [2018] report a relative improvement of 46% of the Inception Score metric [Salimans et al., 2016] on ImageNet if the mini-batch size is increased 8–fold.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 6 / 17

slide-9
SLIDE 9

VARIANCE REDUCTION–MOTIVATION FOR GAMES

  • INTUITIVELY: MINIMIZATION VS. GAME (NOISE FROM STOCHASTIC GRADIENT)
  • EMPIRICALLY: BIGGAN–“INCREASED BATCH SIZE SIGNIFICANTLY IMPROVES

PERFORMANCES”

  • TO SUM UP, TWO ISSUES:
  • Adversarial aspect from min-max → Extragradient.
  • Noise from stochastic gradient → Variance Reduction.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 6 / 17

slide-10
SLIDE 10

EXTRAGRADIENT

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 7 / 17

slide-11
SLIDE 11

EXTRAGRADIENT

Two players θ, ϕ. Idea: perform a “Lookahead step” Extrapolation:

θt+1/2 = θt − η∇θLG(θt, ϕt)

ϕt+1/2 = ϕt − η∇ϕLD(θt, ϕt) Update:

θt+1 = θt − η∇θLG(θt+1/2, ϕt+1/2)

ϕt+1 = ϕt − η∇ϕLD(θt+1/2, ϕt+1/2)

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 8 / 17

slide-12
SLIDE 12

VARIANCE REDUCED GRADIENT METHODS

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 9 / 17

slide-13
SLIDE 13

VARIANCE REDUCED ESTIMATE OF THE GRADIENT

Based on Finite sum assumption: 1 n

n

  • i=1

L(xi, ω), Epoch based algorithm:

  • Save the full gradient 1

n

  • i ∇L(xi, ωS) and the snapshot ωS.
  • For one epoch use the update rule:

ω ← ω − η

  • ∇L(xi, ω)
  • Stochastic gradient

+ 1

n

  • i

∇L(xi, ωS) − ∇L

  • xi, ωS
  • correction using saved past iterate
  • Requires 2 stochastic gradients (at the current point and at the snapshot).
  • If ωS is close to ω → close to full batch gradient → small variance.
  • Full batch gradient expensive but tractable, e.g., compute it once per pass.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17

slide-14
SLIDE 14

VARIANCE REDUCED ESTIMATE OF THE GRADIENT

Based on Finite sum assumption: 1 n

n

  • i=1

L(xi, ω), Epoch based algorithm:

  • Save the full gradient 1

n

  • i ∇L(xi, ωS) and the snapshot ωS.
  • For one epoch use the update rule:

ω ← ω − η

  • ∇L(xi, ω)
  • Stochastic gradient

+ 1

n

  • i

∇L(xi, ωS) − ∇L

  • xi, ωS
  • correction using saved past iterate
  • Requires 2 stochastic gradients (at the current point and at the snapshot).
  • If ωS is close to ω → close to full batch gradient → small variance.
  • Full batch gradient expensive but tractable, e.g., compute it once per pass.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17

slide-15
SLIDE 15

VARIANCE REDUCED ESTIMATE OF THE GRADIENT

Based on Finite sum assumption: 1 n

n

  • i=1

L(xi, ω), Epoch based algorithm:

  • Save the full gradient 1

n

  • i ∇L(xi, ωS) and the snapshot ωS.
  • For one epoch use the update rule:

ω ← ω − η

  • ∇L(xi, ω)
  • Stochastic gradient

+ 1

n

  • i

∇L(xi, ωS) − ∇L

  • xi, ωS
  • correction using saved past iterate
  • Requires 2 stochastic gradients (at the current point and at the snapshot).
  • If ωS is close to ω → close to full batch gradient → small variance.
  • Full batch gradient expensive but tractable, e.g., compute it once per pass.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17

slide-16
SLIDE 16

VARIANCE REDUCED ESTIMATE OF THE GRADIENT

Based on Finite sum assumption: 1 n

n

  • i=1

L(xi, ω), Epoch based algorithm:

  • Save the full gradient 1

n

  • i ∇L(xi, ωS) and the snapshot ωS.
  • For one epoch use the update rule:

ω ← ω − η

  • ∇L(xi, ω)
  • Stochastic gradient

+ 1

n

  • i

∇L(xi, ωS) − ∇L

  • xi, ωS
  • correction using saved past iterate
  • Requires 2 stochastic gradients (at the current point and at the snapshot).
  • If ωS is close to ω → close to full batch gradient → small variance.
  • Full batch gradient expensive but tractable, e.g., compute it once per pass.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17

slide-17
SLIDE 17

SVRE: VARIANCE REDUCTION + EXTRAGRADIENT

PSEUDO–ALGORITHM

  • 1. Save snapshot ωS ← ωt and compute 1

n

  • i ∇L(xi, ωS).
  • 2. For i in 1, . . . , epoch_length:
  • Compute ωt+ 1

2 with variance reduced gradients at ωt.

  • Compute ωt+1 with variance reduced gradients at ωt+ 1

2 .

  • t ← t + 1
  • 3. Repeat until convergence.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17

slide-18
SLIDE 18

SVRE: VARIANCE REDUCTION + EXTRAGRADIENT

PSEUDO–ALGORITHM

  • 1. Save snapshot ωS ← ωt and compute 1

n

  • i ∇L(xi, ωS).
  • 2. For i in 1, . . . , epoch_length:
  • Compute ωt+ 1

2 with variance reduced gradients at ωt.

  • Compute ωt+1 with variance reduced gradients at ωt+ 1

2 .

  • t ← t + 1
  • 3. Repeat until convergence.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17

slide-19
SLIDE 19

SVRE: VARIANCE REDUCTION + EXTRAGRADIENT

PSEUDO–ALGORITHM

  • 1. Save snapshot ωS ← ωt and compute 1

n

  • i ∇L(xi, ωS).
  • 2. For i in 1, . . . , epoch_length:
  • Compute ωt+ 1

2 with variance reduced gradients at ωt.

  • Compute ωt+1 with variance reduced gradients at ωt+ 1

2 .

  • t ← t + 1
  • 3. Repeat until convergence.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17

slide-20
SLIDE 20

SVRE: VARIANCE REDUCTION + EXTRAGRADIENT

PSEUDO–ALGORITHM

  • 1. Save snapshot ωS ← ωt and compute 1

n

  • i ∇L(xi, ωS).
  • 2. For i in 1, . . . , epoch_length:
  • Compute ωt+ 1

2 with variance reduced gradients at ωt.

  • Compute ωt+1 with variance reduced gradients at ωt+ 1

2 .

  • t ← t + 1
  • 3. Repeat until convergence.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17

slide-21
SLIDE 21

SVRE: VARIANCE REDUCTION + EXTRAGRADIENT

PSEUDO–ALGORITHM

  • 1. Save snapshot ωS ← ωt and compute 1

n

  • i ∇L(xi, ωS).
  • 2. For i in 1, . . . , epoch_length:
  • Compute ωt+ 1

2 with variance reduced gradients at ωt.

  • Compute ωt+1 with variance reduced gradients at ωt+ 1

2 .

  • t ← t + 1
  • 3. Repeat until convergence.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17

slide-22
SLIDE 22

SVRE: VARIANCE REDUCTION + EXTRAGRADIENT

PSEUDO–ALGORITHM

  • 1. Save snapshot ωS ← ωt and compute 1

n

  • i ∇L(xi, ωS).
  • 2. For i in 1, . . . , epoch_length:
  • Compute ωt+ 1

2 with variance reduced gradients at ωt.

  • Compute ωt+1 with variance reduced gradients at ωt+ 1

2 .

  • t ← t + 1
  • 3. Repeat until convergence.

SVRE yields the fastest convergence rate for strongly convex stochastic game optimization in the literature.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17

slide-23
SLIDE 23

SVRE: EXPERIMENTS

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 12 / 17

slide-24
SLIDE 24

EXPERIMENTS

SVRE YIELDS STABLE GAN OPTIMIZATION

Stochastic baseline

1 2 3 4 5 Iterations

×105

50 100 150 200 250 300 Fréchet Inception Distance

G = 1×10 4, D = 2×10 4, r = 1 : 1 G = 1×10 4, D = 2×10 4, r = 1 : 2 G = 5×10 5, D = 2×10 4, r = 1 : 1 G = 5×10 5, D = 2×10 4, r = 1 : 2 G = 1×10 4, D = 4×10 4, r = 1 : 1 G = 1×10 4, D = 4×10 4, r = 1 : 2 G = 5×10 5, D = 4×10 4, r = 1 : 1 G = 5×10 5, D = 4×10 4, r = 1 : 2

1 2 3 4 5 Iterations

×105

50 100 150 200 250 300 Fréchet Inception Distance

G = 1×10 3, D = 4×10 3, s = 1 G = 1×10 3, D = 4×10 3, s = 2 G = 1×10 3, D = 4×10 3, s = 3 G = 1×10 3, D = 4×10 3, s = 4 G = 1×10 3, D = 5×10 3, s = 1 G = 1×10 3, D = 5×10 3, s = 2 G = 1×10 3, D = 5×10 3, s = 3 G = 1×10 3, D = 5×10 3, s = 4

− Always diverges. − Many hyperparameters (ηG, ηD, β1, γ, r). + if convergence → fast

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 13 / 17

slide-25
SLIDE 25

EXPERIMENTS

SVRE YIELDS STABLE GAN OPTIMIZATION

Stochastic baseline SVRE

1 2 3 4 5 Iterations

×105

50 100 150 200 250 300 Fréchet Inception Distance

G = 1×10 4, D = 2×10 4, r = 1 : 1 G = 1×10 4, D = 2×10 4, r = 1 : 2 G = 5×10 5, D = 2×10 4, r = 1 : 1 G = 5×10 5, D = 2×10 4, r = 1 : 2 G = 1×10 4, D = 4×10 4, r = 1 : 1 G = 1×10 4, D = 4×10 4, r = 1 : 2 G = 5×10 5, D = 4×10 4, r = 1 : 1 G = 5×10 5, D = 4×10 4, r = 1 : 2

1 2 3 4 5 Iterations

×105

50 100 150 200 250 300 Fréchet Inception Distance

G = 1×10 3, D = 4×10 3, s = 1 G = 1×10 3, D = 4×10 3, s = 2 G = 1×10 3, D = 4×10 3, s = 3 G = 1×10 3, D = 4×10 3, s = 4 G = 1×10 3, D = 5×10 3, s = 1 G = 1×10 3, D = 5×10 3, s = 2 G = 1×10 3, D = 5×10 3, s = 3 G = 1×10 3, D = 5×10 3, s = 4

− Always diverges. − Many hyperparameters (ηG, ηD, β1, γ, r). + if convergence → fast + Does not diverge. + fewer hyperparameters (omits β1, γ, r) − slower for very deep nets.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 13 / 17

slide-26
SLIDE 26

SVRE: TAKEAWAYS

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 14 / 17

slide-27
SLIDE 27

SVRE: TAKEAWAYS

  • Controlling variance is more critical for games (could be reason behind

success of Adam on GANs)

  • SVRE: combines Extragradient and variance reduction.
  • Best convergence rate (under some assumptions) for games.
  • Good stability properties.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 15 / 17

slide-28
SLIDE 28

THANKS.

Questions?

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 16 / 17

slide-29
SLIDE 29

REFERENCES I

  • A. Brock, J. Donahue, and K. Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. ArXiv e-prints,

September 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua

  • Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680.

2014. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, 2016. Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 17 / 17

slide-30
SLIDE 30

APPENDIX

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 18 / 17

slide-31
SLIDE 31

THE GAN FRAMEWORK

EQUILIBRIUM AT pg = pd

The discriminator maximizes: V (G, D) =

  • x

pd(x) log(D(x)) dx +

  • z

pz(z) log(1 − D(G(z))) dz =

  • x

pd(x) log(D(x)) + pg(x) log(1 − D(x)) dx Where we used x = G(z), and pg is the distribution of x. Hence, the optimal discriminator D∗ is: D∗(x) = pd(x) pd(x) + pg(x)

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 19 / 17

slide-32
SLIDE 32

THE GAN FRAMEWORK

EQUILIBRIUM AT pg = pd

The generator minimizes: V (G, D∗) = E

x∼pd[log D∗(x)] +

E

x∼pg[log(1 − D∗(x))]

= E

x∼pd[log

pd(x) pd(x) + pg(x)] + E

x∼pg[log

pg(x) pd(x) + pg(x)] = − log 4 + DKL(pd||pd + pg 2 ) + DKL(pg||pd + pg 2 ) = − log 4 + 2 · DJS(pd||pg) where we used: DJS(pq) = 1

2DKL(pp+q 2 ) + 1 2DKL(qp+q 2 ).

The optimum is reached when pg = pd, and the optimal value is − log 4.

Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 19 / 17