min L ( ) - GANs: Hard (different) optimization problem: minimax. - - PowerPoint PPT Presentation
min L ( ) - GANs: Hard (different) optimization problem: minimax. - - PowerPoint PPT Presentation
SVRE: NEW METHOD FOR TRAINING GAN S G AUTHIER G IDEL Mila, Universit e de Montr eal Research intern at Element AI G ENERATIVE M ODELING AND M ODEL -B ASED R EASONING FOR R OBOTICS AND AI W ORKSHOP June 14, 2019 R EDUCING N OISE IN GAN T
REDUCING NOISE IN GAN TRAINING WITH VARIANCE REDUCED EXTRAGRADIENT
TATJANA CHAVDAROVA * GAUTHIER GIDEL * FRANC ¸ OIS FLEURET SIMON LACOSTE-JULIEN
* Equal contribution
GENERATIVE ADVERSARIAL NETWORKS
[Goodfellow et al., 2014]
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 3 / 17
CHALLENGES
- Standard supervised learning:
min
θ L(θ)
- GANs: Hard (different) optimization problem: minimax.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 4 / 17
CHALLENGES
- Standard supervised learning:
min
θ L(θ)
- GANs: Hard (different) optimization problem: minimax.
Image source: Vaishnavh Nagarajan
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 4 / 17
“NOISE”: NOISY GRADIENT ESTIMATES
DUE TO STOCHASTICITY
- Using sub-samples (mini-batches) of the full dataset to update the parameters
- Variance Reduced (VR) Gradient: optimization methods that reduce such noise
Minimization: Single-objective θ φ Batch method direction Stochastic method direction: noisy
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 5 / 17
VARIANCE REDUCTION–MOTIVATION FOR GAMES
- INTUITIVELY: MINIMIZATION VS. GAME (NOISE FROM STOCHASTIC GRADIENT)
- EMPIRICALLY: BIGGAN–“INCREASED BATCH SIZE SIGNIFICANTLY IMPROVES
PERFORMANCES”
- TO SUM UP, TWO ISSUES:
θ φ θ φ Minimization Game “approximately” the right direction Direction with noise can be “bad”
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 6 / 17
VARIANCE REDUCTION–MOTIVATION FOR GAMES
- INTUITIVELY: MINIMIZATION VS. GAME (NOISE FROM STOCHASTIC GRADIENT)
- EMPIRICALLY: BIGGAN–“INCREASED BATCH SIZE SIGNIFICANTLY IMPROVES
PERFORMANCES”
- TO SUM UP, TWO ISSUES:
Brock et al. [2018] report a relative improvement of 46% of the Inception Score metric [Salimans et al., 2016] on ImageNet if the mini-batch size is increased 8–fold.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 6 / 17
VARIANCE REDUCTION–MOTIVATION FOR GAMES
- INTUITIVELY: MINIMIZATION VS. GAME (NOISE FROM STOCHASTIC GRADIENT)
- EMPIRICALLY: BIGGAN–“INCREASED BATCH SIZE SIGNIFICANTLY IMPROVES
PERFORMANCES”
- TO SUM UP, TWO ISSUES:
- Adversarial aspect from min-max → Extragradient.
- Noise from stochastic gradient → Variance Reduction.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 6 / 17
EXTRAGRADIENT
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 7 / 17
EXTRAGRADIENT
Two players θ, ϕ. Idea: perform a “Lookahead step” Extrapolation:
θt+1/2 = θt − η∇θLG(θt, ϕt)
ϕt+1/2 = ϕt − η∇ϕLD(θt, ϕt) Update:
θt+1 = θt − η∇θLG(θt+1/2, ϕt+1/2)
ϕt+1 = ϕt − η∇ϕLD(θt+1/2, ϕt+1/2)
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 8 / 17
VARIANCE REDUCED GRADIENT METHODS
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 9 / 17
VARIANCE REDUCED ESTIMATE OF THE GRADIENT
Based on Finite sum assumption: 1 n
n
- i=1
L(xi, ω), Epoch based algorithm:
- Save the full gradient 1
n
- i ∇L(xi, ωS) and the snapshot ωS.
- For one epoch use the update rule:
ω ← ω − η
- ∇L(xi, ω)
- Stochastic gradient
+ 1
n
- i
∇L(xi, ωS) − ∇L
- xi, ωS
- correction using saved past iterate
- Requires 2 stochastic gradients (at the current point and at the snapshot).
- If ωS is close to ω → close to full batch gradient → small variance.
- Full batch gradient expensive but tractable, e.g., compute it once per pass.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17
VARIANCE REDUCED ESTIMATE OF THE GRADIENT
Based on Finite sum assumption: 1 n
n
- i=1
L(xi, ω), Epoch based algorithm:
- Save the full gradient 1
n
- i ∇L(xi, ωS) and the snapshot ωS.
- For one epoch use the update rule:
ω ← ω − η
- ∇L(xi, ω)
- Stochastic gradient
+ 1
n
- i
∇L(xi, ωS) − ∇L
- xi, ωS
- correction using saved past iterate
- Requires 2 stochastic gradients (at the current point and at the snapshot).
- If ωS is close to ω → close to full batch gradient → small variance.
- Full batch gradient expensive but tractable, e.g., compute it once per pass.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17
VARIANCE REDUCED ESTIMATE OF THE GRADIENT
Based on Finite sum assumption: 1 n
n
- i=1
L(xi, ω), Epoch based algorithm:
- Save the full gradient 1
n
- i ∇L(xi, ωS) and the snapshot ωS.
- For one epoch use the update rule:
ω ← ω − η
- ∇L(xi, ω)
- Stochastic gradient
+ 1
n
- i
∇L(xi, ωS) − ∇L
- xi, ωS
- correction using saved past iterate
- Requires 2 stochastic gradients (at the current point and at the snapshot).
- If ωS is close to ω → close to full batch gradient → small variance.
- Full batch gradient expensive but tractable, e.g., compute it once per pass.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17
VARIANCE REDUCED ESTIMATE OF THE GRADIENT
Based on Finite sum assumption: 1 n
n
- i=1
L(xi, ω), Epoch based algorithm:
- Save the full gradient 1
n
- i ∇L(xi, ωS) and the snapshot ωS.
- For one epoch use the update rule:
ω ← ω − η
- ∇L(xi, ω)
- Stochastic gradient
+ 1
n
- i
∇L(xi, ωS) − ∇L
- xi, ωS
- correction using saved past iterate
- Requires 2 stochastic gradients (at the current point and at the snapshot).
- If ωS is close to ω → close to full batch gradient → small variance.
- Full batch gradient expensive but tractable, e.g., compute it once per pass.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 10 / 17
SVRE: VARIANCE REDUCTION + EXTRAGRADIENT
PSEUDO–ALGORITHM
- 1. Save snapshot ωS ← ωt and compute 1
n
- i ∇L(xi, ωS).
- 2. For i in 1, . . . , epoch_length:
- Compute ωt+ 1
2 with variance reduced gradients at ωt.
- Compute ωt+1 with variance reduced gradients at ωt+ 1
2 .
- t ← t + 1
- 3. Repeat until convergence.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17
SVRE: VARIANCE REDUCTION + EXTRAGRADIENT
PSEUDO–ALGORITHM
- 1. Save snapshot ωS ← ωt and compute 1
n
- i ∇L(xi, ωS).
- 2. For i in 1, . . . , epoch_length:
- Compute ωt+ 1
2 with variance reduced gradients at ωt.
- Compute ωt+1 with variance reduced gradients at ωt+ 1
2 .
- t ← t + 1
- 3. Repeat until convergence.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17
SVRE: VARIANCE REDUCTION + EXTRAGRADIENT
PSEUDO–ALGORITHM
- 1. Save snapshot ωS ← ωt and compute 1
n
- i ∇L(xi, ωS).
- 2. For i in 1, . . . , epoch_length:
- Compute ωt+ 1
2 with variance reduced gradients at ωt.
- Compute ωt+1 with variance reduced gradients at ωt+ 1
2 .
- t ← t + 1
- 3. Repeat until convergence.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17
SVRE: VARIANCE REDUCTION + EXTRAGRADIENT
PSEUDO–ALGORITHM
- 1. Save snapshot ωS ← ωt and compute 1
n
- i ∇L(xi, ωS).
- 2. For i in 1, . . . , epoch_length:
- Compute ωt+ 1
2 with variance reduced gradients at ωt.
- Compute ωt+1 with variance reduced gradients at ωt+ 1
2 .
- t ← t + 1
- 3. Repeat until convergence.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17
SVRE: VARIANCE REDUCTION + EXTRAGRADIENT
PSEUDO–ALGORITHM
- 1. Save snapshot ωS ← ωt and compute 1
n
- i ∇L(xi, ωS).
- 2. For i in 1, . . . , epoch_length:
- Compute ωt+ 1
2 with variance reduced gradients at ωt.
- Compute ωt+1 with variance reduced gradients at ωt+ 1
2 .
- t ← t + 1
- 3. Repeat until convergence.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17
SVRE: VARIANCE REDUCTION + EXTRAGRADIENT
PSEUDO–ALGORITHM
- 1. Save snapshot ωS ← ωt and compute 1
n
- i ∇L(xi, ωS).
- 2. For i in 1, . . . , epoch_length:
- Compute ωt+ 1
2 with variance reduced gradients at ωt.
- Compute ωt+1 with variance reduced gradients at ωt+ 1
2 .
- t ← t + 1
- 3. Repeat until convergence.
SVRE yields the fastest convergence rate for strongly convex stochastic game optimization in the literature.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 11 / 17
SVRE: EXPERIMENTS
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 12 / 17
EXPERIMENTS
SVRE YIELDS STABLE GAN OPTIMIZATION
Stochastic baseline
1 2 3 4 5 Iterations
×105
50 100 150 200 250 300 Fréchet Inception Distance
G = 1×10 4, D = 2×10 4, r = 1 : 1 G = 1×10 4, D = 2×10 4, r = 1 : 2 G = 5×10 5, D = 2×10 4, r = 1 : 1 G = 5×10 5, D = 2×10 4, r = 1 : 2 G = 1×10 4, D = 4×10 4, r = 1 : 1 G = 1×10 4, D = 4×10 4, r = 1 : 2 G = 5×10 5, D = 4×10 4, r = 1 : 1 G = 5×10 5, D = 4×10 4, r = 1 : 2
1 2 3 4 5 Iterations
×105
50 100 150 200 250 300 Fréchet Inception Distance
G = 1×10 3, D = 4×10 3, s = 1 G = 1×10 3, D = 4×10 3, s = 2 G = 1×10 3, D = 4×10 3, s = 3 G = 1×10 3, D = 4×10 3, s = 4 G = 1×10 3, D = 5×10 3, s = 1 G = 1×10 3, D = 5×10 3, s = 2 G = 1×10 3, D = 5×10 3, s = 3 G = 1×10 3, D = 5×10 3, s = 4
− Always diverges. − Many hyperparameters (ηG, ηD, β1, γ, r). + if convergence → fast
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 13 / 17
EXPERIMENTS
SVRE YIELDS STABLE GAN OPTIMIZATION
Stochastic baseline SVRE
1 2 3 4 5 Iterations
×105
50 100 150 200 250 300 Fréchet Inception Distance
G = 1×10 4, D = 2×10 4, r = 1 : 1 G = 1×10 4, D = 2×10 4, r = 1 : 2 G = 5×10 5, D = 2×10 4, r = 1 : 1 G = 5×10 5, D = 2×10 4, r = 1 : 2 G = 1×10 4, D = 4×10 4, r = 1 : 1 G = 1×10 4, D = 4×10 4, r = 1 : 2 G = 5×10 5, D = 4×10 4, r = 1 : 1 G = 5×10 5, D = 4×10 4, r = 1 : 2
1 2 3 4 5 Iterations
×105
50 100 150 200 250 300 Fréchet Inception Distance
G = 1×10 3, D = 4×10 3, s = 1 G = 1×10 3, D = 4×10 3, s = 2 G = 1×10 3, D = 4×10 3, s = 3 G = 1×10 3, D = 4×10 3, s = 4 G = 1×10 3, D = 5×10 3, s = 1 G = 1×10 3, D = 5×10 3, s = 2 G = 1×10 3, D = 5×10 3, s = 3 G = 1×10 3, D = 5×10 3, s = 4
− Always diverges. − Many hyperparameters (ηG, ηD, β1, γ, r). + if convergence → fast + Does not diverge. + fewer hyperparameters (omits β1, γ, r) − slower for very deep nets.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 13 / 17
SVRE: TAKEAWAYS
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 14 / 17
SVRE: TAKEAWAYS
- Controlling variance is more critical for games (could be reason behind
success of Adam on GANs)
- SVRE: combines Extragradient and variance reduction.
- Best convergence rate (under some assumptions) for games.
- Good stability properties.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 15 / 17
THANKS.
Questions?
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 16 / 17
REFERENCES I
- A. Brock, J. Donahue, and K. Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. ArXiv e-prints,
September 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua
- Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680.
2014. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, 2016. Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 17 / 17
APPENDIX
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 18 / 17
THE GAN FRAMEWORK
EQUILIBRIUM AT pg = pd
The discriminator maximizes: V (G, D) =
- x
pd(x) log(D(x)) dx +
- z
pz(z) log(1 − D(G(z))) dz =
- x
pd(x) log(D(x)) + pg(x) log(1 − D(x)) dx Where we used x = G(z), and pg is the distribution of x. Hence, the optimal discriminator D∗ is: D∗(x) = pd(x) pd(x) + pg(x)
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 19 / 17
THE GAN FRAMEWORK
EQUILIBRIUM AT pg = pd
The generator minimizes: V (G, D∗) = E
x∼pd[log D∗(x)] +
E
x∼pg[log(1 − D∗(x))]
= E
x∼pd[log
pd(x) pd(x) + pg(x)] + E
x∼pg[log
pg(x) pd(x) + pg(x)] = − log 4 + DKL(pd||pd + pg 2 ) + DKL(pg||pd + pg 2 ) = − log 4 + 2 · DJS(pd||pg) where we used: DJS(pq) = 1
2DKL(pp+q 2 ) + 1 2DKL(qp+q 2 ).
The optimum is reached when pg = pd, and the optimal value is − log 4.
Gauthier Gidel Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop 19 / 17