Generative Adversarial Networks
Mostly adapted from Goodfellowβs 2016 NIPS tutorial: https://arxiv.org/pdf/1701.00160.pdf
Networks Mostly adapted from Goodfellows 2016 NIPS tutorial: - - PowerPoint PPT Presentation
Generative Adversarial Networks Mostly adapted from Goodfellows 2016 NIPS tutorial: https://arxiv.org/pdf/1701.00160.pdf Story so far: Why generative models? Unsupervised learning means we have more training data Some problems have
Mostly adapted from Goodfellowβs 2016 NIPS tutorial: https://arxiv.org/pdf/1701.00160.pdf
manifold of high dimensional data. This manifold can provide insight into high dimensional observations
latent variables
Bishop β Pattern Recognition and Machine Learning
π ππ πΏ, π, π = πͺ ππ π, π + πΏπΏπ that the covariance matrix of the data distribution is broken into 2 terms
KL π π¨ || π π¨|π¦, π = KL π π¨ || π π¦, π¨ |π + log π π¦ π
log π π¦ π = KL π π¨ || π π¨|π¦, π β KL π π¨ || π π¦, π¨ |π
β π, π β βKL π π¨ || π π¦, π¨ |π Then: log π π¦ π = KL π π¨ ||π π¨|π¦, π + β π, π
π π π, πΎ πβπ
Bishop β Pattern Recognition and Machine Learning
parameters according to: π(π’) β ππ¬π‘π§ππ²πΎ π½π π (π) π¦π©π‘ π π, π πΎ
Bishop β Pattern Recognition and Machine Learning
argmaxπΏ,π π½π π’ (π) log π π, π πΏ, π = = argmaxπΏ,π β π 2 log det(π) β ΰ·
π=1 π
α α 1 2 ππ
ππβ1ππ β ππ ππβ1πΏπ½π π’ (ππ) ππ
+ 1 2 tr πΏππβ1πΏπ½π π’
ππ ππππ π
we need to compute in the E step.
probability distribution (functional) π
variables and parameters of a distribution
probability distribution on the parameters πΎ, then absorbs them into π¨
parameters
πβ ππ , πβ ππ with either β α
βπ΅ or β α βπΆ as the loss
propagation with an MC procedure to approximately run EM on the ELBO
gradient to flow through the network
π(ππ, π¦π, π) π(π¦π|π¨π, π) π¨π = π(ππ, π¦π, π) ππ ~π(π)
simple, for example uniform, Gaussian, or even isotropic Gaussian
https://blog.openai.com/generative-models/
look blurry
explanations for this
likelihood
family of distributions
approximation
https://arxiv.org/pdf/1701.00160.pdf
suggests that this is not actually the problem
with maximum likelihood and still generate sharp examples
π π = ΰ·
π’=1 π
π π¦π’ π¦1, β¦ , π¦π’β1)
Real Samples Generated Samples https://arxiv.org/pdf/1703.10717.pdf
https://arxiv.org/pdf/1703.10717.pdf Real Samples Generated Samples Boundary Equilibrium GAN Energy Based GAN
https://arxiv.org/pdf/1703.10717.pdf
Generator (Counterfeiter): Creates fake data from random input Discriminator (Detective): Distinguish real data from fake data
what real data looks like
data distribution
the probability distribution is conditioned on this input
code, i.e. a point on the manifold
Generator Trained to get better and better at fooling the discriminator (making fake data look real) Discriminator Trained to get better and better at distinguishing real data from fake data
Generator: π» π¨, π(π») A differentiable function, π» (here having parameters π(π»)), mapping from the latent space, βπ, to the data space, βπ Discriminator: πΈ π¦, π(πΈ) A differentiable function, πΈ (here having parameters π(πΈ)), mapping from the data space, βπ, to a scalar between 0 and 1 representing the probability that the data is real
Generator: π» π¨ For simplicity of notation, we write π» π¨ without π(π») Typically π» is a neural network, but it doesnβt have to be Note π¨ can go into any layer
first Discriminator: πΈ π¦ , πΈ π»(π¨) Note that the discriminator can also take the output of the generator as input. Typically πΈ is a neural network, but it doesnβt have to be
π¨ π» π¨ or π¦ πΈ π»(π¨) or πΈ π¦
equilibrium:
, πΈ(π¦π)
(πΈ) β Compute gradient of Discriminator loss, πΎ πΈ
π π» , π(πΈ)
πΈ
π ~ π(π¨ π)
, πΈ(π¦π)
π (π») β Compute gradient of Generator loss, πΎ π»
π π» , π(πΈ)
π π»
Can also run π minibatches
before updating the generator, but Goodfellow finds π = 1 tends to work best
, πΈ(π¦π)
(πΈ) β Compute gradient of Discriminator loss, πΎ πΈ
π π» , π(πΈ)
πΈ
π ~ π(π¨ π)
, πΈ(π¦π)
π (π») β Compute gradient of Generator loss, πΎ π»
π π» , π(πΈ)
π π»
Notice: the only explicit probability distribution we have is the random noise distribution, the prior The loss causes the data distribution to be learned implicitly
, πΈ(π¦π)
(πΈ) β Compute ππ(πΈ)πΎ πΈ
π π» , π(πΈ)
π (π») β Compute ππ(π»)πΎ π»
π π» , π(πΈ)
πΈ
π π»
Update the discriminator and generator from the same pair of mini-batches
πΎ πΈ π πΈ , π π» = β 1 2 π½π¦βΌππππ’π log πΈ π¦ β 1 2 π½π¨βΌππ¨ log 1 β πΈ π» π¨
πΎ π» π πΈ , π π» = βπΎ πΈ π πΈ , π π»
min
π― max π¬ β πΎ πΈ
π πΈ , π π»
back to this)
πΎ πΈ π πΈ , π π» = β 1 2 π½π¦βΌππππ’π log πΈ π¦ β 1 2 π½π¨ log 1 β πΈ π» π¨ = β 1 2 ΰΆ±
π¦
ππππ’π π¦ log πΈ π¦ ππ¦ + ΰΆ±
π¨
ππ¨ π¨ log 1 β πΈ π» π¨ ππ¨ = β 1 2 ΰΆ±
π¦
ππππ’π π¦ log πΈ π¦ + ππ» π¦ log 1 β πΈ π¦ ππ¦
πΎ πΈ π πΈ , π π» = β 1 2 ΰΆ±
π¦
ππππ’π π¦ log πΈ π¦ + ππ» π¦ log 1 β πΈ π¦ ππ¦ Take the functional derivative w.r.t. πΈ π¦ and set to 0, analogous to: π ππ§ ππππ’π π¦ log π§ + ππ» π¦ log 1 β π§ = 0 ππππ’π(π¦) π§ β ππ» π¦ 1 β π§ = 0 π§ = ππππ’π(π¦) ππππ’π π¦ + ππ»(π¦) β πΈβ π¦ = ππππ’π(π¦) ππππ’π π¦ + ππ»(π¦)
probabilities of π¦ under the data distribution and the generator distribution: πΈβ π¦ =
ππππ’π(π¦) ππππ’π π¦ +ππ»(π¦) = π(πππ’π|π¦)
πΈ π¦ ππ»(π¦) ππππ’π(π¦) πΈβ π¦ πΈβ π¦ πΈ π¦
πΎ πΈ π πΈ , π π» = β 1 2 π½π¦βΌππππ’π log πΈ π¦ β 1 2 π½π¨ log 1 β πΈ π» π¨
πππππ(π) ππ―(π) via
supervised learning
the global optimum for this game is achieved if and only if ππ» π¦ = ππππ’π π¦
Shannon divergence between the generator distribution and the true data distribution!
πΈ is set to its global optimum given π» at every iteration and π» improves the criterion at every iteration, then alternating
practice, there may be reasons to weaken it)
π πΈ , π π» , πΎ π» π πΈ , π π»
This may not hold β especially if we have data lying on a manifold. Even when it holds the ratio can be numerically unstable
practice other discriminators can do nearly as good a job: i.e. the discriminator can overfit the data
much of today)
πΎ π» π πΈ , π π» = 1 2 π½π¦βΌππππ’π log πΈ π¦ + 1 2 π½π¨ log 1 β πΈ π» π¨
better than the generator? πΈ π» π¨ β 0 1 2 π½π¨ log 1 β πΈ π» π¨ β 0
πΎ π» π πΈ , π π» = β 1 2 π½π¦βΌππππ’π log 1 β πΈ π¦ β 1 2 π½π¨ log πΈ π» π¨
πΎ π» π πΈ , π π» = β 1 2 π½π¨ log πΈ π» π¨
β 0, β
1 2 π½π¨ log πΈ π» π¨
β β
likelihood estimation
πΎ π» π πΈ , π π» = β 1 2 π½π¨ exp πβ1 πΈ π» π¨
the data distribution and the model distribution under certain assumptions
πΎ π» π πΈ , π π» = β 1 2 π½π¨ log πΈ π» π¨
some assumptions) as πΎ π» = βπΎ πΈ
with a hinge loss (for example L2 loss of an autoencoder on real vs. fake examples): πΎ πΈ π πΈ , π π» = πΈ π¦ + max(π β πΈ π» π¨ , 0) πΎ π» π πΈ , π π» = πΈ(π» π¨ )
ππ» π¦ = ππππ’π π¦ almost everywhere
the learning
colleagues f-GAN paper. They show a family of loss functions and how each corresponds to an π-divergence on the probability distributions we are trying to learn
choice of divergence (and proposes using an approximation to the Earth Moverβs distance)
space the modelβs manifold and the true data manifold can have a negligible intersection in practice
behaved
πΎ πΈ π πΈ , π π» = β π½π¦βΌππππ’ππΈ π¦ β π½π¨πΈ π» π¨ πΎ π» π πΈ , π π» = βπ½π¨πΈ π» π¨
generator is updated
Lipschitz continuity required by the theory
mode collapse
https://arxiv.org/pdf/1611.02163.pdf
https://arxiv.org/pdf/1611.02163.pdf
minibatch of fake/real examples
https://arxiv.org/pdf/1611.02163.pdf
making its update
the putting less mass there: discourages the generator from concentrating mass
https://arxiv.org/pdf/1611.02163.pdf
Nash equilibrium make sense?
designed for
numerics of GANs: Consensus
http://www.inference.vc/my-notes-on-the-numerics-of-gans/
between the model and true probability distributions
equilibrium are not that good
Generative Adversarial Networks
convolution
architectures
distribution so that many of the training problems are mitigated
πΎ πΈ π πΈ , π π» = β 1 2 π½π¦βΌππππ’π log πΈ π¦ β 1 2 π½π¨ log 1 β πΈ π» π¨
setting the target of the real examples to 0.9 e.g. instead of 1 (but keep the target of the model at exactly 0)
new data (overfitting)
problem of non-overlapping support between the model and data distributions
https://arxiv.org/pdf/1701.00160.pdf
batch)
the label, making them class conditional
classes, where a class is added for fake data
rewarded, and the reward function governs learning
are two players responding to each other)
are competitive though)
area of active research
the latent code more interpretable (InfoGAN), and there are many