networks
play

Networks Mostly adapted from Goodfellows 2016 NIPS tutorial: - PowerPoint PPT Presentation

Generative Adversarial Networks Mostly adapted from Goodfellows 2016 NIPS tutorial: https://arxiv.org/pdf/1701.00160.pdf Story so far: Why generative models? Unsupervised learning means we have more training data Some problems have


  1. Generative Adversarial Networks Mostly adapted from Goodfellow’s 2016 NIPS tutorial: https://arxiv.org/pdf/1701.00160.pdf

  2. Story so far: Why generative models? • Unsupervised learning means we have more training data • Some problems have many right answers, and diversity is desirable • Caption generation, image to image, super-resolution • Some tasks intrinsically require generation • Machine translation • Some generative models allow us to investigate a lower dimensional manifold of high dimensional data. This manifold can provide insight into high dimensional observations • Brain activity, gene expression

  3. Recap: Factor Analysis • Generative model: Assumes that data are generated from real valued latent variables Bishop – Pattern Recognition and Machine Learning

  4. Recap: Factor Analysis • We can see from the marginal distribution: 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈 that the covariance matrix of the data distribution is broken into 2 terms • A diagonal part 𝛀 : variance not shared between variables • A low rank matrix 𝑿𝑿 𝑈 : shared variance due to latent factors

  5. Recap: Evidence Lower Bound (ELBO) • From basic probability we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄 • We can rearrange the terms to get the following decomposition: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 • We define the evidence lower bound (ELBO) as: ℒ 𝑟, 𝜄 ≜ −KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 Then: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄

  6. Recap: The EM algorithm E step Bishop – Pattern Recognition and Machine Learning • Maximize ℒ 𝑟, 𝜄 (𝑢−1) with respect to 𝑟 by setting 𝒓 𝒖 𝒜 ← 𝒒 𝒜 𝒚, 𝜾 𝒖−𝟐

  7. Recap: The M step Bishop – Pattern Recognition and Machine Learning • After applying the E step, we increase the likelihood of the data by finding better parameters according to: 𝜄 (𝑢) ← 𝐛𝐬𝐡𝐧𝐛𝐲 𝜾 𝔽 𝒓 𝒖 (𝒜) 𝐦𝐩𝐡 𝒒 𝒚, 𝒜 𝜾

  8. Recap: EM in practice argmax 𝑿,𝛀 𝔽 𝑟 𝑢 (𝒜) log 𝑞 𝒀, 𝒂 𝑿, 𝛀 = 𝑂 = argmax 𝑿,𝛀 − 𝑂 1 𝑈 𝛀 −1 𝒚 𝑗 − 𝒚 𝒋 𝑈 𝛀 −1 𝑿𝔽 𝑟 𝑢 (𝒜 𝒋 ) 𝒜 𝑗 2 log det(𝛀) − ෍ ቆ 2 𝒚 𝑗 𝑗=1 + 1 𝑈 2 tr 𝑿 𝑈 𝛀 −1 𝑿𝔽 𝑟 𝑢 𝒜 𝒋 𝒜 𝒋 𝒜 𝒋 ቇ • By looking at what expectations the M step requires, we find out what we need to compute in the E step. • For FA, we only need these 2 sufficient statistics to enable the M step . • In practice, sufficient statistics are often what we compute in the E step

  9. Recap: From EM to Variational Inference • In EM we alternately maximize the ELBO with respect to 𝜄 and probability distribution (functional) 𝑟 • In variational inference, we drop the distinction between hidden variables and parameters of a distribution • I.e. we replace 𝑞(𝑦, 𝑨|𝜄) with 𝑞(𝑦, 𝑨) . Effectively this puts a probability distribution on the parameters 𝜾 , then absorbs them into 𝑨 • Fully Bayesian treatment instead of a point estimate for the parameters

  10. Recap: Variational Autoencoder • For 𝑢 = 1: 𝑐: 𝑈 𝜖ℒ 𝜖ℒ ℒ 𝐵 or − ሚ ℒ 𝐶 as the 𝜖𝜄 with either − ሚ • Estimate 𝜖𝜚 , 𝑞(𝑦 𝑗 |𝑨 𝑗 , 𝜄) loss • Update 𝜚, 𝜄 𝑨 𝑗 = 𝑕(𝜗 𝑗 , 𝑦 𝑗 , 𝜚) • Training procedure uses standard back propagation with an MC procedure to approximately run EM on the ELBO • The reparameterization trick enables the 𝑕(𝜗 𝑗 , 𝑦 𝑗 , 𝜚) 𝜗 𝑗 ~𝑞(𝜗) gradient to flow through the network

  11. Recap: Requirements of the VAE • Note that the VAE requires 2 tractable distributions to be used: • The prior distribution 𝑞(𝑨) must be easy to sample from • The conditional likelihood 𝑞 𝑦|𝑨, 𝜄 must be computable • In practice this means that the 2 distributions of interest are often simple, for example uniform, Gaussian, or even isotropic Gaussian

  12. Recap: The VAE blurry image problem • The samples from the VAE look blurry • Three plausible explanations for this • Maximizing the likelihood • Restrictions on the family of distributions https://blog.openai.com/generative-models/ • The lower bound approximation

  13. Recap: The maximum likelihood explanation • Recent evidence suggests that this is not actually the problem • GANs can be trained with maximum likelihood and still generate sharp examples https://arxiv.org/pdf/1701.00160.pdf

  14. A taxonomy of generative models

  15. Fully Visible Belief Net (FVBN), e.g. Wavenet 𝑈 𝑞 𝒚 = ෑ 𝑞 𝑦 𝑢 𝑦 1 , … , 𝑦 𝑢−1 ) 𝑢=1 • No latent variable (hence fully visible) • Easier to optimize well • Tractable log-likelihood • Slower to run • Train with auto-regressive target

  16. GAN Advantages • Sample in parallel (vs FVBN) • Few restrictions on generator function • No Markov Chain • No variational bound • Subjectively better samples

  17. GAN Disadvantages • Very difficult to train properly • Difficult to evaluate • Likelihood cannot be computed • No encoder (in vanilla GAN)

  18. GAN samples look sharp Real Samples Generated Samples https://arxiv.org/pdf/1703.10717.pdf

  19. GAN samples look sharp Real Samples Generated Samples Boundary Equilibrium GAN Energy Based GAN https://arxiv.org/pdf/1703.10717.pdf

  20. Interpolation is impressive https://arxiv.org/pdf/1703.10717.pdf

  21. Generative Adversarial Networks: Basic idea Looks Fake! Generator Discriminator (Counterfeiter): (Detective): Distinguish Creates fake data real data from fake from random data input Looks Real!

  22. The Generator • Faking Data • To create good fake data, the generator must understand what real data looks like • Attempts to generate samples that are likely under the true data distribution • Implicitly learns to model the true distribution • Latent Code • Since the sample is determined by the random noise input, the probability distribution is conditioned on this input • The random noise is interpreted by the model as a latent code , i.e. a point on the manifold

  23. Problem setup Generator Trained Discriminator Trained to get better and to get better and better at fooling better at distinguishing the discriminator real data from fake (making fake data data look real)

  24. Formalizing the generator/discriminator Generator: 𝐻 𝑨, 𝜄 (𝐻) Discriminator: 𝐸 𝑦, 𝜄 (𝐸) A differentiable function, A differentiable function, 𝐸 (here having parameters 𝜄 (𝐸) ), 𝐻 (here having parameters 𝜄 (𝐻) ), mapping from the mapping from the data space, latent space, ℝ 𝑀 , to the ℝ 𝑁 , to a scalar between 0 and 1 data space, ℝ 𝑁 representing the probability that the data is real

  25. Simplifying notation Generator: 𝐻 𝑨 Discriminator: 𝐸 𝑦 , 𝐸 𝐻(𝑨) For simplicity of notation, Note that the discriminator can we write 𝐻 𝑨 without 𝜄 (𝐻) also take the output of the generator as input. Typically 𝐻 is a neural Typically 𝐸 is a neural network, network, but it doesn’t have but it doesn’t have to be to be Note 𝑨 can go into any layer of the network, not just the first

  26. An artist’s renditio n 𝐸 𝐻(𝑨) or 𝐸 𝑦 𝐻 𝑨 or 𝑦 𝑨

  27. The game (theory) • The generator and discriminator are adversaries in a game • The generator controls only its parameters • The discriminator controls only its parameters • Each seeks to maximize its own success and minimize the success of the other: related to minimax theory

  28. Nash equilibrium • In game theory, a local optimum in this system is called a Nash equilibrium: • Generator loss, 𝐾 (𝐻) , is at a local minimum with respect to 𝜄 𝐻 • Discriminator loss, 𝐾 (𝐸) , is at a local minimum with respect to 𝜄 𝐸

  29. Basic training procedure • Initialize 𝜄 (𝐻) , 𝜄 (𝐸) • For 𝑢 = 1: 𝑐: 𝑈 Initialize Δ𝜄 (𝐸) = 0 • • For 𝑗 = 𝑢: 𝑢 + 𝑐 − 1 • Sample 𝑨 𝑗 ~ 𝑞(𝑨 𝑗 ) Can also run 𝑙 minibatches • Compute 𝐸 𝐻 𝑨 𝑗 , 𝐸(𝑦 𝑗 ) of the discriminator update (𝐸) ← Compute gradient of Discriminator loss , 𝐾 𝐸 𝜄 𝐻 , 𝜄 (𝐸) • Δ𝜄 𝑗 before updating the Δ𝜄 (𝐸) ← Δ𝜄 (𝐸) + Δ𝜄 𝑗 𝐸 • generator, but Goodfellow Update 𝜄 (𝐸) • finds 𝑙 = 1 tends to work Initialize Δ𝜄 (𝐻) = 0 • best • For 𝑘 = 𝑢: 𝑢 + 𝑐 − 1 • Sample 𝑨 𝑘 ~ 𝑞(𝑨 𝑘 ) • Compute 𝐸 𝐻 𝑨 𝑘 , 𝐸(𝑦 𝑘 ) (𝐻) ← Compute gradient of Generator loss, 𝐾 𝐻 𝜄 𝐻 , 𝜄 (𝐸) • Δ𝜄 𝑘 Δ𝜄 (𝐻) ← Δ𝜄 (𝐻) + Δ𝜄 𝐻 • 𝑘 • Update 𝜄 (𝐻)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend