variational autoencoders recap story so far
play

Variational Autoencoders Recap: Story so far A classification MLP - PowerPoint PPT Presentation

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly separable features


  1. Variational Autoencoders

  2. Recap: Story so far • A classification MLP actually comprises two components • A “feature extraction network” that converts the inputs into linearly separable features • Or nearly linearly separable features • A final linear classifier that operates on the linearly separable features • Neural networks can be used to perform linear or non-linear PCA • “Autoencoders” • Can also be used to compose constructive dictionaries for data • Which, in turn can be used to model data distributions

  3. Recap: The penultimate layer 𝑧 1 𝑧 2 y 2 x 1 x 2 y 1 • The network up to the output layer may be viewed as a transformation that transforms data from non-linear classes to linearly separable features • We can now attach any linear classifier above it for perfect classification • Need not be a perceptron • In fact, slapping on an SVM on top of the features may be more generalizable!

  4. Recap: The behavior of the layers

  5. Recap: Auto-encoders and PCA Training: Learning 𝑋 by minimizing L2 divergence 𝐲 ො 𝒙 𝑼 x = 𝑥 𝑈 𝑥x ො x 2 = x − w 𝑈 𝑥x 2 𝑒𝑗𝑤 ො x, x = x − ො 𝒙 ෡ 𝑋 = argmin 𝐹 𝑒𝑗𝑤 ො x, x 𝑋 𝐲 ෡ x − w 𝑈 𝑥x 2 𝑋 = argmin 𝐹 𝑋 5

  6. Recap: Auto-encoders and PCA 𝐲 ො 𝒙 𝑼 𝒙 𝐲 • The autoencoder finds the direction of maximum energy • Variance if the input is a zero-mean RV • All input vectors are mapped onto a point on the principal axis 6

  7. Recap: Auto-encoders and PCA • Varying the hidden layer value only generates data along the learned manifold • May be poorly learned • Any input will result in an output along the learned manifold

  8. Recap: Learning a data-manifold Sax dictionary DECODER • The decoder represents a source-specific generative dictionary • Exciting it will produce typical data from the source! 8

  9. Overview • Just as autoencoders can be viewed as performing a non-linear PCA, variational autoencoders can be viewed as performing a non-linear Factor Analysis (FA) • Variational autoencoders (VAEs) get their name from variational inference, a technique that can be used for parameter estimation • We will introduce Factor Analysis, variational inference and expectation maximization, and finally VAEs

  10. Why Generative Models? Training data • Unsupervised/Semi-supervised learning: More training data available • E.g. all of the videos on YouTube

  11. Why generative models? Many right answers • Caption -> Image • Outline -> Image https://openreview.net/pdf?id=Hyvw0L9el A man in an orange jacket with sunglasses and a hat skis down a hill https://arxiv.org/abs/1611.07004

  12. Why generative models? Intrinsic to task Example: Super resolution https://arxiv.org/abs/1609.04802

  13. Why generative models? Insight • What kind of structure can we find in complex observations (MEG recording of brain activity above, gene-expression network to the left)? • Is there a low dimensional manifold underlying these complex observations? • What can we learn about the brain, cellular https://bmcbioinformatics.biomedcentral.c function, etc. if we know more about these om/articles/10.1186/1471-2105-12-327 manifolds?

  14. Factor Analysis • Generative model: Assumes that data are generated from real valued latent variables Bishop – Pattern Recognition and Machine Learning

  15. Factor Analysis model Factor analysis assumes a generative model • where the 𝑗𝑢ℎ observation, 𝒚 𝒋 ∈ ℝ 𝐸 is conditioned on • a vector of real valued latent variables 𝒜 𝒋 ∈ ℝ 𝑀 . Here we assume the prior distribution is Gaussian: 𝑞 𝒜 𝒋 = 𝒪(𝒜 𝒋 |𝝂 𝟏 , 𝚻 𝟏 ) We also will use a Gaussian for the data likelihood: 𝑞 𝒚 𝒋 𝒜 𝒋 , 𝑿, 𝝂, 𝛀 = 𝒪(𝑿𝒜 𝒋 + 𝝂, 𝛀) Where 𝑿 ∈ ℝ 𝐸×𝑀 , 𝛀 ∈ ℝ 𝐸×𝐸 , 𝛀 is diagonal

  16. Marginal distribution of observed 𝒚 𝒋 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = න 𝒪(𝑿𝒜 𝒋 + 𝝂, 𝛀) 𝒪 𝒜 𝒋 𝝂 𝟏 , 𝚻 𝟏 𝐞𝒜 𝒋 = 𝒪 𝒚 𝒋 𝑿𝝂 𝟏 + 𝝂, 𝛀 + 𝑿 𝚻 𝟏 𝑿 𝑈 Note that we can rewrite this as: 𝑞 𝒚 𝒋 ෢ 𝝂, 𝛀 + ෢ 𝑿෢ 𝑿 𝑈 𝑿, ෝ 𝝂, 𝛀 = 𝒪 𝒚 𝒋 ෝ − 1 𝝂 = 𝑿𝝂 𝟏 + 𝝂 and ෢ 2 . Where ෝ 𝑿 = 𝑿𝚻 𝟏 Thus without loss of generality (since 𝝂 𝟏 , 𝚻 𝟏 are absorbed into learnable parameters) we let: 𝑞 𝒜 𝒋 = 𝒪 𝒜 𝒋 𝟏, 𝑱 And find: 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈

  17. Marginal distribution interpretation • We can see from 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈 that the covariance matrix of the data distribution is broken into 2 terms • A diagonal part 𝛀 : variance not shared between variables • A low rank matrix 𝑿𝑿 𝑈 : shared variance due to latent factors

  18. Special Case: Probabilistic PCA (PPCA) • Probabilistic PCA is a special case of Factor Analysis • We further restrict 𝛀 = 𝜏 2 𝑱 (assume isotropic independent variance) • Possible to show that when the data are centered ( 𝝂 = 0 ), the limiting case where 𝜏 → 0 gives back the same solution for 𝑿 as PCA • Factor analysis is a generalization of PCA that models non-shared variance (can think of this as noise in some situations, or individual variation in others)

  19. Inference in FA • To find the parameters of the FA model, we use the Expectation Maximization (EM) algorithm • EM is very similar to variational inference • We’ll derive EM by first finding a lower bound on the log -likelihood we want to maximize, and then maximizing this lower bound

  20. Evidence Lower Bound decomposition • For any distributions 𝑟 𝑨 , 𝑞(𝑨) we have: ≜ න 𝑟 𝑨 log 𝑟(𝑨) KL 𝑟 𝑨 || 𝑞 𝑨 𝑞(𝑨) 𝐞𝑨 • Consider the KL divergence of an arbitrary weighting distribution 𝑟 𝑨 from a conditional distribution 𝑞 𝑨|𝑦, 𝜄 : 𝑟(𝑨) ≜ න 𝑟 𝑨 log KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 𝑞(𝑨|𝑦, 𝜄) 𝐞𝑨 = න 𝑟 𝑨 [log 𝑟 𝑨 − log 𝑞(𝑨|𝑦, 𝜄)] 𝐞𝑨

  21. Applying Bayes log 𝑞 𝑨 𝑦, 𝜄 = log 𝑞 𝑦 𝑨, 𝜄 𝑞(𝑨|𝜄) 𝑞(𝑦|𝜄) = log 𝑞 𝑦 𝑨, 𝜄 + log 𝑞 𝑨 𝜄 − log 𝑞 𝑦 𝜄 Then: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = න 𝑟 𝑨 [log 𝑟 𝑨 − log 𝑞(𝑨|𝑦, 𝜄)] 𝐞𝑨 = න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 + log 𝑞 𝑦 𝜄 𝐞𝑨

  22. Rewriting the divergence • Since the last term does not depend on z, and we know ׬ 𝑟 𝑨 d𝑨 = 1 , we can pull it out of the integration: න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 + log 𝑞 𝑦 𝜄 𝐞𝑨 = න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 𝐞𝑨 + log 𝑞 𝑦 𝜄 𝑟(𝑨) = න 𝑟 𝑨 log 𝑞 𝑦 𝑨, 𝜄 𝑞(𝑨, 𝜄) 𝐞𝑨 + log 𝑞 𝑦 𝜄 𝑟(𝑨) = න 𝑟 𝑨 log 𝑞(𝑦, 𝑨 |𝜄) 𝐞𝑨 + log 𝑞 𝑦 𝜄 Then we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄

  23. Evidence Lower Bound • From basic probability we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄 • We can rearrange the terms to get the following decomposition: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 • We define the evidence lower bound (ELBO) as: ℒ 𝑟, 𝜄 ≜ −KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 Then: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄

  24. Why the name evidence lower bound? • Rearranging the decomposition log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄 • we have ℒ 𝑟, 𝜄 = log 𝑞 𝑦 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 • Since KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 ≥ 0 , ℒ 𝑟, 𝜄 is a lower bound on the log- likelihood we want to maximize • 𝑞 𝑦 𝜄 is sometimes called the evidence • When is this bound tight ? When 𝑟 𝑨 = 𝑞 𝑨|𝑦, 𝜄 • The ELBO is also sometimes called the variational bound

  25. Visualizing ELBO decomposition Bishop – Pattern Recognition and Machine Learning • Note: all we have done so far is decompose the log probability of the data, we still have exact equality • This holds for any distribution 𝑟

  26. Expectation Maximization • Expectation Maximization alternately optimizes the ELBO, ℒ 𝑟, 𝜄 , with respect to 𝑟 (the E step) and 𝜄 (the M step) • Initialize 𝜄 (0) • At each iteration 𝑢 = 1, … • E step: Hold 𝜄 (𝑢−1) fixed, find 𝑟 (𝑢) which maximizes ℒ 𝑟, 𝜄 (𝑢−1) • M step: Hold 𝑟 (𝑢) fixed, find 𝜄 (𝑢) which maximizes ℒ 𝑟 (𝑢) , 𝜄

  27. The E step Bishop – Pattern Recognition and Machine Learning • Suppose we are at iteration 𝑢 of our algorithm. How do we maximize ℒ 𝑟, 𝜄 (𝑢−1) with respect to 𝑟 ? We know that: argmax 𝑟 ℒ 𝑟, 𝜄 (𝑢−1) = argmax 𝑟 log 𝑞 𝑦|𝜄 𝑢−1 − KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 (𝑢−1)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend