latent variable models
play

Latent Variable Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 1 / 25 Plan for today 1 Latent Variable Models Learning deep generative models


  1. Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 1 / 25

  2. Plan for today 1 Latent Variable Models Learning deep generative models Stochastic optimization: Reparameterization trick Inference Amortization Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 2 / 25

  3. Variational Autoencoder A mixture of an infinite number of Gaussians: 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks 3 Even though p ( x | z ) is simple, the marginal p ( x ) is very complex/flexible Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 3 / 25

  4. Recap Latent Variable Models Allow us to define complex models p ( x ) in terms of simple building blocks p ( x | z ) Natural for unsupervised learning tasks (clustering, unsupervised representation learning, etc.) No free lunch: much more difficult to learn compared to fully observed, autoregressive models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 4 / 25

  5. Recap: Variational Inference Suppose q ( z ) is any probability distribution over the hidden variables � D KL ( q ( z ) � p ( z | x ; θ )) = − q ( z ) log p ( z , x ; θ ) + log p ( x ; θ ) − H ( q ) ≥ 0 z Evidence lower bound (ELBO) holds for any q � log p ( x ; θ ) ≥ q ( z ) log p ( z , x ; θ ) + H ( q ) z Equality holds if q = p ( z | x ; θ ) � log p ( x ; θ )= q ( z ) log p ( z , x ; θ ) + H ( q ) z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 5 / 25

  6. Recap: The Evidence Lower bound What if the posterior p ( z | x ; θ ) is intractable to compute? Suppose q ( z ; φ ) is a (tractable) probability distribution over the hidden variables parameterized by φ (variational parameters) For example, a Gaussian with mean and covariance specified by φ q ( z ; φ ) = N ( φ 1 , φ 2 ) Variational inference : pick φ so that q ( z ; φ ) is as close as possible to p ( z | x ; θ ). In the figure, the posterior p ( z | x ; θ ) (blue) is better approximated by N (2 , 2) (orange) than N ( − 4 , 0 . 75) (green) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 6 / 25

  7. Recap: The Evidence Lower bound � ≥ q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) = L ( x ; θ, φ ) log p ( x ; θ ) � �� � z ELBO = L ( x ; θ, φ ) + D KL ( q ( z ; φ ) � p ( z | x ; θ )) The better q ( z ; φ ) can approximate the posterior p ( z | x ; θ ), the smaller D KL ( q ( z ; φ ) � p ( z | x ; θ )) we can achieve, the closer ELBO will be to log p ( x ; θ ). Next: jointly optimize over θ and φ to maximize the ELBO over a dataset Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 7 / 25

  8. Variational learning L ( and L ( x ; θ, φ 2 ) are both lower bounds. We want to jointly optimize θ and φ Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 8 / 25

  9. The Evidence Lower bound applied to the entire dataset Evidence lower bound (ELBO) holds for any q ( z ; φ ) � log p ( x ; θ ) ≥ q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) = L ( x ; θ, φ ) � �� � z ELBO Maximum likelihood learning (over the entire dataset): � � log p ( x i ; θ ) ≥ L ( x i ; θ, φ i ) ℓ ( θ ; D ) = x i ∈D x i ∈D Therefore � L ( x i ; θ, φ i ) max ℓ ( θ ; D ) ≥ max θ θ,φ 1 , ··· ,φ M x i ∈D Note that we use different variational parameters φ i for every data point x i , because the true posterior p ( z | x i ; θ ) is different across datapoints x i Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 9 / 25

  10. A variational approximation to the posterior Assume p ( z , x i ; θ ) is close to p data ( z , x i ). Suppose z captures information such as the digit identity (label), style, etc. For simplicity, assume z ∈ { 0 , 1 , 2 , · · · , 9 } . Suppose q ( z ; φ i ) is a (categorical) probability distribution over the hidden variable z parameterized by φ i = [ p 0 , p 1 , · · · , p 9 ] � q ( z ; φ i ) = ( φ i k ) 1[ z = k ] k ∈{ 0 , 1 , 2 , ··· , 9 } If φ i = [0 , 0 , 0 , 1 , 0 , · · · , 0], is q ( z ; φ i ) a good approximation of p ( z | x 1 ; θ ) ( x 1 is the leftmost datapoint)? Yes If φ i = [0 , 0 , 0 , 1 , 0 , · · · , 0], is q ( z ; φ i ) a good approximation of p ( z | x 3 ; θ ) ( x 3 is the rightmost datapoint)? No For each x i , need to find a good φ i , ∗ (via optimization, can be expensive). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 10 / 25

  11. Learning via stochastic variational inference (SVI) Optimize � x i ∈D L ( x i ; θ, φ i ) as a function of θ, φ 1 , · · · , φ M using (stochastic) gradient descent � L ( x i ; θ, φ i ) q ( z ; φ i ) log p ( z , x i ; θ ) + H ( q ( z ; φ i )) = z E q ( z ; φ i ) [log p ( z , x i ; θ ) − log q ( z ; φ i )] = 1 Initialize θ, φ 1 , · · · , φ M 2 Randomly sample a data point x i from D 3 Optimize L ( x i ; θ, φ i ) as a function of φ i : Repeat φ i = φ i + η ∇ φ i L ( x i ; θ, φ i ) 1 until convergence to φ i , ∗ ≈ arg max φ L ( x i ; θ, φ ) 2 4 Compute ∇ θ L ( x i ; θ, φ i , ∗ ) 5 Update θ in the gradient direction. Go to step 2 How to compute the gradients? There might not be a closed form solution for the expectations. So we use Monte Carlo sampling Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 11 / 25

  12. Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] Note: dropped i superscript from φ i for compactness To evaluate the bound, sample z 1 , · · · , z k from q ( z ; φ ) and estimate E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] ≈ 1 � log p ( z k , x ; θ ) − log q ( z k ; φ )) k k Key assumption: q ( z ; φ ) is tractable, i.e., easy to sample from and evaluate Want to compute ∇ θ L ( x ; θ, φ ) and ∇ φ L ( x ; θ, φ ) The gradient with respect to θ is easy ∇ θ E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] E q ( z ; φ ) [ ∇ θ log p ( z , x ; θ )] = 1 � ∇ θ log p ( z k , x ; θ ) ≈ k k Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 12 / 25

  13. Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] Want to compute ∇ θ L ( x ; θ, φ ) and ∇ φ L ( x ; θ, φ ) The gradient with respect to φ is more complicated because the expectation depends on φ We still want to estimate with a Monte Carlo average Later in the course we’ll see a general technique called REINFORCE (from reinforcement learning) For now, a better but less general alternative that only works for continuous z (and only some distributions) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 13 / 25

  14. Reparameterization Want to compute a gradient with respect to φ of � E q ( z ; φ ) [ r ( z )] = q ( z ; φ ) r ( z ) d z where z is now continuous Suppose q ( z ; φ ) = N ( µ, σ 2 I ) is Gaussian with parameters φ = ( µ, σ ). These are equivalent ways of sampling: Sample z ∼ q φ ( z ) Sample ǫ ∼ N (0 , I ), z = µ + σǫ = g ( ǫ ; φ ) Using this equivalence we compute the expectation in two ways: � E z ∼ q ( z ; φ ) [ r ( z )] = E ǫ ∼N (0 , I ) [ r ( g ( ǫ ; φ ))] = p ( ǫ ) r ( µ + σǫ ) d ǫ ∇ φ E q ( z ; φ ) [ r ( z )] = ∇ φ E ǫ [ r ( g ( ǫ ; φ ))] = E ǫ [ ∇ φ r ( g ( ǫ ; φ ))] Easy to estimate via Monte Carlo if r and g are differentiable w.r.t. φ and ǫ is easy to sample from (backpropagation) � k ∇ φ r ( g ( ǫ k ; φ )) where ǫ 1 , · · · , ǫ k ∼ N (0 , I ). E ǫ [ ∇ φ r ( g ( ǫ ; φ ))] ≈ 1 k Typically much lower variance than REINFORCE Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 14 / 25

  15. Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ ) ] � �� � r ( z ,φ ) Our case is slightly more complicated because we have E q ( z ; φ ) [ r ( z , φ )] instead of E q ( z ; φ ) [ r ( z )]. Term inside the expectation also depends on φ . Can still use reparameterization. Assume z = µ + σǫ = g ( ǫ ; φ ) like before. Then E q ( z ; φ ) [ r ( z , φ )] = E ǫ [ r ( g ( ǫ ; φ ) , φ )] 1 � r ( g ( ǫ k ; φ ) , φ ) ≈ k k Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 15 / 25

  16. Amortized Inference � L ( x i ; θ, φ i ) ℓ ( θ ; D ) ≥ max max θ,φ 1 , ··· ,φ M θ x i ∈D So far we have used a set of variational parameters φ i for each data point x i . Does not scale to large datasets. Amortization: Now we learn a single parametric function f λ that maps each x to a set of (good) variational parameters. Like doing regression on x i �→ φ i , ∗ For example, if q ( z | x i ) are Gaussians with different means µ 1 , · · · , µ m , we learn a single neural network f λ mapping x i to µ i We approximate the posteriors q ( z | x i ) using this distribution q λ ( z | x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 16 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend