Latent Variable Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 1 / 25

Plan for today 1 Latent Variable Models Learning deep generative models Stochastic optimization: Reparameterization trick Inference Amortization Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 2 / 25

Variational Autoencoder A mixture of an infinite number of Gaussians: 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks 3 Even though p ( x | z ) is simple, the marginal p ( x ) is very complex/flexible Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 3 / 25

Recap Latent Variable Models Allow us to define complex models p ( x ) in terms of simple building blocks p ( x | z ) Natural for unsupervised learning tasks (clustering, unsupervised representation learning, etc.) No free lunch: much more difficult to learn compared to fully observed, autoregressive models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 4 / 25

Recap: Variational Inference Suppose q ( z ) is any probability distribution over the hidden variables � D KL ( q ( z ) � p ( z | x ; θ )) = − q ( z ) log p ( z , x ; θ ) + log p ( x ; θ ) − H ( q ) ≥ 0 z Evidence lower bound (ELBO) holds for any q � log p ( x ; θ ) ≥ q ( z ) log p ( z , x ; θ ) + H ( q ) z Equality holds if q = p ( z | x ; θ ) � log p ( x ; θ )= q ( z ) log p ( z , x ; θ ) + H ( q ) z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 5 / 25

Recap: The Evidence Lower bound What if the posterior p ( z | x ; θ ) is intractable to compute? Suppose q ( z ; φ ) is a (tractable) probability distribution over the hidden variables parameterized by φ (variational parameters) For example, a Gaussian with mean and covariance specified by φ q ( z ; φ ) = N ( φ 1 , φ 2 ) Variational inference : pick φ so that q ( z ; φ ) is as close as possible to p ( z | x ; θ ). In the figure, the posterior p ( z | x ; θ ) (blue) is better approximated by N (2 , 2) (orange) than N ( − 4 , 0 . 75) (green) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 6 / 25

Recap: The Evidence Lower bound � ≥ q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) = L ( x ; θ, φ ) log p ( x ; θ ) � �� z ELBO = L ( x ; θ, φ ) + D KL ( q ( z ; φ ) � p ( z | x ; θ )) The better q ( z ; φ ) can approximate the posterior p ( z | x ; θ ), the smaller D KL ( q ( z ; φ ) � p ( z | x ; θ )) we can achieve, the closer ELBO will be to log p ( x ; θ ). Next: jointly optimize over θ and φ to maximize the ELBO over a dataset Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 7 / 25

Variational learning L ( and L ( x ; θ, φ 2 ) are both lower bounds. We want to jointly optimize θ and φ Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 8 / 25

The Evidence Lower bound applied to the entire dataset Evidence lower bound (ELBO) holds for any q ( z ; φ ) � log p ( x ; θ ) ≥ q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) = L ( x ; θ, φ ) � �� z ELBO Maximum likelihood learning (over the entire dataset): � � log p ( x i ; θ ) ≥ L ( x i ; θ, φ i ) ℓ ( θ ; D ) = x i ∈D x i ∈D Therefore � L ( x i ; θ, φ i ) max ℓ ( θ ; D ) ≥ max θ θ,φ 1 , ··· ,φ M x i ∈D Note that we use different variational parameters φ i for every data point x i , because the true posterior p ( z | x i ; θ ) is different across datapoints x i Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 9 / 25

A variational approximation to the posterior Assume p ( z , x i ; θ ) is close to p data ( z , x i ). Suppose z captures information such as the digit identity (label), style, etc. For simplicity, assume z ∈ { 0 , 1 , 2 , · · · , 9 } . Suppose q ( z ; φ i ) is a (categorical) probability distribution over the hidden variable z parameterized by φ i = [ p 0 , p 1 , · · · , p 9 ] � q ( z ; φ i ) = ( φ i k ) 1[ z = k ] k ∈{ 0 , 1 , 2 , ··· , 9 } If φ i = [0 , 0 , 0 , 1 , 0 , · · · , 0], is q ( z ; φ i ) a good approximation of p ( z | x 1 ; θ ) ( x 1 is the leftmost datapoint)? Yes If φ i = [0 , 0 , 0 , 1 , 0 , · · · , 0], is q ( z ; φ i ) a good approximation of p ( z | x 3 ; θ ) ( x 3 is the rightmost datapoint)? No For each x i , need to find a good φ i , ∗ (via optimization, can be expensive). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 10 / 25

Learning via stochastic variational inference (SVI) Optimize � x i ∈D L ( x i ; θ, φ i ) as a function of θ, φ 1 , · · · , φ M using (stochastic) gradient descent � L ( x i ; θ, φ i ) q ( z ; φ i ) log p ( z , x i ; θ ) + H ( q ( z ; φ i )) = z E q ( z ; φ i ) [log p ( z , x i ; θ ) − log q ( z ; φ i )] = 1 Initialize θ, φ 1 , · · · , φ M 2 Randomly sample a data point x i from D 3 Optimize L ( x i ; θ, φ i ) as a function of φ i : Repeat φ i = φ i + η ∇ φ i L ( x i ; θ, φ i ) 1 until convergence to φ i , ∗ ≈ arg max φ L ( x i ; θ, φ ) 2 4 Compute ∇ θ L ( x i ; θ, φ i , ∗ ) 5 Update θ in the gradient direction. Go to step 2 How to compute the gradients? There might not be a closed form solution for the expectations. So we use Monte Carlo sampling Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 11 / 25

Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] Note: dropped i superscript from φ i for compactness To evaluate the bound, sample z 1 , · · · , z k from q ( z ; φ ) and estimate E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] ≈ 1 � log p ( z k , x ; θ ) − log q ( z k ; φ )) k k Key assumption: q ( z ; φ ) is tractable, i.e., easy to sample from and evaluate Want to compute ∇ θ L ( x ; θ, φ ) and ∇ φ L ( x ; θ, φ ) The gradient with respect to θ is easy ∇ θ E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] E q ( z ; φ ) [ ∇ θ log p ( z , x ; θ )] = 1 � ∇ θ log p ( z k , x ; θ ) ≈ k k Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 12 / 25

Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] Want to compute ∇ θ L ( x ; θ, φ ) and ∇ φ L ( x ; θ, φ ) The gradient with respect to φ is more complicated because the expectation depends on φ We still want to estimate with a Monte Carlo average Later in the course we’ll see a general technique called REINFORCE (from reinforcement learning) For now, a better but less general alternative that only works for continuous z (and only some distributions) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 13 / 25

Reparameterization Want to compute a gradient with respect to φ of � E q ( z ; φ ) [ r ( z )] = q ( z ; φ ) r ( z ) d z where z is now continuous Suppose q ( z ; φ ) = N ( µ, σ 2 I ) is Gaussian with parameters φ = ( µ, σ ). These are equivalent ways of sampling: Sample z ∼ q φ ( z ) Sample ǫ ∼ N (0 , I ), z = µ + σǫ = g ( ǫ ; φ ) Using this equivalence we compute the expectation in two ways: � E z ∼ q ( z ; φ ) [ r ( z )] = E ǫ ∼N (0 , I ) [ r ( g ( ǫ ; φ ))] = p ( ǫ ) r ( µ + σǫ ) d ǫ ∇ φ E q ( z ; φ ) [ r ( z )] = ∇ φ E ǫ [ r ( g ( ǫ ; φ ))] = E ǫ [ ∇ φ r ( g ( ǫ ; φ ))] Easy to estimate via Monte Carlo if r and g are differentiable w.r.t. φ and ǫ is easy to sample from (backpropagation) � k ∇ φ r ( g ( ǫ k ; φ )) where ǫ 1 , · · · , ǫ k ∼ N (0 , I ). E ǫ [ ∇ φ r ( g ( ǫ ; φ ))] ≈ 1 k Typically much lower variance than REINFORCE Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 14 / 25

Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ ) ] � �� r ( z ,φ ) Our case is slightly more complicated because we have E q ( z ; φ ) [ r ( z , φ )] instead of E q ( z ; φ ) [ r ( z )]. Term inside the expectation also depends on φ . Can still use reparameterization. Assume z = µ + σǫ = g ( ǫ ; φ ) like before. Then E q ( z ; φ ) [ r ( z , φ )] = E ǫ [ r ( g ( ǫ ; φ ) , φ )] 1 � r ( g ( ǫ k ; φ ) , φ ) ≈ k k Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 15 / 25

Amortized Inference � L ( x i ; θ, φ i ) ℓ ( θ ; D ) ≥ max max θ,φ 1 , ··· ,φ M θ x i ∈D So far we have used a set of variational parameters φ i for each data point x i . Does not scale to large datasets. Amortization: Now we learn a single parametric function f λ that maps each x to a set of (good) variational parameters. Like doing regression on x i �→ φ i , ∗ For example, if q ( z | x i ) are Gaussians with different means µ 1 , · · · , µ m , we learn a single neural network f λ mapping x i to µ i We approximate the posteriors q ( z | x i ) using this distribution q λ ( z | x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 16 / 25

Latent Variable Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 1 / 25 Plan for today 1 Latent Variable Models Learning deep generative models

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Semi-Amortized Variational Autoencoders Yoon Kim Sam Wiseman Andrew Miller David Sontag

Hudson Global Q4 2016 Earnings Call March 2, 2017 Page 1 Forward-Looking Statements Please be

Excel for Finance Loan Calculator Data Table Amortization Table Financial Functions Many

Section 2.4: Amortization schedules and payoff amounts MATH 105: Contemporary Mathematics

Amortized Analysis performed An individual operation may itself be relatively expensive, but

Maps, and Iterators Algorithms CSE 373 SU 18 BEN JONES 1 Warmup Draw a tree for this

Amortized Analysis Amortization is an analysis technique that can influence Aggregation. The

Supplemental Slides First Quarter Fiscal 2012 First Quarter Fiscal 2012 Earnings Call Executing

Sambuz

Useful Links

Newsletter

Mail Us