Discrete Latent Variable Models Stefano Ermon, Aditya Grover - PowerPoint PPT Presentation

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 1 / 29

Summary Major themes in the course Representation: Latent variable vs. fully observed Objective function and optimization algorithm: Many divergences and distances optimized via likelihood-free (two sample test) or likelihood based methods Evaluation of generative models Combining different models and variants Plan for today: Discrete Latent Variable Modeling Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 2 / 29

Why should we care about discreteness? Discreteness is all around us! Decision Making: Should I attend CS 236 lecture or not? Structure learning Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 3 / 29

Why should we care about discreteness? Many data modalities are inherently discrete Graphs Text, DNA Sequences, Program Source Code, Molecules, and lots more Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 4 / 29

Stochastic Optimization Consider the following optimization problem max φ E q φ ( z ) [ f ( z )] Recap example: Think of q ( · ) as the inference distribution for a VAE � � log p θ ( x , z ) max θ,φ E q φ ( z | x ) . q ( z | x ) Gradients w.r.t. θ can be derived via linearity of expectation ∇ θ E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] = E q ( z ; φ ) [ ∇ θ log p ( z , x ; θ )] � 1 ∇ θ log p ( z k , x ; θ ) ≈ k k If z is continuous, q ( · ) is reparameterizable, and f ( · ) is differentiable in φ , then we can use reparameterization to compute gradients w.r.t. φ What if any of the above assumptions fail? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 5 / 29

Stochastic Optimization with REINFORCE Consider the following optimization problem max E q φ ( z ) [ f ( z )] φ For many class of problem scenarios, reparameterization trick is inapplicable Scenario 1: f ( · ) is non-differentiable in φ e.g., optimizing a black box reward function in reinforcement learning Scenario 2: q φ ( z ) cannot be reparameterized as a differentiable function of φ with respect to a fixed base distribution e.g., discrete distributions REINFORCE is a general-purpose solution to both these scenarios We will first analyze it in the context of reinforcement learning and then extend it to latent variable models with discrete latent variables Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 6 / 29

REINFORCE for reinforcement learning Example: Pulling arms of slot machines Which arm to pull? Set A of possible actions. E.g., pull arm 1, arm 2, . . . , etc. Each action z ∈ A has a reward f ( z ) Randomized policy for choosing actions q φ ( z ) parameterized by φ . For example, φ could be the parameters of a multinomial distribution Goal: Learn the parameters φ that maximize our earnings (in expectation) max E q φ ( z ) [ f ( z )] φ Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 7 / 29

Policy Gradients Want to compute a gradient with respect to φ of the expected reward � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z = � ∂φ i f ( z ) = � ∂ ∂ q φ ( z ) ∂ q φ ( z ) 1 E q φ ( z ) [ f ( z )] z q φ ( z ) ∂φ i f ( z ) z q φ ( z ) ∂φ i � ∂ log q φ ( z ) � = � z q φ ( z ) ∂ log q φ ( z ) f ( z ) = E q φ ( z ) f ( z ) ∂φ i ∂φ i Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 8 / 29

REINFORCE Gradient Estimation Want to compute a gradient with respect to φ of � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z The REINFORCE rule is ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] We can now construct a Monte Carlo estimate Sample z 1 , · · · , z K from q φ ( z ) and estimate � ∇ φ E q φ ( z ) [ f ( z )] ≈ 1 f ( z k ) ∇ φ log q φ ( z k ) K k Assumption: The distribution q ( · ) is easy to sample from and evaluate probabilities Works for both discrete and continuous distributions Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 9 / 29

Variational Learning of Latent Variable Models To learn the variational approximation we need to compute the gradient with respect to φ of � L ( x ; θ, φ ) = q φ ( z | x ) log p ( z , x ; θ ) + H ( q φ ( z | x )) z = E q φ ( z | x ) [log p ( z , x ; θ ) − log q φ ( z | x ))] The function inside the brackets also depends on φ (and θ, x ). Want to compute a gradient with respect to φ of � E q φ ( z | x ) [ f ( φ, θ, z , x )] = q φ ( z | x ) f ( φ, θ, z , x ) z The REINFORCE rule is ∇ φ E q φ ( z | x ) [ f ( φ, θ, z , x )] = E q φ ( z | x ) [ f ( φ, θ, z , x ) ∇ φ log q φ ( z | x ) + ∇ φ f ( φ, θ, z , x )] We can now construct a Monte Carlo estimate of ∇ φ L ( x ; θ, φ ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 10 / 29

REINFORCE Gradient Estimates have High Variance Want to compute a gradient with respect to φ of � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z The REINFORCE rule is ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] Monte Carlo estimate: Sample z 1 , · · · , z K from q φ ( z ) � ∇ φ E q φ ( z ) [ f ( z )] ≈ 1 f ( z k ) ∇ φ log q φ ( z k ) := f MC ( z 1 , · · · , z K ) K k Monte Carlo estimates of gradients are unbiased � � f MC ( z 1 , · · · , z K ) E z 1 , ··· , z K ∼ q φ ( z ) = ∇ φ E q φ ( z ) [ f ( z )] Almost never used in practice because of high variance Variance can be reduced via carefully designed control variates Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 11 / 29

Control Variates The REINFORCE rule is ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] Given any constant B (a control variate) ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [( f ( z ) − B ) ∇ φ log q φ ( z )] To see why, � � E q φ ( z ) [ B ∇ φ log q φ ( z )] = B q φ ( z ) ∇ φ log q φ ( z ) = B ∇ φ q φ ( z ) z z � = B ∇ φ q φ ( z ) = B ∇ φ 1 = 0 z Monte Carlo gradient estimates of both f ( z ) and f ( z ) − B have same expectation These estimates can however have different variances Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 12 / 29

Control variates Suppose we want to compute � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z Define � � � f ( z ) = f ( z ) + a h ( z ) − E q φ ( z ) [ h ( z )] h ( z ) is referred to as a control variate Assumption: E q φ ( z ) [ h ( z )] is known Monte Carlo gradient estimates of f ( z ) and � f ( z ) have the same expectation E z 1 , ··· , z K ∼ q φ ( z ) [ � f MC ( z 1 , · · · , z K )] = E z 1 , ··· , z K ∼ q φ ( z ) [ f MC ( z 1 , · · · , z K )] but different variances Can try to learn and update the control variate during training Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 13 / 29

Control variates Can derive an alternate Monte Carlo estimate for REINFORCE gradients based on control variates Sample z 1 , · · · , z K from q φ ( z ) ∇ φ E q φ ( z ) [ f ( z )] � � = ∇ φ E q φ ( z ) [ f ( z ) + a h ( z ) − E q φ ( z ) [ h ( z )] ] � � � � K ≈ 1 1 k f ( z k ) ∇ φ log q φ ( z k ) + a k =1 h ( z k ) − E q φ ( z ) [ h ( z )] K K � � := f MC ( z 1 , · · · , z K ) + a h MC ( z 1 , · · · , z K ) − E q φ ( z ) [ h ( z )] := � f MC ( z 1 , · · · , z K ) What is Var( � f MC ) vs. Var( f MC )? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 14 / 29

Control variates Comparing Var( � f MC ) vs. Var( f MC ) � � Var( � f MC ) = Var( f MC + a h MC − E q φ ( z ) [ h ( z )] ) = Var( f MC + ah MC ) Var( f MC ) + a 2 Var( h MC ) + 2 a Cov( f MC , h MC ) = To get the optimal coefficient a ∗ that minimizes the variance, take derivatives w.r.t. a and set them to 0 a ∗ = − Cov( f MC , h MC ) Var( h MC ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 15 / 29

Control variates Comparing Var( � f MC ) vs. Var( f MC ) Var( � f MC ) = Var( f MC ) + a 2 Var( h MC ) + 2 a Cov( f MC , h MC ) Setting the coefficient a = a ∗ = − Cov( f MC , h MC ) Var( h MC ) Var( f MC ) − Cov( f MC , h MC ) 2 Var( � f MC ) = Var( h MC ) Cov( f MC , h MC ) 2 = Var( f MC ) − Var( h MC )Var( f MC )Var( f MC ) (1 − ρ ( f MC , h MC ) 2 )Var( f MC ) = Correlation coefficient ρ ( f MC , h MC ) is between -1 and 1. For maximum variance reduction, we want f MC and h MC to be highly correlated Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 16 / 29

Neural Variational Inference and Learning (NVIL) Latent variable models with discrete latent variables are often referred to as belief networks Variational learning objective is same as ELBO � L ( x ; θ, φ ) = q φ ( z | x ) log p ( z , x ; θ ) + H ( q φ ( z | x )) z = E q φ ( z | x ) [log p ( z , x ; θ ) − log q φ ( z | x ))] := E q φ ( z | x ) [ f ( φ, θ, z , x )] Here, z is discrete and hence we cannot use reparameterization Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 17 / 29

Discrete Latent Variable Models Stefano Ermon, Aditya Grover - PowerPoint PPT Presentation

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 1 / 29 Summary Major themes in the course Representation: Latent variable

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Rao-Blackwellized Stochastic Gradients for Discrete Distributions Runjing (Bryan) Liu June 11,

lcda : Local Classification of Discrete Data by Latent Class Models Michael B ucker

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Welcome Anne Hooper - Event Facilitator, CCG Lay Member for Patient Engagement Purpose of event:

Judges, Juries, and Judges, Juries, and Scientific Evidence Scientific Evidence Valerie P. Hans

Senior PWP Network 4 June 2019 Andy Wright, IAPT Advisor, Heather Stonebank, Lead PWP

Authentication Authentication September 11, 2020 Administrative new VM new VM

Todays Presenter Africa Hands Contract Librarian Author: Successfully Serving the College

Managing Media Relations Public Health Communications Webinar Series July 18, 2019 Webinar

42 42-ness is something whos successor is 43-ness forty-two 42-ness

Propane Powered Blue Bird Vision 1 Private & Confidential Proven Partnership Blue

Discrete Latent Variable Models Stefano Ermon, Aditya Grover - PowerPoint PPT Presentation

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 1 / 29 Summary Major themes in the course Representation: Latent variable

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Rao-Blackwellized Stochastic Gradients for Discrete Distributions Runjing (Bryan) Liu June 11,

lcda : Local Classification of Discrete Data by Latent Class Models Michael B ucker

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Welcome Anne Hooper - Event Facilitator, CCG Lay Member for Patient Engagement Purpose of event:

Judges, Juries, and Judges, Juries, and Scientific Evidence Scientific Evidence Valerie P. Hans

Senior PWP Network 4 June 2019 Andy Wright, IAPT Advisor, Heather Stonebank, Lead PWP

Authentication Authentication September 11, 2020 Administrative new VM new VM

Todays Presenter Africa Hands Contract Librarian Author: Successfully Serving the College

Managing Media Relations Public Health Communications Webinar Series July 18, 2019 Webinar

42 42-ness is something whos successor is 43-ness forty-two 42-ness

Propane Powered Blue Bird Vision 1 Private &amp; Confidential Proven Partnership Blue

Propane Powered Blue Bird Vision 1 Private & Confidential Proven Partnership Blue