Latent Variable Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 28

Recap of last lecture 1 Autoregressive models: Chain rule based factorization is fully general Compact representation via conditional independence and/or neural parameterizations 2 Autoregressive models Pros: Easy to evaluate likelihoods Easy to train 3 Autoregressive models Cons: Requires an ordering Generation is sequential Cannot learn features in an unsupervised way Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 2 / 28

Plan for today 1 Latent Variable Models Mixture models Variational autoencoder Variational inference and learning Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 3 / 28

Latent Variable Models: Motivation 1 Lots of variability in images x due to gender, eye color, hair color, pose, etc. However, unless images are annotated, these factors of variation are not explicitly available (latent). 2 Idea : explicitly model these factors using latent variables z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 4 / 28

Latent Variable Models: Motivation 1 Only shaded variables x are observed in the data (pixel values) 2 Latent variables z correspond to high level features If z chosen properly, p ( x | z ) could be much simpler than p ( x ) If we had trained this model, then we could identify features via p ( z | x ), e.g., p ( EyeColor = Blue | x ) 3 Challenge: Very difficult to specify these conditionals by hand Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 5 / 28

Deep Latent Variable Models 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks 3 Hope that after training, z will correspond to meaningful latent factors of variation ( features ). Unsupervised representation learning. 4 As before, features can be computed via p ( z | x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 6 / 28

Mixture of Gaussians: a Shallow Latent Variable Model Mixture of Gaussians. Bayes net: z → x . 1 z ∼ Categorical (1 , · · · , K ) 2 p ( x | z = k ) = N ( µ k , Σ k ) Generative process 1 Pick a mixture component k by sampling z 2 Generate a data point by sampling from that Gaussian Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 7 / 28

Mixture of Gaussians: a Shallow Latent Variable Model Mixture of Gaussians: 1 z ∼ Categorical (1 , · · · , K ) 2 p ( x | z = k ) = N ( µ k , Σ k ) 3 Clustering: The posterior p ( z | x ) identifies the mixture component 4 Unsupervised learning: We are hoping to learn from unlabeled data (ill-posed problem) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 8 / 28

Unsupervised learning Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 9 / 28

Unsupervised learning Shown is the posterior probability that a data point was generated by the i -th mixture component, P ( z = i | x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 10 / 28

Unsupervised learning Unsupervised clustering of handwritten digits. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 11 / 28

Mixture models Combine simple models into a more complex and expressive one K � � � p ( x ) = p ( x , z ) = p ( z ) p ( x | z ) = p ( z = k ) N ( x ; µ k , Σ k ) � �� z z k =1 component Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 12 / 28

Variational Autoencoder A mixture of an infinite number of Gaussians: 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks µ θ ( z ) = σ ( A z + c ) = ( σ ( a 1 z + c 1 ) , σ ( a 2 z + c 2 )) = ( µ 1 ( z ) , µ 2 ( z )) � � exp( σ ( b 1 z + d 1 )) 0 Σ θ ( z ) = diag (exp( σ ( B z + d ))) = 0 exp( σ ( b 2 z + d 2 )) θ = ( A , B , c , d ) 3 Even though p ( x | z ) is simple, the marginal p ( x ) is very complex/flexible Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 13 / 28

Recap Latent Variable Models Allow us to define complex models p ( x ) in terms of simple building blocks p ( x | z ) Natural for unsupervised learning tasks (clustering, unsupervised representation learning, etc.) No free lunch: much more difficult to learn compared to fully observed, autoregressive models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 14 / 28

Marginal Likelihood Suppose some pixel values are missing at train time (e.g., top half) Let X denote observed random variables, and Z the unobserved ones (also called hidden or latent) Suppose we have a model for the joint distribution (e.g., PixelCNN) p ( X , Z ; θ ) What is the probability p ( X = ¯ x ; θ ) of observing a training data point ¯ x ? � � p ( X = ¯ x , Z = z ; θ ) = p (¯ x , z ; θ ) z z Need to consider all possible ways to complete the image (fill green part) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 15 / 28

Variational Autoencoder Marginal Likelihood A mixture of an infinite number of Gaussians: 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks 3 Z are unobserved at train time (also called hidden or latent) 4 Suppose we have a model for the joint distribution. What is the probability p ( X = ¯ x ; θ ) of observing a training data point ¯ x ? � � p ( X = ¯ p (¯ x , Z = z ; θ ) d z = x , z ; θ ) d z z z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 16 / 28

Partially observed data Suppose that our joint distribution is p ( X , Z ; θ ) We have a dataset D , where for each datapoint the X variables are observed (e.g., pixel values) and the variables Z are never observed (e.g., cluster or class id.). D = { x (1) , · · · , x ( M ) } . Maximum likelihood learning: � � � � log p ( x ; θ ) = log p ( x ; θ ) = log p ( x , z ; θ ) z x ∈D x ∈D x ∈D Evaluating log � z p ( x , z ; θ ) can be intractable. Suppose we have 30 binary latent features, z ∈ { 0 , 1 } 30 . Evaluating � z p ( x , z ; θ ) involves a sum with 2 30 terms. For continuous variables, log � z p ( x , z ; θ ) d z is often intractable. Gradients ∇ θ also hard to compute. Need approximations . One gradient evaluation per training data point x ∈ D , so approximation needs to be cheap. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 17 / 28

First attempt: Naive Monte Carlo Likelihood function p θ ( x ) for Partially Observed Data is hard to compute: 1 � � p θ ( x ) = p θ ( x , z ) = |Z| |Z| p θ ( x , z ) = |Z| E z ∼ Uniform ( Z ) [ p θ ( x , z )] All values of z z ∈Z We can think of it as an (intractable) expectation. Monte Carlo to the rescue: Sample z (1) , · · · , z ( k ) uniformly at random 1 Approximate expectation with sample average 2 k p θ ( x , z ) ≈ |Z| 1 � � p θ ( x , z ( j ) ) k z j =1 Works in theory but not in practice. For most z , p θ ( x , z ) is very low (most completions don’t make sense). Some are very large but will never ”hit” likely completions by uniform random sampling. Need a clever way to select z ( j ) to reduce variance of the estimator. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 18 / 28

Second attempt: Importance Sampling Likelihood function p θ ( x ) for Partially Observed Data is hard to compute: � p θ ( x , z ) � q ( z ) � � p θ ( x ) = p θ ( x , z ) = q ( z ) p θ ( x , z ) = E z ∼ q ( z ) q ( z ) All possible values of z z ∈Z Monte Carlo to the rescue: Sample z (1) , · · · , z ( k ) from q ( z ) 1 Approximate expectation with sample average 2 k p θ ( x , z ( j ) ) p θ ( x ) ≈ 1 � q ( z ( j ) ) k j =1 What is a good choice for q ( z )? Intuitively, choose likely completions. It would then be tempting to estimate the log -likelihood as:   k � p θ ( x , z (1) ) � p θ ( x , z ( j ) )  1 �  k =1 log ( p θ ( x )) ≈ log ≈ log q ( z ( j ) ) q ( z (1) ) k j =1 � � �� p θ ( x , z (1) ) p θ ( x , z (1) ) However, it’s clear that E z (1) ∼ q ( z ) log � = log E z (1) ∼ q ( z ) q ( z (1) ) q ( z (1) ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 19 / 28

Evidence Lower Bound Log-Likelihood function for Partially Observed Data is hard to compute: �� p θ ( x , z ) �� q ( z ) log p θ ( x , z ) = log q ( z ) p θ ( x , z ) = log E z ∼ q ( z ) q ( z ) z ∈Z z ∈Z log() is a concave function. log( px + (1 − p ) x ′ ) ≥ p log( x ) + (1 − p ) log( x ′ ). Idea: use Jensen Inequality (for concave functions) �� log E z ∼ q ( z ) [ f ( z )] = log q ( z ) f ( z ) ≥ q ( z ) log f ( z ) z z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 20 / 28

Latent Variable Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 28 Recap of last lecture 1 Autoregressive models: Chain rule based factorization is

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Commit as late as possible Analogy Real life actions | system transactions Goal is to:

Ab bo ou ut t t th he e M Ma ag gic ic L La an n e ter rn n S Sli lid d es

What Answer for Which Audience? Identifying Categories of Answers on Stack Overflow Damien &

Characterizing Late Roadblocks in Ribosome Assembly Jessica Rabuck-Gibbons 1,2 , Joseph Davis 1 ,

Models w/ Latent Random Variables Graham Neubig Site https://phontron.com/class/nn4nlp2017/

Models w/ Latent Random Variables Chunting Zhou Site https://phontron.com/class/nn4nlp2019/

Latent Class Models: The Latent Class Logit Model Accouting for unobserved heterogeneity:

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

Latent Variable Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 28 Recap of last lecture 1 Autoregressive models: Chain rule based factorization is

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Commit as late as possible Analogy Real life actions | system transactions Goal is to:

Ab bo ou ut t t th he e M Ma ag gic ic L La an n e ter rn n S Sli lid d es

What Answer for Which Audience? Identifying Categories of Answers on Stack Overflow Damien &amp;

Characterizing Late Roadblocks in Ribosome Assembly Jessica Rabuck-Gibbons 1,2 , Joseph Davis 1 ,

Models w/ Latent Random Variables Graham Neubig Site https://phontron.com/class/nn4nlp2017/

Models w/ Latent Random Variables Chunting Zhou Site https://phontron.com/class/nn4nlp2019/

Latent Class Models: The Latent Class Logit Model Accouting for unobserved heterogeneity:

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

What Answer for Which Audience? Identifying Categories of Answers on Stack Overflow Damien &