Variational Autoencoders + Deep Generative Models Matt Gormley - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders + Deep Generative Models Matt Gormley Lecture 27 Dec. 4, 2019 1

Reminders • Final Exam – Evening Exam – Thu, Dec. 5 at 6:30pm – 9:00pm • 618 Final Poster: – Submission: Tue, Dec. 10 at 11:59pm – Presentation: Wed, Dec. 11 (time will be announced on Piazza) 3

FINAL EXAM LOGISTICS 6

Final Exam • Time / Location – Time: Evening Exam Thu, Dec. 5 at 6:30pm – 9:00pm – Room : Doherty Hall A302 – Seats: There will be assigned seats . Please arrive early to find yours. – Please watch Piazza carefully for announcements • Logistics – Covered material: Lecture 1 – Lecture 26 (not the new material in Lecture 27) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back) 7

Final Exam • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it 8

Final Exam • Exam Contents – ~30% of material comes from topics covered before Midterm Exam – ~70% of material comes from topics covered after Midterm Exam 9

Topics from before Midterm Exam • • Search-Based Structured Graphical Model Learning Prediction – Fully observed Bayesian – Reductions to Binary Network learning – Fully observed MRF learning Classification – Learning to Search – Fully observed CRF learning – RNN-LMs – Parameterization of a GM – seq2seq models – Neural potential functions • • Graphical Model Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals (2) partition function Undirected GMs vs. (3) most probably assignment Factor Graphs – Variable Elimination – Bayesian Networks vs. Markov Random Fields vs. – Belief Propagation (sum- Conditional Random Fields product and max-product) – MAP Inference via MILP 10

Topics from after Midterm Exam • • Learning for Structure Approximate Inference by Prediction Optimization – Structured Perceptron – Variational Inference – Structured SVM – Mean Field Variational – Neural network potentials Inference – Coordinate Ascent V.I. (CAVI) • Approximate MAP Inference – Variational EM – MAP Inference via MILP – Variational Bayes – MAP Inference via LP • Bayesian Nonparametrics relaxation • – Dirichlet Process Approximate Inference by – DP Mixture Model Sampling • – Monte Carlo Methods Deep Generative Models – Gibbs Sampling – Variational Autoencoders – Metropolis-Hastings – Markov Chains and MCMC 11

VARIATIONAL EM 12

Variational EM Whiteboard – Example: Unsupervised POS Tagging – Variational Bayes – Variational EM 13

Unsupervised POS Tagging Bayesian Inference for HMMs • Task : unsupervised POS tagging • Data : 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary : defines legal part-of-speech (POS) tags for each word type • Models : – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) – CGS: collapsed Gibbs Sampler for Bayesian HMM E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t k,w ] + β z t � 1 ,k ] + α k,z t +1 ] + α + E q ( z ¬ t ) [ δ ( z t − 1 = k = z t +1 )] Algo 1 mean field update: q ( z t = k ) ∝ k, · ] + W β · z t � 1 , · ] + K α · E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t k, · ] + K α + E q ( z ¬ t ) [ δ ( z t − 1 = k )] C ¬ t C ¬ t C ¬ t z t � 1 ,k + α k,z t +1 + α + δ ( z t � 1 = k = z t +1 ) k,w + β CGS full conditional: p ( z t = k | x , z ¬ t , α , β ) ∝ k, · + W β · z t � 1 , · + K α · C ¬ t C ¬ t C ¬ t k, · + K α + δ ( z t � 1 = k ) 14 Figure from Wang & Blunsom (2013)

Unsupervised POS Tagging Bayesian Inference for HMMs • Task : unsupervised POS tagging • Data : 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary : defines legal part-of-speech (POS) tags for each word type • Models : – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) – CGS: collapsed Gibbs Sampler for Bayesian HMM Number of Iterations (CGS) Number of Iterations (CGS) 400 4,000 8,000 12,000 16,000 20,000 0 4,000 8,000 12,000 16,000 20,000 VB 1,500 Algo 1 0.85 1,400 Algo 2 Test Perplexity 1,300 0.8 CGS Accuracy 1,200 0.75 1,100 EM (28mins) 1,000 0.7 VB (35mins) Algo 1 (15mins) 900 0.65 Algo 2 (50mins) 800 CGS (480mins) 10 20 30 40 50 0 10 20 30 40 50 Number of Iterations (Variational Algorithms) Number of Iterations (Variational Algorithms) 15 Figure from Wang & Blunsom (2013)

Unsupervised POS Tagging Bayesian Inference for HMMs • Task : unsupervised POS tagging • Data : 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary : defines legal part-of-speech (POS) tags for each word type • Models : – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) – CGS: collapsed Gibbs Sampler for Bayesian HMM Speed: • EM is slow b/c of log-space computations EM (28mins) • VB is slow b/c of digamma computations VB (35mins) • Algo 1 (CVB) is the fastest! Algo 1 (15mins) • Algo 2 (CVB) is slow b/c it computes dynamic Algo 2 (50mins) parameters • CGS: an order of magnitude slower than any CGS (480mins) deterministic algorithm 16 Figure from Wang & Blunsom (2013)

Stochastic Variational Bayesian HMM • Task : Human Chromatin Segmentation • Goal : unsupervised segmentation of the genome • Data : from ENCODE, “250 million observations consisting of twelve assays carried out in the chronic myeloid leukemia cell line K562” • Figure from Foti et al. (2014) Metric : “the false discovery rate (FDR) of predicting active promoter elements in the sequence" ● L/2 = 1 L/2 = 3 L/2 = 10 1.5 − 3.0 Diag. Dom. • Diag. Dom. GrowBuffer Models: 1.0 Held out log − probability − 3.5 ● Off – 0.5 DBN HMM: dynamic Bayesian On − 4.0 ||A|| F HMM trained with standard EM 0.0 ● ● ● κ − 4.5 1.00 ● – 0.1 SVIHMM: stochastic variational − 6.0 Rev. Cycles 0.75 Rev. Cycles 0.3 inference for a Bayesian HMM 0.50 − 6.2 ● 0.5 • 0.7 Main Takeaway : 0.25 − 6.4 ● ● ● 0.00 − 6.6 – the two models perform at 1 10 100 0 20 40 60 0 20 40 60 0 20 40 60 L/2 (log − scale) Iteration similar levels of FDR Figure from Mammana & Chung (2015) – SVIHMM takes one hour – DBNHMM takes days 17

Grammar Induction Question: Can maximizing (unsupervised) marginal likelihood produce useful results? Answer: Let’s look at an example… • Babies learn the syntax of their native language (e.g. English) just by hearing many sentences • Can a computer similarly learn syntax of a human language just by looking at lots of example sentences? – This is the problem of Grammar Induction! – It’s an unsupervised learning problem – We try to recover the syntactic structure for each sentence without any supervision 18

Grammar Induction time flies like an arrow time flies like an arrow time flies like an arrow … No semantic interpretation time flies like an arrow 19

Grammar Induction Training Data: Sentences only, without parses x (1) like Sample 1: time flies an arrow x (2) Sample 2: real flies like soup x (3) fly with Sample 3: flies their wings x (4) you Sample 4: with time will see Test Data: Sentences with parses, so we can evaluate accuracy 20

Grammar Induction Q: Does likelihood Dependency Model with Valence (Klein & Manning, 2004) correlate with 60 accuracy on a task we care about? Attachment Accuracy (%) Pearson’s r = 0.63 50 (strong correlation) A: Yes, but there is 40 still a wide range of accuracies for a 30 particular likelihood value 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 21 Figure from Gimpel & Smith (NAACL 2012) - slides

Variational Autoencoders + Deep Generative Models Matt Gormley - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders + Deep Generative Models Matt Gormley Lecture 27 Dec. 4, 2019 1 Reminders

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Semi-Amortized Variational Autoencoders Yoon Kim Sam Wiseman Andrew Miller David Sontag

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

Variational Autoencoders Tom Fletcher March 25, 2019 Talking about this paper: Diederik Kingma

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Introduction to Deep Models Part II: Variational Autoencoders and Latent Spaces Nick Winovich

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

CS598LAZ - Variational Autoencoders Raymond Yeh, Junting Lou, Teck-Yian Lim Outline - Review

About generative aspects of Variational Autoencoders LOD19 The Fifth International Conference

MCMC and Variational Inference for AutoEncoders Achille Thin 1 , Alain Durmus 2 , Eric Moulines 1 1

generative design systems Generative Brief Design Definitions Workshop Processes

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

HTML5 Connectivity Methods and Mobile Power Consumption Giridhar D. Mandyam November 8, 2012

POSIX API (fjnish) / Scheduling intro 1 last time shells: program for users to run other

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

S E E . F E E L . T O U C H . CMOS type imagers became the default imaging means in consumer

What CFGs do not capture Last class, we talked about over-generation problem of CFG

Malaysian Healthy Ageing Society Is Sleep Quality related to Psychological Distress in the

My life in 40 variables -by Jus(n Timmer Jus5n Timmer, My life in 40 variables 1 Who am

Outline: Vergence Eye Movements: Classification I. Describe with 3 degrees of freedom - Horiz,

Variational Autoencoders + Deep Generative Models Matt Gormley - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders + Deep Generative Models Matt Gormley Lecture 27 Dec. 4, 2019 1 Reminders

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Semi-Amortized Variational Autoencoders Yoon Kim Sam Wiseman Andrew Miller David Sontag

Learning Deep Generative Models Inference &amp; Representation Lecture 12 Rahul G. Krishnan

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

Variational Autoencoders Tom Fletcher March 25, 2019 Talking about this paper: Diederik Kingma

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Introduction to Deep Models Part II: Variational Autoencoders and Latent Spaces Nick Winovich

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

CS598LAZ - Variational Autoencoders Raymond Yeh, Junting Lou, Teck-Yian Lim Outline - Review

About generative aspects of Variational Autoencoders LOD19 The Fifth International Conference

MCMC and Variational Inference for AutoEncoders Achille Thin 1 , Alain Durmus 2 , Eric Moulines 1 1

generative design systems Generative Brief Design Definitions Workshop Processes

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

HTML5 Connectivity Methods and Mobile Power Consumption Giridhar D. Mandyam November 8, 2012

POSIX API (fjnish) / Scheduling intro 1 last time shells: program for users to run other

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

S E E . F E E L . T O U C H . CMOS type imagers became the default imaging means in consumer

What CFGs do not capture Last class, we talked about over-generation problem of CFG

Malaysian Healthy Ageing Society Is Sleep Quality related to Psychological Distress in the

My life in 40 variables -by Jus(n Timmer Jus5n Timmer, My life in 40 variables 1 Who am

Outline: Vergence Eye Movements: Classification I. Describe with 3 degrees of freedom - Horiz,

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan