Latent-Variable Generative Models and the Expectation Maximization - PowerPoint PPT Presentation

CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32

Motivation: Bag-Of-Words (BOW) Document Model ◮ Fixed-length documents x ∈ V T ◮ BOW parameters: word distribution p W over V defining T � p X ( x ) = p W ( x t ) t =1 ◮ Model’s generative story: any word in any document is independently generated. ◮ What if the true generative story underlying data is different? x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ MLE: p X ( x (1) ) = p X ( x (2) ) = (1 / 2) 10 Karl Stratos CS 533: Natural Language Processing 2/32

Latent-Variable BOW (LV-BOW) Document Model ◮ LV-BOW parameters ◮ p Z : “topic” distribution over { 1 . . . K } ◮ p W | Z : conditional word distribution over V defining T � p X | Z ( x | z ) = p W | Z ( x t | z ) ∀ z ∈ { 1 . . . K } t =1 K � p X ( x ) = p Z ( z ) × p X | Z ( x | z ) z =1 ◮ Model’s generative story: for each document, a topic is generated and conditioning on that words are independently generated Karl Stratos CS 533: Natural Language Processing 3/32

Back to the Example x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ K = 2 with p Z (1) = p Z (2) = 1 / 2 ◮ p W | Z ( a | 1) = p W | Z ( b | 2) = 1 ◮ p X ( x (1) ) = p X ( x (2) ) = 1 / 2 ≫ (1 / 2) 10 Key idea: introduce a latent variable Z to model true generative process more faithfully Karl Stratos CS 533: Natural Language Processing 4/32

The Latent-Variable Generative Model Paradigm Model. Joint distribution over X and Z p XZ ( x, z ) = p Z ( z ) × p X | Z ( x | z ) Learning. We don’t observe Z ! � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z � �� p X ( x ) Karl Stratos CS 533: Natural Language Processing 5/32

The Learning Problem ◮ How can we solve � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Specifically for LV-BOW, given N documents x (1) . . . x ( N ) ∈ V T , how can we learn topic distribution p Z and conditional word distribution p W | Z that maximize �� N T � � p W | Z ( x ( i ) log p Z ( z ) × t | z ) i =1 t =1 z ∈Z Karl Stratos CS 533: Natural Language Processing 6/32

Code Karl Stratos CS 533: Natural Language Processing 8/32

Code in Action Karl Stratos CS 533: Natural Language Processing 9/32

Code in Action: Bad Initialization Karl Stratos CS 533: Natural Language Processing 10/32

Another Example Initial values After convergence Karl Stratos CS 533: Natural Language Processing 11/32

Again Possible to Get Stuck in a Local Optimum Initial values After convergence Karl Stratos CS 533: Natural Language Processing 12/32

Why Does It Work? ◮ A special case of the expectation maximization (EM) algorithm adapted for LV-BOW ◮ EM is an extremely important and general concept ◮ Another special case: variational autoencoders (VAEs, next class) Karl Stratos CS 533: Natural Language Processing 13/32

Setting ◮ Original problem: difficult to optimize (nonconvex) � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Alternative problem: easy to optimize (often concave) max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is some arbitrary posterior distribution that is easy to compute Karl Stratos CS 533: Natural Language Processing 14/32

Solving the Alternative Problem ◮ Many models we considered (LV-BOW, HMM, PCFG) can be written as � p τ ( a ) count τ ( a | x,z ) p XZ ( x, z ) = ( τ,a ) ∈E ◮ E is a set of possible event type-value pairs. ◮ count τ ( a | x, z ) is number of times τ = a happens in ( x, z ) ◮ Model has a distribution p τ over possible values of type τ ◮ Example p XZ (( a, a, a, b, b ) , 2) = p Z (2) × p W | Z ( a | 2) 3 × p W | Z ( b | 2) 2 (LV-BOW) p XZ (( La , La , La ) , ( N, N, N )) = o ( La | N ) 3 × t ( N |∗ ) × t ( N | N ) 2 × t ( STOP | N ) (HMM) Karl Stratos CS 533: Natural Language Processing 15/32

Game Plan ◮ So we have established that it is often easy to solve the alternative problem max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is any posterior distribution easy to compute ◮ We will relate the original log likelihood objective with this quantity by the following slide. Karl Stratos CS 533: Natural Language Processing 18/32

EM: Coordinate Ascent on ELBO Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) x ∼ pop X p XZ z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: q Z | X ← arg max ELBO( p XZ , ¯ q Z | X ) ¯ q Z | X p XZ ← arg max ELBO(¯ p XZ , q Z | X ) ¯ p XZ 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 20/32

Equivalently Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: p XZ ← arg max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ p Z | X ( ·| x ) 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 21/32

EM Can Only Increase the Objective (Or Leave It Unchanged) LL( p ′ XZ ) ELBO( p ′ XZ , q ′ Z | X ) LL( p XZ ) = ELBO( p XZ , p Z | X ) LL( p XZ ) LL( p XZ ) ⇒ ⇒ ELBO( p XZ , q Z | X )   � LL( p XZ ) = E  log p XZ ( x, z )  x ∼ pop X z ∈Z ELBO( p XZ , q Z | X ) = LL( p XZ ) − D KL ( q Z | X || p Z | X ) = E [log p XZ ( x, z )] + H ( q Z | X ) x ∼ pop X z ∼ qZ | X ( ·| x ) Karl Stratos CS 533: Natural Language Processing 22/32

EM Can Only Increase the Objective (Or Leave It Unchanged) From https://media.nature.com/full/nature-assets/nbt/ journal/v26/n8/extref/nbt1406-S1.pdf Karl Stratos CS 533: Natural Language Processing 23/32

Sample Version Input : N iid samples from pop X , definition of p XZ Output : local optimum of N 1 � � p XZ ( x ( i ) , z ) max log N p XZ i =1 z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: N � � p Z | X ( z | x ( i ) ) log ¯ p XZ ( x ( i ) , z ) p XZ ← arg max ¯ p XZ i =1 z ∈Z 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 24/32

Latent-Variable Generative Models and the Expectation Maximization - PowerPoint PPT Presentation

CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Motivation: Bag-Of-Words (BOW)

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

1 Latent variable models In the next section we will discuss latent variable models for

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Sunday Homework 3 : an Diniohlet Allocation Model Latent Generative : Generative model

generative design systems Generative Brief Design Definitions Workshop Processes

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Computational Geometry Lecture 1: Introduction and convex hulls 1 Computational Geometry

Equivariant Motion Planning Hellen Colman Wright College, Chicago Topological Robotics A new

$TITLE: M7-5.GMS: Small-Group Monopolistic Competition * markup formula is 1/(sigma -

Pitch location and Greinkes July Exploring Pitch Data in R Strike zone success Exploring

c cientific omputing ADiCape in a large-scale industrial problem Monika Petera, Martin

Safe Obstacle Avoidance of Autonomous Robo5c Ground Vehicles

A Convex Relaxation Framework for Strategic Bidding in Electricity Markets Mahdi Ghamkhari

A constructive Coq library for the mechanisation of undecidability Yannick Forster and Dominique