(An example of) The Expectation-Maximization (EM) Algorithm - PDF document

CSE 446: Machine Learning Lecture (An example of) The Expectation-Maximization (EM) Algorithm Instructor: Sham Kakade 1 An example: the problem of document clustering/topic modeling Suppose we have N documents x 1 , . . . x n . Each document is is of length T , and we only keep track of the word count in each document. Let us say Count ( n ) ( w ) is the number of times word w appeared in the n -th document. We are interested in a “soft” grouping of the documents along with estimating a model for document generation. Let us start with a simple model. 2 A generative model for documents For a moment, put aside the document clustering problem. Let us instead posit a (probabilistic) procedure which underlies how our documents were generated. 2.1 “Bag of words” model: a (single) topic model Random variables: a “hidden” (or latent topic) i ∈ { 1 . . . k } and T -word outcomes w 1 , w 2 , . . . w T which take on some discrete values (these T outcomes constitute a document). Parameters: the mixing weights π i = Pr( topic = i ) , the topics b wi = Pr( word = w | topic = i ) The generative model for a T -word document, where every document is only about one topic, is specified as follows: 1. sample a topic i , which has probability π i 2. gererate T words w 1 , w 2 , . . . w T , independently. in particular, we choose word w t as the t -th word with probability b w t i . Note this generative model ignores the word order, so it is not a particularly faithful generative model. Due to the ’graph’ (i.e. the conditional independencies implied by the generative model procedure), we can write the joint probability of the outcome topic i occurring with a document containing the words w 1 , w 2 , . . . w T as: Pr( topic = i and w 1 , w 2 , . . . w T ) = Pr( topic = i ) Pr( w 1 , w 2 , . . . w T | topic = i ) = Pr( topic = i ) Pr( w 1 | topic = i ) Pr( w 2 | topic = i ) Pr( w T | topic = i ) = π i b w 1 i b w 2 i . . . b w T i where the second to last step follows due to the fact that the words are generated independently given the topic i . 1

Inference Suppose we were given a document with w 1 , w 2 , . . . w T . One inference question would be: what is the probability the underlying topic is i ? By Bayes rule, we have: 1 Pr( topic = i | w 1 , w 2 , . . . w T ) = Pr( w 1 , w 2 , . . . w T ) Pr( topic = i and w 1 , w 2 , . . . w T ) 1 = Z π i b w 1 i b w 2 i . . . b w T i where Z is a number chosen so that the probabilities sum to 1 . Critically, note that Z is not a function of i . 2.2 Maximum Likelihood estimation Given the N documents, we could estimate the parameters as follows: � b, � π = arg min b,π − log Pr( x 1 , . . . x n | b, π ) How can we do this efficiently? 3 The Expectation Maximization algorithm (EM): By example The EM algorithm is a general procedure to estimate the parameters in a model with latent (unobserved) factors. We present an example of the algorithm. EM improves the log likelihood function at every step and will converge. However, it may not converge to the global optima. Think of it as a more general (and probabilistic) adaptation of the K -means algorithm. 3.1 The algorithm: An example for the topic modeling case The EM algorithm is an alternating minimization algorithm. We start at some initialization and then alternate between the E and M steps as follows: Initialization: Start with some guess � b and � π (where the guess is not “symmetric”). The E step: Estimate the posterior probabilities, i.e. the soft assignments, of each document: Pr ( topic i | x n ) = 1 π i � b w 1 i � b w 2 i . . . � � Z � b w T i The M step: Note that Count ( n ) ( w ) /T is the empirical frequency of word w in the n -th document. Given the power probabilities (which we can view as “soft” assignments), we go back and re-estimate the topic probabilities and the mixing weights as follows � N n =1 � Pr ( topic i | x n ) Count ( n ) ( w ) /T � b wi = � N n =1 � Pr ( topic i | x n ) 2

and N � π i = 1 � � Pr ( topic i | x n ) N n =1 Now got back to the E -step. 3.2 (local) Convergence For a general class of latent variable models — models which have unobserved random variables — we can say the following about EM: • If the algorithm has not converged, then, after every M step, the negative log likelihood function decreases in value. • The algorithm will converge in the limit (to some point, under mild assumptions). Unfortunately, this point may not be the global minima. This is related to the that the log likelihood objective function (for these latent variable models) is typically not convex. 3

(An example of) The Expectation-Maximization (EM) Algorithm - PDF document

CSE 446: Machine Learning Lecture (An example of) The Expectation-Maximization (EM) Algorithm Instructor: Sham Kakade 1 An example: the problem of document clustering/topic modeling Suppose we have N documents x 1 , . . . x n . Each document is

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

A Novel Approach to Model Error Modeling using the Expectation-Maximization Algorithm Ramn A.

Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos

Expectation-Maximization Algorithm. Petr Pok Czech Technical University in Prague Faculty of

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Invariant-equivariant representation learning for multi-class data Ilya Feige Faculty

Meta-Learning with Shared Amortized Variational Inference Ekaterina Iakovleva Jakob Verbeek

Latent Class Models for Algorithm Portfolio Methods Bryan Silverthorn and Risto Miikkulainen

Examples and Implementations [Bayesian approach to Latent Class Models: Definition, Simulation,

How to measure material deprivation? A Latent Markov Model based approach Francesco Dotto 1 Joint

Strictly Completing Partial Latin Squares Jaromy Kuhl Department of Mathematics and Statistics

Latin Squares and Orthogonal Arrays Lucia Moura School of Electrical Engineering and Computer

1 f x y z T C V M H G Q D O N U X E R W P B A L I S F J K Y Z z M T N Z J K A