CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi - PowerPoint PPT Presentation

CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi Material and slides developed by Roger Grosse, University of Toronto UofT CSC 2515: 07-EM 1 / 53

Motivating Examples Some examples of situations where you’d use unupservised learning You want to understand how a scientific field has changed over time. You take a large database of papers and model how the distribution of topics changes from year to year. But what are the topics? UofT CSC 2515: 07-EM 2 / 53

Motivating Examples Some examples of situations where you’d use unupservised learning You want to understand how a scientific field has changed over time. You take a large database of papers and model how the distribution of topics changes from year to year. But what are the topics? You’re a biologist studying animal behavior, so you want to infer a high-level description of their behavior from video. You don’t know the set of behaviors ahead of time. UofT CSC 2515: 07-EM 2 / 53

Motivating Examples Some examples of situations where you’d use unupservised learning You want to understand how a scientific field has changed over time. You take a large database of papers and model how the distribution of topics changes from year to year. But what are the topics? You’re a biologist studying animal behavior, so you want to infer a high-level description of their behavior from video. You don’t know the set of behaviors ahead of time. You want to reduce your energy consumption, so you take a time series of your energy consumption over time, and try to break it down into separate components (refrigerator, washing machine, etc.). UofT CSC 2515: 07-EM 2 / 53

Motivating Examples Some examples of situations where you’d use unupservised learning You want to understand how a scientific field has changed over time. You take a large database of papers and model how the distribution of topics changes from year to year. But what are the topics? You’re a biologist studying animal behavior, so you want to infer a high-level description of their behavior from video. You don’t know the set of behaviors ahead of time. You want to reduce your energy consumption, so you take a time series of your energy consumption over time, and try to break it down into separate components (refrigerator, washing machine, etc.). Common theme: you have some data, and you want to infer the causal structure underlying the data. This structure is latent, which means it’s never observed. UofT CSC 2515: 07-EM 2 / 53

Overview In last lecture, we looked at density modeling where all the random variables were fully observed. The more interesting case is when some of the variables are latent, or never observed. These are called latent variable models. Today, we’ll see how to cluster data by fitting a latent variable model. This will require a new algorithm called Expectation-Maximization (E-M). UofT CSC 2515: 07-EM 3 / 53

Recall: K-means Initialization: randomly initialize cluster centers The algorithm iteratively alternates between two steps: Assignment step: Assign each data point to the closest cluster Refitting step: Move each cluster center to the center of gravity of the data assigned to it Refitted Assignments means UofT CSC 2515: 07-EM 4 / 53

Recall: K-Means K-means Objective: Find cluster centers m and assignments r to minimize the sum of squared distances of data points { x ( i ) } to their assigned cluster centers N K � � r ( i ) k � m k − x ( i ) � 2 { m } , { r } J ( { m } , { r } ) = min min { m } , { r } i =1 k =1 � r ( i ) r ( i ) s.t. = 1 , ∀ i , where ∈ { 0 , 1 } , ∀ k , i k k k = 1 means that x ( i ) is assigned to cluster k (with center m k ) where r ( i ) k The assignment and refitting steps were each doing coordinate descent on this objective. This means the objective improves in each iteration, so the algorithm can’t diverge, get stuck in a cycle, etc. UofT CSC 2515: 07-EM 5 / 53

Recall: K-Means Initialization : Set K means { m k } to random values Repeat until convergence (until assignments do not change): Assignment : k i = arg min ˆ k d (m k , x ( i ) ) exp[ − β d (m k , x ( i ) )] r ( i ) = � k j exp[ − β d (m j , x ( i ) )] k ( i ) = k r ( i ) → ˆ = 1 ← k (soft assignments) (hard assignments) Refitting: � i r ( i ) k x ( i ) m k = � i r ( i ) k UofT CSC 2515: 07-EM 6 / 53

A Generative View of Clustering What if the data don’t look like spherical blobs? elongated clusters discrete data UofT CSC 2515: 07-EM 7 / 53

A Generative View of Clustering What if the data don’t look like spherical blobs? elongated clusters discrete data This lecture: formulating clustering as a probabilistic model specify assumptions about how the observations relate to latent variables use an algorithm called E-M to (approximtely) maximize the likelihood of the observations This lets us generalize clustering to non-spherical ceters or to non-Gaussian observation models (as you do in Homework 4). UofT CSC 2515: 07-EM 7 / 53

Generative Models Recap Recall generative classifiers: p (x , t ) = p (x | t ) p ( t ) We fit p ( t ) and p (x | t ) using labeled data. UofT CSC 2515: 07-EM 8 / 53

Generative Models Recap Recall generative classifiers: p (x , t ) = p (x | t ) p ( t ) We fit p ( t ) and p (x | t ) using labeled data. If t is never observed, we call it a latent variable, or hidden variable, and generally denote it with z instead. The things we can observe (i.e. x) are called observables. UofT CSC 2515: 07-EM 8 / 53

Generative Models Recap Recall generative classifiers: p (x , t ) = p (x | t ) p ( t ) We fit p ( t ) and p (x | t ) using labeled data. If t is never observed, we call it a latent variable, or hidden variable, and generally denote it with z instead. The things we can observe (i.e. x) are called observables. By marginalizing out z , we get a density over the observables: � � p (x | z ) p ( z ) p (x) = p (x , z ) = z z This is called a latent variable model. If p ( z ) is a categorial distribution, this is a mixture model, and different values of z correspond to different components. UofT CSC 2515: 07-EM 8 / 53

Gaussian Mixture Model (GMM) Most common mixture model: Gaussian mixture model (GMM) A GMM represents a distribution as K � p (x) = π k N (x | µ k , Σ k ) k =1 with π k the mixing coefficients, where: K � π k = 1 and π k ≥ 0 ∀ k k =1 UofT CSC 2515: 07-EM 9 / 53

Gaussian Mixture Model (GMM) Most common mixture model: Gaussian mixture model (GMM) A GMM represents a distribution as K � p (x) = π k N (x | µ k , Σ k ) k =1 with π k the mixing coefficients, where: K � π k = 1 and π k ≥ 0 ∀ k k =1 This defines a density over x, so we can fit the parameters using maximum likelihood. We’re try to match the data density of x as closely as possible. This is a hard optimization problem (and the focus of this lecture). GMMs are universal approximators of densities (if you have enough components). Even diagonal GMMs are universal approximators. UofT CSC 2515: 07-EM 9 / 53

Gaussian Mixture Model (GMM) Can also write the model as a generative process: For i = 1 , . . . , N : z ( i ) ∼ Categorical ( π ) x ( i ) | z ( i ) ∼ N ( µ z ( i ) , Σ z ( i ) ) UofT CSC 2515: 07-EM 10 / 53

Visualizing a Mixture of Gaussians – 1D Gaussians If you fit a Gaussian to data: [Slide credit: K. Kutulakos] UofT CSC 2515: 07-EM 11 / 53

Visualizing a Mixture of Gaussians – 1D Gaussians If you fit a Gaussian to data: Now, we are trying to fit a GMM (with K = 2 in this example): [Slide credit: K. Kutulakos] UofT CSC 2515: 07-EM 11 / 53

Visualizing a Mixture of Gaussians – 2D Gaussians UofT CSC 2515: 07-EM 12 / 53

Questions? ? UofT CSC 2515: 07-EM 13 / 53

Fitting GMMs: Maximum Likelihood Some shorthand notation: let θ = { π k , µ k , Σ k } denote the full set of model parameters. Let X = { x ( i ) } and Z = { z ( i ) } . Maximum likelihood objective: � K � N � � π k N (x ( i ) ; µ k , Σ k ) log p (X; θ ) = log i =1 k =1 UofT CSC 2515: 07-EM 14 / 53

Fitting GMMs: Maximum Likelihood Some shorthand notation: let θ = { π k , µ k , Σ k } denote the full set of model parameters. Let X = { x ( i ) } and Z = { z ( i ) } . Maximum likelihood objective: � K � N � � π k N (x ( i ) ; µ k , Σ k ) log p (X; θ ) = log i =1 k =1 In general, no closed-form solution Not identifiable: solution is invariant to permutations Challenges in optimizing this using gradient descent? UofT CSC 2515: 07-EM 14 / 53

Fitting GMMs: Maximum Likelihood Some shorthand notation: let θ = { π k , µ k , Σ k } denote the full set of model parameters. Let X = { x ( i ) } and Z = { z ( i ) } . Maximum likelihood objective: � K � N � � π k N (x ( i ) ; µ k , Σ k ) log p (X; θ ) = log i =1 k =1 In general, no closed-form solution Not identifiable: solution is invariant to permutations Challenges in optimizing this using gradient descent? Non-convex (due to permutation symmetry, just like neural nets) Need to enforce non-negativity constraint on π k and PSD constraint on Σ k Derivatives w.r.t. Σ k are expensive/complicated. We need a different approach! UofT CSC 2515: 07-EM 14 / 53

CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi - PowerPoint PPT Presentation

CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi Material and slides developed by Roger Grosse, University of Toronto UofT CSC 2515: 07-EM 1 / 53 Motivating Examples Some examples of situations where youd use unupservised

CSC 2515 Lecture 11: Differential Privacy Roger Grosse University of Toronto UofT CSC 2515:

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

CSC 411 Lectures 1617: Expectation-Maximization Roger Grosse, Amir-massoud Farahmand, and Juan

VIP Boxes Presentation falmediasa P.O.Box 4900 - Riyadh 11412 +966 11 203 2515 +966 11 203 2534

CSC 2515: Machine Learning Lecture 1 - Introduction and Nearest Neighbours Marzyeh Ghassemi

CSC 2515: Machine Learning Lecture 1 - Introduction and Nearest Neighbours Roger Grosse

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Announcements Reminder: Pset 2 due Wed March 2 Fitting a transformation: Midterm exam is

Background Information Stephen D. Bay and Michael J. Pazzani University of California, Irvine

From Arabidopsis roots to bilinear equations Dustin Cartwright 1 October 22, 2008 1 joint with

Integrating inconsistent data in a probabilistic model Ji r Vomlel This presentation is

Convex Optimization 1. Introduction Prof. Ying Cui Department of Electrical Engineering

Recitation sessions : Review of proof techniques and probability Friday January 17, 3:00-4:10

IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta

Measurement and Data Data describes the real world Data maps entities in the domain of