EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on - PowerPoint PPT Presentation

Stat 451 Lecture Notes 04 12 EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange 2 Updated: March 9, 2016 1 / 47

Outline 1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary 2 / 47

Notion of “missing data” Let X denote the observable data and θ the parameter to be estimated. The EM algorithm is particularly suited for problems in which there is a notion of “missing data”. The missing data can be actual data that is missing, or some “imaginary” data that exists only in our minds (and necessarily missing). The point is that IF the missing data were available, then finding the MLE for θ would be relatively straightforward. 3 / 47

Notation Again, X is the observable data. Let Y denote the complete data . 3 Usually we think of Y as being composed of observable data X and missing data Z , that is, Y = ( X , Z ). Perhaps more generally, we think of the observable data X as a sort of projection of the complete data, i.e., “ X = M ( Y )”. This suggests a notion of marginalization ... The basic idea behind the EM algorithm is to iteratively impute the missing data. 3 This is the notation used in G&H which, as they admit, is not standard in the EM literature. 4 / 47

Example – mixture model Here is an example where the “missing data” is not real. Suppose X = ( X 1 , . . . , X n ) consists of iid samples from the mixture α N( µ 1 , 1) + (1 − α ) N( µ 2 , 1) , where θ = ( α, µ 1 , µ 2 ) is to be estimated. IF we knew which of the two groups X i was from, then it would be straightforward to get the MLE for θ , i.e., just calculate the group means. The missing part Z = ( Z 1 , . . . , Z n ) is the group label, i.e., � 1 if X i ∼ N( µ 1 , 1) Z i = i = 1 , . . . , n . 0 if X i ∼ N( µ 2 , 1) , 5 / 47

More notation Complete data Y = ( X , Z ) splits to the observed data X and missing data Z . The complete data likelihood θ �→ L Y ( θ ) is the joint distribution of ( X , Z ). The observed likelihood θ �→ L X ( θ ) is obtained by marginalizing the joint distribution of ( X , Z ). The conditional distribution of Z , given X , is an essential piece: θ �→ L Z | X ( θ ). Though the same notation “ L ” is used for all the likelihoods, it should be clear that these are all distinct functions of θ . 7 / 47

Example – mixture model (cont.) Complete data Y = ( Y 1 , . . . , Y n ), where each Y i consists of the observed data X i with the missing group label Z i . Observed data likelihood is n � L X ( θ ) = { α N( X i | µ 1 , 1) + (1 − α )N( X i | µ 2 , 1) } ; i =1 not a nice function—the sum is inside the product. Complete data likelihood is much nicer—write it out! The conditional distribution of Z , given X , is determined by the conditional probabilities α N( X i | µ 1 , 1) P θ ( Z i = 1 | X i ) = α N( X i | µ 1 , 1) + (1 − α )N( X i | µ 2 , 1) . 8 / 47

EM formulation The EM works with some new function: Q ( θ ′ | θ ) = E θ { log L Y ( θ ′ ) | X } , the conditional expectation of the complete data log likelihood, at θ ′ , given X and the particular value θ . Implicit in this expression is that, given X , the only “random” part of Y is the missing data Z . So, in this expression, the expectation is actually with respect to Z , given X , i.e., � Q ( θ ′ | θ ) = log { L ( X , z ) ( θ ′ ) } L z | X ( θ ) dz . 9 / 47

EM formulation (cont.) The EM algorithm iterates computing Q ( θ ′ | θ ), which involves an expectation, and then maximizing it. Start with a fixed θ (0) . At iteration t ≥ 1 do: E-step. Evaluate Q t ( θ ) := Q ( θ | θ ( t − 1) ); M-step. Update θ ( t ) = arg max θ Q t ( θ ). Repeat these steps until practical convergence is reached. 10 / 47

A super-simple example Goal is to maximize the observed data likelihood. But EM iteratively maximizes some other function, so it’s not clear that we are doing something reasonable. Before we get to theory, it helps to consider a simple example to see that EM is doing the right thing. iid Y = ( X , Z ), where X , Z ∼ N( θ, 1), but Z is missing. Observed data MLE ˆ θ = X . The Q function in the E-step is 2 { ( θ − X ) 2 + ( θ − θ ( t ) ) 2 } . Q ( θ | θ ( t ) ) = − 1 Find the M-step update—what should happen as t → ∞ ? 11 / 47

Ascent property The claimed ascent property of EM is as follows: L X ( θ ( t +1) ) ≥ L X ( θ ( t ) ) , ∀ t . To prove this, we first need a simple identity involving joint, conditional, and marginal densities: log f V ( v ) = log f U , V ( u , v ) − log f U | V ( u | v ) . The next general fact is the non-negativity of relative entropy or Kullback–Leibler divergence : � log p ( x ) q ( x ) p ( x ) dx ≥ 0 , equality iff p = q . Follows from Jensen’s inequality, since y �→ − log y is convex. 13 / 47

Ascent property (cont.) Using the density identity, we can write log L X ( θ ) = log L Y ( θ ) − log L Z | X ( θ ) . Take expectation wrt Z , given X and θ ( t ) , gives log L X ( θ ) = Q ( θ | θ ( t ) ) − H ( θ | θ ( t ) ) , where H ( θ | θ ( t ) ) = E θ ( t ) { log L Z | X ( θ ) | X } . It follows from non-negativity of KL that H ( θ ( t ) | θ ( t ) ) − H ( θ | θ ( t ) ) ≥ 0 , ∀ θ. 14 / 47

Ascent property (cont.) Key observation: picking θ ( t +1) such that Q ( θ ( t +1) | θ ( t ) ) ≥ Q ( θ ( t ) | θ ( t ) ) will increase both terms in the expression for L X ( · ). So maximizing Q ( · | θ ( t ) ) in the M-step will result in updates with the desired ascent property: L X ( θ ( t +1) ) ≥ L X ( θ ( t ) ) , ∀ t . This does not imply that the EM updates will necessarily converge to the MLE, just that they are surely moving in the right direction. 15 / 47

Further properties One can express the EM updates through a abstract mapping Ψ, i.e., θ ( t +1) = Ψ( θ ( t ) ). If EM converges to ˆ θ , then ˆ θ must be a fixed-point of Ψ. Do a Taylor approximation of Ψ(ˆ θ ( t ) ) near ˆ θ : θ ( t +1) − ˆ ≈ Ψ ′ ( θ ( t ) )( θ ( t ) − ˆ θ θ ) . � �� Ψ( θ ( t ) ) − Ψ(ˆ θ ) If parameter is one-dimensional, then the convergence order can be seen to be Ψ ′ (ˆ θ ), provided that ˆ θ is a (local) maxima. 16 / 47

EM for exponential family models Recall that a model/joint distribution P θ for data Y is a natural exponential family if the log-likelihood is of the form log L Y ( θ ) = const + log a ( θ ) + θ ⊤ s ( y ) , where s ( y ) is the “sufficient statistic.” For problems where the complete data Y is modeled as an exponential family, EM takes a relatively simple form. This is an important case since many examples involve exponential families. 17 / 47

EM for exponential family models (cont.) For exponential families, Q function looks like � Q ( θ | θ ( t ) ) = const + log a ( θ ) + θ ⊤ s ( y ) L z | X ( θ ( t ) ) dz . To maximize this, take derivative wrt θ and set to zero: � ⇒ − a ′ ( θ ) s ( y ) L z | X ( θ ( t ) ) dz . = a ( θ ) = From Stat 411, you know that the left-hand side is E θ { s ( Y ) } . Let s ( t ) be the right-hand side. M-step updates θ ( t ) → θ ( t +1) by solving the equation: E θ { s ( Y ) } = s ( t ) . 18 / 47

EM for exponential family models (cont.) E-step. Compute s ( t ) based on guess θ ( t ) . M-step. Update guess to θ ( t +1) by solving the equation E θ { s ( Y ) } = s ( t ) . 19 / 47

Example 1 – censored exponential model iid Complete data Y 1 , . . . , Y n ∼ Exp( θ ), rate. Complete data log-likelihood log L Y ( θ ) = n log θ − θ � n i =1 Y i . � �� s ( Y ) Suppose some observations are right-censored, i.e., only a lower bound observed. Write observed data as pairs ( X i , δ i ), where X i = min( Y i , c i ) , ( c i ’s are non-random) δ i = I { X i = Y i } . Missing data Z consists of the actual event times for the censored observations. 21 / 47

Example 1 – censored exponential model (cont.) For EM, we first need to compute s ( t ) ... Only censored cases are of concern. If an observation Y i is right-censored at c i , then we know that c i is a lower bound. Recall that exponential has a memory-less property . So, E-step of the EM requires n � s ( t ) = � � δ i X i + (1 − δ i )E θ ( t ) { Y i | censored } i =1 n � � � δ i X i + (1 − δ i )( X i + 1 /θ ( t ) ) = i =1 n 1 � = nX + (1 − δ i ) . θ ( t ) i =1 22 / 47

Example 1 – censored exponential model (cont.) Clearly, E θ { s ( Y ) } = n /θ . So, the M-step requires we solve for θ in n 1 (1 − δ i ) = n � nX + θ . θ ( t ) i =1 In particular, the EM update in this case is n � θ ( t ) · 1 1 � − 1 � θ ( t +1) = X + (1 − δ i ) . n i =1 Iterate this update till convergence. 23 / 47

EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on - PowerPoint PPT Presentation

Stat 451 Lecture Notes 04 12 EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange 2 Updated: March 9, 2016 1 / 47 Outline 1 Problem and motivation 2 Definition of the EM

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Earley algorithm Earley: introduction Example of Earley algorithm Scott Farrar CLMA,

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Dijkstras Algorithm Austin Saporito and Charlie Rizzo Test Questions 1. What is the run time

Pollards Rho Algorithm for Elliptic Curves Aaron Blumenfeld November 30, 2015 Aaron

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Knuth-Morris-Pratt Algorithm Kranthi Kumar Mandumula December 18, 2011 Kranthi Kumar Mandumula

Algorithm Analyses Hoang Anh Quan June 22, 2018 Outline The Big Oh, Omega, Theta The first

Analysis of Competing Risks in the Pareto Model for Progressive Censoring with binomial removals

MECT Microeconometrics Blundell Lecture 2 Censored Data Models Richard Blundell

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Fall 2017 Prof. Tesler Ch. 1.

Estimation of the survival function Rasmus Waagepetersen Department of Mathematics Aalborg

Analysis of Country-wide Internet Outages Caused by Censorship Alberto Dainotti - alberto@unina.it

simsurv: A Package for Simulating Simple or Complex Survival Data Sam Brilleman 1,2 , Rory Wolfe

Data-Discriminants of Likelihood Equations Jose Israel Rodriguez 1 and Xiaoxian Tang 2 1 University

EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on - PowerPoint PPT Presentation

Stat 451 Lecture Notes 04 12 EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange 2 Updated: March 9, 2016 1 / 47 Outline 1 Problem and motivation 2 Definition of the EM

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm &amp; Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Earley algorithm Earley: introduction Example of Earley algorithm Scott Farrar CLMA,

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Dijkstras Algorithm Austin Saporito and Charlie Rizzo Test Questions 1. What is the run time

Pollards Rho Algorithm for Elliptic Curves Aaron Blumenfeld November 30, 2015 Aaron

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Knuth-Morris-Pratt Algorithm Kranthi Kumar Mandumula December 18, 2011 Kranthi Kumar Mandumula

Algorithm Analyses Hoang Anh Quan June 22, 2018 Outline The Big Oh, Omega, Theta The first

Analysis of Competing Risks in the Pareto Model for Progressive Censoring with binomial removals

MECT Microeconometrics Blundell Lecture 2 Censored Data Models Richard Blundell

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Fall 2017 Prof. Tesler Ch. 1.

Estimation of the survival function Rasmus Waagepetersen Department of Mathematics Aalborg

Analysis of Country-wide Internet Outages Caused by Censorship Alberto Dainotti - alberto@unina.it

simsurv: A Package for Simulating Simple or Complex Survival Data Sam Brilleman 1,2 , Rory Wolfe

Data-Discriminants of Likelihood Equations Jose Israel Rodriguez 1 and Xiaoxian Tang 2 1 University

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM