EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on - - PowerPoint PPT Presentation

em algorithm
SMART_READER_LITE
LIVE PREVIEW

EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on - - PowerPoint PPT Presentation

Stat 451 Lecture Notes 04 12 EM Algorithm Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange 2 Updated: March 9, 2016 1 / 47 Outline 1 Problem and motivation 2 Definition of the EM


slide-1
SLIDE 1

Stat 451 Lecture Notes 0412 EM Algorithm

Ryan Martin UIC www.math.uic.edu/~rgmartin

1Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange 2Updated: March 9, 2016 1 / 47

slide-2
SLIDE 2

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

2 / 47

slide-3
SLIDE 3

Notion of “missing data”

Let X denote the observable data and θ the parameter to be estimated. The EM algorithm is particularly suited for problems in which there is a notion of “missing data”. The missing data can be actual data that is missing, or some “imaginary” data that exists only in our minds (and necessarily missing). The point is that IF the missing data were available, then finding the MLE for θ would be relatively straightforward.

3 / 47

slide-4
SLIDE 4

Notation

Again, X is the observable data. Let Y denote the complete data.3 Usually we think of Y as being composed of observable data X and missing data Z, that is, Y = (X, Z). Perhaps more generally, we think of the observable data X as a sort of projection of the complete data, i.e., “X = M(Y )”. This suggests a notion of marginalization... The basic idea behind the EM algorithm is to iteratively impute the missing data.

3This is the notation used in G&H which, as they admit, is not standard in

the EM literature.

4 / 47

slide-5
SLIDE 5

Example – mixture model

Here is an example where the “missing data” is not real. Suppose X = (X1, . . . , Xn) consists of iid samples from the mixture α N(µ1, 1) + (1 − α) N(µ2, 1), where θ = (α, µ1, µ2) is to be estimated. IF we knew which of the two groups Xi was from, then it would be straightforward to get the MLE for θ, i.e., just calculate the group means. The missing part Z = (Z1, . . . , Zn) is the group label, i.e., Zi =

  • 1

if Xi ∼ N(µ1, 1) if Xi ∼ N(µ2, 1), i = 1, . . . , n.

5 / 47

slide-6
SLIDE 6

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

6 / 47

slide-7
SLIDE 7

More notation

Complete data Y = (X, Z) splits to the observed data X and missing data Z. The complete data likelihood θ → LY (θ) is the joint distribution of (X, Z). The observed likelihood θ → LX(θ) is obtained by marginalizing the joint distribution of (X, Z). The conditional distribution of Z, given X, is an essential piece: θ → LZ|X(θ). Though the same notation “L” is used for all the likelihoods, it should be clear that these are all distinct functions of θ.

7 / 47

slide-8
SLIDE 8

Example – mixture model (cont.)

Complete data Y = (Y1, . . . , Yn), where each Yi consists of the observed data Xi with the missing group label Zi. Observed data likelihood is LX(θ) =

n

  • i=1

{αN(Xi | µ1, 1) + (1 − α)N(Xi | µ2, 1)}; not a nice function—the sum is inside the product. Complete data likelihood is much nicer—write it out! The conditional distribution of Z, given X, is determined by the conditional probabilities Pθ(Zi = 1 | Xi) = αN(Xi | µ1, 1) αN(Xi | µ1, 1) + (1 − α)N(Xi | µ2, 1).

8 / 47

slide-9
SLIDE 9

EM formulation

The EM works with some new function: Q(θ′ | θ) = Eθ{log LY (θ′) | X}, the conditional expectation of the complete data log likelihood, at θ′, given X and the particular value θ. Implicit in this expression is that, given X, the only “random” part of Y is the missing data Z. So, in this expression, the expectation is actually with respect to Z, given X, i.e., Q(θ′ | θ) =

  • log{L(X,z)(θ′)} Lz|X(θ) dz.

9 / 47

slide-10
SLIDE 10

EM formulation (cont.)

The EM algorithm iterates computing Q(θ′ | θ), which involves an expectation, and then maximizing it. Start with a fixed θ(0). At iteration t ≥ 1 do: E-step. Evaluate Qt(θ) := Q(θ | θ(t−1)); M-step. Update θ(t) = arg maxθ Qt(θ). Repeat these steps until practical convergence is reached.

10 / 47

slide-11
SLIDE 11

A super-simple example

Goal is to maximize the observed data likelihood. But EM iteratively maximizes some other function, so it’s not clear that we are doing something reasonable. Before we get to theory, it helps to consider a simple example to see that EM is doing the right thing. Y = (X, Z), where X, Z

iid

∼ N(θ, 1), but Z is missing. Observed data MLE ˆ θ = X. The Q function in the E-step is Q(θ | θ(t)) = − 1

2{(θ − X)2 + (θ − θ(t))2}.

Find the M-step update—what should happen as t → ∞?

11 / 47

slide-12
SLIDE 12

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

12 / 47

slide-13
SLIDE 13

Ascent property

The claimed ascent property of EM is as follows: LX(θ(t+1)) ≥ LX(θ(t)), ∀ t. To prove this, we first need a simple identity involving joint, conditional, and marginal densities: log fV (v) = log fU,V (u, v) − log fU|V (u | v). The next general fact is the non-negativity of relative entropy

  • r Kullback–Leibler divergence:
  • log p(x)

q(x) p(x) dx ≥ 0, equality iff p = q. Follows from Jensen’s inequality, since y → − log y is convex.

13 / 47

slide-14
SLIDE 14

Ascent property (cont.)

Using the density identity, we can write log LX(θ) = log LY (θ) − log LZ|X(θ). Take expectation wrt Z, given X and θ(t), gives log LX(θ) = Q(θ | θ(t)) − H(θ | θ(t)), where H(θ | θ(t)) = Eθ(t){log LZ|X(θ) | X}. It follows from non-negativity of KL that H(θ(t) | θ(t)) − H(θ | θ(t)) ≥ 0, ∀ θ.

14 / 47

slide-15
SLIDE 15

Ascent property (cont.)

Key observation: picking θ(t+1) such that Q(θ(t+1) | θ(t)) ≥ Q(θ(t) | θ(t)) will increase both terms in the expression for LX(·). So maximizing Q(· | θ(t)) in the M-step will result in updates with the desired ascent property: LX(θ(t+1)) ≥ LX(θ(t)), ∀ t. This does not imply that the EM updates will necessarily converge to the MLE, just that they are surely moving in the right direction.

15 / 47

slide-16
SLIDE 16

Further properties

One can express the EM updates through a abstract mapping Ψ, i.e., θ(t+1) = Ψ(θ(t)). If EM converges to ˆ θ, then ˆ θ must be a fixed-point of Ψ. Do a Taylor approximation of Ψ(ˆ θ(t)) near ˆ θ: θ(t+1) − ˆ θ

  • Ψ(θ(t))−Ψ(ˆ

θ)

≈ Ψ′(θ(t))(θ(t) − ˆ θ). If parameter is one-dimensional, then the convergence order can be seen to be Ψ′(ˆ θ), provided that ˆ θ is a (local) maxima.

16 / 47

slide-17
SLIDE 17

EM for exponential family models

Recall that a model/joint distribution Pθ for data Y is a natural exponential family if the log-likelihood is of the form log LY (θ) = const + log a(θ) + θ⊤s(y), where s(y) is the “sufficient statistic.” For problems where the complete data Y is modeled as an exponential family, EM takes a relatively simple form. This is an important case since many examples involve exponential families.

17 / 47

slide-18
SLIDE 18

EM for exponential family models (cont.)

For exponential families, Q function looks like Q(θ | θ(t)) = const + log a(θ) +

  • θ⊤s(y)Lz|X(θ(t)) dz.

To maximize this, take derivative wrt θ and set to zero: = ⇒ −a′(θ) a(θ) =

  • s(y)Lz|X(θ(t)) dz.

From Stat 411, you know that the left-hand side is Eθ{s(Y )}. Let s(t) be the right-hand side. M-step updates θ(t) → θ(t+1) by solving the equation: Eθ{s(Y )} = s(t).

18 / 47

slide-19
SLIDE 19

EM for exponential family models (cont.)

E-step. Compute s(t) based on guess θ(t). M-step. Update guess to θ(t+1) by solving the equation Eθ{s(Y )} = s(t).

19 / 47

slide-20
SLIDE 20

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

20 / 47

slide-21
SLIDE 21

Example 1 – censored exponential model

Complete data Y1, . . . , Yn

iid

∼ Exp(θ), rate. Complete data log-likelihood log LY (θ) = n log θ − θ n

i=1 Yi

  • s(Y )

. Suppose some observations are right-censored, i.e., only a lower bound observed. Write observed data as pairs (Xi, δi), where Xi = min(Yi, ci), (ci’s are non-random) δi = I{Xi=Yi}. Missing data Z consists of the actual event times for the censored observations.

21 / 47

slide-22
SLIDE 22

Example 1 – censored exponential model (cont.)

For EM, we first need to compute s(t)... Only censored cases are of concern. If an observation Yi is right-censored at ci, then we know that ci is a lower bound. Recall that exponential has a memory-less property. So, E-step of the EM requires s(t) =

n

  • i=1
  • δiXi + (1 − δi)Eθ(t){Yi | censored}
  • =

n

  • i=1
  • δiXi + (1 − δi)(Xi + 1/θ(t))
  • = nX +

1 θ(t)

n

  • i=1

(1 − δi).

22 / 47

slide-23
SLIDE 23

Example 1 – censored exponential model (cont.)

Clearly, Eθ{s(Y )} = n/θ. So, the M-step requires we solve for θ in nX + 1 θ(t)

n

  • i=1

(1 − δi) = n θ . In particular, the EM update in this case is θ(t+1) =

  • X +

1 θ(t) · 1 n

n

  • i=1

(1 − δi) −1 . Iterate this update till convergence.

23 / 47

slide-24
SLIDE 24

Example 1 – censored exponential model (cont.)

Simulated data: n = 30, θ = 3, censored at 0.632. Picture below shows the observed data likelihood and the EM steps starting at θ(0) = 7.

2 4 6 8

  • 250
  • 200
  • 150
  • 100
  • 50

θ LX(θ)

24 / 47

slide-25
SLIDE 25

Example 2 – probit regression

Recall the probit regression model: Xi ∼ Ber(Φ(u⊤

i θ).

We can use the EM algorithm to easily get the MLE of θ. Write the complete data as Y = (Y1, . . . , Yn), where Yi ∼ N(u⊤

i θ, 1), and

Xi =

  • 1

if Yi > 0 if Yi ≤ 0. Exercise: Check that Xi defined in this way has the same distribution as that given by the probit model... Basically, we observe the sign of the complete data, but the actual values are missing.

25 / 47

slide-26
SLIDE 26

Example 2 – probit regression (cont.)

The complete-data problem is easy, just a normal linear regression with known variance—exponential family. s(Y ) = U⊤Y , where U is the design matrix. Observed data tells us the sign of Yi, so the conditional expectation is that of a truncated normal distribution:4 Eθ(t)(Yi | Xi) = µ(t)

i

+ wi ϕ(µ(t)

i )

Φ(wiµ(t)

i )

  • v(t)

i

,

  • wi = 2Xi − 1

µ(t)

i

= u⊤

i θ(t)

. This completes E-step; M-step requires solving U⊤Uθ

Eθ{s(Y )}

= U⊤Uθ(t) + U⊤v(t)

  • s(t)

.

4http://en.wikipedia.org/wiki/Truncated_normal_distribution 26 / 47

slide-27
SLIDE 27

Example 2 – probit regression (cont.)

Simulated data: n = 50; intercept θ1 = 0, slope θ2 = 1; predictor variables iid N(0, 42). Plot below shows data and EM fitted probit regression line.

  • 5

5 10 0.0 0.2 0.4 0.6 0.8 1.0 X Y

27 / 47

slide-28
SLIDE 28

Example 3 – robust regression

Consider the linear model yi = x⊤

i β + σεi.

Least-squares estimators, based on normal errors, are sensitive (not robust) to “outlier” observations. Remedy: fit a model with heavier-than-normal tails. One approach to robust regression is to model ε with a Student-t distribution with small degrees of freedom. This model can be fit with standard optimization tools, but a clever approach makes application of EM quite simple. Key observation: Student-t is a scale mixture of normals, f (ε) =

  • N(ε | 0, ν

z ) ChiSq(z | ν) dz.

28 / 47

slide-29
SLIDE 29

Example 3 – robust regression (cont.)

For simplicity, we assume that ν = df is known. Think of the Zi values implicitly attached to the Student-t error distribution for εi as “missing data.” If we knew Z = (Z1, . . . , Zn) then the problem would just be just a simple modification of the basic normal model... For θ = (β, σ), the complete data log-likelihood is log LY (θ) =

n

  • i=1

log N(yi − x⊤

i β | 0, νσ2 Zi ).

E-step requires expectation wrt conditional distribution of Z, given data and a guess θ(t)...

29 / 47

slide-30
SLIDE 30

Example 3 – robust regression (cont.)

It can be shown (HW?) that the conditional distribution of Zi, given observed data and a guess θ(t) is 1 ν yi − x⊤

i β(t)

σ(t) 2 + 1 −1 × ChiSq(ν + 1), i = 1, . . . , n. Then the Q function in the E-step is obtained by plugging in the conditional expectation of Zi, i.e., Q(θ | θ(t)) = −n log σ2 2 − 1 2σ2

n

  • i=1

w(t)

i

(yi − x⊤

i β)2,

where w(t)

i

= (ν + 1) yi − x⊤

i β(t)

σ(t) 2 + ν −1 , i = 1, . . . , n. M-step is equivalent to a weighted least squares problem...

30 / 47

slide-31
SLIDE 31

Example 3 – robust regression (cont.)

Belgian phone call data – in R’s MASS library. Compare fit of LS versus Student-t via EM (df=4).

1950 1955 1960 1965 1970 50 100 150 200 Year Calls (in millions) EM LS

31 / 47

slide-32
SLIDE 32

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

32 / 47

slide-33
SLIDE 33

Challenge

The EM algorithm is designed to return the MLE ˆ θ. It does not, however, say anything about standard errors. Recall that if we run, say, “BFGS” via the function optim in R, then we can request that the Hessian at the MLE be returned, which can be used to approximate the standard errors of ˆ θ. The challenge is that the EM doesn’t work directly with the

  • bserved data log-likelihood.

Question: how to compute standard errors within EM?

33 / 47

slide-34
SLIDE 34

Analytical calculation

Of course, if we can write down a formula the negative second derivative of − log LX(θ) at θ = ˆ θ, or the Fisher information, I(ˆ θ), then we have an estimator of the standard errors. For the probit regression model (Example 2 above), we have a formula for the Fisher information: In(θ) =

n

  • i=1

ϕ(u⊤

i θ)2

Φ(u⊤

i θ){1 − Φ(u⊤ i θ)}uiu⊤ i .

So, we can just plug in our MLE ˆ θ from the EM into this formula and (numerically) invert the matrix. Can also numerically differentiate − log LX(θ)...

34 / 47

slide-35
SLIDE 35

Bootstrap

We will talk in more detail about bootstrap later, but here is a little taste of the main idea. We want to estimate the variance of ˆ θ, as computed by EM, but it’s hard because we only have one sample of ˆ θ. If we had many samples/copies of ˆ θ, then the variance is easy to calculate. How to get multiple copies of ˆ θ? Bootstrap principle is to resample (with replacement) from the observed data X = (X1, . . . , Xn), many times.

35 / 47

slide-36
SLIDE 36

Bootstrap (cont.)

Fix a large number B. For b = 1, . . . , B, compute an estimate ˆ θb as follows:

Sample X ⋆

b = (X ⋆ b1, . . . , X ⋆ bn) with replacement from the

  • bserved data X = (X1, . . . , Xn).

Compute ˆ θb by applying the EM algorithm to X ⋆

b .

Estimate the variance of the MLE ˆ θ with just the sample variance (covariance) of ˆ θ1, . . . , ˆ θB. Motivation for this idea comes from the fact that the empirical distribution for X ought to be similar to the true sampling model, at least for large n. May be expensive in the EM context because it requires B separate EM runs...

36 / 47

slide-37
SLIDE 37

Other methods

Numerical differentiation of the score function

∂ ∂θ log LX(θ) at

θ = ˆ θ, where ˆ θ is the EM solution. Supplemented EM (SEM) uses multiple EM runs, apparently more stable than numerical differentiation. Louis’s method looks interesting (based on a missing information principle: “iX(θ) = iY (θ) − iZ|X(θ)”) but involves some specialized computations. Using the empirical information seems attractive: based on idea that Fisher information is the variance of the score.

37 / 47

slide-38
SLIDE 38

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

38 / 47

slide-39
SLIDE 39

Considerations

In each of the two main steps of the EM algorithm, there are potentially some non-trivial computations involved. In the E-step, an expectation is required and, in general, this cannot be done analytically. Similarly, in the M-step, optimization is required and, in general, this cannot be done analytically. We know that both integration and optimization can be done numerically, but there are concerns about efficiency, i.e., nested loops. So, in general, there are questions about how to efficiently design EM algorithms.

39 / 47

slide-40
SLIDE 40

Modifying the E-step

In the E-step, we need to compute an expectation with respect to the conditional distribution of Z, given X. In some case, this boils down to several one-dimensional integrals, which we could possibly do with quadrature. An alternative is to replace numerical integration with Monte Carlo (more on this general approach later).

This is attractive but also may be expensive due to having to run Monte Carlo at every E-step in the EM. Gives EM some kind of Bayesian flavor...

If you haven’t noticed, EM folks like acronyms, so the Monte Carlo EM is called MCEM.

40 / 47

slide-41
SLIDE 41

Modifying the M-step

In the M-step, we need to maximize Q(θ | θ(t)) wrt θ. If not doable analytically, then we can consider any one of those numerical optimization routines considered previously. Concern is that many numerical optimizations may be expensive. Other ideas:

Maximize Q one component at a time—ECM algorithm. Do only one iteration of Newton at each M-step—EM gradient.

41 / 47

slide-42
SLIDE 42

One specific extension – PX-EM

Ordinary EM is based on the idea that computations can be simplified IF some “missing data” were known. The EM is a very powerful tool, but often suffers from slow convergence. A counter-intuitive idea is to consider introducing more parameters to speed up convergence. This is called PX-EM, where PX = “parameter expansion”. The PX-EM enjoys the same ascent property as EM, but its rate of convergence is no slower.

42 / 47

slide-43
SLIDE 43

PX-EM (cont.)

Treat parameter θ as a (not one-to-one) function of (ψ, α). Intuition is that the original model corresponds to the case where α is held fixed at some specified value α0, i.e., θ = f (ψ, α0). Start with the complete-data log likelihood for (ψ, α), which we write as log LY (ψ, α). For exponential families, this will be a linear function of the sufficient statistics for the expanded (ψ, α)-model. Then we can proceed to iteratively compute conditional expectation and maximization. There is a slight difference in the PX E-step, however.

43 / 47

slide-44
SLIDE 44

PX-EM (cont.)

At iteration t, suppose we have (ψ(t), α(t)), which defines θ(t). PX E-step. Set Q(ψ, α | ψ(t), α0), the conditional expectation of complete-data log-likelihood, but note that we use α0 instead of the current guess α(t). PX M-step. Maximize Q to get (ψ(t+1), α(t+1), and compute θ(t+1) = f (ψ(t+1), α(t+1)). Advantage of this PX version of EM is that it improves the M-step by using extra information in the enlarged model.

44 / 47

slide-45
SLIDE 45

Example 2 – probit regression with PX-EM

Recall the probit regression model: Xi ∼ Ber(Φ(u⊤

i θ)).

The complete data in this model is Yi ∼ N(u⊤

i θ, 1).

Expand the parameter θ by introducing a variance parameter, Yi ∼ N(u⊤

i ψ, α2),

α0 = 1. Now, the sufficient statistics for the complete-data model are s(Y ) = (U⊤Y , Y ⊤Y ). Properties (mean and variance) of the truncated normal distribution help us to carry out the PX E-step. PX M-step is pretty straightforward, like for the ordinary EM. See R code for the details.

45 / 47

slide-46
SLIDE 46

Outline

1 Problem and motivation 2 Definition of the EM algorithm 3 Properties of EM 4 Examples 5 Estimating standard errors 6 Different versions of EM 7 Summary

46 / 47

slide-47
SLIDE 47

Remarks

The EM algorithm is a nice tool for maximizing non-standard likelihood functions, particularly in cases where there is some notion of “missing data.” Requires some effort to derive the E- and M-steps. There is a huge literature5 on EM, and mixture models (HW?) and censored-data problems are important applications. Ideas of sort of “randomly imputing” missing values is clever and has some Bayesian flavor—data augmentation... The main challenge with EM is that it’s convergence may be slow, but there are some remedies available. Interesting question: can EM be parallelized?

5As of today, the original EM paper (Dempster, Laird, and Rubin, JRSS-B

1977) has been cited over 44,000 times!

47 / 47