Notes on Neal and Hintons Generalized Expectation Maximization - - PowerPoint PPT Presentation

notes on neal and hinton s generalized expectation
SMART_READER_LITE
LIVE PREVIEW

Notes on Neal and Hintons Generalized Expectation Maximization - - PowerPoint PPT Presentation

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1 Talk overview What kinds of problems does expectation maximization solve? An example


slide-1
SLIDE 1

Notes on Neal and Hinton’s Generalized Expectation Maximization (GEM) Algorithm

Mark Johnson Brown University Febuary 2005, updated November 2008

1

slide-2
SLIDE 2

Talk overview

  • What kinds of problems does expectation maximization solve?
  • An example of EM
  • Relaxation, and proving that EM converges
  • Sufficient statistics and EM
  • The generalized EM algorithm

2

slide-3
SLIDE 3

Hidden Markov Models

y1 y2 y3 y4 x1 x2 x3 x4 y0 States (e.g., parts of speech) Observations (e.g., words) P(Y , X|θ) =

n

  • i=1

P(Yi|Yi−1, θ)P(Xi|Yi, θ) P(yi|yi−1, θ) = θyi,yi−1 P(xi|yi, θ) = θxi,yi

3

slide-4
SLIDE 4

Maximum likelihood estimation

  • Given visible data (y, x), how can we estimate θ?
  • Maximum likelihood principle:
  • θ

= argmax

θ

L(y,x)(θ), where: L(y,x)(θ) = log Pθ(y, x) = log P(y, x|θ)

  • For a HMM, these are simple to calculate:
  • θyi,yj

= nyi,yj(y, x)

  • y′

i ny′ i,yj(y, x)

  • θxi,yi

= nxi,yi(y, x)

  • x′

i nx′ i,yi(y, x)

4

slide-5
SLIDE 5

ML estimation from hidden data

  • Our model defines P(Y , X), but our data only contains values for

X, i.e., the variable Y is hidden – HMM example: D only contains words x but not their labels y

  • Maximum likelihood principle still applies:
  • θ

= argmax

θ

Lx(θ), where: Lx(θ) = log P(x|θ) = log

  • y∈Y

P(y, x|θ)

  • But maximizing Lx(θ) may now be a non-trivial problem!

5

slide-6
SLIDE 6

What does Expectation Maximization do?

  • Expectation Maximization (EM) is a maximum likelihood

estimation procedure for problems with hidden variables

  • EM is good for problems where:

– our model P(Y, X|θ) involves variables Y and X – our training data contains x but not y – maximizing P(x|θ) is hard – maximizing P(y, x|θ) is easy

  • In HMM example: if training data consists of words x alone, and

does not contain their labels

6

slide-7
SLIDE 7

The EM algorithm

  • The EM algorithm:

– Guess an initial model θ(0) – For t = 1, 2, . . ., compute Q(t)(y) and θ(t), where Q(t)(y) = P(y|x, θ(t−1)) (E-step) θ(t) = argmax

θ

EY ∼Q(t)[log P(Y, x|θ)] (M-step) = argmax

θ

  • y∈Y

Q(t)(y) log P(y, x|θ) = argmax

θ

  • y∈Y

P(y, x|θ)Q(t)(y)

  • Q(t)(y) is probability of “pseudo-data” y using model θ(t−1)
  • θ(t) is the MLE based on pseudo-data (y, x), where each (y, x) is

weighted according to Q(t)(y)

7

slide-8
SLIDE 8

HMM example

  • For a HMM, the EM formulae are:

Q(t)(y) = P(y|x, θ(t−1)) = P(y, x|θ(t−1))

  • y∈Y P(y, x|θ(t−1))

θ(t)

yi,yj

=

  • y∈Y Q(t)(y)nyi,yj(y, x)
  • y′

i

  • y∈Y Q(t)(y)ny′

i,yj(y, x)

θ(t)

xi,yi

=

  • y∈Y Q(t)(y)nxi,yi(y, x)
  • x′

i

  • y∈Y Q(t)(y)nx′

i,yi(y, x)

8

slide-9
SLIDE 9

EM converges — overview

  • Neal and Hinton define a function F(Q, θ) where:

– Q(Y ) is a probability distribution over the hidden variables – θ are the model parameters argmax

θ

max

Q

F(Q, θ) =

  • θ, the MLE of θ

max

Q

F(Q, θ) = Lx(θ), the log likelihood of θ argmax

Q

F(Q, θ) = P(Y |x, θ) for all θ

  • The EM algorithm is an alternating maximization of F

Q(t) = argmax

Q

F(Q, θ(t−1)) (E-step) θ(t) = argmax

θ

F(Q(t), θ) (M-step)

9

slide-10
SLIDE 10

The EM algorithm converges

F(Q, θ) = EY ∼Q[log P(Y, x|θ)] + H(Q) = Lx(θ) − KL(Q(Y )||P(Y |x, θ)) H(Q) = entropy of Q Lx(θ) = log P(x|θ) = log likelihood of θ KL(Q||P) = KL divergence between Q and P Q(t)(Y ) = P(Y |x, θ(t−1)) = argmax

Q

F(Q, θ(t−1)) (E-st θ(t) = argmax

θ

EY ∼Q(t)[log P(Y, x|θ)] = argmax

θ

F(Q(t), θ) (M-st

  • The maximum value of F is achieved at θ =

θ and Q(Y ) = P(Y |x, θ).

  • The sequence of F values produced by the EM algorithm is

non-decreasing and bounded above by L( θ).

10

slide-11
SLIDE 11

Generalized EM

  • Idea: anything that increases F gets you closer to

θ

  • Idea: insert any additional operations you want into the EM

algorithm so long as they don’t decrease F – Update θ after each data item has been processed – Visit some data items more often than others – Only update some components of θ on some iterations

11

slide-12
SLIDE 12

Incremental EM for factored models

  • Data and model both factor: Y = (Y1, . . . , Yn), X = (X1, . . . , Xn)

P(Y, X|θ) =

n

  • i=1

P(Yi, Xi|θ)

  • Incremental EM algorithm:

– Initialize θ(0) and Q(0)

i (Yi) for i = 1, . . . , n

– E-step: Choose some data item i to be updated Q(t)

j

= Q(t−1)

j

for all j = i Q(t)

i (Yi)

= P(Yi|xi, θ(t−1)) – M-step: θ(t) = argmax

θ

EY ∼Q(t)[log P(Y, x|θ)]

12

slide-13
SLIDE 13

EM using sufficient statistics

  • Model parameters θ estimated from sufficient statistics S:

(Y, X) → S → θ

  • In HMM example, pseudo-counts are sufficient statistics
  • EM algorithm with sufficient statistics:

˜ s(t) = EY ∼P(Y |x,θ(t−1))[S] (E-step) θ(t) = maximum likelihood value for θ based on ˜ s(t) (M-step)

13

slide-14
SLIDE 14

Incremental EM using sufficient statistics

  • Incremental EM algorithm with sufficient statistics:

(Yi, Xi) → Si → S → θ S =

  • i

Si – Initialize θ(0) and ˜ s(0)

i

for i = 1, . . . , n – E-step: Choose some data item i to be updated ˜ s(t)

j

= ˜ s(t−1)

j

for all j = i ˜ s(t)

i

= EYi∼P(Yi|xi,θ(t−1))[Si] ˜ s(t) = ˜ s(t−1) + (˜ s(t)

i

− ˜ s(t−1)

i

) – M-step: θ(t) = maximum likelihood value for θ based on ˜ s(t)

14

slide-15
SLIDE 15

Conclusion

  • The Expectation-Maximization algorithm is a general technique

for using supervised maximum likelihood estimators to solve unsupervised estimation problems

  • The E-step and the M-step can be viewed as steps of an

alternating maximization procedure – The functional F is bounded above by the log likelihood – Each E and M step increases F ⇒ Eventually the EM algorithm converges to a local optimum (not necessarily a global optimum)

  • We can insert any steps we like into the EM algorithm so long as

they do not decrease F ⇒ Incremental versions of the EM algorithm

15