Probabilistic & Unsupervised Learning Expectation Maximisation - - PowerPoint PPT Presentation

probabilistic unsupervised learning expectation
SMART_READER_LITE
LIVE PREVIEW

Probabilistic & Unsupervised Learning Expectation Maximisation - - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Expectation Maximisation Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018


slide-1
SLIDE 1

Probabilistic & Unsupervised Learning Expectation Maximisation

Maneesh Sahani

maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

slide-2
SLIDE 2

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

slide-3
SLIDE 3

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function.

slide-4
SLIDE 4

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function. ◮ Maximum may be closed-form.

slide-5
SLIDE 5

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function. ◮ Maximum may be closed-form. ◮ If not, numerical optimisation is still generally straightforward.

slide-6
SLIDE 6

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function. ◮ Maximum may be closed-form. ◮ If not, numerical optimisation is still generally straightforward.

◮ Latent variable models: p(x|θx, θz) =

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z))

  • p(x|z,θx )

fz(z)eθT

z Tz(z)

Zz(θz)

  • p(z|θz)

ℓ(θx, θz) =

  • n

log

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z)) fz(z)eθT

z Tz(z)

Zz(θz)

slide-7
SLIDE 7

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function. ◮ Maximum may be closed-form. ◮ If not, numerical optimisation is still generally straightforward.

◮ Latent variable models: p(x|θx, θz) =

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z))

  • p(x|z,θx )

fz(z)eθT

z Tz(z)

Zz(θz)

  • p(z|θz)

ℓ(θx, θz) =

  • n

log

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z)) fz(z)eθT

z Tz(z)

Zz(θz)

◮ Usually no closed form optimum.

slide-8
SLIDE 8

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function. ◮ Maximum may be closed-form. ◮ If not, numerical optimisation is still generally straightforward.

◮ Latent variable models: p(x|θx, θz) =

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z))

  • p(x|z,θx )

fz(z)eθT

z Tz(z)

Zz(θz)

  • p(z|θz)

ℓ(θx, θz) =

  • n

log

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z)) fz(z)eθT

z Tz(z)

Zz(θz)

◮ Usually no closed form optimum. ◮ Often multiple local maxima.

slide-9
SLIDE 9

Log-likelihoods

◮ Exponential family models: p(x|θ) = f(x)eθTT(x)/Z(θ)

ℓ(θ) = θT

n

T(xn) − N log Z(θ) (+ constants)

◮ Concave function. ◮ Maximum may be closed-form. ◮ If not, numerical optimisation is still generally straightforward.

◮ Latent variable models: p(x|θx, θz) =

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z))

  • p(x|z,θx )

fz(z)eθT

z Tz(z)

Zz(θz)

  • p(z|θz)

ℓ(θx, θz) =

  • n

log

  • dz fx(x) eφ(θx ,z)TTx (x)

Zx(φ(θx, z)) fz(z)eθT

z Tz(z)

Zz(θz)

◮ Usually no closed form optimum. ◮ Often multiple local maxima. ◮ Direct numerical optimisation may be possible but infrequently easy.

slide-10
SLIDE 10

Example: mixture of Gaussians

Data:

X = {x1 . . . xN}

Latent process: si

iid

∼ Disc[π]

Component distributions: xi | (si = m) ∼ Pm[θm] = N (µm, Σm) Marginal distribution: P(xi) =

k

  • m=1

πmPm(x; θm)

Log-likelihood:

ℓ({µm}, {Σm}, π) =

n

  • i=1

log

k

  • m=1

πm

  • |2πΣm|

e− 1

2 (xi−µm)TΣ−1 m (xi−µm)

slide-11
SLIDE 11

The joint-data likelihood and EM

◮ For many models, maximisation might be straightforward if z were not latent, and we

could just maximise the joint-data likelihood:

ℓ(θx, θz) =

  • n

φ(θx, zn)TTx(xn)+θT

z

  • n

Tz(zn)−

  • n

log Zx(φ(θx, zn))−N log Zz(θz)

slide-12
SLIDE 12

The joint-data likelihood and EM

◮ For many models, maximisation might be straightforward if z were not latent, and we

could just maximise the joint-data likelihood:

ℓ(θx, θz) =

  • n

φ(θx, zn)TTx(xn)+θT

z

  • n

Tz(zn)−

  • n

log Zx(φ(θx, zn))−N log Zz(θz)

◮ Conversely, if we knew θ, we might easily compute (the posterior over) the values of z.

slide-13
SLIDE 13

The joint-data likelihood and EM

◮ For many models, maximisation might be straightforward if z were not latent, and we

could just maximise the joint-data likelihood:

ℓ(θx, θz) =

  • n

φ(θx, zn)TTx(xn)+θT

z

  • n

Tz(zn)−

  • n

log Zx(φ(θx, zn))−N log Zz(θz)

◮ Conversely, if we knew θ, we might easily compute (the posterior over) the values of z. ◮ Idea: update θ and (the distribution on) z in alternation, to reach a self-consistent

answer.

slide-14
SLIDE 14

The joint-data likelihood and EM

◮ For many models, maximisation might be straightforward if z were not latent, and we

could just maximise the joint-data likelihood:

ℓ(θx, θz) =

  • n

φ(θx, zn)TTx(xn)+θT

z

  • n

Tz(zn)−

  • n

log Zx(φ(θx, zn))−N log Zz(θz)

◮ Conversely, if we knew θ, we might easily compute (the posterior over) the values of z. ◮ Idea: update θ and (the distribution on) z in alternation, to reach a self-consistent

  • answer. Will this yield the right answer?
slide-15
SLIDE 15

The joint-data likelihood and EM

◮ For many models, maximisation might be straightforward if z were not latent, and we

could just maximise the joint-data likelihood:

ℓ(θx, θz) =

  • n

φ(θx, zn)TTx(xn)+θT

z

  • n

Tz(zn)−

  • n

log Zx(φ(θx, zn))−N log Zz(θz)

◮ Conversely, if we knew θ, we might easily compute (the posterior over) the values of z. ◮ Idea: update θ and (the distribution on) z in alternation, to reach a self-consistent

  • answer. Will this yield the right answer?

◮ Typically, it will (as we shall see). This is the Expectation Maximisation (EM) algorithm.

slide-16
SLIDE 16

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

slide-17
SLIDE 17

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

◮ Decomposes difficult problems into series of tractable steps.

slide-18
SLIDE 18

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods.

slide-19
SLIDE 19

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate.

slide-20
SLIDE 20

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often

imputation and inference or estimation.

slide-21
SLIDE 21

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often

imputation and inference or estimation.

◮ Not essential for simple models (like MoGs/FA), though often more efficient than

  • alternatives. Crucial for learning in complex settings.
slide-22
SLIDE 22

The Expectation Maximisation (EM) algorithm

The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden.

◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often

imputation and inference or estimation.

◮ Not essential for simple models (like MoGs/FA), though often more efficient than

  • alternatives. Crucial for learning in complex settings.

◮ Provides a framework for principled approximations.

slide-23
SLIDE 23

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood.

slide-24
SLIDE 24

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

slide-25
SLIDE 25

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

slide-26
SLIDE 26

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

αx1 + (1 − α)x2

slide-27
SLIDE 27

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

αx1 + (1 − α)x2

log(αx1 + (1 − α)x2)

α log(x1) + (1 − α) log(x2)

slide-28
SLIDE 28

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

αx1 + (1 − α)x2

log(αx1 + (1 − α)x2)

α log(x1) + (1 − α) log(x2)

In general:

slide-29
SLIDE 29

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

αx1 + (1 − α)x2

log(αx1 + (1 − α)x2)

α log(x1) + (1 − α) log(x2)

In general: For αi ≥ 0, αi = 1 (and {xi > 0}): log

i

αixi

  • i

αi log(xi)

slide-30
SLIDE 30

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

αx1 + (1 − α)x2

log(αx1 + (1 − α)x2)

α log(x1) + (1 − α) log(x2)

In general: For αi ≥ 0, αi = 1 (and {xi > 0}): log

i

αixi

  • i

αi log(xi)

For probability measure α and concave f f (Eα [x]) ≥ Eα [f(x)]

slide-31
SLIDE 31

Jensen’s inequality

One view: EM iteratively refines a lower bound on the log-likelihood. log(x) x1 x2

αx1 + (1 − α)x2

log(αx1 + (1 − α)x2)

α log(x1) + (1 − α) log(x2)

In general: For αi ≥ 0, αi = 1 (and {xi > 0}): log

i

αixi

  • i

αi log(xi)

For probability measure α and concave f f (Eα [x]) ≥ Eα [f(x)] Equality (if and) only if f(x) is almost surely constant or linear on (convex) support of α.

slide-32
SLIDE 32

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)
slide-33
SLIDE 33

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ P(Z, X|θ)
slide-34
SLIDE 34

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ q(Z)P(Z, X|θ)

q(Z)

slide-35
SLIDE 35

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ q(Z)P(Z, X|θ)

q(Z)

  • dZ q(Z) log P(Z, X|θ)

q(Z)

slide-36
SLIDE 36

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ q(Z)P(Z, X|θ)

q(Z)

  • dZ q(Z) log P(Z, X|θ)

q(Z)

def

= F(q, θ).

slide-37
SLIDE 37

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ q(Z)P(Z, X|θ)

q(Z)

  • dZ q(Z) log P(Z, X|θ)

q(Z)

def

= F(q, θ).

Now, dZ q(Z) log P(Z, X|θ) q(Z)

=

  • dZ q(Z) log P(Z, X|θ) −
  • dZ q(Z) log q(Z)
slide-38
SLIDE 38

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ q(Z)P(Z, X|θ)

q(Z)

  • dZ q(Z) log P(Z, X|θ)

q(Z)

def

= F(q, θ).

Now, dZ q(Z) log P(Z, X|θ) q(Z)

=

  • dZ q(Z) log P(Z, X|θ) −
  • dZ q(Z) log q(Z)

=

  • dZ q(Z) log P(Z, X|θ) + H[q],

where H[q] is the entropy of q(Z).

slide-39
SLIDE 39

The lower bound for EM – “free energy”

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ = {θx, θz}. Log-likelihood:

ℓ(θ) = log P(X|θ) = log

  • dZ P(Z, X|θ)

By Jensen, any distribution, q(Z), over the latent variables generates a lower bound:

ℓ(θ) = log

  • dZ q(Z)P(Z, X|θ)

q(Z)

  • dZ q(Z) log P(Z, X|θ)

q(Z)

def

= F(q, θ).

Now, dZ q(Z) log P(Z, X|θ) q(Z)

=

  • dZ q(Z) log P(Z, X|θ) −
  • dZ q(Z) log q(Z)

=

  • dZ q(Z) log P(Z, X|θ) + H[q],

where H[q] is the entropy of q(Z). So:

F(q, θ) = log P(Z, X|θ)q(Z) + H[q]

slide-40
SLIDE 40

The E and M steps of EM

The free-energy lower bound on ℓ(θ) is a function of θ and a distribution q:

F(q, θ) = log P(Z, X|θ)q(Z) + H[q],

slide-41
SLIDE 41

The E and M steps of EM

The free-energy lower bound on ℓ(θ) is a function of θ and a distribution q:

F(q, θ) = log P(Z, X|θ)q(Z) + H[q],

The EM steps can be re-written:

◮ E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed:

q(k)(Z) := argmax

q(Z)

F

  • q(Z), θ(k−1)

.

slide-42
SLIDE 42

The E and M steps of EM

The free-energy lower bound on ℓ(θ) is a function of θ and a distribution q:

F(q, θ) = log P(Z, X|θ)q(Z) + H[q],

The EM steps can be re-written:

◮ E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed:

q(k)(Z) := argmax

q(Z)

F

  • q(Z), θ(k−1)

.

◮ M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed:

θ(k) := argmax

θ

F

  • q(k)(Z), θ
  • = argmax

θ

log P(Z, X|θ)q(k)(Z)

The second equality comes from the fact that H

  • q(k)(Z)
  • does not depend directly on θ.
slide-43
SLIDE 43

The E Step

The free energy can be re-written

slide-44
SLIDE 44

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

slide-45
SLIDE 45

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

=

  • q(Z) log P(Z|X, θ)P(X|θ)

q(Z) dZ

slide-46
SLIDE 46

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

=

  • q(Z) log P(Z|X, θ)P(X|θ)

q(Z) dZ

=

  • q(Z) log P(X|θ) dZ +
  • q(Z) log P(Z|X, θ)

q(Z) dZ

slide-47
SLIDE 47

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

=

  • q(Z) log P(Z|X, θ)P(X|θ)

q(Z) dZ

=

  • q(Z) log P(X|θ) dZ +
  • q(Z) log P(Z|X, θ)

q(Z) dZ

= ℓ(θ) − KL[q(Z)P(Z|X, θ)]

The second term is the Kullback-Leibler divergence.

slide-48
SLIDE 48

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

=

  • q(Z) log P(Z|X, θ)P(X|θ)

q(Z) dZ

=

  • q(Z) log P(X|θ) dZ +
  • q(Z) log P(Z|X, θ)

q(Z) dZ

= ℓ(θ) − KL[q(Z)P(Z|X, θ)]

The second term is the Kullback-Leibler divergence. This means that, for fixed θ, F is bounded above by ℓ, and achieves that bound when KL[q(Z)P(Z|X, θ)] = 0.

slide-49
SLIDE 49

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

=

  • q(Z) log P(Z|X, θ)P(X|θ)

q(Z) dZ

=

  • q(Z) log P(X|θ) dZ +
  • q(Z) log P(Z|X, θ)

q(Z) dZ

= ℓ(θ) − KL[q(Z)P(Z|X, θ)]

The second term is the Kullback-Leibler divergence. This means that, for fixed θ, F is bounded above by ℓ, and achieves that bound when KL[q(Z)P(Z|X, θ)] = 0. But KL[qp] is zero if and only if q = p (see appendix.)

slide-50
SLIDE 50

The E Step

The free energy can be re-written

F(q, θ) =

  • q(Z) log P(Z, X|θ)

q(Z) dZ

=

  • q(Z) log P(Z|X, θ)P(X|θ)

q(Z) dZ

=

  • q(Z) log P(X|θ) dZ +
  • q(Z) log P(Z|X, θ)

q(Z) dZ

= ℓ(θ) − KL[q(Z)P(Z|X, θ)]

The second term is the Kullback-Leibler divergence. This means that, for fixed θ, F is bounded above by ℓ, and achieves that bound when KL[q(Z)P(Z|X, θ)] = 0. But KL[qp] is zero if and only if q = p (see appendix.) So, the E step sets q(k)(Z) = P(Z|X, θ(k−1)) [inference / imputation] and, after an E step, the free energy equals the likelihood.

slide-51
SLIDE 51

Coordinate Ascent in F (Demo)

To visualise, we consider a one parameter / one latent mixture: s ∼ Bernoulli[π] x|s = 0 ∼ N[−1, 1] x|s = 1 ∼ N[1, 1] . Single data point x1 = .3. q(s) is a distribution on a single binary latent, and so is represented by r1 ∈ [0, 1].

−2 −1.5 −1 −0.5 0.5 1 1.5 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

slide-52
SLIDE 52

Coordinate Ascent in F (Demo)

slide-53
SLIDE 53

Coordinate Ascent in F (Demo)

slide-54
SLIDE 54

Coordinate Ascent in F (Demo)

slide-55
SLIDE 55

Coordinate Ascent in F (Demo)

slide-56
SLIDE 56

Coordinate Ascent in F (Demo)

slide-57
SLIDE 57

Coordinate Ascent in F (Demo)

slide-58
SLIDE 58

Coordinate Ascent in F (Demo)

slide-59
SLIDE 59

Coordinate Ascent in F (Demo)

slide-60
SLIDE 60

Coordinate Ascent in F (Demo)

slide-61
SLIDE 61

Coordinate Ascent in F (Demo)

slide-62
SLIDE 62

Coordinate Ascent in F (Demo)

slide-63
SLIDE 63

Coordinate Ascent in F (Demo)

slide-64
SLIDE 64

Coordinate Ascent in F (Demo)

slide-65
SLIDE 65

Coordinate Ascent in F (Demo)

slide-66
SLIDE 66

Coordinate Ascent in F (Demo)

slide-67
SLIDE 67

Coordinate Ascent in F (Demo)

slide-68
SLIDE 68

Coordinate Ascent in F (Demo)

slide-69
SLIDE 69

Coordinate Ascent in F (Demo)

slide-70
SLIDE 70

Coordinate Ascent in F (Demo)

slide-71
SLIDE 71

Coordinate Ascent in F (Demo)

slide-72
SLIDE 72

Coordinate Ascent in F (Demo)

slide-73
SLIDE 73

Coordinate Ascent in F (Demo)

slide-74
SLIDE 74

Coordinate Ascent in F (Demo)

slide-75
SLIDE 75

Coordinate Ascent in F (Demo)

slide-76
SLIDE 76

Coordinate Ascent in F (Demo)

slide-77
SLIDE 77

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

  • θ(k−1)
slide-78
SLIDE 78

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

  • θ(k−1)

=

E step

F

  • q(k), θ(k−1)

◮ The E step brings the free energy to the likelihood.

slide-79
SLIDE 79

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

  • θ(k−1)

=

E step

F

  • q(k), θ(k−1)

M step

F

  • q(k), θ(k)

◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ.

slide-80
SLIDE 80

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

  • θ(k−1)

=

E step

F

  • q(k), θ(k−1)

M step

F

  • q(k), θ(k)

Jensen

  • θ(k)

,

◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ. ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL

slide-81
SLIDE 81

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

  • θ(k−1)

=

E step

F

  • q(k), θ(k−1)

M step

F

  • q(k), θ(k)

Jensen

  • θ(k)

,

◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ. ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL

If the M-step is executed so that θ(k) = θ(k−1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

slide-82
SLIDE 82

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

  • θ(k−1)

=

E step

F

  • q(k), θ(k−1)

M step

F

  • q(k), θ(k)

Jensen

  • θ(k)

,

◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ. ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL

If the M-step is executed so that θ(k) = θ(k−1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases. Can also show that fixed points of EM (generally) correspond to maxima of the likelihood (see appendices).

slide-83
SLIDE 83

EM Summary

◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable

model.

ℓ(θ) = log P(X|θ) = log

  • dZ P(X|Z, θ)P(Z|θ)
slide-84
SLIDE 84

EM Summary

◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable

model.

ℓ(θ) = log P(X|θ) = log

  • dZ P(X|Z, θ)P(Z|θ)

◮ Increases a variational lower bound on the likelihood by coordinate ascent.

F(q, θ) = log P(Z, X|θ)q(Z) + H[q] = ℓ(θ) − KL[q(Z)P(Z|X)] ≤ ℓ(θ)

slide-85
SLIDE 85

EM Summary

◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable

model.

ℓ(θ) = log P(X|θ) = log

  • dZ P(X|Z, θ)P(Z|θ)

◮ Increases a variational lower bound on the likelihood by coordinate ascent.

F(q, θ) = log P(Z, X|θ)q(Z) + H[q] = ℓ(θ) − KL[q(Z)P(Z|X)] ≤ ℓ(θ)

◮ E step:

q(k)(Z) := argmax

q(Z)

F

  • q(Z), θ(k−1)

= P(Z|X, θ(k−1))

slide-86
SLIDE 86

EM Summary

◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable

model.

ℓ(θ) = log P(X|θ) = log

  • dZ P(X|Z, θ)P(Z|θ)

◮ Increases a variational lower bound on the likelihood by coordinate ascent.

F(q, θ) = log P(Z, X|θ)q(Z) + H[q] = ℓ(θ) − KL[q(Z)P(Z|X)] ≤ ℓ(θ)

◮ E step:

q(k)(Z) := argmax

q(Z)

F

  • q(Z), θ(k−1)

= P(Z|X, θ(k−1))

◮ M step:

θ(k) := argmax

θ

F

  • q(k)(Z), θ
  • = argmax

θ

log P(Z, X|θ)q(k)(Z)

slide-87
SLIDE 87

EM Summary

◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable

model.

ℓ(θ) = log P(X|θ) = log

  • dZ P(X|Z, θ)P(Z|θ)

◮ Increases a variational lower bound on the likelihood by coordinate ascent.

F(q, θ) = log P(Z, X|θ)q(Z) + H[q] = ℓ(θ) − KL[q(Z)P(Z|X)] ≤ ℓ(θ)

◮ E step:

q(k)(Z) := argmax

q(Z)

F

  • q(Z), θ(k−1)

= P(Z|X, θ(k−1))

◮ M step:

θ(k) := argmax

θ

F

  • q(k)(Z), θ
  • = argmax

θ

log P(Z, X|θ)q(k)(Z)

◮ After E-step F(q, θ) = ℓ(θ) ⇒ maximum of free-energy is maximum of likelihood.

slide-88
SLIDE 88

Partial M steps and Partial E steps

Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize. (Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm). In fact, immediately after an E step

∂ ∂θ

  • θ(k−1)

log P(X, Z|θ)q(k)(Z)[=P(Z|X,θ(k−1))] = ∂ ∂θ

  • θ(k−1)

log P(X|θ) [cf. mixture gradients from last lecture.] So E-step (inference) can be used to construct other gradient-based optimisation schemes (e.g. “Expectation Conjugate Gradient”, Salakhutdinov et al. ICML 2003). Partial E steps: We can also just increase F wrt to some of the qs. For example, sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. One might also update the posterior over a subset of the hidden variables, while holding others fixed...

slide-89
SLIDE 89

EM for MoGs

◮ Evaluate responsibilities

rim = Pm(x)πm

  • m′ Pm′(x)πm′

◮ Update parameters

µm ←

  • i rimxi
  • i rim

Σm ←

  • i rim(xi − µm)(xi − µm)T
  • i rim

πm ←

  • i rim

N

slide-90
SLIDE 90

The Gaussian mixture model (E-step)

In a univariate Gaussian mixture model, the density of a data point x is: p(x|θ) =

k

  • m=1

p(s = m|θ)p(x|s = m, θ) ∝

k

  • m=1

πm σm

exp

1 2σ2

m

  • x − µm)2

,

where θ is the collection of parameters: means µm, variances σ2

m and mixing proportions

πm = p(s = m|θ).

The hidden variable si indicates which component generated observation xi.

slide-91
SLIDE 91

The Gaussian mixture model (E-step)

In a univariate Gaussian mixture model, the density of a data point x is: p(x|θ) =

k

  • m=1

p(s = m|θ)p(x|s = m, θ) ∝

k

  • m=1

πm σm

exp

1 2σ2

m

  • x − µm)2

,

where θ is the collection of parameters: means µm, variances σ2

m and mixing proportions

πm = p(s = m|θ).

The hidden variable si indicates which component generated observation xi. The E-step computes the posterior for si given the current parameters: q(si) = p(si|xi, θ) ∝ p(xi|si, θ)p(si|θ)

slide-92
SLIDE 92

The Gaussian mixture model (E-step)

In a univariate Gaussian mixture model, the density of a data point x is: p(x|θ) =

k

  • m=1

p(s = m|θ)p(x|s = m, θ) ∝

k

  • m=1

πm σm

exp

1 2σ2

m

  • x − µm)2

,

where θ is the collection of parameters: means µm, variances σ2

m and mixing proportions

πm = p(s = m|θ).

The hidden variable si indicates which component generated observation xi. The E-step computes the posterior for si given the current parameters: q(si) = p(si|xi, θ) ∝ p(xi|si, θ)p(si|θ) q(si = m) ∝ πm

σm

exp

1 2σ2

m

(xi − µm)2

with the normalization such that

m rim = 1.

slide-93
SLIDE 93

The Gaussian mixture model (E-step)

In a univariate Gaussian mixture model, the density of a data point x is: p(x|θ) =

k

  • m=1

p(s = m|θ)p(x|s = m, θ) ∝

k

  • m=1

πm σm

exp

1 2σ2

m

  • x − µm)2

,

where θ is the collection of parameters: means µm, variances σ2

m and mixing proportions

πm = p(s = m|θ).

The hidden variable si indicates which component generated observation xi. The E-step computes the posterior for si given the current parameters: q(si) = p(si|xi, θ) ∝ p(xi|si, θ)p(si|θ) rim

def

= q(si = m) ∝ πm σm

exp

1 2σ2

m

(xi − µm)2

(responsibilities) with the normalization such that

m rim = 1.

slide-94
SLIDE 94

The Gaussian mixture model (E-step)

In a univariate Gaussian mixture model, the density of a data point x is: p(x|θ) =

k

  • m=1

p(s = m|θ)p(x|s = m, θ) ∝

k

  • m=1

πm σm

exp

1 2σ2

m

  • x − µm)2

,

where θ is the collection of parameters: means µm, variances σ2

m and mixing proportions

πm = p(s = m|θ).

The hidden variable si indicates which component generated observation xi. The E-step computes the posterior for si given the current parameters: q(si) = p(si|xi, θ) ∝ p(xi|si, θ)p(si|θ) rim

def

= q(si = m) ∝ πm σm

exp

1 2σ2

m

(xi − µm)2

(responsibilities)

← δsi=mq

with the normalization such that

m rim = 1.

slide-95
SLIDE 95

The Gaussian mixture model (M-step)

In the M-step we optimize the sum (since s is discrete): E = log p(x, s|θ)q(s) =

  • q(s) log[p(s|θ) p(x|s, θ)]

=

  • i,m

rim

  • log πm − log σm −

1 2σ2

m

(xi − µm)2 .

Optimum is found by setting the partial derivatives of E to zero:

slide-96
SLIDE 96

The Gaussian mixture model (M-step)

In the M-step we optimize the sum (since s is discrete): E = log p(x, s|θ)q(s) =

  • q(s) log[p(s|θ) p(x|s, θ)]

=

  • i,m

rim

  • log πm − log σm −

1 2σ2

m

(xi − µm)2 .

Optimum is found by setting the partial derivatives of E to zero:

∂ ∂µm

E =

  • i

rim (xi − µm) 2σ2

m

= 0 ⇒ µm =

  • i rimxi
  • i rim ,
slide-97
SLIDE 97

The Gaussian mixture model (M-step)

In the M-step we optimize the sum (since s is discrete): E = log p(x, s|θ)q(s) =

  • q(s) log[p(s|θ) p(x|s, θ)]

=

  • i,m

rim

  • log πm − log σm −

1 2σ2

m

(xi − µm)2 .

Optimum is found by setting the partial derivatives of E to zero:

∂ ∂µm

E =

  • i

rim (xi − µm) 2σ2

m

= 0 ⇒ µm =

  • i rimxi
  • i rim ,

∂ ∂σm

E =

  • i

rim

  • − 1

σm + (xi − µm)2 σ3

m

  • = 0

⇒ σ2

m =

  • i rim(xi − µm)2
  • i rim

,

slide-98
SLIDE 98

The Gaussian mixture model (M-step)

In the M-step we optimize the sum (since s is discrete): E = log p(x, s|θ)q(s) =

  • q(s) log[p(s|θ) p(x|s, θ)]

=

  • i,m

rim

  • log πm − log σm −

1 2σ2

m

(xi − µm)2 .

Optimum is found by setting the partial derivatives of E to zero:

∂ ∂µm

E =

  • i

rim (xi − µm) 2σ2

m

= 0 ⇒ µm =

  • i rimxi
  • i rim ,

∂ ∂σm

E =

  • i

rim

  • − 1

σm + (xi − µm)2 σ3

m

  • = 0

⇒ σ2

m =

  • i rim(xi − µm)2
  • i rim

, ∂ ∂πm

E =

  • i

rim 1

πm , ∂E ∂πm + λ = 0 ⇒ πm = 1

n

  • i

rim, where λ is a Lagrange multiplier ensuring that the mixing proportions sum to unity.

slide-99
SLIDE 99

EM for Factor Analysis

x1 x2 xD z1 z2 zK

  • • •
  • • •

The model for x: p(x|θ) =

  • p(z|θ)p(x|z, θ)dz = N(0, ΛΛT + Ψ)

Model parameters: θ = {Λ, Ψ}. E step: For each data point xn, compute the posterior distribution of hidden factors given the

  • bserved data: qn(zn) = p(zn|xn, θt).

M step: Find the θt+1 that maximises F(q, θ):

F(q, θ) =

  • n
  • qn(zn) [log p(zn|θ) + log p(xn|zn, θ) − log qn(zn)] dzn

=

  • n
  • qn(zn) [log p(zn|θ) + log p(xn|zn, θ)] dzn + c.
slide-100
SLIDE 100

The E step for Factor Analysis

E step: For each data point xn, compute the posterior distribution of hidden factors given the

  • bserved data: qn(zn) = p(zn|xn, θ) = p(zn, xn|θ)/p(xn|θ)

Tactic: write p(zn, xn|θ), consider xn to be fixed. What is this as a function of zn? p(zn, xn)

=

p(zn)p(xn|zn)

= (2π)− K

2 exp{−1

2zT

nzn} |2πΨ|− 1

2 exp{−1

2(xn − Λzn)TΨ−1(xn − Λzn)}

=

c × exp{−1 2[zT

nzn + (xn − Λzn)TΨ−1(xn − Λzn)]}

=

c’ × exp{−1 2[zT

n(I + ΛTΨ−1Λ)zn − 2zT nΛTΨ−1xn]}

=

c” × exp{−1 2[zT

nΣ−1zn − 2zT nΣ−1µn + µT nΣ−1µn]}

So Σ = (I + ΛTΨ−1Λ)−1 = I − βΛ and µn = ΣΛTΨ−1xn = βxn. Where β = ΣΛTΨ−1. Note that µn is a linear function of xn and Σ does not depend on xn.

slide-101
SLIDE 101

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

slide-102
SLIDE 102

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

slide-103
SLIDE 103

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

= c − 1

2zT

nzn − 1

2 log |Ψ| − 1 2(xn − Λzn)TΨ−1(xn − Λzn)

slide-104
SLIDE 104

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

= c − 1

2zT

nzn − 1

2 log |Ψ| − 1 2(xn − Λzn)TΨ−1(xn − Λzn)

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + zT nΛTΨ−1Λzn

slide-105
SLIDE 105

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

= c − 1

2zT

nzn − 1

2 log |Ψ| − 1 2(xn − Λzn)TΨ−1(xn − Λzn)

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + zT nΛTΨ−1Λzn

  • = c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + Tr

  • ΛTΨ−1ΛznzT

n

slide-106
SLIDE 106

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

= c − 1

2zT

nzn − 1

2 log |Ψ| − 1 2(xn − Λzn)TΨ−1(xn − Λzn)

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + zT nΛTΨ−1Λzn

  • = c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + Tr

  • ΛTΨ−1ΛznzT

n

  • Taking expectations wrt qn(zn):
slide-107
SLIDE 107

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

= c − 1

2zT

nzn − 1

2 log |Ψ| − 1 2(xn − Λzn)TΨ−1(xn − Λzn)

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + zT nΛTΨ−1Λzn

  • = c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + Tr

  • ΛTΨ−1ΛznzT

n

  • Taking expectations wrt qn(zn):

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

slide-108
SLIDE 108

The M step for Factor Analysis

M step: Find θt+1 by maximising F =

  • n

log p(zn|θ) + log p(xn|zn, θ)qn(zn) + c

log p(zn|θ) + log p(xn|zn, θ)

= c − 1

2zT

nzn − 1

2 log |Ψ| − 1 2(xn − Λzn)TΨ−1(xn − Λzn)

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + zT nΛTΨ−1Λzn

  • = c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λzn + Tr

  • ΛTΨ−1ΛznzT

n

  • Taking expectations wrt qn(zn):

= c’ − 1

2 log |Ψ| − 1 2

  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Note that we don’t need to know everything about q(zn), just the moments zn and
  • znzT

n

  • .

These are the expected sufficient statistics.

slide-109
SLIDE 109

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

slide-110
SLIDE 110

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤:

slide-111
SLIDE 111

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤: ∂F ∂Λ = Ψ−1

n

xnµT

n − Ψ−1Λ

  • NΣ +
  • n

µnµT

n

  • = 0
slide-112
SLIDE 112

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤: ∂F ∂Λ = Ψ−1

n

xnµT

n − Ψ−1Λ

  • NΣ +
  • n

µnµT

n

  • = 0

⇒ Λ=

  • n

xnµT

n

NΣ+

  • n

µnµT

n

−1

slide-113
SLIDE 113

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤: ∂F ∂Λ = Ψ−1

n

xnµT

n − Ψ−1Λ

  • NΣ +
  • n

µnµT

n

  • = 0

⇒ Λ=

  • n

xnµT

n

NΣ+

  • n

µnµT

n

−1 ∂F ∂Ψ−1 = N

2 Ψ − 1 2

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

Note: we should actually only take derivatives w.r.t. Ψdd since Ψ is diagonal.

slide-114
SLIDE 114

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤: ∂F ∂Λ = Ψ−1

n

xnµT

n − Ψ−1Λ

  • NΣ +
  • n

µnµT

n

  • = 0

⇒ Λ=

  • n

xnµT

n

NΣ+

  • n

µnµT

n

−1 ∂F ∂Ψ−1 = N

2 Ψ − 1 2

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

⇒ Ψ = 1

N

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

Note: we should actually only take derivatives w.r.t. Ψdd since Ψ is diagonal.

slide-115
SLIDE 115

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤: ∂F ∂Λ = Ψ−1

n

xnµT

n − Ψ−1Λ

  • NΣ +
  • n

µnµT

n

  • = 0

⇒ Λ=

  • n

xnµT

n

NΣ+

  • n

µnµT

n

−1 ∂F ∂Ψ−1 = N

2 Ψ − 1 2

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

⇒ Ψ = 1

N

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

  • Ψ= ΛΣΛT+ 1

N

  • n

(xn − Λµn)(xn − Λµn)T

(squared residuals) Note: we should actually only take derivatives w.r.t. Ψdd since Ψ is diagonal.

slide-116
SLIDE 116

The M step for Factor Analysis (cont.)

F = c′ − N

2 log |Ψ| − 1 2

  • n
  • xT

nΨ−1xn − 2xT nΨ−1Λµn + Tr

  • ΛTΨ−1Λ(µnµT

n + Σ)

  • Taking derivatives wrt Λ and Ψ−1, using ∂Tr[AB]

∂B

= AT and ∂ log |A|

∂A

= A−⊤: ∂F ∂Λ = Ψ−1

n

xnµT

n − Ψ−1Λ

  • NΣ +
  • n

µnµT

n

  • = 0

⇒ Λ=

  • n

xnµT

n

NΣ+

  • n

µnµT

n

−1 ∂F ∂Ψ−1 = N

2 Ψ − 1 2

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

⇒ Ψ = 1

N

  • n
  • xnxT

n − ΛµnxT n − xnµT nΛT + Λ(µnµT n + Σ)ΛT

  • Ψ= ΛΣΛT+ 1

N

  • n

(xn − Λµn)(xn − Λµn)T

(squared residuals) Note: we should actually only take derivatives w.r.t. Ψdd since Ψ is diagonal. As Σ → 0 these become the equations for ML linear regression

slide-117
SLIDE 117

Mixtures of Factor Analysers

Simultaneous clustering and dimensionality reduction. p(x|θ) =

  • k

πk N(µk, ΛkΛT

k + Ψ)

where πk is the mixing proportion for FA k, µk is its centre, Λk is its “factor loading matrix”, and Ψ is a common sensor noise model. θ = {{πk, µk, Λk}k=1...K, Ψ} We can think of this model as having two sets of hidden latent variables:

◮ A discrete indicator variable sn ∈ {1, . . . K} ◮ For each factor analyzer, a continous factor vector zn,k ∈ RDk

p(x|θ) =

K

  • sn=1

p(sn|θ)

  • p(z|sn, θ)p(xn|z, sn, θ) dz

As before, an EM algorithm can be derived for this model: E step: We need moments of p(zn, sn|xn, θ), specifically: δsn=m, δsn=mzn and

  • δsn=mznzT

n

  • .

M step: Similar to M-step for FA with responsibility-weighted moments. See http://www.learning.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf

slide-118
SLIDE 118

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.
slide-119
SLIDE 119

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]
slide-120
SLIDE 120

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]

=

  • q(z)
  • θTT(z, x) − log Z(θ)
  • dz + const wrt θ
slide-121
SLIDE 121

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]

=

  • q(z)
  • θTT(z, x) − log Z(θ)
  • dz + const wrt θ

= θTT(z, x)q(z) − log Z(θ) + const wrt θ

So, in the E step all we need to compute are the expected sufficient statistics under q.

slide-122
SLIDE 122

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]

=

  • q(z)
  • θTT(z, x) − log Z(θ)
  • dz + const wrt θ

= θTT(z, x)q(z) − log Z(θ) + const wrt θ

So, in the E step all we need to compute are the expected sufficient statistics under q. We also have:

∂ ∂θ log Z(θ) =

1 Z(θ)

∂ ∂θ Z(θ) =

1 Z(θ)

∂ ∂θ

  • f(ξ) exp{θTT(ξ)}
slide-123
SLIDE 123

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]

=

  • q(z)
  • θTT(z, x) − log Z(θ)
  • dz + const wrt θ

= θTT(z, x)q(z) − log Z(θ) + const wrt θ

So, in the E step all we need to compute are the expected sufficient statistics under q. We also have:

∂ ∂θ log Z(θ) =

1 Z(θ)

∂ ∂θ Z(θ) =

1 Z(θ)

∂ ∂θ

  • f(ξ) exp{θTT(ξ)}

=

  • 1

Z(θ)f(ξ) exp{θTT(ξ)}

  • p(ξ|θ)

· T(ξ)

slide-124
SLIDE 124

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]

=

  • q(z)
  • θTT(z, x) − log Z(θ)
  • dz + const wrt θ

= θTT(z, x)q(z) − log Z(θ) + const wrt θ

So, in the E step all we need to compute are the expected sufficient statistics under q. We also have:

∂ ∂θ log Z(θ) =

1 Z(θ)

∂ ∂θ Z(θ) =

1 Z(θ)

∂ ∂θ

  • f(ξ) exp{θTT(ξ)}

=

  • 1

Z(θ)f(ξ) exp{θTT(ξ)}

  • p(ξ|θ)

· T(ξ) = T(ξ)p(ξ|θ)

slide-125
SLIDE 125

EM for exponential families

EM is often applied to models whose joint over ξ = (z, x) has exponential-family form: p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ)

  • with Z(θ) =
  • f(ξ) exp{θTT(ξ)}dξ
  • but whose marginal p(x) ∈ ExpFam.

The free energy dependence on θ is given by:

F(q, θ) =

  • q(z) log p(z, x|θ)dz + H[q]

=

  • q(z)
  • θTT(z, x) − log Z(θ)
  • dz + const wrt θ

= θTT(z, x)q(z) − log Z(θ) + const wrt θ

So, in the E step all we need to compute are the expected sufficient statistics under q. We also have:

∂ ∂θ log Z(θ) =

1 Z(θ)

∂ ∂θ Z(θ) =

1 Z(θ)

∂ ∂θ

  • f(ξ) exp{θTT(ξ)}

=

  • 1

Z(θ)f(ξ) exp{θTT(ξ)}

  • p(ξ|θ)

· T(ξ) = T(ξ)p(ξ|θ)

Thus, the M step solves:

∂F ∂θ = T(z, x)q(z) − T(z, x)p(ξ|θ) = 0

slide-126
SLIDE 126

EM for exponential family mixtures

To derive EM formally for models with discrete latents (including mixtures) it is useful to introduce an indicator vector s in place of the discrete s. si = m

si = [0, 0, . . . , 1

  • mth position

, . . . 0]

slide-127
SLIDE 127

EM for exponential family mixtures

To derive EM formally for models with discrete latents (including mixtures) it is useful to introduce an indicator vector s in place of the discrete s. si = m

si = [0, 0, . . . , 1

  • mth position

, . . . 0]

Collecting the M component distributions’ natural params into a matrix Θ = [θm]: log P(X, S) =

  • i
  • (log π)Tsi + sT

i ΘTT(xi) − sT i log Z(Θ)

  • + const

where log Z(Θ) collects the log-normalisers for all components into an M-element vector.

slide-128
SLIDE 128

EM for exponential family mixtures

To derive EM formally for models with discrete latents (including mixtures) it is useful to introduce an indicator vector s in place of the discrete s. si = m

si = [0, 0, . . . , 1

  • mth position

, . . . 0]

Collecting the M component distributions’ natural params into a matrix Θ = [θm]: log P(X, S) =

  • i
  • (log π)Tsi + sT

i ΘTT(xi) − sT i log Z(Θ)

  • + const

where log Z(Θ) collects the log-normalisers for all components into an M-element vector. Then, the expected sufficient statistics (E-step) are:

  • i

siq

(responsibilities rim)

  • i

T(xi)

  • sT

i

  • q

(responsibility-weighted sufficient stats)

slide-129
SLIDE 129

EM for exponential family mixtures

To derive EM formally for models with discrete latents (including mixtures) it is useful to introduce an indicator vector s in place of the discrete s. si = m

si = [0, 0, . . . , 1

  • mth position

, . . . 0]

Collecting the M component distributions’ natural params into a matrix Θ = [θm]: log P(X, S) =

  • i
  • (log π)Tsi + sT

i ΘTT(xi) − sT i log Z(Θ)

  • + const

where log Z(Θ) collects the log-normalisers for all components into an M-element vector. Then, the expected sufficient statistics (E-step) are:

  • i

siq

(responsibilities rim)

  • i

T(xi)

  • sT

i

  • q

(responsibility-weighted sufficient stats) And maximisation of the expected log-joint (M-step) gives:

π(k+1) ∝

  • i

siq

  • T(x)|θ(k+1)

m

  • =

i

T(xi)

  • [si]m
  • q

i

  • [si]m
  • q
slide-130
SLIDE 130

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν

slide-131
SLIDE 131

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F(q, θ) =

  • q(Z) log p(Z, X| θ)dZ + H[q]

≤ log P(X|θ)

slide-132
SLIDE 132

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F MAP(q, θ) =

  • q(Z) log p(Z, X, θ)dZ + H[q]

≤ log P(X|θ)+ log P(θ)

slide-133
SLIDE 133

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F MAP(q, θ) =

  • q(Z) log p(Z, X, θ)dZ + H[q]

≤ log P(X|θ)+ log P(θ) =

  • q(Z)
  • θT(
  • i

T(ξi) + τ) − (N + ν) log Z(θ)

  • dZ + const wrt θ
slide-134
SLIDE 134

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F MAP(q, θ) =

  • q(Z) log p(Z, X, θ)dZ + H[q]

≤ log P(X|θ)+ log P(θ) =

  • q(Z)
  • θT(
  • i

T(ξi) + τ) − (N + ν) log Z(θ)

  • dZ + const wrt θ

= θT(T(ξ)q(z) + τ) − (N + ν) log Z(θ) + const wrt θ

So, the expected sufficient statistics in the E step are unchanged.

slide-135
SLIDE 135

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F MAP(q, θ) =

  • q(Z) log p(Z, X, θ)dZ + H[q]

≤ log P(X|θ)+ log P(θ) =

  • q(Z)
  • θT(
  • i

T(ξi) + τ) − (N + ν) log Z(θ)

  • dZ + const wrt θ

= θT(T(ξ)q(z) + τ) − (N + ν) log Z(θ) + const wrt θ

So, the expected sufficient statistics in the E step are unchanged. Thus, after an E-step the augmented free-energy equals the log-joint, and so free-energy maxima are log-joint maxima (i.e. MAP values).

slide-136
SLIDE 136

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F MAP(q, θ) =

  • q(Z) log p(Z, X, θ)dZ + H[q]

≤ log P(X|θ)+ log P(θ) =

  • q(Z)
  • θT(
  • i

T(ξi) + τ) − (N + ν) log Z(θ)

  • dZ + const wrt θ

= θT(T(ξ)q(z) + τ) − (N + ν) log Z(θ) + const wrt θ

So, the expected sufficient statistics in the E step are unchanged. Thus, after an E-step the augmented free-energy equals the log-joint, and so free-energy maxima are log-joint maxima (i.e. MAP values). Can we find posteriors?

slide-137
SLIDE 137

EM for MAP

What if we have a prior? p(ξ|θ) = f(ξ) exp{θTT(ξ)}/Z(θ) p(θ) = F(ν, τ) exp{θTτ}/Z(θ)ν Augment the free energy by adding the log prior:

F MAP(q, θ) =

  • q(Z) log p(Z, X, θ)dZ + H[q]

≤ log P(X|θ)+ log P(θ) =

  • q(Z)
  • θT(
  • i

T(ξi) + τ) − (N + ν) log Z(θ)

  • dZ + const wrt θ

= θT(T(ξ)q(z) + τ) − (N + ν) log Z(θ) + const wrt θ

So, the expected sufficient statistics in the E step are unchanged. Thus, after an E-step the augmented free-energy equals the log-joint, and so free-energy maxima are log-joint maxima (i.e. MAP values). Can we find posteriors? Only approximately – we’ll return to this later as “Variational Bayes”.

slide-138
SLIDE 138

References

◮ A. P

. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 1-38.

http://www.jstor.org/stable/2984875

◮ R. M. Neal and G. E. Hinton (1998).

A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (editor) Learning in Graphical Models, pp. 355-368, Dordrecht: Kluwer Academic Publishers.

http://www.cs.utoronto.ca/∼radford/ftp/emk.pdf

◮ R. Salakhutdinov, S. Roweis and Z. Ghahramani, (2003).

Optimization with EM and expectation-conjugate-gradient. In ICML (pp. 672-679).

http://www.cs.utoronto.ca/∼rsalakhu/papers/emecg.pdf

◮ Z. Ghahramani and G. E. Hinton (1996).

The EM Algorithm for Mixtures of Factor Analyzers. University of Toronto Technical Report CRG-TR-96-1.

http://learning.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf

slide-139
SLIDE 139

Proof of the Matrix Inversion Lemma

(A + XBX T)−1 = A−1 − A−1X(B−1 + X TA−1X)−1X TA−1

Need to prove:

  • A−1 − A−1X(B−1 + X TA−1X)−1X TA−1

(A + XBX T) = I

Expand: I + A−1XBX T − A−1X(B−1 + X TA−1X)−1X T − A−1X(B−1 + X TA−1X)−1X TA−1XBX T Regroup:

= I + A−1X

  • BX T − (B−1 + X TA−1X)−1X T − (B−1 + X TA−1X)−1X TA−1XBX T

= I + A−1X

  • BX T − (B−1 + X TA−1X)−1B−1BX T − (B−1 + X TA−1X)−1X TA−1XBX T

= I + A−1X

  • BX T − (B−1 + X TA−1X)−1(B−1 + X TA−1X)BX T

= I + A−1X(BX T − BX T) = I

slide-140
SLIDE 140

KL[q(x)p(x)] ≥ 0, with equality iff ∀x : p(x) = q(x)

First consider discrete distributions; the Kullback-Liebler divergence is: KL[qp] =

  • i

qi log qi pi . To minimize wrt distribution q we need a Lagrange multiplier to enforce normalisation: E

def

= KL[qp] + λ

  • 1 −
  • i

qi

  • =
  • i

qi log qi pi + λ

  • 1 −
  • i

qi

  • Find conditions for stationarity

∂E ∂qi =

log qi − log pi + 1 − λ = 0 ⇒ qi = pi exp(λ − 1)

∂E ∂λ =

1 −

  • i

qi = 0 ⇒

  • i

qi = 1

       ⇒ qi = pi.

Check sign of curvature (Hessian):

∂2E ∂qi∂qi = 1

qi > 0,

∂2E ∂qi∂qj = 0,

so unique stationary point qi = pi is indeed a minimum. Easily verified that at that minimum, KL[qp] = KL[pp] = 0. A similar proof holds for continuous densities, using functional derivatives.

slide-141
SLIDE 141

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

slide-142
SLIDE 142

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

Now,

ℓ(θ) = log P(X|θ) = log P(X|θ)P(Z|X,θ∗)

slide-143
SLIDE 143

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

Now,

ℓ(θ) = log P(X|θ) = log P(X|θ)P(Z|X,θ∗) =

  • log P(Z, X|θ)

P(Z|X, θ)

  • P(Z|X,θ∗)
slide-144
SLIDE 144

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

Now,

ℓ(θ) = log P(X|θ) = log P(X|θ)P(Z|X,θ∗) =

  • log P(Z, X|θ)

P(Z|X, θ)

  • P(Z|X,θ∗)

= log P(Z, X|θ)P(Z|X,θ∗) − log P(Z|X, θ)P(Z|X,θ∗)

slide-145
SLIDE 145

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

Now,

ℓ(θ) = log P(X|θ) = log P(X|θ)P(Z|X,θ∗) =

  • log P(Z, X|θ)

P(Z|X, θ)

  • P(Z|X,θ∗)

= log P(Z, X|θ)P(Z|X,θ∗) − log P(Z|X, θ)P(Z|X,θ∗)

so, d dθ ℓ(θ) = d dθ log P(Z, X|θ)P(Z|X,θ∗) − d dθ log P(Z|X, θ)P(Z|X,θ∗)

slide-146
SLIDE 146

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

Now,

ℓ(θ) = log P(X|θ) = log P(X|θ)P(Z|X,θ∗) =

  • log P(Z, X|θ)

P(Z|X, θ)

  • P(Z|X,θ∗)

= log P(Z, X|θ)P(Z|X,θ∗) − log P(Z|X, θ)P(Z|X,θ∗)

so, d dθ ℓ(θ) = d dθ log P(Z, X|θ)P(Z|X,θ∗) − d dθ log P(Z|X, θ)P(Z|X,θ∗) The second term is 0 at θ∗ if the derivative exists (minimum of KL[··]), and thus: d dθ ℓ(θ)

  • θ∗

=

d dθ log P(Z, X|θ)P(Z|X,θ∗)

  • θ∗

= 0

slide-147
SLIDE 147

Fixed Points of EM are Stationary Points in ℓ

Let a fixed point of EM occur with parameter θ∗. Then:

∂ ∂θ log P(Z, X | θ)P(Z|X,θ∗)

  • θ∗

= 0

Now,

ℓ(θ) = log P(X|θ) = log P(X|θ)P(Z|X,θ∗) =

  • log P(Z, X|θ)

P(Z|X, θ)

  • P(Z|X,θ∗)

= log P(Z, X|θ)P(Z|X,θ∗) − log P(Z|X, θ)P(Z|X,θ∗)

so, d dθ ℓ(θ) = d dθ log P(Z, X|θ)P(Z|X,θ∗) − d dθ log P(Z|X, θ)P(Z|X,θ∗) The second term is 0 at θ∗ if the derivative exists (minimum of KL[··]), and thus: d dθ ℓ(θ)

  • θ∗

=

d dθ log P(Z, X|θ)P(Z|X,θ∗)

  • θ∗

= 0

So, EM converges to a stationary point of ℓ(θ).

slide-148
SLIDE 148

Maxima in F correspond to maxima in ℓ

Let θ∗ now be the parameter value at a local maximum of F (and thus at a fixed point)

slide-149
SLIDE 149

Maxima in F correspond to maxima in ℓ

Let θ∗ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d2 dθ2 ℓ(θ) = d2 dθ2 log P(Z, X|θ)P(Z|X,θ∗) − d2 dθ2 log P(Z|X, θ)P(Z|X,θ∗)

slide-150
SLIDE 150

Maxima in F correspond to maxima in ℓ

Let θ∗ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d2 dθ2 ℓ(θ) = d2 dθ2 log P(Z, X|θ)P(Z|X,θ∗) − d2 dθ2 log P(Z|X, θ)P(Z|X,θ∗) The first term on the right is negative (a maximum) and the second term is positive (a minimum).

slide-151
SLIDE 151

Maxima in F correspond to maxima in ℓ

Let θ∗ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d2 dθ2 ℓ(θ) = d2 dθ2 log P(Z, X|θ)P(Z|X,θ∗) − d2 dθ2 log P(Z|X, θ)P(Z|X,θ∗) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and

θ∗ is a maximum of ℓ.

slide-152
SLIDE 152

Maxima in F correspond to maxima in ℓ

Let θ∗ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d2 dθ2 ℓ(θ) = d2 dθ2 log P(Z, X|θ)P(Z|X,θ∗) − d2 dθ2 log P(Z|X, θ)P(Z|X,θ∗) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and

θ∗ is a maximum of ℓ.

[. . . as long as the derivatives exist. They sometimes don’t (zero-noise ICA)].