1 Quick recap of EM negative total-data log-likelihood free - - PDF document

1 quick recap of em
SMART_READER_LITE
LIVE PREVIEW

1 Quick recap of EM negative total-data log-likelihood free - - PDF document

Statistical Modeling and Analysis of Neural Data (NEU 560) Princeton University, Spring 2018 Jonathan Pillow Lecture 17 notes: Latent variable models and EM Thurs, 4.12 1 Quick recap of EM negative total-data log-likelihood free energy


slide-1
SLIDE 1

Statistical Modeling and Analysis of Neural Data (NEU 560) Princeton University, Spring 2018 Jonathan Pillow

Lecture 17 notes: Latent variable models and EM

Thurs, 4.12

1 Quick recap of EM

L =

log-likelihood

  • log p(x|θ) = log
  • total-data

likelihood

  • p(x, z|θ)

dx ≥

  • q(z|φ)

variational distribution

log p(x, z|θ) q(z|φ)

  • dx

negative free energy

F(φ, θ) (1) The negative free energy has two convenient forms, which we exploit in the two alternating phases

  • f EM:

F(φ, θ) = log p(x|θ) − KL(q(z|φ)||p(z|x, θ)) (used in E-step) (2) F(φ, θ) =

  • q(z|φ) log p(x, z|θ)dz + H[q(z|φ)]
  • indep of θ

(used in M-step) (3) Specifically, EM involves alternating between:

  • E-step: Update φ by setting q(z|φ) = p(z|x, θ), with θ held fixed.
  • M-step: Update θ by maximizing
  • q(z|φ) log p(x, z|θ)dz, with φ held fixed.

Note that for discrete latent variable models, where the latent z takes on finite or countably infinitely many discrete values, the integral over z is replaced by a sum: F(φ, θ) =

m

  • j=0

q(z = αj|φ) log p(x, z = αj|θ)dz + H[q(z|φ)]

  • indep of θ

(4) where {α1, . . . , αm} are the possible values of z. See slides at the end for two graphical depictions of EM. 1

slide-2
SLIDE 2

2 EM for mixture of Gaussians

The model: z ∼ Ber(p) (5) x|z ∼

  • N(µ0, C0), if p = 0

N(µ1, C1), if p = 1 (6) Suppose we have a dataset consisting of N samples {xi}N

i=1.

Our model seeks to model these in terms of a set of pairs of iid random variables {(zi, xi)}N

i=1, each consisting of a latent and

an observation. These samples are independent under the model, meaning that the negative free energy is given by a sum of independent terms: F =

N

  • i=1
  • q(zi = 0|φi) log p(xi, zi = 0|θ) + q(zi = 1|φi) log p(xi, zi = 1|θ)
  • .

(7) Here φi is the variational parameter associated with the i’th latent variable zi.

2.1 E-step

The E step involves setting q(zi|φi) equal to the conditional distribution of zi given the data and current parameters θ. We will denote these binary probabilities by φi0 and φi1, given by the recognition distribution of the zi under the model: φi0 = p(zi = 0|xi, θ) = (1 − p)N0(xi) (1 − p)N0(xi) + pN1(xi) (8) φi1 = p(zi = 1|xi, θ) = pN1(xi) (1 − p)N0(xi) + pN1(xi), (9) where N0(xi) = N(xi|µ0, C0) and N1(xi) = N(xi|µ1, C1), and note that φi0 + φi1 = 1. At the end of the E-step we have a pair of these probabilities for each sample, which can be represented as an N × 2 matrix: φ =      φ10 φ11 φ20 φ21 . . . . . . φN0 φN1      (10)

2.2 M-step

The M-step involves updating the parameters θ = {p, µ0, µ1, C0, C1} using the current variational distribution q(z|φ). To do this, we plug in the assignment probabilities {φi0, φi1} from the E-step 2

slide-3
SLIDE 3

into the negative free energy (eq. 7) to obtain: F =

N

  • i=1
  • φi0 log p(xi, zi = 0|θ) + φi1 log p(xi, zi = 1|θ)
  • (11)

=

N

  • i=1
  • φi0
  • log(1 − p) + log N(xi|µ0, C0)
  • + φi1
  • log p + log N(xi|µ1, C1)
  • (12)

Maximizing this expression for the model parameters (see next section for derivations) gives up- dates: ˆ µ0 =

  • 1

φi0 φi0xi (13) ˆ µ1 =

  • 1

φi1 φi1xi (14) ˆ C0 =

  • 1

φi0 φi0(xi − ˆ µ0)(xi − ˆ µ0)⊤ (15) ˆ C1 =

  • 1

φi1 φi1(xi − ˆ µ1)(xi − ˆ µ1)⊤ (16) p = 1 N

  • φi1

(17) Note that the mean and covariance updates are formed by taking the weighted average and weighted covariance of the samples, with weights given by the assignment probabilities φi0 and φi1.

3 Derivation of M-step updates

3.1 Updates for µ0, µ1

To derive the updates for µ0, we collect the terms from the free energy (eq. 12) that involve µ0, giving: F(µ0) =

N

  • i=1

φi0 log N(xi|µ0, C0) + const (18) = −1 2

N

  • i=1

φi0(xi − µ0)⊤C−1

0 (xi − µ0) + const

(19) = −1 2

N

  • i=1

φi0

  • − 2µ⊤

0 C−1 0 xi + µ⊤ 0 C−1 0 µ0) + const

(20) = µ⊤

0 C−1

N

  • i=1

φi0xi

  • − 1

2 N

  • i=1

φi0

  • µ⊤

0 C−1 0 µ0 + const.

(21) 3

slide-4
SLIDE 4

Differentiating with respect to µ0 and setting to zero gives: ∂ ∂µ0 F = C−1 N

  • i=1

φi0xi

N

  • i=1

φi0

  • C−1

0 µ0 = 0

(22) = ⇒ N

  • i=1

φi0xi

  • =

N

  • i=1

φi0

  • µ0

(23) = ⇒ ˆ µ0 = N

i=1 φi0xi

N

i=1 φi0

. (24) A similar approach leads to the update for µ1, with weights φi1 instead of φi0.

3.2 Updates for C0, C1

Matrix derivative identities: Assume C is a symmetric, positive definite matrix. We have the following identities ([1]):

  • log-determinant:

∂ ∂C log |C| = C−1 (25)

  • quadratic form:

∂ ∂C x⊤Cx = xx⊤ (26) Derivation: The simplest approach for deriving updates for C0 is to differentiate the negative free energy F with respect to C−1 and then solve for C0. We assume we already have the updated mean ˆ µ0 (which did not depend on C0 or any other parameters). The free energy as a function of C0 can be written: F(C0) =

N

  • i=1

φi0 log N(xi|ˆ µ0, C0) + const (27) =

N

  • i=1

φi0

  • + 1

2 log |C−1 0 | − 1 2(xi − ˆ

µ0)⊤C−1

0 (xi − ˆ

µ0)

  • + const

(28) Differentiating with respect to C−1 gives us: ∂ ∂C−1 F = 1

2

N

  • i=1

φi0

  • C0 + 1

2

N

  • i=1

φi0(xi − ˆ µ0)(x − ˆ µ0)⊤

  • = 0

(29) = ⇒ ˆ C0 = 1 N

i=1 φi0

  • N
  • i=1

φi0(xi − ˆ µ0)(x − ˆ µ0)⊤

  • ,

(30) 4

slide-5
SLIDE 5

which as noted above is simply the covariance matrix of all stimuli weighted by their recognition

  • weights. The same derivation can be used for C1.

3.3 Mixing probability p update

Finally, updates for p are obtained by collecting terms involving p: F(p) =

N

  • i=1
  • φi0 log(1 − p) + φi1 log p
  • + const

(31) = log(1 − p) N

  • i=1

φi0

  • + (log p)

N

  • i=1

φi0

  • + const

(32) Differentiating and setting to zero gives ∂ ∂pF = 1 p − 1 N

  • i=1

φi0

  • + 1

p N

  • i=1

φi1

  • = 0

(33) = ⇒ p N

  • i=1

φi0

  • + (p − 1)

N

  • i=1

φi1

  • = 0

(34) = ⇒ p N

  • i=1

φi0 + φi1

  • =

N

  • i=1

φi1

  • (35)

= ⇒ p = 1 N

N

  • i=1

φi1, (36) where note that we have used pi0 + pi1 = 1 for all i. Thus the m-step estimate for p is simply the average probability assigned to cluster 1.

References

[1] K. B. Petersen and M. S. Pedersen. The matrix cookbook, Oct 2008. Version 20081110. 5