EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 - - PowerPoint PPT Presentation

em variational bayes
SMART_READER_LITE
LIVE PREVIEW

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 - - PowerPoint PPT Presentation

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19 MLE by Gradient


slide-1
SLIDE 1

EM & Variational Bayes

Hanxiao Liu September 9, 2014

1 / 19

slide-2
SLIDE 2

Outline

  • 1. EM Algorithm

1.1 Introduction 1.2 Example: Mixture of vMFs

  • 2. Variational Bayes

2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19

slide-3
SLIDE 3

MLE by Gradient Ascent

Goal: maximize L (θ; X) = log p (X|θ) w.r.t θ Gradient Ascent (GA)

◮ One-step view: θt+1 ← ∇L

  • θt; X
  • + θt

◮ Two-step view:

  • 1. Q
  • θ; θt

= L

  • θt; X
  • +
  • θt − θ
  • ∇L
  • θt; X
  • − 1

2

  • θt − θ
  • 2

2

  • 2. θt+1 ← argmaxθ Q
  • θ; θt

Drawbacks

  • 1. ∇L can be too complicated to work with
  • 2. Too general to be efficient for structured problems

3 / 19

slide-4
SLIDE 4

MLE by EM

Expectation-maximization (EM)

  • 1. Expectation: Q
  • θ; θt

= EZ|X,θt L (θ; X, Z)

  • 2. Maximization: θt+1 ← argmaxθ Q
  • θ; θt

◮ Replace L (θ; X)

  • log-likelihood

by L (θ; X, Z)

  • complete log-likelihood

◮ L (θ; X, Z) is a random function w.r.t Z

—use the expected function as a surrogate

4 / 19

slide-5
SLIDE 5

why EM is superior

A comparison between Q

  • θ, θt

, i.e., the local concave model

  • 1. EM

Q

  • θ; θt

= EZ|X,θt L (θ; X, Z) = L (θ; X) − DKL

  • p
  • Z|X, θt

||p (Z|X, θ)

  • + C
  • 2. GA

Q

  • θ; θt

= L

  • θt; X
  • +
  • θt − θ
  • ∇L
  • θt; X
  • − 1

2

  • θt − θ
  • 2

2

5 / 19

slide-6
SLIDE 6

Example: vMF mixture

Notations

◮ X = {xi}n i=1, θ =

  • π ∈ ∆k−1, {(µi, κi)}k

i=1

  • ◮ Z = {zij ∈ {0, 1}}

◮ zij = 1 =

⇒ xi ∼ the j-th mixture component

Log-likelihood L (θ; X) =

n

  • i=1

log p (xi|θ) =

n

  • i=1

log

k

  • j=1

πjvMF

  • xi|µj, κj
  • log sum coupling

Complete log-likelihood L (θ; X, Z) =

n

  • i=1

log p (xi, zi|θ) =

n

  • i=1

k

  • j=1

zij log

  • πjvMF
  • xi|µj, κj
  • 6 / 19
slide-7
SLIDE 7

E-step

Compute Q

  • θ; θt ∆

= EZ|X,θt L (θ; X, Z) Q

  • π, µ, κ; πt, µt, κt

= EZ|X,πt,µt,κt

n

  • i=1

k

  • j=1

zij log

  • πjvMF
  • xi|µj, κj
  • =

n

  • i=1

k

  • j=1

wt

ij log vMF

  • xi|µj, κj
  • + wt

ij log πj

where wt

ij = Ezij|X,πt,µt,κt [zij] = p

  • zij = 1|xi, πt, µt, κt

= πt

j · vMF

  • xi|µt

j, κt j

  • k

u=1 πt u · vMF (xi|µt u, κt u)

7 / 19

slide-8
SLIDE 8

M-step

Maximize Q

  • π, µ, κ; πt, µt, κt

=

n

  • i=1

k

  • j=1

wt

ij log vMF

  • xi|µj, κj
  • +wt

ij log πj

w.r.t π, µ and κ s.t. |π|1 = 1 and µj2 = 1, ∀j ∈ [k] To impose constraints, maximize ˜ Q ∆ = Q + λ

  • 1 − π⊤1
  • +

k

  • j=1

νj

  • 1 − µ⊤

j µj

  • 8 / 19
slide-9
SLIDE 9

M-step

˜ Q

  • π, µ, κ; πt, µt, κt

=

n

  • i=1

k

  • j=1

wt

ij log vMF

  • xi|µj, κj
  • + wt

ij log πj

+ λ

  • 1 − π⊤1
  • +

k

  • j=1

νj

  • 1 − µ⊤

j µj

  • Updating πt

j

Combining k

j πj = k j wt ij = 1 with

∂πj ˜ Q =

n

i=1 wt ij

πj − λ = 0 = ⇒ πt+1

j

n

i=1 wt ij

n

9 / 19

slide-10
SLIDE 10

M-step

˜ Q

  • π, µ, κ; πt, µt, κt

=

n

  • i=1

k

  • j=1

wt

ij log vMF

  • xi|µj, κj
  • + wt

ij log πj

+ λ

  • 1 − π⊤1
  • +

k

  • j=1

νj

  • 1 − µ⊤

j µj

  • Updating µt

j

log vMF

  • xi|µj, κj
  • = κjµ⊤

j xi + C

(w.r.t µj) ∂µj ˜ Q = κj

n

  • i=1

wt

ijxi − νjµj = 0

= ⇒ µt+1

j

rj rj2 where rj = n i=1 wt ijxi

10 / 19

slide-11
SLIDE 11

M-step

Updating κt

j ◮ Cp (κj) = κ

p 2 −1 j

(2π)

p 2 I p 2 −1(κj)

◮ the recurrence property of modified Bessel function 1

∂κj log I p

2 −1 (κj) = p 2 −1

κj

+

I p

2 (κj)

I p

2 −1(κj)

∂κj ˜ Q =

n

  • i=1

wt

ij

I p

2 (κj)

I p

2 −1 (κj) + µ⊤

j xi

  • = 0

= ⇒ I p

2 (κj)

I p

2−1 (κj) = ¯

rj = ⇒ κt+1

j

≈ ¯ rjp − ¯ r3

j

1 − ¯ r2

j

[?] where ¯ rj =

n

i=1 wt ijµ⊤ j xi

n

i=1 wt ij 1http://functions.wolfram.com/Bessel-TypeFunctions/BesselK/introductions/Bessels/05/

11 / 19

slide-12
SLIDE 12

An alternative view of EM

EM - original definition

  • 1. Expectation: Q
  • θ; θt

= EZ|X,θt L (θ; X, Z) why?

  • 2. Maximization: θt+1 ← argmaxθ Q
  • θ; θt

L (θ; X) = Eq log p (X|θ) = Eq

  • log p (X, Z|θ)

q (Z)

  • VLB(q,θ)

+ Eq

  • log

q (Z) p (Z|X, θ)

  • DKL(q(Z)||p(Z|X,θ))

EM - coordinate ascent

  • 1. qt+1 = argmaxq VLB
  • q, θt
  • 2. θt+1 = argmaxθ VLB

qt+1, θ

  • Show the equivalence?

12 / 19

slide-13
SLIDE 13

Bayes Inference

Notations

◮ θ : hyper parameters ◮ Z : hidden variables + random parameters

Goals

  • 1. find a good posterior q (Z) ≈ p (Z|X; θ)
  • 2. estimate θ by Empirical Bayes, i.e., maximize L (θ; X) w.r.t θ

L (θ; X) = Eq

  • log p (X, Z|θ)

q (Z)

  • VLB(q,θ)

+ Eq

  • log

q (Z) p (Z|X, θ)

  • DKL(q(Z)||p(Z|X,θ))

both goals can be achieved via the same procedure as EM

13 / 19

slide-14
SLIDE 14

Variational Bayes Inference

One should have q → p (Z; X, θ∗) by alternating between

  • 1. qt+1 = argmaxq VLB
  • q, θt
  • 2. θt+1 = argmaxθ VLB

qt+1, θ

  • However, we do not want q to be too complicated

◮ e.g., Q

  • θ; θt

= Eq L (θ; X, Z) can be intractable Solution: modify the first step as qt+1 = argmaxq∈Q VLB

  • q, θt

Q - some tractable distribution families

◮ Recall: without Q, qt+1 ≡ p

  • Z|X, θt

14 / 19

slide-15
SLIDE 15

Variational Bayes Inference

Goal: solve argmaxq∈Q VLB

  • q, θt

usually, Q =

  • q | q (Z) = M

i=1 qi (Zi) ∆

= M

i=1 qi

  • Coordinate ascent

VLB

  • qj; q−j, θt

= Eq

 log

p

  • X, Z; θt

q (Z)

 

= Eq log p

  • X, Z; θt

M

  • i=1

Eq log qi = Eqj

  • Eq−j log p
  • X, Z; θt

− Eqj log qj + C = −DKL

  • log qj||Eq−j log p
  • X, Z; θt

+ C log q∗

j = Eq−j log p

  • X, Z; θt

15 / 19

slide-16
SLIDE 16

Example: Bayes Mixture of Gaussians

Consider putting a prior over the means in GM 2

◮ For k = 1, 2 . . . K, µk ∼ N

0, τ 2

◮ For i = 1, 2 . . . N

  • 1. zi ∼ Mult (π)
  • 2. xi ∼ N
  • µzi, σ2

p (z, µ|X) = p (X|z, µ) p (z) p (µ) p (X) =

N

i=1 p (zi) p (xi|zi, µ) K k=1 p (µk) z

N

i=1 p (zi) p (xi|zi, µ) K k=1 p (µk) dµ

q (z, µ) =

N

  • i=1

q (zi; φi)

K

  • k=1

q

  • µk; ˜

µk, ˜ σ2

k

  • 2https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

16 / 19

slide-17
SLIDE 17

Example: Bayes Mixture of Gaussians

log q∗ (zj) = Eq\zj log p (z, µ, X) = Eq\zj

N

  • i=1

log p (zi) + log p (xi|zi, µ) +

K

  • k=1

log p (µk)

  • = log p (zj) + Eq(µzj ) log p

xj|zj, µzj + C

= log πzj + xj Eq(µzj )

µzj

  • ˜

µzj

−1 2 Eq(µzj )

  • µ2

zj

  • ˜

µ2

zj +˜

σ2

zj

+ C By observation q∗ (zj) ∼ Mult, we can update φj accordingly

17 / 19

slide-18
SLIDE 18

Example: Bayes Mixture of Gaussians

log q∗ (µj) = Eq\µj log p (z, µ, X) = Eq\µj

N

  • i=1

log p (zi) + log p (xi|zi, µzi) +

K

  • k=1

log p (µk)

  • = Eq\µj

N

  • i=1

K

  • k=1

δzi=k log N (xi|µk) + log p (µj) + C =

N

  • i=1

Ezi [δzi=j]

  • φj

i

log N (xi|µj) + log p (µj) + C Observing that q∗ (µj) ∼ N, ˜ µj and ˜ σ2

j can be updated accordingly

18 / 19

slide-19
SLIDE 19

Stay tuned

Next topics

◮ LDA (Wanli) ◮ Bayes vMF

19 / 19