Variational Inference for Bayes vMF Mixture Hanxiao Liu September - - PowerPoint PPT Presentation

variational inference for bayes vmf mixture
SMART_READER_LITE
LIVE PREVIEW

Variational Inference for Bayes vMF Mixture Hanxiao Liu September - - PowerPoint PPT Presentation

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational Inference Review Lower bound the likelihood L ( ; X ) = E q log p ( X | ) log p ( X , Z | ) q ( Z ) = E q + E q log q (


slide-1
SLIDE 1

Variational Inference for Bayes vMF Mixture

Hanxiao Liu September 23, 2014

1 / 14

slide-2
SLIDE 2

Variational Inference Review

Lower bound the likelihood L (θ; X) = Eq log p (X|θ) = Eq

  • log p (X, Z|θ)

q (Z)

  • VLB(q,θ)

+ Eq

  • log

q (Z) p (Z|X, θ)

  • DKL(q(Z)||p(Z|X,θ))

Raise VLB (q, θ) by coordinate ascent

  • 1. qt+1 = argmax

q=M

i=1 qi

VLB

  • q, θt
  • 2. θt+1 = argmaxθ VLB

qt+1, θ

  • 2 / 14
slide-3
SLIDE 3

Variational Inference Review

Goal: solve argmax

q=M

i=1 qi

VLB

  • q, θt

by coordinate ascent, i.e. sequentially updating a single qi in each iteration. Each coordinate step has a closed-form solution—

VLB

  • qj; q−j, θt

= Eq

  • log p
  • X, Z|θt

q (Z)

  • = Eq log p
  • X, Z|θt

M

  • i=1

Eq log qi = Eqj Eq−j log p

  • X, Z|θt
  • log ˜

qj+const

−Eqj log qj + const =

  • qj log ˜

qj qj + const = −DKL (qj||˜ qj) + const

= ⇒ log q∗

j = Eq−j log p

  • X, Z|θt

+ const

3 / 14

slide-4
SLIDE 4

Bayes vMF Mixture

[Gopal and Yang, 2014]

◮ π ∼ Dirichlet (·|α) ◮ µk ∼ vMF (·|µ0, C0) ◮ κk ∼ logNormal

  • ·|m, σ2

◮ zi ∼ Multi (·|π) ◮ xi ∼ vMF

  • ·|µzi, κzi
  • ◮ q (π)

?

≡ Dirichlet (·|ρ)

◮ q (µk)

?

≡ vMF (·|ψk, γk)

◮ q (κk)

?

≡ logNormal (·|ak, bk)

◮ q (zi)

?

≡ Multi (·|λi) 4 / 14

slide-5
SLIDE 5

Compute log p (X, Z|θ)

p (X, Z|θ) = Dirichlet (π|α) ×

N

  • i=1

Multi (zi|π) vMF

  • xi|µzi, κzi
  • ×

K

  • k=1

vMF (µk|µ0, C0) logNormal

  • κk|m, σ2

log p (X, Z|θ) =− log B (α) +

K

  • k=1

(α − 1) log πk +

N

  • i=1

K

  • k=1

zik log πk +

N

  • i=1

K

  • k=1

zik

  • log CD (κk) + κkx⊤

i µk

  • +

K

  • k=1
  • log CD (C0) + C0µ⊤

k µ0

  • +

K

  • k=1
  • − log κk − 1

2 log

  • 2πσ2

− (log κk − m)2 2σ2

  • 5 / 14
slide-6
SLIDE 6

Updating q (π)

q (π)

?

≡ Dirichlet (·|ρ) log q∗ (π) = Eq\π log p (X, Z|θ) + const = Eq\π

K

  • k=1

(α − 1) log πk +

N

  • i=1

K

  • k=1

zik log πk

  • + const

=

K

  • k=1
  • α +

N

  • i=1

Eq [zik] − 1

  • log πk + const

= ⇒ q∗ (π) ∝

K

  • k=1

π

α+N

i=1 Eq[zik]−1

k

∼ Dirichlet = ⇒ ρ∗

k = α + N

  • i=1

Eq [zik]

6 / 14

slide-7
SLIDE 7

Updating q (zi)

q (zi)

?

≡ Multi (·|λi)

log q∗ (zi) = Eq\zi log p (X, Z|θ) + const = Eq\zi N

  • i=1

K

  • k=1

zik log πk +

N

  • i=1

K

  • k=1

zik

  • log CD (κk) + κkx⊤

i µk

  • + const

=

K

  • k=1

zik

  • Eq log πk + Eq log CD (κk) + Eq [κk] x⊤

i Eq [µk]

  • + const

= ⇒ q∗ (zi) ∼ Multi, λ∗

ik ∝ eEq log πk+Eq log CD(κk)+Eq[κk]x⊤

i Eq[µk]

Assume Eq log πk, Eq log CD (κk), Eq [κk] and Eq [µk] are already

  • known. We will explicitly compute them later.

7 / 14

slide-8
SLIDE 8

Updating q (µk)

q (µk)

?

≡ vMF (·|ψk, γk) log q∗ (µk) = Eq\µk log p (X, Z|θ) + const = Eq\µk

 

N

  • i=1

K

  • j=1

zijκjx⊤

i µj + K

  • j=1

C0µ⊤

j µ0

  + const

= Eq [κk]

N

  • i=1

Eq [zik] x⊤

i µk

  • + C0µ⊤

k µ0 + const

= ⇒ q∗ (µk) ∝ e

  • Eq[κk]

N

i=1 Eq[zik]xi

  • +C0µ0

µk ∼ vMF γ∗

k =

  • Eq [κk]

N

  • i=1

Eq [zik] xi

  • + C0µ0
  • , ψ∗

k =

Eq [κk]

N

i=1 Eq [zik] xi

  • + C0µ0

γk

8 / 14

slide-9
SLIDE 9

Updating q (κk)

q (κk)

?

≡ logNormal (·|ak, bk)

log q∗ (κk) = Eq\κk log p (X, Z|θ) + const = Eq\κk

  • N
  • i=1

K

  • j=1

zij

  • log CD (κj) + κjx⊤

i µj

  • +

K

  • j=1

− log κj − (log κj − m)2 2σ2

  • + const

= Eq\κk

  • N
  • i=1

zik

  • log CD (κk) + κkx⊤

i µk

  • − log κk − (log κk − m)2

2σ2

  • + const

=

N

  • i=1

Eq [zik] log CD (κk) + κkx⊤

i Eq [µk]

− log κk − (log κk − m)2 2σ2 + const

= ⇒ q∗ (κk) ∼ logNormal due to the existence of log CD (κk)

9 / 14

slide-10
SLIDE 10

Intermediate Quantities

Some intermediate quantities are in closed-form

◮ q (zi) ≡ Multi (zi|λi) =

⇒ Eq [zij] = λij

◮ q (π) ≡ Dirichlet (π|ρ) =

⇒ Eq log πk = Ψ (ρk) − Ψ

  • j ρj
  • ◮ q (µk) ≡ vMF (µk|ψk, γk) =

⇒ Eq [µk] =

I D

2

(γk) I D

2 −1(γk)ψk

1

[Rothenbuehler, 2005] Some are not— Eq [κk] and Eq log CD (κk)

  • 1. the absence of a good parametric form of q (κk)

◮ apply sampling

  • 2. even if κk ∼ logNormal is assumed, Eq log CD (κk) is still

hard to deal with

◮ bound log CD (·) by some simple functions 1can be derived from the characteristic function of vMF

10 / 14

slide-11
SLIDE 11

Sampling

In principle we can sample κk from p (κk|X, θ). Unfortunately, the sampling procedure above requires the samples

  • f zi, µk, π, . . . which are not maintained by variational inference.

Recall the optimal posterior for κk satisfies 2

log q∗ (κk) =

N

  • i=1

E [zik]

  • log CD (κk) + κkx⊤

i Eq [µk]

  • − log κk − (log κk − m)2

2σ2 + const = ⇒ q∗ (κk) ∝ exp N

  • i=1

E [zik]

  • log CD (κk) + κkx⊤

i Eq [µk]

  • × logNormal
  • κk|m, σ2

We can sample from q∗ (κk) !

2see derivation on p.8

11 / 14

slide-12
SLIDE 12

Bounding

Outline

◮ Assume q (κk) ≡ logNormal (·|ak, bk) ◮ Lower bound Eq log CD (κk) in VLB by some simple terms ◮ To optimize q (κk), use gradient ascent w.r.t ak and bk to

raise the VLB Empirically, sampling outperforms bounding

12 / 14

slide-13
SLIDE 13

Empirical Bayes for Hyperparameters

Raise VLB (q, θ) by coordinate ascent

  • 1. qt+1 = argmax

q=M

i=1 qi

VLB

  • q, θt
  • 2. θt+1 = argmaxθ VLB

qt+1, θ

  • = argmaxθ Eqt+1 log p (X, Z|θ)

For example, one can use gradient ascent to optimize α max

α>0

− log B (α) + (α − 1)

K

  • k=1

Eqt+1 [log πk] m, σ2, µ0 and C0 can be optimized in a similar manner 3

3Unlike α, their solutions can be written in closed-form

13 / 14

slide-14
SLIDE 14

Reference I

Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. In Journal of Machine Learning Research, pages 1345–1382. Gopal, S. and Yang, Y. (2014). Von mises-fisher clustering models. In Proceedings of The 31st International Conference on Machine Learning, pages 154–162. Rothenbuehler, J. (2005). Dependence Structures beyond copulas: A new model of a multivariate regular varying distribution based on a finite von Mises-Fisher mixture model. PhD thesis, Cornell University.

14 / 14