SLIDE 1
Variational Inference for Bayes vMF Mixture
Hanxiao Liu September 23, 2014
1 / 14
SLIDE 2 Variational Inference Review
Lower bound the likelihood L (θ; X) = Eq log p (X|θ) = Eq
q (Z)
+ Eq
q (Z) p (Z|X, θ)
Raise VLB (q, θ) by coordinate ascent
q=M
i=1 qi
VLB
- q, θt
- 2. θt+1 = argmaxθ VLB
qt+1, θ
SLIDE 3 Variational Inference Review
Goal: solve argmax
q=M
i=1 qi
VLB
by coordinate ascent, i.e. sequentially updating a single qi in each iteration. Each coordinate step has a closed-form solution—
VLB
= Eq
q (Z)
−
M
Eq log qi = Eqj Eq−j log p
qj+const
−Eqj log qj + const =
qj qj + const = −DKL (qj||˜ qj) + const
= ⇒ log q∗
j = Eq−j log p
+ const
3 / 14
SLIDE 4 Bayes vMF Mixture
[Gopal and Yang, 2014]
◮ π ∼ Dirichlet (·|α) ◮ µk ∼ vMF (·|µ0, C0) ◮ κk ∼ logNormal
◮ zi ∼ Multi (·|π) ◮ xi ∼ vMF
?
≡ Dirichlet (·|ρ)
◮ q (µk)
?
≡ vMF (·|ψk, γk)
◮ q (κk)
?
≡ logNormal (·|ak, bk)
◮ q (zi)
?
≡ Multi (·|λi) 4 / 14
SLIDE 5 Compute log p (X, Z|θ)
p (X, Z|θ) = Dirichlet (π|α) ×
N
Multi (zi|π) vMF
K
vMF (µk|µ0, C0) logNormal
log p (X, Z|θ) =− log B (α) +
K
(α − 1) log πk +
N
K
zik log πk +
N
K
zik
i µk
K
k µ0
K
2 log
− (log κk − m)2 2σ2
SLIDE 6 Updating q (π)
q (π)
?
≡ Dirichlet (·|ρ) log q∗ (π) = Eq\π log p (X, Z|θ) + const = Eq\π
K
(α − 1) log πk +
N
K
zik log πk
=
K
N
Eq [zik] − 1
= ⇒ q∗ (π) ∝
K
π
α+N
i=1 Eq[zik]−1
k
∼ Dirichlet = ⇒ ρ∗
k = α + N
Eq [zik]
6 / 14
SLIDE 7 Updating q (zi)
q (zi)
?
≡ Multi (·|λi)
log q∗ (zi) = Eq\zi log p (X, Z|θ) + const = Eq\zi N
K
zik log πk +
N
K
zik
i µk
=
K
zik
- Eq log πk + Eq log CD (κk) + Eq [κk] x⊤
i Eq [µk]
= ⇒ q∗ (zi) ∼ Multi, λ∗
ik ∝ eEq log πk+Eq log CD(κk)+Eq[κk]x⊤
i Eq[µk]
Assume Eq log πk, Eq log CD (κk), Eq [κk] and Eq [µk] are already
- known. We will explicitly compute them later.
7 / 14
SLIDE 8 Updating q (µk)
q (µk)
?
≡ vMF (·|ψk, γk) log q∗ (µk) = Eq\µk log p (X, Z|θ) + const = Eq\µk
N
K
zijκjx⊤
i µj + K
C0µ⊤
j µ0
+ const
= Eq [κk]
N
Eq [zik] x⊤
i µk
k µ0 + const
= ⇒ q∗ (µk) ∝ e
N
i=1 Eq[zik]xi
⊤
µk ∼ vMF γ∗
k =
N
Eq [zik] xi
k =
Eq [κk]
N
i=1 Eq [zik] xi
γk
8 / 14
SLIDE 9 Updating q (κk)
q (κk)
?
≡ logNormal (·|ak, bk)
log q∗ (κk) = Eq\κk log p (X, Z|θ) + const = Eq\κk
K
zij
i µj
K
− log κj − (log κj − m)2 2σ2
= Eq\κk
zik
i µk
2σ2
=
N
Eq [zik] log CD (κk) + κkx⊤
i Eq [µk]
− log κk − (log κk − m)2 2σ2 + const
= ⇒ q∗ (κk) ∼ logNormal due to the existence of log CD (κk)
9 / 14
SLIDE 10 Intermediate Quantities
Some intermediate quantities are in closed-form
◮ q (zi) ≡ Multi (zi|λi) =
⇒ Eq [zij] = λij
◮ q (π) ≡ Dirichlet (π|ρ) =
⇒ Eq log πk = Ψ (ρk) − Ψ
- j ρj
- ◮ q (µk) ≡ vMF (µk|ψk, γk) =
⇒ Eq [µk] =
I D
2
(γk) I D
2 −1(γk)ψk
1
[Rothenbuehler, 2005] Some are not— Eq [κk] and Eq log CD (κk)
- 1. the absence of a good parametric form of q (κk)
◮ apply sampling
- 2. even if κk ∼ logNormal is assumed, Eq log CD (κk) is still
hard to deal with
◮ bound log CD (·) by some simple functions 1can be derived from the characteristic function of vMF
10 / 14
SLIDE 11 Sampling
In principle we can sample κk from p (κk|X, θ). Unfortunately, the sampling procedure above requires the samples
- f zi, µk, π, . . . which are not maintained by variational inference.
Recall the optimal posterior for κk satisfies 2
log q∗ (κk) =
N
E [zik]
i Eq [µk]
2σ2 + const = ⇒ q∗ (κk) ∝ exp N
E [zik]
i Eq [µk]
We can sample from q∗ (κk) !
2see derivation on p.8
11 / 14
SLIDE 12
Bounding
Outline
◮ Assume q (κk) ≡ logNormal (·|ak, bk) ◮ Lower bound Eq log CD (κk) in VLB by some simple terms ◮ To optimize q (κk), use gradient ascent w.r.t ak and bk to
raise the VLB Empirically, sampling outperforms bounding
12 / 14
SLIDE 13 Empirical Bayes for Hyperparameters
Raise VLB (q, θ) by coordinate ascent
q=M
i=1 qi
VLB
- q, θt
- 2. θt+1 = argmaxθ VLB
qt+1, θ
- = argmaxθ Eqt+1 log p (X, Z|θ)
For example, one can use gradient ascent to optimize α max
α>0
− log B (α) + (α − 1)
K
Eqt+1 [log πk] m, σ2, µ0 and C0 can be optimized in a similar manner 3
3Unlike α, their solutions can be written in closed-form
13 / 14
SLIDE 14
Reference I
Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. In Journal of Machine Learning Research, pages 1345–1382. Gopal, S. and Yang, Y. (2014). Von mises-fisher clustering models. In Proceedings of The 31st International Conference on Machine Learning, pages 154–162. Rothenbuehler, J. (2005). Dependence Structures beyond copulas: A new model of a multivariate regular varying distribution based on a finite von Mises-Fisher mixture model. PhD thesis, Cornell University.
14 / 14