Memoized Online Variational Inference for Dirichlet Process Mixture - - PowerPoint PPT Presentation

memoized online variational inference for dirichlet
SMART_READER_LITE
LIVE PREVIEW

Memoized Online Variational Inference for Dirichlet Process Mixture - - PowerPoint PPT Presentation

Memoized Online Variational Inference for Dirichlet Process Mixture Models Michael C. Hughes Erik B. Sudderth Department of Computer Science, Brown University 26 June 2014 Advances in Neural Information Processing Systems (2013) Presented by


slide-1
SLIDE 1

Memoized Online Variational Inference for Dirichlet Process Mixture Models

Michael C. Hughes Erik B. Sudderth

Department of Computer Science, Brown University

26 June 2014

Advances in Neural Information Processing Systems (2013)

Presented by Kyle Ulrich

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 1 / 12

slide-2
SLIDE 2

Review: Dirichlet process mixture models

A draw, G, from a DP consists of an infinite collection of atoms: G ∼ DP(α0H), G

  • k=1

wkδφk. (1) The mixture weights wk are represented by the stick-breaking process and the data-generating parameters φk are drawn from the base measure H: wk = vk

k−1

  • ℓ=1

(1 − vℓ), vk ∼ Beta(1, α0), φk ∼ H(λ0) (2) Each data point n = 1, . . . , N has cluster assignment zn and observation xn distributed according to zn ∼ Cat(w), xn ∼ F(φzn) (3) Often, H and F are assumed to belong to the exponential family.

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 2 / 12

slide-3
SLIDE 3

Overview of inference for DPM models

1 Variational inference is attractive for large-scale datasets

However, full-dataset variational inference scales poorly and often converges to poor local optima

2 Stochastic online (SO) variational inference alternatively scales to

large datasets

On the downside, they are sensitive to the learning rate decay schedule and choice of batch size

3 The proposed memoized online (MO) variational inference avoids

these noisy gradients and learning rates

Requires multiple full passes through the data Birth and merge moves naturally help MO escape local optima

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 3 / 12

slide-4
SLIDE 4

Mean-field variational inference for DP mixture models

With mean-field inference, we seek to obtain a variational distribution, q(z, v, φ) =

N

  • n=1

q(zn|ˆ rn)

K

  • k=1

q(vk|ˆ α1, ˆ α0)q(φk|ˆ λk), (4) with the following distributions on the individual factors: q(zn) = Cat(ˆ rn1, . . . ,ˆ rnK), q(vk) = Beta(ˆ αk1, ˆ αk0), q(φk) = H(ˆ λk). The parameters of q are optimized such that the KL divergence from the true posterior is minimized; this results in maximizing the ELBO, L(q) Eq [log p(x, v, z, φ|α0, λ0) − log q(v, z, φ)] (5) Maximizing this ELBO, we can iteratively update ˆ rn, ˆ α1, ˆ α0, and ˆ λk. These batch updates are standard and presented in the paper.

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 4 / 12

slide-5
SLIDE 5

Truncation strategy

There are many methods to set the truncation level K of the DP:

1

Place artificially large mass on the final component, i.e., q(vK = 1) = 1

2

Set the stick-breaking ‘tail’ to the prior, i.e., q(vk) = p(vk|α) for k > K

3

Truncate the assignments to enforce q(zn = k) = 0 for k > K

This work uses method 3 above, which has several advantages:

1

All data is explained by the first K components. This allows the data to be conditionally independent to all parameters with k > K.

2

Therefore, inference only needs to consider a finite set of K atoms

3

This minimizes unnecessary computation while still approximating the infinite posterior

4

Truncation is nested – any q with truncation K could be represented exactly under truncation K + 1 with zero mass on final component

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 5 / 12

slide-6
SLIDE 6

Stochastic online (SO) variational inference

At each iteration t, SO processes only a subset of data, Bt, sampled uniformly at random from the large corpus of data. SO first updates local factors q(zn) for n ∈ Bt Then, with a noisy gradient step, SO updates the sufficient statistics

  • f the global factors

For example, for ˆ λk, compute1 ˆ λ∗

k = λ0 + N |Bt|

  • n∈Bt ˆ

rnkt(xn) Then update the global parameter as ˆ λ(t)

k

← ρtˆ λ∗

k + (1 − ρt)ˆ

λ(t−1)

k

ρt is the learning rate. Convergence is guaranteed for appropriate decays of ρt. Performance:

This has computational advantages and sometimes achieves better solutions than the full-dataset algorithm However, it is sensitive to learning rate decay and choice of batch size

1t represents the sufficient statistics of the observation distribution Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 6 / 12

slide-7
SLIDE 7

Memoized online variational inference

The data is divided into B fixed batches {Bb}B

b=1

Maintain memoized sufficient statistics2 Sb

k = [ˆ

Nk(Bb), sk(Bb)] Track full-dataset statistics S0

k = [ˆ

Nk, sk(x)] Visit each distinct batch once per full pass through the data

1

Update local parameters for current batch, i.e., ˆ rn for n ∈ Bb

2

Update cached global sufficient statistics for each component: S0

k ← S0 k − Sb k ,

Sb

k ← [ˆ

Nk(Bb), sk(Bb)], S0

k ← S0 k + Sb k

(6)

3

Update global parameters, i.e., ˆ αk1, ˆ αk0 and ˆ λk

Advantages:

1

Unlike SO, MO is guaranteed to improve the ELBO at every step

2

MO updates reduce to standard full-dataset updates

3

More scalable and converges faster than the full-dataset algorithm

4

Same computational complexity as SO, without need for learning rates

2For notation, sufficient statistics are defined as ˆ

Nk Eq[N

n=1 znk] and

sk(x) Eq[N

n=1 znkt(xn)] Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 7 / 12

slide-8
SLIDE 8

Birth moves

To escape local optima, we may wish to propose birth moves This is done in three steps:

Collection: During pass 1, subsample data in targeted component k′ Creation: Before pass 2, fit a DPM to subsampled data Adoption: During pass 2, update parameters with all K + K ′

  • components. Future merge moves will eliminate components.

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 8 / 12

slide-9
SLIDE 9

Merge moves

To reduce computation costs, we may wish to propose merge moves This merge move has three steps:

1

Select components ka and kb to merge into km

2

Form the candidate configuration q′ by utilizing the additive properties: ˆ rnkm = ˆ rnka + ˆ rnkb S0

km = S0 ka + S0 kb

(7)

3

Accept q′ only if the ELBO improves

For each pass of the data, the author’s proposed algorithm performs:

One birth move Memoized ascent steps for all batches Several merges after the final batch

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 9 / 12

slide-10
SLIDE 10

Results – toy data

Data (N = 100000) synthetic image patches generated by a zero-mean GMM with 8 equally-common components Each component has a 25 × 25 covariance matrix producing 5 × 5

  • patches. We wish to recover these matrices and the number of them.

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 10 / 12

slide-11
SLIDE 11

Results – MNIST digit clustering

Clustering N = 60000 MNIST images of handwritten digits 0-9. As preprocessing, all images projected to D = 50 via PCA.

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 11 / 12

slide-12
SLIDE 12

Questions?

Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 12 / 12