Variational Inference for Dirichlet Process Mixtures By David Blei - - PowerPoint PPT Presentation

variational inference for dirichlet process mixtures
SMART_READER_LITE
LIVE PREVIEW

Variational Inference for Dirichlet Process Mixtures By David Blei - - PowerPoint PPT Presentation

Variational Inference for Dirichlet Process Mixtures By David Blei and Michael Jordan Presented by Daniel Acuna Motivation Non-parametric Bayesian models seem to be the right idea: Do not fix the number of mixture components


slide-1
SLIDE 1

Variational Inference for Dirichlet Process Mixtures

By David Blei and Michael Jordan Presented by Daniel Acuna

slide-2
SLIDE 2

Motivation

 Non-parametric Bayesian models seem to be

the right idea:

 Do not fix the number of mixture components

 Dirichlet process is an elegant and principled

way to “automatically” set the components

 Need to explore new methods that cope

intractable nature of marginalization or conditional

 MCMC sampling methods widely used in this

context, but there are other ideas

slide-3
SLIDE 3

Motivation

 Variational inference have proved to be

faster and more predictable (deterministic) than sampling

 The basic idea

 Reformulate as an optimization problem  Relax the optimization problem  Optimize (find a bound of the original

problem)

slide-4
SLIDE 4

Background

 Dirichlet process mixture is a measure

  • n measures

 Multiples representations and

interpretations:

 Ferguson Existent theorem  Blackwell-MacQueen urn scheme  Chinese restaurant process  Stick-breaking construction

slide-5
SLIDE 5

Dirichlet process mixture model

 Base distribution  Positive scaling parameter

exhibit a clustering effect

G0

  • {1,K,n1}

The DP mixture has a natural interpretation as a flexible mixture model in which the number of components is random and grows as new data are observed

slide-6
SLIDE 6

Stick-breaking representation

 Two infinite collections of independent

random variables

Vi ~ Beta(1,) i

* ~ G0

For i = {1,2,…}

 Stir-breaking representation of G

i(v) = vi (1 v j)

j=1 i1

  • G =

i(v)

i=1

  • i

*

 G is discrete!

slide-7
SLIDE 7

Sticking-breaking rep.

The data can be described as arriving from

1)

Draw

2)

Draw

3)

For the n-th data point

1)

Draw

2)

Draw

Vi | ~ Beta(1,), i = {1,2,...} i

* |G0 ~ G0 i = {1,2,...}

Zn |{v1,v2,...} ~ Mult((v)) Xn | zn ~ p(xn |zn

* )

slide-8
SLIDE 8

DP mixture for exponential families

 Observable data drawn from exponential family, the

base distribution is the conjugate

slide-9
SLIDE 9

Variational inf. for DP mix.

 In DP, our goal  But complex  Variational inference uses a proposal

distribution that breaks the dependency among latent variables

slide-10
SLIDE 10

Variational inf. for DP mix.

 In general, consider a model with

hyperparameters , latent variables and observations x =

 The posterior distribution:

Difficult!

slide-11
SLIDE 11

Variational inf. for DP mix

 This is difficult

Because latent variables become dependent when conditioning on observed data

 We reformulate the problem using the

mean-field method, which optimizes the KL divergence with respect to a variational distribution.

slide-12
SLIDE 12

Variational inf. for DP mix

 This is, we aim to minimize the KL

divergence between and

 Or equivalently, we try to maximize the

lower bound

slide-13
SLIDE 13

Mean field of exponential fam.

 For each latent variable, the conditional

is a member of a exponential family:

 Where is the natural

parameter of wi when conditioned on the remaining latent variables

 Here the family of distributions is

Variational parameters

slide-14
SLIDE 14

Mean-field of exponential family

 The optimization of KL divergence

after derivation (see Apendix)

 Notice:

 Gibbs sampling, we draw wi from p(wi|w-I,x,θ)  Here, we update vi to set it equal E[ gi(w-I,x,θ)]

slide-15
SLIDE 15

DP mixtures

 The latent variables are stick lengths,

atoms, and cluster assignment

 The hyper parameters are the scaling

and conjugate base distribution

 And the bound now is

slide-16
SLIDE 16

Relaxation of optimization

 To exploit this bound, with family q we

need to approximate G

 G is an infinite-dimensional random

measure.

 An approximation is to truncate the stick-

breaking representation!

slide-17
SLIDE 17

 Fix value T and q(vT = 1)=1, then

are equal to zero for t>T

 (remember from )  Propose,

 Beta distributions  Exponential family distributions  Multinomial distributions

Relaxation of optimization

slide-18
SLIDE 18

Optimization

 The optimization is performed by

coordinate ascent algorithm

 From,

Infinite!

slide-19
SLIDE 19

Optimization

 But,

Then

Where

slide-20
SLIDE 20

Optimization

 Finally, the mean-field coordinate

ascent algorithm boils down to updates:

slide-21
SLIDE 21

Predictive distribution

slide-22
SLIDE 22

Empirical comparison

slide-23
SLIDE 23

Conclusion

 Faster than sampling for particular problems  Unlikely, that one method will dominate

another  both have their pros and cons

 This is the simplest variational method

(mean-field). Other methods are worth exploring.

 Check www.videolectures.net