Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou - - PowerPoint PPT Presentation

hierarchical dirichlet processes
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou - - PowerPoint PPT Presentation

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content Introduction and Motivation Dirichlet Processes Hierarchical Dirichlet Processes Definition Three Analogs Inference Three


slide-1
SLIDE 1

1

Hierarchical Dirichlet Processes

Presenters: Micah Hodosh, Yizhou Sun 4/7/2010

slide-2
SLIDE 2

2

Content

  • Introduction and Motivation
  • Dirichlet Processes
  • Hierarchical Dirichlet Processes

– Definition – Three Analogs

  • Inference

– Three Sampling Strategies

slide-3
SLIDE 3

3

Introduction

 Hierarchical approach to model-based clustering of

grouped data

 Find an unknown number of clusters to capture the

structure of each group and allow for sharing among the groups

 Documents with an arbitrary number of topics which

are shared globably across the set of corpora.

 A Dirichlet Process will be used as a prior mixture

components

 The DP will be extended to a HDP to allow for sharing

clusters among related clustering problems

slide-4
SLIDE 4

4

Motivation

 Interested in problems with observations organized

into groups

 Let xji be the ith observation of group j = xj = {xj1,

xj2...}

 xji is exchangeable with any other element of xj  For all j,k , xj is exchangeable with xk

slide-5
SLIDE 5

5

Motivation

 Assume each observation is drawn independently for a

mixture model

 Factor θji is the mixture component associated with

xji

 Let F(θji ) be the distribution of xji given θji  Let Gj be the prior distribution of θj1, θj2... which are

conditionally independent given Gj

slide-6
SLIDE 6

6

Content

  • Introduction and Motivation
  • Dirichlet Processes
  • Hierarchical Dirichlet Processes

– Definition – Three Analogs

  • Inference

– Three Sampling Strategies

slide-7
SLIDE 7

7

The Dirichlet Process

 Let (Θ , β) be a measureable space,  Let G0 be a probability measure on that space  Let A = (A1,A2..,Ar) be a finite partition of that space  Let α0 be a positive real number  G ~ DP( α0, G0) is defined s.t. for all A :

slide-8
SLIDE 8

8

Stick Breaking Construction

 The general idea is that the distribution G will be a

weighted average of the distributions of a set of infinite random variables

 2 infinite sets of i.i.d random variables  ϕk ~ G0 – Samples from the initial probability measure  πk' ~ Beta (1, α0) – Defines the weights of these

samples

slide-9
SLIDE 9

9

Stick Breaking Construction

 πk' ~ Beta (1, α0)  Define πk as

π1' 1-π1' (1-π1')π2' ... 1

slide-10
SLIDE 10

10

Stick Breaking Construction

 πk ~ GEM(α0)  These πk define the weight of drawing the value

corresponding to ϕk.

slide-11
SLIDE 11

11

Polya urn scheme/ CRP

 Let each θ1, θ2,.. be i.i.d. Random variables

distributed according to G

 Consider the distribution of θi, given θ1,...θi-1,

integrating out G:

 

slide-12
SLIDE 12

12

Polya urn scheme

 Consider a simple urn model representation. Each

sample is a ball of a certain color

 Balls are drawn equiprobably, and when a ball of

color x is drawn, both that ball and a new ball of color x is returned to the urn

 With Probability proportional to α0, a new atom is

created from G0,

 A new ball of a new color is added to the urn

slide-13
SLIDE 13

13

Polya urn scheme

 Let ϕ1 ...ϕK be the distinct values taken on by

θ1,...θi-1,

 If mk is the number of values of θ1,...θi-1, equal

to ϕk:

slide-14
SLIDE 14

14

Chinese restaurant process:

...

ϕ1 ϕ2 ϕ3

θ1 θ2 θ3 θ4

slide-15
SLIDE 15

15

Dirichlet Process Mixture Model

 Dirichlet Process as nonparametric prior on the

parameters of a mixture model:

slide-16
SLIDE 16

16

Dirichlet Process Mixture Model

 From the stick breaking representation:

 θi will be the distribution represented by ϕk with

probability πk

 Let zi be the indicator variable representing

which ϕk θi is associated with:

slide-17
SLIDE 17

17

Infinite Limit of Finite Mixture Model

 Consider a multinomial on L mixture

components with parameters π = (π1, … πL)

 Let π have a symmetric Dirichlet prior with

hyperparameters (α0/L,....α0/L)

 If xi is drawn from a mixture component, zi,

according to the defined distribution:

slide-18
SLIDE 18

18

Infinite Limit of Finite Mixture Model

 If

, then as L approaches ∞:

 The marginal distribution of x1,x2....

approaches that of a Dirichlet Process Mixture Model

slide-19
SLIDE 19

19

Content

  • Introduction and Motivation
  • Dirichlet Processes
  • Hierarchical Dirichlet Processes

– Definition – Three Analogs

  • Inference

– Three Sampling Strategies

slide-20
SLIDE 20

20

HDP Definition

  • General idea

– To model grouped data

  • Each group j <=> a Dirichlet

process mixture model

  • Hierarchical prior to link these

mixture models <=> hierarchical Dirichlet process

– A hierarchical Dirichlet process is

  • A distribution over a set of

random probability measures ( )

slide-21
SLIDE 21

21

HDP Definition (Cont.)

  • Formally, a hierarchical Dirichlet process

defines

– A set of random probability measures , one for each group j – A global random probability measure

  • is a distributed as a Dirichlet process
  • are conditional independent given , also

follow DP

is discrete!

slide-22
SLIDE 22

22

Hierarchical Dirichlet Process Mixture Model

  • Hierarchical Dirichlet process as prior

distribution over the factors for grouped data

  • For each group j

– Each observation corresponds to a factor – The factors are i.i.d random. variables distributed as

slide-23
SLIDE 23

23

Some Notices

  • HDP can be extended to more than two

levels

– The base measure H can be drawn from a DP, and so on and so forth – A tree can be formed

  • Each node is a DP
  • Children nodes are conditionally independent

given their parent, which is a base measure

  • The atoms at a given node are shared among all

its descendant nodes

slide-24
SLIDE 24

24

Analog I: The stick-breaking construction

  • Stick-breaking representation of
  • Stick-breaking representation of

i.e., i.e.,

slide-25
SLIDE 25

25

Equivalent representation using conditional distributions

slide-26
SLIDE 26

26

Analog II: the Chinese restaurant franchise

  • General idea:

– Allow multiple restaurants to share a common menu, which includes a set of dishes – A restaurant has infinite tables, each table has only one dish

slide-27
SLIDE 27

27

Notations

  • – The factor (dish) corresponding to
  • – The factors (dishes) drawn from H
  • – The dish chosen by table t in restaurant j
  • : the index of associated with
  • : the index of associated with
slide-28
SLIDE 28

28

Conditional distributions

  • Integrate out Gj (sampling table for

customer)

  • Integrate out G0 (sampling dish for table)

Count notation: , number of customers in restaurant j, at table t, eating dish k , number of tables in restaurant j, eating dish k

slide-29
SLIDE 29

29

Analog III: The infinite limit of finite mixture models

  • Two different finite models both yield

HDPM

– Global mixing proportions place a prior for group-specific mixing proportions

As L goes infinity

slide-30
SLIDE 30

30

– Each group choose a subset of T mixture components

As L, T go to infinity

slide-31
SLIDE 31

31

Content

  • Introduction and Motivation
  • Dirichlet Processes
  • Hierarchical Dirichlet Processes

– Definition – Three Analogs

  • Inference

– Three Sampling Strategies

slide-32
SLIDE 32

32

Introduction to three MCMC schemes

  • Assumption: H is conjugate to F

– A straightforward Gibbs sampler based on Chinese restaurant franchise – An augmented representation involving both the Chinese restaurant franchise and the posterior for G0 – A variation to scheme 2 with streamline bookkeeping

slide-33
SLIDE 33

33

Conditional density of data under mixture component k

  • For data , conditional density under

component k given all data items except is:

  • For data set , conditional density

is similarly defined

slide-34
SLIDE 34

34

Scheme I: Posterior sampling in the Chinese restaurant franchise

  • Sampling t and k

– Sampling t –

  • If is a new t, sampling the k corresponding to it

by

  • And
slide-35
SLIDE 35

35

– Sampling k

  • Where is all the observations for table t in restaurant j
slide-36
SLIDE 36

36

Scheme II: Posterior sampling with an augmented representation

  • Posterior of G0 given :
  • An explicit construction for G0 is given:
slide-37
SLIDE 37

37

  • Given a sample of G0, posterior for each

group is factorized and sampling in each group can be performed separately

  • Sampling t and k:

– Almost the same as in Scheme I

  • Except using to replace
  • When a new component knew is instantiated, draw

, and set and

slide-38
SLIDE 38

38

– Sampling for

slide-39
SLIDE 39

39

Scheme III: Posterior sampling by direct assignment

  • Difference from Scheme I and II:

– In I and II, data items are first assigned to some table t, and the tables are then assigned to some component k – In III, directly assign data items to component via variable , which is equivalent to

  • Tables are collapsed to numbers
slide-40
SLIDE 40

40

  • Sampling z:
  • Sampling m:
  • Sampling
slide-41
SLIDE 41

41

Comparison of Sampling Schemes

  • In terms of ease of implementation

– The direct assignment is better

  • In terms of convergence speed

– Direct assignment changes the component membership of data items one at a time – Scheme I and II, component membership of

  • ne table will change the membership of

multiple data items at the same time, leading to better performance

slide-42
SLIDE 42

42

Applications

  • Hierarchical DP extension of LDA

– In CRF representation: dishes are topics, customers are the observed words

slide-43
SLIDE 43

43

Applications

  • HDP-HMM
slide-44
SLIDE 44

44

References

  • Yee Whye Teh et. al., Hierarchical

Dirichlet Processes, 2006