Hierarchical Dirichlet Processes Sharing Clusters Among Related - - PowerPoint PPT Presentation

hierarchical dirichlet processes
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Dirichlet Processes Sharing Clusters Among Related - - PowerPoint PPT Presentation

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2 Yaodong Yu 3 1 , 2 , 3 University of Virginia 12/15/2017 Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet


slide-1
SLIDE 1

Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups Dongruo Zhou1 Difan Zou2 Yaodong Yu3

1,2,3University of Virginia

12/15/2017

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 1 / 31

slide-2
SLIDE 2

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 2 / 31

slide-3
SLIDE 3

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 3 / 31

slide-4
SLIDE 4

Mixture Model

We are interested in problems where the observations are organized into groups, and assumed exchangeable both within each group and across groups. Let j index the groups and i index the observations within each group, then θji|Gj ∼ Gj, for each j, i xji|θji ∼ F(θji), for each j, i where θji is the factor variable, F(θji) is the distribution of xji given θji, Gj is the prior distribution for the factor θji.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 4 / 31

slide-5
SLIDE 5

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 5 / 31

slide-6
SLIDE 6

Definition

A Dirichlet process DP(α0, G0) is defined to be the distribution of a random probability measure G over measure space (Θ, B) We say Gj ∼ DP(α0, G0) if for any finite measurable partition (A1, . . . , Ar) of Θ, (Gj(A1), . . . , Gj(Ar)) ∼ Dir(α0G0(A1), . . . , α0G0(Ar)), where y ∼ Dir(βi, 1 ≤ i ≤ r) iff p(yi = xi, 1 ≤ i ≤ r, r

j=1 xi = 1) ∼ r j=1 xβi−1 i

. Distribution of distributions.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 6 / 31

slide-7
SLIDE 7

Direct view from Chinese restaurant process

Hard to describe Gj directly from formal definition! Can we describe θi ∼ Gj directly from α0, G0 without using Gj? Chinese restaurant process Suppose θ1, . . . are conditionally independent given Gj, then θi|θ1, . . . , θi−1, α0, G0 ∼

i−1

  • l=1

δθl i − 1 + α0 + α0 i − 1 + α0 G0. With probability

i−1 i−1+α0 , θi takes existing values in θ1, . . . , θi−1; with

probability

α0 i−1+α0 , θi takes values from G0.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 7 / 31

slide-8
SLIDE 8

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 8 / 31

slide-9
SLIDE 9

Definition

How about another distribution on G0? We consider G0 also satisfies Dirichlet process DP(γ, H), and each Gj are conditionally independent given G0, with distribution DP(α0, G0), named G0|γ, H ∼ DP(γ, H), Gj|α0, G0 ∼ DP(α0, G0).

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 9 / 31

slide-10
SLIDE 10

Definition

θ G

i

xi G α

θ Gj γ G0 H α xji

ji

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 10 / 31

slide-11
SLIDE 11

Interpretation of HDP as Chinese restaurant process

φ

φ φ

22 21 23 24 26 25 27 28 33 34 31 32 35 36

= = ψ

12

= ψ = ψ = ψ = ψ = ψ = ψ = ψ

11 13 1 2 1 3 1 21 22 23 24 31 3 1 1 2 32

ψ

θ θ θ θ θ θ θ θ θ θ θ θ

17 15 16 12 11 13 14 18

θ θ θ θ θ θ θ θ

φ φ φ φ φ φ φ φ

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 11 / 31

slide-12
SLIDE 12

Interpretation of HDP as Chinese restaurant process

From previous definition of Chinese restaurant process, we have θji|θj1, . . . , θj,i−1, α0, G0 ∼

i−1

  • l=1

δθjl i − 1 + α0 + α0 i − 1 + α0 G0. which can also be written as θji|θj1, . . . , θj,i−1, α0, G0 ∼

mj·

  • t=1

njt· i − 1 + α0 δψjt + α0 i − 1 + α0 G0, where ψjt are distinct values appearing in θj1, . . . , θj,i−1, mj· represents how many different values ψjt are, and njt· represents how many times ψjt appears in θj1, . . . , θj,i−1.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 12 / 31

slide-13
SLIDE 13

Interpretation of HDP as Chinese restaurant process

Integrate out G0, we finally have ψjt|ψ11, . . . , ψ21, . . . , ψj1, . . . , ψj,t−1, γ, H ∼

K

  • k=1

m·k m·· + γ δφk + γ m·· + γ H, where φk represents all different values which appear before ψjt, K represents the number of how many different values, m·k represents how many times φk appears before ψjt, m·· = K

i=1 m·i.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 13 / 31

slide-14
SLIDE 14

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 14 / 31

slide-15
SLIDE 15

Posterior Sampling

Observations: xji ∼ F(θji) Factor θji ∼ Gj: θji|θj1, . . . , θj,i−1, α0, G0 ∼

mj·

  • t=1

njt· i − 1 + α0 δψjt + α0 i − 1 + α0 G0, Random Variable ψjt ∼ G0 ψjt|ψ11, . . . , ψ21, . . . , ψj1, . . . , ψj,t−1, γ, H ∼

K

  • k=1

m·k m·· + γ δφk + γ m·· + γ H,

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 15 / 31

slide-16
SLIDE 16

Posterior Sampling in the Chinese Restaurant Franchise

Purpose: sample θji and ψjt given observations x. Simplification: We sample the indexes t and k rather than θji and ψjt. We first show the conditional density of xji under component k (φk) given all data items except xji as follows: f −xji

k

(xji) =

  • f (xji|φk)Πj′i′=jif (xj′i′|φk)dφk
  • Πj′i′=jif (xj′i′|φk)h(φ)dφk

where h(φk) denotes the density function of H.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 16 / 31

slide-17
SLIDE 17

Sampling t

If t takes on a particular previously used value t, we have p(tij = t|t−ji, k) ∝ n−ji

jt· . Then the posterior probability p(tij|t−ji, k, x)

satisfies, p(tij = t|t−ji, k, x) ∝ p(xji|tij = t, t−ji, k) · p(tij = t|t−ji, k) = n−ji

jt· f −xji kjt

(xji) If tji takes on a new value tnew, we have p(tij = tnew|t−ji, k) ∝ α0. Thus p(tij = tnew|t−ji, k, x) ∝ α0p(xji|tij = tnew, t−ji, k) p(xji|tij = tnew, t−ji, k) =

K

  • k=1

m·k m·· + γ f −xji

k

(xji) + γ m·· + γ f −xji

knew (xji)

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 17 / 31

slide-18
SLIDE 18

Sampling k

Following the last page, if sampled value of tji is tnew, then we have p(kjtnew = k|t, k−jtnew ) ∝

  • m·kf −xji

k

(xji) k is previously used γf −xji

knew (xji)

k = knew if tji = t, we have p(kjtnew = k|t, k−jtnew ) ∝

  • m·kf −xjt

k

(xjt) k is previously used γf −xjt

knew (xjt)

k = knew where xjt = (xji : all i with tji = t).

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 18 / 31

slide-19
SLIDE 19

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 19 / 31

slide-20
SLIDE 20

Document Modeling

Dataset: Corpus of nematode biology abstracts1 5,838 abstracts in total Data Processing: Remove standard stop words and words appearing less than 10

  • times. Left with 476,441 words in total and a vocabulary size of 5,699

Representation: Use “bag of words” to represent a document

1Available at http://elegans.swmed.edu/wli/cgcbib. Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 20 / 31

slide-21
SLIDE 21

Document Modeling

θ Gj γ G0 H α xji

ji

Figure: Model

Abstracts Words Topics Mixture Components Observations Groups

Figure: Correspondence

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 21 / 31

slide-22
SLIDE 22

Document Modeling

Compare with latent Dirichlet allocation (LDA) 2. For LDA, we need to vary the number of topics used between 10 and 120

10 20 30 40 50 60 70 80 90 100 110 120 750 800 850 900 950 1000 1050 Perplexity Number of LDA topics Perplexity on test abstacts of LDA and HDP mixture LDA HDP Mixture

Figure 2:

Figure: Comparison of LDA and HDP mixture.

2D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR,

3:9931022, 2003.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 22 / 31

slide-23
SLIDE 23

Outline

1

Model Introduction General Problem Setting Dirichlet Process Hierarchical Dirichlet Process

2

Inference Posterior Sampling

3

Experiments Document Modeling Multiple Corpora

4

Questions

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 23 / 31

slide-24
SLIDE 24

Multiple Corpora

Dataset: Articles from the proceedings of the Neural Information Processing Systems (NIPS) conference for the years 1988-19993 1,447 articles in total Data Processing: Remove standard stop words, words appearing more than 4,000 times and words appearing less than 50 times. Representation: Use “bag of words” to represent a document

3http://www.cs.utoronto.ca/roweis/nips. Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 24 / 31

slide-25
SLIDE 25

Multiple Corpora

Nine Sections: algorithms and architectures (AA), applications (AP), cognitive science (CS), control and navigation (CN), implementations (IM), learning theory (LT), neuroscience (NS), signal processing (SP), vision sciences (VS). We treat these sections as “corpora”, and are interested in the pattern of sharing of topics among these corpora. Our test section is always the VS (vision sciences) section, while the additional section is varied across the other eight.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 25 / 31

slide-26
SLIDE 26

Multiple Corpora

θ Gj G0 H γ α Gj

VS Training documents

xji

ji VS Test documents

xji

ji

θ θ Gj G0 H γ α Gj Gj

VS Training documents Additional training documents

xji

ji

xji

ji VS Test documents

xji

ji

θ θ θ Gj G1 G2 γ H α α1 G0 Gj Gj

VS Training documents Additional training documents

xji

ji

xji

ji VS Test documents

xji

ji

θ θ M1 M2 M3

Figure: Three models for the NIPS data. From left to right: M1, M2 and M3.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 26 / 31

slide-27
SLIDE 27

Multiple Corpora

M1 simply ignores documents from the additional section, and uses a HDP to model the VS documents. M2 uses a HDP mixture model, with one group per document, but lumping together training documents from both sections. M3 takes a hierarchical approach and models each section separately using a HDP mixture model, and places another DP prior over the common base distributions for both submodels

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 27 / 31

slide-28
SLIDE 28

Multiple Corpora

The training set consist of 80 documents from the other section, so that larger sections like AA (algorithms and architecures) do not get an unfair advantage. The training set consist of 0 ∼ 80 documents from VS.

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 28 / 31

slide-29
SLIDE 29

Multiple Corpora

10 20 30 40 50 60 70 80 2500 3000 3500 4000 4500 5000 5500 6000 Number of VS training documents Perplexity Average perplexity over NIPS sections of 3 models M1: additional sction ignored M2: flat, additional section M3: hierarchical, additional section

Figure 3: Left: perplexity of test VS documents

Figure: Perplexity of test VS documents given training documents from VS and another section for 3 different models.

10 20 30 40 50 60 70 80 2500 3000 3500 4000 4500 5000 Number of VS training documents Perplexity Generalization from LT, AA, AP to VS LT AA AP

Figure: Perplexity of test VS documents given LT, AA and AP documents respectively, using M3

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 29 / 31

slide-30
SLIDE 30

Questions

Any Questions?

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 30 / 31

slide-31
SLIDE 31

Questions

Thank you!

Dongruo Zhou, Difan Zou, Yaodong Yu (Universities of Virginia) Hierarchical Dirichlet Processes 12/15/2017 31 / 31