- Dr. Nonparametric Bayes
Or: How I Learned to Stop Worrying and Love the Dirichlet Process
Kurt Miller CS 294: Practical Machine Learning November 19, 2009
Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love - - PowerPoint PPT Presentation
Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the Dirichlet Process Kurt Miller CS 294: Practical Machine Learning November 19, 2009 Today we will discuss Nonparametric Bayesian methods. Kurt T. Miller Dr.
Kurt Miller CS 294: Practical Machine Learning November 19, 2009
Kurt T. Miller
2
Kurt T. Miller
2
Introduction
Kurt T. Miller
3
Introduction
Build model Predict using model Data
Parametric Approach Nonparametric Approach
+ + + + + ++ + + + + + + + + + + + + + + + + +
Kurt T. Miller
4
Introduction
Build model Predict using model Data
Parametric Approach Nonparametric Approach
Kurt T. Miller
5
Introduction
Build model Data
Parametric Approach Nonparametric Approach
Kurt T. Miller
6
Introduction
Statistics: Bayesian Basics
The Bayesian approach treats statistical problems by maintaining probability distributions over possible parameter values. That is, we treat the parameters themselves as random variables having distributions:
1
We have some beliefs about our parameter values θ before we see any data. These beliefs are encoded in the prior distribution P(θ).
2
Treating the parameters θ as random variables, we can write the likelihood of the data X as a conditional probability: P(X|θ).
3
We would like to update our beliefs about θ based on the data by obtaining P(θ|X), the posterior distribution. Solution: by Bayes’ theorem, P(θ|X) = P(X|θ)P(θ) P(X) where P(X) =
(Slide from tutorial lecture)
Kurt T. Miller
7
Introduction
You can take a course on this question.
Kurt T. Miller
8
Introduction
You can take a course on this question. One answer: Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n))
Kurt T. Miller
8
Introduction
You can take a course on this question. One answer: Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n)) De Finetti’s Theorem (1955): If (x1, x2, . . .) are infinitely exchangeable, then ∀n p(x1, . . . , xn) = n
p(xi|θ)
for some random variable θ.
Kurt T. Miller
8
Introduction
Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is?
Kurt T. Miller
9
Introduction
Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable.
Kurt T. Miller
9
Introduction
Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is?
Kurt T. Miller
9
Introduction
Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is? The maximum likelihood estimate is θ = 1. Seem reasonable?
Kurt T. Miller
9
Introduction
Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is? The maximum likelihood estimate is θ = 1. Seem reasonable? Not really. Why?
Kurt T. Miller
9
Introduction
When we observe {H, H, H, H}, why does θ = 1 seem unreasonable?
Kurt T. Miller
10
Introduction
When we observe {H, H, H, H}, why does θ = 1 seem unreasonable? Prior knowledge! We believe coins generally have θ ≈ 1/2. How to encode this? By using a Beta prior on θ.
Kurt T. Miller
10
Introduction
Place a Beta(a, b) prior on θ. This prior has the form p(θ) ∝ θa−1(1 − θ)b−1. What does this distribution look like?
Kurt T. Miller
11
Introduction
Place a Beta(a, b) prior on θ. This prior has the form p(θ) ∝ θa−1(1 − θ)b−1. What does this distribution look like?
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 α1=1.0, α2=0.1 α1=1.0, α2=1.0 α1=1.0, α2=5.0 α1=1.0, α2=10.0 α1=9.0, α2=3.0
Kurt T. Miller
11
Introduction
After observing X, a sequence with n heads and m tails, the posterior on θ is: p(θ|X) ∝ p(X|θ)p(θ) ∝ θa+n−1(1 − θ)b+m−1 ∼ Beta(a + n, b + m).
Kurt T. Miller
12
Introduction
After observing X, a sequence with n heads and m tails, the posterior on θ is: p(θ|X) ∝ p(X|θ)p(θ) ∝ θa+n−1(1 − θ)b+m−1 ∼ Beta(a + n, b + m). If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3
Kurt T. Miller
12
Nonparametric Bayesian Methods overview
Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?
Kurt T. Miller
13
Nonparametric Bayesian Methods overview
Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?
data.
Kurt T. Miller
13
Nonparametric Bayesian Methods overview
Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?
data.
Kurt T. Miller
13
Nonparametric Bayesian Methods overview
Latent class models - often used in the clustering context
Latent feature models
Regression Today we focus on the Dirichlet Process!
Kurt T. Miller
14
Nonparametric Bayesian Methods overview
A nonparametric approach to clustering. It can be used in any probabilistic model for clustering. Before diving into the details, we first introduce several key ideas.
Kurt T. Miller
15
Nonparametric Bayesian Methods overview
Kurt T. Miller
16
Nonparametric Bayesian Methods overview
Kurt T. Miller
17
Preliminaries
We must specify two things:
p(X|θ)
p(θ) We will slowly develop what these are in the Bayesian clustering context.
Kurt T. Miller
18
Preliminaries
How many clusters?
Kurt T. Miller
19
Preliminaries
How many clusters?
Kurt T. Miller
19
Preliminaries
Frequentist approach: Gaussian Mixture Models with K mixtures Distribution over classes: π = (π1, . . . , πK) Each cluster has a mean and covariance: φi = (µi, Σi) Then p(x|π, φ) =
K
πkp(x|φk) Use Expectation Maximization (EM) to maximize the likelihood of the data with respect to (π, φ).
Kurt T. Miller
20
Preliminaries
Frequentist approach: Gaussian Mixture Models with K mixtures Alternate definition: G =
K
πkδφk where δφk is an atom at φk. Then θi ∼ G xi ∼ p(x|θi)
G θi xi N
Ω
Kurt T. Miller
21
Parametric Bayesian Clustering
Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures Distribution over classes: π = (π1, . . . , πK) π ∼ Dirichlet(α/K, . . . , α/K) (We’ll review the Dirichlet Distribution in a several slides.) Each cluster has a mean and covariance: φk = (µk, Σk) (µk, Σk) ∼ Normal-Inverse-Wishart(ν) We still have p(x|π, φ) =
K
πkp(x|φk)
Kurt T. Miller
22
Parametric Bayesian Clustering
Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =
K
πkδφk θi ∼ G xi ∼ p(x|θi)
G θi xi N
Ω
α G0
Kurt T. Miller
23
Parametric Bayesian Clustering
We had π ∼ Dirichlet(α1, . . . , αK) The Dirichlet density is defined as p(π|α) = Γ K
k=1 αk
k=1 Γ(αk)
πα1−1
1
πα2−1
2
· · · παK−1
K
where πK = 1 − K−1
k=1 πk.
The expectations of π are E(πi) = αi K
i=1 αi
Kurt T. Miller
24
Parametric Bayesian Clustering
A special case of the Dirichlet distribution is the Beta distribution for when K = 2. p(π|α1, α2) = Γ (α1 + α2) Γ(α1)Γ(α2)πα1−1(1 − π)α2−1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 α1=1.0, α2=0.1 α1=1.0, α2=1.0 α1=1.0, α2=5.0 α1=1.0, α2=10.0 α1=9.0, α2=3.0
Kurt T. Miller
25
Parametric Bayesian Clustering
In three dimensions: p(π|α1, α2, α3) = Γ (α1 + α2 + α3) Γ(α1)Γ(α2)Γ(α3)πα1−1
1
πα2−1
2
(1 − π1 − π2)α3−1 α = (2, 2, 2) α = (5, 5, 5) α = (2, 2, 25)
Kurt T. Miller
26
Parametric Bayesian Clustering
α = (2, 2, 2)
1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.1 0.2 0.3 0.4 0.5 1 2 3 0.1 0.2 0.3 0.4
α = (5, 5, 5)
1 2 3 0.1 0.2 0.3 0.4 0.5 1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8
α = (2, 2, 5)
1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8
Kurt T. Miller
27
Parametric Bayesian Clustering
The Aggregation Property: If (π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK) then (π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)
Kurt T. Miller
28
Parametric Bayesian Clustering
The Aggregation Property: If (π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK) then (π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK) This is also valid for any aggregation:
πk
K
αk
28
Parametric Bayesian Clustering
Let Z ∼ Multinomial(π) and π ∼ Dir(α). Posterior: p(π|z) ∝ p(z|π)p(π) = (πz1
1 · · · πzK K )(πα1−1 1
· · · παK−1
K
) = (πz1+α1−1
1
· · · πzK+αK−1
K
) which is Dir(α + z).
Kurt T. Miller
29
Parametric Bayesian Clustering
Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =
K
πkδφk θi ∼ G xi ∼ p(x|θi)
G θi xi N
Ω
α G0
Kurt T. Miller
30
Parametric Bayesian Clustering
We no longer want just the maximum likelihood parameters, we want the full posterior: p(π, φ|X) ∝ p(X|π, φ)p(π, φ) Unfortunately, this is not analytically tractable. Two main approaches to approximate inference:
Kurt T. Miller
31
Parametric Bayesian Clustering
Suppose we wish to reason about p(θ|X), but we cannot compute this distribution exactly. If instead, we can sample θ ∼ p(θ|X), what can we do?
p(θ|X) Samples from p(θ|X)
This is the idea behind Monte Carlo methods.
Kurt T. Miller
32
Parametric Bayesian Clustering
We do not have access to an oracle that will give use samples θ ∼ p(θ|X). How do we get these samples? Markov Chain Monte Carlo (MCMC) methods have been developed to solve this problem. We focus on Gibbs sampling, a special case of the Metropolis-Hastings algorithm.
Kurt T. Miller
33
Parametric Bayesian Clustering
An MCMC technique
Assume θ consists of several parameters θ = (θ1, . . . , θm). In the finite mixture model, θ = (π, µ1, . . . , µK, Σ1, . . . , ΣK). Then do
1 , . . . , θ(0) m ) at time step 0.
θ(t) are samples from p(θ|X).
Kurt T. Miller
34
Parametric Bayesian Clustering
An MCMC technique
In Gibbs sampling, we only need to be able to sample θ(t)
i
∼ p(θi|θ(t)
1 , . . . , θ(t) i−1, θ(t−1) i+1 , . . . , θ(t−1) m
, X). If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ(t) from p(θ|X).
Kurt T. Miller
35
Parametric Bayesian Clustering
An MCMC technique
In Gibbs sampling, we only need to be able to sample θ(t)
i
∼ p(θi|θ(t)
1 , . . . , θ(t) i−1, θ(t−1) i+1 , . . . , θ(t−1) m
, X). If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ(t) from p(θ|X). Example: θ = (θ1, θ2) and p(θ) ∼ N(µ, Σ).
−3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5First 50 samples First 500 samples
Kurt T. Miller
35
Parametric Bayesian Clustering
Introduce “membership” indicators zi where zi ∼ Multinomial(π) indicates which cluster the ith data point belongs to. p(π, Z, φ|X) ∝ p(X|Z, φ)p(Z|π)p(π, φ)
xi N α G0 π zi
K
φk
Kurt T. Miller
36
Parametric Bayesian Clustering
Randomly initialize Z, π, φ. Repeat until we have enough samples:
zi|Z−i, π, φ, X ∝
K
πkp(xi|φk)1
1
{zi=k}
π|Z, φ, X ∼ Dir(n1 + α/K, . . . , nK + α/K) where ni is the number of points assigned to cluster i.
Kurt T. Miller
37
Parametric Bayesian Clustering
[Matlab demo]
Kurt T. Miller
38
Parametric Bayesian Clustering
Idea for an improvement: we can marginalize out some variables due to conjugacy, so do not need to sample it. This is called a collapsed
Randomly initialize Z, φ. Repeat:
zi|Z−i, φ, X ∝
K
(nk + α/K)p(xi|φk)1
1
{zi=k}
Kurt T. Miller
39
Parametric Bayesian Clustering
For easy visualization, we used a Gaussian mixture model. You should use the appropriate likelihood model for your application!
Kurt T. Miller
40
Parametric Bayesian Clustering
approximate inference.
Kurt T. Miller
41
Kurt T. Miller
42
Parametric Bayesian Clustering
Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K.
Kurt T. Miller
43
Parametric Bayesian Clustering
Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K. What if we just let K → ∞ in our parametric model?
Kurt T. Miller
43
Parametric Bayesian Clustering
Let K → ∞. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =
K
πkδφk θi ∼ G xi ∼ p(x|θi)
Kurt T. Miller
44
Parametric Bayesian Clustering
Randomly initialize Z, φ. Repeat:
zi|Z−i, φ, X ∝
K
(nk + α/K)p(xi|φk)1
1
{zi=k}
→
K
nkp(xi|φk)1
1
{zi=k}
Note that nk = 0 for empty clusters.
Kurt T. Miller
45
Parametric Bayesian Clustering
What about empty clusters? Lump all empty clusters together. Let K+ be the number of occupied clusters. Then the posterior probability of sitting at any empty cluster is: zi|Z−i, φ, X ∝ α/K × (K − K+)f(xi|G0) → αf(xi|G0) for f(xi|G0) =
Kurt T. Miller
46
Parametric Bayesian Clustering
Kurt T. Miller
47
The Dirichlet Process Model
We must again specify two things:
p(X|θ) Identical to the parametric case.
p(θ) The Dirichlet Process! Exact posterior inference is still intractable. But we have already derived the Gibbs update equations!
Kurt T. Miller
48
The Dirichlet Process Model
Image from http://www.nature.com/nsmb/journal/v7/n6/fig tab/nsb0600 443 F1.html
Kurt T. Miller
49
The Dirichlet Process Model
(G(A1), . . . , G(An)) ∼ Dir(α0G0(A1), . . . , α0G0(An)) Kurt T. Miller
50
The Dirichlet Process Model
A flexible, nonparametric prior over an infinite number of clusters/classes as well as the parameters for those classes.
Kurt T. Miller
51
The Dirichlet Process Model
parameters. The Dirichlet Process (DP) is a distribution over distributions. We write G ∼ DP(α, G0) to indicate G is a distribution drawn from the DP. It will become clearer in a bit what α and G0 are.
Kurt T. Miller
52
The Dirichlet Process Model
G θi xi N
Ω
α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.
Kurt T. Miller
53
The Dirichlet Process Model
Definition: Let G0 be a probability measure on the measurable space (Ω, B) and α ∈ R+. The Dirichlet Process DP(α, G0) is the distribution on probability measures G such that for any finite partition (A1, . . . , Am) of Ω, (G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).
A A A A A
1 2 3 4 5
Ω
(Ferguson, ’73)
Kurt T. Miller
54
The Dirichlet Process Model
Suppose we sample
What is the posterior distribution of G given θ1?
Kurt T. Miller
55
The Dirichlet Process Model
Suppose we sample
What is the posterior distribution of G given θ1? G|θ1 ∼ DP
α α + 1G0 + 1 α + 1δθ1
G|θ1, . . . , θn ∼ DP
α α + nG0 + 1 α + n
n
δθi
55
The Dirichlet Process Model
With probability 1, a sample G ∼ DP(α, G0) is of the form G =
∞
πkδφk
(Sethuraman, ’94)
Kurt T. Miller
56
The Dirichlet Process Model
Draw G ∼ DP(α, G0) to get G =
∞
πkδφk Use this in a mixture model:
G θi xi N
Ω
α G0
Kurt T. Miller
57
The Dirichlet Process Model
βk ∼ Beta(1, α) k = 1, 2, . . .
π1 = β1 πk = βk
k−1
(1 − βl) k = 2, 3, . . .
1 2 ... 1 β β (1−β )
Kurt T. Miller
58
The Dirichlet Process Model
πk = βk k−1
l=1 (1 − βl)
k=1 πk = 1 (wp1):
1 −
K
πk = 1 − β1 − β2(1 − β1) − β3(1 − β1)(1 − β2) − · · · = (1 − β1)(1 − β2 − β3(1 − β2) − · · · ) =
K
(1 − βk) → (wp1 as K → ∞)
k=1 πkδφk has a clean definition as a random
measure
Kurt T. Miller
59
The Dirichlet Process Model
G θi xi N
Ω
α G0 φk πk ∞ ∞
Kurt T. Miller
60
The Dirichlet Process Model
restaurant with an infinite number of tables
distribution: P(previously occupied table i|Fm−1) ∝ ni P(the next unoccupied table|Fm−1) ∝ α where ni is the number of customers currently at table i and where Fm−1 denotes the state of the restaurant after m − 1 customers have been seated
61
The Dirichlet Process Model
and on the number of tables
with each table
chooses the parameter vector for that table (φk) from the prior
φ2 φ1 φ3 φ
that we might care about in the clustering setting
Kurt T. Miller
62
The Dirichlet Process Model
φk = (µk, Σk) ∼ N(a, b) ⊗ IW(α, β) xi ∼ N(φk) for a data point i sitting at table k
Kurt T. Miller
63
The Dirichlet Process Model
OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP?
Kurt T. Miller
64
The Dirichlet Process Model
OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP? Important fact: The CRP is exchangeable. Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitely exchangeable, then ∀n p(x1, . . . , xn) = n
p(xi|G)
for some random variable G.
Kurt T. Miller
64
The Dirichlet Process Model
Kurt T. Miller
65
The Dirichlet Process Model
That means, when we integrate out G, we get the CRP. p(θ1, . . . , θn) =
p(θi|G)dP(G)
G θi xi N
Ω
α G0
Kurt T. Miller
65
The Dirichlet Process Model
Kurt T. Miller
66
The Dirichlet Process Model
G θi xi N
Ω
α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.
Kurt T. Miller
67
Inference for the Dirichlet Process
We introduce the indicators zi and use the CRP representation. Randomly initialize Z, φ. Repeat:
zi|Z−i, φ, X ∝
K
nkp(xi|φk)1
1
{zi=k} + αf(xi|G0)1
1
{zi=K+1}
This is the sampler we saw earlier, but now with some theoretical basis.
Kurt T. Miller
68
Inference for the Dirichlet Process
Show Matlab demo
!! !" !# !$ % $ # " ! !& % & '% ()*+,-./.)0.1)/.2-+32.-/ !!" !# !$ !% !& " & % $ # !' " ' !" ()*+,)-./0!" !! !" !# !$ % $ # " !# !$ % $ # " ! &'()*'+,-.$%[Matlab demo]
Kurt T. Miller
69
Inference for the Dirichlet Process
Kurt T. Miller
70
Inference for the Dirichlet Process
approximate inference. This is based on the CRP representation.
Kurt T. Miller
71
Inference for the Dirichlet Process
Kurt T. Miller
72
Hierarchical Dirichlet Process
View parameters as random variables - place a prior on them.
Kurt T. Miller
73
Hierarchical Dirichlet Process
View parameters as random variables - place a prior on them.
Often the priors themselves need parameters.
Kurt T. Miller
73
Hierarchical Dirichlet Process
View parameters as random variables - place a prior on them.
Often the priors themselves need parameters.
Place a prior on these parameters!
Kurt T. Miller
73
Hierarchical Dirichlet Process
Example: xi ∼ N(θi, σ2) in m different groups.
x1j N1 θ2 θ1 N2 x2j xmj Nm θm
How to estimate θi for each group?
Kurt T. Miller
74
Hierarchical Dirichlet Process
Example: xi ∼ N(θi, σ2) in m different groups. Treat θis as random variables sampled from a common prior θi ∼ N(θ0, σ2
0)
x1j N1 θ2 θ1 N2 x2j xmj Nm θm
θ0
Kurt T. Miller
75
Hierarchical Dirichlet Process
θ0 θi xij Ni m
is equivalent to
x1j N1 θ2 θ1 N2 x2j xmj Nm θm
θ0
Kurt T. Miller
76
Hierarchical Dirichlet Process
Independent estimation Hierarchical Bayesian
x1j N1 θ2 θ1 N2 x2j xmj Nm θm
· · ·
⇒
θ0 θi xij Ni m Kurt T. Miller
77
Hierarchical Dirichlet Process
Independent estimation Hierarchical Bayesian
x1j N1 θ2 θ1 N2 x2j xmj Nm θm
· · ·
⇒
θ0 θi xij Ni m
What do we do if we have DPs for multiple related datasets?
G1 θ1i x1i N1 α1 H1 H2 Hm αm G2 Gm θ2i θmi xmi x2i N2 Nm
· · ·
α2
⇒
Kurt T. Miller
77
Hierarchical Dirichlet Process
Independent estimation Hierarchical Bayesian
x1j N1 θ2 θ1 N2 x2j xmj Nm θm
· · ·
⇒
θ0 θi xij Ni m
What do we do if we have DPs for multiple related datasets?
G1 θ1i x1i N1 α1 H1 H2 Hm αm G2 Gm θ2i θmi xmi x2i N2 Nm
· · ·
α2
⇒
m Ni xij θij Gi H α G0 Kurt T. Miller
77
Hierarchical Dirichlet Process
m Ni xij θij Gi H α G0
What kind of distribution do we use for G0? H? Suppose θij are mean parameters for a Gaussian where Gi ∼ DP(α, G0) and G0 is a Gaussian with unknown mean? G0 = N(θ0, σ2
0)
Kurt T. Miller
78
Hierarchical Dirichlet Process
m Ni xij θij Gi H α G0
What kind of distribution do we use for G0? H? Suppose θij are mean parameters for a Gaussian where Gi ∼ DP(α, G0) and G0 is a Gaussian with unknown mean? G0 = N(θ0, σ2
0)
This does NOT work! Why?
Kurt T. Miller
78
Hierarchical Dirichlet Process
m Ni xij θij Gi H α G0
The problem: If G0 is continuous, then with probability ONE, Gi and Gj will share ZERO atoms. ⇒ This means NO clustering!
Gi Gj G0
Kurt T. Miller
79
Hierarchical Dirichlet Process
m Ni xij θij Gi H α G0
So G0 must be discrete. What discrete prior can we use on G0?
Kurt T. Miller
80
Hierarchical Dirichlet Process
m Ni xij θij Gi H α G0
So G0 must be discrete. What discrete prior can we use on G0? How about a parametric prior?
Kurt T. Miller
80
Hierarchical Dirichlet Process
m Ni xij θij Gi H α G0
So G0 must be discrete. What discrete prior can we use on G0? How about a parametric prior? Gee, if only we had a nonparametric prior on discrete measures...
Kurt T. Miller
80
Hierarchical Dirichlet Process
Solution:
m Ni xij θij Gi H α G0 γ
G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) θij|Gi ∼ Gi xij|θij ∼ p(xij|θij)
(Teh, Jordan, Beal, Blei, 2004)
Kurt T. Miller
81
Hierarchical Dirichlet Process
Since G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) we know G0 =
∞
πkδφk Gi =
∞
πikδφk
Kurt T. Miller
82
Hierarchical Dirichlet Process
Since G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) we know G0 =
∞
πkδφk Gi =
∞
πikδφk What is the relationship between πk and πik?
Kurt T. Miller
82
Hierarchical Dirichlet Process
Let (A1, . . . , Am) be a partition of Ω.
A A A A A
1 2 3 4 5
Ω
By properties of the DP (Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am))
Kurt T. Miller
83
Hierarchical Dirichlet Process
Let (A1, . . . , Am) be a partition of Ω.
A A A A A
1 2 3 4 5
Ω
By properties of the DP (Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)) ⇒
k∈K1
πik, . . . ,
πik
Dir
πk, . . . , α
πk
83
Hierarchical Dirichlet Process
G0 ∼ DP(γ, H) π ∼ GEM(γ) Gi ∼ DP(α, G0) πi ∼ DP(α, π) θij|Gi ∼ Gi φk ∼ H xij|θij ∼ p(xij|θij) zij ∼ πi xij ∼ p(xij|φzij)
m Ni xij θij Gi H α G0 γ φk
m Ni xij α γ zij π πj
G0
∞ Kurt T. Miller
84
Hierarchical Dirichlet Process
Remember:
@ X
k∈K1
πik, . . . , X
k∈Km
πik 1 A ∼ Dir @α X
k∈K1
πk, . . . , α X
k∈Km
πk 1 A
Explicit relationship between π and πi:
βk ∼ Beta(1, γ) πk = βk
k−1
Y
j=1
(1 − βj) βik ∼ Beta απk, α 1 −
k
X
j=1
πj !! πik = βik
k−1
Y
j=1
(1 − βij)
Kurt T. Miller
85
Hierarchical Dirichlet Process
π ∼ GEM(γ), πi ∼ DP(α, π) π: γ = 2
1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35πi: α = 1
1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1α = 5
1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35α = 20
1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35α = 100
1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35Kurt T. Miller
86
Hierarchical Dirichlet Process
For the DP, we had:
G θi xi N
Ω
α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.
For the HDP, we have
Kurt T. Miller
87
Hierarchical Dirichlet Process
First integrate out the Gi.
m Ni xij θij Gi H α G0 γ
⇒
m Ni xij θij H α G0 γ
Kurt T. Miller
88
Hierarchical Dirichlet Process
What is the generative process when we integrate out Gi?
k=1 πkδφk.
CRP.
m Ni xij θij H α G0 γ
G0
. . .
φ1
. . . . . .
φ2 φ3
θ11
φ2 φ2 φ4 φ1 φ1 φ1
θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35
φ1 φ2 φ3 φ4
Kurt T. Miller
89
Hierarchical Dirichlet Process
First integrate out the Gi, then integrate out G0
m Ni xij θij Gi H α G0 γ
⇒
m Ni xij θij H α G0 γ
⇒
m Ni xij θij H α γ
Kurt T. Miller
90
Hierarchical Dirichlet Process
G0
. . .
φ1
. . . . . .
φ2 φ3
θ11
φ2 φ2 φ4 φ1 φ1 φ1
θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35
φ1 φ2 φ3 φ4
⇒
. . .
φ1
. . . . . .
φ2 φ3
θ11
φ2 φ2 φ4 φ1 φ1 φ1
θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35
Kurt T. Miller
91
Hierarchical Dirichlet Process
For the DP, we had:
G θi xi N
Ω
α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.
For the HDP, we have
process
Kurt T. Miller
92
Hierarchical Dirichlet Process
Same classes of algorithms used for the DP:
We will not go into these.
Kurt T. Miller
93
Hierarchical Dirichlet Process
Finite Hidden Markov Models (HMMs):
y ∼ p(y|φi)
s1 s2 · · · sm s1 π11 π12 · · · π1m s2 π21 π22 · · · π2m . . . . . . . . . ... . . . sm πm1 πm2 · · · πmm How do we let m → ∞?
Kurt T. Miller
94
Hierarchical Dirichlet Process
How do we let m → ∞? Think a bit outside the traditional clustering context. Let each state si corresponds to a group. π|γ ∼ GEM(γ) πi|α, π ∼ DP(α, π) φk|H ∼ H xt|xt−1, (πi)∞
i=1
∼ πxt−1 yt|xt, (πi)∞
i=1
∼ p(yt|φxt)
Kurt T. Miller
95
http://npbayes.wikidot.com/references Includes both the “classics” as well as modern applications.
Kurt T. Miller
96