Bayesian Nonparametrics: Models Based on the Dirichlet Process - - PowerPoint PPT Presentation

bayesian nonparametrics models based on the dirichlet
SMART_READER_LITE
LIVE PREVIEW

Bayesian Nonparametrics: Models Based on the Dirichlet Process - - PowerPoint PPT Presentation

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian


slide-1
SLIDE 1

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Alessandro Panella

Department of Computer Science University of Illinois at Chicago

Machine Learning Seminar Series

February 18, 2013

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 1 / 57

slide-2
SLIDE 2

Sources and Inspirations

Tutorials (slides)

P . Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics. NIPS 2011.

  • M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That. NIPS 2005.

Articles etc.

E.B. Sudderth, Chapter in PhD thesis, 2006.

  • E. Fox, Chapter in PhD thesis, 2008.

Y.W. Teh, Dirichlet Processes. Encyclopedia of Machine Learning, 2010. Springer. ...

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 2 / 57

slide-3
SLIDE 3

Outline

1

Introduction and background Bayesian learning Nonparametric models

2

Finite mixture models Bayesian models Clustering with FMMs Inference

3

Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference

4

A little more theory. . . De Finetti’s REDUX Dirichlet process REDUX

5

The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 3 / 57

slide-4
SLIDE 4

Introduction and background

Outline

1

Introduction and background Bayesian learning Nonparametric models

2

Finite mixture models Bayesian models Clustering with FMMs Inference

3

Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference

4

A little more theory. . . De Finetti’s REDUX Dirichlet process REDUX

5

The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 4 / 57

slide-5
SLIDE 5

Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

slide-6
SLIDE 6

Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

slide-7
SLIDE 7

Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

slide-8
SLIDE 8

Introduction and background Bayesian learning

Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

Frequentist

Maximum Likelihood (ML): ˆ θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)

p(x)

Bayesian prediction (using the whole posterior, not just one estimator) p(xnew|x) =

  • Θ

p(xnew|θ)p(θ|x) dθ Maximum A Posteriori (MAP) ˆ θMAP = argmax

θ

p(x|θ)p(θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

slide-9
SLIDE 9

Introduction and background Bayesian learning

Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

Frequentist

Maximum Likelihood (ML): ˆ θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)

p(x)

Bayesian prediction (using the whole posterior, not just one estimator) p(xnew|x) =

  • Θ

p(xnew|θ)p(θ|x) dθ Maximum A Posteriori (MAP) ˆ θMAP = argmax

θ

p(x|θ)p(θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

slide-10
SLIDE 10

Introduction and background Bayesian learning

Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

Frequentist

Maximum Likelihood (ML): ˆ θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)

p(x)

Bayesian prediction (using the whole posterior, not just one estimator) p(xnew|x) =

  • Θ

p(xnew|θ)p(θ|x) dθ Maximum A Posteriori (MAP) ˆ θMAP = argmax

θ

p(x|θ)p(θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

slide-11
SLIDE 11

Introduction and background Bayesian learning

De Finetti’s theorem

A premise:

Definition

An infinite sequence random variables (x1, x2, . . .) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1, . . . , N), p(x1, x2, . . . , xN) = p(xπ(1), xπ(2) . . . , xπ(N)) Note: exchangeability not equal i.i.d!

Example (Polya Urn)

An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57

slide-12
SLIDE 12

Introduction and background Bayesian learning

De Finetti’s theorem

A premise:

Definition

An infinite sequence random variables (x1, x2, . . .) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1, . . . , N), p(x1, x2, . . . , xN) = p(xπ(1), xπ(2) . . . , xπ(N)) Note: exchangeability not equal i.i.d!

Example (Polya Urn)

An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57

slide-13
SLIDE 13

Introduction and background Bayesian learning

De Finetti’s theorem (cont’d)

Theorem (De Finetti, 1935. Aka Representation Theorem)

A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x1, x2, . . . , xN) =

  • Θ

p(θ)

N

  • i=1

p(xi|θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57

slide-14
SLIDE 14

Introduction and background Bayesian learning

De Finetti’s theorem (cont’d)

Theorem (De Finetti, 1935. Aka Representation Theorem)

A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x1, x2, . . . , xN) =

  • Θ

p(θ)

N

  • i=1

p(xi|θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57

slide-15
SLIDE 15

Introduction and background Bayesian learning

Bayesian learning

Hypothesis space H Given data D, compute p(h|D) = p(D|h)p(h) p(D) Then, we probably want to predict some future data D′, by either:

Average over H, i.e. p(D′|D) =

  • H p(D′|h)p(h|D)p(h) dh

Choose the MAP h (or compute it directly), i.e. p(D′|D) = p(D′|hMAP) Sample from the posterior ...

H can be anything! Bayesian learning as a general learning framework We will consider the case in which h is a probabilistic model itself, i.e. a parameter vector θ.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 9 / 57

slide-16
SLIDE 16

Introduction and background Bayesian learning

A simple example

Infer the bias θ ∈ [0, 1] of a coin after observing N tosses. H = 1, T = 0, p(H) = θ h = θ, hence H = [0, 1] Sequence of Bernoulli trials: p(x1, . . . , xn|θ) = θnH(1 − θ)N−nH where nH = # heads. Unknown θ: p(x1, . . . , xN) = 1 θnH(1 − θ)nH−kp(θ) dθ Need to find a “good” prior p(θ). . . Beta distribution!

θ x1 x2 xN θ xi N

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 10 / 57

slide-17
SLIDE 17

Introduction and background Bayesian learning

A simple example (cont’d)

Beta distribution: θ ∼ Beta(a, b) p(θ|a, b) =

1 B(a,b)θa−1(1 − θ)b−1

Bayesian learning: p(h|D) ∝ p(D|h)p(h); for us: p(θ|x1, . . . , xN) ∝ p(x1, . . . , xn|θ)p(θ) = θnH(1 − θ)nT 1 B(a, b)θa−1(1 − θ)b−1 ∝ θnH+a−1(1 − θ)nT+b−1 i.e. θ|x1, . . . , xN ∼ Beta(a + NH, b + NT) We’re lucky! The Beta distribution is a conjugate prior to the binomial distribution.

Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57

slide-18
SLIDE 18

Introduction and background Bayesian learning

A simple example (cont’d)

Beta distribution: θ ∼ Beta(a, b) p(θ|a, b) =

1 B(a,b)θa−1(1 − θ)b−1

Bayesian learning: p(h|D) ∝ p(D|h)p(h); for us: p(θ|x1, . . . , xN) ∝ p(x1, . . . , xn|θ)p(θ) = θnH(1 − θ)nT 1 B(a, b)θa−1(1 − θ)b−1 ∝ θnH+a−1(1 − θ)nT+b−1 i.e. θ|x1, . . . , xN ∼ Beta(a + NH, b + NT) We’re lucky! The Beta distribution is a conjugate prior to the binomial distribution.

Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57

slide-19
SLIDE 19

Introduction and background Bayesian learning

A simple example (cont’d)

Three sequences of four tosses:

H T H H H H H T H H H H Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 12 / 57

slide-20
SLIDE 20

Introduction and background Nonparametric models

Nonparametric models

“Nonparametric” doesn’t mean “no parameters”! Rather, The number of parameters grows as more data are observed. ∞-dimensional parameter space.

Finite data ⇒ Bounded number of parameters

Definition

A nonparametric model is a Bayesian model on an ∞-dimensional parameter space.

Example

x2 x1 µ

Parametric

p(x)

Nonparametric

(from Orbanz and Teh, NIPS 2011)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 13 / 57

slide-21
SLIDE 21

Introduction and background Nonparametric models

Nonparametric models

“Nonparametric” doesn’t mean “no parameters”! Rather, The number of parameters grows as more data are observed. ∞-dimensional parameter space.

Finite data ⇒ Bounded number of parameters

Definition

A nonparametric model is a Bayesian model on an ∞-dimensional parameter space.

Example

x2 x1 µ

Parametric

p(x)

Nonparametric

(from Orbanz and Teh, NIPS 2011)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 13 / 57

slide-22
SLIDE 22

Finite mixture models

Outline

1

Introduction and background Bayesian learning Nonparametric models

2

Finite mixture models Bayesian models Clustering with FMMs Inference

3

Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference

4

A little more theory. . . De Finetti’s REDUX Dirichlet process REDUX

5

The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 14 / 57

slide-23
SLIDE 23

Finite mixture models Bayesian models

Models in Bayesian data analysis

Model

Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models.

Posterior inference

Knowing p(D|M, θ). . . ← “how data is generated” . . . compute p(θ|M, D) Akin to “reversing” the generative process.

θ D

p(θ)

M

p(D|M, θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 15 / 57

slide-24
SLIDE 24

Finite mixture models Bayesian models

Models in Bayesian data analysis

Model

Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models.

Posterior inference

Knowing p(D|M, θ). . . ← “how data is generated” . . . compute p(θ|M, D) Akin to “reversing” the generative process.

θ D

p(θ)

M

p(D|M, θ) p(θ|D, M)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 15 / 57

slide-25
SLIDE 25

Finite mixture models Clustering with FMMs

Finite mixture models (FMMs)

Bayesian approach to clustering. Each data point is assumed to belong to

  • ne of K clusters.

General form

A sequence of data points x = (x1, . . . , xN) each with probability p(xi|π, θ1, . . . , θK) =

K

  • k=1

πk f(xi|θk) π ∈ ΠK−1

Generative process

For each i: Draw a cluster assignment zi ∼ π Draw a data point xi ∼ F(θzi).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 16 / 57

slide-26
SLIDE 26

Finite mixture models Clustering with FMMs

FMMs (example)

Mixture of univariate Gaussians

θk = (µk, σk) xi ∼ N(µk, σk) p(xi|π, µ, σ) =

K

  • k=1

πk fN (xi; µk, σk)

−2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35

N(1, 1) N(4, .5) N(6, .7) π = (0.15, 0.25, 0.6)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 17 / 57

slide-27
SLIDE 27

Finite mixture models Clustering with FMMs

FMMs (cont’d)

Clustering with FFMs

Need priors for π, θ Usually, π is given a (symmetric) Dirichlet distribution prior. θk’s are given a suitable prior H, depending on the data.

π α zi xi θk H N K

π ∼ Dir(α/K, . . . , α/K) θk|H ∼ H k = 1 . . . K zi|π ∼ π xi|θ, zi ∼ F(θzi) i = 1 . . . N

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 18 / 57

slide-28
SLIDE 28

Finite mixture models Clustering with FMMs

Dirichlet distribution

Multivariate generalization of Beta.

(from Teh, MLSC 2008)

Dir(1, 1, 1) Dir(2, 2, 2) Dir(5, 5, 5) Dir(5, 5, 2) Dir(5, 2, 2) Dir(0.7, 0.7, 0.7)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 19 / 57

slide-29
SLIDE 29

Finite mixture models Clustering with FMMs

Dirichlet distribution (cont’d)

π ∼ Dir(α/K, . . . , α/K) iff p(π1, . . . , πK) = Γ(α)

  • k Γ(α/K)

K

  • k=1

πα/K−1

k

Conjugate prior to categorical/multinomial, i.e. π ∼ Dir( α

K , . . . , α K )

zi ∼ π i = 1 . . . N implies π|z1, . . . , zN ∼ Dir α K + n1, α K + n2, . . . , α K + nK

  • Moreover,

p(z1, . . . , zN|α) = Γ(α) Γ(α + N)

K

  • k=1

Γ(nk + α/K) Γ(α/K) and p(zi = k|z(−i), α) = n(−i)

k

+ α/K α + N − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 20 / 57

slide-30
SLIDE 30

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α, H) = p(x|z, H)p(z|α)

  • z p(x|z, H)p(z|α),

where

p(z|α) = Γ(α) Γ(α + N)

K

  • k=1

Γ(nk + α/K) Γ(α/K) p(x|z, H) =

  • Θ

N

  • i=1

p(xi|θzi)

K

  • k=1

H(θk)

π α zi xi θk H N K

Parameter estimation: infer π, θ

p(π, θ|x, α, H) =

  • z
  • p(π|z, α)

K

  • k=1

p(θk|x, H)

  • p(z|x, α, H)

⇒ No analytic procedure.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

slide-31
SLIDE 31

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α, H) = p(x|z, H)p(z|α)

  • z p(x|z, H)p(z|α),

where

p(z|α) = Γ(α) Γ(α + N)

K

  • k=1

Γ(nk + α/K) Γ(α/K) p(x|z, H) =

  • Θ

N

  • i=1

p(xi|θzi)

K

  • k=1

H(θk)

π α zi xi θk H N K

Parameter estimation: infer π, θ

p(π, θ|x, α, H) =

  • z
  • p(π|z, α)

K

  • k=1

p(θk|x, H)

  • p(z|x, α, H)

⇒ No analytic procedure.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

slide-32
SLIDE 32

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α, H) = p(x|z, H)p(z|α)

  • z p(x|z, H)p(z|α),

where

p(z|α) = Γ(α) Γ(α + N)

K

  • k=1

Γ(nk + α/K) Γ(α/K) p(x|z, H) =

  • Θ

N

  • i=1

p(xi|θzi)

K

  • k=1

H(θk)

π α zi xi θk H N K

Parameter estimation: infer π, θ

p(π, θ|x, α, H) =

  • z
  • p(π|z, α)

K

  • k=1

p(θk|x, H)

  • p(z|x, α, H)

⇒ No analytic procedure.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

slide-33
SLIDE 33

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α, H) = p(x|z, H)p(z|α)

  • z p(x|z, H)p(z|α),

where

p(z|α) = Γ(α) Γ(α + N)

K

  • k=1

Γ(nk + α/K) Γ(α/K) p(x|z, H) =

  • Θ

N

  • i=1

p(xi|θzi)

K

  • k=1

H(θk)

π α zi xi θk H N K

Parameter estimation: infer π, θ

p(π, θ|x, α, H) =

  • z
  • p(π|z, α)

K

  • k=1

p(θk|x, H)

  • p(z|x, α, H)

⇒ No analytic procedure.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

slide-34
SLIDE 34

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α, H) = p(x|z, H)p(z|α)

  • z p(x|z, H)p(z|α),

where

p(z|α) = Γ(α) Γ(α + N)

K

  • k=1

Γ(nk + α/K) Γ(α/K) p(x|z, H) =

  • Θ

N

  • i=1

p(xi|θzi)

K

  • k=1

H(θk)

π α zi xi θk H N K

Parameter estimation: infer π, θ

p(π, θ|x, α, H) =

  • z
  • p(π|z, α)

K

  • k=1

p(θk|x, H)

  • p(z|x, α, H)

⇒ No analytic procedure.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

slide-35
SLIDE 35

Finite mixture models Inference

Approximate inference for FMMs

No exact inference because of the unknown clusters identifiers z

Expectation-Maximization (EM)

Widely used, but we will focus on MCMC because of the connection with Dirichlet Process.

Gibbs sampling

Markov chain Monte Carlo (MCMC) integration method Set of random variables v = {v1, v2, . . . , vM}. We want to compute p(v). Randomly initialize their values. At each iteration, sample a variable vi and hold the rest constant: v(t)

i

∼ p(vi|v(t−1)

j

, j = i) ← usually tractable v(t)

j

= v(t−1)

j

This creates a Markov chain with p(v) as equilibrium distribution.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 22 / 57

slide-36
SLIDE 36

Finite mixture models Inference

Gibbs sampling for FMMs

State variables: z1, . . . , zN, θ1, . . . , θK, π. Conditional distributions: p(π|z, θ) = Dir α K + n1, . . . , α K + nk

  • p(θk|x, z) ∝ p(θk)
  • i:zi=k

p(xi|θk) = H(θk)

  • i:zi=k

Fθk(xi) p(zi = k|π, θ, x) ∝ p(zi = k|πk)p(xi|zi = k, θk) = πk Fθk(xi) We can avoid sampling π: p(zi = k|z−i, θ, x) ∝ p(xi|θk)p(zi = k|z−i) ∝ Fθk(xi)

  • n(−i)

k

+ α/K

  • π

α zi xi θk H N K

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 23 / 57

slide-37
SLIDE 37

Finite mixture models Inference

Gibbs sampling for FMMs (example)

Mixture of 4 bivariate Gaussians

Normal-inverse Wishart prior on θk = (µk, Σk), conjugate to normal distribution. Σk ∼ W(ν, ∆) µk ∼ N(ϑ, Σk/κ)

log p(x | !, ") = −539.17 log p(x | !, ") = −404.18 log p(x | !, ") = −397.40

T=2 T=10 T=40

(from Sudderth, 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 24 / 57

slide-38
SLIDE 38

Finite mixture models Inference

FMMs: alternative representation

π α zi xi θk H N K

π ∼ Dir(α) θk ∼ H zi ∼ π xi ∼ F(θzi)

G α ¯ θi xi N H

2

θ

1

θ

2

x

1

x

G(θ) =

K

  • k=1

πkδ(θ, θk) θk ∼ H π ∼ Dir(α) ¯ θi ∼ G xi ∼ F(¯ θi)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 25 / 57

slide-39
SLIDE 39

Dirichlet process mixture models

Outline

1

Introduction and background Bayesian learning Nonparametric models

2

Finite mixture models Bayesian models Clustering with FMMs Inference

3

Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference

4

A little more theory. . . De Finetti’s REDUX Dirichlet process REDUX

5

The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 26 / 57

slide-40
SLIDE 40

Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown? How many parameters?

Idea

Let’s use ∞ parameters! We want something of the kind: p(xi|π, θ1, θ2, . . .) =

  • k=1

πk p(xi|θk)

How to define such a measure?

We’d like the nice conjugancy properties of Dirichlet to carry on. . . Is there such a thing, the ∞ limit of a Dirichlet?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 27 / 57

slide-41
SLIDE 41

Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown? How many parameters?

Idea

Let’s use ∞ parameters! We want something of the kind: p(xi|π, θ1, θ2, . . .) =

  • k=1

πk p(xi|θk)

How to define such a measure?

We’d like the nice conjugancy properties of Dirichlet to carry on. . . Is there such a thing, the ∞ limit of a Dirichlet?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 27 / 57

slide-42
SLIDE 42

Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown? How many parameters?

Idea

Let’s use ∞ parameters! We want something of the kind: p(xi|π, θ1, θ2, . . .) =

  • k=1

πk p(xi|θk)

How to define such a measure?

We’d like the nice conjugancy properties of Dirichlet to carry on. . . Is there such a thing, the ∞ limit of a Dirichlet?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 27 / 57

slide-43
SLIDE 43

Dirichlet process mixture models The Dirichlet process

The (practical) Dirichlet process

The Dirichlet process is a distribution over probability measures

  • ver Θ.

DP(α, H)

H(θ) is the base (mean) measure.

Think µ for a Gaussian. . . . . . but in the space of probability measures.

α is the concentration parameter.

Controls the dispersion around the mean H.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 28 / 57

slide-44
SLIDE 44

Dirichlet process mixture models The Dirichlet process

The Dirichlet process (cont’d)

A draw G ∼ DP(α, H) is an infinite discrete probability measure: G(θ) =

  • k=1

πkδ(θ, θk), where θk ∼ H, and π is sampled from a “stick-breaking prior.”

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

(from Orbanz & Teh, 2008)

G Θ

Break a stick

Imagine a stick of length one. For k = 1 . . . ∞, do the following: Break the stick at a point drawn from Beta(1, α). Let πk be such value and keep the remainder of the stick. Following standard convention, we write π ∼ GEM(α). (Details in second part of talk)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 29 / 57

slide-45
SLIDE 45

Dirichlet process mixture models The Dirichlet process

The Dirichlet process (cont’d)

A draw G ∼ DP(α, H) is an infinite discrete probability measure: G(θ) =

  • k=1

πkδ(θ, θk), where θk ∼ H, and π is sampled from a “stick-breaking prior.”

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

(from Orbanz & Teh, 2008)

G Θ

Break a stick

Imagine a stick of length one. For k = 1 . . . ∞, do the following: Break the stick at a point drawn from Beta(1, α). Let πk be such value and keep the remainder of the stick. Following standard convention, we write π ∼ GEM(α). (Details in second part of talk)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 29 / 57

slide-46
SLIDE 46

Dirichlet process mixture models The Dirichlet process

Stick-breaking, intuitively

β1

π1 π2 π3 π4 π5

β2 β3 β4 β5 1−β1 1−β2 1−β3 1−β4

5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k 5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k 5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k 5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k

α = 1 α = 5

(from Sudderth, 2008)

Small α ⇒ lots of weight assigned to few θk’s. ⇒ G will be very different from base measure H. Large α ⇒ weights equally distributed on θk’s. ⇒ G will resemble the base measure H.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 30 / 57

slide-47
SLIDE 47

Dirichlet process mixture models The Dirichlet process (from Navarro et al., 2005)

H G

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 31 / 57

slide-48
SLIDE 48

Dirichlet process mixture models DP mixture models

The DP mixture model (DPMM)

Let’s use G ∼ DP(α, H) to build an infinite mixture model.

G α ¯ θi xi N H

2

θ

1

θ

2

x

1

x

G ∼ DP(α, H) ¯ θi ∼ G xi ∼ F¯

θi

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 32 / 57

slide-49
SLIDE 49

Dirichlet process mixture models DP mixture models

DPM (cont’d)

Using explicit clusters indicators z = (z1, z2, . . . , zN).

π α zi xi θk H N ∞

π ∼ GEM(α) θk ∼ H k = 1, . . . , ∞ zi ∼ π xi ∼ Fθzi i = 1, . . . , N

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 33 / 57

slide-50
SLIDE 50

Dirichlet process mixture models DP mixture models

Chinese restaurant process

So far, we only have a generative model. Is there a “nice” conjugancy property to use during inference? It turns out (details in part 2) that, if π ∼ GEM(α) zi ∼ π the distribution p(z|α) =

  • p(z|π)p(π) dπ is easily tractable, and is known

as the Chinese restaurant process (CRP).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 34 / 57

slide-51
SLIDE 51

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with ∞ tables with ∞ capacity.

1 2 3 4

...

zi = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits:

at table 1 w. prob ∝ 1 at table 2 w. prob. ∝ α

Customer i sits:

at table k w. prob. ∝ nk (# ppl at k) at new table w. prob. ∝ α

p(zi = k) = nk α + i − 1 p(zi = knew) = α α + i − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

slide-52
SLIDE 52

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with ∞ tables with ∞ capacity.

1 2 3 4

...

1

zi = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits:

at table 1 w. prob ∝ 1 at table 2 w. prob. ∝ α

Customer i sits:

at table k w. prob. ∝ nk (# ppl at k) at new table w. prob. ∝ α

p(zi = k) = nk α + i − 1 p(zi = knew) = α α + i − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

slide-53
SLIDE 53

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with ∞ tables with ∞ capacity.

1 2 3 4

...

1 2

zi = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits:

at table 1 w. prob ∝ 1 at table 2 w. prob. ∝ α

Customer i sits:

at table k w. prob. ∝ nk (# ppl at k) at new table w. prob. ∝ α

p(zi = k) = nk α + i − 1 p(zi = knew) = α α + i − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

slide-54
SLIDE 54

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with ∞ tables with ∞ capacity.

1 2 3 4

...

1 2 3

zi = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits:

at table 1 w. prob ∝ 1 at table 2 w. prob. ∝ α

Customer i sits:

at table k w. prob. ∝ nk (# ppl at k) at new table w. prob. ∝ α

p(zi = k) = nk α + i − 1 p(zi = knew) = α α + i − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

slide-55
SLIDE 55

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with ∞ tables with ∞ capacity.

1 2 3 4

...

1 2 3 4 5 6 7

zi = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits:

at table 1 w. prob ∝ 1 at table 2 w. prob. ∝ α

Customer i sits:

at table k w. prob. ∝ nk (# ppl at k) at new table w. prob. ∝ α

p(zi = k) = nk α + i − 1 p(zi = knew) = α α + i − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

slide-56
SLIDE 56

Dirichlet process mixture models Inference

Gibbs sampling for DPMMs

Via the CRP , we can find the conditional distributions for Gibbs sampling. State: θ1, . . . , θk, z. p(θk|x, z) ∝ p(θk)

  • i:zi=k

p(xi|θk) = h(θk)f(xi|θk) p(zi = k|z−i, θ, x) ∝ p(xi|θk)p(zi = k|z−i) ∝

  • n(−i)

k

f(xi|θk) exising k α f(xi|θk) new k

π α zi xi θk H N ∞

K grows as more data are observed, asymptotically as α log n.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 36 / 57

slide-57
SLIDE 57

Dirichlet process mixture models Inference

Gibbs sampling for DPMMs (example)

Mixture of bivariate Gaussians

T=2 T=10 T=40

(from Sudderth, 2008) log p(x | !, ") = −399.82 log p(x | !, ") = −399.08 log p(x | !, ") = −396.71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 37 / 57

slide-58
SLIDE 58

Dirichlet process mixture models Inference

END OF FIRST PART.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 38 / 57

slide-59
SLIDE 59

A little more theory. . .

Outline

1

Introduction and background Bayesian learning Nonparametric models

2

Finite mixture models Bayesian models Clustering with FMMs Inference

3

Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference

4

A little more theory. . . De Finetti’s REDUX Dirichlet process REDUX

5

The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 39 / 57

slide-60
SLIDE 60

A little more theory. . . De Finetti’s REDUX

De Finetti’s REDUX

Theorem (De Finetti, 1935. Aka Representation Theorem)

A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x1, x2, . . . , xN) =

  • Θ

p(θ)

N

  • i=1

p(xi|θ) dθ The theorem wouldn’t be true if θ’s range is limited to Euclidean’s vector spaces. We need to allow θ to range over measures. ⇒ p(θ) is a distribution on measures, like the DP .

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 40 / 57

slide-61
SLIDE 61

A little more theory. . . De Finetti’s REDUX

De Finetti’s REDUX

Theorem (De Finetti, 1935. Aka Representation Theorem)

A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x1, x2, . . . , xN) =

  • Θ

p(θ)

N

  • i=1

p(xi|θ) dθ The theorem wouldn’t be true if θ’s range is limited to Euclidean’s vector spaces. We need to allow θ to range over measures. ⇒ p(θ) is a distribution on measures, like the DP .

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 40 / 57

slide-62
SLIDE 62

A little more theory. . . Dirichlet process REDUX

Dirichlet Process REDUX

Definition

Let Θ be a measurable space (of parameters), H be a probability distribution

  • n Θ, and α a positive scalar. A Dirichlet process is the distribution of a

random probability measure G over Θ, such that for any finite partition (T1, . . . , Tk) of Θ, we have (G(T1), . . . , G(TK)) ∼ Dir(αH(T1), . . . , αH(TK)).

T

1

T

2

T

3

T

1

~ T

2

~ T

3

~ T

4

~ T

5

~

Θ

(from Sudderth, 2008)

E[G(Tk)] = H(Tk)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 41 / 57

slide-63
SLIDE 63

A little more theory. . . Dirichlet process REDUX

Posterior conjugancy

Via the conjugancy of the Dirichlet distribution, we know that: p(G(T1), . . . , G(TK)|¯ θ ∈ Tk) = Dir(αH(T1), . . . , αH(Tk) + 1, . . . , αH(TK)) Formalizing this analysis, we obtain that if G ∼ DP(α, H) ¯ θi ∼ G i = 1, . . . , N, the posterior measure also follows a Dirichlet process: p(G|¯ θ1, . . . , ¯ θN, α, H) = DP

  • α + N,

1 α + N

  • αH +

N

  • i=1

δ¯

θi

  • The DP defines a conjugate prior for distributions on arbitrary measure

spaces.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 42 / 57

slide-64
SLIDE 64

A little more theory. . . Dirichlet process REDUX

Generating samples: stick breaking

Sethuraman (1995): equivalent definition of the Dirichlet process, through the stick-breaking construction. G(θ) ∼ DP(α, H) iff G(θ) =

  • k=1

πkδ(θ, θk), where θ ∼ H, and πk = βk

k−1

  • l=1

(1 − βl) βl ∼ Beta(1, α)

β1

π1 π2 π3 π4 π5

β2 β3 β4 β5 1−β1 1−β2 1−β3 1−β4

5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k 5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k 5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k 5 10 15 20 0.1 0.2 0.3 0.4 0.5 k !k

α = 1 α = 5

(from Sudderth, 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 43 / 57

slide-65
SLIDE 65

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior): G ∼ DP(α, H) θ|G ∼ G ⇔ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • Consider the partition (Θ, Θ \ θ) of Θ. We have:

(G(Θ), G(Θ \ θ)) ∼ Dir

  • (α + 1)αH + δθ

α + 1 (θ), (α + 1)αH + δθ α + 1 (Θ \ θ)

  • = Dir(1, α) = Beta(1, α)

G has point mass located at θ: G = βδθ + (1 − β)G′ β ∼ Beta(1, α) and G′ is the renormalized probability measure with the point mass removedÉ What is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

slide-66
SLIDE 66

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior): G ∼ DP(α, H) θ|G ∼ G ⇔ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • Consider the partition (Θ, Θ \ θ) of Θ. We have:

(G(Θ), G(Θ \ θ)) ∼ Dir

  • (α + 1)αH + δθ

α + 1 (θ), (α + 1)αH + δθ α + 1 (Θ \ θ)

  • = Dir(1, α) = Beta(1, α)

G has point mass located at θ: G = βδθ + (1 − β)G′ β ∼ Beta(1, α) and G′ is the renormalized probability measure with the point mass removedÉ What is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

slide-67
SLIDE 67

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior): G ∼ DP(α, H) θ|G ∼ G ⇔ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • Consider the partition (Θ, Θ \ θ) of Θ. We have:

(G(Θ), G(Θ \ θ)) ∼ Dir

  • (α + 1)αH + δθ

α + 1 (θ), (α + 1)αH + δθ α + 1 (Θ \ θ)

  • = Dir(1, α) = Beta(1, α)

G has point mass located at θ: G = βδθ + (1 − β)G′ β ∼ Beta(1, α) and G′ is the renormalized probability measure with the point mass removedÉ What is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

slide-68
SLIDE 68

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior): G ∼ DP(α, H) θ|G ∼ G ⇔ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • Consider the partition (Θ, Θ \ θ) of Θ. We have:

(G(Θ), G(Θ \ θ)) ∼ Dir

  • (α + 1)αH + δθ

α + 1 (θ), (α + 1)αH + δθ α + 1 (Θ \ θ)

  • = Dir(1, α) = Beta(1, α)

G has point mass located at θ: G = βδθ + (1 − β)G′ β ∼ Beta(1, α) and G′ is the renormalized probability measure with the point mass removedÉ What is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

slide-69
SLIDE 69

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We have: G ∼ DP(α, H) θ|G ∼ G ⇒ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • G

= βδθ + (1 − β)G′ β ∼ Beta(1, α) Consider a further partition θ, T1, . . . , TK) of Θ: (G(θ), G(T1), . . . , G(TK)) = (β, (1 − β)G′(T1), . . . , (1 − β)G′(TK)) ∼ Dir(1, αH(T1), . . . , αH(TK)) Using the agglomerative/decimative property of Dirichlet, we get (G′(T1), . . . , G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK)) G′ ∼ DP(α, H)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 45 / 57

slide-70
SLIDE 70

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We have: G ∼ DP(α, H) θ|G ∼ G ⇒ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • G

= βδθ + (1 − β)G′ β ∼ Beta(1, α) Consider a further partition θ, T1, . . . , TK) of Θ: (G(θ), G(T1), . . . , G(TK)) = (β, (1 − β)G′(T1), . . . , (1 − β)G′(TK)) ∼ Dir(1, αH(T1), . . . , αH(TK)) Using the agglomerative/decimative property of Dirichlet, we get (G′(T1), . . . , G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK)) G′ ∼ DP(α, H)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 45 / 57

slide-71
SLIDE 71

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We have: G ∼ DP(α, H) θ|G ∼ G ⇒ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • G

= βδθ + (1 − β)G′ β ∼ Beta(1, α) Consider a further partition θ, T1, . . . , TK) of Θ: (G(θ), G(T1), . . . , G(TK)) = (β, (1 − β)G′(T1), . . . , (1 − β)G′(TK)) ∼ Dir(1, αH(T1), . . . , αH(TK)) Using the agglomerative/decimative property of Dirichlet, we get (G′(T1), . . . , G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK)) G′ ∼ DP(α, H)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 45 / 57

slide-72
SLIDE 72

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

Therefore, G ∼ DP(α, H) G = β1δθ1 + (1 − β1)G1 G = β1δθ1 + (1 − β1)(β2δθ2 + (1 − β2)G2) . . . G =

  • k=1

πkδθk where πk = βk

k−1

  • l=1

(1 − βl) βl ∼ Beta(1, α), which is the stick-breaking construction.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 46 / 57

slide-73
SLIDE 73

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

Once again, we start from the posterior: p(G|¯ θ1, . . . , ¯ θN, α, H) = DP

  • α + N,

1 α + N

  • αH +

N

  • i=1

δ¯

θi

  • The expected measure of any subset T ⊂ Θ is:

E

  • G(T)|¯

θ1, . . . , ¯ θN, α, H

  • =

1 α + N

  • αH +

N

  • i=1

δ¯

θi(T)

  • Since G is discrete, some of the {¯

θi}N

i=1 ∼ G take identical values.

Assume K ≤ N unique values: E

  • G(T)|¯

θ1, . . . , ¯ θN, α, H

  • =

1 α + N

  • αH +

K

  • i=1

Nkδ¯

θi(T)

  • Alessandro Panella (CS Dept. - UIC)

Bayesian Nonparametrics February 18, 2013 47 / 57

slide-74
SLIDE 74

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

Once again, we start from the posterior: p(G|¯ θ1, . . . , ¯ θN, α, H) = DP

  • α + N,

1 α + N

  • αH +

N

  • i=1

δ¯

θi

  • The expected measure of any subset T ⊂ Θ is:

E

  • G(T)|¯

θ1, . . . , ¯ θN, α, H

  • =

1 α + N

  • αH +

N

  • i=1

δ¯

θi(T)

  • Since G is discrete, some of the {¯

θi}N

i=1 ∼ G take identical values.

Assume K ≤ N unique values: E

  • G(T)|¯

θ1, . . . , ¯ θN, α, H

  • =

1 α + N

  • αH +

K

  • i=1

Nkδ¯

θi(T)

  • Alessandro Panella (CS Dept. - UIC)

Bayesian Nonparametrics February 18, 2013 47 / 57

slide-75
SLIDE 75

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

Once again, we start from the posterior: p(G|¯ θ1, . . . , ¯ θN, α, H) = DP

  • α + N,

1 α + N

  • αH +

N

  • i=1

δ¯

θi

  • The expected measure of any subset T ⊂ Θ is:

E

  • G(T)|¯

θ1, . . . , ¯ θN, α, H

  • =

1 α + N

  • αH +

N

  • i=1

δ¯

θi(T)

  • Since G is discrete, some of the {¯

θi}N

i=1 ∼ G take identical values.

Assume K ≤ N unique values: E

  • G(T)|¯

θ1, . . . , ¯ θN, α, H

  • =

1 α + N

  • αH +

K

  • i=1

Nkδ¯

θi(T)

  • Alessandro Panella (CS Dept. - UIC)

Bayesian Nonparametrics February 18, 2013 47 / 57

slide-76
SLIDE 76

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

A bit informally. . . Let Tk contain θk and shrink it arbitrarily. To the limit, we have that p(¯ θN+1 = θ|¯ θ1, . . . , ¯ θN, α, H) = 1 α + N

  • αh(θ) +

K

  • i=1

Nkδ¯

θi(θ)

  • This is the generalized Polya urn scheme

An urn contains one ball for each preceding observation, with a different color for each distinct θk. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special “weighted” ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ¯

  • k. [This description is from Sudderth, 2008]

This allows to sample from a Dirichlet process without explicitly constructing the underlying G ∼ DP(α, H).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 48 / 57

slide-77
SLIDE 77

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

A bit informally. . . Let Tk contain θk and shrink it arbitrarily. To the limit, we have that p(¯ θN+1 = θ|¯ θ1, . . . , ¯ θN, α, H) = 1 α + N

  • αh(θ) +

K

  • i=1

Nkδ¯

θi(θ)

  • This is the generalized Polya urn scheme

An urn contains one ball for each preceding observation, with a different color for each distinct θk. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special “weighted” ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ¯

  • k. [This description is from Sudderth, 2008]

This allows to sample from a Dirichlet process without explicitly constructing the underlying G ∼ DP(α, H).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 48 / 57

slide-78
SLIDE 78

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

The Dirichlet process implicitly partitions the data. Let zi indicate the subset (cluster) associated with the i6th observation, i.e. ¯ θi = θzi. From the previous slide, we get: p(zN+1 = z|z1, . . . , zN, α) = 1 α + N

  • αδ(z,¯

k) +

K

  • i=1

Nkδ(z, k)

  • This is the Chinese restaurant process (CRP).

1 2 3 4

...

1 2 3 4 5 6 7

It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 49 / 57

slide-79
SLIDE 79

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

The Dirichlet process implicitly partitions the data. Let zi indicate the subset (cluster) associated with the i6th observation, i.e. ¯ θi = θzi. From the previous slide, we get: p(zN+1 = z|z1, . . . , zN, α) = 1 α + N

  • αδ(z,¯

k) +

K

  • i=1

Nkδ(z, k)

  • This is the Chinese restaurant process (CRP).

1 2 3 4

...

1 2 3 4 5 6 7

It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 49 / 57

slide-80
SLIDE 80

A little more theory. . . Dirichlet process REDUX

Take away message

These representations are all equivalent! Posterior DP: G ∼ DP(α, H) θ|G ∼ G ⇔ θ ∼ H G|θ ∼ DP

  • α + 1, αH+δθ

α+1

  • Stick-breaking construction:

G(θ) =

  • k=1

πkδ(θ, θk) θk ∼ H π ∼ GEM(α) Generalized Polya urn p(¯ θN+1 = θ|¯ θ1, . . . , ¯ θN, α, H) = 1 α + N

  • αh(θ) +

K

  • i=1

Nkδ¯

θi(θ)

  • Chinese restaurant process

p(zN+1 = z|z1, . . . , zN, α) = 1 α + N

  • αδ(z,¯

k) +

K

  • i=1

Nkδ(z, k)

  • Alessandro Panella (CS Dept. - UIC)

Bayesian Nonparametrics February 18, 2013 50 / 57

slide-81
SLIDE 81

The hierarchical Dirichlet process

Outline

1

Introduction and background Bayesian learning Nonparametric models

2

Finite mixture models Bayesian models Clustering with FMMs Inference

3

Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference

4

A little more theory. . . De Finetti’s REDUX Dirichlet process REDUX

5

The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 51 / 57

slide-82
SLIDE 82

The hierarchical Dirichlet process

The DP mixture model (DPMM)

Let’s use G ∼ DP(α, H) to build an infinite mixture model.

G α ¯ θi xi N H

2

θ

1

θ

2

x

1

x

G ∼ DP(α, H) ¯ θi ∼ G xi ∼ F¯

θi

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 52 / 57

slide-83
SLIDE 83

The hierarchical Dirichlet process

Related subgroups of data

Dataset with J related groups x = (x1, . . . , xJ). xj = (xj1, . . . , xjNj) contains Nj observations. We want these group to share clusters (transfer knowledge.)

  • 2

m

1

x2 xmj

j

x1j

  • i

xij

  • (from jordan, 2005)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 53 / 57

slide-84
SLIDE 84

The hierarchical Dirichlet process

Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ, H)

Defines a set of shared clusters.

G0(θ) =

  • k=1

βkδ(θ, θk) θk ∼ H β ∼ GEM(γ) Group specific distributions Gj ∼ DP(α, G0) Gj(θ) =

  • t=1

˜ πkδ(θ, ˜ θk) ˜ θt ∼ G0 ˜ π ∼ GEM(γ)

Note G0 as base measure! Each local cluster has parameter ˜ θk copied from some global cluster

For each group, data points are generated according to: ¯ θji ∼ Gj xji ∼ F(˜ θji)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 54 / 57

slide-85
SLIDE 85

The hierarchical Dirichlet process

Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ, H)

Defines a set of shared clusters.

G0(θ) =

  • k=1

βkδ(θ, θk) θk ∼ H β ∼ GEM(γ) Group specific distributions Gj ∼ DP(α, G0) Gj(θ) =

  • t=1

˜ πkδ(θ, ˜ θk) ˜ θt ∼ G0 ˜ π ∼ GEM(γ)

Note G0 as base measure! Each local cluster has parameter ˜ θk copied from some global cluster

For each group, data points are generated according to: ¯ θji ∼ Gj xji ∼ F(˜ θji)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 54 / 57

slide-86
SLIDE 86

The hierarchical Dirichlet process

Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ, H)

Defines a set of shared clusters.

G0(θ) =

  • k=1

βkδ(θ, θk) θk ∼ H β ∼ GEM(γ) Group specific distributions Gj ∼ DP(α, G0) Gj(θ) =

  • t=1

˜ πkδ(θ, ˜ θk) ˜ θt ∼ G0 ˜ π ∼ GEM(γ)

Note G0 as base measure! Each local cluster has parameter ˜ θk copied from some global cluster

For each group, data points are generated according to: ¯ θji ∼ Gj xji ∼ F(˜ θji)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 54 / 57

slide-87
SLIDE 87

The hierarchical Dirichlet process

The HDP mixture model (DPMM)

Gj α ¯ θji xji N H G0 γ J

J

12

θ

11

θ

22

θ

21

θ G

1

G

2

G H

12

x

11

x

21

x

22

x

G0 ∼ DP(γ, H) Gj ∼ DP(α, G0) ¯ θji ∼ Gj xji ∼ F ¯

θji

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 55 / 57

slide-88
SLIDE 88

The hierarchical Dirichlet process

The HDP mixture model (DPMM)

Gj(θ) =

  • t=1

˜ πkδ(θ, ˜ θk) ˜ θt ∼ G0 ˜ π ∼ GEM(γ) G0 is discrete. Each group might create several copies of the same global cluster. Aggregating the probabilities: Gj(θ) =

  • t=1

πkδ(θ, ˜ θk) πjk =

  • t:kjt=k

˜ πjt It can be shown that π ∼ DP(α, β).

β = (β1, β2, . . .): average weight of local clusters. π = (π1, π2, . . .) group-specific weights. α controls the variability of clusters weight across groups.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 56 / 57

slide-89
SLIDE 89

The hierarchical Dirichlet process

THANK YOU. QUESTIONS?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 57 / 57