[PPT] - Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love PowerPoint Presentation

SLIDE 1

Dr. Nonparametric Bayes

Or: How I Learned to Stop Worrying and Love the Dirichlet Process

Kurt Miller CS 294: Practical Machine Learning November 19, 2009

SLIDE 2

Today we will discuss Nonparametric Bayesian methods.

Kurt T. Miller

Dr. Nonparametric Bayes

2

SLIDE 3

Today we will discuss Nonparametric Bayesian methods.

“Nonparametric Bayesian methods”? What does that mean?

Kurt T. Miller

Dr. Nonparametric Bayes

2

SLIDE 4

Introduction

Nonparametric Nonparametric: Does NOT mean there are no parameters.

Kurt T. Miller

Dr. Nonparametric Bayes

3

SLIDE 5

Introduction

Example: Classification

Build model Predict using model Data

Parametric Approach Nonparametric Approach

⇒ ⇒ ⇒

+ + + + + ++ + + + + + + + + + + + + + + + + +

⇒

Kurt T. Miller

Dr. Nonparametric Bayes

4

SLIDE 6

Introduction

Example: Regression

Build model Predict using model Data

Parametric Approach Nonparametric Approach

⇒ ⇒ ⇒

Kurt T. Miller

Dr. Nonparametric Bayes

5

SLIDE 7

Introduction

Example: Clustering

Build model Data

Parametric Approach Nonparametric Approach

⇒ ⇒ ⇒

Kurt T. Miller

Dr. Nonparametric Bayes

6

SLIDE 8

Introduction

So now we know what nonparametric means, but what does Bayesian mean?

Statistics: Bayesian Basics

The Bayesian approach treats statistical problems by maintaining probability distributions over possible parameter values. That is, we treat the parameters themselves as random variables having distributions:

1

We have some beliefs about our parameter values θ before we see any data. These beliefs are encoded in the prior distribution P(θ).

2

Treating the parameters θ as random variables, we can write the likelihood of the data X as a conditional probability: P(X|θ).

3

We would like to update our beliefs about θ based on the data by obtaining P(θ|X), the posterior distribution. Solution: by Bayes’ theorem, P(θ|X) = P(X|θ)P(θ) P(X) where P(X) =

P(X|θ)P(θ)dθ

(Slide from tutorial lecture)

Kurt T. Miller

Dr. Nonparametric Bayes

7

SLIDE 9

Introduction

Why Be Bayesian?

You can take a course on this question.

Kurt T. Miller

Dr. Nonparametric Bayes

8

SLIDE 10

Introduction

Why Be Bayesian?

You can take a course on this question. One answer: Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n))

Kurt T. Miller

Dr. Nonparametric Bayes

8

SLIDE 11

Introduction

Why Be Bayesian?

You can take a course on this question. One answer: Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n)) De Finetti’s Theorem (1955): If (x1, x2, . . .) are infinitely exchangeable, then ∀n p(x1, . . . , xn) = n

i=1

p(xi|θ)

dP(θ)

for some random variable θ.

Kurt T. Miller

Dr. Nonparametric Bayes

8

SLIDE 12

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is?

Kurt T. Miller

Dr. Nonparametric Bayes

9

SLIDE 13

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable.

Kurt T. Miller

Dr. Nonparametric Bayes

9

SLIDE 14

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is?

Kurt T. Miller

Dr. Nonparametric Bayes

9

SLIDE 15

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is? The maximum likelihood estimate is θ = 1. Seem reasonable?

Kurt T. Miller

Dr. Nonparametric Bayes

9

SLIDE 16

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is? The maximum likelihood estimate is θ = 1. Seem reasonable? Not really. Why?

Kurt T. Miller

Dr. Nonparametric Bayes

9

SLIDE 17

Introduction

Simple Example

When we observe {H, H, H, H}, why does θ = 1 seem unreasonable?

Kurt T. Miller

Dr. Nonparametric Bayes

10

SLIDE 18

Introduction

Simple Example

When we observe {H, H, H, H}, why does θ = 1 seem unreasonable? Prior knowledge! We believe coins generally have θ ≈ 1/2. How to encode this? By using a Beta prior on θ.

Kurt T. Miller

Dr. Nonparametric Bayes

10

SLIDE 19

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form p(θ) ∝ θa−1(1 − θ)b−1. What does this distribution look like?

Kurt T. Miller

Dr. Nonparametric Bayes

11

SLIDE 20

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form p(θ) ∝ θa−1(1 − θ)b−1. What does this distribution look like?

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 α1=1.0, α2=0.1 α1=1.0, α2=1.0 α1=1.0, α2=5.0 α1=1.0, α2=10.0 α1=9.0, α2=3.0

Kurt T. Miller

Dr. Nonparametric Bayes

11

SLIDE 21

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior on θ is: p(θ|X) ∝ p(X|θ)p(θ) ∝ θa+n−1(1 − θ)b+m−1 ∼ Beta(a + n, b + m).

Kurt T. Miller

Dr. Nonparametric Bayes

12

SLIDE 22

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior on θ is: p(θ|X) ∝ p(X|θ)p(θ) ∝ θa+n−1(1 − θ)b+m−1 ∼ Beta(a + n, b + m). If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3

Kurt T. Miller

Dr. Nonparametric Bayes

12

SLIDE 23

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?

Kurt T. Miller

Dr. Nonparametric Bayes

13

SLIDE 24

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?

Complexity of our model should be allowed to grow as we get more

data.

Kurt T. Miller

Dr. Nonparametric Bayes

13

SLIDE 25

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?

Complexity of our model should be allowed to grow as we get more

data.

Place a prior on an unbounded number of parameters.

Kurt T. Miller

Dr. Nonparametric Bayes

13

SLIDE 26

Nonparametric Bayesian Methods overview

Dirichlet Process/Chinese Restaurant Process

Latent class models - often used in the clustering context

Beta Process/Indian Buffet Process

Latent feature models

Gaussian Process (No culinary metaphor - oh well)

Regression Today we focus on the Dirichlet Process!

Kurt T. Miller

Dr. Nonparametric Bayes

14

SLIDE 27

Nonparametric Bayesian Methods overview

Today’s topic: The Dirichlet Process

A nonparametric approach to clustering. It can be used in any probabilistic model for clustering. Before diving into the details, we first introduce several key ideas.

Kurt T. Miller

Dr. Nonparametric Bayes

15

SLIDE 28

Nonparametric Bayesian Methods overview

Key ideas to be discussed today

A parametric Bayesian approach to clustering
Defining the model
Markov Chain Monte Carlo (MCMC) inference
A nonparametric approach to clustering
Defining the model - The Dirichlet Process!
MCMC inference
Extensions

Kurt T. Miller

Dr. Nonparametric Bayes

16

SLIDE 29

Nonparametric Bayesian Methods overview

Key ideas to be discussed today

A parametric Bayesian approach to clustering
Defining the model
Markov Chain Monte Carlo (MCMC) inference
A nonparametric approach to clustering
Defining the model - The Dirichlet Process!
MCMC inference
Extensions

Kurt T. Miller

Dr. Nonparametric Bayes

17

SLIDE 30

Preliminaries

A Bayesian Approach to Clustering

We must specify two things:

The likelihood term (how data is affected by the parameters):

p(X|θ)

The prior (the prior distirubution on the parameters):

p(θ) We will slowly develop what these are in the Bayesian clustering context.

Kurt T. Miller

Dr. Nonparametric Bayes

18

SLIDE 31

Preliminaries

Motivating example: Clustering

How many clusters?

Kurt T. Miller

Dr. Nonparametric Bayes

19

SLIDE 32

Preliminaries

Motivating example: Clustering

How many clusters?

Kurt T. Miller

Dr. Nonparametric Bayes

19

SLIDE 33

Preliminaries

Clustering – A Parametric Approach

Frequentist approach: Gaussian Mixture Models with K mixtures Distribution over classes: π = (π1, . . . , πK) Each cluster has a mean and covariance: φi = (µi, Σi) Then p(x|π, φ) =

K

k=1

πkp(x|φk) Use Expectation Maximization (EM) to maximize the likelihood of the data with respect to (π, φ).

Kurt T. Miller

Dr. Nonparametric Bayes

20

SLIDE 34

Preliminaries

Clustering – A Parametric Approach

Frequentist approach: Gaussian Mixture Models with K mixtures Alternate definition: G =

K

k=1

πkδφk where δφk is an atom at φk. Then θi ∼ G xi ∼ p(x|θi)

G θi xi N

Ω

Kurt T. Miller

Dr. Nonparametric Bayes

21

SLIDE 35

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures Distribution over classes: π = (π1, . . . , πK) π ∼ Dirichlet(α/K, . . . , α/K) (We’ll review the Dirichlet Distribution in a several slides.) Each cluster has a mean and covariance: φk = (µk, Σk) (µk, Σk) ∼ Normal-Inverse-Wishart(ν) We still have p(x|π, φ) =

K

k=1

πkp(x|φk)

Kurt T. Miller

Dr. Nonparametric Bayes

22

SLIDE 36

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =

K

i=1

πkδφk θi ∼ G xi ∼ p(x|θi)

G θi xi N

Ω

α G0

Kurt T. Miller

Dr. Nonparametric Bayes

23

SLIDE 37

Parametric Bayesian Clustering

The Dirichlet Distribution

We had π ∼ Dirichlet(α1, . . . , αK) The Dirichlet density is defined as p(π|α) = Γ K

k=1 αk

K

k=1 Γ(αk)

πα1−1

1

πα2−1

2

· · · παK−1

K

where πK = 1 − K−1

k=1 πk.

The expectations of π are E(πi) = αi K

i=1 αi

Kurt T. Miller

Dr. Nonparametric Bayes

24

SLIDE 38

Parametric Bayesian Clustering

The Beta Distribution

A special case of the Dirichlet distribution is the Beta distribution for when K = 2. p(π|α1, α2) = Γ (α1 + α2) Γ(α1)Γ(α2)πα1−1(1 − π)α2−1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 α1=1.0, α2=0.1 α1=1.0, α2=1.0 α1=1.0, α2=5.0 α1=1.0, α2=10.0 α1=9.0, α2=3.0

Kurt T. Miller

Dr. Nonparametric Bayes

25

SLIDE 39

Parametric Bayesian Clustering

The Dirichlet Distribution

In three dimensions: p(π|α1, α2, α3) = Γ (α1 + α2 + α3) Γ(α1)Γ(α2)Γ(α3)πα1−1

1

πα2−1

2

(1 − π1 − π2)α3−1 α = (2, 2, 2) α = (5, 5, 5) α = (2, 2, 25)

Kurt T. Miller

Dr. Nonparametric Bayes

26

SLIDE 40

Parametric Bayesian Clustering

Draws from the Dirichlet Distribution

α = (2, 2, 2)

1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.1 0.2 0.3 0.4 0.5 1 2 3 0.1 0.2 0.3 0.4

α = (5, 5, 5)

1 2 3 0.1 0.2 0.3 0.4 0.5 1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8

α = (2, 2, 5)

1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8

Kurt T. Miller

Dr. Nonparametric Bayes

27

SLIDE 41

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If (π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK) then (π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

Kurt T. Miller

Dr. Nonparametric Bayes

28

SLIDE 42

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If (π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK) then (π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK) This is also valid for any aggregation:

π1 + π2,
k=3K

πk

∼ Beta
α1 + α2,

K

k=3

αk

Kurt T. Miller
Dr. Nonparametric Bayes

28

SLIDE 43

Parametric Bayesian Clustering

Multinomial-Dirichlet Conjugacy

Let Z ∼ Multinomial(π) and π ∼ Dir(α). Posterior: p(π|z) ∝ p(z|π)p(π) = (πz1

1 · · · πzK K )(πα1−1 1

· · · παK−1

K

) = (πz1+α1−1

1

· · · πzK+αK−1

K

) which is Dir(α + z).

Kurt T. Miller

Dr. Nonparametric Bayes

29

SLIDE 44

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =

K

i=1

πkδφk θi ∼ G xi ∼ p(x|θi)

G θi xi N

Ω

α G0

Kurt T. Miller

Dr. Nonparametric Bayes

30

SLIDE 45

Parametric Bayesian Clustering

Bayesian Mixture Models

We no longer want just the maximum likelihood parameters, we want the full posterior: p(π, φ|X) ∝ p(X|π, φ)p(π, φ) Unfortunately, this is not analytically tractable. Two main approaches to approximate inference:

Markov Chain Monte Carlo (MCMC) methods
Variational approximations

Kurt T. Miller

Dr. Nonparametric Bayes

31

SLIDE 46

Parametric Bayesian Clustering

Monte Carlo Methods

Suppose we wish to reason about p(θ|X), but we cannot compute this distribution exactly. If instead, we can sample θ ∼ p(θ|X), what can we do?

p(θ|X) Samples from p(θ|X)

This is the idea behind Monte Carlo methods.

Kurt T. Miller

Dr. Nonparametric Bayes

32

SLIDE 47

Parametric Bayesian Clustering

Markov Chain Monte Carlo (MCMC)

We do not have access to an oracle that will give use samples θ ∼ p(θ|X). How do we get these samples? Markov Chain Monte Carlo (MCMC) methods have been developed to solve this problem. We focus on Gibbs sampling, a special case of the Metropolis-Hastings algorithm.

Kurt T. Miller

Dr. Nonparametric Bayes

33

SLIDE 48

Parametric Bayesian Clustering

Gibbs sampling

An MCMC technique

Assume θ consists of several parameters θ = (θ1, . . . , θm). In the finite mixture model, θ = (π, µ1, . . . , µK, Σ1, . . . , ΣK). Then do

Initialize θ(0) = (θ(0)

1 , . . . , θ(0) m ) at time step 0.

For t = 1, 2, . . ., draw θ(t) given θ(t−1) in such a way that eventually

θ(t) are samples from p(θ|X).

Kurt T. Miller

Dr. Nonparametric Bayes

34

SLIDE 49

Parametric Bayesian Clustering

Gibbs sampling

An MCMC technique

In Gibbs sampling, we only need to be able to sample θ(t)

i

∼ p(θi|θ(t)

1 , . . . , θ(t) i−1, θ(t−1) i+1 , . . . , θ(t−1) m

, X). If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ(t) from p(θ|X).

Kurt T. Miller

Dr. Nonparametric Bayes

35

SLIDE 50

Parametric Bayesian Clustering

Gibbs sampling

An MCMC technique

In Gibbs sampling, we only need to be able to sample θ(t)

i

∼ p(θi|θ(t)

1 , . . . , θ(t) i−1, θ(t−1) i+1 , . . . , θ(t−1) m

, X). If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ(t) from p(θ|X). Example: θ = (θ1, θ2) and p(θ) ∼ N(µ, Σ).

−3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5

First 50 samples First 500 samples

Kurt T. Miller

Dr. Nonparametric Bayes

35

SLIDE 51

Parametric Bayesian Clustering

Bayesian Mixture Models - MCMC inference

Introduce “membership” indicators zi where zi ∼ Multinomial(π) indicates which cluster the ith data point belongs to. p(π, Z, φ|X) ∝ p(X|Z, φ)p(Z|π)p(π, φ)

xi N α G0 π zi

K

φk

Kurt T. Miller

Dr. Nonparametric Bayes

36

SLIDE 52

Parametric Bayesian Clustering

Gibbs sampling for the Bayesian Mixture Model

Randomly initialize Z, π, φ. Repeat until we have enough samples:

1. Sample each zi from

zi|Z−i, π, φ, X ∝

K

k=1

πkp(xi|φk)1

1

{zi=k}

2. Sample each π from

π|Z, φ, X ∼ Dir(n1 + α/K, . . . , nK + α/K) where ni is the number of points assigned to cluster i.

3. Sample each φk from the NIW posterior based on Z and X.

Kurt T. Miller

Dr. Nonparametric Bayes

37

SLIDE 53

Parametric Bayesian Clustering

MCMC in Action

Bad Initialization Point Iteration 25 Iteration 65

[Matlab demo]

Kurt T. Miller

Dr. Nonparametric Bayes

38

SLIDE 54

Parametric Bayesian Clustering

Collapsed Gibbs Sampler

Idea for an improvement: we can marginalize out some variables due to conjugacy, so do not need to sample it. This is called a collapsed

sampler. Here marginalize out π.

Randomly initialize Z, φ. Repeat:

1. Sample each zi from

zi|Z−i, φ, X ∝

K

k=1

(nk + α/K)p(xi|φk)1

1

{zi=k}

2. Sample each φk from the NIW posterior based on Z and X.

Kurt T. Miller

Dr. Nonparametric Bayes

39

SLIDE 55

Parametric Bayesian Clustering

Note about the likelihood term

For easy visualization, we used a Gaussian mixture model. You should use the appropriate likelihood model for your application!

Kurt T. Miller

Dr. Nonparametric Bayes

40

SLIDE 56

Parametric Bayesian Clustering

Summary: Parametric Bayesian clustering

First specify the likelihood - application specific.
Next specify a prior on all parameters.
Exact posterior inference is intractable. Can use a Gibbs sampler for

approximate inference.

Kurt T. Miller

Dr. Nonparametric Bayes

41

SLIDE 57

5 minute break

Kurt T. Miller

Dr. Nonparametric Bayes

42

SLIDE 58

Parametric Bayesian Clustering

How to Choose K?

Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K.

Kurt T. Miller

Dr. Nonparametric Bayes

43

SLIDE 59

Parametric Bayesian Clustering

How to Choose K?

Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K. What if we just let K → ∞ in our parametric model?

Kurt T. Miller

Dr. Nonparametric Bayes

43

SLIDE 60

Parametric Bayesian Clustering

Thought Experiment

Let K → ∞. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =

K

i=1

πkδφk θi ∼ G xi ∼ p(x|θi)

Kurt T. Miller

Dr. Nonparametric Bayes

44

SLIDE 61

Parametric Bayesian Clustering

Thought Experiment: Collapsed Gibbs Sampler

Randomly initialize Z, φ. Repeat:

1. Sample each zi from

zi|Z−i, φ, X ∝

K

k=1

(nk + α/K)p(xi|φk)1

1

{zi=k}

→

K

k=1

nkp(xi|φk)1

1

{zi=k}

Note that nk = 0 for empty clusters.

2. Sample each φk based on Z and X.

Kurt T. Miller

Dr. Nonparametric Bayes

45

SLIDE 62

Parametric Bayesian Clustering

Thought Experiment: Collapsed Gibbs Sampler

What about empty clusters? Lump all empty clusters together. Let K+ be the number of occupied clusters. Then the posterior probability of sitting at any empty cluster is: zi|Z−i, φ, X ∝ α/K × (K − K+)f(xi|G0) → αf(xi|G0) for f(xi|G0) =

p(x|φ)dG0(φ).

Kurt T. Miller

Dr. Nonparametric Bayes

46

SLIDE 63

Parametric Bayesian Clustering

Key ideas to be discussed today

A parametric Bayesian approach to clustering
Defining the model
Markov Chain Monte Carlo (MCMC) inference
A nonparametric approach to clustering
Defining the model - The Dirichlet Process!
MCMC inference
Extensions

Kurt T. Miller

Dr. Nonparametric Bayes

47

SLIDE 64

The Dirichlet Process Model

A Nonparametric Bayesian Approach to Clustering

We must again specify two things:

The likelihood term (how data is affected by the parameters):

p(X|θ) Identical to the parametric case.

The prior (the prior distirubution on the parameters):

p(θ) The Dirichlet Process! Exact posterior inference is still intractable. But we have already derived the Gibbs update equations!

Kurt T. Miller

Dr. Nonparametric Bayes

48

SLIDE 65

The Dirichlet Process Model

What is the Dirichlet Process?

Image from http://www.nature.com/nsmb/journal/v7/n6/fig tab/nsb0600 443 F1.html

Kurt T. Miller

Dr. Nonparametric Bayes

49

SLIDE 66

The Dirichlet Process Model

What is the Dirichlet Process?

(G(A1), . . . , G(An)) ∼ Dir(α0G0(A1), . . . , α0G0(An)) Kurt T. Miller

Dr. Nonparametric Bayes

50

SLIDE 67

The Dirichlet Process Model

The Dirichlet Process

A flexible, nonparametric prior over an infinite number of clusters/classes as well as the parameters for those classes.

Kurt T. Miller

Dr. Nonparametric Bayes

51

SLIDE 68

The Dirichlet Process Model

Parameters for the Dirichlet Process

α - The concentration parameter.
G0 - The base measure. A prior distribution for the cluster specific

parameters. The Dirichlet Process (DP) is a distribution over distributions. We write G ∼ DP(α, G0) to indicate G is a distribution drawn from the DP. It will become clearer in a bit what α and G0 are.

Kurt T. Miller

Dr. Nonparametric Bayes

52

SLIDE 69

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process

G θi xi N

Ω

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

Kurt T. Miller

Dr. Nonparametric Bayes

53

SLIDE 70

The Dirichlet Process Model

The Dirichlet Process

Definition: Let G0 be a probability measure on the measurable space (Ω, B) and α ∈ R+. The Dirichlet Process DP(α, G0) is the distribution on probability measures G such that for any finite partition (A1, . . . , Am) of Ω, (G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).

A A A A A

1 2 3 4 5

Ω

(Ferguson, ’73)

Kurt T. Miller

Dr. Nonparametric Bayes

54

SLIDE 71

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

G ∼ DP(α, G0)
θ1 ∼ G

What is the posterior distribution of G given θ1?

Kurt T. Miller

Dr. Nonparametric Bayes

55

SLIDE 72

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

G ∼ DP(α, G0)
θ1 ∼ G

What is the posterior distribution of G given θ1? G|θ1 ∼ DP

α + 1,

α α + 1G0 + 1 α + 1δθ1

More generally

G|θ1, . . . , θn ∼ DP

α + n,

α α + nG0 + 1 α + n

n

i=1

δθi

Kurt T. Miller
Dr. Nonparametric Bayes

55

SLIDE 73

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

With probability 1, a sample G ∼ DP(α, G0) is of the form G =

∞

k=1

πkδφk

(Sethuraman, ’94)

Kurt T. Miller

Dr. Nonparametric Bayes

56

SLIDE 74

The Dirichlet Process Model

The Dirichlet Process and Clustering

Draw G ∼ DP(α, G0) to get G =

∞

k=1

πkδφk Use this in a mixture model:

G θi xi N

Ω

α G0

Kurt T. Miller

Dr. Nonparametric Bayes

57

SLIDE 75

The Dirichlet Process Model

The Stick-Breaking Process

Define an infinite sequence of Beta random variables:

βk ∼ Beta(1, α) k = 1, 2, . . .

And then define an infinite sequence of mixing proportions as:

π1 = β1 πk = βk

k−1

l=1

(1 − βl) k = 2, 3, . . .

This can be viewed as breaking off portions of a stick:

1 2 ... 1 β β (1−β )

When π are drawn this way, we can write π ∼ GEM(α).

Kurt T. Miller

Dr. Nonparametric Bayes

58

SLIDE 76

The Dirichlet Process Model

The Stick-Breaking Process

We now have an explicit formula for each πk:

πk = βk k−1

l=1 (1 − βl)

We can also easily see that ∞

k=1 πk = 1 (wp1):

1 −

K

k=1

πk = 1 − β1 − β2(1 − β1) − β3(1 − β1)(1 − β2) − · · · = (1 − β1)(1 − β2 − β3(1 − β2) − · · · ) =

K

k=1

(1 − βk) → (wp1 as K → ∞)

So now G = ∞

k=1 πkδφk has a clean definition as a random

measure

Kurt T. Miller

Dr. Nonparametric Bayes

59

SLIDE 77

The Dirichlet Process Model

The Stick-Breaking Process

G θi xi N

Ω

α G0 φk πk ∞ ∞

Kurt T. Miller

Dr. Nonparametric Bayes

60

SLIDE 78

The Dirichlet Process Model

The Chinese Restaurant Process (CRP)

A random process in which n customers sit down in a Chinese

restaurant with an infinite number of tables

first customer sits at the first table
mth subsequent customer sits at a table drawn from the following

distribution: P(previously occupied table i|Fm−1) ∝ ni P(the next unoccupied table|Fm−1) ∝ α where ni is the number of customers currently at table i and where Fm−1 denotes the state of the restaurant after m − 1 customers have been seated

Kurt T. Miller
Dr. Nonparametric Bayes

61

SLIDE 79

The Dirichlet Process Model

The CRP and Clustering

Data points are customers; tables are clusters
the CRP defines a prior distribution on the partitioning of the data

and on the number of tables

This prior can be completed with:
a likelihood—e.g., associate a parameterized probability distribution

with each table

a prior for the parameters—the first customer to sit at table k

chooses the parameter vector for that table (φk) from the prior

φ2 φ1 φ3 φ

4
So we now have a distribution—or can obtain one—for any quantity

that we might care about in the clustering setting

Kurt T. Miller

Dr. Nonparametric Bayes

62

SLIDE 80

The Dirichlet Process Model

The CRP Prior, Gaussian Likelihood, Conjugate Prior

φk = (µk, Σk) ∼ N(a, b) ⊗ IW(α, β) xi ∼ N(φk) for a data point i sitting at table k

Kurt T. Miller

Dr. Nonparametric Bayes

63

SLIDE 81

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP?

Kurt T. Miller

Dr. Nonparametric Bayes

64

SLIDE 82

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP? Important fact: The CRP is exchangeable. Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitely exchangeable, then ∀n p(x1, . . . , xn) = n

i=1

p(xi|G)

dP(G)

for some random variable G.

Kurt T. Miller

Dr. Nonparametric Bayes

64

SLIDE 83

The Dirichlet Process Model

The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP.

Kurt T. Miller

Dr. Nonparametric Bayes

65

SLIDE 84

The Dirichlet Process Model

The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP.

That means, when we integrate out G, we get the CRP. p(θ1, . . . , θn) =

n
i=1

p(θi|G)dP(G)

G θi xi N

Ω

α G0

Kurt T. Miller

Dr. Nonparametric Bayes

65

SLIDE 85

The Dirichlet Process Model

The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP. In English, this means that if the DP is the prior on G, then the CRP defines how points are assigned to clusters when we integrate out G.

Kurt T. Miller

Dr. Nonparametric Bayes

66

SLIDE 86

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process Summary

G θi xi N

Ω

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

Kurt T. Miller

Dr. Nonparametric Bayes

67

SLIDE 87

Inference for the Dirichlet Process

Inference for the DP - Gibbs sampler

We introduce the indicators zi and use the CRP representation. Randomly initialize Z, φ. Repeat:

1. Sample each zi from

zi|Z−i, φ, X ∝

K

k=1

nkp(xi|φk)1

1

{zi=k} + αf(xi|G0)1

1

{zi=K+1}

2. Sample each φk based on Z and X only for occupied clusters.

This is the sampler we saw earlier, but now with some theoretical basis.

Kurt T. Miller

Dr. Nonparametric Bayes

68

SLIDE 88

Inference for the Dirichlet Process

MCMC in Action for the DP

What does this look like in action?

Show Matlab demo

!! !" !# !$ % $ # " ! !& % & '% ()*+,-./.)0.1)/.2-+32.-/ !!" !# !$ !% !& " & % $ # !' " ' !" ()*+,)-./0!" !! !" !# !$ % $ # " !# !$ % $ # " ! &'()*'+,-.$%

[Matlab demo]

Kurt T. Miller

Dr. Nonparametric Bayes

69

SLIDE 89

Inference for the Dirichlet Process

Improvements to the MCMC algorithm

Collapse out the φk if conjugate model.
Split-merge algorithms.

Kurt T. Miller

Dr. Nonparametric Bayes

70

SLIDE 90

Inference for the Dirichlet Process

Summary: Nonparametric Bayesian clustering

First specify the likelihood - application specific.
Next specify a prior on all parameters - the Dirichlet Process!
Exact posterior inference is intractable. Can use a Gibbs sampler for

approximate inference. This is based on the CRP representation.

Kurt T. Miller

Dr. Nonparametric Bayes

71

SLIDE 91

Inference for the Dirichlet Process

Key ideas to be discussed today

A parametric Bayesian approach to clustering
Defining the model
Markov Chain Monte Carlo (MCMC) inference
A nonparametric approach to clustering
Defining the model - The Dirichlet Process!
MCMC inference
Extensions

Kurt T. Miller

Dr. Nonparametric Bayes

72

SLIDE 92

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian idea

View parameters as random variables - place a prior on them.

Kurt T. Miller

Dr. Nonparametric Bayes

73

SLIDE 93

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian idea

View parameters as random variables - place a prior on them.

“Problem”?

Often the priors themselves need parameters.

Kurt T. Miller

Dr. Nonparametric Bayes

73

SLIDE 94

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian idea

View parameters as random variables - place a prior on them.

“Problem”?

Often the priors themselves need parameters.

Solution

Place a prior on these parameters!

Kurt T. Miller

Dr. Nonparametric Bayes

73

SLIDE 95

Hierarchical Dirichlet Process

Multiple Learning Problems

Example: xi ∼ N(θi, σ2) in m different groups.

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

How to estimate θi for each group?

Kurt T. Miller

Dr. Nonparametric Bayes

74

SLIDE 96

Hierarchical Dirichlet Process

Multiple Learning Problems

Example: xi ∼ N(θi, σ2) in m different groups. Treat θis as random variables sampled from a common prior θi ∼ N(θ0, σ2

0)

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0

Kurt T. Miller

Dr. Nonparametric Bayes

75

SLIDE 97

Hierarchical Dirichlet Process

Recall Plate Notation:

θ0 θi xij Ni m

is equivalent to

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0

Kurt T. Miller

Dr. Nonparametric Bayes

76

SLIDE 98

Hierarchical Dirichlet Process

Let’s Be Bold!

Independent estimation Hierarchical Bayesian

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

⇒

θ0 θi xij Ni m Kurt T. Miller

Dr. Nonparametric Bayes

77

SLIDE 99

Hierarchical Dirichlet Process

Let’s Be Bold!

Independent estimation Hierarchical Bayesian

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

⇒

θ0 θi xij Ni m

What do we do if we have DPs for multiple related datasets?

G1 θ1i x1i N1 α1 H1 H2 Hm αm G2 Gm θ2i θmi xmi x2i N2 Nm

· · ·

α2

⇒

Kurt T. Miller

Dr. Nonparametric Bayes

77

SLIDE 100

Hierarchical Dirichlet Process

Let’s Be Bold!

Independent estimation Hierarchical Bayesian

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

⇒

θ0 θi xij Ni m

What do we do if we have DPs for multiple related datasets?

G1 θ1i x1i N1 α1 H1 H2 Hm αm G2 Gm θ2i θmi xmi x2i N2 Nm

· · ·

α2

⇒

m Ni xij θij Gi H α G0 Kurt T. Miller

Dr. Nonparametric Bayes

77

SLIDE 101

Hierarchical Dirichlet Process

Attempt 1

m Ni xij θij Gi H α G0

What kind of distribution do we use for G0? H? Suppose θij are mean parameters for a Gaussian where Gi ∼ DP(α, G0) and G0 is a Gaussian with unknown mean? G0 = N(θ0, σ2

0)

Kurt T. Miller

Dr. Nonparametric Bayes

78

SLIDE 102

Hierarchical Dirichlet Process

Attempt 1

m Ni xij θij Gi H α G0

What kind of distribution do we use for G0? H? Suppose θij are mean parameters for a Gaussian where Gi ∼ DP(α, G0) and G0 is a Gaussian with unknown mean? G0 = N(θ0, σ2

0)

This does NOT work! Why?

Kurt T. Miller

Dr. Nonparametric Bayes

78

SLIDE 103

Hierarchical Dirichlet Process

Attempt 1

m Ni xij θij Gi H α G0

The problem: If G0 is continuous, then with probability ONE, Gi and Gj will share ZERO atoms. ⇒ This means NO clustering!

Gi Gj G0

Kurt T. Miller

Dr. Nonparametric Bayes

79

SLIDE 104

Hierarchical Dirichlet Process

Attempt 2

m Ni xij θij Gi H α G0

So G0 must be discrete. What discrete prior can we use on G0?

Kurt T. Miller

Dr. Nonparametric Bayes

80

SLIDE 105

Hierarchical Dirichlet Process

Attempt 2

m Ni xij θij Gi H α G0

So G0 must be discrete. What discrete prior can we use on G0? How about a parametric prior?

Kurt T. Miller

Dr. Nonparametric Bayes

80

SLIDE 106

Hierarchical Dirichlet Process

Attempt 2

m Ni xij θij Gi H α G0

So G0 must be discrete. What discrete prior can we use on G0? How about a parametric prior? Gee, if only we had a nonparametric prior on discrete measures...

Kurt T. Miller

Dr. Nonparametric Bayes

80

SLIDE 107

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

Solution:

m Ni xij θij Gi H α G0 γ

G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) θij|Gi ∼ Gi xij|θij ∼ p(xij|θij)

(Teh, Jordan, Beal, Blei, 2004)

Kurt T. Miller

Dr. Nonparametric Bayes

81

SLIDE 108

Hierarchical Dirichlet Process

G0 vs. Gi

Since G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) we know G0 =

∞

k=1

πkδφk Gi =

∞

k=1

πikδφk

Kurt T. Miller

Dr. Nonparametric Bayes

82

SLIDE 109

Hierarchical Dirichlet Process

G0 vs. Gi

Since G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) we know G0 =

∞

k=1

πkδφk Gi =

∞

k=1

πikδφk What is the relationship between πk and πik?

Kurt T. Miller

Dr. Nonparametric Bayes

82

SLIDE 110

Hierarchical Dirichlet Process

Relationship between πk and πjk

Let (A1, . . . , Am) be a partition of Ω.

A A A A A

1 2 3 4 5

Ω

By properties of the DP (Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am))

Kurt T. Miller

Dr. Nonparametric Bayes

83

SLIDE 111

Hierarchical Dirichlet Process

Relationship between πk and πjk

Let (A1, . . . , Am) be a partition of Ω.

A A A A A

1 2 3 4 5

Ω

By properties of the DP (Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)) ⇒

k∈K1

πik, . . . ,

k∈Km

πik

∼

Dir

α
k∈K1

πk, . . . , α

k∈Km

πk

Kurt T. Miller
Dr. Nonparametric Bayes

83

SLIDE 112

Hierarchical Dirichlet Process

Stick-Breaking Construction for the HDP

G0 ∼ DP(γ, H) π ∼ GEM(γ) Gi ∼ DP(α, G0) πi ∼ DP(α, π) θij|Gi ∼ Gi φk ∼ H xij|θij ∼ p(xij|θij) zij ∼ πi xij ∼ p(xij|φzij)

m Ni xij θij Gi H α G0 γ φk

m Ni xij α γ zij π πj

G0

∞ Kurt T. Miller

Dr. Nonparametric Bayes

84

SLIDE 113

Hierarchical Dirichlet Process

Stick-Breaking Construction for the HDP

Remember:

@ X

k∈K1

πik, . . . , X

k∈Km

πik 1 A ∼ Dir @α X

k∈K1

πk, . . . , α X

k∈Km

πk 1 A

Explicit relationship between π and πi:

βk ∼ Beta(1, γ) πk = βk

k−1

Y

j=1

(1 − βj) βik ∼ Beta απk, α 1 −

k

X

j=1

πj !! πik = βik

k−1

Y

j=1

(1 − βij)

Kurt T. Miller

Dr. Nonparametric Bayes

85

SLIDE 114

Hierarchical Dirichlet Process

The Effect of α

π ∼ GEM(γ), πi ∼ DP(α, π) π: γ = 2

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

πi: α = 1

1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1

α = 5

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

α = 20

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

α = 100

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Kurt T. Miller

Dr. Nonparametric Bayes

86

SLIDE 115

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

For the DP, we had:

Mathematical definition
Stick-breaking construction
Chinese restaurant process

G θi xi N

Ω

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

For the HDP, we have

Mathematical definition
Stick-breaking construction
?

Kurt T. Miller

Dr. Nonparametric Bayes

87

SLIDE 116

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF) - Step 1

First integrate out the Gi.

m Ni xij θij Gi H α G0 γ

⇒

m Ni xij θij H α G0 γ

Kurt T. Miller

Dr. Nonparametric Bayes

88

SLIDE 117

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF) - Step 1

What is the generative process when we integrate out Gi?

1. Draw global G0 = ∞

k=1 πkδφk.

2. Each group acts like a separate

CRP.

m Ni xij θij H α G0 γ

G0

. . .

φ1

. . . . . .

φ2 φ3

θ11

φ2 φ2 φ4 φ1 φ1 φ1

θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35

φ1 φ2 φ3 φ4

Kurt T. Miller

Dr. Nonparametric Bayes

89

SLIDE 118

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF)

First integrate out the Gi, then integrate out G0

m Ni xij θij Gi H α G0 γ

⇒

m Ni xij θij H α G0 γ

⇒

m Ni xij θij H α γ

Kurt T. Miller

Dr. Nonparametric Bayes

90

SLIDE 119

Hierarchical Dirichlet Process

Chinese Restaurant Franchise (CRF)

G0

. . .

φ1

. . . . . .

φ2 φ3

θ11

φ2 φ2 φ4 φ1 φ1 φ1

θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35

φ1 φ2 φ3 φ4

⇒

. . .

φ1

. . . . . .

φ2 φ3

θ11

φ2 φ2 φ4 φ1 φ1 φ1

θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35

Kurt T. Miller

Dr. Nonparametric Bayes

91

SLIDE 120

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

For the DP, we had:

Mathematical definition
Stick-breaking construction
Chinese restaurant process

G θi xi N

Ω

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

For the HDP, we have

Mathematical definition
Stick-breaking construction
Chinese restaurant franchise

process

Kurt T. Miller

Dr. Nonparametric Bayes

92

SLIDE 121

Hierarchical Dirichlet Process

Inference

Same classes of algorithms used for the DP:

MCMC
CRF representation
Stick-breaking representation
Variational

We will not go into these.

Kurt T. Miller

Dr. Nonparametric Bayes

93

SLIDE 122

Hierarchical Dirichlet Process

Application of the HDP - Infinite Hidden Markov Model

Finite Hidden Markov Models (HMMs):

m states s1, . . . , sm
si has parameter φi with emission distribution

y ∼ p(y|φi)

m × m transition matrix

s1 s2 · · · sm s1 π11 π12 · · · π1m s2 π21 π22 · · · π2m . . . . . . . . . ... . . . sm πm1 πm2 · · · πmm How do we let m → ∞?

Kurt T. Miller

Dr. Nonparametric Bayes

94

SLIDE 123

Hierarchical Dirichlet Process

Application of the HDP - Infinite Hidden Markov Model

How do we let m → ∞? Think a bit outside the traditional clustering context. Let each state si corresponds to a group. π|γ ∼ GEM(γ) πi|α, π ∼ DP(α, π) φk|H ∼ H xt|xt−1, (πi)∞

i=1

∼ πxt−1 yt|xt, (πi)∞

i=1

∼ p(yt|φxt)

Kurt T. Miller

Dr. Nonparametric Bayes

95

SLIDE 124

Questions?

Great set of references for the Machine Learning community:

http://npbayes.wikidot.com/references Includes both the “classics” as well as modern applications.

Kurt T. Miller

Dr. Nonparametric Bayes

96