Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love - - PowerPoint PPT Presentation

dr nonparametric bayes
SMART_READER_LITE
LIVE PREVIEW

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love - - PowerPoint PPT Presentation

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the Dirichlet Process Kurt Miller CS 294: Practical Machine Learning November 19, 2009 Today we will discuss Nonparametric Bayesian methods. Kurt T. Miller Dr.


slide-1
SLIDE 1
  • Dr. Nonparametric Bayes

Or: How I Learned to Stop Worrying and Love the Dirichlet Process

Kurt Miller CS 294: Practical Machine Learning November 19, 2009

slide-2
SLIDE 2

Today we will discuss Nonparametric Bayesian methods.

Kurt T. Miller

  • Dr. Nonparametric Bayes

2

slide-3
SLIDE 3

Today we will discuss Nonparametric Bayesian methods.

“Nonparametric Bayesian methods”? What does that mean?

Kurt T. Miller

  • Dr. Nonparametric Bayes

2

slide-4
SLIDE 4

Introduction

Nonparametric Nonparametric: Does NOT mean there are no parameters.

Kurt T. Miller

  • Dr. Nonparametric Bayes

3

slide-5
SLIDE 5

Introduction

Example: Classification

Build model Predict using model Data

Parametric Approach Nonparametric Approach

⇒ ⇒ ⇒

+ + + + + ++ + + + + + + + + + + + + + + + + +

Kurt T. Miller

  • Dr. Nonparametric Bayes

4

slide-6
SLIDE 6

Introduction

Example: Regression

Build model Predict using model Data

Parametric Approach Nonparametric Approach

⇒ ⇒ ⇒

Kurt T. Miller

  • Dr. Nonparametric Bayes

5

slide-7
SLIDE 7

Introduction

Example: Clustering

Build model Data

Parametric Approach Nonparametric Approach

⇒ ⇒ ⇒

Kurt T. Miller

  • Dr. Nonparametric Bayes

6

slide-8
SLIDE 8

Introduction

So now we know what nonparametric means, but what does Bayesian mean?

Statistics: Bayesian Basics

The Bayesian approach treats statistical problems by maintaining probability distributions over possible parameter values. That is, we treat the parameters themselves as random variables having distributions:

1

We have some beliefs about our parameter values θ before we see any data. These beliefs are encoded in the prior distribution P(θ).

2

Treating the parameters θ as random variables, we can write the likelihood of the data X as a conditional probability: P(X|θ).

3

We would like to update our beliefs about θ based on the data by obtaining P(θ|X), the posterior distribution. Solution: by Bayes’ theorem, P(θ|X) = P(X|θ)P(θ) P(X) where P(X) =

  • P(X|θ)P(θ)dθ

(Slide from tutorial lecture)

Kurt T. Miller

  • Dr. Nonparametric Bayes

7

slide-9
SLIDE 9

Introduction

Why Be Bayesian?

You can take a course on this question.

Kurt T. Miller

  • Dr. Nonparametric Bayes

8

slide-10
SLIDE 10

Introduction

Why Be Bayesian?

You can take a course on this question. One answer: Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n))

Kurt T. Miller

  • Dr. Nonparametric Bayes

8

slide-11
SLIDE 11

Introduction

Why Be Bayesian?

You can take a course on this question. One answer: Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n)) De Finetti’s Theorem (1955): If (x1, x2, . . .) are infinitely exchangeable, then ∀n p(x1, . . . , xn) = n

  • i=1

p(xi|θ)

  • dP(θ)

for some random variable θ.

Kurt T. Miller

  • Dr. Nonparametric Bayes

8

slide-12
SLIDE 12

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is?

Kurt T. Miller

  • Dr. Nonparametric Bayes

9

slide-13
SLIDE 13

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable.

Kurt T. Miller

  • Dr. Nonparametric Bayes

9

slide-14
SLIDE 14

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is?

Kurt T. Miller

  • Dr. Nonparametric Bayes

9

slide-15
SLIDE 15

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is? The maximum likelihood estimate is θ = 1. Seem reasonable?

Kurt T. Miller

  • Dr. Nonparametric Bayes

9

slide-16
SLIDE 16

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, the probability of heads. Suppose we observe: {T, H, H, T}. What do we think θ is? The maximum likelihood estimate is θ = 1/2. Seems reasonable. Now suppose we observe: {H, H, H, H}. What do we think θ is? The maximum likelihood estimate is θ = 1. Seem reasonable? Not really. Why?

Kurt T. Miller

  • Dr. Nonparametric Bayes

9

slide-17
SLIDE 17

Introduction

Simple Example

When we observe {H, H, H, H}, why does θ = 1 seem unreasonable?

Kurt T. Miller

  • Dr. Nonparametric Bayes

10

slide-18
SLIDE 18

Introduction

Simple Example

When we observe {H, H, H, H}, why does θ = 1 seem unreasonable? Prior knowledge! We believe coins generally have θ ≈ 1/2. How to encode this? By using a Beta prior on θ.

Kurt T. Miller

  • Dr. Nonparametric Bayes

10

slide-19
SLIDE 19

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form p(θ) ∝ θa−1(1 − θ)b−1. What does this distribution look like?

Kurt T. Miller

  • Dr. Nonparametric Bayes

11

slide-20
SLIDE 20

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form p(θ) ∝ θa−1(1 − θ)b−1. What does this distribution look like?

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 α1=1.0, α2=0.1 α1=1.0, α2=1.0 α1=1.0, α2=5.0 α1=1.0, α2=10.0 α1=9.0, α2=3.0

Kurt T. Miller

  • Dr. Nonparametric Bayes

11

slide-21
SLIDE 21

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior on θ is: p(θ|X) ∝ p(X|θ)p(θ) ∝ θa+n−1(1 − θ)b+m−1 ∼ Beta(a + n, b + m).

Kurt T. Miller

  • Dr. Nonparametric Bayes

12

slide-22
SLIDE 22

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior on θ is: p(θ|X) ∝ p(X|θ)p(θ) ∝ θa+n−1(1 − θ)b+m−1 ∼ Beta(a + n, b + m). If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3

Kurt T. Miller

  • Dr. Nonparametric Bayes

12

slide-23
SLIDE 23

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?

Kurt T. Miller

  • Dr. Nonparametric Bayes

13

slide-24
SLIDE 24

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?

  • Complexity of our model should be allowed to grow as we get more

data.

Kurt T. Miller

  • Dr. Nonparametric Bayes

13

slide-25
SLIDE 25

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should we expect from nonparametric Bayesian methods?

  • Complexity of our model should be allowed to grow as we get more

data.

  • Place a prior on an unbounded number of parameters.

Kurt T. Miller

  • Dr. Nonparametric Bayes

13

slide-26
SLIDE 26

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods overview

  • Dirichlet Process/Chinese Restaurant Process

Latent class models - often used in the clustering context

  • Beta Process/Indian Buffet Process

Latent feature models

  • Gaussian Process (No culinary metaphor - oh well)

Regression Today we focus on the Dirichlet Process!

Kurt T. Miller

  • Dr. Nonparametric Bayes

14

slide-27
SLIDE 27

Nonparametric Bayesian Methods overview

Today’s topic: The Dirichlet Process

A nonparametric approach to clustering. It can be used in any probabilistic model for clustering. Before diving into the details, we first introduce several key ideas.

Kurt T. Miller

  • Dr. Nonparametric Bayes

15

slide-28
SLIDE 28

Nonparametric Bayesian Methods overview

Key ideas to be discussed today

  • A parametric Bayesian approach to clustering
  • Defining the model
  • Markov Chain Monte Carlo (MCMC) inference
  • A nonparametric approach to clustering
  • Defining the model - The Dirichlet Process!
  • MCMC inference
  • Extensions

Kurt T. Miller

  • Dr. Nonparametric Bayes

16

slide-29
SLIDE 29

Nonparametric Bayesian Methods overview

Key ideas to be discussed today

  • A parametric Bayesian approach to clustering
  • Defining the model
  • Markov Chain Monte Carlo (MCMC) inference
  • A nonparametric approach to clustering
  • Defining the model - The Dirichlet Process!
  • MCMC inference
  • Extensions

Kurt T. Miller

  • Dr. Nonparametric Bayes

17

slide-30
SLIDE 30

Preliminaries

A Bayesian Approach to Clustering

We must specify two things:

  • The likelihood term (how data is affected by the parameters):

p(X|θ)

  • The prior (the prior distirubution on the parameters):

p(θ) We will slowly develop what these are in the Bayesian clustering context.

Kurt T. Miller

  • Dr. Nonparametric Bayes

18

slide-31
SLIDE 31

Preliminaries

Motivating example: Clustering

How many clusters?

Kurt T. Miller

  • Dr. Nonparametric Bayes

19

slide-32
SLIDE 32

Preliminaries

Motivating example: Clustering

How many clusters?

Kurt T. Miller

  • Dr. Nonparametric Bayes

19

slide-33
SLIDE 33

Preliminaries

Clustering – A Parametric Approach

Frequentist approach: Gaussian Mixture Models with K mixtures Distribution over classes: π = (π1, . . . , πK) Each cluster has a mean and covariance: φi = (µi, Σi) Then p(x|π, φ) =

K

  • k=1

πkp(x|φk) Use Expectation Maximization (EM) to maximize the likelihood of the data with respect to (π, φ).

Kurt T. Miller

  • Dr. Nonparametric Bayes

20

slide-34
SLIDE 34

Preliminaries

Clustering – A Parametric Approach

Frequentist approach: Gaussian Mixture Models with K mixtures Alternate definition: G =

K

  • k=1

πkδφk where δφk is an atom at φk. Then θi ∼ G xi ∼ p(x|θi)

G θi xi N

Kurt T. Miller

  • Dr. Nonparametric Bayes

21

slide-35
SLIDE 35

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures Distribution over classes: π = (π1, . . . , πK) π ∼ Dirichlet(α/K, . . . , α/K) (We’ll review the Dirichlet Distribution in a several slides.) Each cluster has a mean and covariance: φk = (µk, Σk) (µk, Σk) ∼ Normal-Inverse-Wishart(ν) We still have p(x|π, φ) =

K

  • k=1

πkp(x|φk)

Kurt T. Miller

  • Dr. Nonparametric Bayes

22

slide-36
SLIDE 36

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =

K

  • i=1

πkδφk θi ∼ G xi ∼ p(x|θi)

G θi xi N

α G0

Kurt T. Miller

  • Dr. Nonparametric Bayes

23

slide-37
SLIDE 37

Parametric Bayesian Clustering

The Dirichlet Distribution

We had π ∼ Dirichlet(α1, . . . , αK) The Dirichlet density is defined as p(π|α) = Γ K

k=1 αk

  • K

k=1 Γ(αk)

πα1−1

1

πα2−1

2

· · · παK−1

K

where πK = 1 − K−1

k=1 πk.

The expectations of π are E(πi) = αi K

i=1 αi

Kurt T. Miller

  • Dr. Nonparametric Bayes

24

slide-38
SLIDE 38

Parametric Bayesian Clustering

The Beta Distribution

A special case of the Dirichlet distribution is the Beta distribution for when K = 2. p(π|α1, α2) = Γ (α1 + α2) Γ(α1)Γ(α2)πα1−1(1 − π)α2−1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 α1=1.0, α2=0.1 α1=1.0, α2=1.0 α1=1.0, α2=5.0 α1=1.0, α2=10.0 α1=9.0, α2=3.0

Kurt T. Miller

  • Dr. Nonparametric Bayes

25

slide-39
SLIDE 39

Parametric Bayesian Clustering

The Dirichlet Distribution

In three dimensions: p(π|α1, α2, α3) = Γ (α1 + α2 + α3) Γ(α1)Γ(α2)Γ(α3)πα1−1

1

πα2−1

2

(1 − π1 − π2)α3−1 α = (2, 2, 2) α = (5, 5, 5) α = (2, 2, 25)

Kurt T. Miller

  • Dr. Nonparametric Bayes

26

slide-40
SLIDE 40

Parametric Bayesian Clustering

Draws from the Dirichlet Distribution

α = (2, 2, 2)

1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.1 0.2 0.3 0.4 0.5 1 2 3 0.1 0.2 0.3 0.4

α = (5, 5, 5)

1 2 3 0.1 0.2 0.3 0.4 0.5 1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8

α = (2, 2, 5)

1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8 1 2 3 0.2 0.4 0.6 0.8

Kurt T. Miller

  • Dr. Nonparametric Bayes

27

slide-41
SLIDE 41

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If (π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK) then (π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

Kurt T. Miller

  • Dr. Nonparametric Bayes

28

slide-42
SLIDE 42

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If (π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK) then (π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK) This is also valid for any aggregation:

  • π1 + π2,
  • k=3K

πk

  • ∼ Beta
  • α1 + α2,

K

  • k=3

αk

  • Kurt T. Miller
  • Dr. Nonparametric Bayes

28

slide-43
SLIDE 43

Parametric Bayesian Clustering

Multinomial-Dirichlet Conjugacy

Let Z ∼ Multinomial(π) and π ∼ Dir(α). Posterior: p(π|z) ∝ p(z|π)p(π) = (πz1

1 · · · πzK K )(πα1−1 1

· · · παK−1

K

) = (πz1+α1−1

1

· · · πzK+αK−1

K

) which is Dir(α + z).

Kurt T. Miller

  • Dr. Nonparametric Bayes

29

slide-44
SLIDE 44

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures G is now a random measure. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =

K

  • i=1

πkδφk θi ∼ G xi ∼ p(x|θi)

G θi xi N

α G0

Kurt T. Miller

  • Dr. Nonparametric Bayes

30

slide-45
SLIDE 45

Parametric Bayesian Clustering

Bayesian Mixture Models

We no longer want just the maximum likelihood parameters, we want the full posterior: p(π, φ|X) ∝ p(X|π, φ)p(π, φ) Unfortunately, this is not analytically tractable. Two main approaches to approximate inference:

  • Markov Chain Monte Carlo (MCMC) methods
  • Variational approximations

Kurt T. Miller

  • Dr. Nonparametric Bayes

31

slide-46
SLIDE 46

Parametric Bayesian Clustering

Monte Carlo Methods

Suppose we wish to reason about p(θ|X), but we cannot compute this distribution exactly. If instead, we can sample θ ∼ p(θ|X), what can we do?

p(θ|X) Samples from p(θ|X)

This is the idea behind Monte Carlo methods.

Kurt T. Miller

  • Dr. Nonparametric Bayes

32

slide-47
SLIDE 47

Parametric Bayesian Clustering

Markov Chain Monte Carlo (MCMC)

We do not have access to an oracle that will give use samples θ ∼ p(θ|X). How do we get these samples? Markov Chain Monte Carlo (MCMC) methods have been developed to solve this problem. We focus on Gibbs sampling, a special case of the Metropolis-Hastings algorithm.

Kurt T. Miller

  • Dr. Nonparametric Bayes

33

slide-48
SLIDE 48

Parametric Bayesian Clustering

Gibbs sampling

An MCMC technique

Assume θ consists of several parameters θ = (θ1, . . . , θm). In the finite mixture model, θ = (π, µ1, . . . , µK, Σ1, . . . , ΣK). Then do

  • Initialize θ(0) = (θ(0)

1 , . . . , θ(0) m ) at time step 0.

  • For t = 1, 2, . . ., draw θ(t) given θ(t−1) in such a way that eventually

θ(t) are samples from p(θ|X).

Kurt T. Miller

  • Dr. Nonparametric Bayes

34

slide-49
SLIDE 49

Parametric Bayesian Clustering

Gibbs sampling

An MCMC technique

In Gibbs sampling, we only need to be able to sample θ(t)

i

∼ p(θi|θ(t)

1 , . . . , θ(t) i−1, θ(t−1) i+1 , . . . , θ(t−1) m

, X). If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ(t) from p(θ|X).

Kurt T. Miller

  • Dr. Nonparametric Bayes

35

slide-50
SLIDE 50

Parametric Bayesian Clustering

Gibbs sampling

An MCMC technique

In Gibbs sampling, we only need to be able to sample θ(t)

i

∼ p(θi|θ(t)

1 , . . . , θ(t) i−1, θ(t−1) i+1 , . . . , θ(t−1) m

, X). If we repeat this for any model we discuss today, theory tells us that eventually we get samples θ(t) from p(θ|X). Example: θ = (θ1, θ2) and p(θ) ∼ N(µ, Σ).

−3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5

First 50 samples First 500 samples

Kurt T. Miller

  • Dr. Nonparametric Bayes

35

slide-51
SLIDE 51

Parametric Bayesian Clustering

Bayesian Mixture Models - MCMC inference

Introduce “membership” indicators zi where zi ∼ Multinomial(π) indicates which cluster the ith data point belongs to. p(π, Z, φ|X) ∝ p(X|Z, φ)p(Z|π)p(π, φ)

xi N α G0 π zi

K

φk

Kurt T. Miller

  • Dr. Nonparametric Bayes

36

slide-52
SLIDE 52

Parametric Bayesian Clustering

Gibbs sampling for the Bayesian Mixture Model

Randomly initialize Z, π, φ. Repeat until we have enough samples:

  • 1. Sample each zi from

zi|Z−i, π, φ, X ∝

K

  • k=1

πkp(xi|φk)1

1

{zi=k}

  • 2. Sample each π from

π|Z, φ, X ∼ Dir(n1 + α/K, . . . , nK + α/K) where ni is the number of points assigned to cluster i.

  • 3. Sample each φk from the NIW posterior based on Z and X.

Kurt T. Miller

  • Dr. Nonparametric Bayes

37

slide-53
SLIDE 53

Parametric Bayesian Clustering

MCMC in Action

Bad Initialization Point Iteration 25 Iteration 65

[Matlab demo]

Kurt T. Miller

  • Dr. Nonparametric Bayes

38

slide-54
SLIDE 54

Parametric Bayesian Clustering

Collapsed Gibbs Sampler

Idea for an improvement: we can marginalize out some variables due to conjugacy, so do not need to sample it. This is called a collapsed

  • sampler. Here marginalize out π.

Randomly initialize Z, φ. Repeat:

  • 1. Sample each zi from

zi|Z−i, φ, X ∝

K

  • k=1

(nk + α/K)p(xi|φk)1

1

{zi=k}

  • 2. Sample each φk from the NIW posterior based on Z and X.

Kurt T. Miller

  • Dr. Nonparametric Bayes

39

slide-55
SLIDE 55

Parametric Bayesian Clustering

Note about the likelihood term

For easy visualization, we used a Gaussian mixture model. You should use the appropriate likelihood model for your application!

Kurt T. Miller

  • Dr. Nonparametric Bayes

40

slide-56
SLIDE 56

Parametric Bayesian Clustering

Summary: Parametric Bayesian clustering

  • First specify the likelihood - application specific.
  • Next specify a prior on all parameters.
  • Exact posterior inference is intractable. Can use a Gibbs sampler for

approximate inference.

Kurt T. Miller

  • Dr. Nonparametric Bayes

41

slide-57
SLIDE 57

5 minute break

Kurt T. Miller

  • Dr. Nonparametric Bayes

42

slide-58
SLIDE 58

Parametric Bayesian Clustering

How to Choose K?

Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K.

Kurt T. Miller

  • Dr. Nonparametric Bayes

43

slide-59
SLIDE 59

Parametric Bayesian Clustering

How to Choose K?

Generic model selection: cross-validation, AIC, BIC, MDL, etc. Can place of parametric prior on K. What if we just let K → ∞ in our parametric model?

Kurt T. Miller

  • Dr. Nonparametric Bayes

43

slide-60
SLIDE 60

Parametric Bayesian Clustering

Thought Experiment

Let K → ∞. φk ∼ G0 π ∼ Dirichlet(α/K, . . . , α/K) G =

K

  • i=1

πkδφk θi ∼ G xi ∼ p(x|θi)

Kurt T. Miller

  • Dr. Nonparametric Bayes

44

slide-61
SLIDE 61

Parametric Bayesian Clustering

Thought Experiment: Collapsed Gibbs Sampler

Randomly initialize Z, φ. Repeat:

  • 1. Sample each zi from

zi|Z−i, φ, X ∝

K

  • k=1

(nk + α/K)p(xi|φk)1

1

{zi=k}

K

  • k=1

nkp(xi|φk)1

1

{zi=k}

Note that nk = 0 for empty clusters.

  • 2. Sample each φk based on Z and X.

Kurt T. Miller

  • Dr. Nonparametric Bayes

45

slide-62
SLIDE 62

Parametric Bayesian Clustering

Thought Experiment: Collapsed Gibbs Sampler

What about empty clusters? Lump all empty clusters together. Let K+ be the number of occupied clusters. Then the posterior probability of sitting at any empty cluster is: zi|Z−i, φ, X ∝ α/K × (K − K+)f(xi|G0) → αf(xi|G0) for f(xi|G0) =

  • p(x|φ)dG0(φ).

Kurt T. Miller

  • Dr. Nonparametric Bayes

46

slide-63
SLIDE 63

Parametric Bayesian Clustering

Key ideas to be discussed today

  • A parametric Bayesian approach to clustering
  • Defining the model
  • Markov Chain Monte Carlo (MCMC) inference
  • A nonparametric approach to clustering
  • Defining the model - The Dirichlet Process!
  • MCMC inference
  • Extensions

Kurt T. Miller

  • Dr. Nonparametric Bayes

47

slide-64
SLIDE 64

The Dirichlet Process Model

A Nonparametric Bayesian Approach to Clustering

We must again specify two things:

  • The likelihood term (how data is affected by the parameters):

p(X|θ) Identical to the parametric case.

  • The prior (the prior distirubution on the parameters):

p(θ) The Dirichlet Process! Exact posterior inference is still intractable. But we have already derived the Gibbs update equations!

Kurt T. Miller

  • Dr. Nonparametric Bayes

48

slide-65
SLIDE 65

The Dirichlet Process Model

What is the Dirichlet Process?

Image from http://www.nature.com/nsmb/journal/v7/n6/fig tab/nsb0600 443 F1.html

Kurt T. Miller

  • Dr. Nonparametric Bayes

49

slide-66
SLIDE 66

The Dirichlet Process Model

What is the Dirichlet Process?

(G(A1), . . . , G(An)) ∼ Dir(α0G0(A1), . . . , α0G0(An)) Kurt T. Miller

  • Dr. Nonparametric Bayes

50

slide-67
SLIDE 67

The Dirichlet Process Model

The Dirichlet Process

A flexible, nonparametric prior over an infinite number of clusters/classes as well as the parameters for those classes.

Kurt T. Miller

  • Dr. Nonparametric Bayes

51

slide-68
SLIDE 68

The Dirichlet Process Model

Parameters for the Dirichlet Process

  • α - The concentration parameter.
  • G0 - The base measure. A prior distribution for the cluster specific

parameters. The Dirichlet Process (DP) is a distribution over distributions. We write G ∼ DP(α, G0) to indicate G is a distribution drawn from the DP. It will become clearer in a bit what α and G0 are.

Kurt T. Miller

  • Dr. Nonparametric Bayes

52

slide-69
SLIDE 69

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process

G θi xi N

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

Kurt T. Miller

  • Dr. Nonparametric Bayes

53

slide-70
SLIDE 70

The Dirichlet Process Model

The Dirichlet Process

Definition: Let G0 be a probability measure on the measurable space (Ω, B) and α ∈ R+. The Dirichlet Process DP(α, G0) is the distribution on probability measures G such that for any finite partition (A1, . . . , Am) of Ω, (G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).

A A A A A

1 2 3 4 5

(Ferguson, ’73)

Kurt T. Miller

  • Dr. Nonparametric Bayes

54

slide-71
SLIDE 71

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

  • G ∼ DP(α, G0)
  • θ1 ∼ G

What is the posterior distribution of G given θ1?

Kurt T. Miller

  • Dr. Nonparametric Bayes

55

slide-72
SLIDE 72

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

  • G ∼ DP(α, G0)
  • θ1 ∼ G

What is the posterior distribution of G given θ1? G|θ1 ∼ DP

  • α + 1,

α α + 1G0 + 1 α + 1δθ1

  • More generally

G|θ1, . . . , θn ∼ DP

  • α + n,

α α + nG0 + 1 α + n

n

  • i=1

δθi

  • Kurt T. Miller
  • Dr. Nonparametric Bayes

55

slide-73
SLIDE 73

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

With probability 1, a sample G ∼ DP(α, G0) is of the form G =

  • k=1

πkδφk

(Sethuraman, ’94)

Kurt T. Miller

  • Dr. Nonparametric Bayes

56

slide-74
SLIDE 74

The Dirichlet Process Model

The Dirichlet Process and Clustering

Draw G ∼ DP(α, G0) to get G =

  • k=1

πkδφk Use this in a mixture model:

G θi xi N

α G0

Kurt T. Miller

  • Dr. Nonparametric Bayes

57

slide-75
SLIDE 75

The Dirichlet Process Model

The Stick-Breaking Process

  • Define an infinite sequence of Beta random variables:

βk ∼ Beta(1, α) k = 1, 2, . . .

  • And then define an infinite sequence of mixing proportions as:

π1 = β1 πk = βk

k−1

  • l=1

(1 − βl) k = 2, 3, . . .

  • This can be viewed as breaking off portions of a stick:

1 2 ... 1 β β (1−β )

  • When π are drawn this way, we can write π ∼ GEM(α).

Kurt T. Miller

  • Dr. Nonparametric Bayes

58

slide-76
SLIDE 76

The Dirichlet Process Model

The Stick-Breaking Process

  • We now have an explicit formula for each πk:

πk = βk k−1

l=1 (1 − βl)

  • We can also easily see that ∞

k=1 πk = 1 (wp1):

1 −

K

  • k=1

πk = 1 − β1 − β2(1 − β1) − β3(1 − β1)(1 − β2) − · · · = (1 − β1)(1 − β2 − β3(1 − β2) − · · · ) =

K

  • k=1

(1 − βk) → (wp1 as K → ∞)

  • So now G = ∞

k=1 πkδφk has a clean definition as a random

measure

Kurt T. Miller

  • Dr. Nonparametric Bayes

59

slide-77
SLIDE 77

The Dirichlet Process Model

The Stick-Breaking Process

G θi xi N

α G0 φk πk ∞ ∞

Kurt T. Miller

  • Dr. Nonparametric Bayes

60

slide-78
SLIDE 78

The Dirichlet Process Model

The Chinese Restaurant Process (CRP)

  • A random process in which n customers sit down in a Chinese

restaurant with an infinite number of tables

  • first customer sits at the first table
  • mth subsequent customer sits at a table drawn from the following

distribution: P(previously occupied table i|Fm−1) ∝ ni P(the next unoccupied table|Fm−1) ∝ α where ni is the number of customers currently at table i and where Fm−1 denotes the state of the restaurant after m − 1 customers have been seated

  • Kurt T. Miller
  • Dr. Nonparametric Bayes

61

slide-79
SLIDE 79

The Dirichlet Process Model

The CRP and Clustering

  • Data points are customers; tables are clusters
  • the CRP defines a prior distribution on the partitioning of the data

and on the number of tables

  • This prior can be completed with:
  • a likelihood—e.g., associate a parameterized probability distribution

with each table

  • a prior for the parameters—the first customer to sit at table k

chooses the parameter vector for that table (φk) from the prior

φ2 φ1 φ3 φ

  • 4
  • So we now have a distribution—or can obtain one—for any quantity

that we might care about in the clustering setting

Kurt T. Miller

  • Dr. Nonparametric Bayes

62

slide-80
SLIDE 80

The Dirichlet Process Model

The CRP Prior, Gaussian Likelihood, Conjugate Prior

φk = (µk, Σk) ∼ N(a, b) ⊗ IW(α, β) xi ∼ N(φk) for a data point i sitting at table k

Kurt T. Miller

  • Dr. Nonparametric Bayes

63

slide-81
SLIDE 81

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP?

Kurt T. Miller

  • Dr. Nonparametric Bayes

64

slide-82
SLIDE 82

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relate to the DP? Important fact: The CRP is exchangeable. Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitely exchangeable, then ∀n p(x1, . . . , xn) = n

  • i=1

p(xi|G)

  • dP(G)

for some random variable G.

Kurt T. Miller

  • Dr. Nonparametric Bayes

64

slide-83
SLIDE 83

The Dirichlet Process Model

The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP.

Kurt T. Miller

  • Dr. Nonparametric Bayes

65

slide-84
SLIDE 84

The Dirichlet Process Model

The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP.

That means, when we integrate out G, we get the CRP. p(θ1, . . . , θn) =

  • n
  • i=1

p(θi|G)dP(G)

G θi xi N

α G0

Kurt T. Miller

  • Dr. Nonparametric Bayes

65

slide-85
SLIDE 85

The Dirichlet Process Model

The CRP and the DP The Dirichlet Process is the De Finetti mixing distribution for the CRP. In English, this means that if the DP is the prior on G, then the CRP defines how points are assigned to clusters when we integrate out G.

Kurt T. Miller

  • Dr. Nonparametric Bayes

66

slide-86
SLIDE 86

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process Summary

G θi xi N

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

Kurt T. Miller

  • Dr. Nonparametric Bayes

67

slide-87
SLIDE 87

Inference for the Dirichlet Process

Inference for the DP - Gibbs sampler

We introduce the indicators zi and use the CRP representation. Randomly initialize Z, φ. Repeat:

  • 1. Sample each zi from

zi|Z−i, φ, X ∝

K

  • k=1

nkp(xi|φk)1

1

{zi=k} + αf(xi|G0)1

1

{zi=K+1}

  • 2. Sample each φk based on Z and X only for occupied clusters.

This is the sampler we saw earlier, but now with some theoretical basis.

Kurt T. Miller

  • Dr. Nonparametric Bayes

68

slide-88
SLIDE 88

Inference for the Dirichlet Process

MCMC in Action for the DP

What does this look like in action?

Show Matlab demo

!! !" !# !$ % $ # " ! !& % & '% ()*+,-./.)0.1)/.2-+32.-/ !!" !# !$ !% !& " & % $ # !' " ' !" ()*+,)-./0!" !! !" !# !$ % $ # " !# !$ % $ # " ! &'()*'+,-.$%

[Matlab demo]

Kurt T. Miller

  • Dr. Nonparametric Bayes

69

slide-89
SLIDE 89

Inference for the Dirichlet Process

Improvements to the MCMC algorithm

  • Collapse out the φk if conjugate model.
  • Split-merge algorithms.

Kurt T. Miller

  • Dr. Nonparametric Bayes

70

slide-90
SLIDE 90

Inference for the Dirichlet Process

Summary: Nonparametric Bayesian clustering

  • First specify the likelihood - application specific.
  • Next specify a prior on all parameters - the Dirichlet Process!
  • Exact posterior inference is intractable. Can use a Gibbs sampler for

approximate inference. This is based on the CRP representation.

Kurt T. Miller

  • Dr. Nonparametric Bayes

71

slide-91
SLIDE 91

Inference for the Dirichlet Process

Key ideas to be discussed today

  • A parametric Bayesian approach to clustering
  • Defining the model
  • Markov Chain Monte Carlo (MCMC) inference
  • A nonparametric approach to clustering
  • Defining the model - The Dirichlet Process!
  • MCMC inference
  • Extensions

Kurt T. Miller

  • Dr. Nonparametric Bayes

72

slide-92
SLIDE 92

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian idea

View parameters as random variables - place a prior on them.

Kurt T. Miller

  • Dr. Nonparametric Bayes

73

slide-93
SLIDE 93

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian idea

View parameters as random variables - place a prior on them.

“Problem”?

Often the priors themselves need parameters.

Kurt T. Miller

  • Dr. Nonparametric Bayes

73

slide-94
SLIDE 94

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian idea

View parameters as random variables - place a prior on them.

“Problem”?

Often the priors themselves need parameters.

Solution

Place a prior on these parameters!

Kurt T. Miller

  • Dr. Nonparametric Bayes

73

slide-95
SLIDE 95

Hierarchical Dirichlet Process

Multiple Learning Problems

Example: xi ∼ N(θi, σ2) in m different groups.

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

How to estimate θi for each group?

Kurt T. Miller

  • Dr. Nonparametric Bayes

74

slide-96
SLIDE 96

Hierarchical Dirichlet Process

Multiple Learning Problems

Example: xi ∼ N(θi, σ2) in m different groups. Treat θis as random variables sampled from a common prior θi ∼ N(θ0, σ2

0)

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0

Kurt T. Miller

  • Dr. Nonparametric Bayes

75

slide-97
SLIDE 97

Hierarchical Dirichlet Process

Recall Plate Notation:

θ0 θi xij Ni m

is equivalent to

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0

Kurt T. Miller

  • Dr. Nonparametric Bayes

76

slide-98
SLIDE 98

Hierarchical Dirichlet Process

Let’s Be Bold!

Independent estimation Hierarchical Bayesian

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0 θi xij Ni m Kurt T. Miller

  • Dr. Nonparametric Bayes

77

slide-99
SLIDE 99

Hierarchical Dirichlet Process

Let’s Be Bold!

Independent estimation Hierarchical Bayesian

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0 θi xij Ni m

What do we do if we have DPs for multiple related datasets?

G1 θ1i x1i N1 α1 H1 H2 Hm αm G2 Gm θ2i θmi xmi x2i N2 Nm

· · ·

α2

Kurt T. Miller

  • Dr. Nonparametric Bayes

77

slide-100
SLIDE 100

Hierarchical Dirichlet Process

Let’s Be Bold!

Independent estimation Hierarchical Bayesian

x1j N1 θ2 θ1 N2 x2j xmj Nm θm

· · ·

θ0 θi xij Ni m

What do we do if we have DPs for multiple related datasets?

G1 θ1i x1i N1 α1 H1 H2 Hm αm G2 Gm θ2i θmi xmi x2i N2 Nm

· · ·

α2

m Ni xij θij Gi H α G0 Kurt T. Miller

  • Dr. Nonparametric Bayes

77

slide-101
SLIDE 101

Hierarchical Dirichlet Process

Attempt 1

m Ni xij θij Gi H α G0

What kind of distribution do we use for G0? H? Suppose θij are mean parameters for a Gaussian where Gi ∼ DP(α, G0) and G0 is a Gaussian with unknown mean? G0 = N(θ0, σ2

0)

Kurt T. Miller

  • Dr. Nonparametric Bayes

78

slide-102
SLIDE 102

Hierarchical Dirichlet Process

Attempt 1

m Ni xij θij Gi H α G0

What kind of distribution do we use for G0? H? Suppose θij are mean parameters for a Gaussian where Gi ∼ DP(α, G0) and G0 is a Gaussian with unknown mean? G0 = N(θ0, σ2

0)

This does NOT work! Why?

Kurt T. Miller

  • Dr. Nonparametric Bayes

78

slide-103
SLIDE 103

Hierarchical Dirichlet Process

Attempt 1

m Ni xij θij Gi H α G0

The problem: If G0 is continuous, then with probability ONE, Gi and Gj will share ZERO atoms. ⇒ This means NO clustering!

Gi Gj G0

Kurt T. Miller

  • Dr. Nonparametric Bayes

79

slide-104
SLIDE 104

Hierarchical Dirichlet Process

Attempt 2

m Ni xij θij Gi H α G0

So G0 must be discrete. What discrete prior can we use on G0?

Kurt T. Miller

  • Dr. Nonparametric Bayes

80

slide-105
SLIDE 105

Hierarchical Dirichlet Process

Attempt 2

m Ni xij θij Gi H α G0

So G0 must be discrete. What discrete prior can we use on G0? How about a parametric prior?

Kurt T. Miller

  • Dr. Nonparametric Bayes

80

slide-106
SLIDE 106

Hierarchical Dirichlet Process

Attempt 2

m Ni xij θij Gi H α G0

So G0 must be discrete. What discrete prior can we use on G0? How about a parametric prior? Gee, if only we had a nonparametric prior on discrete measures...

Kurt T. Miller

  • Dr. Nonparametric Bayes

80

slide-107
SLIDE 107

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

Solution:

m Ni xij θij Gi H α G0 γ

G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) θij|Gi ∼ Gi xij|θij ∼ p(xij|θij)

(Teh, Jordan, Beal, Blei, 2004)

Kurt T. Miller

  • Dr. Nonparametric Bayes

81

slide-108
SLIDE 108

Hierarchical Dirichlet Process

G0 vs. Gi

Since G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) we know G0 =

  • k=1

πkδφk Gi =

  • k=1

πikδφk

Kurt T. Miller

  • Dr. Nonparametric Bayes

82

slide-109
SLIDE 109

Hierarchical Dirichlet Process

G0 vs. Gi

Since G0 ∼ DP(γ, H) Gi ∼ DP(α, G0) we know G0 =

  • k=1

πkδφk Gi =

  • k=1

πikδφk What is the relationship between πk and πik?

Kurt T. Miller

  • Dr. Nonparametric Bayes

82

slide-110
SLIDE 110

Hierarchical Dirichlet Process

Relationship between πk and πjk

Let (A1, . . . , Am) be a partition of Ω.

A A A A A

1 2 3 4 5

By properties of the DP (Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am))

Kurt T. Miller

  • Dr. Nonparametric Bayes

83

slide-111
SLIDE 111

Hierarchical Dirichlet Process

Relationship between πk and πjk

Let (A1, . . . , Am) be a partition of Ω.

A A A A A

1 2 3 4 5

By properties of the DP (Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)) ⇒

k∈K1

πik, . . . ,

  • k∈Km

πik

Dir

  • α
  • k∈K1

πk, . . . , α

  • k∈Km

πk

  • Kurt T. Miller
  • Dr. Nonparametric Bayes

83

slide-112
SLIDE 112

Hierarchical Dirichlet Process

Stick-Breaking Construction for the HDP

G0 ∼ DP(γ, H) π ∼ GEM(γ) Gi ∼ DP(α, G0) πi ∼ DP(α, π) θij|Gi ∼ Gi φk ∼ H xij|θij ∼ p(xij|θij) zij ∼ πi xij ∼ p(xij|φzij)

m Ni xij θij Gi H α G0 γ φk

m Ni xij α γ zij π πj

G0

∞ Kurt T. Miller

  • Dr. Nonparametric Bayes

84

slide-113
SLIDE 113

Hierarchical Dirichlet Process

Stick-Breaking Construction for the HDP

Remember:

@ X

k∈K1

πik, . . . , X

k∈Km

πik 1 A ∼ Dir @α X

k∈K1

πk, . . . , α X

k∈Km

πk 1 A

Explicit relationship between π and πi:

βk ∼ Beta(1, γ) πk = βk

k−1

Y

j=1

(1 − βj) βik ∼ Beta απk, α 1 −

k

X

j=1

πj !! πik = βik

k−1

Y

j=1

(1 − βij)

Kurt T. Miller

  • Dr. Nonparametric Bayes

85

slide-114
SLIDE 114

Hierarchical Dirichlet Process

The Effect of α

π ∼ GEM(γ), πi ∼ DP(α, π) π: γ = 2

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

πi: α = 1

1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1

α = 5

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

α = 20

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

α = 100

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Kurt T. Miller

  • Dr. Nonparametric Bayes

86

slide-115
SLIDE 115

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

For the DP, we had:

  • Mathematical definition
  • Stick-breaking construction
  • Chinese restaurant process

G θi xi N

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

For the HDP, we have

  • Mathematical definition
  • Stick-breaking construction
  • ?

Kurt T. Miller

  • Dr. Nonparametric Bayes

87

slide-116
SLIDE 116

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF) - Step 1

First integrate out the Gi.

m Ni xij θij Gi H α G0 γ

m Ni xij θij H α G0 γ

Kurt T. Miller

  • Dr. Nonparametric Bayes

88

slide-117
SLIDE 117

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF) - Step 1

What is the generative process when we integrate out Gi?

  • 1. Draw global G0 = ∞

k=1 πkδφk.

  • 2. Each group acts like a separate

CRP.

m Ni xij θij H α G0 γ

G0

. . .

φ1

. . . . . .

φ2 φ3

θ11

φ2 φ2 φ4 φ1 φ1 φ1

θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35

φ1 φ2 φ3 φ4

Kurt T. Miller

  • Dr. Nonparametric Bayes

89

slide-118
SLIDE 118

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF)

First integrate out the Gi, then integrate out G0

m Ni xij θij Gi H α G0 γ

m Ni xij θij H α G0 γ

m Ni xij θij H α γ

Kurt T. Miller

  • Dr. Nonparametric Bayes

90

slide-119
SLIDE 119

Hierarchical Dirichlet Process

Chinese Restaurant Franchise (CRF)

G0

. . .

φ1

. . . . . .

φ2 φ3

θ11

φ2 φ2 φ4 φ1 φ1 φ1

θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35

φ1 φ2 φ3 φ4

. . .

φ1

. . . . . .

φ2 φ3

θ11

φ2 φ2 φ4 φ1 φ1 φ1

θ12 θ13 θ14 θ15 θ16 θ26 θ21 θ22 θ23 θ24 θ25 θ31 θ32 θ33 θ34 θ35

Kurt T. Miller

  • Dr. Nonparametric Bayes

91

slide-120
SLIDE 120

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

For the DP, we had:

  • Mathematical definition
  • Stick-breaking construction
  • Chinese restaurant process

G θi xi N

α G0 G ∼ DP(α, G0) Stick-Breaking Process (just the weights) The CRP describes the partitions of θ when G is marginalized out.

For the HDP, we have

  • Mathematical definition
  • Stick-breaking construction
  • Chinese restaurant franchise

process

Kurt T. Miller

  • Dr. Nonparametric Bayes

92

slide-121
SLIDE 121

Hierarchical Dirichlet Process

Inference

Same classes of algorithms used for the DP:

  • MCMC
  • CRF representation
  • Stick-breaking representation
  • Variational

We will not go into these.

Kurt T. Miller

  • Dr. Nonparametric Bayes

93

slide-122
SLIDE 122

Hierarchical Dirichlet Process

Application of the HDP - Infinite Hidden Markov Model

Finite Hidden Markov Models (HMMs):

  • m states s1, . . . , sm
  • si has parameter φi with emission distribution

y ∼ p(y|φi)

  • m × m transition matrix

s1 s2 · · · sm s1 π11 π12 · · · π1m s2 π21 π22 · · · π2m . . . . . . . . . ... . . . sm πm1 πm2 · · · πmm How do we let m → ∞?

Kurt T. Miller

  • Dr. Nonparametric Bayes

94

slide-123
SLIDE 123

Hierarchical Dirichlet Process

Application of the HDP - Infinite Hidden Markov Model

How do we let m → ∞? Think a bit outside the traditional clustering context. Let each state si corresponds to a group. π|γ ∼ GEM(γ) πi|α, π ∼ DP(α, π) φk|H ∼ H xt|xt−1, (πi)∞

i=1

∼ πxt−1 yt|xt, (πi)∞

i=1

∼ p(yt|φxt)

Kurt T. Miller

  • Dr. Nonparametric Bayes

95

slide-124
SLIDE 124

Questions?

Great set of references for the Machine Learning community:

http://npbayes.wikidot.com/references Includes both the “classics” as well as modern applications.

Kurt T. Miller

  • Dr. Nonparametric Bayes

96