Outline Motivation and challenge Dirichlet Process and Infinite - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Motivation and challenge Dirichlet Process and Infinite - - PDF document

Nonparametric Bayesian M Nonparametric Bayesian Models odels --Learning and Reasoning in Open Possible Worlds -- Learning and Reasoning in Open Possible Worlds Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology


slide-1
SLIDE 1

8/6/2009 VLPR09 @ Beijing, China

1

Eric Xing

epxing@cs.cmu.edu

Machine Learning Dept./Language Technology Inst./Computer Science Dept.

Carnegie Mellon University

Nonparametric Nonparametric Bayesian M Bayesian Models

  • dels
  • -Learning and Reasoning in Open Possible Worlds

Learning and Reasoning in Open Possible Worlds

8/6/2009 VLPR09 @ Beijing, China

2

Outline

Motivation and challenge Dirichlet Process and Infinite Mixture

  • Formulation
  • Approximate Inference algorithm
  • Example: population clustering

Hierarchical Dirichlet Process and Multi-Task Clustering

  • Formulation
  • Transformed DP and HDP
  • Kernel stick-breaking process
  • Application: joint image segmentation

Dynamic Dirichlet Process

  • Hidden Markov DP
  • Temporal DPM
  • Application: evolutionary clustering of documents

Summary

slide-2
SLIDE 2

8/6/2009 VLPR09 @ Beijing, China

3

Clustering

8/6/2009 VLPR09 @ Beijing, China

4

Image Segmentation

How to segment images?

Manual segmentation (very expensive) Algorithm segmentation K-means Statistical mixture models Spectral clustering

Problems with most existing

algorithms

Ignore the spatial information Perform the segmentation one image

at a time

Need to specify the number of

segments a priori

slide-3
SLIDE 3

8/6/2009 VLPR09 @ Beijing, China

5

Discover Object Categories

Discover what objects are present in a collection of

images in an unsupervised way

Find those same objects in novel images Determine what local image features correspond to

what objects; segmenting the image

8/6/2009 VLPR09 @ Beijing, China

6

Learn and Recognize Natural Scene Categories

slide-4
SLIDE 4

8/6/2009 VLPR09 @ Beijing, China

7

Object Recognition and Tracking

t=1 t=2 t=3

(1.8, 7.4, 2.3) (1.9, 9.0, 2.1) (1.9, 6.1, 2.2) (0.9, 5.8, 3.1) (0.7, 5.1, 3.2) (0.6, 5.9, 3.2)

8/6/2009 VLPR09 @ Beijing, China

8

PNAS PNAS papers papers Research Research topics topics 1900 2000 ? Research Research circles circles

The Evolution of Science

CS Bio Phy Phy

slide-5
SLIDE 5

8/6/2009 VLPR09 @ Beijing, China

9

A Classical Approach

Clustering as Mixture Modeling Then "model selection"

8/6/2009 VLPR09 @ Beijing, China

10 10

Partially Observed, Open and Evolving Possible Worlds

Unbounded # of objects/trajectories Changing attributes Birth/death, merge/split Relational ambiguity The parametric paradigm: Finite Structurally

unambiguous

* |t t

Ξ

* | 1 1 + +

Ξ

t t

{ } ( )

k

p φ | x

Sensor model Sensor model

  • bservation space
  • bservation space

Entity space Entity space motion model motion model

{ } { }

( )

t k t k

p φ φ

1 +

Event model Event model

{ } ( )

k

p φ

{ } ( )

T k

p

: 1

φ

  • r

How to open it up? How to open it up?

slide-6
SLIDE 6

8/6/2009 VLPR09 @ Beijing, China

11 11

Model Selection vs. Posterior Inference

Model selection

"intelligent" guess: ??? cross validation: data-hungry information theoretic: AIC TIC MDL : Bayes factor:

need to compute data likelihood Posterior inference:

we want to handle uncertainty of model complexity explicitly

we favor a distribution that does not constrain M in a "closed" space!

( )

) , ˆ | ( | ) ( min arg K KL

ML

g f θ ⋅ ⋅ ) ( ) | ( ) | ( M p M D p D M p ∝

{ }

K , θ ≡ M

Parsimony, Parsimony, Ockam's Ockam's Razor Razor

8/6/2009 VLPR09 @ Beijing, China

12 12

Two "Recent" Developments

First order probabilistic languages (FOPLs)

Examples: PRM, BLOG … Lift graphical models to "open" world (#rv, relation, index, lifespan …) Focus on complete, consistent, and operating rules to instantiate possible

worlds, and formal language of expressing such rules

Operational way of defining distributions over possible worlds, via sampling

methods

Bayesian Nonparametrics

Examples: Dirichlet processes, stick-breaking processes … From finite, to infinite mixture, to more complex constructions (hierarchies,

spatial/temporal sequences, …)

Focus on the laws and behaviors of both the generative formalisms and

resulting distributions

Often offer explicit expression of distributions, and expose the structure of

the distributions --- motivate various approximate schemes

slide-7
SLIDE 7

8/6/2009 VLPR09 @ Beijing, China

13 13

Outline

Motivation and challenge Dirichlet Process and Infinite Mixture

  • Formulation
  • Approximate Inference algorithm
  • Example: population clustering

Hierarchical Dirichlet Process and Multi-Task Clustering

  • Formulation
  • Transformed DP and HDP
  • Kernel stick-breaking process
  • Application: joint image segmentation

Dynamic Dirichlet Process

  • Hidden Markov DP
  • Temporal DPM
  • Application: evolutionary clustering of documents

Summary

8/6/2009 VLPR09 @ Beijing, China

14 14

Clustering

How to label them ? How many clusters ???

slide-8
SLIDE 8

8/6/2009 VLPR09 @ Beijing, China

15 15

Random Partition of Probability Space

{ }

1 1 π

φ ,

{ }

2 2 π

φ ,

{ }

5 5 π

φ ,

{ }

6 6 π

φ ,

{ }

3 3 π

φ ,

{ }

4 4 π

φ ,

centroid :=φ Image ele. :=(x,θ)

. (event, pevent)

8/6/2009 VLPR09 @ Beijing, China

16 16

Stick-breaking Process

G0 0.4 0.4 0.6 0.5 0.3 0.3 0.8 0.24 ) , Beta( ~ )

  • (

~ ) (

∏ ∑ ∑

α β β β π π θ θ δ π 1 1 1

1 1 1 1 k k j k k k k k k k k k

G G

= = =

= = =

Location Mass

slide-9
SLIDE 9

8/6/2009 VLPR09 @ Beijing, China

17 17

DP – a Pólya urn Process

  • Self-reinforcing property
  • exchangeable partition
  • f samples

α + = 5 2 p α + = 5 3 p α α + = 5 p

) ( : K p G =

( )

) G DP( G α ~

. ~ , , |

1

1 1 G i i n G

K k k i i

k

α α δ α α φ φ

φ

+ − + + −

= −

Joint: Joint: Marginal: Marginal:

8/6/2009 VLPR09 @ Beijing, China

18 18

Clustering and DP Mixture

We can associate mixture components with colors in the Pólya

urn model and thereby define a clustering of the data

) ( : K p G =

1 3 2 4 5 6

α + = 5 2 p α + = 5 3 p α α + = 5 p

slide-10
SLIDE 10

8/6/2009 VLPR09 @ Beijing, China

19 19

Chinese Restaurant Process

= ) | = (

  • i

i

k c P c 1

α + 1 1 α α + 1 α + 2 1 α + 2 1 α α + 2 α + 3 1 α + 3 2 α α + 3 1

  • +

1

α i m 1

  • +

2

α i m 1

α i

....

1

θ

2

θ

8/6/2009 VLPR09 @ Beijing, China

20 20

Dirichlet Process

A CDF, G, on possible worlds

  • f random partitions follows a

Dirichlet Process if for any measurable finite partition (φ1,φ2, .., φm):

(G(φ1), G(φ2), …, G(φm) ) ~ Dirichlet( αG0(φ1), …., αG0(φm) ) where G0 is the base measure and α is the scale parameter

1

φ

2

φ

5

φ

6

φ

3

φ

4

φ

Thus a Thus a Dirichlet Dirichlet Process Process G G defines a distribution of distribution defines a distribution of distribution

a distribution another distribution

slide-11
SLIDE 11

8/6/2009 VLPR09 @ Beijing, China

21 21

xi

N

G θi α α G G0 yi xi

N π

θ α α G G0

The Stick-breaking construction The Pólya urn construction

Graphical Model Representations of DP

8/6/2009 VLPR09 @ Beijing, China

22 22

Example: DP-haplotyper

[Xing et al, 2004]

Clustering human populations Inference: Markov Chain Monte Carlo (MCMC)

Gibbs sampling Metropolis Hasting Gn Hn1 Hn2 A θ N K G α G0 DP infinite mixture components (for population haplotypes haplotypes) Likelihood model (for individual haplotypes and genotypes genotypes)

slide-12
SLIDE 12

8/6/2009 VLPR09 @ Beijing, China

23 23

Single-locus mutation model Noisy observation model

Inheritance and Observation Models

1

i

H

θ θ θ θ . for | | for ) , | ( prob with a h a h B a h a h P

t t t t t t t t H

= → ⎪ ⎩ ⎪ ⎨ ⎧ ≠ − − = = 1 1

i

G

Ancestral pool Haplotypes Genotype

λ . : ) , | (

, ,

prob with h h g h h g P

t t t G 2 1 2 1

⊕ =

2

i

H

1

A

3

A

2

A

e e i

i C

H A →

1

i

C

2

i

C

i i i

G H H →

2 1 ,

8/6/2009 VLPR09 @ Beijing, China

24 24

Gibbs sampling for exploring the posterior distribution

under the proposed model

Integrate out the parameters such as or , and sample

and

Gibbs sampling algorithm: draw samples of each random variable to

be sampled given values of all the remaining variables

MCMC for Haplotype Inference

θ λ

k i

a c

e , e

i

h

) , | ( ) | ( ) , , | (

] [ , ] [ ] [

c h c a h c

e e e e e e

i k i i i i i

a h p k c p k c p

− − −

= ∝ =

Posterior Prior x Likelihood Pólya urn

M

slide-13
SLIDE 13

8/6/2009 VLPR09 @ Beijing, China

25 25

MCMC for Haplotype Inference

1.

Sample cie(j), from

2.

Sample ak from

3.

Sample hie(j) from

  • For DP scale parameter α: a vague inverse Gamma prior

8/6/2009 VLPR09 @ Beijing, China

26 26

Convergence of Ancestral Inference

slide-14
SLIDE 14

8/6/2009 VLPR09 @ Beijing, China

27 27

DP vs. Finite Mixture via EM

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1 2 3 4 5 data sets individual error Series1 Series2 DP EM 8/6/2009 VLPR09 @ Beijing, China

28 28

Variational Inference

[Blei & Jordan 2005, Kurihara et al 2007]

Gibbs sampling solution is not efficient enough to scale up

to the large scale problems.

Truncated stick-breaking approximation can be formulated

in the space of explicit, non-exchangeable cluster labels.

Variational inference can now be applied to such a finite-

dimensional distribution

Variational Inference:

For a complicated P(X1, X2, … Xn), approximate it with Q(X):

slide-15
SLIDE 15

8/6/2009 VLPR09 @ Beijing, China

29 29

Approximations to DP

Truncated stick-breaking

representation

The joint distribution can be

expressed as:

Finite symmetric Dirichlet

approximation

The joint distribution can be

expressed as:

8/6/2009 VLPR09 @ Beijing, China

30 30

TDP vs. TSB

TDP is size biased cluster labels is NOT interchangeable under TDP but is

interchangeable under TSB

slide-16
SLIDE 16

8/6/2009 VLPR09 @ Beijing, China

31 31

Marginalization

In variational Bayesian approximation, we assume a factorized

form for the posterior distribution.

However it is not a good assumption since changes in π will

have a considerable impact on z.

If we can integrate out π , the joint distribution is given by For the TSB representation: α For the FSD representation:

8/6/2009 VLPR09 @ Beijing, China

32 32

VB inference

We can then apply the VB inference on the four approximations The approximated posterior distribution for TSB and FSD are Depending on marginalization or not, v and π may be integrated out.

slide-17
SLIDE 17

8/6/2009 VLPR09 @ Beijing, China

33 33

Experimental results

8/6/2009 VLPR09 @ Beijing, China

34 34

Outline

Motivation and challenge Dirichlet Process and Infinite Mixture

  • Formulation
  • Approximate Inference algorithm
  • Example: population clustering

Hierarchical Dirichlet Process and Multi-Task Clustering

  • Formulation
  • Transformed DP and HDP
  • Kernel stick-breaking process
  • Application: joint image segmentation

Dynamic Dirichlet Process

  • Hidden Markov DP
  • Temporal DPM
  • Application: evolutionary clustering of documents

Summary

slide-18
SLIDE 18

8/6/2009 VLPR09 @ Beijing, China

35 35

Solving Multiple Clustering Problems

( )

) G DP( G α ~

1 3 2 4 5 6

( )

) G DP( G α ~

( )

) G DP( G α ~

( )

) G DP( G α ~

( )

) G DP( G α ~

? ? ? ?

Nature articles Nature articles PNAS articles PNAS articles Science articles Science articles

… … …

8/6/2009 VLPR09 @ Beijing, China

36 36

Solve separately

Fail to capture correlation Fail to cross-reinforce shared

information (i.e., topic specific lexicon)

Data fragmentation

Solve together

Then what is the difference

between all these journals?

( )

) G DP( G α ~

( )

) G DP( G α ~

( )

) G DP( G α ~

( )

) G DP( G α ~

Nature articles Nature articles PNAS articles PNAS articles Science articles Science articles

… … …

Solving Multiple Clustering Problems

slide-19
SLIDE 19

8/6/2009 VLPR09 @ Beijing, China

37 37

Hierarchical Dirichlet Process

Two level Pólya urn scheme

At the i-th step in j-th "group",

α α α θ + − + −

∑ ∑

k jk k jk jk k

m prob with DP level upper the to Go m m prob with Choose . .

∑ ∑

+ + γ γ γ θ

k k k k

n prob with sample new a Draw n n prob with Choose . . Oracle

1

θ

2

θ

1

θ 1 θ

2

θ 2 θ

1

θ

2

θ

1

θ 1 θ

2

θ 2 θ

1

θ

2

θ

1

θ 1 θ

2

θ 2 θ

1

θ

2

θ

1

θ 1 θ

2

θ

[Teh et al., 2005, Xing et al. 2005] 8/6/2009 VLPR09 @ Beijing, China

38 38

  • Draw from stock urn define Dirichlet Process DP(γ,H)
  • Conditioning on DP(γ,H), the mjth draw from the mth bottom-level

urn also form a Dirichlet measure

) ( ) ( ~ |

*

i i K k k i i

H i i n

k

θ γ γ θ δ γ θ θ

φ

+ + +

= − 1

) ( ) ( ) ( ) ( ~ |

* * ,

* * j j k j j k j j

m K m K k k m j m K k j k k j m m

H p p H i m m n n m θ θ δ θ γ γ α α θ δ α γ α θ θ

φ φ 1 1 1 + = = −

+ = + + + + + +

∑ ∑

Hierarchical Dirichlet Process

Two level Pólya urn scheme

At the i-th step in j-th "group",

1

θ

2

θ

1

θ 1 θ

2

θ 2 θ

1

θ

2

θ

1

θ 1 θ

2

θ 2 θ

1

θ

2

θ

1

θ 1 θ

2

θ 2 θ

1

θ

2

θ

1

θ 1 θ

2

θ

[Teh et al., 2005, Xing et al. 2005]

slide-20
SLIDE 20

8/6/2009 VLPR09 @ Beijing, China

39 39

xi

N

G θi α α G G0 yi xi

N π

θ α α G G0

The Stick-breaking construction The Pólya urn construction

Recall: Graphical Model Representations of DP

8/6/2009 VLPR09 @ Beijing, China

40 40

Hierarchical DP Mixture

J

xi

N

Gj θi α α G0 γ γ H H yi xi

N πj

θ α α H H

β

γ γ

J ∑ ∑

∞ ∞

) ( ), , Stick( ) ( ), Stick( ~

1 1 = =

= = = =

k k k j j k k k k

G G H θ δ π β α π θ δ β γ β θ

( ) ( )

( )

. ' ' , , Beta ~ ' : ) , Stick(

k

∏ ∑

− = =

− = −

1 1 1

1 1

k l jl jk jk k l l jk

π π π β α αβ π β α

slide-21
SLIDE 21

8/6/2009 VLPR09 @ Beijing, China

41 41

w

N

c z

D

π

Latent Dirichlet Allocation (LDA)

“beach”

Topic Models for Images

8/6/2009 VLPR09 @ Beijing, China

42 42

Image Representation

] [ , ] [ ] [ , ] [

| | 1 1 | | 1 1 11 V nd n V d

w w r r w w r r L L M L L

cat, grass, tiger, water

annotation vector

(binary, same for each segment)

representation vector

(real, 1 per image segment)

slide-22
SLIDE 22

8/6/2009 VLPR09 @ Beijing, China

43 43

Infinite Topic Model for Image

A single image with k topic

zi wi

N πj

θ α α H H

k

Dirichlet

An LDA ∞

zi wi

N πj

θ α α H H

Stick breaking

A single image with inf-topic A DP

zi wi

N πj

θ α α H H

β

γ γ

J ∞ J images with inf-topic An HDP

8/6/2009 VLPR09 @ Beijing, China

44 44

Problem with HDP

Every group (i.e., image) has exactly the same set of

visual-vocabulary topics, albeit with different frequency

zi wi

N πj

α α H H

β

γ γ

J ∞ J

wi

N

Gj θi α α G0 γ γ H H θk

slide-23
SLIDE 23

8/6/2009 VLPR09 @ Beijing, China

45 45

Transformed Dirichlet Process

An extension of HDP in which global mixture components

undergo a set of random transformations before being reused in each group

zi wi

N πj

θk α α H H

β

γ γ

J ∞

yi wiN

πj

α α H H

β

γ γ

J ∞

θk φk R R

ρjk

[Sudderth et al, 2005] 8/6/2009 VLPR09 @ Beijing, China

46 46 HDP uses a large set of global clusters to discretize the

transformations underlying the data, and may have poor generalization for modeling visual scenes.

Synthetic Data Results

slide-24
SLIDE 24

8/6/2009 VLPR09 @ Beijing, China

47 47

Analyzing Street Scenes

8/6/2009 VLPR09 @ Beijing, China

48 48

Kernel stick-breaking process

For image analysis, we want to impose the belief that spatially

proximate patches are more probable to be associated with the same cluster.

We augmented the stick-breaking representation of DP to

employ a kernel function to quantify some additional prior.

[Dunson and Park, 2006]

slide-25
SLIDE 25

8/6/2009 VLPR09 @ Beijing, China

49 49

KSBP for image analysis

Consider a image composed of N patches, the features

vectors and the associated locations can be modeled as follows:

N

N n n ,

} {

1 =

x

N n n , 1

} {

=

r

∞ =

Γ =

1

) , , ; (

h h h h r

h

ψ V G

θ

δ π r

r ind n

G ~ φ ) ( ~

n ind n

f φ x

[ ]

− =

Γ − Γ = Γ

1 1

) , , ( 1 ) , , ( ) , , ; (

h l l l h h h h h

ψ K V ψ K V ψ V r r r π ) , ( ~ b a Beta V

iid h

  • iid

h

G ~ θ H

iid h ~

Γ

8/6/2009 VLPR09 @ Beijing, China

50 50

Multi-task Image Segmentation

slide-26
SLIDE 26

8/6/2009 VLPR09 @ Beijing, China

51 51

July 6th., 2008 ICML 2008 presentation

51

Segmentation results [An et al, 2008]

8/6/2009 VLPR09 @ Beijing, China

52 52

Outline

Motivation and challenge Dirichlet Process and Infinite Mixture

  • Formulation
  • Approximate Inference algorithm
  • Example: population clustering

Hierarchical Dirichlet Process and Multi-Task Clustering

  • Formulation
  • Transformed DP and HDP
  • Kernel stick-breaking process
  • Application: joint image segmentation

Dynamic Dirichlet Process

  • Hidden Markov DP
  • Temporal DPM
  • Application: evolutionary clustering of documents

Summary

slide-27
SLIDE 27

8/6/2009 VLPR09 @ Beijing, China

53 53

Each chain

corresponds to the trajectory of a specific

  • bject

Object Recognition and Tracking

... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... The clipper Person 1 Person k The scene

8/6/2009 VLPR09 @ Beijing, China

54 54

Hidden Markov Dirichlet Process

Hidden Markov Dirichlet process mixtures

  • Extension of HMM model to infinite ancestral space
  • Infinite dimensional transition matrix
  • Each row of the transition matrix is modeled with a DP:

Ct Ct+1

... 3 2 1 : . 3 2 1

(Xing and Sohn. Bayesian Analysis, 2007, Sohn and Xing, ISMB 2007)

) , ( DP ~ , | H H G γ γ ) , ( DP ~ , | G G Gi α α

slide-28
SLIDE 28

8/6/2009 VLPR09 @ Beijing, China

55 55

A H

π

α α

β

γ γ H H y1 y2 y3 yN

.... π

α α

β

γ γ H H y1 y1 y2 y2 y3 y3 yN yN

....

∞ ∞

C1 C2 C3 CN

HMDP as a Graphical Model

Hidden trajectories Hidden trajectories Trajectory-segment indicator Trajectory-segment indicator Observed trace Observed trace

8/6/2009 VLPR09 @ Beijing, China

56 56

Evolutionary Clustering

Adapts the number of mixture components over time

Mixture components can die out New mixture components are born at any time Retained mixture components parameters evolve according to a

Markovian dynamics 1900 2000 CS Bio Phy Phy Research Papers Topics

slide-29
SLIDE 29

8/6/2009 VLPR09 @ Beijing, China

57 57

The Chinese Restaurant Process

Customers correspond to data points Tables correspond to clusters/mixture components Dishes correspond to parameter of the mixtures

1

θ

2

θ

1

θ

1

φ

2

θ

2

φ 8/6/2009 VLPR09 @ Beijing, China 58 58

The Recurrent Chinese Restaurant Process

The restaurant operates in epochs The restaurant is closed at the end of each epoch The state of the restaurant at time epoch t depends on that at time

epoch t-1

Can be extended to higher-order dependencies.

Temporal DPM [Ahmed and Xing 2008]

slide-30
SLIDE 30

8/6/2009 VLPR09 @ Beijing, China

59 59

φ2,1 φ1,1 φ3,1

T=1 Dish eaten at table 3 at time epoch 1 OR the parameters of cluster 3 at time epoch 1 ‐Customers at time T=1 are seated as before: ‐ Choose table j ∝ Nj,1 and Sample xi ~ f(φj,1) ‐ Choose a new table K+1 ∝ α ‐ Sample φK+1,1 ~ G0 and Sample xi ~ f(φK+1,1) Generative Process

8/6/2009 VLPR09 @ Beijing, China

60 60

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,1 φ1,1 φ3,1

N1,1=2 N2,1=3 N3,1=1

slide-31
SLIDE 31

8/6/2009 VLPR09 @ Beijing, China

61 61

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,1 φ1,1 φ3,1

N1,1=2 N2,1=3 N3,1=1

8/6/2009 VLPR09 @ Beijing, China

62 62

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,1 φ1,1 φ3,1

N1,1=2 N2,1=3 N3,1=1 α + 6 2 α + 6 3 α + 6 1 α α + 6

slide-32
SLIDE 32

8/6/2009 VLPR09 @ Beijing, China

63 63

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,1 φ1,1 φ3,1

N1,1=2 N2,1=3 N3,1=1 α + 6 2

8/6/2009 VLPR09 @ Beijing, China

64 64

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,1 φ1,2 φ3,1

N1,1=2 N2,1=3 N3,1=1 Sample φ1,2 ~ P(.| φ1,1)

slide-33
SLIDE 33

8/6/2009 VLPR09 @ Beijing, China

65 65

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,1 φ1,2 φ3,1

N1,1=2 N2,1=3 N3,1=1 α + + + 1 6 2 1 α + +1 6 3 α + +1 6 1 α α + +1 6 And so on ……

8/6/2009 VLPR09 @ Beijing, China

66 66

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,2 φ1,2 φ3,1

N1,1=2 N2,1=3 N3,1=1 At the end of epoch 2

φ4,2

Newly born cluster Died out cluster

slide-34
SLIDE 34

8/6/2009 VLPR09 @ Beijing, China

67 67

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,2 φ1,2 φ3,1

N1,1=2 N2,1=3 N3,1=1

φ4,2

T=3

φ2,2 φ1,2 φ4,2

N4,2=1 N2,2=2 N1,2=2

8/6/2009 VLPR09 @ Beijing, China

68 68

Temporal DPM

Can be extended to model higher-order dependencies Can decay dependencies over time

Pseudo-counts for table k at time t is

History size Decay factory Number of customers sitting at table K at time epoch t‐w

= − −

⎠ ⎞ ⎜ ⎝ ⎛

W w w t k w

N e

1 , λ

slide-35
SLIDE 35

8/6/2009 VLPR09 @ Beijing, China

69 69

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,2 φ1,2 φ3,1

N1,1=2 N2,1=3 N3,1=1

φ4,2

T=3

φ2,2 φ1,2 φ4,2

N2,3 N2,3 =

8/6/2009 VLPR09 @ Beijing, China

70 70

TDPM Generative Power

W=T λ = ∞ DPM W=4 λ = .4 TDPM W= 0 λ = ? (any) Independent DPMs Power‐law curve

slide-36
SLIDE 36

8/6/2009 VLPR09 @ Beijing, China

71 71

Experiments

Simulated data Chain dynamics is modeled as random walk Gaussian emission: Simulated 30 epochs with 100 data points in each

epoch

Can TDPM recover the ground truth clustering?

Posterior inference ran using Gibbs sampling [Ahmed and Xing 2008]

Compare with fixed-dimension dynamic models

8/6/2009 VLPR09 @ Beijing, China

72 72

TDPM Adaptability over Time

slide-37
SLIDE 37

8/6/2009 VLPR09 @ Beijing, China

73 73

Results: NIPS 12

Building a simple dynamic topic model Chain dynamics is as before Emission model for document xk,t is:

Project φk,t over the simplex Sample xk,t|ct,i ~ Multinomial(.|Logisitic(φk,t))

Unlike LDA here a document belongs to one topic

Use this model to analyze NIPS12 corpus

Proceeding of NIPS conference 1987-1999 8/6/2009 VLPR09 @ Beijing, China 74 74

slide-38
SLIDE 38

8/6/2009 VLPR09 @ Beijing, China

75 75

The Big Picture

ci xiN

π

φ

Κ

Time Time Model Dimension Model Dimension

K‐means

ci xi

N π

φ

∞ α

G0

DPM Dynamic clustering TDPM ∞ Fixed-dimensions Dynamic clustering

8/6/2009 VLPR09 @ Beijing, China

76 76

Summary

A non-parametric Bayesian model for Pattern

Uncovery

Finite mixture model of latent patterns (e.g., image segments, objects)

infinite mixture of propotypes: alternative to model selection hierarchical infinite mixture infinite hidden Markov model temporal infinite mixture model

Applications in general data-mining …

slide-39
SLIDE 39

8/6/2009 VLPR09 @ Beijing, China

77 77

How to Model Semantic?

Q: What is it about? A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic

  • information. Thus it can be seen as a

shift to the formal machinery of syntax based translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

Source Target SMT Alignment Score BLEU Parse Tree Noun Phrase Grammar CFG likelihood EM Hidden Parameters Estimation argMax MT Syntax Learning 0.6 0.3 0.1 Unigram over vocabulary Topics Mixing Proportion Topic Models

8/6/2009 VLPR09 @ Beijing, China

78 78

Objects are bags of elements Mixtures are distributions over elements Objects have mixing vector θ

Represents each mixtures’ contributions

Object is generated as follows:

Pick a mixture component from θ Pick an element from that component

Admixture Models

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2

0.1 0.1 0.5 ….. 0.1 0.5 0.1 ….. 0.5 0.1 0.1 …..

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
slide-40
SLIDE 40

8/6/2009 VLPR09 @ Beijing, China

79 79

Topic Models = Admixture Models

Generating a document

Prior θ z w β Nd N K

( ) { }

( )

from , | Draw

  • from

Draw

  • each word

For prior the from

: 1

n

z k n n n

l multinomia z w l multinomia z n Draw β β θ θ − Which prior to use?

8/6/2009 VLPR09 @ Beijing, China

80 80

Variational Inference

) | , (

: 1 D

z P

n

γ

β γ z w μ Σ

( )

p q KL

n

min arg

* *, *,

: 1

φ µ Σ

Optimization Problem Approximate the Posterior Approximate the Integral Solve μ*,Σ*,φ1:n*

( )

( ) ( )

Σ =

n n n

z q q z q φ µ γ γ * *, ,

: 1

γ z w

μ*

Σ* Φ* β