Clustering Relational Data using the Infinite Relational Model Ana - - PowerPoint PPT Presentation

clustering relational data using the infinite relational
SMART_READER_LITE
LIVE PREVIEW

Clustering Relational Data using the Infinite Relational Model Ana - - PowerPoint PPT Presentation

Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29 Outline Clustering 1 Model


slide-1
SLIDE 1

Clustering Relational Data using the Infinite Relational Model

Ana Daglis

Supervised by: Matthew Ludkin

September 4, 2015

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29

slide-2
SLIDE 2

Outline

1

Clustering

2

Model

3

Gibbs Sampling Methodology Results

4

Split-Merge Algorithm Methodology Results

5

Future Work

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 2 / 29

slide-3
SLIDE 3

Clustering

Clustering

Cluster Analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Applications:

recommendation engines (Netflix, iTunes, Quora,...) image compression targeted marketing Google News Figure: Trace-plot of the number of Figure: Block structure obtained.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 3 / 29

slide-4
SLIDE 4

Model

Infinite Relational Model

Infinite Relational Model (IRM) is a model, in which each node is assigned to a cluster. The number of clusters is not known initially and is learned from the data as part of the statistical inference. IRM is represented by the following parameters: zi - cluster, containing node i, for i = 1, ..., n. φi,j - probability of an edge between i-th and j-th clusters.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 4 / 29

slide-5
SLIDE 5

Model

Assumptions

Given the adjacency matrix of the graph, X, as our data, we assume that Xi,j ∼ Bernoulli(φzi,zj). Since z and φ are not known, hierarchical and beta priors respectively are imposed:

  • z ∼ CRP(A)

φi,j ∼ Beta(a, b).

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 5 / 29

slide-6
SLIDE 6

Model

Chinese Restaurant Process (CRP(A))

The Chinese restaurant process is a discrete process, whose value at time n is the partition of 1, 2, ..., n. At time n = 1, have trivial partition {{1}}. At time n + 1, element n + 1 is either:

1 added to an existing block with probability |b|/(n + A), where |b| is

the size of the block, or

2 creates a completely new block with probability A/(n + A).

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 6 / 29

slide-7
SLIDE 7

Model

Chinese Restaurant Process (CRP(A))

1

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 7 / 29

slide-8
SLIDE 8

Model

Chinese Restaurant Process (CRP(A))

1 1+A A 1+A

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 8 / 29

slide-9
SLIDE 9

Model

Chinese Restaurant Process (CRP(A))

1 2+A 1 2+A A 2+A

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 9 / 29

slide-10
SLIDE 10

Model

Chinese Restaurant Process (CRP(A))

1 3+A 2 3+A A 3+A

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 10 / 29

slide-11
SLIDE 11

Gibbs Sampling Methodology

Gibbs Sampling

Want: a sample from a multivariate distribution θ = (θ1, θ2, . . . , θd). Algorithm:

1 Initialize with θ = (θ(0)

1 , θ(0) 2 , . . . , θ(0) d ).

2 For i = 1, 2, . . . , n,

Simulate θ(i)

1

from the conditional θ1|(θ(i−1)

2

, . . . , θ(i−1)

d

) Simulate θ(i)

2

from the conditional θ2|(θ(i)

1 , θ(i−1) 3

, . . . , θ(i−1)

d

) ... Simulate θ(i)

d

from the conditional θd|(θ(i)

1 , θ(i) 2 , . . . , θ(i) d−1).

3 Discard the first k iterations and estimate the posterior distribution

using (θ(k+1)

1

, θ(k+1)

2

, . . . , θ(k+1)

d

), . . . , (θ(n)

1 , θ(n) 2 , . . . , θ(n) d ).

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 11 / 29

slide-12
SLIDE 12

Gibbs Sampling Methodology

Gibbs Sampling

We use the Gibbs sampling to infer the posterior distribution of z. The cluster assignments, zi, are iteratively sampled from their conditional distribution, P(zi = k|z\i, X) ∝ P(X|z)P(zi = k|z\i), where z\i denotes all cluster assignments except zi.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 12 / 29

slide-13
SLIDE 13

Gibbs Sampling Methodology

Simulated Data

We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φi,i = 0.85, for i = 1, ...n φi,j = 0.05, for i = j a = b = 1 for uniform prior A = 1. (a) Simulated network

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 13 / 29

slide-14
SLIDE 14

Gibbs Sampling Results

Simulated Data

We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φi,i = 0.85, for i = 1, ...n φi,j = 0.05, for i = j a = b = 1 for uniform prior A = 1. (b) Supplied network

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 14 / 29

slide-15
SLIDE 15

Gibbs Sampling Results

Block structure obtained

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 15 / 29

slide-16
SLIDE 16

Gibbs Sampling Results

Trace-plot of the number of blocks

2000 4000 6000 8000 10000 1 2 3 4 5 6 Iteration Number of blocks

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 16 / 29

slide-17
SLIDE 17

Gibbs Sampling Results

Gibbs Sampling Summary

The algorithm fails to split the data into 6 clusters within 10000 iterations, and is stuck in five-cluster configuration for a long time. The main problem with the Gibbs sampler is that it is slow to converge, and it often becomes trapped in a local mode (5 blocks in this case). A possible improvement is the split-merge algorithm, which updates simultaneously a group of nodes and avoids these problems.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 17 / 29

slide-18
SLIDE 18

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Algorithm:

1 Select two distinct nodes, i and j, uniformly at random. 2 If i and j belong to the same cluster, split that cluster into two by

assigning elements to either of the two clusters independently with equal probability.

3 If i and j belong to different clusters, merge those clusters. 4 Evaluate Metropolis-Hastings acceptance probability. If accepted,

the new cluster assignment becomes the next step of the algorithm. Otherwise, the initial cluster assignment remains as the next state. a(z∗, z) = min

  • 1, q(z|z∗)P(z∗)L(X|z∗)

q(z∗|z)P(z)L(X|z)

  • ,

where q is proposal probability, P(z) prior, L(X|z) likelihood.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 18 / 29

slide-19
SLIDE 19

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 19 / 29

slide-20
SLIDE 20

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 20 / 29

slide-21
SLIDE 21

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 21 / 29

slide-22
SLIDE 22

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 22 / 29

slide-23
SLIDE 23

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 23 / 29

slide-24
SLIDE 24

Split-Merge Algorithm Methodology

Split-Merge Algorithm

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 24 / 29

slide-25
SLIDE 25

Split-Merge Algorithm Results

Gibbs Sampler + Split-Merge

We applied the Gibbs sampler together with the split-merge algorithm to the earlier network. For every nine full Gibbs sampling scans, one split-merge step was used. The algorithm appropriately splits the data into six clusters, has short burn-in time and mixes well.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 25 / 29

slide-26
SLIDE 26

Split-Merge Algorithm Results

Block structure obtained

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 26 / 29

slide-27
SLIDE 27

Split-Merge Algorithm Results

Trace-plot of the number of blocks

200 400 600 800 1000 1 2 3 4 5 6 7 Iteration Number of blocks

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 27 / 29

slide-28
SLIDE 28

Future Work

Future Work

Assess the performance of the algorithms when the blocks significantly vary in size. Evaluate the complexities of the algorithms. Explore more advanced algorithms (such as the Restricted Gibbs Sampling Split-Merge).

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 28 / 29

slide-29
SLIDE 29

Future Work

References

Schmidt, M. N. and Mørup, M. (2013). Non-parametric Bayesian modeling of complex networks. IEEE Signal Processing Magazine, 30:110-128. Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13:158–182.

Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 29 / 29