Clustering Relational Data using the Infinite Relational Model Ana - PowerPoint PPT Presentation

Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29

Outline Clustering 1 Model 2 Gibbs Sampling 3 Methodology Results Split-Merge Algorithm 4 Methodology Results Future Work 5 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 2 / 29

Clustering Clustering Cluster Analysis : Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Applications: recommendation engines (Netflix, iTunes, Quora,...) image compression targeted marketing Google News Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 3 / 29 Figure: Trace-plot of the number of Figure: Block structure obtained.

Model Infinite Relational Model Infinite Relational Model (IRM) is a model, in which each node is assigned to a cluster. The number of clusters is not known initially and is learned from the data as part of the statistical inference. IRM is represented by the following parameters : z i - cluster, containing node i , for i = 1 , ..., n . φ i , j - probability of an edge between i -th and j -th clusters. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 4 / 29

Model Assumptions Given the adjacency matrix of the graph, X , as our data, we assume that X i , j ∼ Bernoulli ( φ z i , z j ). Since z and φ are not known, hierarchical and beta priors respectively are imposed: � z ∼ CRP ( A ) φ i , j ∼ Beta ( a , b ) . Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 5 / 29

Model Chinese Restaurant Process (CRP(A)) The Chinese restaurant process is a discrete process, whose value at time n is the partition of 1 , 2 , ..., n . At time n = 1, have trivial partition {{ 1 }} . At time n + 1, element n + 1 is either: 1 added to an existing block with probability | b | / ( n + A ), where | b | is the size of the block, or 2 creates a completely new block with probability A / ( n + A ). Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 6 / 29

Model Chinese Restaurant Process (CRP(A)) 1 0 0 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 7 / 29

Model Chinese Restaurant Process (CRP(A)) 1 A 0 1+ A 1+ A Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 8 / 29

Model Chinese Restaurant Process (CRP(A)) 1 1 A 2+ A 2+ A 2+ A Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 9 / 29

Model Chinese Restaurant Process (CRP(A)) 1 2 A 3+ A 3+ A 3+ A Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 10 / 29

Gibbs Sampling Methodology Gibbs Sampling Want : a sample from a multivariate distribution θ = ( θ 1 , θ 2 , . . . , θ d ) . Algorithm : 1 Initialize with θ = ( θ (0) 1 , θ (0) 2 , . . . , θ (0) d ) . 2 For i = 1 , 2 , . . . , n , Simulate θ ( i ) from the conditional θ 1 | ( θ ( i − 1) , . . . , θ ( i − 1) ) 1 2 d Simulate θ ( i ) from the conditional θ 2 | ( θ ( i ) 1 , θ ( i − 1) , . . . , θ ( i − 1) ) 2 3 d ... Simulate θ ( i ) from the conditional θ d | ( θ ( i ) 1 , θ ( i ) 2 , . . . , θ ( i ) d − 1 ). d 3 Discard the first k iterations and estimate the posterior distribution using ( θ ( k +1) , θ ( k +1) , . . . , θ ( k +1) ) , . . . , ( θ ( n ) 1 , θ ( n ) 2 , . . . , θ ( n ) d ) . 1 2 d Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 11 / 29

Gibbs Sampling Methodology Gibbs Sampling We use the Gibbs sampling to infer the posterior distribution of z . The cluster assignments, z i , are iteratively sampled from their conditional distribution, P ( z i = k | z \ i , X ) ∝ P ( X | z ) P ( z i = k | z \ i ) , where z \ i denotes all cluster assignments except z i . Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 12 / 29

Gibbs Sampling Methodology Simulated Data We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φ i , i = 0 . 85, for i = 1 , ... n φ i , j = 0 . 05, for i � = j a = b = 1 for uniform prior A = 1. (a) Simulated network Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 13 / 29

Gibbs Sampling Results Simulated Data We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φ i , i = 0 . 85, for i = 1 , ... n φ i , j = 0 . 05, for i � = j a = b = 1 for uniform prior A = 1. (b) Supplied network Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 14 / 29

Gibbs Sampling Results Block structure obtained Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 15 / 29

Gibbs Sampling Results Trace-plot of the number of blocks 6 5 Number of blocks 4 3 2 1 0 2000 4000 6000 8000 10000 Iteration Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 16 / 29

Gibbs Sampling Results Gibbs Sampling Summary The algorithm fails to split the data into 6 clusters within 10000 iterations, and is stuck in five-cluster configuration for a long time. The main problem with the Gibbs sampler is that it is slow to converge , and it often becomes trapped in a local mode (5 blocks in this case). A possible improvement is the split-merge algorithm , which updates simultaneously a group of nodes and avoids these problems. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 17 / 29

Split-Merge Algorithm Methodology Split-Merge Algorithm Algorithm : 1 Select two distinct nodes, i and j , uniformly at random. 2 If i and j belong to the same cluster, split that cluster into two by assigning elements to either of the two clusters independently with equal probability. 3 If i and j belong to different clusters, merge those clusters. 4 Evaluate Metropolis-Hastings acceptance probability. If accepted, the new cluster assignment becomes the next step of the algorithm. Otherwise, the initial cluster assignment remains as the next state. 1 , q ( z | z ∗ ) P ( z ∗ ) L ( X | z ∗ ) � � a ( z ∗ , z ) = min , q ( z ∗ | z ) P ( z ) L ( X | z ) where q is proposal probability, P ( z ) prior, L ( X | z ) likelihood. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 18 / 29

Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 19 / 29

Split-Merge Algorithm Results Gibbs Sampler + Split-Merge We applied the Gibbs sampler together with the split-merge algorithm to the earlier network. For every nine full Gibbs sampling scans, one split-merge step was used. The algorithm appropriately splits the data into six clusters, has short burn-in time and mixes well . Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 25 / 29

Split-Merge Algorithm Results Block structure obtained Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 26 / 29

Split-Merge Algorithm Results Trace-plot of the number of blocks 7 6 Number of blocks 5 4 3 2 1 0 200 400 600 800 1000 Iteration Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 27 / 29

Future Work Future Work Assess the performance of the algorithms when the blocks significantly vary in size. Evaluate the complexities of the algorithms. Explore more advanced algorithms (such as the Restricted Gibbs Sampling Split-Merge). Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 28 / 29

Future Work References Schmidt, M. N. and Mørup, M. (2013). Non-parametric Bayesian modeling of complex networks. IEEE Signal Processing Magazine , 30:110-128. Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics , 13:158–182. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 29 / 29

Clustering Relational Data using the Infinite Relational Model Ana - PowerPoint PPT Presentation

Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29 Outline Clustering 1 Model

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

Chapter 2: Relational Model Chapter 2: Relational Model Structure of Relational Databases

Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering DSE 210 Clustering in R d Two common uses of clustering: Vector quantization Find

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

On the Long Way of Reforms: The System of RA Local Government in the Eyes of Citizens Presentation

MUST Project Team Me e ting in Ulaanbaatar on Se pte mbe r 10- 14, 2018 Review on Existing

Exploring the Pisces-Perseus Supercluster (PPS) with FAST Li Xiao Ming Zhu Mei Ai FAST

Click to edit Master title style If they know how to talk about them Click to edit Master text

Mathematics 101: Data Collection and Sampling Techniques Olive R. Cawiding Department of

COMM 291 Midterm Review Session By Simon Roberts Types of Variables Categorical Variable:

High Resolution Mapping of Fertility and Mortality from National Household Survey Data in Low

Ideas4Work YOUTH EMPLOYABILITY and ENTREPRENEURSHIP in AFRICA Dakar, Senegal January 23-25, 2013