Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and - PowerPoint PPT Presentation

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing Joyce Jiyoung Whang 1 Piyush Rai 2 Inderjit S. Dhillon 1 1 The University of Texas at Austin 2 Duke University International Conference on Data Mining Dec. 7 - Dec. 10, 2013. Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (1/24)

Contents Introduction and Background Stochastic Blockmodel Indian Buffet Process The Proposed Model Basic Model Relevance Selection Mechanism Exploiting Pairwise Similarities Experiments Synthetic Data Facebook Data Drug-Protein Interaction Data Lazega Lawyers Data Conclusions Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (2/24)

Introduction Stochastic Blockmodel Generative model Expresses objects as a low dimensional representation U i , U j Models the link probability of a pair of objects P ( A ij ) = f ( U i , U j , θ ) e.g., latent class model, mixed membership stochastic blockmodel Applications Revealing structures in networks (Overlapping) Clustering, Link prediction Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (3/24)

Introduction Overlapping stochastic blockmodels Objects have hard memberships in multiple clusters. Contributions of this paper Extend the overlapping stochastic blockmodel to bipartite graphs Relevance selection mechanism Make use of additionally available object features Nonparametric Bayesian approach Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (4/24)

Background Indian Buffet Process (IBP) (Griffiths et al. 2011) N objects, K clusters, overlapping clustering U ∈ { 0 , 1 } N × K . Object: customer, cluster: dish The first customer selects Poisson ( α ) dishes to begin with Each subsequent customer n : Selects an already selected dish k with probability m k n Selects Poisson ( α/ n ) new dishes Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (5/24)

The Proposed Model

Basic Model Bipartite graph ( N × M binary adjacency matrix, |A| = N , |B| = M ) σ ( u n Wv ⊤ P ( A nm = 1) = m ) � = σ ( u nk W kl v ml ) U ∼ IBP ( α u ) k , l ∼ IBP ( α v ) V - W kl : the interaction strength between two nodes due to their memberships in N or (0 , σ 2 W ∼ w ) cluster k and cluster l B er ( σ ( UWV ⊤ )) ∼ A - IBP ( α ): IBP prior distribution, N or (0 , σ 2 ): Gaussian distribution, 1 - σ ( x ) = 1+exp( − x ) , B er ( p ): Bernoulli distribution, - U ∈ { 0 , 1 } N × K , V ∈ { 0 , 1 } M × L : cluster assignment matrices P ( A nm = 1) = σ ( W 12 + W 13 + W 32 + W 33 ) Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (7/24)

Basic Model Unipartite graph ( A ∈ { 0 , 1 } N × N ) σ ( u n Wu ⊤ P ( A nm = 1) = m ) � = σ ( u nk W kl u ml ) U ∼ IBP ( α u ) k , l N or (0 , σ 2 ∼ W w ) B er ( σ ( UWU ⊤ )) A ∼ - IBP ( α ): IBP prior distribution, N or (0 , σ 2 ): Gaussian distribution, 1 - σ ( x ) = 1+exp( − x ) , B er ( p ): Bernoulli distribution, - U ∈ { 0 , 1 } N × K : cluster assignment matrix P ( A nm = 1) = σ ( W 12 + W 13 + W 32 + W 33 ) Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (8/24)

Relevance Selection Mechanism Motivation In real-world networks, there may be some noisy objects (e.g., spammer) May lead to bad parameter estimates Maintain two random binary vectors R A ∈ { 0 , 1 } N × 1 , R B ∈ { 0 , 1 } M × 1 Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (9/24)

Relevance Selection Mechanism Background noise link probability φ ∼ B et ( a , b ) If one or both objects n ∈ A and m ∈ B are irrelevant - A nm is drawn from B er ( φ ) If both n and m are relevant, - A nm is drawn from B er ( p ) = B er ( σ ( u n Wv ⊤ m )) φ ∼ B et ( a , b ) R A B er ( ρ A R B m ∼ B er ( ρ B ∼ n ) , m ) n if R A u n ∼ IBP ( α u ) n = 1; zeros otherwise if R B ∼ IBP ( α v ) v m m = 1, zeros otherwise σ ( u n Wv ⊤ = m ) p B er ( p R A n R B m φ 1 − R A n R B ∼ m ) A nm Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (10/24)

Exploiting Pairwise Similarities We may have access to side information e.g., a similarity matrix between objects The IBP does not consider the pairwise similarity information. Customer n chooses an existing dish regardless of the similarity of this customer with other customers. Two objects n and m have a high pairwise similarity ⇒ u n and u m should also be similar. Encourages a customer to select a dish if the customer has a high similarity with all other customers who chose that dish. Let the customer select many new dishes if the customer has low similarity with previous customers. Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (11/24)

Exploiting Pairwise Similarities Modify the sampling scheme in the IBP based generative model The probability that object n gets membership in cluster k will be n ′� = n S A � nn ′ u n ′ k proportional to . � n n ′ =1 S A nn ′ � n n ′ =1 S A nn ′ : effective total number of objects, n ′ � = n S A � nn ′ u n ′ k : effective number of objects (other than n ) that belong to cluster k � n ′ � = n u n ′ k = m k - IBP: n n The number of new clusters for object n is given by Poisson ( α/ � n n ′ =1 S A nn ′ ). If the object n has low similarities with the previous objects, encourage it more to get memberships in its own new clusters - IBP: Poisson ( α/ n ) Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (12/24)

The Final Model ROCS ( R elevance-based O verlapping C lustering with S imilarity-based-smoothing) ∼ B et ( a , b ) φ ρ A ρ B ∼ B et ( c , d ) , m ∼ B et ( e , f ) n R A B er ( ρ A R B m ∼ B er ( ρ B ∼ n ) , m ) n Sim IBP ( α u , S A ) u n ∼ Sim IBP ( α v , S B ) v m ∼ σ ( u n Wv ⊤ = m ) p B er ( p R A n R B m φ 1 − R A n R B ∼ m ) A nm - Sim IBP ( α u , S A ): similarity information augmented variant of the IBP For inference, we use MCMC (Gibbs sampling) Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (13/24)

Experiments

Experiments Tasks The correct number of clusters Identify relevant objects Use pairwise similarity information Overlapping clustering Link prediction Baselines Overlapping Clustering using Nonnegative Matrix Factorization (OCNMF) (Psorakis et al. 2011) Kernelized Probabilistic Matrix Factorization (KPMF) (Zhou et al. 2012) Bayesian Community Detection (BCD) (Mørup et al. 2012) Latent Feature Relational Model (LFRM) (Miller et al. 2009) Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (15/24)

Experiments Synthetic Data 30 relevant objects, 20 irrelevant objects Three overlapping clusters Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (16/24)

Experiments Overlapping clustering Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (17/24)

Experiments Table 1: Link Prediction on Synthetic Data Method 0-1 Test Error (%) AUC OCNMF 44.82 ( ± 12.59) 0.7164 ( ± 0.1987) KPMF 39.70 ( ± 1.78) 0.6042 ( ± 0.0517) BCD 20.05 ( ± 1.49) 0.8504 ( ± 0.0197) LFRM 9.59 ( ± 0.36) 0.8619 ( ± 0.0374) 9.05 ( ± 0.42) 0.8787 ( ± 0.0303) ROCS Results Summary ROCS perfectly identifies relevant/irrelevant objects ROCS identifies the correct number of clusters For link prediction task, ROCS is better than other methods in terms of both 0-1 test error and AUC score. Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (18/24)

Experiments Facebook Data An ego-network in Facebook (228 nodes) User profile (e.g., age, gender, etc.) – select 92 features. Known number of clusters: 14 Table 2: Link Prediction on Facebook Data Method 0-1 Test Error (%) AUC OCNMF 36.58 ( ± 19.74) 0.7215 ( ± 0.1666) 35.76 ( ± 2.76) 0.7013 ( ± 0.0174) KPMF 13.59 ( ± 0.31) 0.9187 ( ± 0.0242) BCD 12.38 ( ± 2.82) 0.9156 ( ± 0.0134) LFRM 11.96 ( ± 1.44) 0.9388 ( ± 0.0156) ROCS BCD overestimated the number of clusters (20-22 across multiple runs). LFRM and ROCS almost correctly inferred the ground truth number of clusters (13-15 across multiple runs). Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (19/24)

Experiments Drug-Protein Interaction Data Bipartite graph (200 drug molecules, 150 target proteins) Drug-drug similarity matrix, Protein-protein similarity matrix Table 3: Link Prediction on Drug-Protein Interaction Data Method 0-1 Test Error (%) AUC 16.65 ( ± 0.36) 0.8734 ( ± 0.0133) KPMF 2.75 ( ± 0.04) 0.9032 ( ± 0.0156) LFRM 2.31 ( ± 0.06) 0.9276 ( ± 0.0142) ROCS OCNMF and BCD are not applicable for bipartite graphs. LFRM here denotes ROCS without similarity information. KPMF takes into account the similarity information but does not assume overlapping clustering. Joyce Jiyoung Whang, The University of Texas at Austin International Conference on Data Mining (20/24)

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and - PowerPoint PPT Presentation

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing Joyce Jiyoung Whang 1 Piyush Rai 2 Inderjit S. Dhillon 1 1 The University of Texas at Austin 2 Duke University International Conference on Data Mining

Actor Allegiance and Blockmodel Strength J. T. Rigsby and Dr. J. L. Solka

Topic of this talk Topic of this talk From E- -Relevance Relevance From E to W- -Relevance

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

ASCLU Alternative Subspace Clustering Stephan Gnnemann Ines Frber Emmanuel Mller

Practical Orchestrator Shlomi Noach GitHub Percona Live Europe 2017 How people build so

Lecture 11 Jan-Willem van de Meent Clustering Clustering Unsupervised learning (no labels

With numeric and categorical variables (active and/or illustrative) Ricco RAKOTOMALALA

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and - PowerPoint PPT Presentation

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing Joyce Jiyoung Whang 1 Piyush Rai 2 Inderjit S. Dhillon 1 1 The University of Texas at Austin 2 Duke University International Conference on Data Mining

Actor Allegiance and Blockmodel Strength J. T. Rigsby and Dr. J. L. Solka

Topic of this talk Topic of this talk From E- -Relevance Relevance From E to W- -Relevance

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, &amp; Jeffrey

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

ASCLU Alternative Subspace Clustering Stephan Gnnemann Ines Frber Emmanuel Mller

Practical Orchestrator Shlomi Noach GitHub Percona Live Europe 2017 How people build so

Lecture 11 Jan-Willem van de Meent Clustering Clustering Unsupervised learning (no labels

With numeric and categorical variables (active and/or illustrative) Ricco RAKOTOMALALA

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey