Brendan Meeder Carnegie Mellon University Christos Faloutsos - - PowerPoint PPT Presentation
Brendan Meeder Carnegie Mellon University Christos Faloutsos - - PowerPoint PPT Presentation
Leman Akoglu Carnegie Mellon University Hanghang Tong IBM T. J. Watson Brendan Meeder Carnegie Mellon University Christos Faloutsos Carnegie Mellon University Given a graph with node attributes (features) social networks + user interests
PICS: Parameter-free Identification of Cohesive Subgroups 2 Leman Akoglu (CMU)
Given a graph with node attributes (features)
social networks + user interests phone call networks + customer demographics gene interaction networks + gene expression info
Find cohesive clusters, bridges, anomalies
cohesive cluster: similar connectivity & attribute coherence
A B
People People Groups People People Groups (Binary) Features People Feature Groups People Groups
Given adjacency matrix A and feature matrix F Find homogeneous blocks (clusters) in A and F * parameter-free * scalable
A F
PICS: Parameter-free Identification of Cohesive Subgroups 3 Leman Akoglu (CMU)
Flat clustering Graph clustering
Additional feature nodes
- heterogeneous graph
Weighted edges by both connectivity and
feature similarity
- quadratic pairwise computations!
- choice of similarity function
PICS: Parameter-free Identification of Cohesive Subgroups 4 Leman Akoglu (CMU)
Flat clustering (e.g. k-means) [Kriegel+] [Leeuwen+]
METIS [Karypis and Kumar], [Flake+] [Girvan and Newman] [Andersen+] spectral [Ng+], co-clustering [Dhillon+]
SA-cluster [Zhou+], Spect. rel. clus. [Long+] CoPaM [Moser+], Gamer [Gunneman+]
?,
Autopart and cross-assoc.s [Chakrabarti+], GraphScope [Sun+], PaCK [He+]
PICS: Parameter-free Identification of Cohesive Subgroups 5 Leman Akoglu (CMU)
1.How many node- & attribute-clusters? 2.How to assign nodes and attributes to clusters? L (M) + L (D|M)
encoding length
- f blocks
encoding length
- f clustering
Good Clustering Good Compression
implies
Main idea: employ Minimum Description Length
DETAILS
PICS: Parameter-free Identification of Cohesive Subgroups 6 Leman Akoglu (CMU)
Given database D and set of models for D, MDL selects model M that minimizes
L (M) + L (D|M)
length in bits: description of model M length in bits: data, encoded by M
Bishop: PR&ML
d = 1 d = 9
a1x+a0 deltas
BACKGROUND
a9x9+…+a1x+a0 {} vs. vs.
PICS: Parameter-free Identification of Cohesive Subgroups 7 Leman Akoglu (CMU)
L (M) : Model description cost
- 1. n: #nodes f: #attributes
- 2. k: #node-clus. l: #attribute-clus.
3.
DETAILS
PICS: Parameter-free Identification of Cohesive Subgroups 8
size of node cluster i size of attr. cluster j ) ( log . log . . log log # P nH n r n r n n r r
- st
c clus node p n r bits
- ptimal
i i i i i i i i
Leman Akoglu (CMU)
L(D|M): Data description cost given Model
- 1. For each block in A and F , #1s:
- 2. Encoding cost of a block
where
DETAILS
PICS: Parameter-free Identification of Cohesive Subgroups 9
- r
Leman Akoglu (CMU)
L (M) : Model description cost
1.
as n: #nodes, f: #attributes
- 2. k: #node-clusters, l: #attribute-clusters
- 3. size of node-cluster i
size of attribute-cluster j
L(D|M): Data description cost given Model
- 1. For each block in A and F , #1s:
2.
where
DETAILS
PICS: Parameter-free Identification of Cohesive Subgroups 10
- r
A similar problem (column re-ordering for minimum total run length) is shown to be NP-hard [Johnson+]. (reduction from Hamiltonian Path)
Leman Akoglu (CMU)
PICS: Parameter-free Identification of Cohesive Subgroups 11
The algorithm is iterative and monotonic –will converge to local optimum
Leman Akoglu (CMU)
12 PICS: Parameter-free Identification of Cohesive Subgroups Leman Akoglu (CMU)
# non-zeros time/iteration (s)
PICS: Parameter-free Identification of Cohesive Subgroups 13
Computational complexity:
Leman Akoglu (CMU)
Graphs Description n f nnz
- 1. Phone call
users, titles 94 7 391
- 2. Device
users, titles 94 7 5K
- 3. PolBooks
books, incl. 92 2 840
- 4. PolBlogs
blogs, incl. 1.5K 2 20K
- 5. Twitter
users, h-tags 9.6K 10K 82K
- 6. YouTube
users, groups 77K 30K 1M
- 7. YeastGene
genes, articles 844 17K 64K
PICS: Parameter-free Identification of Cohesive Subgroups 14 Leman Akoglu (CMU)
“core and periphery”
liberal vs. conservative Book groups Books
PICS: Parameter-free Identification of Cohesive Subgroups 15 Leman Akoglu (CMU)
“core and periphery”
liberal vs. conservative Book groups Books
PICS: Parameter-free Identification of Cohesive Subgroups 16 Leman Akoglu (CMU)
Examples of bridging ‘conservative’ books Examples of “core” liberal and conservative books
– – –
Subjects Subjects title title Phone calls Device scans casual business grad call-center
PICS: Parameter-free Identification of Cohesive Subgroups 17 Leman Akoglu (CMU)
Yeast genes
844 genes 17K articles
Yeast genes Articles
1
A1
2
3
A3 A2
PICS: Parameter-free Identification of Cohesive Subgroups 18 Leman Akoglu (CMU)
survey
Twitter users @hashtags
9,6K users 10K hashtags
PICS: Parameter-free Identification of Cohesive Subgroups 19 Leman Akoglu (CMU)
casual Italian bloggers heavy-hitters
YouTube users YouTube groups
77K users 30K groups
PICS: Parameter-free Identification of Cohesive Subgroups 20 Leman Akoglu (CMU)
familiar strangers anime lovers bridges
21
Novel clustering model: ▪ PICS finds groups of nodes in an attributed graph with (1) similar connectivity, and (2) attribute homogeneity. ▪ It also groups the node attributes into attribute-clusters. Parameter-free nature: ▪ No user input, e.g. number of clusters, similarity functions/thresholds Effectiveness: ▪ Insightful clusters, bridges and outliers in diverse real- world datasets including YouTube and Twitter. Scalability: ▪ Linearly growing run time with graph + attribute size
PICS: Parameter-free Identification of Cohesive Subgroups Leman Akoglu (CMU)
lakoglu@cs.cmu.edu http://www.cs.cmu.edu/~lakoglu/
22 PICS: Parameter-free Identification of Cohesive Subgroups Leman Akoglu (CMU)