Brendan Meeder Carnegie Mellon University Christos Faloutsos - - PowerPoint PPT Presentation

brendan meeder carnegie mellon university
SMART_READER_LITE
LIVE PREVIEW

Brendan Meeder Carnegie Mellon University Christos Faloutsos - - PowerPoint PPT Presentation

Leman Akoglu Carnegie Mellon University Hanghang Tong IBM T. J. Watson Brendan Meeder Carnegie Mellon University Christos Faloutsos Carnegie Mellon University Given a graph with node attributes (features) social networks + user interests


slide-1
SLIDE 1

Leman Akoglu Carnegie Mellon University Hanghang Tong IBM T. J. Watson Brendan Meeder Carnegie Mellon University Christos Faloutsos Carnegie Mellon University

slide-2
SLIDE 2

PICS: Parameter-free Identification of Cohesive Subgroups 2 Leman Akoglu (CMU)

Given a graph with node attributes (features)

social networks + user interests phone call networks + customer demographics gene interaction networks + gene expression info

Find cohesive clusters, bridges, anomalies

cohesive cluster: similar connectivity & attribute coherence

A B

slide-3
SLIDE 3

People People Groups People People Groups (Binary) Features People Feature Groups People Groups

Given adjacency matrix A and feature matrix F Find homogeneous blocks (clusters) in A and F * parameter-free * scalable

A F

PICS: Parameter-free Identification of Cohesive Subgroups 3 Leman Akoglu (CMU)

slide-4
SLIDE 4

 Flat clustering  Graph clustering

 Additional feature nodes

  • heterogeneous graph

 Weighted edges by both connectivity and

feature similarity

  • quadratic pairwise computations!
  • choice of similarity function

PICS: Parameter-free Identification of Cohesive Subgroups 4 Leman Akoglu (CMU)

slide-5
SLIDE 5

Flat clustering (e.g. k-means) [Kriegel+] [Leeuwen+]

 

METIS [Karypis and Kumar], [Flake+] [Girvan and Newman] [Andersen+] spectral [Ng+], co-clustering [Dhillon+]

 

SA-cluster [Zhou+], Spect. rel. clus. [Long+]   CoPaM [Moser+], Gamer [Gunneman+]

 

?,

Autopart and cross-assoc.s [Chakrabarti+], GraphScope [Sun+], PaCK [He+]

  

PICS: Parameter-free Identification of Cohesive Subgroups 5 Leman Akoglu (CMU)

slide-6
SLIDE 6

1.How many node- & attribute-clusters? 2.How to assign nodes and attributes to clusters? L (M) + L (D|M)

encoding length

  • f blocks

encoding length

  • f clustering

Good Clustering Good Compression

implies

Main idea: employ Minimum Description Length

DETAILS

PICS: Parameter-free Identification of Cohesive Subgroups 6 Leman Akoglu (CMU)

slide-7
SLIDE 7

Given database D and set of models for D, MDL selects model M that minimizes

L (M) + L (D|M)

length in bits: description of model M length in bits: data, encoded by M

Bishop: PR&ML

d = 1 d = 9

a1x+a0 deltas

BACKGROUND

a9x9+…+a1x+a0 {} vs. vs.

PICS: Parameter-free Identification of Cohesive Subgroups 7 Leman Akoglu (CMU)

slide-8
SLIDE 8

 L (M) : Model description cost

  • 1. n: #nodes f: #attributes
  • 2. k: #node-clus. l: #attribute-clus.

3.

DETAILS

PICS: Parameter-free Identification of Cohesive Subgroups 8

size of node cluster i size of attr. cluster j ) ( log . log . . log log # P nH n r n r n n r r

  • st

c clus node p n r bits

  • ptimal

i i i i i i i i

        

 

Leman Akoglu (CMU)

slide-9
SLIDE 9

 L(D|M): Data description cost given Model

  • 1. For each block in A and F , #1s:
  • 2. Encoding cost of a block

where

DETAILS

PICS: Parameter-free Identification of Cohesive Subgroups 9

  • r

Leman Akoglu (CMU)

slide-10
SLIDE 10

 L (M) : Model description cost

1.

as n: #nodes, f: #attributes

  • 2. k: #node-clusters, l: #attribute-clusters
  • 3. size of node-cluster i

size of attribute-cluster j

 L(D|M): Data description cost given Model

  • 1. For each block in A and F , #1s:

2.

where

DETAILS

PICS: Parameter-free Identification of Cohesive Subgroups 10

  • r

A similar problem (column re-ordering for minimum total run length) is shown to be NP-hard [Johnson+]. (reduction from Hamiltonian Path)

Leman Akoglu (CMU)

slide-11
SLIDE 11

PICS: Parameter-free Identification of Cohesive Subgroups 11

The algorithm is iterative and monotonic –will converge to local optimum

Leman Akoglu (CMU)

slide-12
SLIDE 12

12 PICS: Parameter-free Identification of Cohesive Subgroups Leman Akoglu (CMU)

slide-13
SLIDE 13

# non-zeros time/iteration (s)

PICS: Parameter-free Identification of Cohesive Subgroups 13

Computational complexity:

Leman Akoglu (CMU)

slide-14
SLIDE 14

Graphs Description n f nnz

  • 1. Phone call

users, titles 94 7 391

  • 2. Device

users, titles 94 7 5K

  • 3. PolBooks

books, incl. 92 2 840

  • 4. PolBlogs

blogs, incl. 1.5K 2 20K

  • 5. Twitter

users, h-tags 9.6K 10K 82K

  • 6. YouTube

users, groups 77K 30K 1M

  • 7. YeastGene

genes, articles 844 17K 64K

PICS: Parameter-free Identification of Cohesive Subgroups 14 Leman Akoglu (CMU)

slide-15
SLIDE 15

“core and periphery”

liberal vs. conservative Book groups Books

PICS: Parameter-free Identification of Cohesive Subgroups 15 Leman Akoglu (CMU)

slide-16
SLIDE 16

“core and periphery”

liberal vs. conservative Book groups Books

PICS: Parameter-free Identification of Cohesive Subgroups 16 Leman Akoglu (CMU)

Examples of bridging ‘conservative’ books Examples of “core” liberal and conservative books

– – –

slide-17
SLIDE 17

Subjects Subjects title title Phone calls Device scans casual business grad call-center

PICS: Parameter-free Identification of Cohesive Subgroups 17 Leman Akoglu (CMU)

slide-18
SLIDE 18

Yeast genes

844 genes 17K articles

Yeast genes Articles

1

A1

2

3

A3 A2

PICS: Parameter-free Identification of Cohesive Subgroups 18 Leman Akoglu (CMU)

survey

slide-19
SLIDE 19

Twitter users @hashtags

9,6K users 10K hashtags

PICS: Parameter-free Identification of Cohesive Subgroups 19 Leman Akoglu (CMU)

casual Italian bloggers heavy-hitters

slide-20
SLIDE 20

YouTube users YouTube groups

77K users 30K groups

PICS: Parameter-free Identification of Cohesive Subgroups 20 Leman Akoglu (CMU)

familiar strangers anime lovers bridges

slide-21
SLIDE 21

21

 Novel clustering model: ▪ PICS finds groups of nodes in an attributed graph with (1) similar connectivity, and (2) attribute homogeneity. ▪ It also groups the node attributes into attribute-clusters.  Parameter-free nature: ▪ No user input, e.g. number of clusters, similarity functions/thresholds  Effectiveness: ▪ Insightful clusters, bridges and outliers in diverse real- world datasets including YouTube and Twitter.  Scalability: ▪ Linearly growing run time with graph + attribute size

PICS: Parameter-free Identification of Cohesive Subgroups Leman Akoglu (CMU)

slide-22
SLIDE 22

lakoglu@cs.cmu.edu http://www.cs.cmu.edu/~lakoglu/

22 PICS: Parameter-free Identification of Cohesive Subgroups Leman Akoglu (CMU)

Source code: www.cs.cmu.edu/~lakoglu/#pics