Personalized PageRank based Community Detection
David F. Gleich Purdue University
Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF Code bit.ly/dgleich-codes
Personalized PageRank based Community Detection Code - - PowerPoint PPT Presentation
Personalized PageRank based Community Detection Code bit.ly/dgleich-codes Joint work with C. Seshadhri, David F. Gleich Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by Purdue University NSF CAREER 1149756-CCF Todays
Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF Code bit.ly/dgleich-codes
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
250 node GEOP network in 2 dimensions
250 node GEOP network in 2 dimensions
[Andersen et al. 2006]
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
(total edges in the set)
David Gleich · Purdue
MLG2013
[Andersen et al. 2006]
David Gleich · Purdue
MLG2013
# G is graph as dictionary-of-sets alpha=0.99 tol=1e-4
r = {} # initialize residual Q = collections.deque() # initialize queue for s in seed: r(s) = 1/len(seed) Q.append(s) while len(Q) > 0: v = Q.popleft() # v has r[v] > tol*deg(v) if v not in x: x[v] = 0. x[v] += (1-alpha)*r[v] mass = alpha*r[v]/(2*len(G[v])) for u in G[v]: # for neighbors of u if u not in r: r[u] = 0. if r[u] < len(G[u])*tol and \ r[u] + mass >= len(G[u])*tol: Q.append(u) # add u to queue if large r[u] = r[u] + mass r[v] = mass*len(G[v])
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
log degree log probability
David Gleich · Purdue
MLG2013
Graph Verts Edges ca-AstroPh 17903 196972 email-Enron 33696 180811 cond-mat-2005 36458 171735 arxiv 86376 517563 dblp 226413 716460 hollywood-2009 1069126 56306653 fb-Penn94 41536 1362220 fb-A-oneyear 1138557 4404989 fb-A 3097165 23667394 soc-LiveJournal1 4843953 42845684
11461 32730 p2p-Gnutella25 22663 54693 as-22july06 22963 48436 itdk0304 190914 607610 κ ¯ C 0.318 0.633 0.085 0.509 0.243 0.657 0.560 0.678 0.383 0.635 0.310 0.766 0.098 0.212 0.038 0.060 0.048 0.097 0.118 0.274 0.037 0.352 0.005 0.005 0.011 0.230 0.061 0.158
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
Approximate canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety
to conductance.
David Gleich · Purdue
MLG2013
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
(Degree + 1)
“Egonet community profile” shows the same shape, 3 secs to compute.
1.1M verts, 4M edges
The Fiedler community computed from the normalized Laplacian is a neighborhood!
David Gleich · Purdue
Facebook data from Wilson et
MLG2013
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
ver t s 2
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
ver t s 2
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges
David Gleich · Purdue
MLG2013
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
We are missing a region of the NCP when we just look at neighborhoods
David Gleich · Purdue
(Degree + 1)
MLG2013
Facebook Sample - 1.1M verts, 4M edges
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
This region fills when using the PPR method (like now!)
David Gleich · Purdue
MLG2013
Facebook Sample - 1.1M verts, 4M edges
In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes.
David Gleich · Purdue
MLG2013
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
The red circles – the best local mins – find the extremes in the egonet profile.
David Gleich · Purdue
MLG2013
Facebook Sample - 1.1M verts, 4M edges
10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg 10 10
1
10
2
10
3
10
4
10
5
10
−4
10
−3
10
−2
10
−1
10 max deg
Full NCP Locally min NCP Original Egonet
David Gleich · Purdue
MLG2013
David Gleich · Purdue
MLG2013
David Gleich · Purdue
10
−3
10
−2
10
−1
10 10
−2
10
−1
10 0.7% =0.10 2.2% =0.25 29.8% =0.88 10
−3
10
−2
10
−1
10 10
−2
10
−1
10 0.7% =0.10 2.2% =0.25 29.8% =0.88
Log-Log scale!
MLG2013
David Gleich · Purdue
MLG2013
10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Coverage (percentage) Maximum Conductance
egonet graclus centers spread hubs random bigclam
Flickr social network 2M vertices 22M edges We can cover 95% of network with communities
David Gleich · Purdue
MLG2013
flickr sample - 2M verts, 22M edges
F1 F2 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24
DBLP
demon bigclam graclus centers spread hubs random egonet
Using datasets from Yang and Leskovec (WDSM 2013) with known overlapping community structure Our method outperform current state of the art
detection methods. Even randomly seeded!
David Gleich · Purdue
MLG2013
PPR community detection is fast
[Andersen et al. FOCS06]
PPR communities look real
[Abrahao et al. KDD2012; Zhu et al. ICML2013]
Egonet analysis reveals basis of NCP Partitioning for seeding yields high coverage & real communities. “Caveman” communities?
David Gleich · Purdue
Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample
bit.ly/18khzO5
bit.ly/dgleich-code
Best conductance cut at intersection of communities?
10 10
1
10
2
10
3
10
4
0.2 0.4 0.6 0.8 1 CDF of Number of Wedges Degree
David Gleich · Purdue
MLG2013