Personalized PageRank based Community Detection Code - - PowerPoint PPT Presentation

personalized pagerank based community detection
SMART_READER_LITE
LIVE PREVIEW

Personalized PageRank based Community Detection Code - - PowerPoint PPT Presentation

Personalized PageRank based Community Detection Code bit.ly/dgleich-codes Joint work with C. Seshadhri, David F. Gleich Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by Purdue University NSF CAREER 1149756-CCF Todays


slide-1
SLIDE 1

Personalized PageRank based Community Detection

David F. Gleich Purdue University

Joint work with C. Seshadhri, Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF Code bit.ly/dgleich-codes

slide-2
SLIDE 2

Today’s talk

  • 1. Personalized PageRank

based community detection

  • 2. Conductance, Egonets, and

Network Community Profiles

  • 3. Egonet seeding
  • 4. Improved seeding

David Gleich · Purdue

2

MLG2013

slide-3
SLIDE 3

A community is a set of vertices that is denser inside than out.

David Gleich · Purdue

3

MLG2013

slide-4
SLIDE 4

250 node GEOP network in 2 dimensions

4

slide-5
SLIDE 5

250 node GEOP network in 2 dimensions

5

slide-6
SLIDE 6

We can find communities using Personalized PageRank (PPR)

[Andersen et al. 2006]

PPR is a Markov chain on nodes

  • 1. with probability 𝛽,

, follow a random edge

  • 2. with probability 1-𝛽,

, restart at a seed aka random surfer aka random walk with restart unique stationary distribution

David Gleich · Purdue

6

MLG2013

slide-7
SLIDE 7

Personalized PageRank community detection

  • 1. Given a seed, approximate the

stationary distribution.

  • 2. Extract the community.

Both are local operations.

David Gleich · Purdue

7

MLG2013

slide-8
SLIDE 8

Demo!

David Gleich · Purdue

8

MLG2013

slide-9
SLIDE 9

Conductance communities

Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community φ(S) = cut(S) min

  • vol(S), vol( ¯

S)

  • (edges leaving the set)

(total edges in the set)

David Gleich · Purdue

cut(S) = 7 vol(S) = 33 vol( ¯ S) = 11 φ(S) = 7/11

9

MLG2013

slide-10
SLIDE 10

Andersen- Chung-Lang personalized PageRank community theorem

[Andersen et al. 2006]

Informally

Suppose the seeds are in a set

  • f good conductance, then the

personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast.

David Gleich · Purdue

10 10

MLG2013

slide-11
SLIDE 11

# G is graph as dictionary-of-sets alpha=0.99 tol=1e-4

  • x = {} # Store x, r as dictionaries

r = {} # initialize residual Q = collections.deque() # initialize queue for s in seed: r(s) = 1/len(seed) Q.append(s) while len(Q) > 0: v = Q.popleft() # v has r[v] > tol*deg(v) if v not in x: x[v] = 0. x[v] += (1-alpha)*r[v] mass = alpha*r[v]/(2*len(G[v])) for u in G[v]: # for neighbors of u if u not in r: r[u] = 0. if r[u] < len(G[u])*tol and \ r[u] + mass >= len(G[u])*tol: Q.append(u) # add u to queue if large r[u] = r[u] + mass r[v] = mass*len(G[v])

David Gleich · Purdue

11 11

MLG2013

slide-12
SLIDE 12

Demo 2!

David Gleich · Purdue

12 12

MLG2013

slide-13
SLIDE 13

Problem 1, which seeds?

David Gleich · Purdue

13 13

MLG2013

slide-14
SLIDE 14

Problem 2, not fast enough.

David Gleich · Purdue

14 14

MLG2013

slide-15
SLIDE 15

Gleich-Seshadhri, KDD 2012

David Gleich · Purdue

Neighborhoods are good communities

15 15

MLG2013

slide-16
SLIDE 16

Gleich-Seshadhri, KDD 2012 Egonets and Conductance

David Gleich · Purdue

Neighborhoods are good communities

^ conductance ^ Vertex Egonets?

… in graphs that look like social and information networks

16 16

MLG2013

slide-17
SLIDE 17

Vertex neighborhoods or Egonets

The induced subgraph of set a vertex its neighbors Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], community seeds [Huang11,Schaeffer11],

  • verlapping communities [Schaeffer07,Rees10].

David Gleich · Purdue

17 17

MLG2013

slide-18
SLIDE 18

Simple version of theorem

If global clustering coefficient = 1, then the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities!

David Gleich · Purdue

18 18

MLG2013

slide-19
SLIDE 19

Theorem

Condition Let graph G have clustering coefficient 𝜆 and have vertex degrees bounded by a power-law function with exponent 𝛿 less than 3. Theorem Then there exists a vertex neighborhood with conductance

log degree log probability

α1n/dγ α2n/dγ

≤ 4(1 − κ)/(3 − 2κ)

David Gleich · Purdue

19 19

MLG2013

slide-20
SLIDE 20

Confession The theory is weak

φ(S) ≤ 4(1 − κ)/(3 − 2κ)

Collaboration networks 𝜆 ~ [0.1 – 0.5] Social networks 𝜆 ~ [0.05 – 0.1]

Graph Verts Edges ca-AstroPh 17903 196972 email-Enron 33696 180811 cond-mat-2005 36458 171735 arxiv 86376 517563 dblp 226413 716460 hollywood-2009 1069126 56306653 fb-Penn94 41536 1362220 fb-A-oneyear 1138557 4404989 fb-A 3097165 23667394 soc-LiveJournal1 4843953 42845684

  • regon2-010526

11461 32730 p2p-Gnutella25 22663 54693 as-22july06 22963 48436 itdk0304 190914 607610 κ ¯ C 0.318 0.633 0.085 0.509 0.243 0.657 0.560 0.678 0.383 0.635 0.310 0.766 0.098 0.212 0.038 0.060 0.048 0.097 0.118 0.274 0.037 0.352 0.005 0.005 0.011 0.230 0.061 0.158

  • Tech. networks

𝜆 ~ [0.005 – 0.05] This bound is useless unless 𝜆 ≥ 1/2

David Gleich · Purdue

20 20

MLG2013

slide-21
SLIDE 21

We view this theory as “intuition for the truth”

David Gleich · Purdue

21 21

MLG2013

slide-22
SLIDE 22

Empirical Evaluation using Network Community Profiles

fb-A-oneyear

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

Community Size Minimum conductance for any community of the given size

Approximate canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety

  • f approximations

to conductance.

David Gleich · Purdue

22 22

MLG2013

slide-23
SLIDE 23

Empirical Evaluation using Network Community Profiles

fb-A-oneyear

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

Community Size

(Degree + 1)

Minimum conductance for any community neighborhood of the given size

“Egonet community profile” shows the same shape, 3 secs to compute.

1.1M verts, 4M edges

The Fiedler community computed from the normalized Laplacian is a neighborhood!

David Gleich · Purdue

Facebook data from Wilson et

  • al. 2009

23 23

MLG2013

slide-24
SLIDE 24

Not just one graph

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

ver t s 2

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

ver t s 2

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges

15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods

David Gleich · Purdue

24 24

MLG2013

slide-25
SLIDE 25

Filling in the Network Community Profile

fb-A-oneyear

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

Minimum conductance for any community neighborhood of the given size

We are missing a region of the NCP when we just look at neighborhoods

David Gleich · Purdue

Community Size

(Degree + 1)

25 25

MLG2013

Facebook Sample - 1.1M verts, 4M edges

slide-26
SLIDE 26

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

Filling in the Network Community Profile

fb-A-oneyear

Minimum conductance for any community of the given size 7807 seconds

This region fills when using the PPR method (like now!)

David Gleich · Purdue

Community Size

26 26

MLG2013

Facebook Sample - 1.1M verts, 4M edges

slide-27
SLIDE 27

Am I a good seed? Locally Minimal Communities

“My conductance is the best locally.”

φ(N(v)) ≤ φ(N(w)) for all w adjacent to v

In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes.

David Gleich · Purdue

27 27

MLG2013

slide-28
SLIDE 28

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

Locally minimal communities capture extremal neighborhoods

fb-A-oneyear

Red dots are conductance and size of a locally minimal community Usually about 1%

  • f # of vertices.

The red circles – the best local mins – find the extremes in the egonet profile.

David Gleich · Purdue

Community Size

28 28

MLG2013

Facebook Sample - 1.1M verts, 4M edges

slide-29
SLIDE 29

10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg 10 10

1

10

2

10

3

10

4

10

5

10

−4

10

−3

10

−2

10

−1

10 max deg

Filling in the NCP Growing locally minimal comm.

fb-A-oneyear

PPR growing

  • nly locally min

communities, seeded from entire egonet 3 seconds 283 seconds 7807 seconds

Full NCP Locally min NCP Original Egonet

David Gleich · Purdue

Community Size

29 29

MLG2013

slide-30
SLIDE 30

But there’s a small problem. Most people want to cover a network with communities! We just looked at the best.

David Gleich · Purdue

30 30

MLG2013

slide-31
SLIDE 31

The coverage of egonet-grown communities is really bad.

David Gleich · Purdue

10

−3

10

−2

10

−1

10 10

−2

10

−1

10 0.7% =0.10 2.2% =0.25 29.8% =0.88 10

−3

10

−2

10

−1

10 10

−2

10

−1

10 0.7% =0.10 2.2% =0.25 29.8% =0.88

Coverage Max Conductance

Facebook network With a conductance

  • f 0.1 (not so good)

we only cover 1% of the vertices in the network.

Log-Log scale!

31 31

MLG2013

slide-32
SLIDE 32

Whang-Gleich-Dhillon, CIKM2013 [upcoming…]

  • 1. Extract part of the graph that might have
  • verlapping communities.
  • 2. Compute a partitioning of the network into

many pieces (think sqrt(n)) using Graclus.

  • 3. Find the center of these partitions.
  • 4. Use PPR to grow egonets of these centers.

David Gleich · Purdue

32 32

MLG2013

slide-33
SLIDE 33

10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Coverage (percentage) Maximum Conductance

egonet graclus centers spread hubs random bigclam

Flickr social network 2M vertices 22M edges We can cover 95% of network with communities

  • f cond. ~0.15.

David Gleich · Purdue

A good partitioning helps

33 33

MLG2013

flickr sample - 2M verts, 22M edges

slide-34
SLIDE 34

F1 F2 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24

DBLP

demon bigclam graclus centers spread hubs random egonet

Using datasets from Yang and Leskovec (WDSM 2013) with known overlapping community structure Our method outperform current state of the art

  • verlapping community

detection methods. Even randomly seeded!

David Gleich · Purdue

And helps to find real-world

  • verlapping communities too.

34 34

MLG2013

slide-35
SLIDE 35

Conclusion & Discussion &

PPR community detection is fast

[Andersen et al. FOCS06]

PPR communities look real

[Abrahao et al. KDD2012; Zhu et al. ICML2013]

Egonet analysis reveals basis of NCP Partitioning for seeding yields high coverage & real communities. “Caveman” communities?

  • MLG2013

David Gleich · Purdue

35 35

Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample

bit.ly/18khzO5

  • Egonet seeding

bit.ly/dgleich-code

References

Best conductance cut at intersection of communities?

slide-36
SLIDE 36

Proof Sketch

1) Large clustering coefficient ⇒ many wedges are closed 2) Heavy tailed degree dist ⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance

Use the probabilistic method to formalize

10 10

1

10

2

10

3

10

4

0.2 0.4 0.6 0.8 1 CDF of Number of Wedges Degree

David Gleich · Purdue

36 36

MLG2013