Parallel Community Detection for Massive Graphs E. Jason Riedy, - - PowerPoint PPT Presentation

parallel community detection for massive graphs
SMART_READER_LITE
LIVE PREVIEW

Parallel Community Detection for Massive Graphs E. Jason Riedy, - - PowerPoint PPT Presentation

Parallel Community Detection for Massive Graphs E. Jason Riedy, Henning Meyerhenke, David Ediger, and David A. Bader 14 February 2012 Exascale data analysis Health care Finding outbreaks, population epidemiology Social networks Advertising,


slide-1
SLIDE 1

Parallel Community Detection for Massive Graphs

  • E. Jason Riedy, Henning Meyerhenke, David Ediger, and

David A. Bader 14 February 2012

slide-2
SLIDE 2

Exascale data analysis

Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes

  • Graph clustering is common in all application areas.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 2/35

slide-3
SLIDE 3

These are not easy graphs.

Yifan Hu’s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.html 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 3/35

slide-4
SLIDE 4

But no shortage of structure...

Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs

  • Locally, there are clusters or communities.
  • First pass over a massive social graph:
  • Find smaller communities of interest.
  • Analyze / visualize top-ranked communities.
  • Our part: Community detection at massive scale. (Or kinda

large, given available data.)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 4/35

slide-5
SLIDE 5

Outline

Motivation Shooting for massive graphs Our parallel method Implementation and platform details Performance Conclusions and plans

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 5/35

slide-6
SLIDE 6

Can we tackle massive graphs now?

Parallel, of course...

  • Massive needs distributed memory, right?
  • Well... Not really. Can buy a 2 TiB Intel-based Dell server
  • n-line for around $200k USD, a 1.5 TiB from IBM, etc.

Image: dell.com.

Not an endorsement, just evidence!

  • Publicly available “real-world” data fits...
  • Start with shared memory to see what needs done.
  • Specialized architectures provide larger shared-memory views
  • ver distributed implementations (e.g. Cray XMT).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 6/35

slide-7
SLIDE 7

Designing for parallel algorithms

What should we avoid in algorithms?

Rules of thumb:

  • “We order the vertices (or edges) by...” unless followed by

bisecting searches.

  • “We look at a region of size more than two steps...” Many

target massive graphs have diameter of ≈ 20. More than two steps swallows much of the graph.

  • “Our algorithm requires more than ˜

O(|E|/#)...” Massive means you hit asymptotic bounds, and |E| is plenty of work.

  • “For each vertex, we do something sequential...” The few

high-degree vertices will be large bottlenecks. Remember: Rules of thumb can be broken with reason.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 7/35

slide-8
SLIDE 8

Designing for parallel implementations

What should we avoid in implementations?

Rules of thumb:

  • Scattered memory accesses through traditional sparse matrix

representations like CSR. Use your cache lines.

idx: 32b idx: 32b ... val: 64b val: 64b ... idx1: 32b idx2: 32b val1: 64b val2: 64b ...

  • Using too much memory, which is a painful trade-off with
  • parallelism. Think Fortran and workspace...
  • Synchronizing too often. There will be work imbalance; try to

use the imbalance to reduce “hot-spotting” on locks or cache lines. Remember: Rules of thumb can be broken with reason. Some of these help when extending to PGAS / message-passing.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 8/35

slide-9
SLIDE 9

Sequential agglomerative method

A B C D E F G

  • A common method (e.g. Clauset,

Newman, & Moore) agglomerates vertices into communities.

  • Each vertex begins in its own

community.

  • An edge is chosen to contract.
  • Merging maximally increases

modularity.

  • Priority queue.
  • Known often to fall into an O(n2)

performance trap with modularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 9/35

slide-10
SLIDE 10

Sequential agglomerative method

A B C D E F G C B

  • A common method (e.g. Clauset,

Newman, & Moore) agglomerates vertices into communities.

  • Each vertex begins in its own

community.

  • An edge is chosen to contract.
  • Merging maximally increases

modularity.

  • Priority queue.
  • Known often to fall into an O(n2)

performance trap with modularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 10/35

slide-11
SLIDE 11

Sequential agglomerative method

A B C D E F G C B D A

  • A common method (e.g. Clauset,

Newman, & Moore) agglomerates vertices into communities.

  • Each vertex begins in its own

community.

  • An edge is chosen to contract.
  • Merging maximally increases

modularity.

  • Priority queue.
  • Known often to fall into an O(n2)

performance trap with modularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 11/35

slide-12
SLIDE 12

Sequential agglomerative method

A B C D E F G C B D A B C

  • A common method (e.g. Clauset,

Newman, & Moore) agglomerates vertices into communities.

  • Each vertex begins in its own

community.

  • An edge is chosen to contract.
  • Merging maximally increases

modularity.

  • Priority queue.
  • Known often to fall into an O(n2)

performance trap with modularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 12/35

slide-13
SLIDE 13

Parallel agglomerative method

A B C D E F G

  • We use a matching to avoid the queue.
  • Compute a heavy weight, large

matching.

  • Simple greedy algorithm.
  • Maximal matching.
  • Within factor of 2 in weight.
  • Merge all communities at once.
  • Maintains some balance.
  • Produces different results.
  • Agnostic to weighting, matching...
  • Can maximize modularity, minimize

conductance.

  • Modifying matching permits easy

exploration.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 13/35

slide-14
SLIDE 14

Parallel agglomerative method

A B C D E F G C D G

  • We use a matching to avoid the queue.
  • Compute a heavy weight, large

matching.

  • Simple greedy algorithm.
  • Maximal matching.
  • Within factor of 2 in weight.
  • Merge all communities at once.
  • Maintains some balance.
  • Produces different results.
  • Agnostic to weighting, matching...
  • Can maximize modularity, minimize

conductance.

  • Modifying matching permits easy

exploration.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 14/35

slide-15
SLIDE 15

Parallel agglomerative method

A B C D E F G C D G E B C

  • We use a matching to avoid the queue.
  • Compute a heavy weight, large

matching.

  • Simple greedy algorithm.
  • Maximal matching.
  • Within factor of 2 in weight.
  • Merge all communities at once.
  • Maintains some balance.
  • Produces different results.
  • Agnostic to weighting, matching...
  • Can maximize modularity, minimize

conductance.

  • Modifying matching permits easy

exploration.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 15/35

slide-16
SLIDE 16

Platform: Cray XMT2

Tolerates latency by massive multithreading.

  • Hardware: 128 threads per processor
  • Context switch on every cycle (500 MHz)
  • Many outstanding memory requests (180/proc)
  • “No” caches...
  • Flexibly supports dynamic load balancing
  • Globally hashed address space, no data cache
  • Support for fine-grained, word-level synchronization
  • Full/empty bit on with every memory word
  • 64 processor XMT2 at CSCS,

the Swiss National Supercomputer Centre.

  • 500 MHz processors, 8192

threads, 2 TiB of shared memory

Image: cray.com 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 16/35

slide-17
SLIDE 17

Platform: Intel R

E7-8870-based server

Tolerates some latency by hyperthreading.

  • Hardware: 2 threads / core, 10 cores / socket, four sockets.
  • Fast cores (2.4 GHz), fast memory (1 066 MHz).
  • Not so many outstanding memory requests (60/socket), but

large caches (30 MiB L3 per socket).

  • Good system support
  • Transparent hugepages reduces TLB costs.
  • Fast, user-level locking. (HLE would be better...)
  • OpenMP, although I didn’t tune it...
  • mirasol, #17 on Graph500

(thanks to UCB)

  • Four processors (80 threads),

256 GiB memory

  • gcc 4.6.1, Linux kernel

3.2.0-rc5

Image: Intel R

press kit

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 17/35

slide-18
SLIDE 18

Implementation: Data structures

Extremely basic for graph G = (V, E)

  • An array of (i, j; w) weighted edge pairs, each i, j stored only
  • nce and packed, uses 3|E| space
  • An array to store self-edges, d(i) = w, |V |
  • A temporary floating-point array for scores, |E|
  • A additional temporary arrays using 4|V | + 2|E| to store

degrees, matching choices, offsets...

  • Weights count number of agglomerated vertices or edges.
  • Scoring methods (modularity, conductance) need only

vertex-local counts.

  • Storing an undirected graph in a symmetric manner reduces

memory usage drastically and works with our simple matcher.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 18/35

slide-19
SLIDE 19

Implementation: Data structures

Extremely basic for graph G = (V, E)

  • An array of (i, j; w) weighted edge pairs, each i, j stored only
  • nce and packed, uses 3|E| 32-bit space
  • An array to store self-edges, d(i) = w, |V |
  • A temporary floating-point array for scores, |E|
  • A additional temporary arrays using 2|V | + |E| 64-bit, 2|V |

32-bit to store degrees, matching choices, offsets...

  • Need to fit uk-2007-05 into 256 GiB.
  • Cheat: Use 32-bit integers for indices. Know we won’t contract

so far to need 64-bit weights.

  • Could cheat further and use 32-bit floats for scores.
  • (Note: Code didn’t bother optimizing workspace size.)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 19/35

slide-20
SLIDE 20

Implementation: Data structures

Extremely basic for graph G = (V, E)

  • An array of (i, j; w) weighted edge pairs, each i, j stored only
  • nce and packed, uses 3|E| space
  • An array to store self-edges, d(i) = w, |V |
  • A temporary floating-point array for scores, |E|
  • A additional temporary arrays using 2|V | + |E| 64-bit, 2|V |

32-bit to store degrees, matching choices, offsets...

  • Original ignored order in edge array, killed OpenMP.
  • New: Roughly bucket edge array by first stored index.

Non-adjacent CSR-like structure.

  • New: Hash i, j to determine order. Scatter among buckets.
  • (New = MTAAP 2012)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 20/35

slide-21
SLIDE 21

Implementation: Routines

Three primitives: Scoring, matching, contracting

Scoring Trivial. Matching Repeat until no ready, unmatched vertex:

1 For each unmatched vertex in parallel, find the

best unmatched neighbor in its bucket.

2 Try to point remote match at that edge (lock,

check if best, unlock).

3 If pointing succeeded, try to point self-match at

that edge.

4 If both succeeded, yeah! If not and there was

some eligible neighbor, re-add self to ready, unmatched list. (Possibly too simple, but...)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 21/35

slide-22
SLIDE 22

Implementation: Routines

Contracting

1 Map each i, j to new vertices, re-order by hashing. 2 Accumulate counts for new i′ bins, prefix-sum for offset. 3 Copy into new bins.

  • Only synchronizing in the prefix-sum. That could be removed if

I don’t re-order the i′, j′ pair; haven’t timed the difference.

  • Actually, the current code copies twice... On short list for

fixing.

  • Binning as opposed to original list-chasing enabled

Intel/OpenMP support with reasonable performance.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 22/35

slide-23
SLIDE 23

Implementation: Routines

Graph name Time (s)

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

celegans_metabolic email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05

Intel E7−8870 Cray XMT2 Primitive score match contract

  • ther

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 23/35

slide-24
SLIDE 24

Performance: Time by platform

Graph name Time (s)

10−2 10−1 100 101 102 103 10−2 10−1 100 101 102 103

  • ● ●
  • ●● ●
  • ● ●●
  • ● ●
  • ● ● ●
  • celegans_metabolic

email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05

Intel E7−8870 Cray XMT2 Number of communities

  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 24/35

slide-25
SLIDE 25

Performance: Rate by platform

Graph name Edges per second

104 105 106 107 104 105 106 107

  • ● ●
  • ● ● ●
  • ● ●
  • celegans_metabolic

email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05

Intel E7−8870 Cray XMT2 Number of communities

  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 25/35

slide-26
SLIDE 26

Performance: Rate by metric (on Intel)

Graph name Edges per second

105 105.5 106 106.5 107 105 105.5 106 106.5 107

  • ● ●
  • ● ●
  • celegans_metabolic

email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05

cnm cond Number of communities

  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 26/35

slide-27
SLIDE 27

Performance: Scaling

Number of threads / processors Time (s)

100.5 101 101.5 102 102.5 103 Intel E7−8870

  • 368.0 s

33.4 s 84.9 s 6.6 s

20 21 22 23 24 25 26 Cray XMT2

  • 1188.9 s

285.4 s 349.6 s 72.1 s

20 21 22 23 24 25 26 Graph

  • uk−2002
  • kron_g500−simple−logn20

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 27/35

slide-28
SLIDE 28

Performance: Modularity at coverage ≈ 0.5

Graph name Modularity

0.0 0.2 0.4 0.6 0.8

  • ● ●●
  • ● ●
  • x

x x x x x x x x x x x x x x x x x x x x x x x x

celegans_metabolic email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05

scoring

  • cnm
  • mb
  • cond

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 28/35

slide-29
SLIDE 29

Performance: Avg. conductance at coverage ≈ 0.5

Graph name AIXC

0.0 0.2 0.4 0.6 0.8

  • ● ●
  • ● ●
  • celegans_metabolic

email power polblogs PGPgiantcompo as−22july06 memplus luxembourg.osm astro−ph cond−mat−2005 preferentialAttachment smallworld G_n_pin_pout caidaRouterLevel rgg_n_2_17_s0 coAuthorsCiteseer citationCiteseer belgium.osm kron_g500−simple−logn16 333SP in−2004 coPapersDBLP eu−2005 ldoor audikw1 kron_g500−simple−logn20 cage15 uk−2002 er−fact1.5−scale25 uk−2007−05

scoring

  • cnm
  • mb
  • cond

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 29/35

slide-30
SLIDE 30

Performance: Modularity by step

step Modularity

0.0 0.2 0.4 0.6 0.8

X X X

5 10 15 20 25 30 35 Graph coAuthorsCiteseer eu−2005 uk−2002 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 30/35

slide-31
SLIDE 31

Performance: Coverage by step

step Coverage

0.0 0.2 0.4 0.6 0.8

X X X

5 10 15 20 25 30 35 Graph coAuthorsCiteseer eu−2005 uk−2002 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 31/35

slide-32
SLIDE 32

Performance: # of communities

step Number of communities

104 104.5 105 105.5 106 106.5 107

X X X

5 10 15 20 25 30 35 Graph coAuthorsCiteseer eu−2005 uk−2002 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 32/35

slide-33
SLIDE 33

Performance: AIXC by step

step AIXC

10−0.25 10−0.2 10−0.15 10−0.1 10−0.05 100

X X X

5 10 15 20 25 30 35 Graph coAuthorsCiteseer eu−2005 uk−2002 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 33/35

slide-34
SLIDE 34

Performance: Comm. volume by step

step Communication volume (cut weight: dashed)

105.5 106 106.5 107 107.5 108

X X X

5 10 15 20 25 30 35 Graph coAuthorsCiteseer eu−2005 uk−2002 10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 34/35

slide-35
SLIDE 35

Conclusions and plans

  • Code: http:

//www.cc.gatech.edu/~jriedy/community-detection/

  • First: Fix the low-hanging fruit.
  • Eliminate a copy during contraction.
  • Deal with stars (next presentation).
  • Then... Practical experiments.
  • How volatile are modularity and conductance to perturbations?
  • What matching schemes work well?
  • How do different metrics compare in applications?
  • Extending to streaming graph data!
  • Includes developing parallel refinement... (distance 2 matching)
  • And possibly de-clustering or manipulating the dendogram.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 35/35

slide-36
SLIDE 36

Acknowledgment of support

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 36/35