Scaling Betweenness Centrality using Communication-Efficient Sparse - - PowerPoint PPT Presentation

scaling betweenness centrality using communication
SMART_READER_LITE
LIVE PREVIEW

Scaling Betweenness Centrality using Communication-Efficient Sparse - - PowerPoint PPT Presentation

Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication Edgar Solomonik 1 , 2 , Maciej Besta 1 , Flavio Vella 1 , and Torsten Hoefler 1 1 Department of Computer Science ETH Zurich 2 Department of Computer Science


slide-1
SLIDE 1

Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication

Edgar Solomonik1,2, Maciej Besta1, Flavio Vella1, and Torsten Hoefler1

1 Department of Computer Science

ETH Zurich

2Department of Computer Science

University of Illinois at Urbana-Champaign

November 2017

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 1/21

slide-2
SLIDE 2

Outline

1

Betweenness Centrality Problem Definition All-Pairs Shortest-Paths Brandes’ Algorithm Parallel Brandes’ Algorithm

2

Sparse Matrix Multiplication Algebraic Shortest Path Computation Parallel Sparse Matrix Multiplication

3

Algebraic Parallel Programming Cyclops Tensor Framework Performance Results

4

Conclusion

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 2/21

slide-3
SLIDE 3

Betweenness Centrality Problem Definition

Centrality in Graphs

Betweenness centrality – For each vertex v in G = (V, E), sum the fractions of shortest paths s ∼ t that pass through v, λ(v) =

  • s,t∈V

σv(s, t)/σ(s, t). σ(s, t) is the number (multiplicity) of shortest paths s ∼ t σv(s, t) is the number of shortest paths s ∼ t that pass through v Shortest paths can be unweighted or weighted Centrality is important in analysis of biology, transport, and social network graphs

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 3/21

slide-4
SLIDE 4

Betweenness Centrality Problem Definition

Path Multiplicities

Let d(s, t) be the shortest distance between vertex s and vertex t The multiplicity of shortest paths σ(s, t) is the number of distinct paths s ∼ t with distance d(s, t) If v is in some shortest path s ∼ t, then d(s, t) = d(s, v) + d(v, t) Consequently, can compute all σv(s, t) and λ(v) given all distances σv(s, t) =

  • σ(s, v)σ(v, t)

: d(s, t) = d(s, v) + d(v, t) : otherwise

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 4/21

slide-5
SLIDE 5

Betweenness Centrality All-Pairs Shortest-Paths

Betweenness Centrality by All-Pairs Shortest-Paths

We can obtain d(s, t) for all s, t by all-pairs shortest-paths (APSP) Multiplicities (σ and σv for each v) are easy to get given distances However, the cost of APSP is prohibitive, for n-node graphs:

Q = Θ(n3) work with typical algorithms (e.g. Floyd-Warshall) D = Θ(log(n)) depth1 M = Θ(n2/p) memory footprint per processor

APSP does not effectively exploit graph sparsity

1Tiskin, Alexander. ”All-pairs shortest paths computation in the BSP model.”

Automata, Languages and Programming (2001): 178-189.

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 5/21

slide-6
SLIDE 6

Betweenness Centrality Brandes’ Algorithm

Brandes’ Algorithm for Betweenness Centrality

Ulrik Brandes proposed a memory-efficient method1 Compute d(s, ⋆) and σ(s, ⋆) for a given source vertex s Using these calculate partial centrality factors ζ(s, v) so ζ(s, v) =

  • t∈V,d(s,v)+d(v,t)=d(s,t)

σ(v, t)/σ(s, t) Construct the centrality scores from partial centrality factors λ(v) =

  • s

σ(s, v)ζ(s, v)

1Brandes, Ulrik. ”A faster algorithm for betweenness centrality.” Journal of

mathematical sociology 25.2 (2001): 163-177.

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 6/21

slide-7
SLIDE 7

Betweenness Centrality Brandes’ Algorithm

Shortest Path Tree

If any multiplicity σ(s, t) > 1, shortest path tree has cross edges

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 7/21

slide-8
SLIDE 8

Betweenness Centrality Brandes’ Algorithm

Shortest Path Tree Multiplicities

σ(s, v) value displayed for each node v given colored source vertex s

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 8/21

slide-9
SLIDE 9

Betweenness Centrality Brandes’ Algorithm

Partial Centrality Factors in Shortest Path Tree

If π(s, v) are the children of v in shortest path tree from s ζ(s, v) =

  • c∈π(s,v)
  • 1

σ(s, c) + ζ(s, c)

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 9/21

slide-10
SLIDE 10

Betweenness Centrality Brandes’ Algorithm

Brandes’ Algorithm Overview

For each source vertex s ∈ V (or a batch of source vertices) Compute single-source shortest-paths (SSSP) from s

For unweighted graphs, use breadth first search (BFS) More viable choices for weighted graphs: Dijkstra, Bellman-Ford, ∆-stepping, ...

Perform back-propagation of centrality scores on shortest path tree from s

Roughly as hard as BFS regardless of whether G is weighted

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 10/21

slide-11
SLIDE 11

Betweenness Centrality Parallel Brandes’ Algorithm

Parallelism in Brandes’ Algorithm

Sources of parallelism in Brandes’ algorithm: Computation of SSSP and back-propagation

Concurrency and efficiency like BFS on graphs Bellman-Ford provides maximal concurrency for weighted graphs at cost of extra work

Different source vertices can be processed in parallel as a batch

Key additional source of concurrency Maintaining more distances requires greater memory footprint, M = Ω(bn/p) for batch size b

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 11/21

slide-12
SLIDE 12

Sparse Matrix Multiplication Algebraic Shortest Path Computation

Algebraic shortest path computations

Tropical (geodetic) semiring additive (idempotent) operator: a ⊕ b = min(a, b), identity: ∞ multiplicative operator: a ⊗ b = a + b, identity: 0 matrix multiplication defined accordingly, C = A ⊗ B ⇒ cij = min

k (aik + bkj)

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 12/21

slide-13
SLIDE 13

Sparse Matrix Multiplication Algebraic Shortest Path Computation

Algebraic shortest path computations

Tropical (geodetic) semiring additive (idempotent) operator: a ⊕ b = min(a, b), identity: ∞ multiplicative operator: a ⊗ b = a + b, identity: 0 matrix multiplication defined accordingly, C = A ⊗ B ⇒ cij = min

k (aik + bkj)

Bellman-Ford algorithm (SSSP) for n × n adjacency matrix A:

1

initialize v(1) = (0, ∞, ∞, . . .)

2

compute v(n) via recurrence v(i+1) = v(i) ⊕ (A ⊗ v(i))

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 12/21

slide-14
SLIDE 14

Sparse Matrix Multiplication Algebraic Shortest Path Computation

Algebraic View of Brandes’ Algorithm

Given frontier vector x(i) and tentative distances w(i) y(i) = A ⊗ x(i) and w(i+1) = w(i) ⊕ y(i) x(i+1) given by entries if w(i+1) that differ from w(i) For BFS, each tentative distance changes only once For Bellman-Ford, tentative distances can change multiple times Thus both algorithms require iterative SpMSpV Having a batch size b > 1 transforms the problem to sparse matrix multiplication (SpGEMM or SpMSpM)

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 13/21

slide-15
SLIDE 15

Sparse Matrix Multiplication Parallel Sparse Matrix Multiplication

Communication Avoiding Sparse Matrix Multiplication

Let the bandwidth cost W be the maximum amount of data communicated by any processor We use analogue of 1D/2D/3D rectangular matrix multiplication The bandwidth cost of matrix multiplication Y = AX is then W = min

p1p2p3=p

nnz(A) p1p2 + nnz(X) p2p3 + nnz(Y) p1p3

  • In our context, nnz(A) = |E| = m, while X holds current frontiers

for b starting vertices, so nnz(X) ≤ nb

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 14/21

slide-16
SLIDE 16

Sparse Matrix Multiplication Parallel Sparse Matrix Multiplication

Communication Avoiding Betweenness Centrality

Latency cost is proportional to number of SpMSpM calls Replication of A for SpMSpMs minimizes bandwidth cost W It then suffices to communicate frontiers X and reduce results Y For undirected graphs, for b starting vertices, total nonzeros in X

  • ver all iterations is nb and for Y is O(nb)

Best choice of b with sufficient memory gives W = O(n√m/p2/3) Memory-limited communication cost bound given in paper

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 15/21

slide-17
SLIDE 17

Algebraic Parallel Programming Cyclops Tensor Framework

Cyclops Tensor Framework (CTF) 1

Distributed-memory symmetric/sparse tensors in C++ or Python For betweenness centrality, we only use CTF matrices

Matrix <int > A(n, n, AS|SP, World(MPI_COMM_WORLD )); A.read(...); A.write(...); A.slice(...); A.permute(...);

Matrix summation in CTF notation is

B["ij"] += A["ij"];

Matrix multiplication in CTF notation is

Y["ij"] += T["ik"]*X["kj"];

Used-defined elementwise functions can be used with either

Y["ij"] += Function <>([]( double x){ return 1/x; })(X["ij"]); Y["ij"] += Function <int ,double ,double >(...)(A["ik"],X["kj"]);

  • 1E. Solomonik, D. Matthews, J. Hammond, J. Demmel, JPDC 2014
  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 16/21

slide-18
SLIDE 18

Algebraic Parallel Programming Cyclops Tensor Framework

CTF Code for Betweenness Centrality

void btwn_central(Matrix <int > A, Matrix <path > P, int n){ Monoid <path > mon(..., []( path a, path b){ if (a.w<b.w) return a; else if (b.w<a.w) return b; else return path(a.w, a.m+b.m); }, ...); Matrix <path > Q(n,k,mon); // shortest path matrix Q["ij"] = P["ij"]; Function <int ,path > append ([]( int w, path p){ return path(w+p.w, p.m); }; ); for (int i=0; i<n; i++) Q["ij"] = append(A["ik"],Q["kj"]); ... }

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 17/21

slide-19
SLIDE 19

Algebraic Parallel Programming Cyclops Tensor Framework

Symmetry and Sparsity by Cyclicity

A cyclic layout provides preservation of packed symmetric storage format load balance for sparse 1D/2D (vertex/edge) graph blocking

  • bliviousness with respect to graph structure/topology
  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 18/21

slide-20
SLIDE 20

Algebraic Parallel Programming Cyclops Tensor Framework

Data Mapping and Autotuning

The CTF workflow is as follows All operations executed bulk synchronously For each product, matrices can be redistributed globally Arbitrary sparsity supported via compressed-sparse-row (CSR)

Modularity permits alternative sparse matrix representaitons

Performance model used to select best contraction algorithm

Leverages randomized distribution of nonzeros (edges) Model coefficients tuned using linear regression

Layout and algorithm choices are made at runtime using model

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 19/21

slide-21
SLIDE 21

Algebraic Parallel Programming Performance Results

CTF Performance for Betweenness Centrality

Implementation uses CTF SpGEMM adaptively with sparse or dense output (push or pull) We compare with CombBLAS, which uses semirings and BFS (unweighted only)

1 4 16 64 256 2 8 32 128 MTEPS/node #nodes Strong scaling of MFBC for real graphs

Friendster CTF-MFBC Orkut CTF-MFBC LiveJournal CTF-MFBC Patents CTF-MFBC

4 16 64 256 1024 4096 2 8 32 128 MTEPS/node #nodes Strong scaling for R-MAT S=22 graph

E=128 CTF-MFBC unweighted E=128 CombBLAS unweighted E=128 CTF-MFBC weighted E=8 CTF-MFBC unweighted E=8 CombBLAS unweighted E=8 CTF-MFBC weighted

Friendster has 66 million vertices and 1.8 billion edges (results on Blue Waters, Cray XE6)

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 20/21

slide-22
SLIDE 22

Conclusion

Conclusions and Future Work

Summary of algorithmic contributions

Parallel communication-avoiding betwenness centrality algorithm Better sparse matrix multiplication for unbalanced nonzero counts Algorithms and implementation general to weighted graphs

Future work

Use of ∆-stepping or other more work-efficient SSSP algorithms Optimizations in conjunction with approximation algorithms

Cyclops Tensor Framework Graphs are one of many applications, other highlights include

Petascale high-accuracy quantum chemistry 56-qubit (largest ever) quantum computing simulation

Already provides most functionality proposed in GraphBLAS 1, plus all of that for tensors (hypergraphs with uniform size nets)

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 21/21

slide-23
SLIDE 23

Conclusion

Backup slides

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 22/21

slide-24
SLIDE 24

Conclusion

Sparse tensor application: strong scaling

We study the time to solution of the sparse MP3 code, with (1) dense V and T (2) sparse V and dense T (3) sparse V and T

0.125 0.25 0.5 1 2 4 8 16 32 64 128 256 24 48 96 192 384 768 seconds/iteration #cores Strong scaling of MP3 with no=40, nv=160 dense 10% sparse*dense 10% sparse*sparse 1% sparse*dense 1% sparse*sparse .1% sparse*dense .1% sparse*sparse

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 23/21

slide-25
SLIDE 25

Conclusion

Sparse tensor application: weak scaling

We study the scaling to larger problems of the sparse MP3 code, with (1) dense V and T (2) sparse V and dense T (3) sparse V and T

1 2 4 8 16 32 64 128 256 512 1024 2048 24 48 96 192 384 768 1536 3072 6144 seconds/iteration #cores Weak scaling of MP3 with no=40, nv=160 dense 10% sparse*dense 10% sparse*sparse 1% sparse*dense 1% sparse*sparse .1% sparse*dense .1% sparse*sparse

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 24/21

slide-26
SLIDE 26

Conclusion

Data mapping and autotuning

Transitions between contractions require redistribution and refolding Base distribution for each tensor

default over all processors

  • r user can specify any processor grid mapping

To contract, tensor is redistributed globally and matricized locally Arbitrary sparsity supported via compressed-sparse-row (CSR) Performance model used to select best contraction algorithm

  • E. Solomonik, M. Besta, F. Vella, T. Hoefler

Communication-Efficient Betweenness Centrality 25/21