in Graph Computations Aydn Bulu John R. Gilbert University of - - PowerPoint PPT Presentation

in graph computations
SMART_READER_LITE
LIVE PREVIEW

in Graph Computations Aydn Bulu John R. Gilbert University of - - PowerPoint PPT Presentation

Parallel Combinatorial BLAS and Applications in Graph Computations Aydn Bulu John R. Gilbert University of California, Santa Barbara Adapted from talks at SIAM conferences 1 Primitives for Graph Computations By analogy to numerical


slide-1
SLIDE 1

1

Parallel Combinatorial BLAS and Applications in Graph Computations

Aydın Buluç John R. Gilbert

University of California, Santa Barbara Adapted from talks at SIAM conferences

slide-2
SLIDE 2

2

  • By analogy to

numerical linear algebra,

  • What would the

combinatorial BLAS look like?

Primitives for Graph Computations

BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix-matrix multiply) BLAS 2 (n-by-n matrix-vector multiply) BLAS 1 (sum of scaled n-vectors) Peak

slide-3
SLIDE 3

3

Real-World Graphs

Properties:

  • Huge (billions of vertices/edges)
  • Very sparse (typically m = O(n))
  • Scale-free [maybe]
  • Community structure [maybe]

Examples:

  • World-wide web
  • Science citation graphs
  • Online social networks
slide-4
SLIDE 4

4

What Kinds of Computations?

  • Some are inherently latency-bound.

→ S-T connectivity

  • Many graph mining algorithms are computationally intensive.

→ Graph clustering → Centrality computations Huge Graphs Expensive Kernels

+ 

Massive Parallelism Very Sparse Graphs Sparse Data Structures (Matrices)

slide-5
SLIDE 5

5

The Case for Sparse Matrices

  • Many irregular applications contain sufficient coarse-

grained parallelism that can ONLY be exploited using abstractions at proper level.

Traditional graph computations Graphs in the language of linear algebra

Data driven. Unpredictable communication patterns Fixed communication patterns. Overlapping opportunities Irregular and unstructured. Poor locality of reference Operations on matrix blocks. Exploits memory hierarchy Fine grained data accesses. Dominated by latency Coarse grained parallelism. Bandwidth limited

slide-6
SLIDE 6

6

The Case for Primitives

It takes a “certain” level of expertise to get any kind of performance in this jungle of parallel computing

  • I think you’ll agree with me by the end of the talk :)

480x

All pairs shortest paths on the GPU

What’s bandwidth anyway? I can just implement it (w/ enough coffee) The right primitive !

slide-7
SLIDE 7

7

Identification of Primitives

  • Sparse matrix-matrix multiplication (SpGEMM)

Most general and challenging parallel primitive.

  • Sparse matrix-vector multiplication (SpMV)
  • Sparse matrix-transpose-vector multiplication (SpMVT)

Equivalently, multiplication from the left

  • Addition and other point-wise operations (SpAdd)

Included in SpGEMM, “proudly” parallel

  • Indexing and assignment (SpRef, SpAsgn)

A(I,J) where I and J are arrays of indices Reduces to SpGEMM

Matrices on semirings, e.g. ( , +), (and, or), (+, min)

slide-8
SLIDE 8

8

  • Graph clustering (Markov, peer pressure)
  • Shortest path calculations
  • Betweenness centrality
  • Subgraph / submatrix indexing
  • Graph contraction
  • Cycle detection
  • Multigrid interpolation & restriction
  • Colored intersection searching
  • Applying constraints in finite element computations
  • Context-free parsing ...

Why focus on SpGEMM?

slide-9
SLIDE 9

9

Comparative Speedup of Sparse 1D & 2D

In practice, 2D algorithms have the potential to scale, if implemented

  • correctly. Overlapping communication, and maintaining load balance are

crucial.

slide-10
SLIDE 10

10

*

= i j

Aik

k k

Bkj

Cij

2-D example: Sparse SUMMA

  • Cij += Aik * Bkj
  • Based on dense SUMMA
  • Generalizes to nonsquare matrices, etc.
slide-11
SLIDE 11

11

Sequential Kernel

  • Strictly O(nnz) data structure
  • Outer-product formulation
  • Work-efficient

X

flops nnz n

Standard algorithm is O(nnz+ flops+n)

slide-12
SLIDE 12

12

Submatrices are hypersparse (i.e. nnz << n)

blocks blocks Total Storage: Average of c nonzeros per column

  • A data structure or algorithm that depends on

the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices

Node Level Considerations

slide-13
SLIDE 13

13

Addressing the Load Balance

  • Random permutations are
  • useful. But...
  • Bulk synchronous algorithms

may still suffer:

  • Asynchronous algorithms

have no notion of stages.

  • Overall, no significant

imbalance.

RMat: Model for graphs with high variance on degrees

slide-14
SLIDE 14

14

Asynchronous Implementation

DCSC<I,N>

Remote get using MPI-2

Sparse2D<I,N>

O(nnz)

  • Two-dimensional block layout
  • (Passive target) remote-memory access
  • Avoids hot spots
  • With very high probability, a block is accessed at most by a

single remote get operation at any given time

slide-15
SLIDE 15

15

Scaling Results for SpGEMM

  • Asynchronous implementation

One-sided MPI-2

  • Runs on TACC’s Lonestar cluster
  • Dual-core dual-socket

Intel Xeon 2.66 Ghz

  • RMat X RMat product

Average degree (nnz/n) ≈ 8

slide-16
SLIDE 16

16

Applications and Algorithms

Betweenness Centrality

CB(v): Among all the shortest paths, what fraction of them pass through the node of interest?

Brandes’ algorithm

A typical software stack for an application enabled with the Combinatorial BLAS

slide-17
SLIDE 17

17

Betweenness Centrality using Sparse Matrices [Robinson, Kepner]

  • Adjacency matrix: sparse array w/ nonzeros for graph edges
  • Storage-efficient implementation from sparse data structures
  • Betweenness Centrality Algorithm:

1.Pick a starting vertex, v 2.Compute shortest paths from v to all other nodes 3.Starting with most distant nodes, roll back and tally paths

x

1 2 3 4 7 6 5

A

T

slide-18
SLIDE 18

18

Betweenness Centrality using BFS

x (A

Tx).*¬x

1 2 3 4 7 6 5

A

T

x

T

t1 t2 t3 t4 x += x ~

  • Every iteration, another level of the BFS is

discovered.

  • Sparsity is preserved, but sparse matrix

times sparse vector has very little potential parallelism (has o(nnz) work)

slide-19
SLIDE 19

19

6

X

A

T

(ATX).*¬X

1 2 3 4 7 5

Parallelism: Multiple-source BFS

  • Batch processing of multiple source vertices
  • Sparse matrix-matrix multiplication => work efficient
  • Potential parallelism is much higher
  • Same applies to the tallying phase
slide-20
SLIDE 20

20

Betweenness Centrality on Combinatorial BLAS

Batch processing greatly helps for large p RMAT scale N has 2N vertices and 8*2N edges

  • Likely to perform better on

large inputs

  • Code only a few lines

longer than Matlab version

10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 80,000,000 90,000,000 100,000,000 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256

BC perf. in TEPS (Traversed Edges per Second)

16 [batch: 256] 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 25 36 49 64 81 100 121 144 169 196 225 256 16 [batch: 256] 16 [batch: 512]

slide-21
SLIDE 21

21

Betweenness Centrality on Combinatorial BLAS

Fundamental trade-off: Parallelism vs memory usage

20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 64 81 100 121 144 169 196 225 256 16 [batch: 256] 16 [batch: 512] 17 [batch: 256] 17 [batch: 512]

slide-22
SLIDE 22

22

Thank You !

Questions?