in Graph Computations Aydn Bulu John R. Gilbert University of - PowerPoint PPT Presentation

Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara Adapted from talks at SIAM conferences 1

Primitives for Graph Computations • By analogy to numerical Peak linear algebra, BLAS 3 BLAS 2 • What would the BLAS 1 combinatorial BLAS look like? BLAS 3 (n-by-n matrix-matrix multiply) BLAS 2 (n-by-n matrix-vector multiply) BLAS 1 (sum of scaled n-vectors) 2

Real-World Graphs Properties: • Huge (billions of vertices/edges) • Very sparse (typically m = O(n)) • Scale-free [maybe] • Community structure [maybe] Examples: • World-wide web • Science citation graphs • Online social networks 3

What Kinds of Computations? • Some are inherently latency-bound. → S-T connectivity • Many graph mining algorithms are computationally intensive. → Graph clustering → Centrality computations  + Huge Graphs Expensive Kernels Massive Parallelism Sparse Data  Very Sparse Graphs Structures (Matrices) 4

The Case for Sparse Matrices Many irregular applications contain su ffi cient coarse- • grained parallelism that can ONLY be exploited using abstractions at proper level. Traditional graph Graphs in the language of computations linear algebra Data driven. Unpredictable Fixed communication patterns. communication patterns Overlapping opportunities Irregular and unstructured. Poor Operations on matrix blocks. locality of reference Exploits memory hierarchy Fine grained data accesses. Coarse grained parallelism. Dominated by latency Bandwidth limited 5

The Case for Primitives It takes a “certain” level of expertise to get any kind of performance in this jungle of parallel computing • I think you’ll agree with me by the end of the talk :) What’s bandwidth anyway? 480x I can just implement it (w/ enough coffee) The right primitive ! All pairs shortest paths on the GPU 6

Identification of Primitives ‣ Sparse matrix-matrix multiplication (SpGEMM) Most general and challenging parallel primitive. ‣ Sparse matrix-vector multiplication (SpMV) ‣ Sparse matrix-transpose-vector multiplication (SpMVT) Equivalently, multiplication from the left ‣ Addition and other point-wise operations (SpAdd) Included in SpGEMM, “proudly” parallel ‣ Indexing and assignment (SpRef, SpAsgn) A(I,J) where I and J are arrays of indices Reduces to SpGEMM Matrices on semirings, e.g. ( , +), (and, or), (+, min) 7

Why focus on SpGEMM? • Graph clustering (Markov, peer pressure) • Shortest path calculations • Betweenness centrality • Subgraph / submatrix indexing • Graph contraction • Cycle detection • Multigrid interpolation & restriction • Colored intersection searching • Applying constraints in finite element computations • Context-free parsing ... 8

Comparative Speedup of Sparse 1D & 2D In practice, 2D algorithms have the potential to scale, if implemented correctly. Overlapping communication, and maintaining load balance are crucial. 9

2-D example: Sparse SUMMA B kj j k k * = i C ij A ik  C ij += A ik * B kj  Based on dense SUMMA Generalizes to nonsquare matrices, etc.  10

Sequential Kernel Standard algorithm is O(nnz+ flops+n) flops nnz n X • Strictly O(nnz) data structure • Outer-product formulation • Work-efficient 11

Node Level Considerations Submatrices are hypersparse (i.e. nnz << n) blocks Average of c nonzeros per column Total Storage: blocks • A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices 12

Addressing the Load Balance RMat: Model for graphs with high variance on degrees • • Random permutations are Asynchronous algorithms have no notion of stages. useful. But... • • Bulk synchronous algorithms Overall, no significant imbalance. may still suffer: 13

Asynchronous Implementation O(nnz) Sparse2D<I,N> Remote get using MPI-2 DCSC<I,N>  Two-dimensional block layout  (Passive target) remote-memory access  Avoids hot spots  With very high probability, a block is accessed at most by a single remote get operation at any given time 14

Scaling Results for SpGEMM  Asynchronous implementation One-sided MPI-2  Runs on TACC’s Lonestar cluster  Dual-core dual-socket Intel Xeon 2.66 Ghz  RMat X RMat product Average degree (nnz/n) ≈ 8 15

Applications and Algorithms Betweenness Centrality C B (v): Among all the shortest paths, what fraction of them pass through the node of interest? A typical software stack for an application enabled with the Combinatorial BLAS Brandes’ algorithm 16

Betweenness Centrality using Sparse Matrices [ Robinson, Kepner] A T 2 1 4 5 7 6 3 x • Adjacency matrix: sparse array w/ nonzeros for graph edges • Storage-efficient implementation from sparse data structures • Betweenness Centrality Algorithm: 1. Pick a starting vertex, v 2. Compute shortest paths from v to all other nodes 3. Starting with most distant nodes, roll back and tally paths 17

Betweenness Centrality using BFS T x) . *¬x x (A A T 2 1 4 5  7 6 3 x • Every iteration, another level of the BFS is discovered. T • Sparsity is preserved, but sparse matrix times sparse vector has very little potential parallelism (has o(nnz) work) ~ x += x t 1 t 2 t 3 t 4 18

Parallelism: Multiple-source BFS 1 2  4 5 7 6 3 (A T X) . *¬X T A X • Batch processing of multiple source vertices • Sparse matrix-matrix multiplication => work efficient • Potential parallelism is much higher • Same applies to the tallying phase 19

Betweenness Centrality on Combinatorial BLAS 140,000,000 120,000,000 100,000,000 Batch processing greatly helps for large p 80,000,000 16 [batch: 256] 16 [batch: 512] 60,000,000 BC perf. in TEPS (Traversed Edges per Second) 100,000,000 40,000,000 90,000,000 20,000,000 80,000,000 70,000,000 0 25 36 49 64 81 100 121 144 169 196 225 256 60,000,000 RMAT scale N has 2 N 50,000,000 vertices and 8*2 N edges 40,000,000 • Likely to perform better on 30,000,000 20,000,000 large inputs 10,000,000 • Code only a few lines 0 longer than Matlab version 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 16 [batch: 256] 20

Betweenness Centrality on Combinatorial BLAS Fundamental trade-off: Parallelism vs memory usage 140,000,000 120,000,000 100,000,000 80,000,000 16 [batch: 256] 16 [batch: 512] 17 [batch: 256] 60,000,000 17 [batch: 512] 40,000,000 20,000,000 0 64 81 100 121 144 169 196 225 256 21

Thank You ! Questions? 22

in Graph Computations Aydn Bulu John R. Gilbert University of - PowerPoint PPT Presentation

Parallel Combinatorial BLAS and Applications in Graph Computations Aydn Bulu John R. Gilbert University of California, Santa Barbara Adapted from talks at SIAM conferences 1 Primitives for Graph Computations By analogy to numerical

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing &

Embarrassingly Parallel Computations Embarrassingly Parallel Computations A computation that

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Interval Computations as Why Intervals? Applied Constructive Interval Computations . . . Wiener

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Graph Sparsifiers Smaller graph that (approximately) preserves the values of some set of

graphics pipeline lecture 7 (lectures 1-6) clip coordinates - graphics pipeline (overview)

Topological Structures in the Analysis of Images and Data Chao Chen City University of New York

More Geometry for Graphics January 12, 2007 CS6620 Spring 07 Review from last time Vector

2/12/15 Template Method Design Patterns Problem: Some classes have a similar algorithm, but

COMPUTER GRAPHICS COURSE The GPU Georgios Papaioannou - 2014 The Hardware Graphics Pipeline (1)

The Visual Library API Anton Epple http://www.eppleton.de What is the Visual Library? Generic

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving

Unity Scripting 4 Unity Components overview Particle components Interaction Key and Button

in Graph Computations Aydn Bulu John R. Gilbert University of - PowerPoint PPT Presentation

Parallel Combinatorial BLAS and Applications in Graph Computations Aydn Bulu John R. Gilbert University of California, Santa Barbara Adapted from talks at SIAM conferences 1 Primitives for Graph Computations By analogy to numerical

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing &amp;

Embarrassingly Parallel Computations Embarrassingly Parallel Computations A computation that

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Interval Computations as Why Intervals? Applied Constructive Interval Computations . . . Wiener

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Graph Sparsifiers Smaller graph that (approximately) preserves the values of some set of

graphics pipeline lecture 7 (lectures 1-6) clip coordinates - graphics pipeline (overview)

Topological Structures in the Analysis of Images and Data Chao Chen City University of New York

More Geometry for Graphics January 12, 2007 CS6620 Spring 07 Review from last time Vector

2/12/15 Template Method Design Patterns Problem: Some classes have a similar algorithm, but

COMPUTER GRAPHICS COURSE The GPU Georgios Papaioannou - 2014 The Hardware Graphics Pipeline (1)

The Visual Library API Anton Epple http://www.eppleton.de What is the Visual Library? Generic

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving

Unity Scripting 4 Unity Components overview Particle components Interaction Key and Button

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing &