Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu - - PowerPoint PPT Presentation

towards a graphblas library in chapel
SMART_READER_LITE
LIVE PREVIEW

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu - - PowerPoint PPT Presentation

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017 Overview q High-level research objec.ve: Enable produc.ve and high-performance graph analy.cs We used


slide-1
SLIDE 1

Towards a GraphBLAS Library in Chapel

Ariful Azad & Aydin Buluç Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017

slide-2
SLIDE 2

Overview

GraphBLAS

Building blocks for graph algorithms in the language of sparse linear algebra

Chapel

An emerging parallel language designed for produc.ve parallel compu.ng at scale Both promise: Produc.vity + Performance

q High-level research objec.ve:

– Enable produc.ve and high-performance graph analy.cs – We used GraphBLAS and Chapel to achieve this goal

q Scope of this paper: A GraphBLAS library in Chapel

slide-3
SLIDE 3
  • 1. Overview of GraphBLAS primi.ves
  • 2. Implementa.on of a subset of GraphBLAS

primi.ves in Chapel with experimental results

Outline

Warning: this is just an early evalua.on as Chapel’s sparse matrix support is ac.vely under development. All experiments were conducted on Chapel 1.13.1. The performance numbers are expected to improve significantly in future releases of Chapel.

slide-4
SLIDE 4

Part 1. GraphBLAS overview

slide-5
SLIDE 5

5

GraphBLAS analogy A ready-to-assemble furniture shop (Ikea)

Building blocks Objects (Algorithms) Final product (Applica.ons)

slide-6
SLIDE 6

6

q GraphBLAS (http://graphblas.org )

– Standard building blocks for graph algorithms in the language of sparse linear algebra – Inspired by the Basic Linear Algebra Subprograms (BLAS) – Par.cipants from industry, academia and na.onal labs – C API is available in the website

(Design of the GraphBLAS API for C, A Buluç, T MaYson, S McMillan, J Moreira, C Yang, IPDPS Workshops 2017)

Graph algorithm building blocks

slide-7
SLIDE 7

7

q Employs graph-matrix duality

– Graphs => sparse matrix – A subset of vertex/edges => sparse/dense vector

q Benefits

– Standard set of opera.ons – Learn from the rich history of numerical linear algebra – Offers structured and regular memory accesses and communica.ons (as opposed to irregular memory accesses in tradi.on graph algorithm) – Opportunity for communica.on avoiding algorithms

GraphBLAS as algorithm building blocks

slide-8
SLIDE 8

8

Some GraphBLAS basic primi.ves

FuncJon Parameters Returns Matlab notaJon MxM (SpGEMM)

  • sparse matrices A and B
  • op.onal unary functs

sparse matrix C = A * B MxV (SpM{Sp}V)

  • sparse matrix A
  • sparse/dense vector x

sparse/dense vector y = A * x EwiseMult, Add, … (SpEWiseX)

  • sparse matrices or vectors
  • binary funct, op.onal unarys

in place or sparse matrix/vector C = A .* B C = A + B Reduce (Reduce)

  • sparse matrix A and funct

dense vector y = sum(A, op) Extract (SpRef)

  • sparse matrix A
  • index vectors p and q

sparse matrix B = A(p, q) Assign (SpAsgn)

  • sparse matrices A and B
  • index vectors p and q

none A(p, q) = B BuildMatrix (Sparse)

  • list of edges/triples (i, j, v)

sparse matrix A = sparse(i, j, v, m, n) ExtractTuples (Find)

  • sparse matrix A

edge list [i, j, v] = find(A)

slide-9
SLIDE 9

9

General purpose opera.ons via semirings (overloading addi.on and mul.plica.on opera.ons)

Real field: (R, +, x) Classical numerical linear algebra Boolean algebra: ({0 1}, |, &) Graph traversal Tropical semiring: (R U {∞}, min, +) Shortest paths (S, select, select) Select subgraph, or contract nodes to form quo.ent graph (edge/vertex aYributes, vertex data aggrega.on, edge data processing) Schema for user-specified computa.on at ver.ces and edges (R, max, +) Graph matching &network alignment (R, min, Jmes) Maximal independent set

  • Shortened semiring notaJon: (Set, Add, MulJply). Both iden..es omiYed.
  • Add: Traverses edges, MulJply: Combines edges/paths at a vertex
slide-10
SLIDE 10

Example: Exploring the next-level ver.ces via SpMSpV

a e b c f g d

1 2 3

3 2

x x x x x x x x x x x x x x x x x x x

a b c d e f g h a b c d e f g h

2 3 2 Overload (mul.ply,add) with (select2nd, min)

a b c d e f g h

Current fronJer Next fronJer

Adjacency matrix

h

slide-11
SLIDE 11

11 Sparse - Dense Matrix Product (SpDM3) Sparse - Sparse Matrix Product (SpGEMM) Sparse Matrix Times Mul<ple Dense Vectors (SpMM) Sparse Matrix- Dense Vector (SpMV) Sparse Matrix- Sparse Vector (SpMSpV)

GraphBLAS primi<ves in increasing arithme<c intensity

Shortest paths (all-pairs, single- source, temporal) Graph clustering (Markov cluster, peer pressure, spectral, local) Miscellaneous: connec<vity, traversal (BFS), independent sets (MIS), graph matching Centrality (PageRank, betweenness, closeness)

Higher-level combinatorial and machine learning algorithms

Classifica7on (support vector machines, Logis<c regression) Dimensionality reduc7on (NMF, PCA)

Algorithmic coverage

  • Develop high-performance algorithms for 10-12 primi.ves.
  • Use them in many algorithms (boost produc.vity).
slide-12
SLIDE 12

Expecta.on: two-layer produc.vity

Graph algorithms GraphBLAS opera.ons Chapel’s produc.vity features

use use

library user space language

slide-13
SLIDE 13

Part 2. ImplemenJng a subset of GraphBLAS operaJons in Chapel

slide-14
SLIDE 14

Parameters Returns Apply x: sparse matrix/vector f: unary func.on None x[i] = f(x[i]) Assign x: sparse matrix/vector y: sparse matrix/vector None x[i] = y[i] eWiseMult x: sparse matrix/vector y: sparse matrix/vector z: sparse matrix/vector z[i] = x[i] * y[i] SpMSpV A: sparse matrix x: sparse vector y: sparse vector y = Ax

For Chapel: A subset of GraphBLAS opera.ons

slide-15
SLIDE 15

q Chapel details

– Chapel 1.13.1 (the latest version before the IPDPS deadline) – Chapel built from source – CHPL_COMM: gasnet/gemini – Job launcher: slurm-srun

q Experiment platorm: NERSC/Edison

– Intel Ivy Bridge processor – 24 cores on 2 sockets – 64 GB memory per node – 30-MB L3 Cache

Experimental platorm

slide-16
SLIDE 16

Sparse matrices in Chapel

q Block distributed sparse matrices. The dense

container is block distributed.

q We used compressed sparse block (CSR) layout to

store local matrices.

var n = 6 const D = {0..n-1, 0..n-1} dmapped Block(1..3,1..3); var spD: sparse subdomain(D); var A = [spD] real; In this example: #locales = 9 In our results, we did not include .me to construct arrays

slide-17
SLIDE 17

The simplest GraphBLAS opera.on: Apply ( x[i] = f(x[i]) )

Apply1: high-level (Chapel style) Apply2 manipula.ng internal arrays (MPI style)

slide-18
SLIDE 18

1 2 4 8 16 32 4 8 16 32 64 128 256 Number of Threads (single node) Time (ms) Apply1 Apply2

Example, simple case : Apply ( x[i] = f(x[i]) )

1 2 4 8 16 32 64 0.000244141 0.000976562 0.00390625 0.015625 0.0625 0.25 1 4 16 64 256 Number of Nodes (24 threads per node) Time (second) Apply1 Apply2

Apply1: high-level (Chapel style) Apply2: manipula.ng internal arrays (C++ style) x: 10M nonzeros Platorm: NERSC/Edison Data parallel loops perform well in shared memory But do not perform well in distributed memory

slide-19
SLIDE 19

Performance on distributed-memory

Apply 1 Apply 2

Using chplvis on four locales

Red: data in, blue: data out

This issue with sparse arrays has been addressed about a week ago All work at locale 0

slide-20
SLIDE 20

Assign x[i] = y[i]

Assign1: high-level (Chapel style) Assign2: manipula.ng internal arrays (MPI style)

slide-21
SLIDE 21

Shared-memory performance: Assign ( x[i] = y[i] )

Assign1: high-level (Chapel style) Assign2: manipula.ng internal arrays (C++ style) x: 1M nonzeros Platorm: NERSC/Edison Big performance gap Even in shared memory

1 2 4 8 16 32 8 16 32 64 128 256 512 1024 2048 Number of Threads (single node) Time (ms) Assign1 Assign2

Why? Indexing a sparse domain uses binary search. For assignment it can be avoided

slide-22
SLIDE 22

distributed-memory performance: Assign ( x[i] = y[i] )

Assign1: high-level (Chapel style) Assign2: manipula.ng internal arrays (C++ style) x: 1M nonzeros Platorm: NERSC/Edison Big performance gap Even in distributed memory

1 2 4 8 16 32 64 0.000976562 0.00390625 0.015625 0.0625 0.25 1 4 16 64 256 1024 Number of Nodes (24 threads per node) Time (second) Assign1 Assign2

slide-23
SLIDE 23

x"

=

*" A"

SPA$

gather" sca-er/" accumulate"

y" Example, complex case: SpMSpV (y = Ax)

Algorithm overview

slide-24
SLIDE 24

Sparse matrix-sparse vector mul.ply (SpMSpV)

x

A x

à

x

  • 1. Gather ver.ces in processor column
  • 2. Local mul.plica.on
  • 3. ScaYer results in processor row

n p n p p × p Processor grid

P processors are arranged in

Mul.ply (access remote data as needed). No collec.ve communica.on Algorithm (Chapel Style) Algorithm (MPI Style)

slide-25
SLIDE 25

0.0009766 0.0039063 0.015625 0.0625 0.25 1 4 1 2 4 8 16 32 64

Time (s) Number of Nodes (24 threads/node) Gather Input Local Mul<ply Sca?er output

Distributed-memory performance of SpMSpV on Edison

A: random; 16M nonzeros x: random; 2000 nonzeros

Remote atomics are expensive in Chapel

We don’t know the reason

slide-26
SLIDE 26

q Exploit available spa.al locality in sparse manipula.ons

– Efficient access of nonzeros of sparse matrices/vectors – Chapel is almost there, needs improved parallel iterators

q Use bulk-synchronous communica.on whenever

possible

– Avoid latency-bound communica.on – Team collec.ves are useful

Requirements for achieving high performance

slide-27
SLIDE 27

Task Hardness Why? Data structure medium Manipula.ng domains and arrays Func.onality easy Fewer lines of code with built-in features Paralleliza.on easy No need to think about communica.on

Our experience: produc.vity vs. performance

ProducJvity (easy to develop a prototype) Task Hardness Why? Data structure hard Manipula.ng low level data structures Shared-memory medium Data parallel iterators for sparse data Distributed- memory hard Needs bulk synchronous communica.on, team collec.ves, etc. Performance (hard to achieve performance)

slide-28
SLIDE 28

q We have implemented a prototype GraphBLAS library

in Chapel

– Implemented breadth-first search as a representa.ve algorithm using these primi.ves

q Library development in Chapel is easy (rela.ve to C++) q Chapel’s distributed-sparse matrix support is s.ll

under development. The distributed-memory performance is expected to improve over .me.

Summary

slide-29
SLIDE 29

q Finish a complete GraphBLAS-compliant library in a

PGAS language (including Chapel)

– Achieving high performance is our focus – Benchmark our library against other programming models and languages

q Design complex graph algorithms using the library to

demonstrate its u.lity

– Understand the impact of programming models on graph analy.cs

Future direc.on

slide-30
SLIDE 30

q Funded in part by DOD/ACS and in part by DOE/ASCR

q Acknowledgement: Cos.n Iancu (LBNL), Brad Chamberlain

(Cray), Michael Ferguson (Cray), Engin Kayraklioglu (George

Washington University)

q References:

– A. Azad and A. Buluç, IPDPS Workshops 2017, Towards a GraphBLAS library in Chapel. – A. Azad and A. Buluç, IPDPS 2017, A work-efficient algorithm for sparse matrix-sparse vector mul.plica.on algorithm.

Acknowledgement and relevant references

QuesJons?