Generalised vectorisation for sparse matrixvector multiplication - PowerPoint PPT Presentation

Generalised vectorisation for sparse matrix–vector multiplication Albert-Jan Yzelman 22nd of July, 2014 c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 1 / 25

Context Solve y = Ax , with A an m × n input matrix, x an input vector of length n , and y an output vector of length m . Structured and unstructured sparse matrices: Emilia 923 RH Pentium (unstructured mesh computations) (circuit simulation) c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 2 / 25

Context P ast work on high-level approaches to SpMV multiplication: 4 x 10 8 x 8 OpenMP CRS 8 . 8 7 . 2 PThread 1D 13 . 6 20 . 0 Cilk CSB 22 . 9 26 . 9 BSP 2D 21 . 3 30 . 8 4 x 10 : HP DL-580, 4 sockets, 10 core Intel Xeon E7-4870 8 x 8 : HP DL-980, 8 sockets, 8 core Intel Xeon E7-2830 Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication , IEEE Trans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014). Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memory parallel programming , Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014). This talk instead focuses on sequential, low-level optimisations. c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 3 / 25

Context One operation, one flop: scalar addition a := b + c scalar multiplication a := b · c c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 4 / 25

Context One operation, l flops:       a 0 b 0 c 0 a 1 b 1 c 1       vectorised addition  :=  +  .   .   .  . . .       . . .     a l − 1 b l − 1 c l − 1 c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 4 / 25

Context One operation, l flops:     a 0 b 0 · c 0 a 1 b 1 · c 1     vectorised multiplication  :=  .   .  . .     . .    a l − 1 b l − 1 · c l − 1 c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 4 / 25

Context Exploiting sparsity through computation using only nonzeroes. Leads to sparse data structures : i = (0 , 0 , 1 , 1 , 2 , 2 , 2 , 3) j = (0 , 4 , 2 , 4 , 1 , 3 , 5 , 2) v = ( a 00 , a 04 , . . . , a 32 ) for k = 0 to nz − 1 y i k := y i k + v k · x j k The coordinate (COO) format: two flops versus five data words. c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 5 / 25

Context 64 with vectorization 32 attainable GFLOP/sec peak floating-point 16 peak memory BW 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 Arithmetic Intensity FLOP/Byte Theoretical turnover points: Intel Xeon E3-1225 64 operations per word (with vectorisation) 16 operations per word (without vectorisation) (Image courtesy of Prof. Wim Vanroose, UA) c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 6 / 25

Context 64 with vectorization 32 attainable GFLOP/sec peak floating-point 16 peak memory BW 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 Arithmetic Intensity FLOP/Byte Theoretical turnover points: Intel Xeon Phi 28 operations per word (with vectorisation) 4 operations per word (without vectorisation) (Image courtesy of Prof. Wim Vanroose, UA) c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 6 / 25

Motivation Wh y care about vectorisation: (in sparse computations) Computations can become latency bound . Good caching increases the effective bandwidth, reduces data access latencies. Vectorisation allows retrieving multiple data elements per CPU cycle; better latency hiding . c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 7 / 25

Motivation V ectorisation also strongly relates to blocking and tiling . Lots of earlier work: classical vector computing (ELLPACK/segmented scans), 1 SpMV register blocking (Blocked CRS, OSKI), 2 sparse blocking, and tiling. 3 This work generalises earlier approaches (1,2). illustrates sparse blocking and tiling (3) through the SpMV multiplication as well as the sparse matrix powers kernel. c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 8 / 25

Sequential SpMV Blocking to fit subvectors into cache, cache-obliviousness to increase cache efficiency. Ref. : Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014). c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 9 / 25

SpMV vectorisation What is needed for vectorisation: support for arbitrary nonzero traversals, handling of non-contiguous columns, handling of non-contiguous rows. Basic operation using vector registers r i : r 1 := r 1 + r 2 · r 3 , the vectorised multiply-add. How to get the right data in the vector registers? nonzeroes of A : steaming loads. elements from x, y : gather/scatter. c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 10 / 25

SpMV vectorisation S treaming loads only apply to the sparse matrix data structure: for k = 0 to nz − 1 y i k := y i k + v k · x j k Streaming loads in blue. For accesses to x and y , alternatives are necessary. c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 11 / 25

SpMV vectorisation ‘Gather’ read random memory areas into a single vector register c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 12 / 25

SpMV vectorisation ‘Scatter’ write back to random memory areas (inverse gather) c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 13 / 25

ELLPACK Sk etch of the (sliced) ELLPACK SpMV multiply ( l = 3 ): Kincaid and Fassiotto, ITPACK software and parallelization in Advances in computer methods for partial differential equations, Proc. 7th Intl. Conf. on Computer Methods for Partial Differential Equations (1992). c � 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 14 / 25

Generalised vectorisation for sparse matrixvector multiplication - PowerPoint PPT Presentation

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of July, 2014 c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 1 / 25 Context Solve y = Ax , with A an m

Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Overview Implicit Vectorisation

VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same

VECTORISATION Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation

Implementing Generalised Alt Gavin Lowe Implementing Generalised Alt 02 CSO for dummies

Generalised Parsing with Parser Combinators L. Thomas van Binsbergen Royal Holloway, University

Generalised Quantifiers on Automatic Structures Sasha Rubin rubin@cs.auckland.ac.nz Department

Generalised Closed Unbounded and Stationary Sets Hazel Brickhill Young Set Theory Workshop 28

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Generalised n -gons with symmetry conditions Joy Morris joint work with John Bamberg, Michael

Clavicle Opposable thumbs Fingernails Binocular & colour vision Generalised

Generalised link-layer adaptation with higher-layer criteria for energy-constrained and

Galois cohomology and finite generalised imaginaries Dmitry Sustretov Hebrew University of

Generalised weakened fictitious play and random belief learning David S. Leslie 12 April 2010

Game comonads & generalised quantifiers Adam O Conghaile joint with Anuj Dawar BCTCS 2020

Generalised Type Setups for Dependently Sorted Logic TACL 2011 Peter Aczel The University of

Generalised Eden growth model and random planar trees Marco Longfils Sergei Zuyev Chalmers

JavaScript: The Good Parts Douglas Crockford Yahoo! Inc. http://www.crockford.com/codecamp/

UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT

at Geoffrey Young geoff@apache.org geoffrey.young@ticketmaster.com @geoffreyyoung 1

Trade Patterns and the Future of I nternational Air Cargo FW S I ATA Speech March 1 1 , 2 0 1 4

Streaming API for XML Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th) Dept. of

Review of UK Shipping Emissions 3 rd November 2011 David Kennedy, Chief Executive Committee on

FP adoption at REA A human-first approach @KenScambler 11 Major companies using FP FP journey

Conversational Implicatures: Summary Weve seen at an intuitive level that one main

Generalised vectorisation for sparse matrixvector multiplication - PowerPoint PPT Presentation

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of July, 2014 c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 1 / 25 Context Solve y = Ax , with A an m

Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Overview Implicit Vectorisation

VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same

VECTORISATION Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation

Implementing Generalised Alt Gavin Lowe Implementing Generalised Alt 02 CSO for dummies

Generalised Parsing with Parser Combinators L. Thomas van Binsbergen Royal Holloway, University

Generalised Quantifiers on Automatic Structures Sasha Rubin rubin@cs.auckland.ac.nz Department

Generalised Closed Unbounded and Stationary Sets Hazel Brickhill Young Set Theory Workshop 28

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Generalised n -gons with symmetry conditions Joy Morris joint work with John Bamberg, Michael

Clavicle Opposable thumbs Fingernails Binocular &amp; colour vision Generalised

Generalised link-layer adaptation with higher-layer criteria for energy-constrained and

Galois cohomology and finite generalised imaginaries Dmitry Sustretov Hebrew University of

Generalised weakened fictitious play and random belief learning David S. Leslie 12 April 2010

Game comonads &amp; generalised quantifiers Adam O Conghaile joint with Anuj Dawar BCTCS 2020

Generalised Type Setups for Dependently Sorted Logic TACL 2011 Peter Aczel The University of

Generalised Eden growth model and random planar trees Marco Longfils Sergei Zuyev Chalmers

JavaScript: The Good Parts Douglas Crockford Yahoo! Inc. http://www.crockford.com/codecamp/

UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT

at Geoffrey Young geoff@apache.org geoffrey.young@ticketmaster.com @geoffreyyoung 1

Trade Patterns and the Future of I nternational Air Cargo FW S I ATA Speech March 1 1 , 2 0 1 4

Streaming API for XML Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th) Dept. of

Review of UK Shipping Emissions 3 rd November 2011 David Kennedy, Chief Executive Committee on

FP adoption at REA A human-first approach @KenScambler 11 Major companies using FP FP journey

Conversational Implicatures: Summary Weve seen at an intuitive level that one main

Clavicle Opposable thumbs Fingernails Binocular & colour vision Generalised

Game comonads & generalised quantifiers Adam O Conghaile joint with Anuj Dawar BCTCS 2020