multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk - PowerPoint PPT Presentation

Parallel S p MV multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk Roose (KU Leuven) December 2013 � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 1 / 29

Acknowledgements This presentation outlines the paper A. N. Yzelman and D. Roose, “High-level strategies for parallel shared-memory sparse matrx–vector multiplication” , IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), in press: http://dx.doi.org/10.1109/TPDS.2013.31 This work is funded by Intel and by the Institute for the Promotion of Innovation through Science and Technology (IWT), in the framework of the Flanders ExaScience Lab, part of Intel Labs Europe. � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 2 / 29

Introduction Given a sparse m × n matrix A and an n × 1 input vector x . We consider both sequential and parallel computation of Ax = y : � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 3 / 29

Introduction First obstacle: inefficient cache use Row-major ordering of nonzeroes: linear access of the output vector y ; ...irregular access of the input vector x . � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 4 / 29

Introduction First obstacle: inefficient cache use Row-major ordering of nonzeroes: linear access of the output vector y ; irregular access of the input vector x . � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 4 / 29

Introduction Second obstacle: the SpMV is bandwidth bound 64 with vectorization 32 attainable GFLOP/sec peak floating-point 16 peak memory BW 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 Arithmetic Intensity FLOP/Byte SpMV has low arithmetic intensity : 0.2–0.25. Compression is mandatory! (Image courtesy of Prof. Wim Vanroose, UA) � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 5 / 29

Introduction Third obstacle: NUMA architectures Core 1 Core 2 Core 3 Core 4 32kB L1 32kB L1 32kB L1 32kB L1 �� 4MB L2 4MB L2 System interface Processor-level NUMAness affects cache behaviour. Share data from x or y between by neighbouring cores. � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 6 / 29

Introduction Third obstacle: NUMA architectures Socket-level NUMAness affects the effective bandwidth. If each processor moves data to (and from) the same memory bank, we are bound by the bandwidth of that single memory bank. � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 7 / 29

Introduction: summary Three factors impede creating an efficient shared-memory parallel SpMV multiplication kernel: inefficient cache use, 1 limited memory bandwidth, and 2 non-uniform memory access (NUMA). 3 � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 8 / 29

Increasing cache-efficiency A dapt the nonzero order: Original situation linear access of the output vector y ; irregular access of the input vector x . O () Ref. : A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 9 / 29

Increasing cache-efficiency A dapt the nonzero order: Zig-zag CRS retains linear access of the output vector y ; imposes O ( m ) more locality. Ref. : A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 9 / 29

Increasing cache-efficiency A dapt the nonzero order using space-filling curves: Fractal storage using the coordinate format, COO no linear access of y , but better combined locality on x and y . O () Ref. : Haase, Liebmann and Plank, “A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices”, International Journal of Parallel, Emergent and Distributed Systems 22(4), pp. 213-220 (2007). � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 10 / 29

Increasing cache-efficiency Or reordering matrix rows and columns: Reordering based on sparse matrix partitioning combines with adapting the nonzero order, models upper-bound on cache misses (with ZZ-CRS) . O Ref. : A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 11 / 29

Increasing cache-efficiency Sparse blocking enhances reordering: corresponding vector elements will fit into cache, block-wise reordering is faster. � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 12 / 29

Increasing cache-efficiency T wo options: space-filling curves within or without: (Compressed Sparse Blocks, CSB) Ref. : Buluc ¸, Williams, Oliker, and Demmel, “Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication”, Proc. Parallel & Distributed Processing (IPDPS), IEEE International, pp. 721-733 (2011). � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 13 / 29

Increasing cache-efficiency T wo options: space-filling curves within or without : Ref. : Lorton and Wise, “Analyzing block locality in Morton-order and Morton-hybrid matrices”, SIGARCH Computer Architecture News, 35(4), pp. 6-12 (2007). Ref. : Martone, Filippone, Tucci, Paprzycki, and Ganzha, “Utilizing recursive storage in sparse matrix-vector multiplication - preliminary considerations”, Proceedings of the ISCA 25th International Conference on Computers and Their Applications (CATA), pp 300-305 (2010). � 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 14 / 29

multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk - PowerPoint PPT Presentation

Parallel S p MV multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk Roose (KU Leuven) December 2013 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 1 / 29 Acknowledgements This presentation

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

Year 4 Multiplication Tables Check Presentation for Parents, Carers & Guardians Kidbrooke

MULTIPLICATION TABLES CHECK (MTC) WHAT WE KNOW The Multiplication Tables Check (MTC) was

Presentation for Parents, Carers & Guardians 2020 30/01/2020 Year 4 Multiplication Tables

Multiplication and Division Instructions Multiplication and Division Instructions MUL

Lecture 5 Multiplication and Division 1 Multiplication More complicated than addition A

lecture 7 Sequential circuits 3 - integer multiplication and division - floating point

CS 1501 www.cs.pitt.edu/~nlf4/cs1501/ Integer Multiplication Integer multiplication Say we have

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

GOVERNANCE AT PUC THE SYSTEM BY WHICH PUCS CONGREGATION EXERCISES ITS AUTHORITY Governance

CDFA // BNY MELLON DE DEVELOPMENT FIN FINANCE WEBCAST SERIES The Landscape of Mega Project

Alice TPC DCS 1 ALICE underground ACR CR1 CR2 CR3 CR4 PLUG ~50m ~15m ALICE BEAM DETECTOR

Metal-organic frameworks in heterogeneous catalysis Catalysis Lecture 2017 ETH Zurich

xml html css structure and form xml for communicating structured data general

Alaska Strategic & Critical Minerals Summit Fairbanks November 20, 2012 Mary Sattler

Teeing up Python Code Golf Lee Sheng lsheng@yelp.com @bogosort Yelps Mission Connecting

CS 171: Visualization Interaction Hanspeter Pfister pfister@seas.harvard.edu This Week

multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk - PowerPoint PPT Presentation

Parallel S p MV multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk Roose (KU Leuven) December 2013 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 1 / 29 Acknowledgements This presentation

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication &amp; Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

Year 4 Multiplication Tables Check Presentation for Parents, Carers &amp; Guardians Kidbrooke

MULTIPLICATION TABLES CHECK (MTC) WHAT WE KNOW The Multiplication Tables Check (MTC) was

Presentation for Parents, Carers &amp; Guardians 2020 30/01/2020 Year 4 Multiplication Tables

Multiplication and Division Instructions Multiplication and Division Instructions MUL

Lecture 5 Multiplication and Division 1 Multiplication More complicated than addition A

lecture 7 Sequential circuits 3 - integer multiplication and division - floating point

CS 1501 www.cs.pitt.edu/~nlf4/cs1501/ Integer Multiplication Integer multiplication Say we have

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

GOVERNANCE AT PUC THE SYSTEM BY WHICH PUCS CONGREGATION EXERCISES ITS AUTHORITY Governance

CDFA // BNY MELLON DE DEVELOPMENT FIN FINANCE WEBCAST SERIES The Landscape of Mega Project

Alice TPC DCS 1 ALICE underground ACR CR1 CR2 CR3 CR4 PLUG ~50m ~15m ALICE BEAM DETECTOR

Metal-organic frameworks in heterogeneous catalysis Catalysis Lecture 2017 ETH Zurich

xml html css structure and form xml for communicating structured data general

Alaska Strategic &amp; Critical Minerals Summit Fairbanks November 20, 2012 Mary Sattler

Teeing up Python Code Golf Lee Sheng lsheng@yelp.com @bogosort Yelps Mission Connecting

CS 171: Visualization Interaction Hanspeter Pfister pfister@seas.harvard.edu This Week

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Year 4 Multiplication Tables Check Presentation for Parents, Carers & Guardians Kidbrooke

Presentation for Parents, Carers & Guardians 2020 30/01/2020 Year 4 Multiplication Tables

Alaska Strategic & Critical Minerals Summit Fairbanks November 20, 2012 Mary Sattler