Performance Optimizations and Bounds for Sparse Matrix-Vector - PowerPoint PPT Presentation

� Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project www.cs.berkeley.edu/ richie/bebop Computer Science Division, U.C. Berkeley Berkeley, California, USA . SC 2002: Session on Sparse Linear Algebra [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36

✁ Context: Performance Tuning in the Sparse Case Application performance dominated by a few computational kernels Performance tuning today Vendor-tuned libraries ( e.g. , BLAS) or user hand-tunes Automatic tuning ( e.g. , PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT) Tuning sparse linear algebra kernels is hard Sparse code has . . . high bandwidth requirements (extra storage) poor locality (indirect, irregular memory access) poor instruction mix (data structure manipulation) Sparse matrix-vector multiply (SpM V) performance: less than 10% of machine peak Performance depends on kernel, architecture, and matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.2/36

✁ Example: Matrix olafu Spy Plot: 03−olafu.rua 0 N = 16146 2400 nnz = 1.0M Kernel = SpM V 4800 7200 9600 12000 14400 0 2400 4800 7200 9600 12000 14400 1015156 non−zeros [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36

✂ ✁ ✂ ✆ ☛ ✆ ✡ ✆ ✠ ✟ ✞ ☎ ☎ ✁ ✄ ✂ ✁ ☞ Example: Matrix olafu Spy Plot (zoom−in): 03−olafu.rua 0 N = 16146 10 nnz = 1.0M Kernel = SpM V 20 A natural choice: 30 blocked compressed sparse row (BCSR). 40 Experiment: 50 Measure performance of all block sizes for 60 . ✄✝✆ 0 10 20 30 40 50 60 1188 non−zeros [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36

Speedups on Itanium: The Need for Search Mflop/s Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] 208 200 1.20 0.81 0.75 0.95 r = 6 190 180 170 1.44 0.99 1.07 0.64 3 row block size (r) 160 150 140 1.21 1.27 0.97 0.78 2 130 120 110 1.00 0.86 0.99 0.93 r = 1 100 93 c = 1 2 3 c = 6 Mflop/s column block size (c) (Peak machine speed: 3.2 Gflop/s) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

✌ Key Questions and Conclusions How do we choose the best data structure automatically? New heuristic for choosing optimal (or near-optimal) block sizes What are the limits on performance of blocked SpM V? Derive performance upper bounds for blocking Often within 20% of upper bound, placing limits on improvement from more “low-level” tuning Performance is memory-bound: reducing data structure size is critical Where are the new opportunities (kernels, techniques) for achieving higher performance? Identify cases in which blocking does and does not work Identify new kernels and opportunities for reducing memory traffic [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.5/36

✍ Related Work Automatic tuning systems and code generation PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) FLAME [GGHvdG01] Sparse performance modeling and tuning Temam and Jalby [TJ92] Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Gropp, et al. [GKKS99], Geus [GR99] Sparse kernel interfaces Sparse BLAS Standard [BCD 01] NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00] PETSc, hypre, . . . [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.6/36

Approach to Automatic Tuning For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

✌ ✄ ☎ ✁ Approach to Automatic Tuning For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) The S PARSITY system for SpM V [Im & Yelick ’99] Interface Input: Your sparse matrix (CSR) Output: Data structure + routine tuned to your matrix & machine Implementation space register level blocking ( ) cache blocking, multiple vectors, . . . Search Off-line: benchmarking (once per architecture) Run-time: estimate matrix properties (“search”) and predict best data structure parameters [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 688 true non−zeros [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 688 true non−zeros BCSR with uniform, aligned grid [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 (688 true non−zeros) + (383 explicit zeros) = 1071 nz Fill-in zeros: trade-off extra flops for better efficiency This example: 50% fill led to 1.5x speedup on Pentium III [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

✫ ★ ✛ ✧ ✕ ✛ ✜ ✜✥ ✦ ✧ ★ ✛ ✩ ✜ ✑ ✪ ★ ✭ ✮ ✯ ✯ ✰ ✦ ✗ ✮ ✣ ✩ ✜ ✑ ★ ✬ ✪ ✑ ✜✥ ✄ ✁ ☎ ✓ ✏ ✑ ✓ ✜ ✪ ✛ ✦ ✜ ✧ ★ ✩ Search: Choosing the Block Size Off-line benchmarking (once per architecture) Measure Dense Performance (r,c) Performance (Mflop/s) of dense matrix in sparse blocked format At run-time, when matrix is known: Estimate Fill Ratio (r,c) , ✎✝✏✒✑ Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros) Choose that maximizes ✢✤✣ ✚✖✛ ✢✤✣ ✔✖✕ ✚✖✛ ✗✙✘ (Replaces previous S PARSITY heurstic) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.9/36

Performance Optimizations and Bounds for Sparse Matrix-Vector - PowerPoint PPT Presentation

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP)

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Optimizations & Bounds for Sparse Symmetric Matrix-Vector Multiply Berkeley Benchmarking and

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Memory Hierarchy Optimizations and Performance Bounds for Sparse Richard Vuduc, Attila

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Combinatorial Species and the Virial Expansion By Stephen Tate University of Warwick

6: Old English Adverbs and Numerals Adverb Formation in Modern English adjective + -ly : easy

Introduction Me: Software Engineer at DENX, Gmbh U-Boot Custodian for Freescale's i.MX

language acquisition and language change Linguistic Theory MA course Szcsnyi Krisztina 2015

Processes & Threads CS 4410, Opera5ng Systems Fall 2016 Cornell University Rachit Agarwal

The Role of the Propensity Score in Observational Studies with Complex Data Structures Fabrizia

Weak adjoint functor theorems Stephen Lack Macquarie University CT2019, Edinburgh joint work

Interval Analysis Grading of On-Line Homework John L. Orr jorr@math.unl.edu University of

Performance Optimizations and Bounds for Sparse Matrix-Vector - PowerPoint PPT Presentation

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP)

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Optimizations &amp; Bounds for Sparse Symmetric Matrix-Vector Multiply Berkeley Benchmarking and

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Memory Hierarchy Optimizations and Performance Bounds for Sparse Richard Vuduc, Attila

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Combinatorial Species and the Virial Expansion By Stephen Tate University of Warwick

6: Old English Adverbs and Numerals Adverb Formation in Modern English adjective + -ly : easy

Introduction Me: Software Engineer at DENX, Gmbh U-Boot Custodian for Freescale's i.MX

language acquisition and language change Linguistic Theory MA course Szcsnyi Krisztina 2015

Processes &amp; Threads CS 4410, Opera5ng Systems Fall 2016 Cornell University Rachit Agarwal

The Role of the Propensity Score in Observational Studies with Complex Data Structures Fabrizia

Weak adjoint functor theorems Stephen Lack Macquarie University CT2019, Edinburgh joint work

Interval Analysis Grading of On-Line Homework John L. Orr jorr@math.unl.edu University of

Optimizations & Bounds for Sparse Symmetric Matrix-Vector Multiply Berkeley Benchmarking and

Processes & Threads CS 4410, Opera5ng Systems Fall 2016 Cornell University Rachit Agarwal