Memory Hierarchy Optimizations and Performance Bounds for Sparse - PowerPoint PPT Presentation

☎ � ✁ ✆ Memory Hierarchy Optimizations and Performance Bounds for Sparse �✄✂ Richard Vuduc, Attila Gyulassy, James Demmel, Katherine Yelick Monday, June 2, 2003 Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu Department of Electrical Engineering and Computer Science U.C. Berkeley, California, USA Workshop on Parallel Linear Algebra (at ICCS-2003) [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.1/30

✆ ✡ ✕ ✔ ✏ ✔ ✏ ✟ ☞ ✌ ☞ ☎ ☞ ☛ ✠ ✌ ✠ ✟ Context: Automatic Tuning of Sparse Kernels Road-blocks to writing efficient sparse code high bandwidth requirements (extra storage) poor locality (indirect, irregular memory access) poor instruction mix (data structure manipulation) typical sparse matrix-vector multiply (SpM V) performance: less than 10% of machine peak performance depends on kernel, architecture, and matrix Goal: automatically choose “best” data structure and implementation (code), given matrix and machine Inspiration: ATLAS/PHiPAC, FFTW/SPRIAL/UHFFT, S PARSITY . . . This talk: ☞✎✍ , or Sp Iterative interior point methods [WO95], SVD [Dem97], HITS algorithm [Kle99] ✑✓✒ Other kernels: SpM V [SC’02], trisolve [ICS/POHLL ’02], , , . . . [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.2/30

✆ ✏ ✕ ✏ ☎ Motivating Example: Matrix raefsky3 Matrix 02−raefsky3 0 N = 21200 2656 nnz = 1.5M 5312 Kernel = Sp 7936 10592 13248 15904 18560 21216 0 2656 5312 7936 10592 13248 15904 18560 21216 1.49 million non−zeros [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.3/30

✖ ✛ ☎ ✟ ✖ ✗ ✟ ✘ ✘ ✜ ✏ ✢ ✙ ✣ ✙ ✤ ✙ ✖ ✥ ✕ ✏ ✆ Motivating Example: Matrix raefsky3 Matrix 02−raefsky3: 8x8 blocked submatrix (1:80, 1:80) 0 N = 21200 8 nnz = 1.5M 16 Kernel = Sp 24 A natural choice: 32 blocked compressed 40 sparse row (BCSR). 48 Experiment: 56 Measure performance of 64 all block sizes for . 72 ✗✚✙ 80 0 8 16 24 32 40 48 56 64 72 80 [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.3/30

✆ ☎ Speedups on Itanium: The Need for Search Register Profile for A T Ax: Matrix 02−raefsky3.rua [itanium−linux−ecc; Ref=78.6 Mflop/s] 216 209 2.75 1.80 1.97 2.03 8 199 189 179 169 2.55 2.48 1.38 1.52 4 row block size (r) 159 149 139 1.74 2.31 2.37 1.30 2 129 119 109 99 1.00 1.52 1.91 2.22 1 89 79 1 2 4 8 column block size (c) [ <--> ] (Peak machine speed: 3.2 Gflop/s) ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.4/30

☞ ✕ ✦ ✏ ✕ ✏ ✦ ✟ ✌ ☎ ☞ ✏ ✏ ☞ ✌ ☞ ✆ Key Questions and Conclusions How can we exploit the memory hierarchy for Sp ? Interleave multiplication by , (up to 1.6x speedup) Combine with prior S PARSITY register blocking optimization (4.2x) How do we choose the best data structure automatically? Matrix may be unknown until run-time Heuristic for choosing optimal (or near-optimal) block sizes What are the limits on performance of blocked Sp ? Derive performance upper bounds for blocking For SpM V, within 20% of bound tuning limited [SC’02] For Sp , within 40% of bound tuning opportunities a la ATLAS [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.5/30

☎ ✆ ✧ Related Work Automatic tuning systems and code generation PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) FLAME [GGHvdG01] Sparse performance modeling and tuning Temam and Jalby [TJ92] Toledo [Tol97], White and Sadayappan [JBWS97], Pinar [PH99] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Gropp, et al. [GKKS99], Geus [GR99] Sparse kernel interfaces Sparse BLAS Standard [BCD 01] NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00] PETSc, hypre, . . . [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.6/30

✮ ✯ ☎ ✬ ✮ ✮ ✪ ✌ ✯ ✍ ★ ✬ ✰ ✪ ✴ ✌ ✴ ✍ ✰ ✮ ✪ ✌ ✴ ✪ ✌ ✯ ☞ � ✁ � ✍ ☞ ☞ ✌ ✠ ★ ✆ ✌ ✪ ✮ ✮ ★ ✮ Memory Hierarchy Optimizations for Sp Cache-level : Interleave multiplication of by , : ✱✲✱ ✳✲✳ ☞✎✍ ✩✫✪✭✬ ✩✫✪ (1) ✴✶✵ Dot-product, followed by “axpy”: reuse . [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.7/30

✍ ✴ ✮ ✮ ✪ ✌ ☎ ★ ✯ ✬ ✪ ✴ ✌ ✍ ✬ ✰ ✮ ✪ ✌ ✴ ✪ ✌ ✴ ✷ ✸ ✹ ✹ ✮ ✯ ✌ ✌ � ✁ � ✍ ☞ ☞ ✌ ✪ ✠ ★ ☞ ✆ ✮ ★ ✰ ✮ ✮ ✯ ✪ Memory Hierarchy Optimizations for Sp Cache-level : Interleave multiplication of by , : ✱✲✱ ✳✲✳ ☞✎✍ ✩✫✪✭✬ ✩✫✪ (1) ✴✶✵ Dot-product, followed by “axpy”: reuse . Register-level : Take to be a block row composed of blocks. Question: How to choose ? ✷✻✺ [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.7/30

✆ ✏ ✿ ✾ ✼ ✒ ☎ ✕ ✒ ✼ ✏ ✿ ✾ 2x2 Code Example 1 for( i = 0 ; i < mb ; i ++ , t += 2 ) { /* for each block row */ ✏✽✼ register double t0 = 0, t1 = 0; 2 3 for( j = ptr [ i ] ; j < ptr [ i + 1 ] ; val += 2 * 2 ) { /* */ 4 register double x0 = x [ ind [ 0 ]+ 0 ] , x1 = x [ ind [ 0 ]+ 1 ] , ; 5 t0 += val [ 0 * 2 + 0 ] * x0 ; 6 t1 += val [ 1 * 2 + 0 ] * x0 ; 7 t0 += val [ 0 * 2 + 1 ] * x1 ; t1 += val [ 1 * 2 + 1 ] * x1 ; 8 } 9 RESET( ind, val ); 10 for( j = ptr [ i ] ; j < ptr [ i + 1 ] ; val += 2 * 2 ) { /* */ 11 register double y0 = 0, y1 = 0 ; 12 y0 += val [ 0 * 2 + 0 ] * t0 ; y1 += val [ 0 * 2 + 1 ] * t0 ; 13 y0 += val [ 1 * 2 + 0 ] * t1 ; 14 15 y1 += val [ 1 * 2 + 1 ] * t1 ; 16 y [ ind [ 0 ]+ 0 ] += y0 ; y [ ind [ 0 ]+ 1 ] += y1 ; } } [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.8/30

✆ ☎ Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 688 true non−zeros [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.9/30

✆ ☎ Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 688 true non−zeros BCSR with uniform, aligned grid [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.9/30

✆ ✟ ☎ Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 (688 true non−zeros) + (383 explicit zeros) = 1071 nz Fill-in zeros: trade-off extra flops for better efficiency This example: 50% fill led to 1.5x speedup on SpM V on Pentium III [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.9/30

✆ ✖ ✗ ✸ ✝ ✏ ✕ ✏ ☎ ☎ ✖ ✟ ✟ ✘ ✘ ✟ ✗ ☞ ✌ ☞ ❃ Approach to Automatic Tuning: Choosing ❀❂❁ For each kernel ( e.g. , Sp ): Identify and generate a space of implementations blocked, interleaved code up to Search to find the fastest using models, experiments Off-line benchmarking (once per architecture): Measure blocked interleaved Sp Mflop/s on dense matrix in sparse format Run-time: Estimate fill for all Inspiration: S PARSITY for SpM V [Im & Yelick ’99] See also: SC’02 paper [ <--> ] Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.10/30

✆ ☎ ✝ ☎ Off-line Benchmarking [Intel Itanium 1] Register Profile for A T Ax: Dense Matrix [itanium−linux−ecc] 337 8 3.26 3.57 3.12 3.36 3.38 3.59 317 7 3.11 2.92 3.22 3.21 3.48 297 277 6 2.83 3.02 3.17 3.31 257 row block size (r) 5 2.85 3.08 237 217 4 3.06 197 177 3 2.87 157 2.78 2 137 117 1 97 1 2 3 4 5 6 7 8 column block size (c) Top 10 codes labeled by speedup over unblocked, interleaved code. Max speedup = 3.59. [ <--> ] Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.11/30

Memory Hierarchy Optimizations and Performance Bounds for Sparse - PowerPoint PPT Presentation

Memory Hierarchy Optimizations and Performance Bounds for Sparse Richard Vuduc, Attila Gyulassy, James Demmel, Katherine Yelick Monday, June 2, 2003 Berkeley Benchmarking and OPtimization (BeBOP) Project

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

CS452/652 Registers g Segmentation g Real-Time Global Descriptor Table g Programming

Operating System Labs Yuanbin Wu cs@ecnu Operating System Labs Review of Memory Management

Chapter 9: Memory Questions? What is main memory? CSCI [4|6]730 How does multiple

Application of entropy compression in pattern avoidance Pascal Ochem, Alexandre Pinlou LIRMM,

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

Data-Flow Based Detection of Loop Bounds Christoph Cullmann Florian Martin AbsInt GmbH

Lower Bounds for Geometric Diameter Problems Herv e Fournier University of Versailles

Undefined, Unspecified, Non-deterministic, and Implementation Defined Behavior in Verifiable Specs

Memory Hierarchy Optimizations and Performance Bounds for Sparse - PowerPoint PPT Presentation

Memory Hierarchy Optimizations and Performance Bounds for Sparse Richard Vuduc, Attila Gyulassy, James Demmel, Katherine Yelick Monday, June 2, 2003 Berkeley Benchmarking and OPtimization (BeBOP) Project

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

CS452/652 Registers g Segmentation g Real-Time Global Descriptor Table g Programming

Operating System Labs Yuanbin Wu cs@ecnu Operating System Labs Review of Memory Management

Chapter 9: Memory Questions? What is main memory? CSCI [4|6]730 How does multiple

Application of entropy compression in pattern avoidance Pascal Ochem, Alexandre Pinlou LIRMM,

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

Data-Flow Based Detection of Loop Bounds Christoph Cullmann Florian Martin AbsInt GmbH

Lower Bounds for Geometric Diameter Problems Herv e Fournier University of Versailles

Undefined, Unspecified, Non-deterministic, and Implementation Defined Behavior in Verifiable Specs

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several