 
              ☎ � ✁ ✆ Memory Hierarchy Optimizations and Performance Bounds for Sparse �✄✂ Richard Vuduc, Attila Gyulassy, James Demmel, Katherine Yelick Monday, June 2, 2003 Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu Department of Electrical Engineering and Computer Science U.C. Berkeley, California, USA Workshop on Parallel Linear Algebra (at ICCS-2003) [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.1/30
✆ ✡ ✕ ✔ ✏ ✔ ✏ ✟ ☞ ✌ ☞ ☎ ☞ ☛ ✠ ✌ ✠ ✟ Context: Automatic Tuning of Sparse Kernels Road-blocks to writing efficient sparse code high bandwidth requirements (extra storage) poor locality (indirect, irregular memory access) poor instruction mix (data structure manipulation) typical sparse matrix-vector multiply (SpM V) performance: less than 10% of machine peak performance depends on kernel, architecture, and matrix Goal: automatically choose “best” data structure and implementation (code), given matrix and machine Inspiration: ATLAS/PHiPAC, FFTW/SPRIAL/UHFFT, S PARSITY . . . This talk: ☞✎✍ , or Sp Iterative interior point methods [WO95], SVD [Dem97], HITS algorithm [Kle99] ✑✓✒ Other kernels: SpM V [SC’02], trisolve [ICS/POHLL ’02], , , . . . [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.2/30
✆ ✏ ✕ ✏ ☎ Motivating Example: Matrix raefsky3 Matrix 02−raefsky3 0 N = 21200 2656 nnz = 1.5M 5312 Kernel = Sp 7936 10592 13248 15904 18560 21216 0 2656 5312 7936 10592 13248 15904 18560 21216 1.49 million non−zeros [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.3/30
✖ ✛ ☎ ✟ ✖ ✗ ✟ ✘ ✘ ✜ ✏ ✢ ✙ ✣ ✙ ✤ ✙ ✖ ✥ ✕ ✏ ✆ Motivating Example: Matrix raefsky3 Matrix 02−raefsky3: 8x8 blocked submatrix (1:80, 1:80) 0 N = 21200 8 nnz = 1.5M 16 Kernel = Sp 24 A natural choice: 32 blocked compressed 40 sparse row (BCSR). 48 Experiment: 56 Measure performance of 64 all block sizes for . 72 ✗✚✙ 80 0 8 16 24 32 40 48 56 64 72 80 [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.3/30
✆ ☎ Speedups on Itanium: The Need for Search Register Profile for A T Ax: Matrix 02−raefsky3.rua [itanium−linux−ecc; Ref=78.6 Mflop/s] 216 209 2.75 1.80 1.97 2.03 8 199 189 179 169 2.55 2.48 1.38 1.52 4 row block size (r) 159 149 139 1.74 2.31 2.37 1.30 2 129 119 109 99 1.00 1.52 1.91 2.22 1 89 79 1 2 4 8 column block size (c) [ <--> ] (Peak machine speed: 3.2 Gflop/s) ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.4/30
☞ ✕ ✦ ✏ ✕ ✏ ✦ ✟ ✌ ☎ ☞ ✏ ✏ ☞ ✌ ☞ ✆ Key Questions and Conclusions How can we exploit the memory hierarchy for Sp ? Interleave multiplication by , (up to 1.6x speedup) Combine with prior S PARSITY register blocking optimization (4.2x) How do we choose the best data structure automatically? Matrix may be unknown until run-time Heuristic for choosing optimal (or near-optimal) block sizes What are the limits on performance of blocked Sp ? Derive performance upper bounds for blocking For SpM V, within 20% of bound tuning limited [SC’02] For Sp , within 40% of bound tuning opportunities a la ATLAS [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.5/30
☎ ✆ ✧ Related Work Automatic tuning systems and code generation PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) FLAME [GGHvdG01] Sparse performance modeling and tuning Temam and Jalby [TJ92] Toledo [Tol97], White and Sadayappan [JBWS97], Pinar [PH99] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Gropp, et al. [GKKS99], Geus [GR99] Sparse kernel interfaces Sparse BLAS Standard [BCD 01] NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00] PETSc, hypre, . . . [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.6/30
✮ ✯ ☎ ✬ ✮ ✮ ✪ ✌ ✯ ✍ ★ ✬ ✰ ✪ ✴ ✌ ✴ ✍ ✰ ✮ ✪ ✌ ✴ ✪ ✌ ✯ ☞ � ✁ � ✍ ☞ ☞ ✌ ✠ ★ ✆ ✌ ✪ ✮ ✮ ★ ✮ Memory Hierarchy Optimizations for Sp Cache-level : Interleave multiplication of by , : ✱✲✱ ✳✲✳ ☞✎✍ ✩✫✪✭✬ ✩✫✪ (1) ✴✶✵ Dot-product, followed by “axpy”: reuse . [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.7/30
✍ ✴ ✮ ✮ ✪ ✌ ☎ ★ ✯ ✬ ✪ ✴ ✌ ✍ ✬ ✰ ✮ ✪ ✌ ✴ ✪ ✌ ✴ ✷ ✸ ✹ ✹ ✮ ✯ ✌ ✌ � ✁ � ✍ ☞ ☞ ✌ ✪ ✠ ★ ☞ ✆ ✮ ★ ✰ ✮ ✮ ✯ ✪ Memory Hierarchy Optimizations for Sp Cache-level : Interleave multiplication of by , : ✱✲✱ ✳✲✳ ☞✎✍ ✩✫✪✭✬ ✩✫✪ (1) ✴✶✵ Dot-product, followed by “axpy”: reuse . Register-level : Take to be a block row composed of blocks. Question: How to choose ? ✷✻✺ [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.7/30
✆ ✏ ✿ ✾ ✼ ✒ ☎ ✕ ✒ ✼ ✏ ✿ ✾ 2x2 Code Example 1 for( i = 0 ; i < mb ; i ++ , t += 2 ) { /* for each block row */ ✏✽✼ register double t0 = 0, t1 = 0; 2 3 for( j = ptr [ i ] ; j < ptr [ i + 1 ] ; val += 2 * 2 ) { /* */ 4 register double x0 = x [ ind [ 0 ]+ 0 ] , x1 = x [ ind [ 0 ]+ 1 ] , ; 5 t0 += val [ 0 * 2 + 0 ] * x0 ; 6 t1 += val [ 1 * 2 + 0 ] * x0 ; 7 t0 += val [ 0 * 2 + 1 ] * x1 ; t1 += val [ 1 * 2 + 1 ] * x1 ; 8 } 9 RESET( ind, val ); 10 for( j = ptr [ i ] ; j < ptr [ i + 1 ] ; val += 2 * 2 ) { /* */ 11 register double y0 = 0, y1 = 0 ; 12 y0 += val [ 0 * 2 + 0 ] * t0 ; y1 += val [ 0 * 2 + 1 ] * t0 ; 13 y0 += val [ 1 * 2 + 0 ] * t1 ; 14 15 y1 += val [ 1 * 2 + 1 ] * t1 ; 16 y [ ind [ 0 ]+ 0 ] += y0 ; y [ ind [ 0 ]+ 1 ] += y1 ; } } [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.8/30
✆ ☎ Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 688 true non−zeros [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.9/30
✆ ☎ Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 688 true non−zeros BCSR with uniform, aligned grid [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.9/30
✆ ✟ ☎ Register-Level Blocking (S PARSITY ): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 (688 true non−zeros) + (383 explicit zeros) = 1071 nz Fill-in zeros: trade-off extra flops for better efficiency This example: 50% fill led to 1.5x speedup on SpM V on Pentium III [ <--> ] ☎✞✝ Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.9/30
✆ ✖ ✗ ✸ ✝ ✏ ✕ ✏ ☎ ☎ ✖ ✟ ✟ ✘ ✘ ✟ ✗ ☞ ✌ ☞ ❃ Approach to Automatic Tuning: Choosing ❀❂❁ For each kernel ( e.g. , Sp ): Identify and generate a space of implementations blocked, interleaved code up to Search to find the fastest using models, experiments Off-line benchmarking (once per architecture): Measure blocked interleaved Sp Mflop/s on dense matrix in sparse format Run-time: Estimate fill for all Inspiration: S PARSITY for SpM V [Im & Yelick ’99] See also: SC’02 paper [ <--> ] Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.10/30
✆ ☎ ✝ ☎ Off-line Benchmarking [Intel Itanium 1] Register Profile for A T Ax: Dense Matrix [itanium−linux−ecc] 337 8 3.26 3.57 3.12 3.36 3.38 3.59 317 7 3.11 2.92 3.22 3.21 3.48 297 277 6 2.83 3.02 3.17 3.31 257 row block size (r) 5 2.85 3.08 237 217 4 3.06 197 177 3 2.87 157 2.78 2 137 117 1 97 1 2 3 4 5 6 7 8 column block size (c) Top 10 codes labeled by speedup over unblocked, interleaved code. Max speedup = 3.59. [ <--> ] Memory Hierarchy Optimizations and Performance Bounds for Sparse – p.11/30
Recommend
More recommend