mary hall november 11 2020
play

Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel - PowerPoint PPT Presentation

Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel Other Utah students Khalid Ahmad, John Jolly, Mahesh Lakshminaranan, Payal Nandy, Tuowen Zhao University of Arizona collaborators: Michelle Strout, Mahdi


  1. Mary Hall November 11, 2020 1

  2. • Anand Venkat, Utah PhD, now at Intel • Other Utah students – Khalid Ahmad, John Jolly, Mahesh Lakshminaranan, Payal Nandy, Tuowen Zhao • University of Arizona collaborators: – Michelle Strout, Mahdi Mohammadi • Boise State collaborators: – Cathie Olschanowsky, Eddie Davis, Tobi Popoola • Intel collaborators: – Jongsoo Park, Hongbo Rong, Raj Barik This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by the National Science Foundation under CCF- 1564074. 2

  3. • Sparse matrices/tensors appear frequently in large systems of equations • Sparse matrices/tensors have diverse applications • Density δ often << .1 Network Theory Epidemiology Finance (Web connectivity) (2D Markov model of epidemic) (Portfolio model) 3 Slide images: SuiteSparse Matrix Collection, sparse.tamu.edu

  4. /* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) Data Transformation: for( k =0; k < m ; k ++); Convert Matrix Format y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; CSR � CSB /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) 11 different block sizes/ for( k =0; k < m ; k ++) implementation y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; } Parallelism: Thread-level (OpenMP Code A: w/schedule) Multiple SpMV computations (SpMM), 7 lines of code Parallelism: Question: Can a compiler SIMD (AVX2) generate Code B starting with Other: Indexing simplification Code A ? Code B: Manually-optimized SpMM from LOBCG, Answer: YES (rest of talk) 2109 lines of code 4

  5. Optimizing Dense Linear Algebra Optimizing Sparse Linear Algebra – – COMPUTE BOUND BOUND BY DATA MOVEMENT Exploit all forms of parallelism to Maximize memory bandwidth • • approach peak flop rate utilization Exploit locality of reused data in Manage load imbalance • • cache and registers Memory access pattern • Hide latency of initial cold misses • unpredictable – try to hide latency Select best sparse matrix • representation - depends on nonzero pattern These optimizations are usually architecture specific. 5

  6. Inspector/Executor: Integrate runtime • optimization based on input data into generated code PARALLEL SCHEDULE • Integration: Incorporate into the Sparse Polyhedral Framework (SPF) Data dependent: Support parallelization in • the presence of data dependences Format: Convert from one to another • format (e.g., CSR to BCSR) DATA REPRESENTATION Value: Use mixed precision data values • 6

  7. /* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) Code A � Code B for( k =0; k < m ; k ++); y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) for( k =0; k < m ; k ++) y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; } 7

  8. • Mathematically represents loop nest computations and transformations applied to them • Enables composition of transformations and correct code generation • Abstractions representing loop nest computations – Iteration spaces as integer sets of points – Transformations as relations on iteration spaces – Statement macros as function of loop index variables – Underlying dependence graph to reason about safety of transformations 8

  9. Stage 1 : Stage 2 : Stage 3 : Extract Loop Bounds Affine Loop Code generation and Construct Transformation (T) Iteration Spaces T_inv modifies array Input Code: Input IS: subscripts. Then, for(i=0; i < n; i++) {[i] : 0 <= i <= n} Polyhedra Scanning s0: a[i+4]=b[i+4]; T = {[i] � [i+4]} T_inv = {[i] � [I-4]} Output Code: Iteration Space (IS): Output IS: for(i=4; i < n + 4; i++) s0 ={[i] : 0 ≤ i < n} {[i] : 4 ≤ i < n + 4} s0: a[i]=b[i]; 9

  10. Non-affine loop bounds for (i=0; i < n; i++) for (j=index[i]; j<index[i+1]; j++) y[i]+=a[j]*x[col[j]]; Non-affine col: Column for element in A subscript index: First location from row i in A 10

  11. Most Polyhedral Compilers for (i=0; i < n; i++) Uninterpreted function: for (j=index[i]; j<index[i+1]; j++) s0: y[i]+=a[j]*x[col[j]]; Represent index as a function in relations [Pugh and Wonnacott, TOPLAS 1998] Extend to support • Loop bounds • Parameters beyond loop indices • Transformations • Code generation Can’t represent bounds for loop j Observations: • index is invariant within loop nest • some loop transformations may be safe if index can be represented 11

  12. T tile = {[i,j]->[i,jj,j] |exists (a | jj = 4a ∧ a ≥ 0 for (i=0; i < n; i++) ∧ jj ≤ j < jj + 4)} for (j=index[i];j<index[i+1];j++) s0: y[i]+=a[j]*x[col[j]]; Represent j loop bounds as uninterpreted functions IS = {[i,j] : 0 ≤ i < n ∧ index_(i) ≤ j < index_(i+1)} for (i = 0; i <= n; i ++) for (jj=index[i]; jj<index[i+1]; jj+=4) Now tiling is possible! for (j = jj; j <min(index[i+1], jj + 4); j++) y[i] += (a[j] * x[col[j]]); [CGO14] Venkat et al. 12

  13. 13

  14. Inspector Code: Runtime information is needed for many Matrix Format optimizations to understand memory access Conversion / Runtime pattern and sparse matrix nonzero structure Parallelization • Inspector analyzes indirect accesses at runtime and/or reorders data Executor Code: • Executor is the reordered computation Iterate using Original concept: Mirchandaney and Saltz, ICS 1988 New Representation Both inspector and executor are generated at compile time, but inspector examines Similar to sparse matrix input matrix once at runtime. libraries like OSKI, PETSc 14

  15. Specialize matrix representation for nonzero structure • Compressed Sparse Row (CSR) is a general structure that is widely used • Blocked Compressed Sparse Row (BCSR) • Uses fixed size dense blocks if any elements are nonzero • Pads with explicit zeros if not in CSR representation; 0 computation retains meaning • Code for dense block is very efficient; Profitable if padding is limited A (in BCSR): A (in CSR): [ 1 5 7 2 3 6 4 ] 1 5 3 6 2x2 blocks nonzeros only 7 2 0 4

  16. k 1 5 3 6 i for (ii=0; ii<n/r; i++) 7 2 0 4 for (kk=0; kk<n/c; kk++) for (i=0; i < r; i++) BCSR for(k=0; k < c; k++) Original code: for (j=index[ii*r+i]; j<index[ii*r+i+1]; j++) for (i=0; i < n; i++) if((kk*c+k) ==col[j]) for (j=index[i]; j<index[i+1]; j++) s0: y[ii*r+i]+=a[j]*x[kk*c+k]; s0: y[i]+=a[j]*x[col[j]]; compact-and-pad (s0, kk, A) make-dense(s0,col[j]) for (i=0; i < n; i++) for(k=0; k < n; k++) for (j=index[i]; j<index[i+1]; j++) if(k==col[j]) A: s0: y[i]+=a[j]*x[col[j]]; tile(0,2,c,counted) tile(0,2,r,counted) [PLDI15] Venkat et al. 16

  17. 17

  18. Dense • (Lower) Triangular (Forward) Solve 1 0 0 0 9 2 0 0 • Rows cannot be processed in parallel 3 7 10 0 4 8 5 12 • x[0] has to be computed before x[1] 0 1 x[1] has to be computed before x[2]… 3 2 • Outer i loop cannot be parallelized Dependence Graph 18

  19. Sparse • Sparse (Lower) Triangular 1 0 0 0 (Forward) Solve Kernel 0 10 0 0 • Some rows can be processed in parallel 0 5 7 0 Dependence 9 0 6 8 graph 0 1 Wavefront 0 Inspector builds Executor traverses Wavefront 1 2 dependence graph wavefronts 3 Wavefront 2 • Parallel wavefront scheduled computation Parallelism is dependent on input ( i loop partially parallel ) structure [SC16] Venkat et al. [PLDI19] Mohammadi et al. 19

  20. 20

  21. BCSR Inspector Speedup 3 Speedup over OSKI 2.5 2 1.5 1 0.5 0 Inspector Code is 1.5x faster than OSKI Matrices BCSR Executor Performance Executor Code within 1% of 9 performance of OSKI Performance/GFLOPS 8 7 6 5 4 3 2 CHiLL 1 0 OSKI Matrices 21

  22. Performance(GB/s) 10 20 30 40 50 60 0 tmt_sym nd24k crankseg_2 offshore Hook_1498 af_shell3 Symmetric Gauss Seidel Emilia_923 Flan_1565 Relaxation bmwcra_1 22 Geo_1438 inline_1 StocF-1465 ecology2 G3_circuit thermal2 apache2 parabolic_fem Generated MKL Serial

  23. 23

  24. 24

  25. 25

  26. Intel i7-4770 (Haswell) CPU, 8 OpenMP threads • Baseline CHiLL performance falls short of manual implementation • Further optimization reduces data movement of index arrays (short vectors) • #pragma simd for vector execution of innermost loop Optimized Code A outperforms Code B! 26

  27. Inspector/Executor Polyhedral Support for Indirection Mirchandaney, Saltz et al., ICS 1988 Omega: Pugh and Wonnacott, TOPLAS 1998 Rauchwerger, 1998 SPF: Strout et al., LCPC 2012 Basumallik and Eigenmann, PPoPP 2006 Ravishankar et al., SC 2012 Sparse Data Representations Compilers for Sparse Computations Sublimation: Bik and Wijshoff, TPDS 1996 SIPR: Shpeisman and Pugh, LCPC 1998 Ding and Kennedy, PLDI 1999 Bernoulli: Mateev et al., ICS 2000 Mellor-Crummey et al., IJHPCA 2004 taco: Kholstad et al., OOPSLA 2017, PLDI LL: Gilad et al., ICFP 2010 2020 van der Spek and Wijshoff, LCPC 2010 Prior work did not integrate all of these optimizations, and mostly did not compose with other optimizations. 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend