Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel - PowerPoint PPT Presentation

Mary Hall November 11, 2020 1

• Anand Venkat, Utah PhD, now at Intel • Other Utah students – Khalid Ahmad, John Jolly, Mahesh Lakshminaranan, Payal Nandy, Tuowen Zhao • University of Arizona collaborators: – Michelle Strout, Mahdi Mohammadi • Boise State collaborators: – Cathie Olschanowsky, Eddie Davis, Tobi Popoola • Intel collaborators: – Jongsoo Park, Hongbo Rong, Raj Barik This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by the National Science Foundation under CCF- 1564074. 2

• Sparse matrices/tensors appear frequently in large systems of equations • Sparse matrices/tensors have diverse applications • Density δ often << .1 Network Theory Epidemiology Finance (Web connectivity) (2D Markov model of epidemic) (Portfolio model) 3 Slide images: SuiteSparse Matrix Collection, sparse.tamu.edu

/* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) Data Transformation: for( k =0; k < m ; k ++); Convert Matrix Format y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; CSR � CSB /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) 11 different block sizes/ for( k =0; k < m ; k ++) implementation y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; } Parallelism: Thread-level (OpenMP Code A: w/schedule) Multiple SpMV computations (SpMM), 7 lines of code Parallelism: Question: Can a compiler SIMD (AVX2) generate Code B starting with Other: Indexing simplification Code A ? Code B: Manually-optimized SpMM from LOBCG, Answer: YES (rest of talk) 2109 lines of code 4

Optimizing Dense Linear Algebra Optimizing Sparse Linear Algebra – – COMPUTE BOUND BOUND BY DATA MOVEMENT Exploit all forms of parallelism to Maximize memory bandwidth • • approach peak flop rate utilization Exploit locality of reused data in Manage load imbalance • • cache and registers Memory access pattern • Hide latency of initial cold misses • unpredictable – try to hide latency Select best sparse matrix • representation - depends on nonzero pattern These optimizations are usually architecture specific. 5

Inspector/Executor: Integrate runtime • optimization based on input data into generated code PARALLEL SCHEDULE • Integration: Incorporate into the Sparse Polyhedral Framework (SPF) Data dependent: Support parallelization in • the presence of data dependences Format: Convert from one to another • format (e.g., CSR to BCSR) DATA REPRESENTATION Value: Use mixed precision data values • 6

/* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) Code A � Code B for( k =0; k < m ; k ++); y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) for( k =0; k < m ; k ++) y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; } 7

• Mathematically represents loop nest computations and transformations applied to them • Enables composition of transformations and correct code generation • Abstractions representing loop nest computations – Iteration spaces as integer sets of points – Transformations as relations on iteration spaces – Statement macros as function of loop index variables – Underlying dependence graph to reason about safety of transformations 8

Stage 1 : Stage 2 : Stage 3 : Extract Loop Bounds Affine Loop Code generation and Construct Transformation (T) Iteration Spaces T_inv modifies array Input Code: Input IS: subscripts. Then, for(i=0; i < n; i++) {[i] : 0 <= i <= n} Polyhedra Scanning s0: a[i+4]=b[i+4]; T = {[i] � [i+4]} T_inv = {[i] � [I-4]} Output Code: Iteration Space (IS): Output IS: for(i=4; i < n + 4; i++) s0 ={[i] : 0 ≤ i < n} {[i] : 4 ≤ i < n + 4} s0: a[i]=b[i]; 9

Non-affine loop bounds for (i=0; i < n; i++) for (j=index[i]; j<index[i+1]; j++) y[i]+=a[j]*x[col[j]]; Non-affine col: Column for element in A subscript index: First location from row i in A 10

Most Polyhedral Compilers for (i=0; i < n; i++) Uninterpreted function: for (j=index[i]; j<index[i+1]; j++) s0: y[i]+=a[j]*x[col[j]]; Represent index as a function in relations [Pugh and Wonnacott, TOPLAS 1998] Extend to support • Loop bounds • Parameters beyond loop indices • Transformations • Code generation Can’t represent bounds for loop j Observations: • index is invariant within loop nest • some loop transformations may be safe if index can be represented 11

T tile = {[i,j]->[i,jj,j] |exists (a | jj = 4a ∧ a ≥ 0 for (i=0; i < n; i++) ∧ jj ≤ j < jj + 4)} for (j=index[i];j<index[i+1];j++) s0: y[i]+=a[j]*x[col[j]]; Represent j loop bounds as uninterpreted functions IS = {[i,j] : 0 ≤ i < n ∧ index_(i) ≤ j < index_(i+1)} for (i = 0; i <= n; i ++) for (jj=index[i]; jj<index[i+1]; jj+=4) Now tiling is possible! for (j = jj; j <min(index[i+1], jj + 4); j++) y[i] += (a[j] * x[col[j]]); [CGO14] Venkat et al. 12

Inspector Code: Runtime information is needed for many Matrix Format optimizations to understand memory access Conversion / Runtime pattern and sparse matrix nonzero structure Parallelization • Inspector analyzes indirect accesses at runtime and/or reorders data Executor Code: • Executor is the reordered computation Iterate using Original concept: Mirchandaney and Saltz, ICS 1988 New Representation Both inspector and executor are generated at compile time, but inspector examines Similar to sparse matrix input matrix once at runtime. libraries like OSKI, PETSc 14

Specialize matrix representation for nonzero structure • Compressed Sparse Row (CSR) is a general structure that is widely used • Blocked Compressed Sparse Row (BCSR) • Uses fixed size dense blocks if any elements are nonzero • Pads with explicit zeros if not in CSR representation; 0 computation retains meaning • Code for dense block is very efficient; Profitable if padding is limited A (in BCSR): A (in CSR): [ 1 5 7 2 3 6 4 ] 1 5 3 6 2x2 blocks nonzeros only 7 2 0 4

k 1 5 3 6 i for (ii=0; ii<n/r; i++) 7 2 0 4 for (kk=0; kk<n/c; kk++) for (i=0; i < r; i++) BCSR for(k=0; k < c; k++) Original code: for (j=index[ii*r+i]; j<index[ii*r+i+1]; j++) for (i=0; i < n; i++) if((kk*c+k) ==col[j]) for (j=index[i]; j<index[i+1]; j++) s0: y[ii*r+i]+=a[j]*x[kk*c+k]; s0: y[i]+=a[j]*x[col[j]]; compact-and-pad (s0, kk, A) make-dense(s0,col[j]) for (i=0; i < n; i++) for(k=0; k < n; k++) for (j=index[i]; j<index[i+1]; j++) if(k==col[j]) A: s0: y[i]+=a[j]*x[col[j]]; tile(0,2,c,counted) tile(0,2,r,counted) [PLDI15] Venkat et al. 16

Dense • (Lower) Triangular (Forward) Solve 1 0 0 0 9 2 0 0 • Rows cannot be processed in parallel 3 7 10 0 4 8 5 12 • x[0] has to be computed before x[1] 0 1 x[1] has to be computed before x[2]… 3 2 • Outer i loop cannot be parallelized Dependence Graph 18

Sparse • Sparse (Lower) Triangular 1 0 0 0 (Forward) Solve Kernel 0 10 0 0 • Some rows can be processed in parallel 0 5 7 0 Dependence 9 0 6 8 graph 0 1 Wavefront 0 Inspector builds Executor traverses Wavefront 1 2 dependence graph wavefronts 3 Wavefront 2 • Parallel wavefront scheduled computation Parallelism is dependent on input ( i loop partially parallel ) structure [SC16] Venkat et al. [PLDI19] Mohammadi et al. 19

BCSR Inspector Speedup 3 Speedup over OSKI 2.5 2 1.5 1 0.5 0 Inspector Code is 1.5x faster than OSKI Matrices BCSR Executor Performance Executor Code within 1% of 9 performance of OSKI Performance/GFLOPS 8 7 6 5 4 3 2 CHiLL 1 0 OSKI Matrices 21

Performance(GB/s) 10 20 30 40 50 60 0 tmt_sym nd24k crankseg_2 offshore Hook_1498 af_shell3 Symmetric Gauss Seidel Emilia_923 Flan_1565 Relaxation bmwcra_1 22 Geo_1438 inline_1 StocF-1465 ecology2 G3_circuit thermal2 apache2 parabolic_fem Generated MKL Serial

Intel i7-4770 (Haswell) CPU, 8 OpenMP threads • Baseline CHiLL performance falls short of manual implementation • Further optimization reduces data movement of index arrays (short vectors) • #pragma simd for vector execution of innermost loop Optimized Code A outperforms Code B! 26

Inspector/Executor Polyhedral Support for Indirection Mirchandaney, Saltz et al., ICS 1988 Omega: Pugh and Wonnacott, TOPLAS 1998 Rauchwerger, 1998 SPF: Strout et al., LCPC 2012 Basumallik and Eigenmann, PPoPP 2006 Ravishankar et al., SC 2012 Sparse Data Representations Compilers for Sparse Computations Sublimation: Bik and Wijshoff, TPDS 1996 SIPR: Shpeisman and Pugh, LCPC 1998 Ding and Kennedy, PLDI 1999 Bernoulli: Mateev et al., ICS 2000 Mellor-Crummey et al., IJHPCA 2004 taco: Kholstad et al., OOPSLA 2017, PLDI LL: Gilad et al., ICFP 2010 2020 van der Spek and Wijshoff, LCPC 2010 Prior work did not integrate all of these optimizations, and mostly did not compose with other optimizations. 27

Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel - PowerPoint PPT Presentation

Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel Other Utah students Khalid Ahmad, John Jolly, Mahesh Lakshminaranan, Payal Nandy, Tuowen Zhao University of Arizona collaborators: Michelle Strout, Mahdi

1 Wils Wils lsdorf lsdorf dorf Hall dorf Hall Hall Hall Wils Wils lsdorf lsdorf dorf

St. Mary Siphon Tubes North of Babb, MT Halls Coulee Siphon St. Mary Canal St. Mary Canal

HOLSHOUSER HALL RENOVATIONS SOUTH VILLAGE SITE PLAN Phase XI Residence Hall(Hunt Hall)

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Pre-congress: Wednesday, May 10 2017 Beirut hall Chartouni hall Cedar hall Phoenicia hall

Hall Effect Measurement System Hall and Hall and van der Pauw Measurements van der Pauw

Mary The Mother of God RCIA October 31, 2013 Mary, the Mother of God As we look at Mary, as

Cele lebrating Mary ry Ward Week 2017 Mary ry Ward 1585 - 1645 Mary Ward North

Hall C Users Group Business Meeting Hall A/C Collaboration Meeting July 16-17 2020 Hall C

THURSDAY MAIN HALL B - NORTHEAST HALL C - SOUTHEAST HALL SCIENCE, SYMBOLISM & SIGNS

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

High Energy Neutrino Cross Sections Neutrino 2004, 18 June 2004 Mary Hall Reno Energy Ranges

Uninterpreted Functions: Their use in Code Transformation Catherine Olschanowsky Mary Hall

REGISTRATION and TEA, COFFEE 10:00 12:15 LUNCH OPENING CEREMONY HALL 1071 Prof. Dr. Muhsin

WESTERN MICHIGAN UNIVERSITY DUNBAR HALL TOWN HALL AGENDA 3:15 4:15p WMU Dunbar Hall

TOWN HALL NORTH ELEVATION EXISTING CONDITIONS TOWN HALL 2 WEST MAIN STREET WEST ELEVATION

Matrices of constant rank and instanton bundles Ada Boralevi IMPAN, Warsaw VBAC 2013 SISSA

Solving large scale eigenvalue problems Lecture 5, March 23, 2016: The QR algorithm II

Semidefinite Farkas Lemma and Dimensional Rigidity of Bar Frameworks A. Y. Alfakih University

Statistical Geometry Processing Winter Semester 2011/2012 Linear Algebra, Function Spaces &

r r Prof. Inder K. Rana Room 112 B Department of Mathematics

Section 7.1 Diagonalization of symmetric matrices Motivation: Diagonalization How did we

Universal Rigidity of Bipartite Graphs Rutgers Dimacs Distance Geometry Conference Robert

Matrices and Vectors Marco Chiarandini Department of Mathematics & Computer Science

Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel - PowerPoint PPT Presentation

Mary Hall November 11, 2020 1 Anand Venkat, Utah PhD, now at Intel Other Utah students Khalid Ahmad, John Jolly, Mahesh Lakshminaranan, Payal Nandy, Tuowen Zhao University of Arizona collaborators: Michelle Strout, Mahdi

1 Wils Wils lsdorf lsdorf dorf Hall dorf Hall Hall Hall Wils Wils lsdorf lsdorf dorf

St. Mary Siphon Tubes North of Babb, MT Halls Coulee Siphon St. Mary Canal St. Mary Canal

HOLSHOUSER HALL RENOVATIONS SOUTH VILLAGE SITE PLAN Phase XI Residence Hall(Hunt Hall)

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Pre-congress: Wednesday, May 10 2017 Beirut hall Chartouni hall Cedar hall Phoenicia hall

Hall Effect Measurement System Hall and Hall and van der Pauw Measurements van der Pauw

Mary The Mother of God RCIA October 31, 2013 Mary, the Mother of God As we look at Mary, as

Cele lebrating Mary ry Ward Week 2017 Mary ry Ward 1585 - 1645 Mary Ward North

Hall C Users Group Business Meeting Hall A/C Collaboration Meeting July 16-17 2020 Hall C

THURSDAY MAIN HALL B - NORTHEAST HALL C - SOUTHEAST HALL SCIENCE, SYMBOLISM &amp; SIGNS

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

High Energy Neutrino Cross Sections Neutrino 2004, 18 June 2004 Mary Hall Reno Energy Ranges

Uninterpreted Functions: Their use in Code Transformation Catherine Olschanowsky Mary Hall

REGISTRATION and TEA, COFFEE 10:00 12:15 LUNCH OPENING CEREMONY HALL 1071 Prof. Dr. Muhsin

WESTERN MICHIGAN UNIVERSITY DUNBAR HALL TOWN HALL AGENDA 3:15 4:15p WMU Dunbar Hall

TOWN HALL NORTH ELEVATION EXISTING CONDITIONS TOWN HALL 2 WEST MAIN STREET WEST ELEVATION

Matrices of constant rank and instanton bundles Ada Boralevi IMPAN, Warsaw VBAC 2013 SISSA

Solving large scale eigenvalue problems Lecture 5, March 23, 2016: The QR algorithm II

Semidefinite Farkas Lemma and Dimensional Rigidity of Bar Frameworks A. Y. Alfakih University

Statistical Geometry Processing Winter Semester 2011/2012 Linear Algebra, Function Spaces &amp;

r r Prof. Inder K. Rana Room 112 B Department of Mathematics

Section 7.1 Diagonalization of symmetric matrices Motivation: Diagonalization How did we

Universal Rigidity of Bipartite Graphs Rutgers Dimacs Distance Geometry Conference Robert

Matrices and Vectors Marco Chiarandini Department of Mathematics &amp; Computer Science

THURSDAY MAIN HALL B - NORTHEAST HALL C - SOUTHEAST HALL SCIENCE, SYMBOLISM & SIGNS

Statistical Geometry Processing Winter Semester 2011/2012 Linear Algebra, Function Spaces &

Matrices and Vectors Marco Chiarandini Department of Mathematics & Computer Science