Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , - PowerPoint PPT Presentation

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , Anand Venkat, Protonu Basu, Mary Hall University of Utah January 19, 2016

Outline • Introduction • Problem • Limitations of polyhedral transformation • CHiLL Compiler Abstractions • Combining polyhedral and AST transformations • Case Studies • Inspector/executor transformation for sparse matrix computation • Partial sum transformation for stencil optimization • Parallel code generation • CUDA • OpenMP • Related Work • Conclusion

Introduction • Limitation of typical polyhedral transformation • Limited to affine domain • Transform iteration spaces • Array indices of statements updated • Complicated optimizations • AST transformation as a post-pass outside of polyhedral framework • Challenges • Leverage the power of composability of polyhedral framework • Introduction • Problem • CHiLL Compiler Abstractions • Case Studies • Related Work • Conclusion

CHiLL Compiler Abstractions Input code: for(i=0; i < n; i++) Input code s0: a[i+1]=a[i] + 5; Shift by 4 CHiLL Compiler CHiLL Abstractions: Dep: <+1> Statement: s0: a[i+1]=a[i] + 5; Loop transformation framework IS: {[i] : 0 <= i < n} Polyhedral xform: {[i]->[0,i+4,0]} Code generation code: a[i+1]=a[i] + 5; xform_inv = {[i]->[i-4]} • Introduction • Problem • CHiLL Compiler Abstractions Generated code: Generated code • Case Studies for(i=4; i < n+ 4; i++) • Related Work s0: a[i-3]=a[i-4]+5; • Conclusion

CHiLL Compiler Abstractions Input code: for(i=0; i < n; i++) Input code s0: a[i+1]=a[i] + 5; CHiLL Compiler CHiLL Abstractions: Dep: <+1> Statement: s0: a[i+1]=a[i] + 5; Loop transformation framework IS: {[i] : 0 <= i < n} Polyhedral xform: {[i] -> [0,1,0]} Code generation code: AST • Introduction • Problem • CHiLL Compiler Abstractions Modified AST Generated code • Case Studies • Related Work • Conclusion

Non-Affine Extension – Coalesce Transformation • Sparse matrix computation • Non-affine indirection through index arrays • Subscript expressions CSR: • x[col[j]] for(i=0; i < n; i++) • Upper/lower loop bounds for(j=index[i];j<index[i+1];j++) • index[i], index[i+1] y[i]+=a[j]*x[col[j]] • Uninterpreted function symbol abstraction • Model functions or mappings (non-affine) • Inspector/Executor mechanism • Introduction • Inspector collects information at runtime • Case Studies • used by optimized executor Inspector/Executor • Partial Sum • Parallel Code Generation • Related Work • Conclusion

Inspector Construction - Coalesce Transformation struct c { Input code: int c_inv[][2]; for(i=0; i < n; i++) int k; for(j=index[i];j<index[i+1];j++) void create_mapping(int i, int j) { y[i]+=a[j]*x[col[j]] c_inv[k][0] = i; c_inv[k][1] = j; AST & Iteration Space Manipulation k++; }} AST T coalesce ={[i,j]->[k]| k=c(i,j) ∧ 0 ≤ k < NNZ} Inspector code: for(i = 0; i < n; i++) Executor code: Polyhedral for(j = index[i]; j < index[i+1]; j++) for (k = 0; k < NNZ; k++) code c.create_mapping(i,j); code y[c_inv[k][0]] += • Introduction • Case Studies a[c_inv[k][1]]*x[col[c_inv[k][1]]]; • Inspector/Executor • Partial Sum • Statement update Parallel Code Generation • Related Work • Conclusion

More Complicated I/E Transformations - BCSR Input code: for(i = 0; i < n; i++) for(i = 0; i < n; i++) for(k = 0; k < n; k++) make-dense for(j = index[i]; j < index[i+1]; j++) for(j = index[i]; j < index[i+1]; j++) y[i] += a[j]*x[col[j]]; if(k == col[j]) Tile(i,k) y[i]+=a[j]*x[k]; Inspector Code: for(ii=0; ii < n/r; ii++){ //reset marked to false (code not shown) for(ii=0; ii < n/r; ii++) for(i=0; i < r; i++) for(kk=0; kk < n/c; kk++) for(j=index[ii*r +i]; j < index[ii*r+i+1];j++) { for(i=0; I < r; i++) code for(k=0; k < c; k++) for(j=index[ii*r+i]; j < index[ii*r+i+1]; j++) kk = col[j]/c; k=col[j]/c – kk*c; if(kk*c+k == col[j]) if(marked[kk] == false){ y[ii*r+i] += a[j]*x[kk*c+k]; marked[kk] = true; • explicit_index[kk] = count; Introduction • Case Studies } //initialize a’[count][0 -r][0-c] to 0 • Inspector/Executor • count++; } Partial Sum Compact-and-pad(kk,a,a ’) • Parallel Code Generation a’[count][ i][k] = a[j]; } • Related Work offset_index [ii+1] = count; • Conclusion

Partial Sum Transformation – Stencil Optimization • Constant-coefficient Stencils Jacobi • Weighted sum • High-order Stencils • Introduction • Case Studies • Inspector/Executor • Partial Sum p = 2 p = 4 p = 6 p = 10 • Parallel Code Generation • Related Work • Conclusion

Still affine Partial Sum Transformation - Reuse r1 = in[j][i+1]; 2D 9-point for (j=0; j<N; j++) stencil r2 = in[j+1][i+1] + in[j-1][i+1]; for (i=0; i<N; i++) { out[j][i] = w1*( in[j-1][i] + in[j+1][i] + R[i] = w1 * r1 + w2 * r2; 1 AST in[j][i-1] + in[j][i+1] ) + w2*( in[j-1][i-1] + in[j+1][i-1] + C[i+1] = w3 * r1 + w1 * r2; 2 in[j-1][i+1] + in[j+1][i+1] ) + L[i+2] = R[i]; w3*( in[j][i] ); } 3 j out[j][i] = L[i] + C[i]+ R[i]; … … i R • Composable with communication- i … … avoiding optimizations C i+1 (j,i+2) … … • Overlapped tiling L i+2 • Loop fusion (j,i+1) • Introduction 1 • Wavefront • Case Studies • Inspector/Executor 2 (j,i) • Partial Sum 3 • Parallel Code Generation • Related Work • Conclusion

Parallel Code Generation • Introduces • Parallel threads • Synchronization • Scaffolding code • Approach • Apply transformations to set up for parallelization • E.g., tiling, datacopy • Annotate AST with aspects of parallel code generation • AST and polyhedral abstractions preserved until code generation, to facilitate composing transformations • Introduction • Case Studies • Code generation emits specialized code • Inspector/Executor • Partial Sum • Parallel Code Generation CUDA OpenMP • Related Work • Conclusion

Parallel Code Generation - CUDA void MM(int c[N][N], int a[N][N], int b[N][N]) { • Impact to AST for (i = 0; i < N; i++) for (j = 0; j < N; j++) • AST annotation of block/thread loops for (k = 0; k < N; k++) • Loops are marked for elimination c[j][i] = c[j][i] + a[k][i] * b[j][k]; } • Polyhedral and AST abstractions remain until code generation tile_by_index(0,{"i","j"},{Ti,Tj}, {l1_control="ii",l2_control="jj"}, {"ii","jj","i","j","k"}) for(t2 = 0; t2 <= 7; t2++) // loop ii, block dimension x{ for(t4 = 0; t4 <= 15; t4++) // loop jj, block dimension y{ for(t6 = 128*t2; t6 <= 128*t2+127; t6++) // loop i { • Introduction for(t8 = 64*t4; t8 <= 64*t4+63; t8++) // loop j { • Case Studies • Inspector/Executor for(t10 = 0; t10 <= 1023; t10++) // loop k { • Partial Sum • Parallel Code Generation s0(t2,t4,t6,t8,t10); }}}}} CUDA OpenMP • Related Work cudaize(0,"mm_GPU",{}, {block={"ii","jj"},thread={"i","j"}},{}) • Conclusion

Parallel Code Generation - CUDA void MM(int c[N][N], int a[N][N], int b[N][N]) { • Impact to AST for (i = 0; i < N; i++) • AST annotation of block/thread loops for (j = 0; j < N; j++) • Loops are mark for elimination for (k = 0; k < N; k++) • Polyhedral and AST abstractions remain until code c[j][i] = c[j][i] + a[k][i] * b[j][k]; } generation • Loop iterators are replaced with block/thread index tile_by_index(0,{"i","j"},{Ti,Tj}, • Eg, ii, jj replaced with blockIdx.x, blockIdx.y {l1_control="ii",l2_control="jj"}, {"ii","jj","i","j","k"}) for(t2 = 0; t2 <= 7; t2++) // loop ii, block dimension x{ for(t4 = 0; t4 <= 15; t4++) // loop jj, block dimension y{ for(t6 = 128*t2; t6 <= 128*t2+127; t6++) // loop i { • Introduction for(t8 = 64*t4; t8 <= 64*t4+63; t8++) // loop j { • Case Studies • Inspector/Executor for(t10 = 0; t10 <= 1023; t10++) // loop k { • Partial Sum • Parallel Code Generation s0(t2,t4,t6,t8,t10); }}}}} blockIdx.x, blockIdx.y CUDA OpenMP • Related Work cudaize(0,"mm_GPU",{}, {block={"ii","jj"},thread={"i","j"}},{}) • Conclusion

Parallel Code Generation - CUDA for (kk = 0; kk <= 63; kk += 1) • Data Copy Transformation for (iii = 0; iii <= 7; iii += 1) • Synchronization for (jjj = 0; jjj <= 3; jjj += 1) • AST annotation for (k = 16 * kk; k <= 16 * kk + 15; k += 1) c[...][...] = c[...][...] + a[...][...] * b[...][...]; • Scaffolding code copy_to_shared(0,"tx","a",-16) AST ... Kernel inlining mm_GPU <<<dimGrid0 ,dimBlock0 >>>(...); for (kk = 0; kk <= 63; kk += 1) { ... for (tmp_tx = 0; tmp_tx <= 7; tmp_tx += 1) __global__ void mm_GPU(...) _P1[...][...] = a[...][...]; { ... } AST __syncthreads(); • for (iii = 0; iii <= 7; iii += 1) Introduction • Case Studies for (jjj = 0; jjj <= 3; jjj += 1) • Inspector/Executor • Partial Sum for (k = 16 * kk; k <= 16 * kk + 15; k += 1) • Parallel Code Generation CUDA c[...][...] = c[...][...] + _P1[...][...] * b[...][...]; OpenMP • Related Work __syncthreads(); } • Conclusion

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , - PowerPoint PPT Presentation

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , Anand Venkat, Protonu Basu, Mary Hall University of Utah January 19, 2016 Outline Introduction Problem Limitations of polyhedral transformation CHiLL Compiler

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam

Computing the Cohomology Ring of a Polyhedral Complex Joint work with D. Kravatz, R.

A study of some pitfalls preventing peak performance in polyhedral compilation using a polyhedral

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Polyhedral Domains and Widening for Verification of Numerical Programs Goran Frehse Verimag

The Rigidity of Infinite Frameworks in Euclidean and Polyhedral Normed Spaces Sean Dewar

Scalable Polyhedral Compilation, Syntax vs. Semantics: 10 in the First Round IMPACT

Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas

Optimization Through Recomputation in the Polyhedral Model By Mike Jongen, Luc Waeijen, Roel

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA Doctoral Retreat 2015 September

Polyhedral Compilation Opportunities in MLIR Uday Bondhugula Indian Institute of Science

Detection of constrictions on closed polyhedral surfaces Franck H etroy , Dominique Attali

Tbilisi Georgia 1 22.05.2019, DUNE-IB GTU is THE LARGEST UNIVERSITY IN TRANSCAUCASIA

Code Reviews & Inspections CSC 4700 Software Engineering Dr. Tom Way CSC 4700 1 Software

School of Computer Science Sophomore Advising Meeting Spring 2015 Tim Richards

CS3157: Advanced Programming Lecture #7 June 18 Shlomo Hershkop shlomo@cs.columbia.edu 1

Chapel With Polyhedral Transformation Using Autotuning TuowenZhao and Mary Hall The 3rd Annual

Applications on Heterogeneous Platforms with Accelerators Accelerators and Hybrid Exascale

Aggregation of Chunky Monkeys 13 November 2020 Association for Computing Machinery 13 November

Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , - PowerPoint PPT Presentation

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , Anand Venkat, Protonu Basu, Mary Hall University of Utah January 19, 2016 Outline Introduction Problem Limitations of polyhedral transformation CHiLL Compiler

Polyhedral Volumes Visual Techniques T. V. Raman &amp; M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman &amp; M. S. Krishnamoorthy Polyhedral Volumes

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam

Computing the Cohomology Ring of a Polyhedral Complex Joint work with D. Kravatz, R.

A study of some pitfalls preventing peak performance in polyhedral compilation using a polyhedral

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &amp;

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Polyhedral Domains and Widening for Verification of Numerical Programs Goran Frehse Verimag

The Rigidity of Infinite Frameworks in Euclidean and Polyhedral Normed Spaces Sean Dewar

Scalable Polyhedral Compilation, Syntax vs. Semantics: 10 in the First Round IMPACT

Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas

Optimization Through Recomputation in the Polyhedral Model By Mike Jongen, Luc Waeijen, Roel

Polyhedral Loop Optimization (Part I) Armin Grlinger SPPEXA Doctoral Retreat 2015 September

Polyhedral Compilation Opportunities in MLIR Uday Bondhugula Indian Institute of Science

Detection of constrictions on closed polyhedral surfaces Franck H etroy , Dominique Attali

Tbilisi Georgia 1 22.05.2019, DUNE-IB GTU is THE LARGEST UNIVERSITY IN TRANSCAUCASIA

Code Reviews &amp; Inspections CSC 4700 Software Engineering Dr. Tom Way CSC 4700 1 Software

School of Computer Science Sophomore Advising Meeting Spring 2015 Tim Richards

CS3157: Advanced Programming Lecture #7 June 18 Shlomo Hershkop shlomo@cs.columbia.edu 1

Chapel With Polyhedral Transformation Using Autotuning TuowenZhao and Mary Hall The 3rd Annual

Applications on Heterogeneous Platforms with Accelerators Accelerators and Hybrid Exascale

Aggregation of Chunky Monkeys 13 November 2020 Association for Computing Machinery 13 November

Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &

Code Reviews & Inspections CSC 4700 Software Engineering Dr. Tom Way CSC 4700 1 Software