Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - - PowerPoint PPT Presentation
Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - - PowerPoint PPT Presentation
Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. 2018 >10 Open Faculty PosiEons, 1 in Programming Languages Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella
- Postdoctoral Researcher Opening: ~Jan. 2018
- >10 Open Faculty PosiEons, 1 in Programming
Languages
Funded in part by Department of Energy Office of Advanced ScienEfic CompuEng Research under awards DE-SC0008682 and ScienEfic Discovery through Advanced ComputaEon (SciDAC) award DE-SC0006947, and by the NaEonal Science FoundaEon award CCF-1018881.
Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella Stencils and Geometric Mul2grid Anand Venkat, Khalid Ahmad, Michelle Strout, Huihui Zhang Sparse Matrix Computa2ons Thomas Nelson, Axel Rivera (Intel), Prasanna Balaprakash, Paul Hovland, Liz Jessup, Boyana Norris Tensor Contrac2ons
/* Laplacian 7-point Variable-Coefficient Stencil */ for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++ temp[k][j][i] = b * h2inv * ( beta_i[k][j][i+1] * ( phi[k][j][i+1] – phi[k][j][i] )
- beta_i[k][j][i] * ( phi[k][j][i] – phi[k][j][i-1] )
+beta_j[k][j+1][i] * ( phi[k][j+1][i] – phi[k][j][i] )
- beta_j[k][j][i] * ( phi[k][j][i] – phi[k][j-1][i] )
+beta_k[k+1][j][i] * ( phi[k+1][j][i] – phi[k][j][i] )
- beta_k[k][j][i] * ( phi[k][j][i] – phi[k-1][j][i] ) );
/* Helmholz */ for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++) temp[k][j][i] = a * alpha[k][j][i] * phi[k][j][i] – temp[k][j][i]; /* Gauss-Seidel Red Black Update */ for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++){ if ((i+j+k+color)%2 == 0 ) phi[k][j][i] = phi[k][j][i] – lambda[k][j][i] * (temp[k][j][i] – rhs[k][j][i]);}
Code A: miniGMG baseline smooth operator approximately 13 lines of code Code B: miniGMG opEmized smooth operator approximately 170 lines of code
Which version would you prefer to write?
Prefetch Data staged in registers/buffers AVX SIMD intrinsics Memory Hierarchy Parallelism Ghost zones: Tradeoff computaEon for communicaEon Spin locks in OpenMP Nested OpenMP and MPI Parallel Wavefronts: Reduce sweeps over 3D grid
And now GPU code?
Code C: miniGMG opEmized smooth operator for GPU, 308 lines of code for just kernel
Which version would you prefer to write?
/* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) for( k =0; k < m ; k ++); y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) for( k =0; k < m ; k ++) y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; }
Code A: MulEple SpMV computaEons (SpMM), 7 lines of code Code B: Manually-opEmized SpMM from LOBCG, 2109 lines of code Convert Matrix Format CSR CSB 11 different block sizes/ implementaEons OpenMP w/ scheduling pragmas to AVX SIMD Indexing simplificaEon
Which version would you prefer to write?
/* local_grad_3 computation from nek5000 */
w[nelt i j k] += Dt[l k] U[nelt n m l] D[j m] D[i n]
/* local_grad3 from nek5000, generated CUDA code */
Code A: 1 line mathemaEcal representaEon Input to OCTOPI Code B: Generated CUDA+harness, 122 lines of code
- Performance portability?
- ParEcularly across fundamentally different CPU and GPU
architectures
- Programmer producEvity?
- High performance implementaEons will require low-level
specificaEon that exposes architecture
- Sooware maintainability and portability?
- May require mulEple implementaEons of applicaEon
Current solu2ons
- Follow MPI and OpenMP standards
- Same code unlikely to perform well across CPU and GPU
- Vendor C and Fortran compilers not opEmized for HPC workloads
- Some domain-specific framework strategies
- Libraries, C++ template expansion, standalone DSL
- Not composable with other opEmizaEons
Code B/C is not Unusual
- CHiLL: polyhedral compiler transformaEon and
code generaEon framework with domain- specific specializa4on (supports C-like C++)
- Target is loop-based scienEfic applicaEons
- Composable transformaEons
- OpEmizaEon strategy can be specified or
derived with transforma4on recipes
- Also opEmizaEon parameters exposed
- Separates code from mapping!
- Autotuning
- SystemaEc exploraEon of alternate transformaEon
recipes and their opEmizaEon parameter values
- Search technology to prune combinatorial space
Automate process of genera2ng Code B from Code A.
Our Approach
for (i=0;i<N;i++) { for (j=1;j<M;j++) { S0: a[i][j] = b[j] – a[i][j-1]; I = {[i,j] | 0<=i<N ∧ 1<=j<=M}
- Immediate: Improve performance of producEon
applicaEon
- Medium term: New research ideas
- Long term:
- Change workflow for HPC applicaEon development
- Move faciliEes into more rapid adopEon of new tools
- Impact compiler and autotuning technology
- Projects:
- DOE Exascale CompuEng Project
- DOE ScienEfic Discovery through Advanced CompuEng
- NSF Blue Waters PAID project
for (i=0;i<N;i++) { for (j=1;j<M;j++) { S0: a[i][j] = b[j] – a[i][j-1]; I = {[i,j] | 0<=i<N ∧ 1<=j<=M} T = {[i,j] [j,i]} for (j=1;j<M;j++) { for (i=0;i<N;i++) { S0: a[i][j] = b[j] – a[i][j-1]; #define S0(i,j) a[(i)][(j)] = b[(j)] – a[(i)][(j-1)] {[i,j] [i’,j’] | 0<=i,i’<N ∧ 1<=j,j’<M ∧ (i=i’ ∧ j=j’-1)}
- a. Original loop nest and
itera2on space
- b. Statement macro
- c. Transformed loop nest
- d. Dependence rela2on for S0
Inspector Code: Matrix format conversion, non-affine transformaEon and parallelizaEon Executor Code: IteraEons are opEmized and use new representaEon
- Inspector/executor methodology
- Inspector analyzes indirect accesses
at run4me and/or reorganizes data representaEon
- Executor is the reordered
computaEon
- Compose with polyhedral
transformaEons
Stencils and GMG
- Memory-bandwidth
bound: CommunicaEon- avoiding
- pEmizaEons
- Compute bound:
Eliminate redundant computaEon (parEal sums)
[HIPC’13],[WOSC’13], [WOSC’14],[IPDPS’15], [PARCO’17]
Three ApplicaEon Domains, One Compiler
Sparse Linear Algebra
- Specialize matrix
representa;on: Data transformaEons
- Incorporate run;me
informa;on: Inspector/executor
- Support non-affine
input/ transforma;ons
[CGO’14], [PLDI’15], [IPDPS’16], [LCPC’16] [IA^3’16], [SC’16]
Tensor Contrac2ons
- Reduce
computa;on: Reassociate
- Op;mize memory
access paBern: Modify loop order to best match data layout and memory hierarchy
- Adjust parallelism
[ICPP’15]
miniGMG w/CHiLL
- Fused operaEons
- CommunicaEon-avoiding
wavefront
- Parallelized (OpenMP)
Autotuning finds the best implementaEon for each box size
- wavefront depth
- nested OpenMP configuraEon
- inter-thread synchronizaEon
(barrier vs. point-to-point) For fine grids (large arrays) CHiLL a{ains nearly a 4.5x speedup over baseline
0.0x 0.5x 1.0x 1.5x 2.0x 2.5x 3.0x 3.5x 4.0x 4.5x 5.0x 64^3 32^3 16^3 8^3 4^3
Speedup over Baseline Smoother
Box Size ( == Level in the V-Cycle)
GSRB Smooth (Edison)
CHiLL generated Manually Tuned Baseline
CommunicaEon Avoiding: SomeEmes Code A Beats Code B!
15
5.224148 4.861889
4.774941
2 4 6 8 10 12
Time (seconds) 2D Thread Blocks <TX,TY>
GSRB Smooth on 64^3 boxes
CUDA-CHiLL Handtuned Handtuned-VL
CHiLL can obviate the need for architecture-specific programming models like CUDA
- CUDA-CHiLL took the sequenEal
GSRB implementaEon (.c) and generated CUDA that runs on NVIDIA GPUs
- CUDA-CHiLL autotuned over the
thread block sizes and is ulEmately 2% faster than the hand-opEmized minigmg-cuda (Code C)
- Adaptable to new GPU
generaEons
Retargetable and Performance Portable: OpEmized Code A can beat Code C!
- These can be manually-wri{en (miniGMG, LOBCG)
- r automaEcally generated (tensor contracEon)
Example TransformaEon Recipes
/* jacobi_box_4_64.py, 27-pt stencil, 643 box size */
from chill import * #select which computaEon to opEmize source('jacobi_box_4_64.c') procedure('smooth_box_4_64') loop(0)
- riginal() # fuse wherever possible
#create a parallel wavefront skew([0,1,2,3,4,5],2,[2,1]) permute([2,1,3,4]) #parEal sum for high order stencils and fuse result distribute([0,1,2,3,4,5],2) stencil_temp(0) stencil_temp(5) fuse([2,3,4,5,6,7,8,9],1) fuse([2,3,4,5,6,7,8,9],2) fuse([2,3,4,5,6,7,8,9],3) fuse([2,3,4,5,6,7,8,9],4)
/* gsrb.lua, variable coefficient GSRB, 643 box size */
init("gsrb_mod.cu", "gsrb",0,0) dofile("cudaize.lua”) # custom commands in lua # set up parallel decomposiEon, adjust via autotuning TI=32 TJ=4 TK=64 TZ=64 Ele_by_index(0, {"box","k","j", "i"},{TZ,TK, TJ, TI}, {l1_control="bb", l2_control="kk", l3_control="jj", l4_control="ii"},{"bb","box","kk","k","jj","j","ii","i"}) cudaize(0, "kernel_GPU", {_temp=N*N*N*N,_beta_i=N*N*N*N, _phi=N*N*N*N},{block={"ii","jj","box"}, thread={"i","j"}},{})
- Data layout
- A brick is a 4x4x4 mini domain
without a ghost zone
- ApplicaEon of a stencil reaches into
- ther bricks (affinity important)
- Implemented with conEguous
storage and adjacency lists
- OpEmizaEon advantages
- Flexible mapping to SIMD, threads
- Rapid copying, simplifies scheduling
and code generaEon, can improve ghost zone exchange
- Be{er memory hierarchy behavior
(including TLB on KNL)
CollaboraEon with Tuowen Zhao (Utah), Protonu Basu, Sam Williams, Hans Johansen (LBNL)
- Inspector/executor methodology
- Inspector analyzes indirect accesses
at run4me and/or reorganizes data representaEon
- Executor is the reordered
computaEon
- Compose with polyhedral
transformaEons
- New project, CHiLL-I/E (w/Michelle
Strout)
- Develop new abstracEons for
automaEng and composing inspector code generaEon
Inspector Dependence Graph
- Overhead
- Tuning search can be expensive
- Off-line tuning expensive, programmer burden
- Specifying search space, transformaEons
- SelecEon and configuraEon of algorithms
- Scope
- Tuning must be repeated for new execuEon contexts
- Exascale resources vary during execuEon, pla‚orm may not be
available for training
- Economies of data scale: Learning based on a community’s code
- Other programmer concerns
- Correctness concerns with dynamically-changing code
- Long-term tool availability
Autotuning in High-Performance CmopuEng ApplicaEons, Balaprakash, Dongarra, Gamblin, Hall, Hollingsworth, Norris, Vuduc, To appear in Special Issue of Proceedings of the IEEE, 2018.