Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - - PowerPoint PPT Presentation

mary hall october 24 2017 postdoctoral researcher opening
SMART_READER_LITE
LIVE PREVIEW

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - - PowerPoint PPT Presentation

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. 2018 >10 Open Faculty PosiEons, 1 in Programming Languages Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella


slide-1
SLIDE 1

Mary Hall October 24, 2017

slide-2
SLIDE 2
  • Postdoctoral Researcher Opening: ~Jan. 2018
  • >10 Open Faculty PosiEons, 1 in Programming

Languages

slide-3
SLIDE 3

Funded in part by Department of Energy Office of Advanced ScienEfic CompuEng Research under awards DE-SC0008682 and ScienEfic Discovery through Advanced ComputaEon (SciDAC) award DE-SC0006947, and by the NaEonal Science FoundaEon award CCF-1018881.

Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella Stencils and Geometric Mul2grid Anand Venkat, Khalid Ahmad, Michelle Strout, Huihui Zhang Sparse Matrix Computa2ons Thomas Nelson, Axel Rivera (Intel), Prasanna Balaprakash, Paul Hovland, Liz Jessup, Boyana Norris Tensor Contrac2ons

slide-4
SLIDE 4

/* Laplacian 7-point Variable-Coefficient Stencil */ for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++ temp[k][j][i] = b * h2inv * ( beta_i[k][j][i+1] * ( phi[k][j][i+1] – phi[k][j][i] )

  • beta_i[k][j][i] * ( phi[k][j][i] – phi[k][j][i-1] )

+beta_j[k][j+1][i] * ( phi[k][j+1][i] – phi[k][j][i] )

  • beta_j[k][j][i] * ( phi[k][j][i] – phi[k][j-1][i] )

+beta_k[k+1][j][i] * ( phi[k+1][j][i] – phi[k][j][i] )

  • beta_k[k][j][i] * ( phi[k][j][i] – phi[k-1][j][i] ) );

/* Helmholz */ for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++) temp[k][j][i] = a * alpha[k][j][i] * phi[k][j][i] – temp[k][j][i]; /* Gauss-Seidel Red Black Update */ for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++){ if ((i+j+k+color)%2 == 0 ) phi[k][j][i] = phi[k][j][i] – lambda[k][j][i] * (temp[k][j][i] – rhs[k][j][i]);}

Code A: miniGMG baseline smooth operator approximately 13 lines of code Code B: miniGMG opEmized smooth operator approximately 170 lines of code

Which version would you prefer to write?

Prefetch Data staged in registers/buffers AVX SIMD intrinsics Memory Hierarchy Parallelism Ghost zones: Tradeoff computaEon for communicaEon Spin locks in OpenMP Nested OpenMP and MPI Parallel Wavefronts: Reduce sweeps over 3D grid

slide-5
SLIDE 5

And now GPU code?

Code C: miniGMG opEmized smooth operator for GPU, 308 lines of code for just kernel

slide-6
SLIDE 6

Which version would you prefer to write?

/* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) for( k =0; k < m ; k ++); y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) for( k =0; k < m ; k ++) y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; }

Code A: MulEple SpMV computaEons (SpMM), 7 lines of code Code B: Manually-opEmized SpMM from LOBCG, 2109 lines of code Convert Matrix Format CSR CSB 11 different block sizes/ implementaEons OpenMP w/ scheduling pragmas to AVX SIMD Indexing simplificaEon

slide-7
SLIDE 7

Which version would you prefer to write?

/* local_grad_3 computation from nek5000 */

w[nelt i j k] += Dt[l k] U[nelt n m l] D[j m] D[i n]

/* local_grad3 from nek5000, generated CUDA code */

Code A: 1 line mathemaEcal representaEon Input to OCTOPI Code B: Generated CUDA+harness, 122 lines of code

slide-8
SLIDE 8
  • Performance portability?
  • ParEcularly across fundamentally different CPU and GPU

architectures

  • Programmer producEvity?
  • High performance implementaEons will require low-level

specificaEon that exposes architecture

  • Sooware maintainability and portability?
  • May require mulEple implementaEons of applicaEon

Current solu2ons

  • Follow MPI and OpenMP standards
  • Same code unlikely to perform well across CPU and GPU
  • Vendor C and Fortran compilers not opEmized for HPC workloads
  • Some domain-specific framework strategies
  • Libraries, C++ template expansion, standalone DSL
  • Not composable with other opEmizaEons

Code B/C is not Unusual

slide-9
SLIDE 9
  • CHiLL: polyhedral compiler transformaEon and

code generaEon framework with domain- specific specializa4on (supports C-like C++)

  • Target is loop-based scienEfic applicaEons
  • Composable transformaEons
  • OpEmizaEon strategy can be specified or

derived with transforma4on recipes

  • Also opEmizaEon parameters exposed
  • Separates code from mapping!
  • Autotuning
  • SystemaEc exploraEon of alternate transformaEon

recipes and their opEmizaEon parameter values

  • Search technology to prune combinatorial space

Automate process of genera2ng Code B from Code A.

Our Approach

for (i=0;i<N;i++) { for (j=1;j<M;j++) { S0: a[i][j] = b[j] – a[i][j-1]; I = {[i,j] | 0<=i<N ∧ 1<=j<=M}

slide-10
SLIDE 10
  • Immediate: Improve performance of producEon

applicaEon

  • Medium term: New research ideas
  • Long term:
  • Change workflow for HPC applicaEon development
  • Move faciliEes into more rapid adopEon of new tools
  • Impact compiler and autotuning technology
  • Projects:
  • DOE Exascale CompuEng Project
  • DOE ScienEfic Discovery through Advanced CompuEng
  • NSF Blue Waters PAID project
slide-11
SLIDE 11

for (i=0;i<N;i++) { for (j=1;j<M;j++) { S0: a[i][j] = b[j] – a[i][j-1]; I = {[i,j] | 0<=i<N ∧ 1<=j<=M} T = {[i,j] [j,i]} for (j=1;j<M;j++) { for (i=0;i<N;i++) { S0: a[i][j] = b[j] – a[i][j-1]; #define S0(i,j) a[(i)][(j)] = b[(j)] – a[(i)][(j-1)] {[i,j] [i’,j’] | 0<=i,i’<N ∧ 1<=j,j’<M ∧ (i=i’ ∧ j=j’-1)}

  • a. Original loop nest and

itera2on space

  • b. Statement macro
  • c. Transformed loop nest
  • d. Dependence rela2on for S0
slide-12
SLIDE 12

Inspector Code: Matrix format conversion, non-affine transformaEon and parallelizaEon Executor Code: IteraEons are opEmized and use new representaEon

  • Inspector/executor methodology
  • Inspector analyzes indirect accesses

at run4me and/or reorganizes data representaEon

  • Executor is the reordered

computaEon

  • Compose with polyhedral

transformaEons

slide-13
SLIDE 13

Stencils and GMG

  • Memory-bandwidth

bound: CommunicaEon- avoiding

  • pEmizaEons
  • Compute bound:

Eliminate redundant computaEon (parEal sums)

[HIPC’13],[WOSC’13], [WOSC’14],[IPDPS’15], [PARCO’17]

Three ApplicaEon Domains, One Compiler

Sparse Linear Algebra

  • Specialize matrix

representa;on: Data transformaEons

  • Incorporate run;me

informa;on: Inspector/executor

  • Support non-affine

input/ transforma;ons

[CGO’14], [PLDI’15], [IPDPS’16], [LCPC’16] [IA^3’16], [SC’16]

Tensor Contrac2ons

  • Reduce

computa;on: Reassociate

  • Op;mize memory

access paBern: Modify loop order to best match data layout and memory hierarchy

  • Adjust parallelism

[ICPP’15]

slide-14
SLIDE 14

miniGMG w/CHiLL

  • Fused operaEons
  • CommunicaEon-avoiding

wavefront

  • Parallelized (OpenMP)

Autotuning finds the best implementaEon for each box size

  • wavefront depth
  • nested OpenMP configuraEon
  • inter-thread synchronizaEon

(barrier vs. point-to-point) For fine grids (large arrays) CHiLL a{ains nearly a 4.5x speedup over baseline

0.0x 0.5x 1.0x 1.5x 2.0x 2.5x 3.0x 3.5x 4.0x 4.5x 5.0x 64^3 32^3 16^3 8^3 4^3

Speedup over Baseline Smoother

Box Size ( == Level in the V-Cycle)

GSRB Smooth (Edison)

CHiLL generated Manually Tuned Baseline

CommunicaEon Avoiding: SomeEmes Code A Beats Code B!

slide-15
SLIDE 15

15

5.224148 4.861889

4.774941

2 4 6 8 10 12

Time (seconds) 2D Thread Blocks <TX,TY>

GSRB Smooth on 64^3 boxes

CUDA-CHiLL Handtuned Handtuned-VL

CHiLL can obviate the need for architecture-specific programming models like CUDA

  • CUDA-CHiLL took the sequenEal

GSRB implementaEon (.c) and generated CUDA that runs on NVIDIA GPUs

  • CUDA-CHiLL autotuned over the

thread block sizes and is ulEmately 2% faster than the hand-opEmized minigmg-cuda (Code C)

  • Adaptable to new GPU

generaEons

Retargetable and Performance Portable: OpEmized Code A can beat Code C!

slide-16
SLIDE 16
  • These can be manually-wri{en (miniGMG, LOBCG)
  • r automaEcally generated (tensor contracEon)

Example TransformaEon Recipes

/* jacobi_box_4_64.py, 27-pt stencil, 643 box size */

from chill import * #select which computaEon to opEmize source('jacobi_box_4_64.c') procedure('smooth_box_4_64') loop(0)

  • riginal() # fuse wherever possible

#create a parallel wavefront skew([0,1,2,3,4,5],2,[2,1]) permute([2,1,3,4]) #parEal sum for high order stencils and fuse result distribute([0,1,2,3,4,5],2) stencil_temp(0) stencil_temp(5) fuse([2,3,4,5,6,7,8,9],1) fuse([2,3,4,5,6,7,8,9],2) fuse([2,3,4,5,6,7,8,9],3) fuse([2,3,4,5,6,7,8,9],4)

/* gsrb.lua, variable coefficient GSRB, 643 box size */

init("gsrb_mod.cu", "gsrb",0,0) dofile("cudaize.lua”) # custom commands in lua # set up parallel decomposiEon, adjust via autotuning TI=32 TJ=4 TK=64 TZ=64 Ele_by_index(0, {"box","k","j", "i"},{TZ,TK, TJ, TI}, {l1_control="bb", l2_control="kk", l3_control="jj", l4_control="ii"},{"bb","box","kk","k","jj","j","ii","i"}) cudaize(0, "kernel_GPU", {_temp=N*N*N*N,_beta_i=N*N*N*N, _phi=N*N*N*N},{block={"ii","jj","box"}, thread={"i","j"}},{})

slide-17
SLIDE 17
  • Data layout
  • A brick is a 4x4x4 mini domain

without a ghost zone

  • ApplicaEon of a stencil reaches into
  • ther bricks (affinity important)
  • Implemented with conEguous

storage and adjacency lists

  • OpEmizaEon advantages
  • Flexible mapping to SIMD, threads
  • Rapid copying, simplifies scheduling

and code generaEon, can improve ghost zone exchange

  • Be{er memory hierarchy behavior

(including TLB on KNL)

CollaboraEon with Tuowen Zhao (Utah), Protonu Basu, Sam Williams, Hans Johansen (LBNL)

slide-18
SLIDE 18
  • Inspector/executor methodology
  • Inspector analyzes indirect accesses

at run4me and/or reorganizes data representaEon

  • Executor is the reordered

computaEon

  • Compose with polyhedral

transformaEons

  • New project, CHiLL-I/E (w/Michelle

Strout)

  • Develop new abstracEons for

automaEng and composing inspector code generaEon

Inspector Dependence Graph

slide-19
SLIDE 19
  • Overhead
  • Tuning search can be expensive
  • Off-line tuning expensive, programmer burden
  • Specifying search space, transformaEons
  • SelecEon and configuraEon of algorithms
  • Scope
  • Tuning must be repeated for new execuEon contexts
  • Exascale resources vary during execuEon, pla‚orm may not be

available for training

  • Economies of data scale: Learning based on a community’s code
  • Other programmer concerns
  • Correctness concerns with dynamically-changing code
  • Long-term tool availability

Autotuning in High-Performance CmopuEng ApplicaEons, Balaprakash, Dongarra, Gamblin, Hall, Hollingsworth, Norris, Vuduc, To appear in Special Issue of Proceedings of the IEEE, 2018.