Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - PowerPoint PPT Presentation

Mary Hall October 24, 2017

• Postdoctoral Researcher Opening: ~Jan. 2018 • >10 Open Faculty PosiEons, 1 in Programming Languages

Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella Sparse Matrix Computa2ons Anand Venkat, Khalid Ahmad, Michelle Strout, Huihui Zhang Tensor Contrac2ons Thomas Nelson, Axel Rivera (Intel), Prasanna Balaprakash, Paul Hovland, Liz Jessup, Boyana Norris Funded in part by Department of Energy Office of Advanced ScienEfic CompuEng Research under awards DE-SC0008682 and ScienEfic Discovery through Advanced ComputaEon (SciDAC) award DE-SC0006947, and by the NaEonal Science FoundaEon award CCF-1018881.

Which version would you prefer to write? Memory Hierarchy /* Laplacian 7-point Variable-Coefficient Stencil */ for (k=0; k<N; k++) Prefetch for (j=0; j<N; j++) Data staged in registers/buffers for (i=0; i<N; i++ temp[k][j][i] = b * h2inv * ( AVX SIMD intrinsics beta_i[k][j][i+1] * ( phi[k][j][i+1] – phi[k][j][i] ) -beta_i[k][j][i] * ( phi[k][j][i] – phi[k][j][i-1] ) +beta_j[k][j+1][i] * ( phi[k][j+1][i] – phi[k][j][i] ) -beta_j[k][j][i] * ( phi[k][j][i] – phi[k][j-1][i] ) +beta_k[k+1][j][i] * ( phi[k+1][j][i] – phi[k][j][i] ) -beta_k[k][j][i] * ( phi[k][j][i] – phi[k-1][j][i] ) ); Parallelism /* Helmholz */ Ghost zones: for (k=0; k<N; k++) for (j=0; j<N; j++) Tradeoff computaEon for communicaEon for (i=0; i<N; i++) Parallel Wavefronts: temp[k][j][i] = a * alpha[k][j][i] * phi[k][j][i] – temp[k][j][i]; Reduce sweeps over 3D grid /* Gauss-Seidel Red Black Update */ Nested OpenMP and MPI for (k=0; k<N; k++) for (j=0; j<N; j++) Spin locks in OpenMP for (i=0; i<N; i++){ if ((i+j+k+color)%2 == 0 ) phi[k][j][i] = phi[k][j][i] – lambda[k][j][i] * (temp[k][j][i] – rhs[k][j][i]);} Code B: miniGMG opEmized smooth operator Code A: miniGMG baseline smooth operator approximately 170 lines of code approximately 13 lines of code

And now GPU code? Code C: miniGMG opEmized smooth operator for GPU, 308 lines of code for just kernel

Which version would you prefer to write? /* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) Convert Matrix Format for( k =0; k < m ; k ++); CSR � CSB y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) OpenMP w/ scheduling for( k =0; k < m ; k ++) y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; } pragmas to AVX SIMD Code A: MulEple SpMV computaEons (SpMM), 7 lines of code Indexing simplificaEon 11 different block sizes/ implementaEons Code B: Manually-opEmized SpMM from LOBCG, 2109 lines of code

Which version would you prefer to write? /* local_grad_3 computation from nek5000 */ /* local_grad3 from nek5000, generated CUDA code */ w[nelt i j k] += Dt[l k] U[nelt n m l] D[j m] D[i n] Code A: 1 line mathemaEcal representaEon Input to OCTOPI Code B: Generated CUDA+harness, 122 lines of code

Code B/C is not Unusual • Performance portability? • ParEcularly across fundamentally different CPU and GPU architectures • Programmer producEvity? • High performance implementaEons will require low-level specificaEon that exposes architecture • Sooware maintainability and portability? • May require mulEple implementaEons of applicaEon Current solu2ons • Follow MPI and OpenMP standards • Same code unlikely to perform well across CPU and GPU • Vendor C and Fortran compilers not opEmized for HPC workloads • Some domain-specific framework strategies • Libraries, C++ template expansion, standalone DSL • Not composable with other opEmizaEons

Our Approach • CHiLL: polyhedral compiler transformaEon and for (i=0;i<N;i++) { for (j=1;j<M;j++) { code generaEon framework with domain- S0: a[i][j] = b[j] – a[i][j-1]; specific specializa4on (supports C-like C++) I = {[i,j] | 0<=i<N ∧ 1<=j<=M} • Target is loop-based scienEfic applicaEons • Composable transformaEons • OpEmizaEon strategy can be specified or derived with transforma4on recipes • Also opEmizaEon parameters exposed • Separates code from mapping! • Autotuning • SystemaEc exploraEon of alternate transformaEon recipes and their opEmizaEon parameter values • Search technology to prune combinatorial space Automate process of genera2ng Code B from Code A.

• Immediate: Improve performance of producEon applicaEon • Medium term: New research ideas • Long term: • Change workflow for HPC applicaEon development • Move faciliEes into more rapid adopEon of new tools • Impact compiler and autotuning technology • Projects: • DOE Exascale CompuEng Project • DOE ScienEfic Discovery through Advanced CompuEng • NSF Blue Waters PAID project

b. Statement macro a. Original loop nest and itera2on space #define S0(i,j) a[(i)][(j)] = b[(j)] – a[(i)][(j-1)] for (i=0;i<N;i++) { for (j=1;j<M;j++) { S0: a[i][j] = b[j] – a[i][j-1]; I = {[i,j] | 0<=i<N ∧ 1<=j<=M} d. Dependence rela2on for S0 c. Transformed loop nest T = {[i,j] � [j,i]} {[i,j] � [i’,j’] | 0<=i,i’<N ∧ 1<=j,j’<M ∧ (i=i’ ∧ j=j’-1)} for (j=1;j<M;j++) { for (i=0;i<N;i++) { S0: a[i][j] = b[j] – a[i][j-1];

• Inspector/executor methodology Inspector Code: Matrix format • Inspector analyzes indirect accesses conversion, non-affine at run4me and/or reorganizes data transformaEon and representaEon parallelizaEon • Executor is the reordered computaEon • Compose with polyhedral Executor Code: transformaEons IteraEons are opEmized and use new representaEon

Three ApplicaEon Domains, One Compiler Sparse Linear Algebra Stencils and GMG Tensor Contrac2ons • Specialize matrix • Memory-bandwidth • Reduce representa;on: bound: computa;on: Data CommunicaEon- Reassociate transformaEons avoiding • Op;mize memory opEmizaEons • Incorporate run;me access paBern: informa;on: • Compute bound: Modify loop order Inspector/executor Eliminate to best match data redundant layout and memory • Support non-affine computaEon hierarchy input/ (parEal sums) transforma;ons • Adjust parallelism [HIPC’13],[WOSC’13], [CGO’14], [PLDI’15], [ICPP’15] [WOSC’14],[IPDPS’15], [IPDPS’16], [LCPC’16] [PARCO’17] [IA^3’16], [SC’16]

CommunicaEon Avoiding: SomeEmes Code A Beats Code B! GSRB Smooth (Edison) � miniGMG w/CHiLL 5.0x • Fused operaEons Speedup over Baseline Smoother CHiLL generated 4.5x • CommunicaEon-avoiding Manually Tuned 4.0x wavefront Baseline 3.5x • Parallelized (OpenMP) 3.0x � Autotuning finds the best 2.5x implementaEon for each box size 2.0x • wavefront depth 1.5x • nested OpenMP configuraEon 1.0x • inter-thread synchronizaEon 0.5x (barrier vs. point-to-point) 0.0x � For fine grids (large arrays) CHiLL 64^3 32^3 16^3 8^3 4^3 a{ains nearly a 4.5x speedup over Box Size ( == Level in the V-Cycle) baseline

Retargetable and Performance Portable: OpEmized Code A can beat Code C! � CHiLL can obviate the need for architecture-specific programming GSRB Smooth on 64^3 boxes 12 models like CUDA CUDA-CHiLL • CUDA-CHiLL took the sequenEal 10 GSRB implementaEon (.c) and Handtuned 8 generated CUDA that runs on Handtuned-VL Time (seconds) 5.224148 NVIDIA GPUs 6 • CUDA-CHiLL autotuned over the 4 thread block sizes and is 4.861889 4.774941 ulEmately 2% faster than the 2 hand-opEmized minigmg-cuda 0 ( Code C ) • Adaptable to new GPU generaEons 2D Thread Blocks <TX,TY> 15

Example TransformaEon Recipes • These can be manually-wri{en (miniGMG, LOBCG) or automaEcally generated (tensor contracEon) /* jacobi_box_4_64.py, 27-pt stencil, 64 3 box size */ /* gsrb.lua, variable coefficient GSRB, 64 3 box size */ from chill import * init("gsrb_mod.cu", "gsrb",0,0) dofile("cudaize.lua”) # custom commands in lua #select which computaEon to opEmize source('jacobi_box_4_64.c') # set up parallel decomposiEon, adjust via autotuning procedure('smooth_box_4_64') TI=32 loop(0) TJ=4 original() # fuse wherever possible TK=64 #create a parallel wavefront TZ=64 skew([0,1,2,3,4,5],2,[2,1]) permute([2,1,3,4]) Ele_by_index(0, {"box","k","j", "i"},{TZ,TK, TJ, TI}, #parEal sum for high order stencils and fuse result {l1_control="bb", l2_control="kk", l3_control="jj", distribute([0,1,2,3,4,5],2) l4_control="ii"},{"bb","box","kk","k","jj","j","ii","i"}) stencil_temp(0) stencil_temp(5) cudaize(0, "kernel_GPU", fuse([2,3,4,5,6,7,8,9],1) {_temp=N*N*N*N,_beta_i=N*N*N*N, fuse([2,3,4,5,6,7,8,9],2) _phi=N*N*N*N},{block={"ii","jj","box"}, fuse([2,3,4,5,6,7,8,9],3) thread={"i","j"}},{}) fuse([2,3,4,5,6,7,8,9],4)

• Data layout • A brick is a 4x4x4 mini domain without a ghost zone • ApplicaEon of a stencil reaches into other bricks (affinity important) • Implemented with conEguous storage and adjacency lists • OpEmizaEon advantages • Flexible mapping to SIMD, threads • Rapid copying, simplifies scheduling and code generaEon, can improve ghost zone exchange • Be{er memory hierarchy behavior (including TLB on KNL) CollaboraEon with Tuowen Zhao (Utah), Protonu Basu, Sam Williams, Hans Johansen (LBNL)

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - PowerPoint PPT Presentation

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. 2018 >10 Open Faculty PosiEons, 1 in Programming Languages Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella

The Office of Postdoctoral Affairs (OPA) Office of Postdoctoral Affairs The Office of

1 Wils Wils lsdorf lsdorf dorf Hall dorf Hall Hall Hall Wils Wils lsdorf lsdorf dorf

OPENING V.1 OPENING V.2 - for improvisation OPENING V.3 OPENING V.4 OPENING V.5

REGISTRATION and TEA, COFFEE 10:00 12:15 LUNCH OPENING CEREMONY HALL 1071 Prof. Dr. Muhsin

St. Mary Siphon Tubes North of Babb, MT Halls Coulee Siphon St. Mary Canal St. Mary Canal

HOLSHOUSER HALL RENOVATIONS SOUTH VILLAGE SITE PLAN Phase XI Residence Hall(Hunt Hall)

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Mary The Mother of God RCIA October 31, 2013 Mary, the Mother of God As we look at Mary, as

Pre-congress: Wednesday, May 10 2017 Beirut hall Chartouni hall Cedar hall Phoenicia hall

Cele lebrating Mary ry Ward Week 2017 Mary ry Ward 1585 - 1645 Mary Ward North

Hall Effect Measurement System Hall and Hall and van der Pauw Measurements van der Pauw

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

Opening music orchestra and trad group Opening Prayer Father Vincent Opening Speech

Observing the birth of planets Valentin Christiaens Postdoctoral researcher - Monash University

Andrew Fraser , Postdoctoral Researcher Civil, Environmental, & Sustainability Engineering

environment: Challenges and opportunities Janina Grabs Postdoctoral researcher, Environmental

Collision Avoidance in Micro Aerial Vehicles Motion Planning 12 November 2018 Glareh Mir

(1) Otto-von-Guericke-Universitt Magdeburg (2) TUD Technische Universitt Darmstadt FOSD

Parameter Estimation for Quantum Information Christopher Granade www.cgranade.com

AXDA : efficient sampling through variable splitting inspired bayesian hierarchical models P.

NLP for Non-Canonical Language and Nature of Categories Learner Language POS example Syntax

1 Inference and estimation in probabilistic time series models David Barber, A. Taylan Cemgil

Explicit Characterisation of Receding Horizon Control Mar a M. Seron September 2004 Centre

Recent Research on Search Based Software Testing: Part 2 Francisco Chicano University of

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - PowerPoint PPT Presentation

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. 2018 >10 Open Faculty PosiEons, 1 in Programming Languages Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella

The Office of Postdoctoral Affairs (OPA) Office of Postdoctoral Affairs The Office of

1 Wils Wils lsdorf lsdorf dorf Hall dorf Hall Hall Hall Wils Wils lsdorf lsdorf dorf

OPENING V.1 OPENING V.2 - for improvisation OPENING V.3 OPENING V.4 OPENING V.5

REGISTRATION and TEA, COFFEE 10:00 12:15 LUNCH OPENING CEREMONY HALL 1071 Prof. Dr. Muhsin

St. Mary Siphon Tubes North of Babb, MT Halls Coulee Siphon St. Mary Canal St. Mary Canal

HOLSHOUSER HALL RENOVATIONS SOUTH VILLAGE SITE PLAN Phase XI Residence Hall(Hunt Hall)

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Mary The Mother of God RCIA October 31, 2013 Mary, the Mother of God As we look at Mary, as

Pre-congress: Wednesday, May 10 2017 Beirut hall Chartouni hall Cedar hall Phoenicia hall

Cele lebrating Mary ry Ward Week 2017 Mary ry Ward 1585 - 1645 Mary Ward North

Hall Effect Measurement System Hall and Hall and van der Pauw Measurements van der Pauw

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

Opening music orchestra and trad group Opening Prayer Father Vincent Opening Speech

Observing the birth of planets Valentin Christiaens Postdoctoral researcher - Monash University

Andrew Fraser , Postdoctoral Researcher Civil, Environmental, &amp; Sustainability Engineering

environment: Challenges and opportunities Janina Grabs Postdoctoral researcher, Environmental

Collision Avoidance in Micro Aerial Vehicles Motion Planning 12 November 2018 Glareh Mir

(1) Otto-von-Guericke-Universitt Magdeburg (2) TUD Technische Universitt Darmstadt FOSD

Parameter Estimation for Quantum Information Christopher Granade www.cgranade.com

AXDA : efficient sampling through variable splitting inspired bayesian hierarchical models P.

NLP for Non-Canonical Language and Nature of Categories Learner Language POS example Syntax

1 Inference and estimation in probabilistic time series models David Barber, A. Taylan Cemgil

Explicit Characterisation of Receding Horizon Control Mar a M. Seron September 2004 Centre

Recent Research on Search Based Software Testing: Part 2 Francisco Chicano University of

Andrew Fraser , Postdoctoral Researcher Civil, Environmental, & Sustainability Engineering