Orthogonal Abstractions for Scheduling and Storage Mappings - - PowerPoint PPT Presentation
Orthogonal Abstractions for Scheduling and Storage Mappings - - PowerPoint PPT Presentation
Orthogonal Abstractions for Scheduling and Storage Mappings Dagstuhl October 26, 2017 Michelle Mills Strout (University of Arizona) Collaborators and Funding Ian Bertolacci (Ph.D. student at U Arizona) Catherine Olschanowsky (faculty at
2
Collaborators and Funding
- Ian Bertolacci (Ph.D. student at U Arizona)
- Catherine Olschanowsky (faculty at Boise
State University)
- Eddie Davis (Ph.D. student at Boise State
University)
- Mary Hall (University of Utah)
- Anand Venkat (Intel, PhD in 2016)
- Brad Chamberlain and Ben Harshbarger (Cray)
- Stephen Guzik (Colorado State University)
- Xinfeng Gao (Colorado State University)
- Christopher Wilcox (Ph.D, CSU instructor)
- Andrew Stone (Ph.D., now at Mathworks)
- Christopher Krieger (Ph.D., now at University
- f Maryland)
The presented research was supported by a Department
- f Energy Early Career Grant DE-SC0003956, a National
Science Foundation CAREER grant CCF 0746693, and a National Science Foundation grant CCF-1422725.
2
3
University of Arizona is Hiring
- Tenure track faculty
- Teaching faculty
- Graduate students
(send email to mstrout@cs.arizona.edu)
4
Example: Need to Schedule and Do Storage Mapping Across Operators
- Modularized per operator
- To reduce synchronization costs need to
aggregate
- To improve data locality need to aggregate
based on data reuse
- Approaches to do this need to schedule across
function calls and loops
5
Schedule across loops for Asynchronous Parallelism
5
Task Graph Full Sparse Tiled Iteration Space Break computation that sweeps over mesh/sparse matrix into chunks/sparse tiles
6
Scheduling Across Loops
Shift and Fuse and Overlap Tile
- Tiles now include extra
flux computations
- All tiles can be
executed in parallel
- Pros and cons
- Performs the best
- Fuse provides good
temporal locality
- Tiling provides good
spatial locality
- Overlap enables all tiles
to start in parallel
6
7
Results When Done By Hand WITH Storage Mapping (SC14)
(SC 2014) A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers, Olschanowsky et al.
Goal: programmable scheduling for data locality and parallelism
8
The Compiler Needs Help
- A general purpose compiler does well with
some optimizations, but …
- program analysis to determine code structure is
hard
- automatic parallelization and optimizations for
improving memory hierarchy use are hard
- We look at ways to
- Enable the programmer to specify the WHAT and
HOW separately (programmer wants control)
- Still provide compiler with enough information so
that it can provide performance portability
8
9
Domain-Specific Library:
OP2 – Unstructured Mesh DSL
- Exposes data access information
- Global (compiler flag) and local data layout
specification
- p_par_loop(adt_calc, "adt_calc", cells,
- p_arg_dat(p_x, 0, pcell, 2, “double:soa", OP_READ),
- p_arg_dat(p_x, 1, pcell, 2, “double:soa ", OP_READ),
- p_arg_dat(p_x, 2, pcell, 2, “double:soa ", OP_READ),
- p_arg_dat(p_x, 3, pcell, 2, “double:soa ", OP_READ),
- p_arg_dat(p_q, -1, OP_ID, PDIM*2, “double:soa ",
OP_READ),
- p_arg_dat(p_adt, -1, OP_ID, 1, "double", OP_WRITE));
9
Data Array Access Functions Data layout
Reguly et al. “Acceleration of a Full-Scale Industrial CFD Application with OP2”, TPDS 2016.
10
Task-Like Library:
CNC – Concurrent Collections
- Computation is specified as steps (tasks)
- Steps get and put tagged data
- Ability to specify run-time scheduling and
garbage collection policies orthogonal
Schlimbach, F. and Brodman, J.C. and Knobe, K. “Concurrent Collections on Distributed Memory Theory Put into Practice”, PDP 2013.
11
Data Parallel Library:
Kokkos
- Can orthogonally specify multi-dimensional
array layout
- Row-major
- Column-major
- Morton order
- Tiled
- Per dimension padding
- ...
View<double**,Tile<8,8>,Space> m(“matrix”,N,N);
- H. Carter Edwards and Christian Trott. “Kokkos, Manycore Device Performance
Portability for C++ HPC Applications”, GPU Tech Conference, 2015.
12
Pragma-Based Approach:
Loop Chain Abstraction (HIPS13, WACCPD16)
#pragma omplc loopchain schedule(fuse, tile((10,10),wavefront,serial)) { // 2-Dimensional loop nest, from 1 to N (inclusive) #pragma omplc domain(1:N,1:N) ... for( int i = 1; i <= N; i += 1 ) for( int j = 1; j <= N; j += 1 ) /* Stuff */ // 2-Dimensional loop nest, from 1 to N (inclusive) #pragma omplc domain(1:N,1:N) ... for( int i = 1; i <= N; i += 1 ) for( int j = 1; j <= N; j += 1 ) /* Things */ } Fuse all loops Tile size Wavefront over tiles Serial within tiles Tile resulting loop
(WACCPD 2016) Identifying and Scheduling Loop Chains Using Directives, Bertolacci et al.
13
Loop Chain Schedule Example (1/3)
13
Loop Chain Loop Nest Loop Nest Original Loop Chain Loop Nest After Loop Fusion
schedule( fuse )
14
Loop Chain Schedule Example (2/3)
14
Loop Chain Loop Nest Loop Nest Original Loop Chain Loop Nest After Loop Fusion Loop Nest Tile After Tiling Tile
schedule( fuse, tile( (2,2), parallel, serial ) )
15
Loop Chain Schedule Example (3/3)
15
Loop Chain Loop Nest Loop Nest
schedule( tile( (2,2), serial, serial ), fuse )
Original Loop Chain Loop Chain Loop Nest Tile Tile Loop Nest Tile Tile After Tiling Loop Nest Tile Tile Tile Tile After Fusing
Inspector 1 (e.g. index set splitting) Inspector 2 (e.g. compact-and-pad) Irregular Computation Executor (Transformed Irregular Computation) Inspector K
Compile time Runtime
Composed Inspector Explicit Functions Explicit Functions Explicit Functions Programme r-Defined CUDA-CHiLL CHiLL compiler Scripting and Autotuning Frameworks that use Uninterpreted Functions Inspector/Executor API Index Arrays
Compiler Scripting Approach:
Inspector-Executor Transformations Composed by Compiler (CGO14, PLDI15, SC16)
Executor for (l=0; l<M; l++){ #omp parallel for for (i in level_set(l)){ for (j=index[i]; j<index[i+1]; j++){ x[i] = x[i] – A[j]*x[col[j]]; }}}
Compile time Runtime
Inspector X CHiLLIE::func part-par(EF col, EF idx){ CHiLLIE::func level_set; // BFS traversal of Deps doing gets. ... idx(i) ... col(j) ... // Place appropriate i’s in each set. ... level_set(l).insert(i) ... return level_set; } Explicit Functions Scripting level_set() = part-par(<i loop>) Index Arrays Dependencies and Scheduling using Uninterpreted Functions Deps = {[i]->[i’]: i<i’ and i=col(j) and idx(i)<=j<idx(i+1) } Sched = {[i,j]->[l,i,j]: i in level_set(l) } (l sequential, i parallel, j sequential) Irregular Computation for (i=0; i<N; i++) { for (j=idx[i]; j<idx[i+1]; j++){ x[i] = x[i] – A[j]*x[col[j]]; }}
CHiLL-I/E Example: Sparse Triangular Solve
18
Moving Forward: Making schedules available in libraries (ICS 2015)
- Diamond slab tiling
written in C
18
- Diamond slab tiling made
available as a Chapel iterator
We want to transform our original schedule: into a faster schedule:
forall (t,x,y) in diamondTileIterator(…) do computation( t, x, y );
for t in timeRange do forall (x,y) in spaceDomain do computation( t, x, y );
int Li=0, Ui=N, Lj=0, Uj=N; for(int ts=0; ts<T ; ts+=subset_s) // loops over bottom left, middle, top right for (int c0 = -2; c0<=0; c0+=1) // loops horizontally? for (int c1 = 0; c1 <= (Uj+tau-3)/(tau-3) ; c1+=1) // loops vertically?, but without skew for (int x = (-Ui-tau+2)/(tau-3); x<=0 ; x += 1){ int c2 = x-c1; //skew // loops for time steps within a slab // (slices within slabs) for (int c3 = 1; c3<=subset_s; c3 += 1) for (int c4 = max(max(max(-tau * c1 - tau * c2 + 2 * c3 - (2*tau-2), -Uj - tau * c2 + c3 - (tau-2)), tau * c0 - tau * c1 - tau * c2 - c3), Li); c4 <= min(min(min(tau * c0 - tau * c1 - tau * c2 - c3 + (tau- 1), -tau * c1 - tau * c2 + 2 * c3), -Lj - tau * c2 + c3), Ui - 1); c4 += 1) for (int c5 = max(max(tau * c1 - c3, Lj), - tau * c2 + c3 - c4 - (tau-1)); c5 <= min(min(Uj - 1, -tau * c2 + c3 - c4), tau * c1 - c3 + (tau-1)); c5 += 1) computation(c3, c4, c5); }
19
Pragma-Based Approach:
Lookup Table Optimization (JSP 2011, SCAM 2012, JSEP 2013)
- Source-to-source translation tool called
Mesa built on ROSE compiler
- Up to 6.8x speedup on set of 5
applications including Small Angle X-ray Scattering
- Provides user pareto optimal set of LUT
transformations to choose from
19
Performance Profiling & Scope Identification Original Code Error Analysis & Performance Modeling Code Generation & Integration Optimized Code Performance & Accuracy Evaluation Construct & Solve Optimization Problem Expression Enumeration & Domain Profiling
20
Tripole Cubed sphere Geodesic Fortran Library-Based Approach:
Gridweaver for semi-regular grids (ICS 2013) Specify stencil computation, grid, and parallel distribution orthogonally.
21
Summary: Potential Stack of Abstractions for Specifying Schedules and/or Storage Mappings
21
High Level Low Level AST Pointers Trees Arrays Graphs “Tables” DSL Poly Sets Computation Data Scheduling Storage Mapping Chapel Iterators Tasks graphs Poly Mappings Non-affine (SPF) Kokkos data layout OP2 SoA or AoS Loop Chain Commands Domains map, reduce, ... Poly Mappings Non-affine (SPF) CHiLL Scripts CnC Tuners Chapel Domain Maps Loop chain Tree rewriting Possible to lower Controls
22
Discussion Questions
- What might an orthogonal abstraction stack
for scheduling and storage mapping (aka performance programming) look like?
- What are ways to separate out the "smarts" to
enable programmer control while still leveraging existing compiler and run-time scheduling and data layout algorithms?
- How do all of these concerns interact with
performance portability? (mstrout@cs.arizona.edu)
22