Orthogonal Abstractions for Scheduling and Storage Mappings - - PowerPoint PPT Presentation

orthogonal abstractions for scheduling and storage
SMART_READER_LITE
LIVE PREVIEW

Orthogonal Abstractions for Scheduling and Storage Mappings - - PowerPoint PPT Presentation

Orthogonal Abstractions for Scheduling and Storage Mappings Dagstuhl October 26, 2017 Michelle Mills Strout (University of Arizona) Collaborators and Funding Ian Bertolacci (Ph.D. student at U Arizona) Catherine Olschanowsky (faculty at


slide-1
SLIDE 1

Orthogonal Abstractions for Scheduling and Storage Mappings

Dagstuhl October 26, 2017 Michelle Mills Strout (University of Arizona)

slide-2
SLIDE 2

2

Collaborators and Funding

  • Ian Bertolacci (Ph.D. student at U Arizona)
  • Catherine Olschanowsky (faculty at Boise

State University)

  • Eddie Davis (Ph.D. student at Boise State

University)

  • Mary Hall (University of Utah)
  • Anand Venkat (Intel, PhD in 2016)
  • Brad Chamberlain and Ben Harshbarger (Cray)
  • Stephen Guzik (Colorado State University)
  • Xinfeng Gao (Colorado State University)
  • Christopher Wilcox (Ph.D, CSU instructor)
  • Andrew Stone (Ph.D., now at Mathworks)
  • Christopher Krieger (Ph.D., now at University
  • f Maryland)

The presented research was supported by a Department

  • f Energy Early Career Grant DE-SC0003956, a National

Science Foundation CAREER grant CCF 0746693, and a National Science Foundation grant CCF-1422725.

2

slide-3
SLIDE 3

3

University of Arizona is Hiring

  • Tenure track faculty
  • Teaching faculty
  • Graduate students

(send email to mstrout@cs.arizona.edu)

slide-4
SLIDE 4

4

Example: Need to Schedule and Do Storage Mapping Across Operators

  • Modularized per operator
  • To reduce synchronization costs need to

aggregate

  • To improve data locality need to aggregate

based on data reuse

  • Approaches to do this need to schedule across

function calls and loops

slide-5
SLIDE 5

5

Schedule across loops for Asynchronous Parallelism

5

Task Graph Full Sparse Tiled Iteration Space Break computation that sweeps over mesh/sparse matrix into chunks/sparse tiles

slide-6
SLIDE 6

6

Scheduling Across Loops

Shift and Fuse and Overlap Tile

  • Tiles now include extra

flux computations

  • All tiles can be

executed in parallel

  • Pros and cons
  • Performs the best
  • Fuse provides good

temporal locality

  • Tiling provides good

spatial locality

  • Overlap enables all tiles

to start in parallel

6

slide-7
SLIDE 7

7

Results When Done By Hand WITH Storage Mapping (SC14)

(SC 2014) A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers, Olschanowsky et al.

Goal: programmable scheduling for data locality and parallelism

slide-8
SLIDE 8

8

The Compiler Needs Help

  • A general purpose compiler does well with

some optimizations, but …

  • program analysis to determine code structure is

hard

  • automatic parallelization and optimizations for

improving memory hierarchy use are hard

  • We look at ways to
  • Enable the programmer to specify the WHAT and

HOW separately (programmer wants control)

  • Still provide compiler with enough information so

that it can provide performance portability

8

slide-9
SLIDE 9

9

Domain-Specific Library:

OP2 – Unstructured Mesh DSL

  • Exposes data access information
  • Global (compiler flag) and local data layout

specification

  • p_par_loop(adt_calc, "adt_calc", cells,
  • p_arg_dat(p_x, 0, pcell, 2, “double:soa", OP_READ),
  • p_arg_dat(p_x, 1, pcell, 2, “double:soa ", OP_READ),
  • p_arg_dat(p_x, 2, pcell, 2, “double:soa ", OP_READ),
  • p_arg_dat(p_x, 3, pcell, 2, “double:soa ", OP_READ),
  • p_arg_dat(p_q, -1, OP_ID, PDIM*2, “double:soa ",

OP_READ),

  • p_arg_dat(p_adt, -1, OP_ID, 1, "double", OP_WRITE));

9

Data Array Access Functions Data layout

Reguly et al. “Acceleration of a Full-Scale Industrial CFD Application with OP2”, TPDS 2016.

slide-10
SLIDE 10

10

Task-Like Library:

CNC – Concurrent Collections

  • Computation is specified as steps (tasks)
  • Steps get and put tagged data
  • Ability to specify run-time scheduling and

garbage collection policies orthogonal

Schlimbach, F. and Brodman, J.C. and Knobe, K. “Concurrent Collections on Distributed Memory Theory Put into Practice”, PDP 2013.

slide-11
SLIDE 11

11

Data Parallel Library:

Kokkos

  • Can orthogonally specify multi-dimensional

array layout

  • Row-major
  • Column-major
  • Morton order
  • Tiled
  • Per dimension padding
  • ...

View<double**,Tile<8,8>,Space> m(“matrix”,N,N);

  • H. Carter Edwards and Christian Trott. “Kokkos, Manycore Device Performance

Portability for C++ HPC Applications”, GPU Tech Conference, 2015.

slide-12
SLIDE 12

12

Pragma-Based Approach:

Loop Chain Abstraction (HIPS13, WACCPD16)

#pragma omplc loopchain schedule(fuse, tile((10,10),wavefront,serial)) { // 2-Dimensional loop nest, from 1 to N (inclusive) #pragma omplc domain(1:N,1:N) ... for( int i = 1; i <= N; i += 1 ) for( int j = 1; j <= N; j += 1 ) /* Stuff */ // 2-Dimensional loop nest, from 1 to N (inclusive) #pragma omplc domain(1:N,1:N) ... for( int i = 1; i <= N; i += 1 ) for( int j = 1; j <= N; j += 1 ) /* Things */ } Fuse all loops Tile size Wavefront over tiles Serial within tiles Tile resulting loop

(WACCPD 2016) Identifying and Scheduling Loop Chains Using Directives, Bertolacci et al.

slide-13
SLIDE 13

13

Loop Chain Schedule Example (1/3)

13

Loop Chain Loop Nest Loop Nest Original Loop Chain Loop Nest After Loop Fusion

schedule( fuse )

slide-14
SLIDE 14

14

Loop Chain Schedule Example (2/3)

14

Loop Chain Loop Nest Loop Nest Original Loop Chain Loop Nest After Loop Fusion Loop Nest Tile After Tiling Tile

schedule( fuse, tile( (2,2), parallel, serial ) )

slide-15
SLIDE 15

15

Loop Chain Schedule Example (3/3)

15

Loop Chain Loop Nest Loop Nest

schedule( tile( (2,2), serial, serial ), fuse )

Original Loop Chain Loop Chain Loop Nest Tile Tile Loop Nest Tile Tile After Tiling Loop Nest Tile Tile Tile Tile After Fusing

slide-16
SLIDE 16

Inspector 1 (e.g. index set splitting) Inspector 2 (e.g. compact-and-pad) Irregular Computation Executor (Transformed Irregular Computation) Inspector K

Compile time Runtime

Composed Inspector Explicit Functions Explicit Functions Explicit Functions Programme r-Defined CUDA-CHiLL CHiLL compiler Scripting and Autotuning Frameworks that use Uninterpreted Functions Inspector/Executor API Index Arrays

Compiler Scripting Approach:

Inspector-Executor Transformations Composed by Compiler (CGO14, PLDI15, SC16)

slide-17
SLIDE 17

Executor for (l=0; l<M; l++){ #omp parallel for for (i in level_set(l)){ for (j=index[i]; j<index[i+1]; j++){ x[i] = x[i] – A[j]*x[col[j]]; }}}

Compile time Runtime

Inspector X CHiLLIE::func part-par(EF col, EF idx){ CHiLLIE::func level_set; // BFS traversal of Deps doing gets. ... idx(i) ... col(j) ... // Place appropriate i’s in each set. ... level_set(l).insert(i) ... return level_set; } Explicit Functions Scripting level_set() = part-par(<i loop>) Index Arrays Dependencies and Scheduling using Uninterpreted Functions Deps = {[i]->[i’]: i<i’ and i=col(j) and idx(i)<=j<idx(i+1) } Sched = {[i,j]->[l,i,j]: i in level_set(l) } (l sequential, i parallel, j sequential) Irregular Computation for (i=0; i<N; i++) { for (j=idx[i]; j<idx[i+1]; j++){ x[i] = x[i] – A[j]*x[col[j]]; }}

CHiLL-I/E Example: Sparse Triangular Solve

slide-18
SLIDE 18

18

Moving Forward: Making schedules available in libraries (ICS 2015)

  • Diamond slab tiling

written in C

18

  • Diamond slab tiling made

available as a Chapel iterator

We want to transform our original schedule: into a faster schedule:

forall (t,x,y) in diamondTileIterator(…) do computation( t, x, y );

for t in timeRange do forall (x,y) in spaceDomain do computation( t, x, y );

int Li=0, Ui=N, Lj=0, Uj=N; for(int ts=0; ts<T ; ts+=subset_s) // loops over bottom left, middle, top right for (int c0 = -2; c0<=0; c0+=1) // loops horizontally? for (int c1 = 0; c1 <= (Uj+tau-3)/(tau-3) ; c1+=1) // loops vertically?, but without skew for (int x = (-Ui-tau+2)/(tau-3); x<=0 ; x += 1){ int c2 = x-c1; //skew // loops for time steps within a slab // (slices within slabs) for (int c3 = 1; c3<=subset_s; c3 += 1) for (int c4 = max(max(max(-tau * c1 - tau * c2 + 2 * c3 - (2*tau-2), -Uj - tau * c2 + c3 - (tau-2)), tau * c0 - tau * c1 - tau * c2 - c3), Li); c4 <= min(min(min(tau * c0 - tau * c1 - tau * c2 - c3 + (tau- 1), -tau * c1 - tau * c2 + 2 * c3), -Lj - tau * c2 + c3), Ui - 1); c4 += 1) for (int c5 = max(max(tau * c1 - c3, Lj), - tau * c2 + c3 - c4 - (tau-1)); c5 <= min(min(Uj - 1, -tau * c2 + c3 - c4), tau * c1 - c3 + (tau-1)); c5 += 1) computation(c3, c4, c5); }

slide-19
SLIDE 19

19

Pragma-Based Approach:

Lookup Table Optimization (JSP 2011, SCAM 2012, JSEP 2013)

  • Source-to-source translation tool called

Mesa built on ROSE compiler

  • Up to 6.8x speedup on set of 5

applications including Small Angle X-ray Scattering

  • Provides user pareto optimal set of LUT

transformations to choose from

19

Performance Profiling & Scope Identification Original Code Error Analysis & Performance Modeling Code Generation & Integration Optimized Code Performance & Accuracy Evaluation Construct & Solve Optimization Problem Expression Enumeration & Domain Profiling

slide-20
SLIDE 20

20

Tripole Cubed sphere Geodesic Fortran Library-Based Approach:

Gridweaver for semi-regular grids (ICS 2013) Specify stencil computation, grid, and parallel distribution orthogonally.

slide-21
SLIDE 21

21

Summary: Potential Stack of Abstractions for Specifying Schedules and/or Storage Mappings

21

High Level Low Level AST Pointers Trees Arrays Graphs “Tables” DSL Poly Sets Computation Data Scheduling Storage Mapping Chapel Iterators Tasks graphs Poly Mappings Non-affine (SPF) Kokkos data layout OP2 SoA or AoS Loop Chain Commands Domains map, reduce, ... Poly Mappings Non-affine (SPF) CHiLL Scripts CnC Tuners Chapel Domain Maps Loop chain Tree rewriting Possible to lower Controls

slide-22
SLIDE 22

22

Discussion Questions

  • What might an orthogonal abstraction stack

for scheduling and storage mapping (aka performance programming) look like?

  • What are ways to separate out the "smarts" to

enable programmer control while still leveraging existing compiler and run-time scheduling and data layout algorithms?

  • How do all of these concerns interact with

performance portability? (mstrout@cs.arizona.edu)

22