Minimizing Completion Time for Loop Tiling with Computation and - PowerPoint PPT Presentation

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Georgios Goumas, Aristidis Sotiropoulos and Nectarios Koziris National Technical University of Athens, Greece Department of Electrical and Computer Engineering Division of Computer Science Computing Systems Lab www.cslab.ece.ntua.gr nkoziris@cslab.ece.ntua.gr IPDPS 2001-San Francisco

Overview Minimizing overall execution time of nested loops on multiprocessor architectures using message passing How? Loop Tiling for parallelism + Overlapping otherwise interleaved communication and pure computation sub-phases Is it possible? s/w communication layer + hardware should assist OVERALL SCHEDULE IS LIKE A PIPELINED DATAPATH ! IPDPS 2001-San Francisco 2

What is tiling or supernode transformation? • Loop transformation • Partitioning of iteration space J n into n-D parallelepiped areas formed by n families of hyperplanes • Each tile or supernode contains many iteration points within its boundary area • Tile is defined by a square matrix H , each row vector h i perpendicular to a family of hyperplanes • Dually, tile is defined by n column vectors p i which are its sides, P=[ p i ] It holds P = H -1 IPDPS 2001-San Francisco 3

Multilevel Tiling: Tiling at all levels of memory hierarchy! � to increase reuse of register files � to increase reuse of cache lines (tiling for locality) � To increase locality in Virtual Memory and at the upper level: � Tiling to exploit parallelism ! IPDPS 2001-San Francisco 4

Why using tiling for parallelism? • Increases Grain of Computation – Reduces synchronization points (atomic tile execution) • Reduces overall communication cost (increases intraprocessor communication) TRY TO FULLY UTILIZE ALL PROCESSORS (CPUs !!!) IPDPS 2001-San Francisco 5

Tiling Transformation Tiles are atomic, identical, bounded and sweep the index space     Hj → = n 2 n r : Z Z , r ( j )    − Hj −  1  j H    Hj identifies the coordinates of the tile that j is mapped to −   − 1 gives the coordinates of j within that tile relative to the j H Hj tile origin IPDPS 2001-San Francisco 6

Example: A simple 2-D Tiling for j 1 = 0 to 5 h 1   1 0 for j 2 = 0 to 5 = 2 H   a( j 1 , j 2 ) = a( j 1 -1 , j 2 ) + a( j 1 -1, j 2 -1 ); 1  0  2 h 2 j 2   2 0 = P    0 2  p 1 p 2 p 2 j 1 { } = ≤ ≤ 2 J ( j , j ) | 0 j , j 5 1 2 1 2 p 1 h 1 h 2 IPDPS 2001-San Francisco 7

Example (cont.) { } = ≤ ≤ 2 J ( j , j ) | 0 j , j 5 j 2 1 2 1 2 5     (0,2) (1,2) 1 j (2,2)   = = = ∈ 1 4 S S S 2 2     J j | j Hj , j J 1 j     2 3 2 (1,1) (2,1) 2 ( )       3 1 2   1 = r    )  (1,0) (2,0)   (  4   1 0    j 1 1 4 0 3 5 2       S 2 j − − = ∈ = = ∈ S 1 n 1 S 1 S S S S   TOS ( J , H ) j Z | j H j , j ( j , j ) J   1 2   S 2 j     2 IPDPS 2001-San Francisco 8

Another Example: j   3 2 = P   7  1 4  (0,1) (1,1) 6 −   4 2 1 = H   5 −   10 1 3 4 h 1 (1,1) (2,0)   2 3   p 2 (0,0)     8 0     h 2 2 = r     2     5   1   p 1 (1,-1) (2,-1)  3  i 0 1 2 3 4 5 6 7 8 9 IPDPS 2001-San Francisco 9

Tile Computation - Communication Cost The number of iteration points contained in a supernode j S expresses the tile computation cost. The tile communication cost is proportional to the number of iteration points that need to send data to neighboring tiles IPDPS 2001-San Francisco 10

n n m 1 ∑ ∑ ∑ (2) = Minimise V ( H ) h d comm i , k k , j det( H ) = = = i 1 k 1 j 1 1 = = ν Subject to V ( H ) comp det( H ) ≥ HD 0       3 1 4 0 6 2 (1) = = = D   , P   , P   1 2       1 2 0 5 2 4 = = V V 20 comp 1 comp 2 = = V 27 , V 19 comm 1 comm 2 IPDPS 2001-San Francisco 11

Objectives when Tiling for Parallelism Most methods try to: Given a computation tile volume, try to minimize the communication needs Re-shape Tiles = reduce communication But, how about iteration space size and boundaries? Objective is to minimize overall execution time ….thus we need efficient scheduling IPDPS 2001-San Francisco 12

Scheduling of Tiles ≥ If , tiles are atomic and preserve the lexicographic HD 0 execution ordering How can we schedule tiles to exploit parallelism ? Use similar methods as scheduling loop iterations! Solution: LINEAR TIME SCHEDULING of TILES What about space scheduling? Solution: CHAINS OF TILES TO SAME PROCESSOR IPDPS 2001-San Francisco 13

Linear Schedule  Π +  j t ( ) = = − Π ∈ n 0 t , wheret min i , i J   Π j 0   disp   Π + S j t = 0 t   Π j S   disp Which is the optimal ? ? For non-overlapping schedule: ? = [1 1 1...1] IPDPS 2001-San Francisco 14

For coarse grain tiles, all iteration dependencies are contained within a tile area. Coarse grain ? VERY FAST PROCESSORS COMMUNICATION LATENCY COMM TO COMP RATIO SHOULD BE MEANINGFUL Supernode dependence set contains only unitary dependencies, In other words, every tile communicates with its neighbors, one at each dimension For these unitary inter-tile dependence vectors: Optimal ? is [1 1 1…1] IPDPS 2001-San Francisco 15

? he total number P of time hyperplanes depends on g: j 2   1 0 S j 2 = 2 H    1  0 2 S j 1 Tile grain: g = |H -1 | = 4 j 1 j 2   S 1 j 2 0 = 3 ' H    0 1  3 S j 1 Tile grain: g ’ = |H ’-1 | = 9 IPDPS 2001-San Francisco 16 j 1

Each tile execution phase involves two sub-phases: a) compute and b) communicate results to others How many such phases ? P(g), where P(g) the number of hyperplanes total execution time: T = P(g) (T comp +T comm ), where: T comp =gt c the overall computation time for all iterations within a tile T comm : the communication cost for sending data to neighboring tiles T comm =T startup +T transmit IPDPS 2001-San Francisco 17

S j 1 Mapping along the maximal dimension : Final tile will be S j 2 executed at t = 5+2 x 3+1=12 3 time instance 2 P 2 1 P 1 1 2 3 4 5 S j 1 Optimal linear schedule is given by ? = [1 2] = + + S S S S S For a tile j ( j , j ), t 2 j j 1 1 2 S 2 1 j IPDPS 2001-San Francisco 18

Unit Execution Time � Unit Communication Time GRIDS GRIDS are task graphs with unitary dependencies ONLY! Optimal time schedule for UET-UCT GRIDS is found to be: Assume each supernode is a task. Overlapping Tile Schedule is like a UET-UCT GRID scheduling problem! S S S S The optimal time schedule for tile j ( j , j ,..., j ) is : 1 2 n n ∑ + S S 2 j j , where k is the " largest" dimension i k = i 1 ≠ i k We map all tiles along k dimension to the same processor IPDPS 2001-San Francisco 19

S j 2 Mapping along the non-maximal dimension : : Final tile will be P 1 executed at P 2 S j 2 t = 2x5+3+1=14 time instance 3 2 1 S j 1 1 2 3 4 5 linear schedule now is given by ? = [2 1]. WORSE than before IPDPS 2001-San Francisco 20

S j 2 overlapping P 2 1 P 1 communication 1 2 computation 2 sub-phases: communication + computation Communication in one time step Computation in the next IPDPS 2001-San Francisco 21

Blocking (non-overlapping) case: S j 2 1 P 2 P 1 1 2 computation communication + computation in each time step IPDPS 2001-San Francisco 22

Non overlapping case Each t imestep contains a triplet of receive - compute - send primitives Or, equivalently: Compute - communicate There exists time where every proc is only sending or receiving! BAD processor utilization! IPDPS 2001-San Francisco 23

Various levels of computation to communication overlapping: IPDPS 2001-San Francisco 24

Overlapping case Each timestep is (ideally) either a compute or a send+receive primitive Every proc computes its tile at k step and receives data to use them at k+1 step, while sends data produced a k-1 step IPDPS 2001-San Francisco 25

In Depth analysis of a time step However, there exists non-avoidable startup latencies: Thus overall time T = P ’ (g) max(A 1 +A 2 +A 3 , B 1 +B 2 +B 3 +B 4 ) IPDPS 2001-San Francisco 26

Communication Layer Internals Buffering + copying from user to kernel space Sending through syscal + transmitting through media Startup latency unavoidable (at the moment!) But what about writing to NIC and transmitting? (at least not the process job, but the kernel’s! Steals CPU cycles anyway!) IPDPS 2001-San Francisco 27

Minimizing Completion Time for Loop Tiling with Computation and - PowerPoint PPT Presentation

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Georgios Goumas, Aristidis Sotiropoulos and Nectarios Koziris National Technical University of Athens, Greece Department of Electrical and Computer

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Automatic Tiling of Mostly-Tileable Loop Nests David Wonnacott Tian Jin Allison Lake

Repetition Types of Loops Counting loop Know how many times to loop

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

Tiling for Dynamic Scheduling Ravi Teja Mullapudi Uday Bondhugula CSA, Indian Institue of

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

CS 5 4 3 : Com puter Graphics Lecture 2 ( Part I I ) : Tiling, Zoom ing and 2 D Clipping

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

ELD Completion Module Advice for students on completion of Modules A, B & C Why?

Gap-labelling of the pinwheel tiling H. Moustafa Lab. de Math ematiques, Clermont-Ferrand

On On Theor Theory of of Di Distri ributed buted Comput Computation Mohsen Ghaffari Graduating

Lei Tang

Q-ary Repeat-Accumulate Codes for Weak Signals Communications Nico Palermo, IV3NWV XVII EME

Distributed System Behavior Modeling of Urban Systems with Ontologies, Rules and Message Passing

OPEN PETASCALE LIBRARIES Advancing the development of numerical software for the new generation

Multicore OSes: Looking Forward from 1991, er, 2011 David A. Holland and Margo I. Seltze (Harvard

How to implement a virtual How to implement a virtual network laboratory in six network

Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info

Minimizing Completion Time for Loop Tiling with Computation and - PowerPoint PPT Presentation

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Georgios Goumas, Aristidis Sotiropoulos and Nectarios Koziris National Technical University of Athens, Greece Department of Electrical and Computer

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Automatic Tiling of Mostly-Tileable Loop Nests David Wonnacott Tian Jin Allison Lake

Repetition Types of Loops Counting loop Know how many times to loop

Tiling: A Data Locality Optimizing Algorithm Previously Kelly &amp; Pugh transformation

Tiling for Dynamic Scheduling Ravi Teja Mullapudi Uday Bondhugula CSA, Indian Institue of

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

CS 5 4 3 : Com puter Graphics Lecture 2 ( Part I I ) : Tiling, Zoom ing and 2 D Clipping

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

ELD Completion Module Advice for students on completion of Modules A, B &amp; C Why?

Gap-labelling of the pinwheel tiling H. Moustafa Lab. de Math ematiques, Clermont-Ferrand

On On Theor Theory of of Di Distri ributed buted Comput Computation Mohsen Ghaffari Graduating

Lei Tang

Q-ary Repeat-Accumulate Codes for Weak Signals Communications Nico Palermo, IV3NWV XVII EME

Distributed System Behavior Modeling of Urban Systems with Ontologies, Rules and Message Passing

OPEN PETASCALE LIBRARIES Advancing the development of numerical software for the new generation

Multicore OSes: Looking Forward from 1991, er, 2011 David A. Holland and Margo I. Seltze (Harvard

How to implement a virtual How to implement a virtual network laboratory in six network

Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

ELD Completion Module Advice for students on completion of Modules A, B & C Why?