Minimizing Completion Time for Loop Tiling with Computation and - - PowerPoint PPT Presentation

minimizing completion time for loop tiling with
SMART_READER_LITE
LIVE PREVIEW

Minimizing Completion Time for Loop Tiling with Computation and - - PowerPoint PPT Presentation

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Georgios Goumas, Aristidis Sotiropoulos and Nectarios Koziris National Technical University of Athens, Greece Department of Electrical and Computer


slide-1
SLIDE 1

IPDPS 2001-San Francisco

Georgios Goumas, Aristidis Sotiropoulos and Nectarios Koziris

National Technical University of Athens, Greece Department of Electrical and Computer Engineering Division of Computer Science Computing Systems Lab

www.cslab.ece.ntua.gr nkoziris@cslab.ece.ntua.gr

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

slide-2
SLIDE 2

IPDPS 2001-San Francisco 2

Overview

Minimizing overall execution time of nested loops on multiprocessor architectures using message passing How? Loop Tiling for parallelism + Overlapping otherwise interleaved communication and pure computation sub-phases

OVERALL SCHEDULE IS LIKE A PIPELINED DATAPATH!

Is it possible? s/w communication layer + hardware should assist

slide-3
SLIDE 3

IPDPS 2001-San Francisco 3

  • Loop transformation
  • Partitioning of iteration space Jn into n-D parallelepiped

areas formed by n families of hyperplanes

  • Each tile or supernode contains many iteration points

within its boundary area

  • Tile is defined by a square matrix H, each row vector hi

perpendicular to a family of hyperplanes

  • Dually, tile is defined by n column vectors pi which are its

sides, P=[pi] It holds P = H-1

What is tiling or supernode transformation?

slide-4
SLIDE 4

IPDPS 2001-San Francisco 4

to increase reuse of register files to increase reuse of cache lines (tiling for locality) To increase locality in Virtual Memory and at the upper level: Tiling to exploit parallelism !

Multilevel Tiling: Tiling at all levels of memory hierarchy!

slide-5
SLIDE 5

IPDPS 2001-San Francisco 5

Why using tiling for parallelism?

  • Increases Grain of Computation –

Reduces synchronization points (atomic tile execution)

  • Reduces
  • verall

communication cost (increases intraprocessor communication) TRY TO FULLY UTILIZE ALL PROCESSORS (CPUs !!!)

slide-6
SLIDE 6

IPDPS 2001-San Francisco 6

   

     − = →

− Hj

H j Hj j r Z Z r

n n 1 2

) ( , :

Tiles are atomic, identical, bounded and sweep the index space

identifies the coordinates of the tile that j is mapped to gives the coordinates of j within that tile relative to the tile origin

 

Hj

 

Hj H j

1 −

Tiling Transformation

slide-7
SLIDE 7

IPDPS 2001-San Francisco 7

      =       = 2 2

2 1 2 1

P H

p1 p2 h1 h2

p1 p2 h1 h2

{ }

5 , | ) , (

2 1 2 1 2

≤ ≤ = j j j j J

for j1 = 0 to 5 for j2 = 0 to 5 a(j1, j2) = a(j1-1, j2) + a(j1-1, j2-1 );

j1 j2

Example: A simple 2-D Tiling

slide-8
SLIDE 8

IPDPS 2001-San Francisco 8

j1 j2

{ }

5 , | ) , (

2 1 2 1 2

≤ ≤ = j j j j J  

      ∈       = = =

2 2 2 1 1 2 1

, | J j j j Hj j j J

S S S

          ∈       = = ∈ =

− − S S S S S S S n S

J j j j j j j H j Z j H J TOS ) , ( , 2 2 | ) , (

2 1 2 1 1 1

1 2 3 4 5 1 2 3 4 5

( ) ( )

     =               1 2 1 4 3 r

(1,2) (1,1) (1,0) (2,2) (0,2) (2,1) (2,0)

Example (cont.)

slide-9
SLIDE 9

IPDPS 2001-San Francisco 9

      − − = 3 1 2 4 10 1 H

      = 4 1 2 3 P

            =               3 2 2 5 8 r

i j p1 p2 h1 h2 (0,0) (1,1) (0,1) (1,1) (2,0) (1,-1) (2,-1)

0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7

Another Example:

slide-10
SLIDE 10

IPDPS 2001-San Francisco 10

The number of iteration points contained in a supernode jS expresses the tile computation cost. The tile communication cost is proportional to the number of iteration points that need to send data to neighboring tiles

Tile Computation - Communication Cost

slide-11
SLIDE 11

IPDPS 2001-San Francisco 11

) det( 1 ) ( ) det( 1 ) (

1 , , 1 1

≥ = = =

∑ ∑ ∑

= = =

HD H H V to Subject d h H H V Minimise

comp m j j k k i n k n i comm

ν

(1) (2)

19 , 27 20 4 2 2 6 , 5 4 , 2 1 1 3

2 1 2 1 2 1

= = = =       =       =       =

comm comm comp comp

V V V V P P D

slide-12
SLIDE 12

IPDPS 2001-San Francisco 12

Objectives when Tiling for Parallelism

Most methods try to: Given a computation tile volume, try to minimize the communication needs Re-shape Tiles = reduce communication But, how about iteration space size and boundaries? Objective is to minimize overall execution time ….thus we need efficient scheduling

slide-13
SLIDE 13

IPDPS 2001-San Francisco 13

Scheduling of Tiles

If , tiles are atomic and preserve the lexicographic execution ordering

≥ HD

How can we schedule tiles to exploit parallelism?

Use similar methods as scheduling loop iterations!

Solution: LINEAR TIME SCHEDULING of TILES What about space scheduling? Solution: CHAINS OF TILES TO SAME PROCESSOR

slide-14
SLIDE 14

IPDPS 2001-San Francisco 14

Linear Schedule

( )

n j

J i i wheret disp t j t ∈ Π − =       Π + Π = , min ,       Π + Π = disp t j t

S jS

Which is the optimal ? ?

For non-overlapping schedule: ? = [1 1 1...1]

slide-15
SLIDE 15

IPDPS 2001-San Francisco 15

For coarse grain tiles, all iteration dependencies are contained within a tile area. Coarse grain ? VERY FAST PROCESSORS COMMUNICATION LATENCY COMM TO COMP RATIO SHOULD BE MEANINGFUL Supernode dependence set contains only unitary dependencies, In other words, every tile communicates with its neighbors, one at each dimension Optimal ? is [1 1 1…1] For these unitary inter-tile dependence vectors:

slide-16
SLIDE 16

IPDPS 2001-San Francisco 16

j1 j2

S

j2

S

j1

? he total number P of time hyperplanes depends on g:

j1 j2

      =

2 1 2 1

H

Tile grain: g = |H-1| = 4

S

j2

      =

3 1 3 1 '

H

S

j1

Tile grain: g’ = |H’-1| = 9

slide-17
SLIDE 17

IPDPS 2001-San Francisco 17

total execution time: T = P(g) (Tcomp+Tcomm), where: Tcomp=gtc the overall computation time for all iterations within a tile Tcomm : the communication cost for sending data to neighboring tiles Each tile execution phase involves two sub-phases: a) compute and b) communicate results to others How many such phases? P(g), where P(g) the number of hyperplanes Tcomm=Tstartup+Ttransmit

slide-18
SLIDE 18

IPDPS 2001-San Francisco 18

Mapping along the maximal dimension :

1 2 3 4 5

P1 P2

1 2 3

Optimal linear schedule is given by ? = [1 2]

Final tile will be executed at t = 5+2 x 3+1=12 time instance

1 2 ), , ( tile a For

1 2 2 1

+ + =

S S j S S S

j j t j j j

S

S

j1

S

j1

S

j2

slide-19
SLIDE 19

IPDPS 2001-San Francisco 19

Unit Execution Time Unit Communication Time GRIDS

GRIDS are task graphs with unitary dependencies ONLY! Optimal time schedule for UET-UCT GRIDS is found to be: Assume each supernode is a task. Overlapping Tile Schedule is like a UET-UCT GRID scheduling problem!

processor same the to dimension along tiles all map We dimension largest" " the is where , 2 : is ) ,..., , ( for tile schedule time

  • ptimal

The

1 2 1

k k j j j j j j

S k n k i i S i S n S S S

+

≠ =

slide-20
SLIDE 20

IPDPS 2001-San Francisco 20

:

1 2 3 4 5

P1 P2

1 2 3

linear schedule now is given by ? = [2 1]. WORSE than before

S

j2

S

j1

S

j2

Final tile will be executed at t = 2x5+3+1=14 time instance

Mapping along the non-maximal dimension :

slide-21
SLIDE 21

IPDPS 2001-San Francisco 21

1 2

P1 P2

1

S

j2

2 sub-phases: communication + computation Communication in one time step Computation in the next

communication computation

  • verlapping
slide-22
SLIDE 22

IPDPS 2001-San Francisco 22

1 2

P1 P2

1

S

j2

communication + computation in each time step

computation

Blocking (non-overlapping) case:

slide-23
SLIDE 23

IPDPS 2001-San Francisco 23

Each timestep contains a triplet of receive-compute-send primitives Or, equivalently: Compute-communicate

There exists time where every proc is only sending

  • r receiving!

BAD processor utilization!

Non overlapping case

slide-24
SLIDE 24

IPDPS 2001-San Francisco 24

Various levels of computation to communication overlapping:

slide-25
SLIDE 25

IPDPS 2001-San Francisco 25

Overlapping case

Each timestep is (ideally) either a compute or a send+receive primitive Every proc computes its tile at k step and receives data to use them at k+1 step, while sends data produced a k-1 step

slide-26
SLIDE 26

IPDPS 2001-San Francisco 26

Thus overall time T = P’(g) max(A1+A2+A3, B1+B2+B3+B4)

In Depth analysis of a time step

However, there exists non-avoidable startup latencies:

slide-27
SLIDE 27

IPDPS 2001-San Francisco 27

Communication Layer Internals

Buffering + copying from user to kernel space Sending through syscal + transmitting through media Startup latency unavoidable (at the moment!) But what about writing to NIC and transmitting?

(at least not the process job, but the kernel’s! Steals CPU cycles anyway!)

slide-28
SLIDE 28

IPDPS 2001-San Francisco 28

Experimental Results

  • Linux Cluster (16 nodes + Ethernet 100Mbps + MPICH)
  • Test app: single statement triple nested loop

with rectangular tiling

  • k dimension is the largest one
  • Each tile is a cube with ij, ik and kj sides
  • Mapping along k dimension, so:

Every processor in the ij plane (tile coordinates (i,j):

1. Receives from neighbors (i-1, j) and (i, j-1) 2. Computes 3. Sends to neighbors (i+1, j) and (i, j+1)

slide-29
SLIDE 29

IPDPS 2001-San Francisco 29

Timing and Extra buffering for the overlapping case:

slide-30
SLIDE 30

IPDPS 2001-San Francisco 30

Blocking primitives

slide-31
SLIDE 31

IPDPS 2001-San Francisco 31

blocking case

For i = 0 to max_i_tile-1 For j = 0 to max_j_tile-1 ProcB(i, j) where ProcB(i, j) is: for k = 0 to max_k_tile-1 { MPI_Recv (T(i-1, j), results (T(i-1,j), k); MPI_Recv (T(i, j-1), results (T(i, j-1), k); compute(); MPI_Send (T(i+1, j), results (T(i, j), k); MPI_Send (T(i, j+1), results (T(i, j), k); }

slide-32
SLIDE 32

IPDPS 2001-San Francisco 32

Non-blocking primitives

slide-33
SLIDE 33

IPDPS 2001-San Francisco 33

non-blocking case

For i = 0 to max_i_tile-1 For j = 0 to max_j_tile-1 ProcNB(i, j) where ProcNB(i, j) is: for k = 0 to max_k_tile-1 { MPI_Isend (T(i+1, j), results (T(i, j), k-1).&s1); MPI_Isend (T(i, j+1), results (T(i, j), k-1), &s2); MPI_Irecv (T(i-1, j-1), results (T(i-1, j), k+1), &r1); MPI_Irend (T(i, j-1), results (T(i, j-1), k+1), &r2); compute(); MPI_wait(s1); MPI_wait(s2); MPI_wait(r1); MPI_wait(r2); }

slide-34
SLIDE 34

IPDPS 2001-San Francisco 34

AxBxC (i, j, k) iteration spaces Use 16 processors: 4 processor in each dim i, j 16 x16 x 16384, 16 x 16 x 32768, 32 x 32 x 4096 Tiles of size 4x4xV, 8x8xV, for variable V, thus variable g Methodology: Find Vexperimental, gexperimental for which Tmin Calculate tc (computation for one iteration) Calculate Tfill_MPI_buffer experimentally for Vexperimental Which is P(gexperimental) (# of hyperplanes)? Find by formula Ttheoret using P(gexperimental) Compare Tmin and Ttheoret

slide-35
SLIDE 35

IPDPS 2001-San Francisco 35

16_16_16384

slide-36
SLIDE 36

IPDPS 2001-San Francisco 36

16_16_32768

slide-37
SLIDE 37

IPDPS 2001-San Francisco 37

32_32_4096

slide-38
SLIDE 38

IPDPS 2001-San Francisco 38

Table of Results

slide-39
SLIDE 39

IPDPS 2001-San Francisco 39

Can we find analytical expressions for Ai(g), Bi(g)? Too difficult Need lower latency layers? High level communication layers seem to abstract zero-copy protocols +DMA

slide-40
SLIDE 40

IPDPS 2001-San Francisco 40

Timestep Analysis using kernel level DMA

slide-41
SLIDE 41

IPDPS 2001-San Francisco 41

Overlapping time schedule using DMA

slide-42
SLIDE 42

IPDPS 2001-San Francisco 42

Using DMA avoids the CPU OS cycle stealing when copying from kernel space to NIC buffers However: When DMA is started from kernel

  • OS kernel checks the size of the user memory area segment
  • OS kernel translates VM to contiguous phys (DMA needs phys mem

addresses)

  • OS kernel writes args and size to DMA engine registers

DMA startup latency (due to OS ops) is increasing in comparison with transmission time Solution: USER LEVEL NETWORKING LAYER THUS: Data are copied from user space to contiguous kernel space mem by CPU

Kernel Level initiation of DMA

slide-43
SLIDE 43

IPDPS 2001-San Francisco 43

Ongoing Work

  • We use SCI (Scalable Interconnection Network)

with DMA capabilities (Dolphin D330 cards)

  • Two threads of control per process
  • CPU does very little job, thus small startup

latencies (even with DMA engine startups)

  • Coarser tile grains than before!
slide-44
SLIDE 44

IPDPS 2001-San Francisco 44

User Level Networking

AM, FM, U-NET and BIP then VIA = standard Messages are sent directly from user space without OS intervention User level communication endpoints How about starting DMA from user level?

slide-45
SLIDE 45

IPDPS 2001-San Francisco 45

  • It would be nice if we could write from user level directly to

contiguous physical memory! mmap “RAM device” We save CPU from the copy to contiguous memory areas.

  • It would be nice if we could initiate DMA from user level!

Support from OS and device We save CPU from memory to device copy.

Evolution to DMA

slide-46
SLIDE 46

IPDPS 2001-San Francisco 46

MPI with Ethernet simple send

slide-47
SLIDE 47

IPDPS 2001-San Francisco 47

MPI with Ethernet DMA send

slide-48
SLIDE 48

IPDPS 2001-San Francisco 48

SCI with Shared memory Send

slide-49
SLIDE 49

IPDPS 2001-San Francisco 49

Our approach SCI with DMA send