Programming Distributed Memory Sytems Using OpenMP Rudolf - - PowerPoint PPT Presentation

programming distributed memory sytems using openmp
SMART_READER_LITE
LIVE PREVIEW

Programming Distributed Memory Sytems Using OpenMP Rudolf - - PowerPoint PPT Presentation

Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount Is OpenMP a useful programming model


slide-1
SLIDE 1

Programming Distributed Memory Sytems Using OpenMP

Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount

slide-2
SLIDE 2
  • R. Eigenmann, Purdue

HIPS 2007 2

Is OpenMP a useful programming model for distributed systems?

OpenMP is a parallel programming model that assumes a shared address space

#pragma OMP parallel for for (i=1; 1<n; i++) {a[i] = b[i];}

Why is it difficult to implement OpenMP for distributed processors?

The compiler or runtime system will need to

partition and place data onto the distributed memories

send/receive messages to orchestrate remote data accesses HPF (High Performance Fortran) was a large-scale effort to do so - without success

So, why should we try (again)?

OpenMP is an easier programming (higher-productivity?) programming

  • model. It

allows programs to be incrementally parallelized starting from the serial versions,

relieves the programmer of the task of managing the movement of logically shared data.

slide-3
SLIDE 3
  • R. Eigenmann, Purdue

HIPS 2007 3

Two Translation Approaches

 Use a Software Distributed Shared

Memory System

 Translate OpenMP directly to MPI

slide-4
SLIDE 4
  • R. Eigenmann, Purdue

HIPS 2007 4

Approach 1: Compiling OpenMP for Software Distributed Shared Memory

slide-5
SLIDE 5
  • R. Eigenmann, Purdue

HIPS 2007 5

Inter-procedural Shared Data Analysis

SUBROUTINE DCDTZ(A, B, C) INTEGER A,B,C C$OMP PARALLEL C$OMP+PRIVATE (B, C) A = … CALL CCRANK … C$OMP END PARALLEL END SUBROUTINE DUDTZ(X, Y, Z) INTEGER X,Y,Z C$OMP PARALLEL C$OMP+REDUCTION(+:X) X = X + … C$OMP END PARALLEL END SUBROUTINE SUB0 INTEGER DELTAT CALL DCDTZ(DELTAT,…) CALL DUDTZ(DELTAT,…) END SUBROUTINE CCRANK() … beta = 1 – alpha … END

slide-6
SLIDE 6
  • R. Eigenmann, Purdue

HIPS 2007 6

DO istep = 1, itmax, 1 !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k) = rsd (i, j, k) !$OMP END PARALLEL DO CALL RHS() ENDDO SUBROUTINE RHS() !$OMP PARALLEL DO u (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = ... !$OMP END PARALLEL DO

Access Pattern Analysis

slide-7
SLIDE 7
  • R. Eigenmann, Purdue

HIPS 2007 7

=> Data Distribution-Aware Optimization

DO istep = 1, itmax, 1 !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k) = rsd (i, j, k) !$OMP END PARALLEL DO CALL RHS() ENDDO SUBROUTINE RHS() !$OMP PARALLEL DO u (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = ... !$OMP END PARALLEL DO

slide-8
SLIDE 8
  • R. Eigenmann, Purdue

HIPS 2007 8

DO k = 1, z !$OMP PARALLEL DO DO j = 1, N, 1 flux(m, j) = u(3, i, j, k) + … ENDDO !$OMP PARALLEL DO DO j = 1, N, 1 DO m = 1, 5, 1 rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO init00 = (N/proc_num)*(pid-1)… limit00 = (N/proc_num)*pid … DO k = 1, z DO j = init00, limit00, 1 flux(m, j) = u(3, i, j, k) + … ENDDO CALL TMK_BARRIER(0) DO j = init00, limit00, 1 DO m = 1, 5, 1 rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO

Adding Redundant Computation to Eliminate Communication

OpenMP Program S-DSM Program Optimized S-DSM Code init00 = (N/proc_num)*(pid-1)… limit00 = (N/proc_num)*pid… new_init = init00 - 1 new_limit = limit00 + 1 DO k = 1, z DO j = new_init, new_limit, 1 flux(m, j) = u(3, i, j, k) + … ENDDO CALL TMK_BARRIER(0) DO j = init00, limit00, 1 DO m = 1, 5, 1 rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO

slide-9
SLIDE 9
  • R. Eigenmann, Purdue

HIPS 2007 9

Access Privatization

Example from equake (SPEC OMPM2001)

If (master) { shared->ARCHnodes = ….. shared->ARCHduration = … ... } /* Parallel Region */ N = shared->ARCHnodes ; iter = shared->ARCHduration; …... // Done by all nodes { ARCHnodes = ….. ARCHduration = … ... } /* Parallel Region */ N = ARCHnodes; iter = ARCHduration; …... READ-ONLY SHARED VARS PRIVATE VARIABLES

slide-10
SLIDE 10
  • R. Eigenmann, Purdue

HIPS 2007 10

1 2 3 4 5 6

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

SPEC OMP2001M Performance Baseline Performance Optimized Performance

wupwise swim mgrid art equake applu

Optimized Performance of OMPM2001 Benchmarks

slide-11
SLIDE 11
  • R. Eigenmann, Purdue

HIPS 2007 11

A Key Question: How Close Are we to MPI Performance ?

1 2 3 4 5 6 7 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

SPEC OMP2001 Performance

Baseline Performance Optimized Performance MPI Performance wupwise swim mgrid applu

slide-12
SLIDE 12
  • R. Eigenmann, Purdue

HIPS 2007 12

Towards Adaptive Optimization

A combined Compiler-Runtime Scheme

 Compiler identifies repetitive access patterns  Runtime system learns the actual remote

addresses and sends data early. Ideal program characteristics:

Outer, serial loop Inner, parallel loops Communication points at barriers

Data addresses are invariant or a linear sequence, w.r.t.

  • uter loop
slide-13
SLIDE 13
  • R. Eigenmann, Purdue

HIPS 2007 13 1 2 3 4 5 6 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

Baseline(No Opt.) Locality Opt Locality Opt + Comp/Run Opt wupwise swim applu SpMul CG

Current Best Performance of OpenMP for S-DSM

slide-14
SLIDE 14
  • R. Eigenmann, Purdue

HIPS 2007 14

Approach 2: Translating OpenMP directly to MPI

 Baseline translation  Overlapping computation and communication

for irregular accesses

slide-15
SLIDE 15
  • R. Eigenmann, Purdue

HIPS 2007 15

Baseline Translation of OpenMP to MPI

 Execution Model

 SPMD model

 Serial Regions are replicated on all processes  Iterations of parallel for loops are distributed (using

static block scheduling)

 Shared Data is allocated on all nodes

 There is no concept of “owner” – only producers and

consumers of shared data

 At the end of a parallel loop, producers communicate

shared data to “potential” future consumers

 Array section analysis is used for summarizing array

accesses

slide-16
SLIDE 16
  • R. Eigenmann, Purdue

HIPS 2007 16

Baseline Translation

Translation Steps:

1.

Identify all shared data

2.

Create annotations for accesses to shared data (use regular section descriptors to summarize array accesses)

3.

Use interprocedural data flow analysis to identify potential consumers; incorporate OpenMP relaxed consistency specifications

4.

Create message sets to communicate data between producers and consumers

slide-17
SLIDE 17
  • R. Eigenmann, Purdue

HIPS 2007 17

Message Set Generation

<write,A,1,l1(p),u1(p)> <read,A,1,l2(p),u2(p)> <write,A,1,l3(p),u3(p)> <read,A,1,l4(p),u4(p)>

… … …

Message Set at RSD vertex V1, for array A from process p to process q computed as SApq = Elements of A with subscripts in the set {[l1(p),u1(p)]∩[l2(q),u2(q)]} U {[l1(p),u1(p)]∩[l4(q),u4(q)]}

V1:

<read,A,1,l5(p),u5(p)>

U ([l1(p),u1(p)]∩{[l5(q),u5(q)]- [l3(p),u3(p)]})

For every write, determine all future reads

slide-18
SLIDE 18
  • R. Eigenmann, Purdue

HIPS 2007 18

Baseline Translation of Irregular Accesses

 Irregular Access – A[B[i]], A[f(i)]

 Reads: assumed the whole array accessed  Writes: inspect at runtime, communicate at

the end of parallel loop

 We often can do better than

“conservative”:

 Monotonic array values => sharpen access

regions

slide-19
SLIDE 19
  • R. Eigenmann, Purdue

HIPS 2007 19

Optimizations based on Collective Communication

 Recognition of Reduction Idioms

 Translate to MPI_Reduce / MPI_Allreduce

functions.

 Casting sends/receives in terms of alltoall

calls

 Beneficial where the producer-consumer

relationship is many-to-many and there is insufficient distance between producers and consumers.

slide-20
SLIDE 20
  • R. Eigenmann, Purdue

HIPS 2007 20

Performance of the Baseline OpenMP to MPI Translation

Platform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch.

slide-21
SLIDE 21
  • R. Eigenmann, Purdue

HIPS 2007 21

We can do more for Irregular Applications ?

L1 : #pragma omp parallel for for(i=0;i<10;i++) A[i] = ... L2 : #pragma omp parallel for for(j=0;j<20;j++) B[j] = A[C[j]] + ...

Subscripts of accesses to shared arrays not always analyzable at compile-time

Baseline OpenMP to MPI translation:

Conservatively estimate that each process accesses the entire array

Try to deduce properties such as monotonicity for the irregular subscript to refine the estimate

Still, there may be redundant communication

Runtime tests (inspection) are needed to resolve accesses Array A produced by process 2 produced by process 1

1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ...

Array C accesses on process 1 accesses on process 2

slide-22
SLIDE 22
  • R. Eigenmann, Purdue

HIPS 2007 22

Inspection

Inspection allows accesses to be differentiated (at runtime) as local and non-local accesses.

Inspection can also map iterations to accesses. This mapping can then be used to re-order iterations so that iterations with the same data source are clubbed together.

Communication of remote data can be overlapped with the computation of iterations that access local data (or data already received)

Array A

1, 3, 5, 0, 2 ….. 2, 5, 8, 1, 2 ... C[i]

accesses on process 1 accesses on process 2

0, 1, 2, 3, 4, …….. 10, 11, 12, 13, 14 ... Index i

0, 1, 2, 3, 5 ….. . 5, 8, 1, 2, 2, ...

3, 0, 4, 1, 2, …….. 11, 12, 13, 10, 14 ...

accesses on process 1 accesses on process 2 reorder iterations

slide-23
SLIDE 23
  • R. Eigenmann, Purdue

HIPS 2007 23

Loop Restructuring

Simple iteration reordering may not be sufficient to expose the full set of possibilities for computation-communication

  • verlap.

L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L2 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; }

Reordering loop L2 may still not club together accesses from different sources L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; } L2-2: #pragma omp parallel for for(j=0;j<N;j++) { for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; }

Distribute loop L2 to form loops L2-1 and L2-2

slide-24
SLIDE 24
  • R. Eigenmann, Purdue

HIPS 2007 24

Loop Restructuring contd.

L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; } L2-2: #pragma omp parallel for for(j=0;j<N;j++) { for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; } L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; } L3: for(i=0;i<num_iter;i++) w[T[i].j] = w[T[i].j] + a[T[i].k]*p[T[i].col] ;

Coalesce nested loop L2-2 to form loop L3 Reorder iterations of loop L3 to achieve computation- communication overlap

Final restructured and reordered loop The T[i] data structure is created and filled in by the inspector

slide-25
SLIDE 25
  • R. Eigenmann, Purdue

HIPS 2007 25

Achieving actual overlap of computation and communication

 Non-blocking send/recv calls may not actually

progress concurrently with computation.

 Use a multi-threaded runtime system with separate

computation and communication threads – on dual CPU machines these threads can progress concurrently.

 The compiler extracts the send/recvs along with the

packing/unpacking of message buffers into a communication thread.

slide-26
SLIDE 26
  • R. Eigenmann, Purdue

HIPS 2007 26 Initiate sends to process q,r

Execute iterations that access local data Wait for receives from process q to complete Execute iterations that access data received from process q

Pack data and send to processes q and r. Receive data from process q

Wait for receives from process r to complete

Computation Thread on Process p Communication Thread

  • n Process p

Receive data from process r

Execute iterations that access data received from process r

Program Timeline

tsend trecv-q trecv-r tcomp-

p

tcomp-

q

tcomp-r twait-r twait-q

Wake up communication thread

slide-27
SLIDE 27
  • R. Eigenmann, Purdue

HIPS 2007 27

50 100 150 200 250 300 350 400 450 1 2 4 8 16 Number of Processors Time (in Seconds)

Actual Time Spent in Send/Recv Computation available for Overlapping Actual Wait Time

Performance of Equake Computation- communication

  • verlap in Equake

200 400 600 800 1000 1200 1 2 4 8 16 Nodes Execution Time (seconds)

Hand-Coded MPI Baseline (No Inspection) Inspection (No Reordering) Inspection and Reordering

slide-28
SLIDE 28
  • R. Eigenmann, Purdue

HIPS 2007 28

2 4 6 8 10 12 1 2 4 8 16 Number of Nodes Time (seconds)

Time spent in Send/Recv Computation Available for Overlapping Actual Wait Time

20 40 60 80 100 120 140 1 2 4 8 16

Number of Nodes Execution Time (seconds)

Hand-coded MPI Baseline Inspector without Reordering Inspection and Reordering

Performance of Moldyn Computation- communication

  • verlap in Moldyn
slide-29
SLIDE 29
  • R. Eigenmann, Purdue

HIPS 2007 29

200 400 600 800 1000 1200 1400 1 2 4 8 16 Number of Nodes Execution Time (in seconds)

NPB-2.3-MPI Baseline Translation Inspector without Reordering Inspector with Iteration Reordering

10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 Number of Nodes Time (seconds)

Time spent in Send/Recv Computation available for Overlap Actual Wait Time

169 339

Performance of CG Computation- communication

  • verlap in CG
slide-30
SLIDE 30
  • R. Eigenmann, Purdue

HIPS 2007 30

Conclusions

There is hope for easier programming models on distributed systems

OpenMP can be translated effectively onto DPS; we have used benchmarks from

SPEC OMP

NAS

additional irregular codes

Direct Translation of OpenMP to MPI outperforms translation via S-DSM

“Fall back” of S-DSM for irregular accesses incurs significant overhead

Caveats:

Data scalability is an issue

Black-belt programmers will always be able to do better

Advanced compiler technology is involved. There will be performance surprises.