programming distributed memory sytems using openmp
play

Programming Distributed Memory Sytems Using OpenMP Rudolf - PowerPoint PPT Presentation

Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount Is OpenMP a useful programming model


  1. Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount

  2. Is OpenMP a useful programming model for distributed systems? OpenMP is a parallel programming model that assumes a shared  address space #pragma OMP parallel for for (i=1; 1<n; i++) {a[i] = b[i];} Why is it difficult to implement OpenMP for distributed processors?  The compiler or runtime system will need to partition and place data onto the distributed memories  send/receive messages to orchestrate remote data accesses  HPF (High Performance Fortran) was a large-scale effort to do so - without success So, why should we try (again)?  OpenMP is an easier programming (higher-productivity?) programming model. It allows programs to be incrementally parallelized starting from the serial  versions, relieves the programmer of the task of managing the movement of  logically shared data. R. Eigenmann, Purdue HIPS 2007 2

  3. Two Translation Approaches  Use a Software Distributed Shared Memory System  Translate OpenMP directly to MPI R. Eigenmann, Purdue HIPS 2007 3

  4. Approach 1: Compiling OpenMP for Software Distributed Shared Memory R. Eigenmann, Purdue HIPS 2007 4

  5. Inter-procedural Shared Data Analysis SUBROUTINE SUB0 INTEGER DELTAT CALL DCDTZ(DELTAT,…) SUBROUTINE DCDTZ(A, B, C) CALL DUDTZ(DELTAT,…) INTEGER A,B,C END C$OMP PARALLEL C$OMP+PRIVATE (B, C) A = … SUBROUTINE DUDTZ(X, Y, Z) CALL CCRANK INTEGER X,Y,Z … C$OMP PARALLEL C$OMP END PARALLEL C$OMP+REDUCTION(+:X) X = X + … END C$OMP END PARALLEL END SUBROUTINE CCRANK() … beta = 1 – alpha … END R. Eigenmann, Purdue HIPS 2007 5

  6. Access Pattern SUBROUTINE RHS() Analysis !$OMP PARALLEL DO u (i, j, k ) = … !$OMP END PARALLEL DO DO istep = 1, itmax, 1 !$OMP PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k ).. rsd (i, j, k ) = … rsd (i, j, k ) = rsd (i, j, k ).. !$OMP END PARALLEL DO !$OMP END PARALLEL DO !$OMP PARALLEL DO !$OMP PARALLEL DO rsd ( i , j , k) = … … = u (i, j, k ).. !$OMP END PARALLEL DO rsd (i, j, k ) = rsd (i, j, k ).. !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k ) = rsd (i, j, k ) !$OMP PARALLEL DO !$OMP END PARALLEL DO … = u (i, j , k).. rsd (i, j , k) = ... CALL RHS() !$OMP END PARALLEL DO ENDDO R. Eigenmann, Purdue HIPS 2007 6

  7. => Data Distribution-Aware Optimization SUBROUTINE RHS() !$OMP PARALLEL DO u (i, j , k) = … DO istep = 1, itmax, 1 !$OMP END PARALLEL DO !$OMP PARALLEL DO !$OMP PARALLEL DO rsd (i, j , k) = … … = u (i, j , k).. !$OMP END PARALLEL DO rsd (i, j , k) = rsd (i, j , k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j , k) = … !$OMP PARALLEL DO !$OMP END PARALLEL DO … = u (i, j , k).. rsd (i, j , k) = rsd (i, j , k).. !$OMP PARALLEL DO !$OMP END PARALLEL DO u (i, j , k) = rsd (i, j , k) !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j , k).. CALL RHS() rsd (i, j , k) = ... !$OMP END PARALLEL DO ENDDO R. Eigenmann, Purdue HIPS 2007 7

  8. Adding Redundant Computation to Eliminate Communication Optimized S-DSM Code S-DSM Program OpenMP Program init00 = (N/proc_num)*(pid-1)… DO k = 1, z init00 = (N/proc_num)*(pid-1)… limit00 = (N/proc_num)*pid… limit00 = (N/proc_num)*pid … !$OMP PARALLEL DO new_init = init00 - 1 DO j = 1, N, 1 new_limit = limit00 + 1 DO k = 1, z flux(m, j) = u(3, i, j, k) + … DO k = 1, z ENDDO DO j = init00, limit00, 1 DO j = new_init, new_limit, 1 flux(m, j) = u(3, i, j, k) + … !$OMP PARALLEL DO flux(m, j) = u(3, i, j, k) + … ENDDO DO j = 1, N, 1 ENDDO CALL TMK_BARRIER(0) DO m = 1, 5, 1 CALL TMK_BARRIER(0) DO j = init00, limit00, 1 rsd(m, i, j, k) = … + DO j = init00, limit00, 1 DO m = 1, 5, 1 flux(m, j+1)-flux(m, j-1)) DO m = 1, 5, 1 rsd(m, i, j, k) = … + ENDDO rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO R. Eigenmann, Purdue HIPS 2007 8

  9. Access Privatization Example from equake (SPEC OMPM2001) If (master) { // Done by all nodes shared->ARCHnodes = ….. { ARCHnodes = ….. shared->ARCHduration = … ARCHduration = … ... ... } PRIVATE } READ-ONLY VARIABLES SHARED VARS /* Parallel Region */ /* Parallel Region */ N = ARCHnodes ; N = shared->ARCHnodes ; iter = ARCHduration ; iter = shared->ARCHduration; …... …... R. Eigenmann, Purdue HIPS 2007 9

  10. Optimized Performance of OMPM2001 Benchmarks SPEC OMP2001M Performance 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 swim mgrid applu equake art wupwise Baseline Performance Optimized Performance R. Eigenmann, Purdue HIPS 2007 10

  11. A Key Question: How Close Are we to MPI Performance ? SPEC OMP2001 Performance 8 7 6 Baseline Performance 5 Optimized Performance 4 MPI Performance 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise mgrid applu swim R. Eigenmann, Purdue HIPS 2007 11

  12. Towards Adaptive Optimization A combined Compiler-Runtime Scheme  Compiler identifies repetitive access patterns  Runtime system learns the actual remote addresses and sends data early. Ideal program characteristics: Data addresses are Inner, parallel invariant or a linear loops Outer, serial sequence, w.r.t. loop outer loop Communication points at barriers R. Eigenmann, Purdue HIPS 2007 12

  13. Current Best Performance of OpenMP for S-DSM 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise swim applu SpMul CG Baseline(No Opt.) Locality Opt Locality Opt + Comp/Run Opt R. Eigenmann, Purdue HIPS 2007 13

  14. Approach 2: Translating OpenMP directly to MPI  Baseline translation  Overlapping computation and communication for irregular accesses R. Eigenmann, Purdue HIPS 2007 14

  15. Baseline Translation of OpenMP to MPI  Execution Model  SPMD model  Serial Regions are replicated on all processes  Iterations of parallel for loops are distributed (using static block scheduling)  Shared Data is allocated on all nodes  There is no concept of “owner” – only producers and consumers of shared data  At the end of a parallel loop, producers communicate shared data to “potential” future consumers  Array section analysis is used for summarizing array accesses R. Eigenmann, Purdue HIPS 2007 15

  16. Baseline Translation Translation Steps: Identify all shared data 1. Create annotations for accesses to shared data 2. (use regular section descriptors to summarize array accesses) Use interprocedural data flow analysis to identify 3. potential consumers ; incorporate OpenMP relaxed consistency specifications Create message sets to communicate data 4. between producers and consumers R. Eigenmann, Purdue HIPS 2007 16

  17. Message Set Generation For every write, V1: <write,A,1,l1(p),u1(p)> determine all future reads … <read,A,1,l2(p),u2(p)> <write,A,1,l3(p),u3(p)> Message Set at RSD vertex V1, for array <read,A,1,l5(p),u5(p)> A from process p to process q computed as … SApq = Elements of A with subscripts in the set <read,A,1,l4(p),u4(p)> {[l1(p),u1(p)] ∩ [l2(q),u2(q)]} U { [l1(p),u1(p)] ∩ [l4(q),u4(q)]} … U ( [ l1(p),u1(p)] ∩ {[l5(q),u5(q)]- [l3(p),u3(p)]}) R. Eigenmann, Purdue HIPS 2007 17

  18. Baseline Translation of Irregular Accesses  Irregular Access – A[B[i]], A[f(i)]  Reads: assumed the whole array accessed  Writes: inspect at runtime, communicate at the end of parallel loop  We often can do better than “conservative”:  Monotonic array values => sharpen access regions R. Eigenmann, Purdue HIPS 2007 18

  19. Optimizations based on Collective Communication  Recognition of Reduction Idioms  Translate to MPI_Reduce / MPI_Allreduce functions.  Casting sends/receives in terms of alltoall calls  Beneficial where the producer-consumer relationship is many-to-many and there is insufficient distance between producers and consumers. R. Eigenmann, Purdue HIPS 2007 19

  20. Performance of the Baseline OpenMP to MPI Translation Platform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch. R. Eigenmann, Purdue HIPS 2007 20

  21. We can do more for Irregular Applications ? L1 : #pragma omp parallel for Subscripts of accesses to shared  for(i=0;i<10;i++) arrays not always analyzable at A[i] = ... compile-time Baseline OpenMP to MPI translation:  L2 : #pragma omp parallel for Conservatively estimate that each  for(j=0;j<20;j++) process accesses the entire array B[j] = A[ C[j] ] + ... Try to deduce properties such as  monotonicity for the irregular subscript to refine the estimate produced by produced by Still, there may be redundant process 1 process 2  communication Array Runtime tests (inspection) are A  needed to resolve accesses Array 1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ... C accesses on accesses on process 1 process 2 R. Eigenmann, Purdue HIPS 2007 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend