Programming Distributed Memory Sytems Using OpenMP Rudolf - PowerPoint PPT Presentation

Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount

Is OpenMP a useful programming model for distributed systems? OpenMP is a parallel programming model that assumes a shared  address space #pragma OMP parallel for for (i=1; 1<n; i++) {a[i] = b[i];} Why is it difficult to implement OpenMP for distributed processors?  The compiler or runtime system will need to partition and place data onto the distributed memories  send/receive messages to orchestrate remote data accesses  HPF (High Performance Fortran) was a large-scale effort to do so - without success So, why should we try (again)?  OpenMP is an easier programming (higher-productivity?) programming model. It allows programs to be incrementally parallelized starting from the serial  versions, relieves the programmer of the task of managing the movement of  logically shared data. R. Eigenmann, Purdue HIPS 2007 2

Two Translation Approaches  Use a Software Distributed Shared Memory System  Translate OpenMP directly to MPI R. Eigenmann, Purdue HIPS 2007 3

Approach 1: Compiling OpenMP for Software Distributed Shared Memory R. Eigenmann, Purdue HIPS 2007 4

Inter-procedural Shared Data Analysis SUBROUTINE SUB0 INTEGER DELTAT CALL DCDTZ(DELTAT,…) SUBROUTINE DCDTZ(A, B, C) CALL DUDTZ(DELTAT,…) INTEGER A,B,C END C$OMP PARALLEL C$OMP+PRIVATE (B, C) A = … SUBROUTINE DUDTZ(X, Y, Z) CALL CCRANK INTEGER X,Y,Z … C$OMP PARALLEL C$OMP END PARALLEL C$OMP+REDUCTION(+:X) X = X + … END C$OMP END PARALLEL END SUBROUTINE CCRANK() … beta = 1 – alpha … END R. Eigenmann, Purdue HIPS 2007 5

Access Pattern SUBROUTINE RHS() Analysis !$OMP PARALLEL DO u (i, j, k ) = … !$OMP END PARALLEL DO DO istep = 1, itmax, 1 !$OMP PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k ).. rsd (i, j, k ) = … rsd (i, j, k ) = rsd (i, j, k ).. !$OMP END PARALLEL DO !$OMP END PARALLEL DO !$OMP PARALLEL DO !$OMP PARALLEL DO rsd ( i , j , k) = … … = u (i, j, k ).. !$OMP END PARALLEL DO rsd (i, j, k ) = rsd (i, j, k ).. !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k ) = rsd (i, j, k ) !$OMP PARALLEL DO !$OMP END PARALLEL DO … = u (i, j , k).. rsd (i, j , k) = ... CALL RHS() !$OMP END PARALLEL DO ENDDO R. Eigenmann, Purdue HIPS 2007 6

=> Data Distribution-Aware Optimization SUBROUTINE RHS() !$OMP PARALLEL DO u (i, j , k) = … DO istep = 1, itmax, 1 !$OMP END PARALLEL DO !$OMP PARALLEL DO !$OMP PARALLEL DO rsd (i, j , k) = … … = u (i, j , k).. !$OMP END PARALLEL DO rsd (i, j , k) = rsd (i, j , k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j , k) = … !$OMP PARALLEL DO !$OMP END PARALLEL DO … = u (i, j , k).. rsd (i, j , k) = rsd (i, j , k).. !$OMP PARALLEL DO !$OMP END PARALLEL DO u (i, j , k) = rsd (i, j , k) !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j , k).. CALL RHS() rsd (i, j , k) = ... !$OMP END PARALLEL DO ENDDO R. Eigenmann, Purdue HIPS 2007 7

Adding Redundant Computation to Eliminate Communication Optimized S-DSM Code S-DSM Program OpenMP Program init00 = (N/proc_num)*(pid-1)… DO k = 1, z init00 = (N/proc_num)*(pid-1)… limit00 = (N/proc_num)*pid… limit00 = (N/proc_num)*pid … !$OMP PARALLEL DO new_init = init00 - 1 DO j = 1, N, 1 new_limit = limit00 + 1 DO k = 1, z flux(m, j) = u(3, i, j, k) + … DO k = 1, z ENDDO DO j = init00, limit00, 1 DO j = new_init, new_limit, 1 flux(m, j) = u(3, i, j, k) + … !$OMP PARALLEL DO flux(m, j) = u(3, i, j, k) + … ENDDO DO j = 1, N, 1 ENDDO CALL TMK_BARRIER(0) DO m = 1, 5, 1 CALL TMK_BARRIER(0) DO j = init00, limit00, 1 rsd(m, i, j, k) = … + DO j = init00, limit00, 1 DO m = 1, 5, 1 flux(m, j+1)-flux(m, j-1)) DO m = 1, 5, 1 rsd(m, i, j, k) = … + ENDDO rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO R. Eigenmann, Purdue HIPS 2007 8

Access Privatization Example from equake (SPEC OMPM2001) If (master) { // Done by all nodes shared->ARCHnodes = ….. { ARCHnodes = ….. shared->ARCHduration = … ARCHduration = … ... ... } PRIVATE } READ-ONLY VARIABLES SHARED VARS /* Parallel Region */ /* Parallel Region */ N = ARCHnodes ; N = shared->ARCHnodes ; iter = ARCHduration ; iter = shared->ARCHduration; …... …... R. Eigenmann, Purdue HIPS 2007 9

Optimized Performance of OMPM2001 Benchmarks SPEC OMP2001M Performance 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 swim mgrid applu equake art wupwise Baseline Performance Optimized Performance R. Eigenmann, Purdue HIPS 2007 10

A Key Question: How Close Are we to MPI Performance ? SPEC OMP2001 Performance 8 7 6 Baseline Performance 5 Optimized Performance 4 MPI Performance 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise mgrid applu swim R. Eigenmann, Purdue HIPS 2007 11

Towards Adaptive Optimization A combined Compiler-Runtime Scheme  Compiler identifies repetitive access patterns  Runtime system learns the actual remote addresses and sends data early. Ideal program characteristics: Data addresses are Inner, parallel invariant or a linear loops Outer, serial sequence, w.r.t. loop outer loop Communication points at barriers R. Eigenmann, Purdue HIPS 2007 12

Current Best Performance of OpenMP for S-DSM 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise swim applu SpMul CG Baseline(No Opt.) Locality Opt Locality Opt + Comp/Run Opt R. Eigenmann, Purdue HIPS 2007 13

Approach 2: Translating OpenMP directly to MPI  Baseline translation  Overlapping computation and communication for irregular accesses R. Eigenmann, Purdue HIPS 2007 14

Baseline Translation of OpenMP to MPI  Execution Model  SPMD model  Serial Regions are replicated on all processes  Iterations of parallel for loops are distributed (using static block scheduling)  Shared Data is allocated on all nodes  There is no concept of “owner” – only producers and consumers of shared data  At the end of a parallel loop, producers communicate shared data to “potential” future consumers  Array section analysis is used for summarizing array accesses R. Eigenmann, Purdue HIPS 2007 15

Baseline Translation Translation Steps: Identify all shared data 1. Create annotations for accesses to shared data 2. (use regular section descriptors to summarize array accesses) Use interprocedural data flow analysis to identify 3. potential consumers ; incorporate OpenMP relaxed consistency specifications Create message sets to communicate data 4. between producers and consumers R. Eigenmann, Purdue HIPS 2007 16

Message Set Generation For every write, V1: <write,A,1,l1(p),u1(p)> determine all future reads … <read,A,1,l2(p),u2(p)> <write,A,1,l3(p),u3(p)> Message Set at RSD vertex V1, for array <read,A,1,l5(p),u5(p)> A from process p to process q computed as … SApq = Elements of A with subscripts in the set <read,A,1,l4(p),u4(p)> {[l1(p),u1(p)] ∩ [l2(q),u2(q)]} U { [l1(p),u1(p)] ∩ [l4(q),u4(q)]} … U ( [ l1(p),u1(p)] ∩ {[l5(q),u5(q)]- [l3(p),u3(p)]}) R. Eigenmann, Purdue HIPS 2007 17

Baseline Translation of Irregular Accesses  Irregular Access – A[B[i]], A[f(i)]  Reads: assumed the whole array accessed  Writes: inspect at runtime, communicate at the end of parallel loop  We often can do better than “conservative”:  Monotonic array values => sharpen access regions R. Eigenmann, Purdue HIPS 2007 18

Optimizations based on Collective Communication  Recognition of Reduction Idioms  Translate to MPI_Reduce / MPI_Allreduce functions.  Casting sends/receives in terms of alltoall calls  Beneficial where the producer-consumer relationship is many-to-many and there is insufficient distance between producers and consumers. R. Eigenmann, Purdue HIPS 2007 19

Performance of the Baseline OpenMP to MPI Translation Platform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch. R. Eigenmann, Purdue HIPS 2007 20

We can do more for Irregular Applications ? L1 : #pragma omp parallel for Subscripts of accesses to shared  for(i=0;i<10;i++) arrays not always analyzable at A[i] = ... compile-time Baseline OpenMP to MPI translation:  L2 : #pragma omp parallel for Conservatively estimate that each  for(j=0;j<20;j++) process accesses the entire array B[j] = A[ C[j] ] + ... Try to deduce properties such as  monotonicity for the irregular subscript to refine the estimate produced by produced by Still, there may be redundant process 1 process 2  communication Array Runtime tests (inspection) are A  needed to resolve accesses Array 1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ... C accesses on accesses on process 1 process 2 R. Eigenmann, Purdue HIPS 2007 21

Programming Distributed Memory Sytems Using OpenMP Rudolf - PowerPoint PPT Presentation

Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount Is OpenMP a useful programming model

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

Advanced OpenMP Lecture 4: OpenMP and MPI Motivation In recent years there has been a trend

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Procedural Generation Lauri Kongas What is procedural generation? Procedural Generation It is

Computer Graphics - Texturing - Philipp Slusallek Pascal Grittmann Overview Last time

Practical Memory Leak Detector Based on Parameterized Procedural Summaries Yungbum Jung,

Securing Real-Time Microcontroller Systems through Customized Memory View Switching Wh What are

Securing Real-Time Microcontroller Systems through Customized Memory View Switching + * Chung

Memory Tagging: e m p o s r r F e p how it improves C/C++ memory safety. Compiler

Attachment Narrative Therapy: A DMM approach to integrating systemic and narrative therapeutic

Building openSUSE with link-time optimizations Jan Hubika and Martin Lika SUSElabs