OpenMP: a shared-memory parallel programming model Eduard Ayguad - PDF document

OpenMP: a shared-memory parallel programming model Eduard Ayguadé Computer Sciences Department Associate Director (BSC) Professor of the Computer Architecture Department (UPC) OpenMP for shared memory OpenMP for shared memory � First definition in 1996 � Today, industry standard, main vendors support it � Advantages � Easy to program, debug, modify and maintain � Incremental parallelization from the beginning � Improve programming productivity � Neither communication nor data distribution needed � Language extensions to Fortran77/90 and C/C++ � Directives or pragmas that can be ignored when compiled in sequential � Intrinsic function in OpenMP library � Environment variables

Three components of OpenMP Three components of OpenMP � OMP directives/pragmas � These form the major elements of OpenMP programming, they � Create threads � Share the work amongst threads � Synchronize threads � Library routines � These routines can be used to control and query the parallel execution environment such as the number of processors that are available for use � Environment variables � The execution environment such as the number of threads to be made available to an OMP program can also be set at the operating system level before the program execution is started (an alternative to calling library routines) PARALLEL region construct PARALLEL region construct end of nested parallel end of parallel region, begining of parallel region nested parallel region region, implicit barrier implicit barrier fork join fork fork join join � Specification of parallel region � C$OMP [END] PARALLEL [clause[[,] clause]…] � #pragma omp parallel [clause [clause]…] � Execution model: � When a thread encounters a parallel region, it creates a team of threads, and it becomes the master of the team. The number of threads in a team remains constant for the duration of the parallel region � Parallelism is added incrementally: i.e. the sequential program evolves into a parallel program

Some useful intrinsic functions Some useful intrinsic functions � To identify individual threads by number � Fortran: INTEGER FUNCTION OMP_GET_THREAD_NUM() � C/C++: int omp_get_thread_num(void) � Returns value between 0 … OMP_GET_NUM_THREADS() -1 � To find out how many threads are being used � Fortran: INTEGER FUNCTION OMP_GET_NUM_THREADS() � C/C++: int omp_get_num_threads(void); � Returns value 1 if outside the parallel region else the number of threads available PARALLEL region construct PARALLEL region construct � Each thread executes the same code redundantly double A[1000]; double A[1000]; omp_set_num_threads(4); omp_set_num_threads(4); #pragma omp parallel #pragma omp parallel { { int ID = omp_get_thread_num(); int ID = omp_get_thread_num(); pooh(ID, A); pooh(ID, A); } } printf(“all done\n”); printf(“all done\n”); omp_set_num_threads(4) A single A single copy of A is copy of A is pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A) shared shared between all between all threads threads printf(“all done\n”); Threads wait here for all threads to finish Threads wait here for all threads to finish before proceeding (I.e. a barrier ) before proceeding (I.e. a barrier )

PARALLEL region construct PARALLEL region construct � Clauses: NUM_THREADS(integer_exp), IF(logical_exp), PRIVATE(list), SHARED(list), FIRSTPRIVATE(list), REDUCTION({operator|intrinsic}:list), COPYIN(list) � Number of threads at each level: � Environment variable OMP_NUM_THREADS � Intrinsic function omp_set_num_threads (in serial part) � NUM_THREADS clause nested parallel region, parallel region, NUM_THREADS=2 NUM_THREADS=3 serial region, fork join omp_set_num_threads(3), setenv OMP_NUM_THREADS=3 fork join First example: computation of PI First example: computation of PI Mathematically, we know that: 1 4.0 ∫ 4.0 dx = π (1+x 2 ) 0 We can approximate the F(x) = 4.0/(1+x 2 ) integral as a sum of 2.0 rectangles: N ∑ F(x i ) ∆ x ≈ π i = 0 Where each rectangle has 1.0 0.0 width ∆ x and height F(x i ) at X the middle of interval i.

First example: computation of PI First example: computation of PI static long num_steps = 100000; 4.0 double step; void main () { int i; ) double x, pi, sum = 0.0; 2 x 2.0 + 1 ( / 0 . 4 step = 1.0/(double) num_steps; = ) x ( F for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; 1.0 0.0 X sum = sum + 4.0/(1.0+x*x); Processor 0 } Processor 1 pi = step * sum; Processor 2 } } Processor 3 First example: computation of PI First example: computation of PI #include <omp.h> 4. static long num_steps = 100000; 0 double step; #define NUM_THREADS 2 ) 2 x + 1 2. ( / 0 0 void main () . 4 = ) { int i, id ; x ( F double x, pi, sum; 1. 0. X 0 step = 1.0/(double) num_steps; 0 omp_set_num_threads(NUM_THREADS) #pragma omp parallel private(x, i, id) reduction(+:sum) { id = omp_get_thread_num(); for (i=id+1; i<=num_steps; i=i+NUM_THREADS) { x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } } pi = sum * step; }

Work distribution Work distribution Work distribution, Work distribution, implicit barrier implicit barrier join fork fork fork join join � Work sharing constructs � Split up loop iterations among the threads in the team � Give a different structured block to each thread in the team � Give a structured block to just one thread in the team Work distribution: DO loops Work distribution: DO loops � Syntax: � #pragma for [clause[clause]…] � C$OMP [END] DO [clause[[,] clause]…] � Clauses: � Data scope: PRIVATE(list), LASTPRIVATE(list), FIRSTPRIVATE(list), REDUCTION(list) � Iteration scheduling: SCHEDULE(type[,chunk]) � Synchronization: NOWAIT, ORDERED

First example: computation of PI First example: computation of PI #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS) #pragma omp parallel for reduction(+:sum) private(x) for (i=1; i<=num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } Loop scheduling strategies Loop scheduling strategies � Loop schedules: � SCHEDULE(STATIC[,chunk]) : iterations are divided into pieces of a size specified by chunk . Pieces are statically assigned to threads in a round-robin fashion following thread number. � SCHEDULE(DYNAMIC[,chunk]) : iterations are broken into pieces of size specified by chunk . Pieces are dynamically assigned to threads. � SCHEDULE(GUIDED[,chunk]) : the chunk size is reduced in an exponentially decreasing manner with each dispatched piece of the iteration space. chunk specifies the minimum size.

Synthetic example: work unbalance Synthetic example: work unbalance PROGRAM test PARAMETER (N=1024) REAL dummy(N), factor INTEGER i, iter, time factor=1/1.0000001 DO iter=1,5 C$OMP PARALLEL DO SCHEDULE(STATIC) C$OMP& SHARED(dummy) PRIVATE(i, time) DO i=0,N dummy(i)= dummy(i)*factor time = i/100 call delay(time) ENDDO ENDDO END Synthetic example: work unbalance Synthetic example: work unbalance PROGRAM test PARAMETER (N=1024) � Low unbalance REAL dummy(N), factor INTEGER i, iter, time factor=1/1.0000001 DO iter=1,5 C$OMP PARALLEL DO SCHEDULE(DYNAMIC ) C$OMP& SHARED(dummy) PRIVATE(i, time) DO i=0,N dummy(i)= dummy(i)*factor time = i/100 call delay(time) ENDDO ENDDO END

Synthetic example: work unbalance Synthetic example: work unbalance PROGRAM test PARAMETER (N=1024) � Low unbalance REAL dummy(N), factor INTEGER i, iter, time � High overhead factor=1/1.0000001 DO iter=1,5 C$OMP PARALLEL DO SCHEDULE(DYNAMIC) C$OMP& SHARED(dummy) PRIVATE(i, time) DO i=0,N dummy(i)= dummy(i)*factor time = i/100 call delay(time) ENDDO ENDDO END Synthetic example: work unbalance Synthetic example: work unbalance PROGRAM test PARAMETER (N=1024) � Less overhead REAL dummy(N), factor INTEGER i, iter, time � Some imbalance: � Heavy chunks towards the factor=1/1.0000001 end DO iter=1,5 C$OMP PARALLEL DO SCHEDULE(DYNAMIC, 50) C$OMP& SHARED(dummy) PRIVATE(i, time) DO i=0,N dummy(i)= dummy(i)*factor time = i/100 call delay(time) ENDDO ENDDO END

Synthetic example: work unbalance Synthetic example: work unbalance � Less overhead PROGRAM test PARAMETER (N=1024) � Good load balance: REAL dummy(N), factor INTEGER i, iter, time � Heavy chunks towards the beginning factor=1/1.0000001 � Dynamic: DO iter=1,5 � Non repetitive pattern C$OMP PARALLEL DO SCHEDULE(GUIDED) C$OMP& SHARED(dummy) PRIVATE(i, time) DO i=0,N dummy(i)= dummy(i)*factor time = i/100 call delay(time) ENDDO ENDDO END Synthetic example: work unbalance Synthetic example: work unbalance � Dynamic � Dynamic,50 � Guided Same scale

OpenMP: a shared-memory parallel programming model Eduard Ayguad - PDF document

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department Associate Director (BSC) Professor of the Computer Architecture Department (UPC) OpenMP for shared memory OpenMP for shared memory First

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Programming using OpenMP Shared Memory Parallel Programming Part I

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Parallel Models Different ways to exploit parallelism Reusing this material This work is

Parallel Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Sequential Algorithms Classical

Fork-Join Parallelism Removing this assumption creates major challenges & opportunities

Status of ProtoDUNE-SP Performance Paper Flavio, Tingjun, Tom ProtoDUNE DRA Meeting Dec 4, 2019

Why formalize? n ML is tricky, particularly in corner cases Formal Semantics n generalizable type

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Lambda Calculus with Types Henk Barendregt ICIS Radboud University Nijmegen The Netherlands New

Information Dynamics Samson Abramsky Department of Computer Science, Oxford University Samson

OpenMP: a shared-memory parallel programming model Eduard Ayguad - PDF document

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department Associate Director (BSC) Professor of the Computer Architecture Department (UPC) OpenMP for shared memory OpenMP for shared memory First

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Programming using OpenMP Shared Memory Parallel Programming Part I

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Parallel Models Different ways to exploit parallelism Reusing this material This work is

Parallel Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Sequential Algorithms Classical

Fork-Join Parallelism Removing this assumption creates major challenges &amp; opportunities

Status of ProtoDUNE-SP Performance Paper Flavio, Tingjun, Tom ProtoDUNE DRA Meeting Dec 4, 2019

Why formalize? n ML is tricky, particularly in corner cases Formal Semantics n generalizable type

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Lambda Calculus with Types Henk Barendregt ICIS Radboud University Nijmegen The Netherlands New

Information Dynamics Samson Abramsky Department of Computer Science, Oxford University Samson

Fork-Join Parallelism Removing this assumption creates major challenges & opportunities