OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - PowerPoint PPT Presentation

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC

What is OpenMP? • OpenMP is an API & standard for shared memory parallel computing. • Works with C, C++ and Fortran. • It was first released in 1997, and version 4.5 was released in 2015. • Can now be used with accelerators such as GPUs, Xeon Phi & FPGA.

The Basics • OpenMP API uses pragmas to tell the compiler what to parallelise. • OpenMP is commonly used to do fork-join parallelism.

The Basics • All parallel code is performed inside a parallel region: #pragma omp parallel { //Parallel code goes here. } !$omp parallel !Parallel code goes here !$omp end parallel

The Basics • The number of threads to use in a parallel region can be controlled in 3 ways: – export OMP_NUM_THREADS=x – void omp_set_num_threads(int x) – C: #pragma omp parallel num_threads(x) – FORTRAN: !$omp parallel num_threads(x)

The Basics • The most common use case is a parallel loop: #pragma omp parallel for !$omp parallel do for(int i = 0; i < 100000; i++) do i=1, 100000 c[i] = a[i] + b[i]; c(i) = a(i) + b(i) end do !$omp end parallel do

Data-Sharing Clauses • One of the most important things to get correct when using OpenMP is data clauses. • ALWAYS Start with #pragma omp parallel default(none) • Makes bugs less likely and easier to track down.

Commonly Used Data-Sharing Clauses • shared : Allows a variable to be accessed by all of the threads inside a parallel region – care for race conditions. • private : Creates an uninitialized copy of the variable for each thread. At the end of the region, the data is lost. • reduction : Creates a copy of the variable for each thread, initialised depending on the type of reduction chosen. Examples options are +, *, -, & etc. At the end of the region, the original variable contains the reduction of all of the threads.

Using Data-Sharing Clauses int i, sum, a[100000]; Integer, Dimension(100000)::a,b,c int b[100000], c[100000]; Integer :: i, sums #pragma omp parallel for default(none) \ !$omp parallel do default(none) & shared(a,b,c) private(i) reduction(+,sum) !$omp shared(a, b, c) private(i) & { !$omp reduction(+,sums) for(i = 0; i < 100000; i++) { do i=1,100000 c[i] = a[i] + b[i]; c(i) = b(i) + a(i) sum += c[i]; sums = sums + c(i) } end do } !$omp end parallel do

Controlling Loop Scheduling • OpenMP also allows the user to specify how the loop is executed, using the schedule option. • The default option is static . The loop is broken into nr_threads equal chunks, and each thread executes a chunk. • You can specify the size of the chunks manually, static(100) will create chunks of size 100. • Other options: guided, dynamic, auto, runtime. • Usually static or guided will give best performance.

Controlling Loop Scheduling • The other commonly used options are: • guided, chunksize : The iterations are assigned to threads in chunks. Each chunk will be proportional to the number of remaining iterations, and no less than the chunk size. • dynamic, chunksize : The iterations are distributed to threads in chunks. Each thread executes a chunk, then requests another chunk once it has completed. • Usually static or guided will give best performance.

Controlling Loop Scheduling #pragma omp parallel for default(none) \ !$omp parallel do default(none) & shared(a, b, c) private(i) \ !$omp shared(a, b, c) private(i) & schedule(guided,1000) !$omp schedule(dynamic, 1000) { do i=1,100000 for(i = 0; i < 100000; i++) { c(i) = b(i) + a(i) c[i] = a[i] + b[i]; end do } !$omp end parallel do }

Thread ID • Each thread has its own ID, which can be retrieved with omp_get_thread_num() • The total number of threads can also be retrieved with omp_get_num_threads()

First exercise - setup • Copy the exercises from – cp /home/aidanchalk/md.tar.gz . • cp /home/aidanchalk/OpenMP_training.pdf . • Extract them to a folder: – tar – xvf md.tar.gz • Load the intel compiler using source /opt/intel/parallel_studio_xe_2017.2.050/psxevars.sh • Compile the initial code with make • Test it on Xeon using the jobscript: – qsub initial.pbs • Check the the output by looking at output.txt. Record the runtime.

First exercise • Copy the original code to ex_1: – cp initial/md.* ex_1/. • Add OpenMP loop-based parallelism to the compute_step and update routines. • To build it use make ex1 • Test it on Xeon and on Xeon Phi KNC: – Xeon (copy from /home/aidanchalk/ex1_xeon.pbs): qsub ex1_xeon.pbs – Xeon Phi: qsub ex1_phi.pbs • How does the runtime compare to the original serial version? The runtimes.out file contains the runtime on each number of cores from 1 to 32.

First exercise • Add schedule(runtime) to the OpenMP loop in compute_step, and add export OMP_SCHEDULE=“guided,8” to the jobscript, and compare the runtime. • Try other schedules (dynamic, auto) and other chunksizes to see how if affects the runtime.

First exercise – potential issue. • If the performance is worse, use make ex1_opt and check the optrpt to see if the compute_step loop was vectorised. • If not, write the loop without using a reduction variable, this should allow the compiler to vectorise the code and get better performance than the serial version.

Task-Based Parallelism • Task-Based Parallelism is an alternative methodology for shared-memory parallel computing. • Rather than managing threads explicitly, we break the work into parallelisable chunks, known as tasks . • Between tasks, we keep track of data flow/dependencies (and potential race conditions). • With this information, we can safely execute independent tasks in parallel.

Diagram of TBP

OpenMP Tasks • OpenMP added tasks in 3.0, and additions to them have been including in both 4.0 and 4.5. • The earliest addition was the task keyword. • OpenMP 4.5 added an easier option – the taskloop . • In OpenMP, tasks are not (usually) executed until the next barrier or unless you use a taskwait barrier.

Taskloop • Used to execute a loop using tasks. • Note: taskloop is not a worksharing construct (like OpenMP for) – you need to run it inside a single region unless you want to perform the loop multiple times. • You can define either num_tasks or grainsize to control the amount of work in each task. • (gcc-6.3 & gcc-7 bug – always use for(int i =0,….) for taskloop).

Taskloop #pragma omp parallel default(none) \ !$omp parallel default(none) & shared(a, b, c) !$omp shared(a, b, c) private(i) { !$omp single #pragma omp single !$omp taskloop num_tasks(1000) #pragma omp taskloop grainsize(1000) do i=1,100000 for(int i = 0; i < 100000; i++) c(i) = b(i) + a(i) { end do c[i] = a[i] + b[i]; !$omp end taskloop } !$omp end single } !$omp end parallel

Second Exercise • Create a new directory called ex_2 and copy the ex_1/md.XXX files to it. • Alter your implementation of compute_step to use taskloop rather than the do/for loop you used before. • Note – you can’t use reduction variable for the energy with taskloop. • Build with make ex2 • How does the runtime compare to your previous version? • How does altering the grain_size (or number of tasks) affect your results?

Second Exercise • If your new code is substantially slower, use make ex2_opt and look at the optimization report. • If the code doesn’t vectorise, avoid updating the arrays directly in the inner loop – instead sum to a temporary and sum to the array in the outer loop.

Explicit Tasks • Taskloop is helpful for just using tasks with an unbalanced loop, however sometimes we want more control over how our tasks are spawned. • We can spawn tasks ourselves, using the task pragma. • To create an explicit task, we put the task pragma around a section of code we want to have executed as a task, and apply the relevant data-sharing clauses. • The firstprivate clause is useful: Any task-private data that needs to be input to a task region should be declared as firstprivate.

Explicit Tasks • Usually we will spawn tasks in a single region (and certain OpenMP definitions will only work if we do). • If we have completely independent tasks, we may be better spawning them inside a parallel for. • Note: we cannot use reduction variables inside tasks (this is in discussion for OpenMP 5.0).

Explicit Tasks #pragma omp parallel default(none) \ !$omp parallel default(none) & shared(a, b, c) private(i) !$omp shared(a, b, c) private(i,j) { !$omp do #pragma omp single do i=1,100000,1000 for(i = 0; i < 100000; i+=1000) { !$omp task default(none) & #pragma omp task default(none) \ !$omp shared(a, b, c) private(j) & shared(a,b,c) firstprivate(i) !$omp firstprivate(i) for(int j = 0; j < 1000; j++) do j=0, 999 c[i+j] = a[i+j] + b[i+j]; c(i+j) = b(i+j) + a(i+j) } end do } !$omp end task end do !$omp end do !$omp end parallel

Third Exercise • Create a new folder (ex_3) and copy the original files (initial/md.XX) to ex_3/md.XX • Break down the outer loop in the compute_step and create explicit tasks. Copy your code from ex_1 to parallelise the update function. • Build with make ex3 • How does the runtime compare to your previous versions. • What size tasks perform best for explicit tasks? • Parallelise the update function using explicit tasks. • Does this improve the performance?

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - PowerPoint PPT Presentation

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API & standard for shared memory parallel computing. Works with C, C++ and Fortran. It was first released in 1997, and version 4.5 was released

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

OpenMP 1 What is OpenMP? An Application Program Interface (API) used to explicitly direct

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

THREADED PROGRAMMING OpenMP Performance 2 A common scenario..... So I wrote my OpenMP

Advanced OpenMP Lecture 4: OpenMP and MPI Motivation In recent years there has been a trend

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

OpenMP 4 - Whats New? SciNet Developer Seminar Ramses van Zon September 25, 2013 Intro to

OPENMP TIPS, TRICKS AND GOTCHAS Mark Bull EPCC, University of Edinburgh (and OpenMP ARB)

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

OpenMP Language Features ! The parallel construct ! ! Work-sharing ! ! Data-sharing !

Programming with OpenMP CS240A, T. Yang, 2013 Modified from Demmel/Yelicks and Mary Halls

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 September 5 Based in part on the

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP Instructor PanteA Zardoshti Department of Computer Engineering Sharif University of

Optimization on one core OpenMP, MPI and hybrid programming An introduction to the de-facto

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - PowerPoint PPT Presentation

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API & standard for shared memory parallel computing. Works with C, C++ and Fortran. It was first released in 1997, and version 4.5 was released

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

OpenMP 1 What is OpenMP? An Application Program Interface (API) used to explicitly direct

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

THREADED PROGRAMMING OpenMP Performance 2 A common scenario..... So I wrote my OpenMP

Advanced OpenMP Lecture 4: OpenMP and MPI Motivation In recent years there has been a trend

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

OpenMP 4 - Whats New? SciNet Developer Seminar Ramses van Zon September 25, 2013 Intro to

OPENMP TIPS, TRICKS AND GOTCHAS Mark Bull EPCC, University of Edinburgh (and OpenMP ARB)

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

OpenMP Language Features ! The parallel construct ! ! Work-sharing ! ! Data-sharing !

Programming with OpenMP CS240A, T. Yang, 2013 Modified from Demmel/Yelicks and Mary Halls

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

OpenMP + NUMA CSE 6230: HPC Tools &amp; Apps Fall 2014 September 5 Based in part on the

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP Instructor PanteA Zardoshti Department of Computer Engineering Sharif University of

Optimization on one core OpenMP, MPI and hybrid programming An introduction to the de-facto

OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 September 5 Based in part on the