openmp 4 0 and beyond
play

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - PowerPoint PPT Presentation

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API & standard for shared memory parallel computing. Works with C, C++ and Fortran. It was first released in 1997, and version 4.5 was released


  1. OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC

  2. What is OpenMP? • OpenMP is an API & standard for shared memory parallel computing. • Works with C, C++ and Fortran. • It was first released in 1997, and version 4.5 was released in 2015. • Can now be used with accelerators such as GPUs, Xeon Phi & FPGA.

  3. The Basics • OpenMP API uses pragmas to tell the compiler what to parallelise. • OpenMP is commonly used to do fork-join parallelism.

  4. The Basics • All parallel code is performed inside a parallel region: #pragma omp parallel { //Parallel code goes here. } !$omp parallel !Parallel code goes here !$omp end parallel

  5. The Basics • The number of threads to use in a parallel region can be controlled in 3 ways: – export OMP_NUM_THREADS=x – void omp_set_num_threads(int x) – C: #pragma omp parallel num_threads(x) – FORTRAN: !$omp parallel num_threads(x)

  6. The Basics • The most common use case is a parallel loop: #pragma omp parallel for !$omp parallel do for(int i = 0; i < 100000; i++) do i=1, 100000 c[i] = a[i] + b[i]; c(i) = a(i) + b(i) end do !$omp end parallel do

  7. Data-Sharing Clauses • One of the most important things to get correct when using OpenMP is data clauses. • ALWAYS Start with #pragma omp parallel default(none) • Makes bugs less likely and easier to track down.

  8. Commonly Used Data-Sharing Clauses • shared : Allows a variable to be accessed by all of the threads inside a parallel region – care for race conditions. • private : Creates an uninitialized copy of the variable for each thread. At the end of the region, the data is lost. • reduction : Creates a copy of the variable for each thread, initialised depending on the type of reduction chosen. Examples options are +, *, -, & etc. At the end of the region, the original variable contains the reduction of all of the threads.

  9. Using Data-Sharing Clauses int i, sum, a[100000]; Integer, Dimension(100000)::a,b,c int b[100000], c[100000]; Integer :: i, sums #pragma omp parallel for default(none) \ !$omp parallel do default(none) & shared(a,b,c) private(i) reduction(+,sum) !$omp shared(a, b, c) private(i) & { !$omp reduction(+,sums) for(i = 0; i < 100000; i++) { do i=1,100000 c[i] = a[i] + b[i]; c(i) = b(i) + a(i) sum += c[i]; sums = sums + c(i) } end do } !$omp end parallel do

  10. Controlling Loop Scheduling • OpenMP also allows the user to specify how the loop is executed, using the schedule option. • The default option is static . The loop is broken into nr_threads equal chunks, and each thread executes a chunk. • You can specify the size of the chunks manually, static(100) will create chunks of size 100. • Other options: guided, dynamic, auto, runtime. • Usually static or guided will give best performance.

  11. Controlling Loop Scheduling • The other commonly used options are: • guided, chunksize : The iterations are assigned to threads in chunks. Each chunk will be proportional to the number of remaining iterations, and no less than the chunk size. • dynamic, chunksize : The iterations are distributed to threads in chunks. Each thread executes a chunk, then requests another chunk once it has completed. • Usually static or guided will give best performance.

  12. Controlling Loop Scheduling #pragma omp parallel for default(none) \ !$omp parallel do default(none) & shared(a, b, c) private(i) \ !$omp shared(a, b, c) private(i) & schedule(guided,1000) !$omp schedule(dynamic, 1000) { do i=1,100000 for(i = 0; i < 100000; i++) { c(i) = b(i) + a(i) c[i] = a[i] + b[i]; end do } !$omp end parallel do }

  13. Thread ID • Each thread has its own ID, which can be retrieved with omp_get_thread_num() • The total number of threads can also be retrieved with omp_get_num_threads()

  14. First exercise - setup • Copy the exercises from – cp /home/aidanchalk/md.tar.gz . • cp /home/aidanchalk/OpenMP_training.pdf . • Extract them to a folder: – tar – xvf md.tar.gz • Load the intel compiler using source /opt/intel/parallel_studio_xe_2017.2.050/psxevars.sh • Compile the initial code with make • Test it on Xeon using the jobscript: – qsub initial.pbs • Check the the output by looking at output.txt. Record the runtime.

  15. First exercise • Copy the original code to ex_1: – cp initial/md.* ex_1/. • Add OpenMP loop-based parallelism to the compute_step and update routines. • To build it use make ex1 • Test it on Xeon and on Xeon Phi KNC: – Xeon (copy from /home/aidanchalk/ex1_xeon.pbs): qsub ex1_xeon.pbs – Xeon Phi: qsub ex1_phi.pbs • How does the runtime compare to the original serial version? The runtimes.out file contains the runtime on each number of cores from 1 to 32.

  16. First exercise • Add schedule(runtime) to the OpenMP loop in compute_step, and add export OMP_SCHEDULE=“guided,8” to the jobscript, and compare the runtime. • Try other schedules (dynamic, auto) and other chunksizes to see how if affects the runtime.

  17. First exercise – potential issue. • If the performance is worse, use make ex1_opt and check the optrpt to see if the compute_step loop was vectorised. • If not, write the loop without using a reduction variable, this should allow the compiler to vectorise the code and get better performance than the serial version.

  18. Task-Based Parallelism • Task-Based Parallelism is an alternative methodology for shared-memory parallel computing. • Rather than managing threads explicitly, we break the work into parallelisable chunks, known as tasks . • Between tasks, we keep track of data flow/dependencies (and potential race conditions). • With this information, we can safely execute independent tasks in parallel.

  19. Diagram of TBP

  20. Diagram of TBP

  21. Diagram of TBP

  22. OpenMP Tasks • OpenMP added tasks in 3.0, and additions to them have been including in both 4.0 and 4.5. • The earliest addition was the task keyword. • OpenMP 4.5 added an easier option – the taskloop . • In OpenMP, tasks are not (usually) executed until the next barrier or unless you use a taskwait barrier.

  23. Taskloop • Used to execute a loop using tasks. • Note: taskloop is not a worksharing construct (like OpenMP for) – you need to run it inside a single region unless you want to perform the loop multiple times. • You can define either num_tasks or grainsize to control the amount of work in each task. • (gcc-6.3 & gcc-7 bug – always use for(int i =0,….) for taskloop).

  24. Taskloop #pragma omp parallel default(none) \ !$omp parallel default(none) & shared(a, b, c) !$omp shared(a, b, c) private(i) { !$omp single #pragma omp single !$omp taskloop num_tasks(1000) #pragma omp taskloop grainsize(1000) do i=1,100000 for(int i = 0; i < 100000; i++) c(i) = b(i) + a(i) { end do c[i] = a[i] + b[i]; !$omp end taskloop } !$omp end single } !$omp end parallel

  25. Second Exercise • Create a new directory called ex_2 and copy the ex_1/md.XXX files to it. • Alter your implementation of compute_step to use taskloop rather than the do/for loop you used before. • Note – you can’t use reduction variable for the energy with taskloop. • Build with make ex2 • How does the runtime compare to your previous version? • How does altering the grain_size (or number of tasks) affect your results?

  26. Second Exercise • If your new code is substantially slower, use make ex2_opt and look at the optimization report. • If the code doesn’t vectorise, avoid updating the arrays directly in the inner loop – instead sum to a temporary and sum to the array in the outer loop.

  27. Explicit Tasks • Taskloop is helpful for just using tasks with an unbalanced loop, however sometimes we want more control over how our tasks are spawned. • We can spawn tasks ourselves, using the task pragma. • To create an explicit task, we put the task pragma around a section of code we want to have executed as a task, and apply the relevant data-sharing clauses. • The firstprivate clause is useful: Any task-private data that needs to be input to a task region should be declared as firstprivate.

  28. Explicit Tasks • Usually we will spawn tasks in a single region (and certain OpenMP definitions will only work if we do). • If we have completely independent tasks, we may be better spawning them inside a parallel for. • Note: we cannot use reduction variables inside tasks (this is in discussion for OpenMP 5.0).

  29. Explicit Tasks #pragma omp parallel default(none) \ !$omp parallel default(none) & shared(a, b, c) private(i) !$omp shared(a, b, c) private(i,j) { !$omp do #pragma omp single do i=1,100000,1000 for(i = 0; i < 100000; i+=1000) { !$omp task default(none) & #pragma omp task default(none) \ !$omp shared(a, b, c) private(j) & shared(a,b,c) firstprivate(i) !$omp firstprivate(i) for(int j = 0; j < 1000; j++) do j=0, 999 c[i+j] = a[i+j] + b[i+j]; c(i+j) = b(i+j) + a(i+j) } end do } !$omp end task end do !$omp end do !$omp end parallel

  30. Third Exercise • Create a new folder (ex_3) and copy the original files (initial/md.XX) to ex_3/md.XX • Break down the outer loop in the compute_step and create explicit tasks. Copy your code from ex_1 to parallelise the update function. • Build with make ex3 • How does the runtime compare to your previous versions. • What size tasks perform best for explicit tasks? • Parallelise the update function using explicit tasks. • Does this improve the performance?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend