OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - - PowerPoint PPT Presentation
OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - - PowerPoint PPT Presentation
OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API & standard for shared memory parallel computing. Works with C, C++ and Fortran. It was first released in 1997, and version 4.5 was released
What is OpenMP?
- OpenMP is an API & standard for shared memory parallel
computing.
- Works with C, C++ and Fortran.
- It was first released in 1997, and version 4.5 was released
in 2015.
- Can now be used with accelerators such as GPUs, Xeon
Phi & FPGA.
The Basics
- OpenMP API uses pragmas to tell the compiler what to
parallelise.
- OpenMP is commonly used to do fork-join parallelism.
The Basics
- All parallel code is performed inside a parallel region:
#pragma omp parallel { //Parallel code goes here. } !$omp parallel !Parallel code goes here !$omp end parallel
The Basics
- The number of threads to use in a parallel region can be
controlled in 3 ways:
– export OMP_NUM_THREADS=x – void omp_set_num_threads(int x) – C: #pragma omp parallel num_threads(x) – FORTRAN: !$omp parallel num_threads(x)
The Basics
- The most common use case is a parallel loop:
#pragma omp parallel for for(int i = 0; i < 100000; i++) c[i] = a[i] + b[i]; !$omp parallel do do i=1, 100000 c(i) = a(i) + b(i) end do !$omp end parallel do
Data-Sharing Clauses
- One of the most important things to get correct when
using OpenMP is data clauses.
- ALWAYS Start with #pragma omp parallel
default(none)
- Makes bugs less likely and easier to track down.
Commonly Used Data-Sharing Clauses
- shared: Allows a variable to be accessed by all of the
threads inside a parallel region – care for race conditions.
- private: Creates an uninitialized copy of the variable for
each thread. At the end of the region, the data is lost.
- reduction: Creates a copy of the variable for each
thread, initialised depending on the type of reduction
- chosen. Examples options are +, *, -, & etc. At the end of
the region, the original variable contains the reduction of all of the threads.
Using Data-Sharing Clauses
int i, sum, a[100000]; int b[100000], c[100000]; #pragma omp parallel for default(none) \ shared(a,b,c) private(i) reduction(+,sum) { for(i = 0; i < 100000; i++) { c[i] = a[i] + b[i]; sum += c[i]; } } Integer, Dimension(100000)::a,b,c Integer :: i, sums !$omp parallel do default(none) & !$omp shared(a, b, c) private(i) & !$omp reduction(+,sums) do i=1,100000 c(i) = b(i) + a(i) sums = sums + c(i) end do !$omp end parallel do
Controlling Loop Scheduling
- OpenMP also allows the user to specify how the loop is
executed, using the schedule option.
- The default option is static. The loop is broken into
nr_threads equal chunks, and each thread executes a chunk.
- You can specify the size of the chunks manually,
static(100) will create chunks of size 100.
- Other options: guided, dynamic, auto, runtime.
- Usually static or guided will give best performance.
Controlling Loop Scheduling
- The other commonly used options are:
- guided, chunksize : The iterations are assigned to
threads in chunks. Each chunk will be proportional to the number of remaining iterations, and no less than the chunk size.
- dynamic, chunksize : The iterations are distributed to
threads in chunks. Each thread executes a chunk, then requests another chunk once it has completed.
- Usually static or guided will give best performance.
Controlling Loop Scheduling
#pragma omp parallel for default(none) \ shared(a, b, c) private(i) \ schedule(guided,1000) { for(i = 0; i < 100000; i++) { c[i] = a[i] + b[i]; } } !$omp parallel do default(none) & !$omp shared(a, b, c) private(i) & !$omp schedule(dynamic, 1000) do i=1,100000 c(i) = b(i) + a(i) end do !$omp end parallel do
Thread ID
- Each thread has its own ID, which can be retrieved with
- mp_get_thread_num()
- The total number of threads can also be retrieved with
- mp_get_num_threads()
First exercise - setup
- Copy the exercises from
– cp /home/aidanchalk/md.tar.gz .
- cp /home/aidanchalk/OpenMP_training.pdf .
- Extract them to a folder:
– tar –xvf md.tar.gz
- Load the intel compiler using source
/opt/intel/parallel_studio_xe_2017.2.050/psxevars.sh
- Compile the initial code with make
- Test it on Xeon using the jobscript:
– qsub initial.pbs
- Check the the output by looking at output.txt. Record the
runtime.
First exercise
- Copy the original code to ex_1:
– cp initial/md.* ex_1/.
- Add OpenMP loop-based parallelism to the compute_step
and update routines.
- To build it use make ex1
- Test it on Xeon and on Xeon Phi KNC:
– Xeon (copy from /home/aidanchalk/ex1_xeon.pbs): qsub ex1_xeon.pbs – Xeon Phi: qsub ex1_phi.pbs
- How does the runtime compare to the original serial
version? The runtimes.out file contains the runtime on each number of cores from 1 to 32.
First exercise
- Add schedule(runtime) to the OpenMP loop in
compute_step, and add export OMP_SCHEDULE=“guided,8” to the jobscript, and compare the runtime.
- Try other schedules (dynamic, auto) and other chunksizes
to see how if affects the runtime.
First exercise – potential issue.
- If the performance is worse, use make ex1_opt and
check the optrpt to see if the compute_step loop was vectorised.
- If not, write the loop without using a reduction variable,
this should allow the compiler to vectorise the code and get better performance than the serial version.
Task-Based Parallelism
- Task-Based Parallelism is an alternative methodology for
shared-memory parallel computing.
- Rather than managing threads explicitly, we break the
work into parallelisable chunks, known as tasks.
- Between tasks, we keep track of data flow/dependencies
(and potential race conditions).
- With this information, we can safely execute independent
tasks in parallel.
Diagram of TBP
Diagram of TBP
Diagram of TBP
OpenMP Tasks
- OpenMP added tasks in 3.0, and additions to them have
been including in both 4.0 and 4.5.
- The earliest addition was the task keyword.
- OpenMP 4.5 added an easier option – the taskloop.
- In OpenMP, tasks are not (usually) executed until the next
barrier or unless you use a taskwait barrier.
Taskloop
- Used to execute a loop using tasks.
- Note: taskloop is not a worksharing construct (like
OpenMP for) – you need to run it inside a single region unless you want to perform the loop multiple times.
- You can define either num_tasks or grainsize to
control the amount of work in each task.
- (gcc-6.3 & gcc-7 bug – always use for(int i=0,….) for
taskloop).
Taskloop
#pragma omp parallel default(none) \ shared(a, b, c) { #pragma omp single #pragma omp taskloop grainsize(1000) for(int i = 0; i < 100000; i++) { c[i] = a[i] + b[i]; } } !$omp parallel default(none) & !$omp shared(a, b, c) private(i) !$omp single !$omp taskloop num_tasks(1000) do i=1,100000 c(i) = b(i) + a(i) end do !$omp end taskloop !$omp end single !$omp end parallel
Second Exercise
- Create a new directory called ex_2 and copy the
ex_1/md.XXX files to it.
- Alter your implementation of compute_step to use
taskloop rather than the do/for loop you used before.
- Note – you can’t use reduction variable for the energy
with taskloop.
- Build with make ex2
- How does the runtime compare to your previous version?
- How does altering the grain_size (or number of tasks)
affect your results?
Second Exercise
- If your new code is substantially slower, use make
ex2_opt and look at the optimization report.
- If the code doesn’t vectorise, avoid updating the arrays
directly in the inner loop – instead sum to a temporary and sum to the array in the outer loop.
Explicit Tasks
- Taskloop is helpful for just using tasks with an unbalanced
loop, however sometimes we want more control over how our tasks are spawned.
- We can spawn tasks ourselves, using the task pragma.
- To create an explicit task, we put the task pragma around
a section of code we want to have executed as a task, and apply the relevant data-sharing clauses.
- The firstprivate clause is useful: Any task-private
data that needs to be input to a task region should be declared as firstprivate.
Explicit Tasks
- Usually we will spawn tasks in a single region (and certain
OpenMP definitions will only work if we do).
- If we have completely independent tasks, we may be
better spawning them inside a parallel for.
- Note: we cannot use reduction variables inside tasks
(this is in discussion for OpenMP 5.0).
Explicit Tasks
#pragma omp parallel default(none) \ shared(a, b, c) private(i) { #pragma omp single for(i = 0; i < 100000; i+=1000) { #pragma omp task default(none) \ shared(a,b,c) firstprivate(i) for(int j = 0; j < 1000; j++) c[i+j] = a[i+j] + b[i+j]; } } !$omp parallel default(none) & !$omp shared(a, b, c) private(i,j) !$omp do do i=1,100000,1000 !$omp task default(none) & !$omp shared(a, b, c) private(j) & !$omp firstprivate(i) do j=0, 999 c(i+j) = b(i+j) + a(i+j) end do !$omp end task end do !$omp end do !$omp end parallel
Third Exercise
- Create a new folder (ex_3) and copy the original files
(initial/md.XX) to ex_3/md.XX
- Break down the outer loop in the compute_step and
create explicit tasks. Copy your code from ex_1 to parallelise the update function.
- Build with make ex3
- How does the runtime compare to your previous versions.
- What size tasks perform best for explicit tasks?
- Parallelise the update function using explicit tasks.
- Does this improve the performance?
Dependencies
- OpenMP 4.0 introduced the depend keyword extension
to tasks.
- When declaring a task, we can specify data as in, out, or
inout.
- Tasks with in dependencies on some data are dependent
- n any previously created task with an out or inout
dependency on the same data.
- Tasks with out or inout dependency are dependent on
any previously created task with any dependency on the same data.
Dependencies
- Note 1: Dependencies on array sections need to be
identical sized sections.
- Note 2: Dependent tasks must be spawned by the same
thread.
- Note 3: Tasks may have as many depend clauses as
necessary.
Dependencies
#pragma omp parallel default(none) \ shared(x,n) { #pragma omp single nowait for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) { #pragma omp task firstprivate(i,j) \ depend(inout: x(i)) shared(x) some_function(i, j, x); } } !$omp parallel default(none) & !$omp shared(x,n) private(i,j) !$omp single do i=1, n do j=1, n !$omp task firstprivate(i,j) & !$omp depend(inout: x(i)) shared(x) Call some_function(i,j,x) !$omp end task end do end do !$omp end single nowait !$omp end parallel
Fourth Exercise
- Copy your code from ex_3/ to ex_4/
- Create dependencies between the tasks in create_step
and in update. Build with make ex4
- Make sure there are no OpenMP barriers between the
start of create_step and the end of update, else the synchronization will result in OpenMP ignoring the dependencies.
- Advanced: If there is time, try to change the code from
- ne-sided updates (i.e. updates only to particle i) to
symmetric updates.
Taskyield keyword
- OpenMP 4.0 also added the taskyield keyword.
- When a thread reaches the keyword, it may suspend
execution of the current task in favour of a different task.
- Not supported in all runtimes:
- In Intel, the current task remains in the executing thread’s
stack, and will be completed after all other tasks are complete.
- Unimplemented in gcc-6 and gcc-7 (will compile, but the
taskyield function in the runtime is empty).
Taskyield keyword
void task_function(){ some_work(); while(!omp_test_lock(x)) #pragma omp taskyield some_critical_work();
- mp_unset_lock(x)
} Subroutine task_function() Call some_work() do while(.not. omp_test_lock(x)) !$omp taskyield end do call some_critical_work() call omp_unset_lock(x) End Subroutine
Task priorities
- OpenMP 4.5 introduced the priority keyword that can
be added to constructs.
- Task priorities are used in many task-based libraries to
hint to the compiler that some tasks are more important to compute earlier during the computation (due to length, number of dependents etc.).
- Neither Intel or gcc currently implement this feature of
the standard (ignoring it is standard compliant).
Task priorities
- Priorities can be specified between 0 and
OMP_MAX_TASK_PRIORITY.
- This can only be set as an environment variable:
- export OMP_MAX_TASK_PRIORITY=100000
- It can be retrieved during runtime, by calling:
- omp_get_max_task_priority().
Task priorities
#pragma omp parallel default(none) \ shared(a, b, c) private(i) { #pragma omp single for(i = 0; i < 100000; i+=1000) { #pragma omp task default(none) \ shared(a,b,c) firstprivate(i) \ priority(i) for(int j = 0; j < 1000; j++) c[i+j] = a[i+j] + b[i+j]; } } !$omp parallel default(none) & !$omp shared(a, b, c) private(i,j) !$omp do do i=1,100000,1000 !$omp task default(none) & !$omp shared(a, b, c) private(j) & !$omp firstprivate(i) priority(i) do j=0, 999 c(i+j) = b(i+j) + a(i+j) end do !$omp end task end do !$omp end do !$omp end parallel
Upcoming features
Features announced for OpenMP 5.0 and other features in discussion.
Upcoming features
- OpenMP 5.0 – current draft:
- Reduction variables can be used with task and taskloop
- Can give a taskwait dependencies – i.e. Provides a
synchronization point on a specific variable – useful for many to many dependencies.
- In discussion:
- Commutative dependencies – i.e. dependencies which can
be done in any order but not simultaneously (to avoid race conditions).
Reduction Variables with Tasks.
- Rather than declaring data with a reduction clause, we
instead use a task_reduction clause.
- For a task to use the reduction variable, it must be passed
to the task with the in_reduction clause.
- The reduction will be complete at the end of the region.
Reduction Variables with Tasks
#pragma omp parallel for \ default(none) task_reduction(+:sum) for(int i = 0; i < 10000; i++){ #pragma omp task firstprivate(i) \ in_reduction(+:sum) { sum += i; } } !$omp parallel do default(none) & !$omp task_reduction(+:sum) private(i) do i=1,10000 !$omp task in_reduction(+:sum) & !$omp firstprivate(i) sum = sum + i; !$omp end task end do !$omp end parallel do
Taskwait Dependencies
- In OpenMP 4.5, taskwait specifies a barrier which forces
all previously spawned tasks in the same region to be completed before progression.
- In OpenMP 5.0, we can provide a depend clause to a
taskwait.
- In this case, the taskwait behaves as though we
specified an empty task with this dependency.
Taskwait Dependencies
for(int i = 0; i < 100; i++){ #pragma omp task depend(in:a) function_using(a); #pragma omp task depend(in:a,b) function_using_both(a,b); #pragma omp task depend(in:c) function_using(c); } #pragma omp task depend(inout:a) function_writes(a); #pragma omp task depend(inout:b) function_writes(b); for(int i = 0; i < 100; i++){ #pragma omp task depend(in:a) function_using(a); #pragma omp task depend(in:a,b) function_using_both(a,b); #pragma omp task depend(in:c) function_using(c); } #pragma omp taskwait depend(out:a,b) #pragma omp task function_writes(a); #pragma omp task function_writes(b);
Commutative Dependencies
- While not yet part of the draft OpenMP 5.0 standard,
commutative dependencies (aka concurrent dependencies
- r conflicts) are expected to come to OpenMP in the
future.
- Many other task-based systems make use of these, e.g.
StarPU, QUARK, SWIFT (Cosmology code).
Commutative Dependencies
- If dependencies are marked as commutative it means the
- rder in which they execute is not important, however
there is (usually) a race condition on data writes between the tasks.
- We can use taskyield or recursive tasks, combined with
locks, to create something similar in OpenMP 4.5
Commutative Dependencies: taskyield
void task_function(omp_lock_t **locks, int nr_locks) { int success = 0 while(!success) { for(int i=0; i < nr_locks; i++) if(!omp_test_lock(locks[i])) break; if(i < nr_locks){ success = 0; for(int j = i-1; j >=0; j--)
- mp_unset_lock(locks[j]);
#pragma omp taskyield }else success = 1; } do_work(); for(int i = 0; i < nr_locks; i++)
- mp_unset_lock(locks[i]);
}
Commutative Dependencies: recursive tasks.
void task_function(omp_lock_t **locks, int nr_locks) { for(int i=0; i < nr_locks; i++) if(!omp_test_lock(locks[i])) break; if(i < nr_locks){ for(int j = i-1; j >=0; j--)
- mp_unset_lock(locks[j]);
#pragma omp task task_function(locks, nr_locks); return; } do_work(); for(int i = 0; i < nr_locks; i++)
- mp_unset_lock(locks[i]);
}
How to use KNL
- aidanchalk@sl142-mic1:~$ export
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_20 17.2.174/linux/compiler/lib/mic:$LD_LIBRARY_PATH
- aidanchalk@sl142-mic1:~$ ./MIC
- OR
- [aidanchalk@sl142 ~]$ export
SINK_LD_LIBRARY_PATH=/opt/intel/compilers_and_librari es_2017.2.174/linux/compiler/lib/mic
- [aidanchalk@sl142 ~]$ micnativeloadex MIC