OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - - PowerPoint PPT Presentation

openmp 4 0 and beyond
SMART_READER_LITE
LIVE PREVIEW

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - - PowerPoint PPT Presentation

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API & standard for shared memory parallel computing. Works with C, C++ and Fortran. It was first released in 1997, and version 4.5 was released


slide-1
SLIDE 1

OpenMP 4.0 and Beyond!

Aidan Chalk, Hartree Centre, STFC

slide-2
SLIDE 2
slide-3
SLIDE 3

What is OpenMP?

  • OpenMP is an API & standard for shared memory parallel

computing.

  • Works with C, C++ and Fortran.
  • It was first released in 1997, and version 4.5 was released

in 2015.

  • Can now be used with accelerators such as GPUs, Xeon

Phi & FPGA.

slide-4
SLIDE 4

The Basics

  • OpenMP API uses pragmas to tell the compiler what to

parallelise.

  • OpenMP is commonly used to do fork-join parallelism.
slide-5
SLIDE 5

The Basics

  • All parallel code is performed inside a parallel region:

#pragma omp parallel { //Parallel code goes here. } !$omp parallel !Parallel code goes here !$omp end parallel

slide-6
SLIDE 6

The Basics

  • The number of threads to use in a parallel region can be

controlled in 3 ways:

– export OMP_NUM_THREADS=x – void omp_set_num_threads(int x) – C: #pragma omp parallel num_threads(x) – FORTRAN: !$omp parallel num_threads(x)

slide-7
SLIDE 7

The Basics

  • The most common use case is a parallel loop:

#pragma omp parallel for for(int i = 0; i < 100000; i++) c[i] = a[i] + b[i]; !$omp parallel do do i=1, 100000 c(i) = a(i) + b(i) end do !$omp end parallel do

slide-8
SLIDE 8

Data-Sharing Clauses

  • One of the most important things to get correct when

using OpenMP is data clauses.

  • ALWAYS Start with #pragma omp parallel

default(none)

  • Makes bugs less likely and easier to track down.
slide-9
SLIDE 9

Commonly Used Data-Sharing Clauses

  • shared: Allows a variable to be accessed by all of the

threads inside a parallel region – care for race conditions.

  • private: Creates an uninitialized copy of the variable for

each thread. At the end of the region, the data is lost.

  • reduction: Creates a copy of the variable for each

thread, initialised depending on the type of reduction

  • chosen. Examples options are +, *, -, & etc. At the end of

the region, the original variable contains the reduction of all of the threads.

slide-10
SLIDE 10

Using Data-Sharing Clauses

int i, sum, a[100000]; int b[100000], c[100000]; #pragma omp parallel for default(none) \ shared(a,b,c) private(i) reduction(+,sum) { for(i = 0; i < 100000; i++) { c[i] = a[i] + b[i]; sum += c[i]; } } Integer, Dimension(100000)::a,b,c Integer :: i, sums !$omp parallel do default(none) & !$omp shared(a, b, c) private(i) & !$omp reduction(+,sums) do i=1,100000 c(i) = b(i) + a(i) sums = sums + c(i) end do !$omp end parallel do

slide-11
SLIDE 11

Controlling Loop Scheduling

  • OpenMP also allows the user to specify how the loop is

executed, using the schedule option.

  • The default option is static. The loop is broken into

nr_threads equal chunks, and each thread executes a chunk.

  • You can specify the size of the chunks manually,

static(100) will create chunks of size 100.

  • Other options: guided, dynamic, auto, runtime.
  • Usually static or guided will give best performance.
slide-12
SLIDE 12

Controlling Loop Scheduling

  • The other commonly used options are:
  • guided, chunksize : The iterations are assigned to

threads in chunks. Each chunk will be proportional to the number of remaining iterations, and no less than the chunk size.

  • dynamic, chunksize : The iterations are distributed to

threads in chunks. Each thread executes a chunk, then requests another chunk once it has completed.

  • Usually static or guided will give best performance.
slide-13
SLIDE 13

Controlling Loop Scheduling

#pragma omp parallel for default(none) \ shared(a, b, c) private(i) \ schedule(guided,1000) { for(i = 0; i < 100000; i++) { c[i] = a[i] + b[i]; } } !$omp parallel do default(none) & !$omp shared(a, b, c) private(i) & !$omp schedule(dynamic, 1000) do i=1,100000 c(i) = b(i) + a(i) end do !$omp end parallel do

slide-14
SLIDE 14

Thread ID

  • Each thread has its own ID, which can be retrieved with
  • mp_get_thread_num()
  • The total number of threads can also be retrieved with
  • mp_get_num_threads()
slide-15
SLIDE 15

First exercise - setup

  • Copy the exercises from

– cp /home/aidanchalk/md.tar.gz .

  • cp /home/aidanchalk/OpenMP_training.pdf .
  • Extract them to a folder:

– tar –xvf md.tar.gz

  • Load the intel compiler using source

/opt/intel/parallel_studio_xe_2017.2.050/psxevars.sh

  • Compile the initial code with make
  • Test it on Xeon using the jobscript:

– qsub initial.pbs

  • Check the the output by looking at output.txt. Record the

runtime.

slide-16
SLIDE 16

First exercise

  • Copy the original code to ex_1:

– cp initial/md.* ex_1/.

  • Add OpenMP loop-based parallelism to the compute_step

and update routines.

  • To build it use make ex1
  • Test it on Xeon and on Xeon Phi KNC:

– Xeon (copy from /home/aidanchalk/ex1_xeon.pbs): qsub ex1_xeon.pbs – Xeon Phi: qsub ex1_phi.pbs

  • How does the runtime compare to the original serial

version? The runtimes.out file contains the runtime on each number of cores from 1 to 32.

slide-17
SLIDE 17

First exercise

  • Add schedule(runtime) to the OpenMP loop in

compute_step, and add export OMP_SCHEDULE=“guided,8” to the jobscript, and compare the runtime.

  • Try other schedules (dynamic, auto) and other chunksizes

to see how if affects the runtime.

slide-18
SLIDE 18

First exercise – potential issue.

  • If the performance is worse, use make ex1_opt and

check the optrpt to see if the compute_step loop was vectorised.

  • If not, write the loop without using a reduction variable,

this should allow the compiler to vectorise the code and get better performance than the serial version.

slide-19
SLIDE 19

Task-Based Parallelism

  • Task-Based Parallelism is an alternative methodology for

shared-memory parallel computing.

  • Rather than managing threads explicitly, we break the

work into parallelisable chunks, known as tasks.

  • Between tasks, we keep track of data flow/dependencies

(and potential race conditions).

  • With this information, we can safely execute independent

tasks in parallel.

slide-20
SLIDE 20

Diagram of TBP

slide-21
SLIDE 21

Diagram of TBP

slide-22
SLIDE 22

Diagram of TBP

slide-23
SLIDE 23

OpenMP Tasks

  • OpenMP added tasks in 3.0, and additions to them have

been including in both 4.0 and 4.5.

  • The earliest addition was the task keyword.
  • OpenMP 4.5 added an easier option – the taskloop.
  • In OpenMP, tasks are not (usually) executed until the next

barrier or unless you use a taskwait barrier.

slide-24
SLIDE 24

Taskloop

  • Used to execute a loop using tasks.
  • Note: taskloop is not a worksharing construct (like

OpenMP for) – you need to run it inside a single region unless you want to perform the loop multiple times.

  • You can define either num_tasks or grainsize to

control the amount of work in each task.

  • (gcc-6.3 & gcc-7 bug – always use for(int i=0,….) for

taskloop).

slide-25
SLIDE 25

Taskloop

#pragma omp parallel default(none) \ shared(a, b, c) { #pragma omp single #pragma omp taskloop grainsize(1000) for(int i = 0; i < 100000; i++) { c[i] = a[i] + b[i]; } } !$omp parallel default(none) & !$omp shared(a, b, c) private(i) !$omp single !$omp taskloop num_tasks(1000) do i=1,100000 c(i) = b(i) + a(i) end do !$omp end taskloop !$omp end single !$omp end parallel

slide-26
SLIDE 26

Second Exercise

  • Create a new directory called ex_2 and copy the

ex_1/md.XXX files to it.

  • Alter your implementation of compute_step to use

taskloop rather than the do/for loop you used before.

  • Note – you can’t use reduction variable for the energy

with taskloop.

  • Build with make ex2
  • How does the runtime compare to your previous version?
  • How does altering the grain_size (or number of tasks)

affect your results?

slide-27
SLIDE 27

Second Exercise

  • If your new code is substantially slower, use make

ex2_opt and look at the optimization report.

  • If the code doesn’t vectorise, avoid updating the arrays

directly in the inner loop – instead sum to a temporary and sum to the array in the outer loop.

slide-28
SLIDE 28

Explicit Tasks

  • Taskloop is helpful for just using tasks with an unbalanced

loop, however sometimes we want more control over how our tasks are spawned.

  • We can spawn tasks ourselves, using the task pragma.
  • To create an explicit task, we put the task pragma around

a section of code we want to have executed as a task, and apply the relevant data-sharing clauses.

  • The firstprivate clause is useful: Any task-private

data that needs to be input to a task region should be declared as firstprivate.

slide-29
SLIDE 29

Explicit Tasks

  • Usually we will spawn tasks in a single region (and certain

OpenMP definitions will only work if we do).

  • If we have completely independent tasks, we may be

better spawning them inside a parallel for.

  • Note: we cannot use reduction variables inside tasks

(this is in discussion for OpenMP 5.0).

slide-30
SLIDE 30

Explicit Tasks

#pragma omp parallel default(none) \ shared(a, b, c) private(i) { #pragma omp single for(i = 0; i < 100000; i+=1000) { #pragma omp task default(none) \ shared(a,b,c) firstprivate(i) for(int j = 0; j < 1000; j++) c[i+j] = a[i+j] + b[i+j]; } } !$omp parallel default(none) & !$omp shared(a, b, c) private(i,j) !$omp do do i=1,100000,1000 !$omp task default(none) & !$omp shared(a, b, c) private(j) & !$omp firstprivate(i) do j=0, 999 c(i+j) = b(i+j) + a(i+j) end do !$omp end task end do !$omp end do !$omp end parallel

slide-31
SLIDE 31

Third Exercise

  • Create a new folder (ex_3) and copy the original files

(initial/md.XX) to ex_3/md.XX

  • Break down the outer loop in the compute_step and

create explicit tasks. Copy your code from ex_1 to parallelise the update function.

  • Build with make ex3
  • How does the runtime compare to your previous versions.
  • What size tasks perform best for explicit tasks?
  • Parallelise the update function using explicit tasks.
  • Does this improve the performance?
slide-32
SLIDE 32

Dependencies

  • OpenMP 4.0 introduced the depend keyword extension

to tasks.

  • When declaring a task, we can specify data as in, out, or

inout.

  • Tasks with in dependencies on some data are dependent
  • n any previously created task with an out or inout

dependency on the same data.

  • Tasks with out or inout dependency are dependent on

any previously created task with any dependency on the same data.

slide-33
SLIDE 33

Dependencies

  • Note 1: Dependencies on array sections need to be

identical sized sections.

  • Note 2: Dependent tasks must be spawned by the same

thread.

  • Note 3: Tasks may have as many depend clauses as

necessary.

slide-34
SLIDE 34

Dependencies

#pragma omp parallel default(none) \ shared(x,n) { #pragma omp single nowait for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) { #pragma omp task firstprivate(i,j) \ depend(inout: x(i)) shared(x) some_function(i, j, x); } } !$omp parallel default(none) & !$omp shared(x,n) private(i,j) !$omp single do i=1, n do j=1, n !$omp task firstprivate(i,j) & !$omp depend(inout: x(i)) shared(x) Call some_function(i,j,x) !$omp end task end do end do !$omp end single nowait !$omp end parallel

slide-35
SLIDE 35

Fourth Exercise

  • Copy your code from ex_3/ to ex_4/
  • Create dependencies between the tasks in create_step

and in update. Build with make ex4

  • Make sure there are no OpenMP barriers between the

start of create_step and the end of update, else the synchronization will result in OpenMP ignoring the dependencies.

  • Advanced: If there is time, try to change the code from
  • ne-sided updates (i.e. updates only to particle i) to

symmetric updates.

slide-36
SLIDE 36

Taskyield keyword

  • OpenMP 4.0 also added the taskyield keyword.
  • When a thread reaches the keyword, it may suspend

execution of the current task in favour of a different task.

  • Not supported in all runtimes:
  • In Intel, the current task remains in the executing thread’s

stack, and will be completed after all other tasks are complete.

  • Unimplemented in gcc-6 and gcc-7 (will compile, but the

taskyield function in the runtime is empty).

slide-37
SLIDE 37

Taskyield keyword

void task_function(){ some_work(); while(!omp_test_lock(x)) #pragma omp taskyield some_critical_work();

  • mp_unset_lock(x)

} Subroutine task_function() Call some_work() do while(.not. omp_test_lock(x)) !$omp taskyield end do call some_critical_work() call omp_unset_lock(x) End Subroutine

slide-38
SLIDE 38

Task priorities

  • OpenMP 4.5 introduced the priority keyword that can

be added to constructs.

  • Task priorities are used in many task-based libraries to

hint to the compiler that some tasks are more important to compute earlier during the computation (due to length, number of dependents etc.).

  • Neither Intel or gcc currently implement this feature of

the standard (ignoring it is standard compliant).

slide-39
SLIDE 39

Task priorities

  • Priorities can be specified between 0 and

OMP_MAX_TASK_PRIORITY.

  • This can only be set as an environment variable:
  • export OMP_MAX_TASK_PRIORITY=100000
  • It can be retrieved during runtime, by calling:
  • omp_get_max_task_priority().
slide-40
SLIDE 40

Task priorities

#pragma omp parallel default(none) \ shared(a, b, c) private(i) { #pragma omp single for(i = 0; i < 100000; i+=1000) { #pragma omp task default(none) \ shared(a,b,c) firstprivate(i) \ priority(i) for(int j = 0; j < 1000; j++) c[i+j] = a[i+j] + b[i+j]; } } !$omp parallel default(none) & !$omp shared(a, b, c) private(i,j) !$omp do do i=1,100000,1000 !$omp task default(none) & !$omp shared(a, b, c) private(j) & !$omp firstprivate(i) priority(i) do j=0, 999 c(i+j) = b(i+j) + a(i+j) end do !$omp end task end do !$omp end do !$omp end parallel

slide-41
SLIDE 41

Upcoming features

Features announced for OpenMP 5.0 and other features in discussion.

slide-42
SLIDE 42

Upcoming features

  • OpenMP 5.0 – current draft:
  • Reduction variables can be used with task and taskloop
  • Can give a taskwait dependencies – i.e. Provides a

synchronization point on a specific variable – useful for many to many dependencies.

  • In discussion:
  • Commutative dependencies – i.e. dependencies which can

be done in any order but not simultaneously (to avoid race conditions).

slide-43
SLIDE 43

Reduction Variables with Tasks.

  • Rather than declaring data with a reduction clause, we

instead use a task_reduction clause.

  • For a task to use the reduction variable, it must be passed

to the task with the in_reduction clause.

  • The reduction will be complete at the end of the region.
slide-44
SLIDE 44

Reduction Variables with Tasks

#pragma omp parallel for \ default(none) task_reduction(+:sum) for(int i = 0; i < 10000; i++){ #pragma omp task firstprivate(i) \ in_reduction(+:sum) { sum += i; } } !$omp parallel do default(none) & !$omp task_reduction(+:sum) private(i) do i=1,10000 !$omp task in_reduction(+:sum) & !$omp firstprivate(i) sum = sum + i; !$omp end task end do !$omp end parallel do

slide-45
SLIDE 45

Taskwait Dependencies

  • In OpenMP 4.5, taskwait specifies a barrier which forces

all previously spawned tasks in the same region to be completed before progression.

  • In OpenMP 5.0, we can provide a depend clause to a

taskwait.

  • In this case, the taskwait behaves as though we

specified an empty task with this dependency.

slide-46
SLIDE 46

Taskwait Dependencies

for(int i = 0; i < 100; i++){ #pragma omp task depend(in:a) function_using(a); #pragma omp task depend(in:a,b) function_using_both(a,b); #pragma omp task depend(in:c) function_using(c); } #pragma omp task depend(inout:a) function_writes(a); #pragma omp task depend(inout:b) function_writes(b); for(int i = 0; i < 100; i++){ #pragma omp task depend(in:a) function_using(a); #pragma omp task depend(in:a,b) function_using_both(a,b); #pragma omp task depend(in:c) function_using(c); } #pragma omp taskwait depend(out:a,b) #pragma omp task function_writes(a); #pragma omp task function_writes(b);

slide-47
SLIDE 47

Commutative Dependencies

  • While not yet part of the draft OpenMP 5.0 standard,

commutative dependencies (aka concurrent dependencies

  • r conflicts) are expected to come to OpenMP in the

future.

  • Many other task-based systems make use of these, e.g.

StarPU, QUARK, SWIFT (Cosmology code).

slide-48
SLIDE 48

Commutative Dependencies

  • If dependencies are marked as commutative it means the
  • rder in which they execute is not important, however

there is (usually) a race condition on data writes between the tasks.

  • We can use taskyield or recursive tasks, combined with

locks, to create something similar in OpenMP 4.5

slide-49
SLIDE 49

Commutative Dependencies: taskyield

void task_function(omp_lock_t **locks, int nr_locks) { int success = 0 while(!success) { for(int i=0; i < nr_locks; i++) if(!omp_test_lock(locks[i])) break; if(i < nr_locks){ success = 0; for(int j = i-1; j >=0; j--)

  • mp_unset_lock(locks[j]);

#pragma omp taskyield }else success = 1; } do_work(); for(int i = 0; i < nr_locks; i++)

  • mp_unset_lock(locks[i]);

}

slide-50
SLIDE 50

Commutative Dependencies: recursive tasks.

void task_function(omp_lock_t **locks, int nr_locks) { for(int i=0; i < nr_locks; i++) if(!omp_test_lock(locks[i])) break; if(i < nr_locks){ for(int j = i-1; j >=0; j--)

  • mp_unset_lock(locks[j]);

#pragma omp task task_function(locks, nr_locks); return; } do_work(); for(int i = 0; i < nr_locks; i++)

  • mp_unset_lock(locks[i]);

}

slide-51
SLIDE 51

How to use KNL

  • aidanchalk@sl142-mic1:~$ export

LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_20 17.2.174/linux/compiler/lib/mic:$LD_LIBRARY_PATH

  • aidanchalk@sl142-mic1:~$ ./MIC
  • OR
  • [aidanchalk@sl142 ~]$ export

SINK_LD_LIBRARY_PATH=/opt/intel/compilers_and_librari es_2017.2.174/linux/compiler/lib/mic

  • [aidanchalk@sl142 ~]$ micnativeloadex MIC