[PPT] - How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! PowerPoint Presentation

SLIDE 1

How to Get Good Performance by Using OpenMP!

!"

Agenda!

!Loop optimizations!
!Measuring OpenMP performance!
!Best practices!
!Task parallelism !!

#" $"

Correctness versus performance!

It may be easy to write a correctly functioning OpenMP program, but not so easy to create a program that provides the desired level of performance.!

%"

Memory access patterns!

A major goal is to organize data accesses so that values are used as often as possible while they are still in the cache.!

SLIDE 2

&"

Two-dimensional array access!

In C, a two-dimensional array is stored in rows.!

'"

Two-dimensional array access!

Empirical test on alvin:! !n = 50000! !row-wise access: ! ! 34.8 seconds! !column-wise access: !213.3 seconds!

("

Loop unrolling!

)"

Loop fusion!

SLIDE 3

*"

Loop fission!

!+"

Loop tiling!

cont'd on next page!

n

!!" !#"

Measuring OpenMP performance!

(1) Using the time command available on Unix systems:! ! !$ time program

real 5.4
user

3.2

sys
2.0

(2) Using the omp_get_wtime() function. !

!Returns the wall clock time (in seconds) relative to ! !an arbitrary reference time. !

SLIDE 4

!$"

Parallel overhead!

The amount of time required to coordinate parallel threads, as

pposed to doing useful work. !

Parallel overhead can include factors such as:!

!

Thread start-up time!

!

Synchronization!

!

Software overhead imposed by parallel compilers, !libraries, tools, operating system, etc.!

!

Thread termination time!

!%"

A simple performance model!

TCPU(P) = (1+ OP ! P)!Tserial TElapsed(P) = ( f P " f +1+ OP ! P)!Tserial

Speedup(P) = TSerial(P) TElapsed(P) = 1 f P ! f +1+ OP " P = 1 0.95 P + 0.05 + 0.02 " P Efficiency(P) = Speedup(P) P

!&" !'"

Performance factors!

Manner in which memory is accessed by the individual threads.!
Sequential overheads: Sequential work that is replicated.!
(OpenMP) Parallelization overheads: The amount of time spent

!handling OpenMP constructs.!

Load imbalance overheads: The load imbalance between

!synchronization points.!

Synchronization overheads: Time wasted for waiting to enter

!critical regions.!

SLIDE 5

!("

Overheads of OpenMP directives!

,-./01234" 56,677-7"83," 56,677-7" 924:7-" ;6,,2-," 83," !)"

Overheads of OpenMP directives on alvin (gcc)!

!*"

Overhead of OpenMP scheduling!

#+"

Overhead of OpenMP scheduling on alvin (gcc)!

SLIDE 6

#!"

Best practices!

##"

Optimize barrier use!

#$"

Avoid the ordered construct!

The ordered construct is expensive.! The construct can often be avoided. It might be better to perform I/O outside the parallel loop.!

#%"

Avoid the critical region construct!

If at all possible, an atomic update is to be preferred.!

SLIDE 7

#&"

!"#

#pragma omp parallel { #pragma omp critical { ... } ... }

Lost time waiting for locks!

time!

Busy! Idle! In Critical!

Avoid large critical regions!

#'"

Maximize parallel regions!

#("

Maximize parallel regions!

Large parallel regions offer more opportunities for using data in cache and provide a bigger context for compiler optimizations.!

#)"

Avoid parallel regions in inner loops!

SLIDE 8

#*"

Unequal work loads lead to idle threads and wasted time.!

time! #pragma omp parallel { #pragma omp for for ( ; ; ) { } }

Load imbalance!

Busy! Idle!

$+"

Load balancing!

""Load balancing is an important aspect of performance!
!For regular expressions (e.g. vector addition), load

!balancing is not an issue!

!For less regular workloads, care needs to be taken in

!distributing the work over the threads!

!Examples of irregular workloads:!

! ! ! ! ! ! !- multiplication of triangular matrices ! ! ! ! ! !- parallel searches in a linked list!

!For these irregular situations, the schedule clause supports

!various iteration scheduling algorithms !

$!"

Address poor load balancing!

cont'd on next page!

$#"

SLIDE 9

$$"

False sharing!

The state bit of a cache line does not keep track of the cache line state on a byte basis, but at the line level instead.! Thus, if independent data items happen to reside on the same cache line (cache block), each update will cause the cache line to “ping-pong” between the threads. ! ! ! ! ! ! ! ! ! ! ! ! ! ! This is called false sharing.!

$%"

False sharing!

False sharing is likely to significantly impact performance under the following conditions:!

1. Shared data are modified by multiple threads.!
2. The access pattern is such that multiple threads modify

the same cache line(s).!

3. These modification occur in rapid succession.!

$&"

False sharing example!

$'"

Array elements are contiguous in memory and hence share cache lines.! ! !! ! !Result: False sharing may lead to poor scaling! Solutions:!

!When updates to an array are frequent, work

!with local copies of the array in stead of an array !indexed by the thread ID.!

!Pad arrays so elements you use are on distinct !

!cache lines.!

False sharing!

SLIDE 10

$(" $("

Array padding!

int a[Nthreads]; #pragma omp parallel for shared(Nthreads,a) schedule(static,1) for (int i=0; i<Nthreads; i++) a[i] += i; int a[Nthreads][cache_line_size]; #pragma omp parallel for shared(Nthreads,a) schedule(static,1) for (int i=0; i<Nthreads; i++) a[i][0] += i;

$)"

Case study: Matrix times vector product!

$*" %+"

SLIDE 11

%!" %#" %$" %%"

Task Parallelism in OpenMP 3.0!

SLIDE 12

$"#

Tasks are independent units of work! Threads are assigned to perform the work of each task!

Tasks may be deferred !
Tasks may be executed immediately!

The runtime system decides which of the above!

Tasks are composed of:!
code to execute!
data environment (it own its data)!
internal control variables!

Serial! Parallel!

What are tasks?!

%'"

Tasks in OpenMP!

OpenMP has always had tasks, but they were not called that.!

!A thread encountering a parallel construct packages up a set of

!implicit tasks, one per thread.!

!A team of threads is created.!
!Each thread is assigned to one of the tasks (and tied to it).!
!Barrier holds master thread until all implicit tasks are finished.!

OpenMP 3.0 adds a way to create a task explicitly for the team to execute.!

%("

The task construct!

#pragma omp task [clause [[,] clause] ... ] structured block

Each encountering thread creates a new task.!

Code and data are being packaged up!
Tasks can be nested!

An OpenMP barrier (implicit or explicit):! ! ! !All tasks created by any thread of the current !team are guaranteed to be completed at barrier exit.! Task barrier (taskwait): ! ! ! ! !Encountering thread suspends until all child tasks it !has generated are !complete.!

%)"

Simple example of using tasks! for pointer chasing!

void process_list(elem_t *elem) { #pragma omp parallel { #pragma omp single { while (elem != NULL) { #pragma omp task { process(elem); } elem = elem->next; } } } } elem is firstprivate by default!

SLIDE 13

%*"

Simple example of using tasks! in a recursive algorithm!

int fib(int n) { int i, j; if (n < 2) return n; i = fib(n - 1); j = fib(n - 2); return i + j; } int main() { int n = 10; printf("fib(%d) = %d\n", n, fib(n)); } #pragma omp parallel #pragma omp single #pragma omp task shared(i) #pragma omp task shared(j) #pragma omp taskwait

Computation of Fibonacci numbers! 1,1,2,3,5,8,13,21,34,55,89,144,...!

&+"

Using tasks for tree traversal!

struct node { struct node *left, *right; }; void traverse(struct node *p, int postorder) { if (p->left != NULL) #pragma omp task traverse(p->left, postorder); if (p->right != NULL) #pragma omp task traverse(p->right, postorder); if (postorder) { #pragma omp taskwait } process(p); }

&!"

Task switching!

Certain constructs have suspend/resume points at defined positions within them.! When a thread encounters a suspend point, it is allowed to suspend the current task and resume another. It can then return to the original task and resume it.! A tied task must be resumed by the same task that suspended it.! Tasks are tied by default. A task can be specified to be untied using! ! ! !#pragma omp task untied

&#"

Collapsing of loops!

The collapse clause (in OpenMP 3.0) handles perfectly nested multi-dimensional loops.!

#pragma omp for collapse(2) for (i = 0; i < N; i++) for (j = 0; j < M; j++) for (k = 0; k < K; k++) foo(i, j, k);

Iteration space from i-loop and j-loop is collapsed into a single

ne, if the two loops are perfectly nested and form a rectangular

iteration space. !

SLIDE 14

&$"

Removal of dependencies!

for (i = 0; i < n; i++) { x = (b[i] + c[i]) / 2; a[i] = a[i + 1] + x; }

Serial version containing anti dependency!

#pragma omp parallel for shared(a,a_copy) for (i = 0; i < n; i++) a_copy[i] = a[i + 1]; #pragma omp parallel for shared(a,a_copy) private(x) for (i = 0; i < n; i++) { x = (b[i] + c[i]) / 2; a[i] = a_copy[i] + x; }

Parallel version with dependencies removed!

&%"

Removal of dependencies!

for (i = 1; i < n; i++) { b[i] = b[i] + a[i - 1]; a[i] = a[i] + c[i]; }

Serial version containing flow dependency!

b[1] = b[1] - a[0]; #pragma omp parallel for shared(a,b,c) for (i = 1; i < n; i++) { a[i] = a[i] + c[i];

b[i + 1] = b[i + 1] + a[i];

} a[n - 1] = a[n - 1] + c[n - 1];

Parallel version with dependencies removed by loop skewing!

&&"

Automatic parallelization!

Some compilers can insert OpenMP can optimize a program

automatically. However, they must be conservative, and

programs spread over several files create difficulties.! The Intel compilers support automatic parallelization. Example,!

icc -o matmult -O3 -parallel -par-report3 matmult.c