CSL 860: Modern Parallel Computation Computation Hello OpenMP - - PowerPoint PPT Presentation
CSL 860: Modern Parallel Computation Computation Hello OpenMP - - PowerPoint PPT Presentation
CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { Parallel case 0 : blah1.. Construct case 1: blah2.. } } } // Back to normal Extremely
Hello OpenMP
#pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. }
Parallel Construct
} } // Back to normal
- Extremely simple to use and incredibly powerful
- Fork-Join model
- Every thread has its own execution context
- Variables can be declared shared or private
Execution Model
- Encountering thread creates a team:
– Itself (master) + zero or more additional threads.
- Applies to structured block immediately following
– Each thread executes a copy of the code in {}
- But, also see: Work-sharing constructs
- There’s an implicit barrier at the end of block
- Only master continues beyond the barrier
- May be nested
– Sometimes disabled by default
Memory Model
- Notion of temporary view of memory
– Allows local caching – Need to flush memory – T1 writes -> T1 flushes -> T2 flushes -> T2 reads – Same order seen by all threads – Same order seen by all threads
- Supports threadprivate memory
- Variables declared before parallel construct:
– Shared by default – May be designated as private – n-1 copies of the original variable is created
- May not be initialized by the system
Shared Variables
- Heap allocated storage
- Static data members
- const-qualified (no mutable members)
- Private:
Variables declared in a scope inside the construct – Variables declared in a scope inside the construct – Loop variable in for construct
- private to the construct
- Others are shared unless declared private
– You can change default
- Arguments passed by reference inherit from original
Beware of Compiler Re-ordering
a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section } }
Beware more of Compiler Re-ordering
// Parallel construct { int b = initialSalary print(“Initial Salary was %d\n”, initialSalary); print(“Initial Salary was %d\n”, initialSalary); Book-keeping() // No read b or write initialSalary if (b < 10000) { raiseSalary(500); } }
Thread Control
E nvironment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS *
- mp_set_num_threads
- mp_get_max_threads
Implementation defined OMP_DYNAMIC
- mp_set_dynamic
- mp_get_dynamic
Implementation defined OMP_NESTED
- mp_set_nested
- mp_get_nested
false OMP_SCHEDULE * Implementation defined
* Also see construct clause: num_threads, schedule
Parallel Construct
#pragma omp parallel \
if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(shared | none) \ default(shared | none) \ shared(var1, var2), \ copyin(var2), \ reduction(operator:list) \ num_threads(n) { }
Parallel Loop
#pragma omp parallel for for (i= 0; i < N; ++i) { blah … }
- No of iterations must be known when the
construct is encountered
– Must be the same for each thread
- Compiler puts a barrier at the end of parallel for
– But see nowait
Parallel For
#pragma omp for \
private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ lastprivate(var1, var2), \ lastprivate(var1, var2), \ reduction(operator: list), \
- rdered, \
schedule(kind[, chunk_size]), \ nowait Canonical For Loop No loop break
Schedule(kind[, chunk_size])
- Divide iterations into contiguous sets, chunks
– chunks are assigned transparently to threads
- static: iterations are divided among threads in a round-robin
fashion
– When no chunk_size is specified, approximately equal chunks are made
dynamic: iterations are assigned to threads in ‘request order’
- dynamic: iterations are assigned to threads in ‘request order’
– When no chunk_size is specified, it defaults to 1.
- guided: like dynamic, the size of each chunk is proportional to the
number of unassigned iterations divided by the number of threads
– If chunk_size =k, chunks have at least k iterations (except the last) – When no chunk_size is specified, it defaults to 1.
- runtime: taken from environment variable
Single
#pragma omp parallel { #pragma omp for for( int i=0; i<N; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragma omp for #pragma omp for for(int i=0; i<N; i++ ) b[i] = x * f2(i); }
- Only one of the threads executes
- Other threads wait for it
– unless NOWAIT is specified
- Hidden complexity
– Threads may be at different instructions
Sections
#pragma omp sections { #pragma omp section { // do this … } #pragma omp section #pragma omp section { // do that … } // … }
- The omp section directives must be closely nested in a sections construct,
where no other work-sharing construct may appear.
Private Variables
#pragma omp parallel private (size, …) for for ( int i = 0; i = numThreads; i++) { int size = numTasks/numThreads; int extra = numTasks – numThreads*size; if(i < extra) size ++; doTask(i, size, numThreads); } doTask(int start, int count) { // Each thread’s instance has its own activation record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); } }
Firstprivate and Lastprivate
- Initial value of private variable is unspecified
– firstprivate initializes copies with the original – Once per thread (not once per iteration) – Original exists before the construct
- Only the original copy is retained after the construct
- lastprivate forces sequential-like behavior
– thread executing the sequentially last iteration (or last listed section) writes to the original copy
Firstprivate and Lastprivate
#pragma omp parallel for firstprivate( simple ) for (int i=0; i<N; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); } #pragma omp parallel for lastprivate( doneEarly ) for( i=0; (i<N || doneEarly; i++ ) { doneEarly = f0(i);
Other Synchronization Directives
#pragma omp master { }
binds to the innermost enclosing parallel region – binds to the innermost enclosing parallel region – Only the master executes – No implied barrier
Master Directive
#pragma omp parallel { #pragma omp for for( int i=0; i<100; i++ ) a[i] = f0(i); #pragma omp master
Only master executes.
#pragma omp master x = f1(a); }
Only master executes. No synchronization.
Critical Section
#pragma omp critical accessBankBalance { }
A single thread at a time – A single thread at a time – Applies to all threads – The name is optional; no name implies global critical region
Barrier Directive
#pragma omp barrier
– Stand-alone – Binds to inner-most parallel region – All threads in the team must execute
- they will all wait for each other at this instruction
- they will all wait for each other at this instruction
- Dangerous:
if (! ready) #pragma omp barrier
– Same sequence of work-sharing and barrier for the entire team
Ordered Directive
#pragma omp ordered { }
- Binds to inner-most enclosing loop
- Binds to inner-most enclosing loop
- The structured block executed in sequential
- rder
- The loop must declare the ordered clause
- May encounter only one ordered regions
Flush Directive
#pragma omp flush (var1, var2)
– Stand-alone, like barrier – Only directly affects the encountering thread – List-of-vars ensures that any compiler re-ordering – List-of-vars ensures that any compiler re-ordering moves all flushes together
Atomic Directive
#pragma omp atomic i++;
- Light-weight critical section
- Only for some expressions
– x = expr (no mutual exclusion on expr evaluation) – x++ – ++x – x-- – --x
Reductions
- Reductions are so common that OpenMP provides
support for them
- May add reduction clause to parallel for
pragma
- Specify reduction operation and reduction variable
- OpenMP takes care of storing partial results in
private variables and combining partial results after the loop
reduction Clause
- reduction (<op> :<variable>)
– + Sum – * Product – & Bitwise and – | Bitwise or ^ Bitwise exclusive or – ^ Bitwise exclusive or – && Logical and – || Logical or
- Add to parallel for
– OpenMP creates a loop to combine copies of the variable – The resulting loop may not be parallel
Nesting Restrictions
- A work-sharing region may not be closely nested inside a
work-sharing, critical, ordered, or master region.
- A barrier region may not be closely nested inside a work-
sharing, critical, ordered, or master region.
- A master region may not be closely nested inside a work-
sharing region. sharing region.
- An ordered region may not be closely nested inside a
critical region.
- An ordered region must be closely nested inside a loop
region (or parallel loop region) with an ordered clause.
- A critical region may not be nested (closely or otherwise)
inside a critical region with the same name. Note that this restriction is not sufficient to prevent deadlock
EXAMPLES
OpenMP Matrix Multiply
#pragma omp parallel for for(int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }
- a, b, c are shared
- i, j, k are private
OpenMP Matrix Multiply: Triangular
#pragma omp parallel for schedule (dynamic, 1 ) for( int i=0; i<n; i++ ) for( int j=i; j<n; j++ ) { c[i][j] = 0.0; for(int k=i; k<n; k++ ) for(int k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }
- This multiplies upper-triangular matrix A with B
- Unbalanced workload
– Schedule improves this
OpenMP Jacobi
for some number of timesteps/iterations { #pragma omp parallel for for (int i=0; i<n; i++ ) for( int j=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); grid[i][j-1] + grid[i][j+1] ); #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }
- This could be improved by using just one parallel region
- Implicit barrier after loops eliminates race on grid
OpenMP Jacobi
for some number of timesteps/iterations { #pragma omp parallel for for (int i=0; i<n; i++ ) for( int j=0, j<n, j++ ) { temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); grid[i][j-1] + grid[i][j+1] ); #pragma omp barrier grid[i][j] = temp[i][j]; } }
- Is barrier sufficient?
- What change to the code is needed?
– Recall barrier is per-team