CSL 860: Modern Parallel Computation Computation Hello OpenMP - - PowerPoint PPT Presentation

csl 860 modern parallel computation computation hello
SMART_READER_LITE
LIVE PREVIEW

CSL 860: Modern Parallel Computation Computation Hello OpenMP - - PowerPoint PPT Presentation

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { Parallel case 0 : blah1.. Construct case 1: blah2.. } } } // Back to normal Extremely


slide-1
SLIDE 1

CSL 860: Modern Parallel Computation Computation

slide-2
SLIDE 2

Hello OpenMP

#pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. }

Parallel Construct

} } // Back to normal

  • Extremely simple to use and incredibly powerful
  • Fork-Join model
  • Every thread has its own execution context
  • Variables can be declared shared or private
slide-3
SLIDE 3

Execution Model

  • Encountering thread creates a team:

– Itself (master) + zero or more additional threads.

  • Applies to structured block immediately following

– Each thread executes a copy of the code in {}

  • But, also see: Work-sharing constructs
  • There’s an implicit barrier at the end of block
  • Only master continues beyond the barrier
  • May be nested

– Sometimes disabled by default

slide-4
SLIDE 4

Memory Model

  • Notion of temporary view of memory

– Allows local caching – Need to flush memory – T1 writes -> T1 flushes -> T2 flushes -> T2 reads – Same order seen by all threads – Same order seen by all threads

  • Supports threadprivate memory
  • Variables declared before parallel construct:

– Shared by default – May be designated as private – n-1 copies of the original variable is created

  • May not be initialized by the system
slide-5
SLIDE 5

Shared Variables

  • Heap allocated storage
  • Static data members
  • const-qualified (no mutable members)
  • Private:

Variables declared in a scope inside the construct – Variables declared in a scope inside the construct – Loop variable in for construct

  • private to the construct
  • Others are shared unless declared private

– You can change default

  • Arguments passed by reference inherit from original
slide-6
SLIDE 6

Beware of Compiler Re-ordering

a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section } }

slide-7
SLIDE 7

Beware more of Compiler Re-ordering

// Parallel construct { int b = initialSalary print(“Initial Salary was %d\n”, initialSalary); print(“Initial Salary was %d\n”, initialSalary); Book-keeping() // No read b or write initialSalary if (b < 10000) { raiseSalary(500); } }

slide-8
SLIDE 8

Thread Control

E nvironment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS *

  • mp_set_num_threads
  • mp_get_max_threads

Implementation defined OMP_DYNAMIC

  • mp_set_dynamic
  • mp_get_dynamic

Implementation defined OMP_NESTED

  • mp_set_nested
  • mp_get_nested

false OMP_SCHEDULE * Implementation defined

* Also see construct clause: num_threads, schedule

slide-9
SLIDE 9

Parallel Construct

#pragma omp parallel \

if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(shared | none) \ default(shared | none) \ shared(var1, var2), \ copyin(var2), \ reduction(operator:list) \ num_threads(n) { }

slide-10
SLIDE 10

Parallel Loop

#pragma omp parallel for for (i= 0; i < N; ++i) { blah … }

  • No of iterations must be known when the

construct is encountered

– Must be the same for each thread

  • Compiler puts a barrier at the end of parallel for

– But see nowait

slide-11
SLIDE 11

Parallel For

#pragma omp for \

private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ lastprivate(var1, var2), \ lastprivate(var1, var2), \ reduction(operator: list), \

  • rdered, \

schedule(kind[, chunk_size]), \ nowait Canonical For Loop No loop break

slide-12
SLIDE 12

Schedule(kind[, chunk_size])

  • Divide iterations into contiguous sets, chunks

– chunks are assigned transparently to threads

  • static: iterations are divided among threads in a round-robin

fashion

– When no chunk_size is specified, approximately equal chunks are made

dynamic: iterations are assigned to threads in ‘request order’

  • dynamic: iterations are assigned to threads in ‘request order’

– When no chunk_size is specified, it defaults to 1.

  • guided: like dynamic, the size of each chunk is proportional to the

number of unassigned iterations divided by the number of threads

– If chunk_size =k, chunks have at least k iterations (except the last) – When no chunk_size is specified, it defaults to 1.

  • runtime: taken from environment variable
slide-13
SLIDE 13

Single

#pragma omp parallel { #pragma omp for for( int i=0; i<N; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragma omp for #pragma omp for for(int i=0; i<N; i++ ) b[i] = x * f2(i); }

  • Only one of the threads executes
  • Other threads wait for it

– unless NOWAIT is specified

  • Hidden complexity

– Threads may be at different instructions

slide-14
SLIDE 14

Sections

#pragma omp sections { #pragma omp section { // do this … } #pragma omp section #pragma omp section { // do that … } // … }

  • The omp section directives must be closely nested in a sections construct,

where no other work-sharing construct may appear.

slide-15
SLIDE 15

Private Variables

#pragma omp parallel private (size, …) for for ( int i = 0; i = numThreads; i++) { int size = numTasks/numThreads; int extra = numTasks – numThreads*size; if(i < extra) size ++; doTask(i, size, numThreads); } doTask(int start, int count) { // Each thread’s instance has its own activation record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); } }

slide-16
SLIDE 16

Firstprivate and Lastprivate

  • Initial value of private variable is unspecified

– firstprivate initializes copies with the original – Once per thread (not once per iteration) – Original exists before the construct

  • Only the original copy is retained after the construct
  • lastprivate forces sequential-like behavior

– thread executing the sequentially last iteration (or last listed section) writes to the original copy

slide-17
SLIDE 17

Firstprivate and Lastprivate

#pragma omp parallel for firstprivate( simple ) for (int i=0; i<N; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); } #pragma omp parallel for lastprivate( doneEarly ) for( i=0; (i<N || doneEarly; i++ ) { doneEarly = f0(i);

slide-18
SLIDE 18

Other Synchronization Directives

#pragma omp master { }

binds to the innermost enclosing parallel region – binds to the innermost enclosing parallel region – Only the master executes – No implied barrier

slide-19
SLIDE 19

Master Directive

#pragma omp parallel { #pragma omp for for( int i=0; i<100; i++ ) a[i] = f0(i); #pragma omp master

Only master executes.

#pragma omp master x = f1(a); }

Only master executes. No synchronization.

slide-20
SLIDE 20

Critical Section

#pragma omp critical accessBankBalance { }

A single thread at a time – A single thread at a time – Applies to all threads – The name is optional; no name implies global critical region

slide-21
SLIDE 21

Barrier Directive

#pragma omp barrier

– Stand-alone – Binds to inner-most parallel region – All threads in the team must execute

  • they will all wait for each other at this instruction
  • they will all wait for each other at this instruction
  • Dangerous:

if (! ready) #pragma omp barrier

– Same sequence of work-sharing and barrier for the entire team

slide-22
SLIDE 22

Ordered Directive

#pragma omp ordered { }

  • Binds to inner-most enclosing loop
  • Binds to inner-most enclosing loop
  • The structured block executed in sequential
  • rder
  • The loop must declare the ordered clause
  • May encounter only one ordered regions
slide-23
SLIDE 23

Flush Directive

#pragma omp flush (var1, var2)

– Stand-alone, like barrier – Only directly affects the encountering thread – List-of-vars ensures that any compiler re-ordering – List-of-vars ensures that any compiler re-ordering moves all flushes together

slide-24
SLIDE 24

Atomic Directive

#pragma omp atomic i++;

  • Light-weight critical section
  • Only for some expressions

– x = expr (no mutual exclusion on expr evaluation) – x++ – ++x – x-- – --x

slide-25
SLIDE 25

Reductions

  • Reductions are so common that OpenMP provides

support for them

  • May add reduction clause to parallel for

pragma

  • Specify reduction operation and reduction variable
  • OpenMP takes care of storing partial results in

private variables and combining partial results after the loop

slide-26
SLIDE 26

reduction Clause

  • reduction (<op> :<variable>)

– + Sum – * Product – & Bitwise and – | Bitwise or ^ Bitwise exclusive or – ^ Bitwise exclusive or – && Logical and – || Logical or

  • Add to parallel for

– OpenMP creates a loop to combine copies of the variable – The resulting loop may not be parallel

slide-27
SLIDE 27

Nesting Restrictions

  • A work-sharing region may not be closely nested inside a

work-sharing, critical, ordered, or master region.

  • A barrier region may not be closely nested inside a work-

sharing, critical, ordered, or master region.

  • A master region may not be closely nested inside a work-

sharing region. sharing region.

  • An ordered region may not be closely nested inside a

critical region.

  • An ordered region must be closely nested inside a loop

region (or parallel loop region) with an ordered clause.

  • A critical region may not be nested (closely or otherwise)

inside a critical region with the same name. Note that this restriction is not sufficient to prevent deadlock

slide-28
SLIDE 28

EXAMPLES

slide-29
SLIDE 29

OpenMP Matrix Multiply

#pragma omp parallel for for(int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

  • a, b, c are shared
  • i, j, k are private
slide-30
SLIDE 30

OpenMP Matrix Multiply: Triangular

#pragma omp parallel for schedule (dynamic, 1 ) for( int i=0; i<n; i++ ) for( int j=i; j<n; j++ ) { c[i][j] = 0.0; for(int k=i; k<n; k++ ) for(int k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

  • This multiplies upper-triangular matrix A with B
  • Unbalanced workload

– Schedule improves this

slide-31
SLIDE 31

OpenMP Jacobi

for some number of timesteps/iterations { #pragma omp parallel for for (int i=0; i<n; i++ ) for( int j=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); grid[i][j-1] + grid[i][j+1] ); #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

  • This could be improved by using just one parallel region
  • Implicit barrier after loops eliminates race on grid
slide-32
SLIDE 32

OpenMP Jacobi

for some number of timesteps/iterations { #pragma omp parallel for for (int i=0; i<n; i++ ) for( int j=0, j<n, j++ ) { temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); grid[i][j-1] + grid[i][j+1] ); #pragma omp barrier grid[i][j] = temp[i][j]; } }

  • Is barrier sufficient?
  • What change to the code is needed?

– Recall barrier is per-team