csl 860 modern parallel computation computation hello
play

CSL 860: Modern Parallel Computation Computation Hello OpenMP - PowerPoint PPT Presentation

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { Parallel case 0 : blah1.. Construct case 1: blah2.. } } } // Back to normal Extremely


  1. CSL 860: Modern Parallel Computation Computation

  2. Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { Parallel case 0 : blah1.. Construct case 1: blah2.. } } } // Back to normal Extremely simple to use and incredibly powerful • Fork-Join model • Every thread has its own execution context • Variables can be declared shared or private •

  3. Execution Model • Encountering thread creates a team: – Itself (master) + zero or more additional threads. • Applies to structured block immediately following – Each thread executes a copy of the code in {} • But, also see: Work-sharing constructs • There’s an implicit barrier at the end of block • Only master continues beyond the barrier • May be nested – Sometimes disabled by default

  4. Memory Model • Notion of temporary view of memory – Allows local caching – Need to flush memory – T1 writes -> T1 flushes -> T2 flushes -> T2 reads – Same order seen by all threads – Same order seen by all threads • Supports threadprivate memory • Variables declared before parallel construct: – Shared by default – May be designated as private – n -1 copies of the original variable is created • May not be initialized by the system

  5. Shared Variables • Heap allocated storage • Static data members • const-qualified (no mutable members) • Private: – Variables declared in a scope inside the construct Variables declared in a scope inside the construct – Loop variable in for construct • private to the construct • Others are shared unless declared private – You can change default • Arguments passed by reference inherit from original

  6. Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section } }

  7. Beware more of Compiler Re-ordering // Parallel construct { int b = initialSalary print(“Initial Salary was %d\n”, initialSalary); print(“Initial Salary was %d\n”, initialSalary); Book-keeping() // No read b or write initialSalary if (b < 10000) { raiseSalary(500); } }

  8. Thread Control E nvironment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS omp_set_num_threads omp_get_max_threads Implementation * defined OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation defined OMP_NESTED omp_set_nested omp_get_nested false OMP_SCHEDULE Implementation * defined * Also see construct clause: num_threads, schedule

  9. Parallel Construct #pragma omp parallel \ if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(shared | none) \ default(shared | none) \ shared(var1, var2), \ copyin(var2), \ reduction(operator:list) \ num_threads(n) { }

  10. Parallel Loop #pragma omp parallel for for (i= 0; i < N; ++i) { blah … } • No of iterations must be known when the construct is encountered – Must be the same for each thread • Compiler puts a barrier at the end of parallel for – But see nowait

  11. Parallel For #pragma omp for \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ lastprivate(var1, var2), \ lastprivate(var1, var2), \ reduction(operator: list), \ ordered, \ schedule(kind[, chunk_size]), \ nowait Canonical For Loop No loop break

  12. Schedule(kind[, chunk_size]) Divide iterations into contiguous sets, chunks • – chunks are assigned transparently to threads static : iterations are divided among threads in a round-robin • fashion – When no chunk_size is specified, approximately equal chunks are made dynamic : iterations are assigned to threads in ‘request order’ dynamic : iterations are assigned to threads in ‘request order’ • – When no chunk_size is specified, it defaults to 1. guided : like dynamic, the size of each chunk is proportional to the • number of unassigned iterations divided by the number of threads – If chunk_size =k, chunks have at least k iterations (except the last) – When no chunk_size is specified, it defaults to 1. runtime: taken from environment variable •

  13. Single #pragma omp parallel { #pragma omp for for( int i=0; i<N; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragma omp for #pragma omp for for(int i=0; i<N; i++ ) b[i] = x * f2(i); } Only one of the threads executes • Other threads wait for it • – unless NOWAIT is specified Hidden complexity • – Threads may be at different instructions

  14. Sections #pragma omp sections { #pragma omp section { // do this … } #pragma omp section #pragma omp section { // do that … } // … } The omp section directives must be closely nested in a sections construct, • where no other work-sharing construct may appear.

  15. Private Variables #pragma omp parallel private (size, …) for for ( int i = 0; i = numThreads; i++) { int size = numTasks/numThreads; int extra = numTasks – numThreads*size; if(i < extra) size ++; doTask(i, size, numThreads); } doTask(int start, int count) { // Each thread’s instance has its own activation record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); } }

  16. Firstprivate and Lastprivate • Initial value of private variable is unspecified – firstprivate initializes copies with the original – Once per thread (not once per iteration) – Original exists before the construct • Only the original copy is retained after the construct • lastprivate forces sequential-like behavior – thread executing the sequentially last iteration (or last listed section) writes to the original copy

  17. Firstprivate and Lastprivate #pragma omp parallel for firstprivate( simple ) for (int i=0; i<N; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); } #pragma omp parallel for lastprivate( doneEarly ) for( i=0; (i<N || doneEarly; i++ ) { doneEarly = f0(i);

  18. Other Synchronization Directives #pragma omp master { } – binds to the innermost enclosing parallel region binds to the innermost enclosing parallel region – Only the master executes – No implied barrier

  19. Master Directive #pragma omp parallel { #pragma omp for for( int i=0; i<100; i++ ) a[i] = f0(i); Only master executes. Only master executes. #pragma omp master #pragma omp master No synchronization. x = f1(a); }

  20. Critical Section #pragma omp critical accessBankBalance { } – A single thread at a time A single thread at a time – Applies to all threads – The name is optional; no name implies global critical region

  21. Barrier Directive #pragma omp barrier – Stand-alone – Binds to inner-most parallel region – All threads in the team must execute • they will all wait for each other at this instruction • they will all wait for each other at this instruction • Dangerous: if (! ready ) #pragma omp barrier – Same sequence of work-sharing and barrier for the entire team

  22. Ordered Directive #pragma omp ordered { } • Binds to inner-most enclosing loop • Binds to inner-most enclosing loop • The structured block executed in sequential order • The loop must declare the ordered clause • May encounter only one ordered regions

  23. Flush Directive #pragma omp flush (var1, var2) – Stand-alone, like barrier – Only directly affects the encountering thread – List-of-vars ensures that any compiler re-ordering – List-of-vars ensures that any compiler re-ordering moves all flushes together

  24. Atomic Directive #pragma omp atomic i++; • Light-weight critical section • Only for some expressions – x = expr (no mutual exclusion on expr evaluation) – x++ – ++x – x-- – --x

  25. Reductions • Reductions are so common that OpenMP provides support for them • May add reduction clause to parallel for pragma • Specify reduction operation and reduction variable • OpenMP takes care of storing partial results in private variables and combining partial results after the loop

  26. reduction Clause • reduction ( <op> : <variable> ) – + Sum – * Product – & Bitwise and – | Bitwise or – ^ ^ Bitwise exclusive or Bitwise exclusive or – && Logical and – || Logical or • Add to parallel for – OpenMP creates a loop to combine copies of the variable – The resulting loop may not be parallel

  27. Nesting Restrictions • A work-sharing region may not be closely nested inside a work-sharing, critical, ordered, or master region. • A barrier region may not be closely nested inside a work- sharing, critical, ordered, or master region. • A master region may not be closely nested inside a work- sharing region. sharing region. • An ordered region may not be closely nested inside a critical region. • An ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause. • A critical region may not be nested (closely or otherwise) inside a critical region with the same name. Note that this restriction is not sufficient to prevent deadlock

  28. EXAMPLES

  29. OpenMP Matrix Multiply #pragma omp parallel for for(int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } • a, b, c are shared • i, j, k are private

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend