AMath 483/583 Lecture 13 Notes: Outline: Parallel computing - - PDF document

amath 483 583 lecture 13 notes
SMART_READER_LITE
LIVE PREVIEW

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing - - PDF document

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing Amdahls law Speed up, strong and weak scaling OpenMP Reading: class notes: bibliography for general books on parallel programming class notes: OpenMP


slide-1
SLIDE 1

AMath 483/583 — Lecture 13

Outline:

  • Parallel computing
  • Amdahl’s law
  • Speed up, strong and weak scaling
  • OpenMP

Reading:

  • class notes: bibliography for general books on parallel

programming

  • class notes: OpenMP section of Bibliography

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Amdahl’s Law

Typically only part of a computation can be parallelized. Suppose 50% of the computation is inherently sequential, and the other 50% can be parallelized. Question: How much faster could the computation potentially run on many processors? Answer: At most a factor of 2, no matter how many processors. The sequential part is taking half the time and that time is still required even if the parallel part is reduced to zero time.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Amdahl’s Law

Suppose 10% of the computation is inherently sequential, and the other 90% can be parallelized. Question: How much faster could the computation potentially run on many processors? Answer: At most a factor of 10, no matter how many processors. The sequential part is taking 1/10 of the time and that time is still required even if the parallel part is reduced to zero time.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-2
SLIDE 2

Amdahl’s Law

Suppose 1/S of the computation is inherently sequential, and the other (1 − 1/S) can be parallelized. Then can gain at most a factor of S, no matter how many processors. If TS is the time required on a sequential machine and we run

  • n P processors, then the time required will be (at least):

TP = (1/S)TS + (1 − 1/S)TS/P Note that TP → (1/S)TS as P → ∞

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Amdahl’s Law

Suppose 1/S of the computation is inherently sequential = ⇒ TP = (1/S)TS + (1 − 1/S)TS/P Example: If 5% of the computation is inherently sequential (S = 20), then the reduction in time is: P TP 1 TS 2 0.525TS 4 0.288TS 32 0.080TS 128 0.057TS 1024 0.051TS

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Speedup

The ratio TS/TP of time on a sequential machine to time running in parallel is the speedup. This is generally less than P for P processors. Perhaps much less. Amdahl’s Law plus overhead costs of starting processes/threads, communication, etc. Caveat: May (rarely) see speedup greater than P... For example, if data doesn’t all fit in one cache but does fit in the combined caches of multiple processors.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-3
SLIDE 3

Scaling

Some algorithms scale better than others as the number of processors increases. Typically interested on how well algorithms work for large problems requiring lots of time, e.g. Particle methods for n particles, algorithms for solving systems of n equations, algorithms for solving PDEs on n × n × n grid in 3D, For large n, there may be lots of inherent parallelism. But this depends on many factors: dependencies between calculations, communication as well as flops, nature of problem and algorithm chosen.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Scaling

Typically interested on how well algorithms work for large problems requiring lots of time. Strong scaling: How does the algorithm perform as the number

  • f processors P increases for a fixed problem size n?

Any algorithm will eventually break down (consider P > n) Weak scaling: How does the algorithm perform when the problem size increases with the number of processors? E.g. If we double the number of processors can we solve a problem “twice as large” in the same time?

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Weak scaling

What does “twice as large” mean? Depends on how algorithm complexity scales with n. Example: Solving n × n linear system with Gaussian elimination requires O(n3) flops. Doubling n requires 8 times as many operations. Problem is “twice as large” if we increase n by a factor of 21/3 ≈ 1.26, e.g. from 100 × 100 to 126 × 126. (Or may be better to count memory accesses!)

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-4
SLIDE 4

Weak scaling

Solving steady state heat equation on n × n × n grid. n3 grid points = ⇒ linear system with this many unknowns. If we used Gaussian elimination (very bad idea!) we would require ∼ (n3)3 = n9 flops. Doubling n would require 29 = 512 times more flops. Good iterative methods can do the job in O(n3) log2(n) work or

  • less. (e.g. multigrid).

Developing better algorithms is as important as better hardware!!

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Speedup for problems like steady state heat equation

Source: SIAM Review 43(2001), p. 168.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP

“Open Specifications for MultiProcessing” Standard for shared memory parallel programming. For shared memory computers, such as multi-core. Can be used with Fortran (77/90/95/2003), C and C++. Complete specifications at http://www.openmp.org

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-5
SLIDE 5

OpenMP References

  • http://www.openmp.org
  • http://www.openmp.org/wp/resources/
  • B. Chapman, G. Jost, R. van der Pas, Using OpenMP:

Portable Shared Memory Parallel Programming, MIT Press, 2007.

  • R. Chandra, L. Dagum, et. al., Parallel Programming in

OpenMP, Academic Press, 2001. Other references in

  • class notes bibliography: OpenMP
  • class notes bibliography: Other courses

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP — Basic Idea

Explicit programmer control of parallelization using fork-join model of parallel execution

  • all OpenMP programs begin as single process, the master

thread, which executes until a parallel region construct encountered

  • FORK: master thread creates team of parallel threads
  • JOIN: When threads complete statements in parallel

region construct they synchronize and terminate, leaving

  • nly the master thread.

fork join join fork

parallel region parallel region

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP — Basic Idea

  • Rule of thumb: One thread per processor (or core),
  • User inserts compiler directives telling compiler how

statements are to be executed

  • which parts are parallel
  • how to assign code in parallel regions to threads
  • what data is private (local) to threads
  • Compiler generates explicit threaded code
  • Dependencies in parallel parts require synchronization

between threads

  • User’s job to remove dependencies in parallel parts or use
  • synchronization. (Tools exist to look for race conditions.)

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-6
SLIDE 6

OpenMP compiler directives

Uses compiler directives that start with !$ (pragmas in C.) These look like comments to standard Fortran but are recognized when compiled with the flag -fopenmp. OpenMP statements: Ordinary Fortran statements conditionally compiled: !$ print *, "Compiled with -fopenmp" OpenMP compiler directives, e.g. !$omp parallel do Calls to OpenMP library routines: use omp_lib ! need this module !$ call omp_set_num_threads(2)

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP directives

!$omp directive [clause ...] if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression)

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

A few OpenMP directives

!$omp parallel [clause] ! block of code !$omp end parallel !$omp parallel do [clause] ! do loop !$omp end parallel do !$omp barrier ! wait until all threads arrive Several others we’ll see later...

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-7
SLIDE 7

OpenMP

API also provides for (but implementation may not support):

  • Nested parallelism (parallel constructs inside other parallel

constructs)

  • Dynamically altering number of threads in different parallel

regions The standard says nothing about parallel I/O. OpenMP provides "relaxed-consistency" view of memory. Threads can cache their data and are not required to maintain exact consistency with real memory all the time. !$omp flush can be used as a memory fence at a point where all threads must have consistent view of memory.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP test code — $UWHPSC/codes/openmp

program test use omp_lib integer :: thread_num ! Specify number of threads to use: !$ call omp_set_num_threads(2) print *, "Testing openmp ..." !$omp parallel !$omp critical !$ thread_num = omp_get_thread_num() !$ print *, "This thread = ",thread_num !$omp end critical !$omp end parallel end program test

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP test code output

Compiled with OpenMP: $ gfortran -fopenmp test.f90 $ ./a.out Testing openmp ... This thread = This thread = 1 (or threads might print in the other order!) Compiled without OpenMP: $ gfortran test.f90 $ ./a.out Testing openmp ...

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-8
SLIDE 8

OpenMP test code

! Specify number of threads to use: !$ call omp_set_num_threads(2) Can specify more threads than processors, but they won’t execute in parallel. The number of threads is determined by (in order):

  • Evaluation of if clause of a directive

(if evaluates to zero or false = ⇒ serial execution)

  • Setting the num_threads clause
  • the omp_set_num_threads() library function
  • the OMP_NUM_THREADS environment variable
  • Implementation default

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP test code

!$omp parallel !$omp critical !$ thread_num = omp_get_thread_num() !$ print *, "This thread = ",thread_num !$omp end critical !$omp end parallel

The !$omp parallel block spawns two threads and each one works independently, doing all instructions in block. Threads are destroyed at !$omp end parallel. However, the statements are also in a !$omp critical block, which indicates that this section of the code can be executed by only

  • ne thread at a time, so in fact they are not done in parallel.

So why do this? The function omp_get_thread_num() returns a unique number for each thread and we want to print both of these.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP test code

Incorrect code without critical section: !$omp parallel !$ thread_num = omp_get_thread_num() !$ print *, "This thread = ",thread_num !$omp end parallel

Why not do these in parallel?

  • 1. If the prints are done simultaneously they may come out

garbled (characters of one interspersed in the other).

  • 2. thread_num is a shared variable. If this were not in a

critical section, the following would be possible:

Thread 0 executes function, sets thread_num=0 Thread 1 executes function, sets thread_num=1 Thread 0 executes print statement: "This thread = 1" Thread 1 executes print statement: "This thread = 1" There is a data race or race condition.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

slide-9
SLIDE 9

OpenMP test code

Could change to add a private clause: !$omp parallel private(thread_num) !$ thread_num = omp_get_thread_num() !$omp critical !$ print *, "This thread = ",thread_num !$omp end critical !$omp end parallel Then each thread has it’s own version of the thread_num variable.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP parallel do loops

!$omp parallel do do i=1,n ! do stuff for each i enddo !$omp end parallel do ! OPTIONAL indicates that the do loop can be done in parallel. Requires: what’s done for each value of i is independent of others Different values of i can be done in any order. The iteration variable i is private to the thread: each thread has its own version. By default, all other variables are shared between threads unless specified otherwise. Need to be careful that threads use shared variables properly.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 13