AMath 483/583 Lecture 13 Notes: Outline: Parallel computing - PDF document

AMath 483/583 — Lecture 13 Notes: Outline: • Parallel computing • Amdahl’s law • Speed up, strong and weak scaling • OpenMP Reading: • class notes: bibliography for general books on parallel programming • class notes: OpenMP section of Bibliography R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Amdahl’s Law Notes: Typically only part of a computation can be parallelized. Suppose 50% of the computation is inherently sequential, and the other 50% can be parallelized. Question: How much faster could the computation potentially run on many processors? Answer: At most a factor of 2, no matter how many processors. The sequential part is taking half the time and that time is still required even if the parallel part is reduced to zero time. R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Amdahl’s Law Notes: Suppose 10% of the computation is inherently sequential, and the other 90% can be parallelized. Question: How much faster could the computation potentially run on many processors? Answer: At most a factor of 10, no matter how many processors. The sequential part is taking 1/10 of the time and that time is still required even if the parallel part is reduced to zero time. R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Amdahl’s Law Notes: Suppose 1 /S of the computation is inherently sequential, and the other (1 − 1 /S ) can be parallelized. Then can gain at most a factor of S, no matter how many processors. If T S is the time required on a sequential machine and we run on P processors, then the time required will be (at least): T P = (1 /S ) T S + (1 − 1 /S ) T S /P Note that T P → (1 /S ) T S as P → ∞ R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Amdahl’s Law Notes: Suppose 1 /S of the computation is inherently sequential = ⇒ T P = (1 /S ) T S + (1 − 1 /S ) T S /P Example: If 5% of the computation is inherently sequential ( S = 20) , then the reduction in time is: P T P T S 1 2 0 . 525 T S 4 0 . 288 T S 32 0 . 080 T S 128 0 . 057 T S 1024 0 . 051 T S R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Speedup Notes: The ratio T S /T P of time on a sequential machine to time running in parallel is the speedup. This is generally less than P for P processors. Perhaps much less. Amdahl’s Law plus overhead costs of starting processes/threads, communication, etc. Caveat: May (rarely) see speedup greater than P ... For example, if data doesn’t all fit in one cache but does fit in the combined caches of multiple processors. R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Scaling Notes: Some algorithms scale better than others as the number of processors increases. Typically interested on how well algorithms work for large problems requiring lots of time, e.g. Particle methods for n particles, algorithms for solving systems of n equations, algorithms for solving PDEs on n × n × n grid in 3D, For large n , there may be lots of inherent parallelism. But this depends on many factors: dependencies between calculations, communication as well as flops, nature of problem and algorithm chosen. R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Scaling Notes: Typically interested on how well algorithms work for large problems requiring lots of time. Strong scaling: How does the algorithm perform as the number of processors P increases for a fixed problem size n ? Any algorithm will eventually break down (consider P > n ) Weak scaling: How does the algorithm perform when the problem size increases with the number of processors? E.g. If we double the number of processors can we solve a problem “twice as large” in the same time? R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Weak scaling Notes: What does “twice as large” mean? Depends on how algorithm complexity scales with n . Example: Solving n × n linear system with Gaussian elimination requires O ( n 3 ) flops. Doubling n requires 8 times as many operations. Problem is “twice as large” if we increase n by a factor of 2 1 / 3 ≈ 1 . 26 , e.g. from 100 × 100 to 126 × 126 . (Or may be better to count memory accesses!) R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

Weak scaling Notes: Solving steady state heat equation on n × n × n grid. n 3 grid points = ⇒ linear system with this many unknowns. If we used Gaussian elimination (very bad idea!) we would require ∼ ( n 3 ) 3 = n 9 flops. Doubling n would require 2 9 = 512 times more flops. Good iterative methods can do the job in O ( n 3 ) log 2 ( n ) work or less. (e.g. multigrid). Developing better algorithms is as important as better hardware!! R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 Speedup for problems like steady state heat equation Notes: Source: SIAM Review 43(2001), p. 168. R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 OpenMP Notes: “Open Specifications for MultiProcessing” Standard for shared memory parallel programming. For shared memory computers, such as multi-core. Can be used with Fortran (77/90/95/2003), C and C++. Complete specifications at http://www.openmp.org R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP References Notes: • http://www.openmp.org • http://www.openmp.org/wp/resources/ • B. Chapman, G. Jost, R. van der Pas, Using OpenMP: Portable Shared Memory Parallel Programming , MIT Press, 2007. • R. Chandra, L. Dagum, et. al., Parallel Programming in OpenMP , Academic Press, 2001. Other references in • class notes bibliography: OpenMP • class notes bibliography: Other courses R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 OpenMP — Basic Idea Notes: Explicit programmer control of parallelization using fork-join model of parallel execution • all OpenMP programs begin as single process, the master thread, which executes until a parallel region construct encountered • FORK: master thread creates team of parallel threads • JOIN: When threads complete statements in parallel region construct they synchronize and terminate, leaving only the master thread. fork join fork join parallel region parallel region R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 OpenMP — Basic Idea Notes: • Rule of thumb: One thread per processor (or core), • User inserts compiler directives telling compiler how statements are to be executed • which parts are parallel • how to assign code in parallel regions to threads • what data is private (local) to threads • Compiler generates explicit threaded code • Dependencies in parallel parts require synchronization between threads • User’s job to remove dependencies in parallel parts or use synchronization. (Tools exist to look for race conditions.) R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

OpenMP compiler directives Notes: Uses compiler directives that start with !$ (pragmas in C.) These look like comments to standard Fortran but are recognized when compiled with the flag -fopenmp . OpenMP statements: Ordinary Fortran statements conditionally compiled: !$ print *, "Compiled with -fopenmp" OpenMP compiler directives, e.g. !$omp parallel do Calls to OpenMP library routines: use omp_lib ! need this module !$ call omp_set_num_threads(2) R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 OpenMP directives Notes: !$omp directive [clause ...] if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 A few OpenMP directives Notes: !$omp parallel [clause] ! block of code !$omp end parallel !$omp parallel do [clause] ! do loop !$omp end parallel do !$omp barrier ! wait until all threads arrive Several others we’ll see later... R.J. LeVeque, University of Washington AMath 483/583, Lecture 13 R.J. LeVeque, University of Washington AMath 483/583, Lecture 13

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing - PDF document

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing Amdahls law Speed up, strong and weak scaling OpenMP Reading: class notes: bibliography for general books on parallel programming class notes: OpenMP

AMath 483/583 Lecture 28 Notes: Outline: Numba and autojit Binary vs. ASCII output

AMath 483/583 Lecture 20 Notes: Outline: Adaptive quadrature, recursive functions

AMath 483/583 Lecture 26 Outline: Monte Carlo methods Random number generators

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

AMath 483/583 Lecture 27 Notes: Outline: Random walk solution of Poisson problem

AMath 483/583 Lecture 24 Notes: Outline: Heat equation and discretization OpenMP and

AMath 483/583 Lecture 6 Notes: This lecture: NumPy arrays and functions Python: main

AMath 483/583 Lecture 27 Outline: Random walk solution of Poisson problem Using MPI

AMath 483/583 Lecture 7 This lecture: Python debugging demo Compiled langauges

AMath 483/583 Lecture 2 Notes: Outline: Binary storage, floating point numbers

AMath 483/583 Lecture 22 Outline: MPI MasterWorker paradigm Linear algebra

AMath 483/583 Lecture 23 Notes: Outline: Linear systems: LU factorization and condition

AMath 483/583 Lecture 12 Notes: Outline: More about computer arithmetic Fortran

AMath 483/583 Lecture 2 Outline: Binary storage, floating point numbers Version

High-Performance Scientific Computing Applied Mathematics 483/583, Spring 2013 University of

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Coordinates Josiah Manson and Scott Schaefer Texas A&M University Barycentric Coordinates

Lecture 2.6: Propositions over a universe Matthew Macauley Department of Mathematical Sciences

DAQ introduction Purpose of this talk : (1) Introduction for those who have not been in every

Outline Classification of first-order theories Simple theories NIP theories NTP 2 Space of

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud

Using a Set Constraint Solver for Program Verifjcation Maximiliano Cristi Universidad Nacional

Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Applications of metric e v al u ation P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing - PDF document

AMath 483/583 Lecture 13 Notes: Outline: Parallel computing Amdahls law Speed up, strong and weak scaling OpenMP Reading: class notes: bibliography for general books on parallel programming class notes: OpenMP

AMath 483/583 Lecture 28 Notes: Outline: Numba and autojit Binary vs. ASCII output

AMath 483/583 Lecture 20 Notes: Outline: Adaptive quadrature, recursive functions

AMath 483/583 Lecture 26 Outline: Monte Carlo methods Random number generators

AMath 483/583 Lecture 8 Notes: This lecture: Fortran subroutines and functions Arrays

AMath 483/583 Lecture 27 Notes: Outline: Random walk solution of Poisson problem

AMath 483/583 Lecture 24 Notes: Outline: Heat equation and discretization OpenMP and

AMath 483/583 Lecture 6 Notes: This lecture: NumPy arrays and functions Python: main

AMath 483/583 Lecture 27 Outline: Random walk solution of Poisson problem Using MPI

AMath 483/583 Lecture 7 This lecture: Python debugging demo Compiled langauges

AMath 483/583 Lecture 2 Notes: Outline: Binary storage, floating point numbers

AMath 483/583 Lecture 22 Outline: MPI MasterWorker paradigm Linear algebra

AMath 483/583 Lecture 23 Notes: Outline: Linear systems: LU factorization and condition

AMath 483/583 Lecture 12 Notes: Outline: More about computer arithmetic Fortran

AMath 483/583 Lecture 2 Outline: Binary storage, floating point numbers Version

High-Performance Scientific Computing Applied Mathematics 483/583, Spring 2013 University of

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Coordinates Josiah Manson and Scott Schaefer Texas A&amp;M University Barycentric Coordinates

Lecture 2.6: Propositions over a universe Matthew Macauley Department of Mathematical Sciences

DAQ introduction Purpose of this talk : (1) Introduction for those who have not been in every

Outline Classification of first-order theories Simple theories NIP theories NTP 2 Space of

Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Arthur Charguraud

Using a Set Constraint Solver for Program Verifjcation Maximiliano Cristi Universidad Nacional

Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Applications of metric e v al u ation P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN

Coordinates Josiah Manson and Scott Schaefer Texas A&M University Barycentric Coordinates