Parallel Computing
Eero Vainikko
Spring 2019 eero.vainikko@ut.ee University of Tartu Institute of Computer Science
Parallel Computing Eero Vainikko eero.vainikko@ut.ee Spring 2019 - - PowerPoint PPT Presentation
University of Tartu Institute of Computer Science Parallel Computing Eero Vainikko eero.vainikko@ut.ee Spring 2019 Overview Lectures -- WED 10:15 - 12:00 -- Liivi 2 - 512 Bring your laptop! Practicals -- FRI 12:15 - 14:00 -- Liivi 2 - 512
Spring 2019 eero.vainikko@ut.ee University of Tartu Institute of Computer Science
Lectures -- WED 10:15 - 12:00 -- Liivi 2 - 512 Practicals -- FRI 12:15 - 14:00 -- Liivi 2 - 512
Bring your laptop!
Final grade
○ Exam - blackboard exam
○ with pair-programming, sometimes
Themes
linear equations, ...
Computer classes
○ Numpy ○ Scipy
ENIAC
Pre-history Even before electronic computers were invented, parallel computing existed!
forecasts - using human arrays
in war industry using Felix-like devices 1950 - Emergence of first electronic
1960 - Mainframe era. IBM
IBM 1620
1970 - Era of Minis 1980 - Era of PCs 1990 - Era of parallel computers 2000 - Clusters / Grids 2010 … - Cloud computing era 20xx - Quantum computers? What’s next?
Something else?
Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will need in the future at most 6 computers! 1950: Thomas J. Watson as well: 6 computers will be needed in the world 1977: Seymour Cray: The computer Cray-1 will attract only 100 clients in the world 1980: Study by IBM – about 50 Cray-class computers could be sold at most in a year worldwide Reality: How many Cray-* processing power do we currently carry in our pockets?
Gordon Moore’s (founder of Intel) law: (1965: the number of switches doubles every second year ) 1975: - refinement of the above: The number of switches on a CPU doubles every 18 months Until 2030 we would reach in such a way to the atomic level or quantum computer! But why?
for i in range(1000000000000): z(i)=x(i)+y(i) # ie. 3*10^12 memory accesses Assuming, that data is traveling with the speed of light 3 ∗ 10⁸ m/s, for finishing the operation in 1 second, the average distance of a memory cell must be: Typical computer memory – a rectangular mesh. The length of each edge would be
https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
First computers
10²
100 Flops Hundred flops (sada op/s) Desktop computers
10⁹
Gigaflops (GFlops) Billion flops (miljard op/s) Supercomputers
10¹²
Teraflops (TFlops) Trillion flops (triljon op/s) Fastest clusters
10¹⁵
Petaflops (PFlops) Quadrillion flops (kvadriljon op/s) Aim now
10¹⁸
Exaflops (EFlops) Quintillion flops (kvintiljon op/s) Next step
10²¹
Zettaflops (ZFlops) Sextillion flops (sekstiljon op/s) Further goal
10²⁴
Yottaflops (YFlops) Septillion flops (septiljon op/s)
How can parallel computers help achieving these goals?
Weather forecast in Europe for next 48 hours from sea lever upto 20km – need to solve a PDE in 3D (xyz and t) The volume of the region: 5000∗4000∗20 km³ Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km Timesteps: 1min. Ca 1000 flops per gridpoint at each timestep t 5000 ∗ 4000 ∗ 20 km³ × 10 rectangles per km³ = 4 ∗ 10⁹ meshpoints 4 ∗ 10⁹ meshpoints ∗ 48 ∗ 60 timesteps × 10³ flop ≈ 1.15 ∗ 10¹⁵ flop 1.15 ∗ 10¹⁵ / 10⁹ = 1.15 ∗ 10⁶ seconds ≈ 13 days! But, with 10¹² flops, < 20 min 1.15 ∗ 10³ seconds
timestep to 10 seconds → the total time would grow from 20 min to 8 days!
(the area: 2 ∗10⁷ km² → 5 ∗ 10⁸ km²) and to combine the model with an Ocean model.
available possibilities, need only to change ε and h to get unsatisfied again!
https://www.top500.org/resources/presentations/
modeling, simulation, design, signal processing, engineering, geology, tectonics, ... In-class exercise #1:
For hints, some pictures from https://computing.llnl.gov/tutorials/parallel_comp/
piazza.com/ut.ee/spring2019/mtat08020 In order to get used to Piazza: ➢ Look (google) around for challenging parallel computing problems ➢ choose your favorite HPC application ➢ Post your findings at Piazza ○ Describe, the application! ■ What makes it challenging? ■ (If you can, find out approximately, the the time /
calculation/simulation takes? ○ Include links to project page, pictures, videos etc.!
Engineering applications
(optimisation, cost, shape, safety etc)
design
Scientific applications
mechanics, macromolecules, composite materials, chemical reactions etc).
thermonuclear reactions, postprocessing of large datasets produced by telescopes)
Commercial applications
space shuttles ) Applications in computer systems
attacks)
➢ Find a set of Exascale computing applications ➢ Post your findings at Piazza ○ Describe, the application! ■ Why is it Exascale? ■ Include links to project page, pictures, videos etc...!
In-class exercise #2
piazza.com/ut.ee/spring2019/mtat08020
Useful Videos:
Alamos National Lab
Exascale, what's happening?:
Computing in 2021
Simulation Code: From Laptops to Exascale Computers
computing and big data
Exascale computing and its applications from various sources ○ Springer ○ Kothe2018.pdf ○ Presentation
Some interesting links (student’s choice)
Serial computing
○ Problem being broken into a series if instructions ○ Instructions are executed one-by-one ○ On a single processor - only one instruction can be executed at any time moment problem
t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12
instructions
Parallel computing
○ simultaneous use of multiple compute resources to solve a computational problem:
■
A problem is broken into discrete parts that can be solved concurrently
■
Each part is further broken down to a series of instructions
■
Instructions from each part execute simultaneously on different processors
■
An overall control/coordination mechanism is employed
t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12 t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12 t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12 t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12
cores/CPUs
computers connected by a network
○ Be broken apart into discrete pieces of work that can be solved simultaneously ○ Execute multiple program instructions at any moment in time ○ Be solved in less time with multiple compute resources than with a single compute resource
today are parallel from a hardware perspective: ○ Multiple functional units (L1 cache, L2 cache, branch, prefetch, decode, floating-point, graphics processing (GPU), integer, etc.) ○ Multiple execution units/cores ○ Multiple hardware threads
IBM BG/Q Compute Chip with 18 cores (PU) and 16 L2 Cache units (L2):
IBM Blue Gene
By Courtesy Argonne National Laboratory, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=24653857
# 3 in top500 - 2012 June # 21 in top500 - 2018 Nov
Fine-grain parallelism
Instruction-Level Parallelism
computational work are done between communication/synchronisatio n events
communication ratio
Course-grain parallelism
computational work are done between communication/synchroniza tion events
communication ratio
performance increase
efficiently
Embarrasingly parallel
can imagine
Monte-Carlo method Processing a huge set of images in parallel
synchronization points
ILP - measure of how many of the instructions in a computer program can be executed simultaneously Parallel instructions are a set of instructions that do not depend on each other when executed
○ 16-bit add on 8-bit CPU
for i in range(10000): x(i)=x(i)+y(i)
○ Multi-core CPUs Example:
e = a + b f = c + d g = e * f
ILP - challenge for compilers and processor designers ILP allows the compiler to use
❖ Instruction pipelining
➢ (similar to car production lines) - performing different sub-operations with moving objects Basic five-stage pipeline in a RISC machine
In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.
➢ Superscalar CPU architecture
○ Implements ILP inside a single CPU ○ More than one instruction per clock cycle ○ Dispatches multiple instructions to multiple redundant functional units inside the processor ■ Each separate functional unit not a separate CPU core but an execution resource inside CPU, like:
Example: Fetching-dispatching 2 instructions a time:
➢ Out-of-Order execution
○ Technique used in most high-performance CPUs ○ The key technique - allow processor to avoid certain class of delays
unavailability Processing of instructions broken into these steps: ➔ Instruction fetch ➔ Instruction dispatch to an instruction queue (called also instruction buffer) ➔ The instruction waits in the queue until its input
➔ Instruction is issued to the appropriate functional unit and executed there ➔ The results are queued (re-order buffer) ➔ The reults are written back to register
➢ Register renaming
○ technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution. ○ technique that abstracts logical registers from physical registers r1 = m[1024] r1 = r1 + 2 m[1032] = r1 r1 = m[2048] r1 = r1 + 4 m[2056] = r1 r1 = m[1024] r1 = r1 + 2 m[1032] = r1 r2 = m[2048] r2 = r2 + 4 m[2056] = r2
➢ Speculative execution
○ allows the execution of complete instructions or parts of instructions before being certain whether this execution should take place ■ control flow speculation ■ value prediction ■ memory dependence prediction ■ cache latency prediction
○ → execute both (all) possible scenarios
○ → predict the most likely scenario!
➢ Branch prediction
○ used to avoid stalling for control dependencies to be resolved ○ used with speculative execution
○ Pattern History Table
Memory system, not processor speed often appears often as a bottleneck
latency - time that it takes from request to delivery from the memory bandwidth - amount of data flow in timestep (between the processor and memory)
Example: Memory latency
blocksize 1 word
memory access takes 100 clock cycles
RAM
Storage Network storage, clouds
registers
cache
Data movement before operations D a t a m
e m e n t a f t e r
e r a t i
s Motivation Consider an arbitrary algorithm. Denote
Introduce q=f/m Why this number is important?
In general, tm⋙ tf therefore total time reflecting processor speed
Example: Gauss elimination method
for each i – key operations (1) and (2): for i in range(n): A(i+1:n,i) = A(i+1:n,i)/A(i,i) # op. (1) A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n) # op. (2) Operation (1) of type: y=ax+y # saxpy
○ 2n + 1 reads ○ n writes
Operation (2) of type: A = A − vwT , A ∈ ℝn×n , v, w ∈ ℝn
○ n2 + 2n reads ○ n2 writes
1st order operation O(n) flops 2nd order operation O(n2) flops
But q = 1 + O(1) in both cases!
Example of 3rd order operation
Faster results in case of 3rd order operations (O(n3) operations with O(n2) memory references). For example, matrix multiplication: C = AB +C, where A, B, C ∈ ℝn×n Here m = 4n2 and f = n2 (2n − 1) + n2 = 2n3 ⇒ q = n/2 → ∞ if n → ∞. This operation can give processor work near peak performance, with good algorithm scheduling!
○ BLAS – freeware, available for example from netlib (http://www.netlib.org/blas/) ○ Processor vendors often supply their own implementation ○ BLAS ATLAS implementation ATLAS (http://math-atlas.sourceforge.net/) – self-optimising code
Example of using BLAS (fortran90):
http://www.ut.ee/~eero/SC/konspekt/Naited/lu1blas3.f90.html
http://www.ut.ee/~eero/SC/konspekt/Naited/testblas3.f90.html )
between main memory and DRAM
storage
Example: Cache effect
➢ Cache size: 32 KB with latency 1 ns ➢
matrices (i.e. all matrices A , B and C fit simultaneously into cache ➢ We observe that: ○ reading 2 matrices into cache (meaning 2K words) takes ca 200μs ○ multiplying two n × n matrices takes 2n³
performed in 16K cycles (4 op/cycle)
➢ ⇒ total time (data access + calculations) = 200μs + 16μs ➢ close to peak performance 64K/216 or 303 Mflops. ➢ Access to same data corresponds to temporal locality In given example O(n²) data accesses and O(n³) computations. Such asymptotic difference very good in case of cache Data reusage is of critical importance for performance!
Anant Vithal Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney and Hong Wang, Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-level Cache Hierarchies, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), DOI: 10.1109/ISCA.2018.00019 https://ieeexplore.ieee.org/abstract/document/8416821 Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure and Stefano Markidis, Exploring the Performance Benefit of Hybrid Memory System on HPC Environments, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), DOI: 10.1109/IPDPSW.2017.115 https://ieeexplore.ieee.org/abstract/document/7965110 In-class reading:
Single instr. stream Multiple instr. stream
SIMD SISD MIMD MISD
Multiple data stream Single data stream
SISD - Single Instruction Single Data
being acted on by the CPU during any one clock cycle
as input during any one clock cycle
minicomputers and workstations, early laptops etc.
(https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)
SISD
Single instr. stream Multiple instr. stream
SIMD MIMD MISD
Multiple data stream Single data stream
SIMD - Single Instruction Multiple Data
instruction at any given clock cycle
element
degree of regularity, such as graphics/image processing
MP-2, ILLIAC IV61 // Architectures
(https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)
X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10
MIMD (MISD) - Multiple Instruction Single Data
streams
multiple processing units.
computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971).
○ multiple frequency filters operating on a single signal stream ○ multiple cryptography algorithms attempting to crack a single coded message.
SISD
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream (Carnegie-MellonC.mmp https://en.wikipedia.org/wiki/C.mmp)
MIMD - Multiple Instruction Multiple Data
executing a different instruction stream
with a different data stream
deterministic or nondeterministic
computer
networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs
SIMD execution sub-components
MIMD SISD
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream (https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)
DMMP
Distributed-memory multicomputers
MIMD SISD
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream
GMSV
Shared-memory multiprocessors Global memory Distributed memory
GMMP
Rarely used
DMSV
Distributed shared memory Message passing Shared variables
DMMP
Distributed
multicomputers
MIMD SISD
Flynn-Johnson classification
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream
GMSV
Shared-memory multiprocessors Global memory Distributed memory
GMMP
Rarely used
DMSV
Distributed shared memory Message passing Shared variables
Shared-memory multiprocessors
sharing the same memory resource
processors
(according to memory access times) Shared memory
(multicore)
CPU
(multicore)
CPU
(multicore)
CPU
(multicore)
CPU
(multicore)
CPU
(multicore)
CPU
DMMP
Distrib-memory multiicomputers
MIMD SISD
Flynn-Johnson classification
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream
GMSV
Shared-memory multiprocessors Global memory Distributed memory
GMMP
Rarely used
DMSV
Distributed shared memory Message passing Shared variables
Shared-memory multiprocessors
Uniform Memory Access (UMA):
Symmetric Multiprocessor (SMP) machines
memory
Coherent UMA
○ if one processor updates a location in shared memory, all the other processors know about the update. ○ implemented at the hardware level. Shared memory
CPU CPU CPU CPU
GMSV
Shared-memory multiprocessors
DMMP
Distrib-memory multiicomputers
MIMD SISD
Flynn-Johnson classification
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream Global memory Distributed memory
GMMP
Rarely used
DMSV
Distributed shared memory Message passing Shared variables
Distributed shared-memory
Non-Uniform Memory Access (NUMA):
SMPs
another SMP
all memories
also be called CC-NUMA - Cache Coherent NUMA
Advantages:
programming perspective to memory
to the proximity of memory to CPUs
Memory
CPU CPU CPU CPU
Memory
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
Memory Memory
Bus Interconnect
Disadvantages:
associated with cache/memory management.
ensure "correct" access of global memory.
DMMP
Distrib-memory multiicomputers
MIMD SISD
Flynn-Johnson classification
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream
GMSV
Shared-memory multiprocessors Global memory Distributed memory
GMMP
Rarely used
DMSV
Distributed shared memory Message passing Shared variables
Distributed memory architectures
Networked processors with their private memories
○ via message-passing ○ explicit communication operations
CPU CPU CPU Memory
Network
Memory Memory
Advantages
with the number of processors
accessing local memory
components Disadvantages
programmer
the system can be complicated
to non-uniform memory
DMSV
Distributed shared memory
GMSV
Shared-memory multiprocessors
DMMP
Distrib-memory multiicomputers
MIMD SISD
Flynn-Johnson classification
Single instr. stream Multiple instr. stream
SIMD MISD
Multiple data stream Single data stream Global memory Distributed memory
GMMP
Rarely used Message passing Shared variables
Hybrid Distributed-Shared Memory
○ Memory components shared between CPUs and GPUs
machine ○ Network communication required to move data between the nodes
Memory
CPU CPU GPU GPU
Memory
CPU CPU GPU GPU GPU GPU CPU CPU GPU GPU CPU CPU
Memory Memory
network
Memory
CPU CPU CPU CPU
Memory
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
Memory Memory
network
1. Look around in Top500 list and choose for a reviewal your favorite computer system there for a short review about the highlights of the system! In particular, address also the architectural aspects we have discussed up to now during the course! 2. Post the short review to the course Piazza: piazza.com/ut.ee/spring2019/mtat08020
Programs
Systems
Granularity
Performance
Minimum Cost-Optimal Execution Time
Programs
Based on: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel Computing, Second Edition. Addison-Wesley, 2003
Analytical Modeling - Basics
function of input size n ) –
T(n) = O(f(n)) if there exist positive constants c and n0 such that T(n) ≤ cf(n), for all n ≥ n0
–
T(n) = Ω(f(n)) if there exist positive constants c and n0 such that T(n) ≥ cf(n), for all n ≥ n0
–
T(n) = Θ(f(n)) if T(n) = O(f(n)) and T(n) = Ω(f(n))
–
the input size
–
the number of processors
–
communication parameters of the machine
Analytical Modeling - Basics
the last processor in a parallel ensemble.
○ But how does this scale when the number of processors is changed of the program is ported to another machine altogether?
○ This begs the obvious followup question - whats the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look more attractive?
Sources of Overhead in Parallel Programs
run twice as fast?
computation, communication, idling, and contention cause degradation in performance
working on any non-trivial parallel problem will need to talk to each other.
results from other processes
imbalance, synchronization, or serial components.
performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication.
Performance Metrics for Parallel Systems: Execution Time, Speedup
Execution time = time elapsed between ➢ beginning and end of execution on sequential a computer ➢ beginning on first computer and end on last computer on a parallel computer We denote the serial runtime by TS and the parallel runtime by TP . Let Tall be the total time collectively spent by all the processing elements.
by all processors combined in non-useful work. This is called the total overhead.
processing elements Tall = p TP (p is the number of processors).
To = p TP - TS
processor to the time required to solve the same problem on a parallel computer with p identical processing elements.
Performance Metrics: Example
adding n numbers by using n processing elements.
can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors.
Computing the globalsum of 16 partial sums using 16 processing
i denotes the sum of numbers with consecutive labels from i to j.
Performance Metrics: Example (continued)
communication of a single word takes time ts + tw, we have the parallel time TP = Θ (log n)
Performance Metrics: Speedup
These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees.
sequential program as the baseline.
Performance Metrics: Speedup Example
seconds.
○ In this case, the speedup is 12/14 = 0.86. This is a more realistic assessment of the system.
Performance Metrics: Speedup Bounds
○ 0 (the parallel program never terminates).
spends less than time TS / p solving the problem.
serial program, which contradicts our assumption of fastest serial program as basis for speedup.
Performance Metrics: Superlinear Speedups
One reason for superlinearity is that the parallel version does less work than corresponding serial algorithm. Searching an unstructured tree for a node with a given label, `S', on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. The corresponding serial formulation expands the entire
algorithm.
Performance Metrics: Superlinear Speedups
Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better cache-hit ratios, and therefore superlinearity. Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from local memory and 2% from remote memory. If DRAM access time is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this corresponds to a speedup of 2.43!
Performance Metrics: Efficiency
processing element is usefully employed
and as high as 1.
Performance Metrics: Efficiency Example
Parallel Time, Speedup, and Efficiency Example
Consider the problem of edge-detection in images. The problem requires us to apply a 3 x 3 template to each pixel. If each multiply-add operation takes time tc, the serial time for an n x n image is given by TS= tc n2
Example of edge detection: (a) an 8 x 8 image; (b) typical templates for detecting edges; and (c) partitioning of the image across four processors with shaded regions indicating image data that must be communicated from neighboring processors to processor 1.
Parallel Time, Speedup, and Efficiency Example (continued)
segments, each with n2 / p pixels.
2(ts + twn).
Parallel Time, Speedup, and Efficiency Example (continued)
and
Cost of a Parallel System
elements used (p x TP ).
solving the problem.
Cost of a Parallel System: Example
Consider the problem of adding numbers on processors.
cost optimal.
Impact of Non-Cost Optimality
Consider a sorting algorithm that uses n processing elements to sort the list in time (log n)2
speedup and efficiency of this algorithm are given by n / log n and 1 / log n, respectively
Effect of Granularity on Performance
execute a parallel algorithm is called scaling down a parallel system
case as a virtual processor and to assign virtual processors equally to scaled down processors
the computation at each processing element increases by a factor of n / p
the virtual processors assigned to a physical processors might talk to each
Building Granularity: Example
elements such that p < n and both n and p are powers of 2
think of them as virtual processors
○ virtual processing element i is simulated by the physical processing element labeled i mod p
in (n / p) log p steps on p processing elements.
Building Granularity: Example (continued)
Θ ( (n / p) log p).
cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal.
Building Granularity: Example (continued)
Can we build granularity in the example in a cost-optimal fashion?
time Θ (n / p).
time Θ(n /p) A cost-optimal way of computing the sum of 16 numbers using four processing elements.
Building Granularity: Example (continued)
TP = Θ (n / p + log p),
Scaling Characteristics of Parallel Programs
increase the number of processing elements, To increases.
parallel programs.
Scaling Characteristics of Parallel Programs: Example
elements.
Scaling Characteristics of Parallel Programs: Example (continued)
Plotting the speedup for various input sizes gives us: Speedup versus the number of processing elements for adding a list
Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law
In-class exercise 6:
One possible starting point can be: https://www.slant.co/topics/6024/~programming-languages-for-concurrent-programming
Amdahl’s law
In each algorithm there exist parts that cannot be parallelised
– sequential part
parallelised optimally
P max S(N,P) 2 1.9 4 3.5 10 6.9 20 10.3 100 16.8 ∞ 20
Example 1. Assume 5% of the algorithm is not parallelisable (ie. σ = 0.05 ) => : Example 2. σ = 0.67 (33% parallelisable), P = 10:
John Gustafson & Ed Barsis (Scania Laboratory) 1988:
they bet Amdahl’s law!
125...250) How was it possible? Does Amdahl’s law hold?
not very good idea to solve a problem with fixed size N on whatever number of processors!
if σ → 0 with N → ∞ Scaled efficiency (to avoid misunderstandings:)
Gustafson-Barsis’ law
Scaled efficiency
Problem size increasing accordingly with adding new processors – does time remain the same?
Methods to increase efficiency
Factors influencing efficiency:
Profiling parallel programs
Overlapping communication and computations Example: Parallel Ax-operation for sparse matrices in Domain Decomposition setups
Starting communication (non-blocking); calculations at inside parts of the region => economy in waiting times Extra computations instead of communication
results over the network
Example: Random number generation. Broadcasting only seed and generate in parallel (deterministic algorithm
Computer Benchmarks
Some HPC (High Performance Computing) benchmarks:
Jack Dongarra. HPL - High Performance Linpack, using MPI and BLAS. Solving systems of linear equations with dense matrices. The aim is to fit a problem withmaximal size (advisably, utilising 80% of memory). http://www.netlib.org/benchmark/hpl/ Used for http://www.top500.org
Gflops (usually <80%memory size)
Gflops
the better, but usually in range 32...256.
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [1] https://www.nas.nasa.gov/publications/npb.html (1992 v1, 1996 v2.1, 2002 v2.2) The original eight benchmarks specified in NPB 1 mimic the computation and data movement in CFD applications:
○ IS - Integer Sort, random memory access ○ EP - Embarrassingly Parallel ○ CG - Conjugate Gradient, irregular memory access and communication ○ MG - Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive ○ FT - discrete 3D fast Fourier Transform, all-to-all communication
○ BT - Block Tri-diagonal solver ○ SP - Scalar Penta-diagonal solver ○ LU - Lower-Upper Gauss-Seidel solver
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [2] Multi-zone versions of NPB (NPB-MZ) are designed to exploit multiple levels of parallelism in applications and to test the effectiveness of multi-level and hybrid parallelization paradigms and tools.
applications of NPB:
○ BT-MZ - uneven-size zones within a problem class, increased number of zones as problem class grows ○ SP-MZ - even-size zones within a problem class, increased number of zones as problem class grows ○ LU-MZ - even-size zones within a problem class, a fixed number of zones for all problem classes
Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [3]
movement
○ UA - Unstructured Adaptive mesh, dynamic and irregular memory access ○ BT-IO - test of different parallel I/O techniques ○ DC - Data Cube ○ DT - Data Traffic
communicating tasks derived from the NPB. They symbolize distributed applications typically run on grids.
○ ED - Embarrassingly Distributed ○ HC - Helical Chain ○ VP - Visualization Pipeline ○ MB - Mixed Bag
The HINT (Hierarchical INTegration). Graphical view of:
https://web.archive.org/web/20130724124556/http://hint.byu.edu/
A) Find out at OpenBenchmarking.org
a) What are currently the most popular benchmarks related to parallel computing? b) Find a benchmark that you like the most and describe it! c) Any other interesting aspect you find fascinating?
B) Find out about top500 vs graph500
a) What is their difference? b) What inspired the creation of graph500? c) How different are these lists? d) Some other interesting aspect you notice when comparing the two benchmarks?
○ Data mininig ○ Molecular dynamics ○ Cryptographic algorithms ○
○ Integral equations ■ Numerical solution usually implies dense matrices ■ Parallelisation of LU-factorisation ○ Numerical solution of Partial Differential Equations (PDEs) ■ Sparse matrice structures ■ → Methods for Iterative solution of systems with sparse matrices
EXAMPLE Finite element method for solving Poisson equation
Example: 2D Finite Element Method for Poisson Equation Poisson Equation where Laplacian Δ is defined by
u can be for example
at the boundary under a transversal load of intensity
Ω Γ
x y
Divergence of a vector function F=(F1,F2) is defined by
Divergence theorem where F=( F1 , F2 ) is a vector-valued function defined on Ω , n = (n1, n2) – outward unit normal to Γ. Here dx – element of area in ℝ2 ; ds – arc length along Γ .
Ω
x y
n
Γ Finite element method for solving Poisson equation
Applying divergence theorem to: F=(vw ,0) and F=(0, vw) (details not shown here) – Green's first identity we come to a variational formulation of the Poisson problem on V : Poisson problem in Variational Formulation: Find u ∈ V such that where in case of Poisson equation
The gradient ∇ of a scalar function f( x,y ) is defined by:
Finite element method for solving Poisson equation
( a(u,v) - being called also linear form )
Again, the Variational formulation of Poisson equation the equation is: With discretisation, we replace the space V with Vh – space of piecewise linear functions. Each function in Vh can be written as where φi (x, y) – basis functions (or 'hat' functions)
Finite element method for solving Poisson equation
With 2D FEM we demand that the equation in the Variational formulation is satisfied for M basis functions φi ∈ Vh i.e. But we have M linear equations with respect to unknowns ξj :
Finite element method for solving Poisson equation
The stiffness matrix A = (aij) elements and right-hand side b = (bi) calculation: Integrals computed only where the pairs ∇φi · ∇φj get in touch (have mutual support)
( Support of a function f=f(x) is defined as the region of values x for which f(x)≠0 )
Finite element method for solving Poisson equation
Example Two basis functions φi and φj for nodes Ni and Nj. Their common support is τ ∪ τ’ so that
Finite element method for solving Poisson equation
Element matrices Consider single element τ Pick two basis functions φi and φj (out of three). φk – piecewise linear ⇒ denoting by pk(x, y) = φk|τ : their dot product
Finite element method for solving Poisson equation
Finding coefficients α and β – put three points (xi, yi, 1), (xj, yj, 0) , (xk, yk, 0) to the plane equation and solve the system The integral is computed by the multiplication with the triangle area 0.5 ×|det D| The element matrix for τ is
Finite element method for solving Poisson equation
Assembled stiffness matrix
different stiffness matrices
○ Dirichlet’ ○ Neumann
○ free boundary condition ○ Robin boundary condition ○ Special boundary conditions for special PDEs (like Impedance boundary conditions for the Helmholtz equation)
Finite element method for solving Poisson equation
Right-hand side Right-hand side Approximate calculation of an integral through the f value in the middle of the triangle τ :
τ (somewhat similarly to
matrix assembly, but more simple)
Finite element method for solving Poisson equation
EXAMPLE - Poisson equation on unit square Variational formulation: Vh – space of piecewise linar functions on Ω with zero on Γ and values defined on inner nodes Discretised problem in Variational Formulation: Find uh ∈ Vh such that where in case of Poisson equation (1)
Finite element method for solving Poisson equation
In FEM the equation (1) needs to be satisfied on a set of testfunctions (the ones called also as hat-functions) φi = φi(x), which are defined such that and it is demanded that (1) is satisfied with each φi (i = 1, ..., N)
Finite element method for solving Poisson equation
where
Soving systems of linear equations with sparse matrices
sparse matrices
○ Analysis for best ordering of unknowns (typically 100 time units) ○ Sparse LU-factorisation (10 time units) ○ Solving corresponding triangular systems (1 time unit) ○
○ e.g. Gauss-Seidel method ○ Krylov subspace methods (CG, GMRES)
○ Direct solvers, ■ e.g. MUMPS solver ○ Iterative parallal solvers ■ Most efficient ones based on Krylov subspace methods ■ But for large problems good preconditioners are needed ■ → Domain Decomposition preconditioner techniques
Domain Decomposition Method for solving Poisson equation
In-class exercise 8: Continuing with the task from In-class exercise 6, which was:
The task is to get some corresponding real examples working using appropriate Jupyter extension kernel! Like always, posting your achievements’ results to Course Piazza!
Domain Decomposition Method for solving linear systems with sparse matrices
Domain Decomposition Method for solving linear systems with sparse matrices Classical Alternating Schwarz method
○ Available only on regular domains
○ Called also multiplicative overlapping Schwarz method
Ω1 Γ2 Ω2 Γ1
Domain Decomposition Method for solving linear systems with sparse matrices As discussed before - solving systems with sparse matrices: iterative methods are the best due to
distribution) of the matrices
○ 1 million iterations - way to many! Solution: to use preconditioner
Domain Decomposition Method for solving linear systems with sparse matrices
Preconditioned Conjugate Gradient (PCG) method
Calculate r(0)=b-Ax(0) with given starting vector x(0) for i = 1,2,... solve Mz(i−1) = r(i−1) # M−1 is called Preconditioner ρi−1 = (r(i−1))T z(i−1) if i==1: p(1) = z(0) else: βi−1 = ρi−1 /ρi−2 p(i) = z(i−1) + βi−1 p(i−1) q(i) = Ap(i) αi = ρi−1 /(p(i))Tq(i) x(i) = x(i−1) + αip(i) r(i) = r(i−1) − αiq(i) check for convergence ; continue if needed
Conjugate Gradient (CG) method
preconditioning)
symmetric matrix A Preconditioned CG method (PCG)
symmetric as well
Domain Decomposition Method for solving linear systems with sparse matrices Parallelizing PCG method
Calculate r(0)=b-Ax(0) with given starting vector x(0) for i = 1,2,... solve Mz(i−1) = r(i−1) # Preconditioner solve ρi−1 = (r(i−1))T z(i−1) # dot product if i==1: p(1) = z(0) else: βi−1 = ρi−1 /ρi−2 p(i) = z(i−1) + βi−1 p(i−1) q(i) = Ap(i) # matrix-vector application αi = ρi−1 /(p(i))Tq(i) # dot product x(i) = x(i−1) + αip(i) r(i) = r(i−1) − αiq(i) check for convergence ; continue if needed
Parallelization ❖ Preconditioner solve ➢ Suitable M needed ■ symmetric ■ allowing parallelization ■ easy to solve ❖ Matrix-vector operation ➢ Parallel version of the operation ❖ Dot product ➢ Parallelizable ➢ all-to-all communication Which of the above operations can be implemented with hiding communication behind computations?
Domain Decomposition Method for solving linear systems with sparse matrices Preconditioners Instead of solving Ax=b, solve M -1Ax=M -1b How to choose a preconditioner? A. Problem Mz = r being easy to solve B. Matrix B=M -1A being better conditioned than A meaning that κ2(B) < κ2(A), where κ2 - condition number = {largest eigenvalue}/{smallest eigenvalue}
○ M=I ○ M=A
Domain Decomposition Method for solving linear systems with sparse matrices Block-Jacobi method Take M1 and M2 are preconditioners for matrix A diagonal blocks A11 and A22 For example, M1 = A11
Domain Decomposition Method for solving linear systems with sparse matrices Multiple Subdomain case
Ω7 Ω8 Ω9 Ω4 Ω5 Ω6 Ω1 Ω2 Ω3
Subdomain restriction operators Restriction operator Ri – matrix applied to global vector x returns components xi=Ri x belonging to subdomain Ωi . One-level Additive Schwarz preconditioner:
N = 9:
Denote:
➢
Ri - Restriction operator
➢
Ai-1
○
Subdomain solvers – exact or inexact ■ Typical inexact solvers
○
In case of CG - symmetry of M-1 – prerequisite!
Domain Decomposition Method for solving linear systems with sparse matrices
One-level Additive Schwarz preconditioner:
Domain Decomposition Method for solving linear systems with sparse matrices
Multilevel methods
Coarse grid Restriction and Interpolation operators Restriction operator R0 is defined as transpose operator of interpolation
T interpolating linearly (or bilinearly) values on coarse grid to
fine grid nodes Coarse grid matrix A0
○ analytically or ○ through formula: A0 = R0A R0
T
Two-level Additive Schwarz method defined as
Requirements to the coarse grid:
cover all the fine grid nodes
must contain fine grid nodes
density, coarse grid must adapt according to the fine grid in terms of nodes numbers in each cell
Parallel programming models
How are the computations structured? Organized by data Organized by tasks Organized by data flow Linear? Recursive? Linear? Recursive? Regular? Irregular? Geometric Decomposition Recursive Data Task Parallelism Divide and Conquer Pipeline Event-based Coordination
A simplified view:
space (PGAS)
○ High Performance Fortran
○ Clojure
○ Charm++ ○ Smalltalk
Parallel programming models
Different models for expressing parallelism in programming languages
○ MPI ○ PVM
○ PRAM (Parallel Random Acess Machine) model
○ Concurrent Haskell
○ Linda
○ CUDA ○ OpenCL ○ OpenACC ○ OpenHMPP (HMPP for Hybrid Multicore Parallel Programming)
○ Erlang ○ ScalaDataflow ○ SISAL (Streams and Iteration in a Single Assignment Language)
Parallel programming models
Different models for expressing parallelism in programming languages
○ Sequoia ○ Bloom
○ Verilog hardware description language (HDL)
Sequential Processes)
○ FortranM ○ Occam
PRAM (Parallel Random Access Machine) model
processors work in parallel
Variations
○ common CRCW PRAM - write succeeds only if all have the same value ○ arbitrary CRCW PRAM - arbitrary process succeeds ○ priority CRCW PRAM - with predefined priorities
Partitioned global address space parallel programming model Fortran90 extension
Data) model
part of data
processor gets which part of the data
commands based on Fortran90
HPF example: matrix multiplication:
PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C INTEGER :: i, j !HPF$ PROCESSORS square(2,2) !HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO square :: C !HPF$ ALIGN A(i,*) WITH C(i,*) !replicate copies of row A(i,*) onto proc.s which compute C(i,j) !HPF$ ALIGN B(*,j) WITH C(*,j) !replicate copies of col. B(*,j)) onto proc.s which compute C(i,j) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N !All the work is local due to ALIGNs C (i,j) = DOT_PRODUCT (A(i,:), B(:,j)) END DO END DO WRITE(*,*) C END
○ find balanced load based from the owner calculates rule ○ data locality
communication
OpenMP tutorial (https://computing.llnl.gov/tutorials/openMP/)
○ C/C++, Fortran* ○ PyMP (TBC)
OpenMP example
/********************************************************** * FILE: omp_mm.c https://computing.llnl.gov/tutorials/openMP/samples/C/omp_m m.c * DESCRIPTION: * OpenMp Example - Matrix Multiply - C Version * Demonstrates a matrix multiply using OpenMP. Threads share row iterations * according to a predefined chunk size. * AUTHOR: Blaise Barney * LAST REVISED: 06/28/05 ****************************************************************/ #include <omp.h> #include <stdio.h> #include <stdlib.h> #define NRA 62 /* number of rows in matrix A */ #define NCA 15 /* number of columns in matrix A */ #define NCB 7 /* number of columns in matrix B */ int main (int argc, char *argv[]) { int tid, nthreads, i, j, k, chunk; double a[NRA][NCA], /* matrix A to be multiplied */ b[NCA][NCB], /* matrix B to be multiplied */ c[NRA][NCB]; /* result matrix C */ chunk = 10; /* set loop iteration chunk size */ /*** Spawn a parallel region explicitly scoping all variables ***/ #pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Starting matrix multiple example with %d threads\n",nthreads); printf("Initializing matrices...\n"); } /*** Initialize matrices ***/ #pragma omp for schedule (static, chunk) for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; #pragma omp for schedule (static, chunk) for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; #pragma omp for schedule (static, chunk) for (i=0; i<NRA; i++) for (j=0; j<NCB; j++) c[i][j]= 0; /*** Do matrix multiply sharing iterations on outer loop ***/
/*** Display who does which iterations for demonstration purposes ***/
printf("Thread %d starting matrix multiply...\n",tid); #pragma omp for schedule (static, chunk) for (i=0; i<NRA; i++) { printf("Thread=%d did row=%d\n",tid,i); for(j=0; j<NCB; j++) for (k=0; k<NCA; k++) c[i][j] += a[i][k] * b[k][j]; } } /*** End of parallel region ***/ /*** Print results ***/ printf("**************************** ***n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { for (j=0; j<NCB; j++.) printf("%6.2f ", c[i][j]); printf("\n"); } printf("**************************** ***\n"); printf ("Done.\n"); }
When to use MPI
○ Libraries
processor basis When not to use MPI
HPF/OpenMP
○ Ne collection (probably a little outdated) http://www.mcs.anl.gov/mpi/libraries.html
○ Sockets
○ CORBA, DCOM, etc.
Outline
○ Communicators groups ○ Topologies ○ One-sided communication ○ MPI fault tolerance
abstraction in parallel application developmnet
(Covered at Lecture 4)
○ – extended message-passing model ○ – not a language or compiler specification ○ – not a specific implementation or product
heterogeneous networks
parallel hardware for end users, library writers, and tool developers
MPI main goals
Goals of the MPI standard MPI’s prime goals are:
MPI also offers:
MPI History
environments
○ library for Caltech’s N-cube (later commercially available Express) ○ P4 library (Aragonne) ○ PICL and PVM (Oak Ridge) ○ LAM (Ohio Supercomputer Center)
common standard – MPI
○ ISIS – a different way, towards fault tolerant (e.g. business applications); ○ MPI – to serve supercomputing
○ completed in May 1994 ○ first implementations 1995 ■ MPICH, based on P4 and Chameleon) ■ LAM MPI ■ many supercomputer vendors – their own
○ completed 1998 ○ first implementations Nov 2002 ○ Also parallel IO defined here ■ first in MPI-IO (NASA)
September 21, 2012. MPI 3.0 Documents
MPI implementations
○ P2P-MPI ○ mpiJava API ○ ...
○ MPICH ○ LAM MPI ○ OpenMPI ○ Intel MPI ○ HP MPI ○ Microsoft Message Passing Interface
○ pyMPI ○ mpi4py ○ Pypar ○ MYMPI
Compiling and running MPI programs
○ typically with something like: ■ mpicc − g file.c ■ mpif77 − g file.f ■ mpiF90 − O file.f90
○
○ for example: mpirun − np 8 a.out ○
MPI as standard
Goals of the MPI standard MPI’s prime goals are:
MPI also offers:
4 basic types of MPI calls 1. Calls used to initialize, manage, and terminate communications 2. Calls used to communicate between pairs
→ point-to-point communication (P2P) 3. Calls used to communicate among groups → collective communication 4. Calls to create data types Communication can be:
Bindings for:
6 basic MPI calls
MPI_Init: initialise MPI MPI_Comm_Size: how many PE? MPI_Comm_Rank: identify the PE MPI_Send MPI_Receive MPI_Finalise: close MPI
EXAMPLE (fortran90): http://www.ut.ee/~eero/SC/konspekt/Naited/greetings.f90.html Example (C): https://github.com/wesleykendall/mpitutorial/blob/gh-pages/tutor ials/mpi-hello-world/code/mpi_hello_world.c
Full range of MPI calls: http://www.mpich.org/static/docs/latest/
from mpi4py import MPI comm = MPI.COMM_WORLD # Defines the default communicator num_procs = comm.Get_size() # Stores the number of processes in size. rank = comm.Get_rank() # Stores the rank (pid) of the current process stat = MPI.Status() msg = "Hello world, say process %s !", % rank if rank == 0: # Master work print msg for i in range(num_procs - 1): msg = comm.recv(source=i+1, tag=MPI.ANY_TAG, status=stat) print msg elif rank == 1: # Worker work comm.send(msg, dest = 0)
Send, Ssend, Bsend, Rsend - blocking calls Isend, Issend, Ibsend, Irsend - nonblocking calls
Non-Blocking Send and Receive, Avoiding Deadlocks
separation between the initiation of the communication and the completion
completion the program could do other useful computation (latency hiding)
insert code to check for completion
○ Blocking commands: send, recv ○ Non-blocking commands: isend, irecv ■ Return request object to be able to check the message status
○ Blocking commands: Send, Recv ○ Non-blocking commands: Isend, Irecv ■ Return request object to be able to check the message status
from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} req = comm.isend(data, dest=1, tag=11) # non-blocking req.wait() elif rank == 1: req = comm.irecv(source=0, tag=11) # non-blocking data = req.wait() from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} comm.send(data, dest=1, tag=11) # blocking elif rank == 1: data = comm.recv(source=0, tag=11) # blocking
Unidirectional communication between processors
from mpi4py import MPI import numpy as np size = MPI.COMM_WORLD.size rank = MPI.COMM_WORLD.rank comm = MPI.COMM_WORLD len = 100 data = np.arange(len,dtype=float) # (or similar) tag = 99 ... # 1. Blocking send and blocking receive if rank==0: print "[0] Sending: ", data comm.Send([data, MPI.FLOAT], 1, tag) elif rank == 1: print "[1] Receiving..." comm.Recv([data, MPI.FLOAT], 0, tag) print "[1] Data: ", data # 2. Non-blocking send and blocking receive if rank==0: print "[0] Sending: ", data request = comm.Isend([data, MPI.FLOAT], 1, tag) ... # calculate or do something useful... request.Wait() elif rank == 1: print "[1] Receiving..." comm.Recv([data, MPI.FLOAT], 0, tag) print "[1] Data: ", data # 3. Blocking send and non-blocking receive if rank==0: print "[0] Sending: ", data comm.Send([data, MPI.FLOAT], 1, tag) elif rank == 1: print "[1] Receiving..." request=comm.Irecv([data, MPI.FLOAT], 0, tag) ... # calculate or do something useful... request.Wait() print "[1] Data: ", data
# 4. Non-blocking send and non-blocking receive if rank==0: print "[0] Sending: ", data request = comm.Isend([data, MPI.FLOAT], 1, tag) ... # calculate or do something useful... request.Wait() elif rank == 1: print "[1] Receiving..." request=comm.Irecv([data, MPI.FLOAT], 0, tag) ... # calculate or do something useful... request.Wait() print "[1] Data: ", data
Unidirectional communication between processors
Wildcards:
Can be used In *recv
Possibilities for checking received message’s details
#probe.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD nproc = comm.Get_size() myid = comm.Get_rank() if myid == 0: data = myid*numpy.ones(5,dtype = numpy.float64) comm.Send([data,3,MPI.DOUBLE],dest=1,tag=1) if myid == 1: info = MPI.Status() comm.Probe(MPI.ANY_SOURCE,MPI.ANY_TAG,info) count = info.Get_elements(MPI.DOUBLE) data = numpy.empty(count,dtype = numpy.float64) comm.Recv(data,MPI.ANY_SOURCE,MPI.ANY_TAG,info) print 'on',myid, 'data: ',data #status.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD nproc = comm.Get_size() myid = comm.Get_rank() data = myid*numpy.ones(5,dtype = numpy.float64) if myid == 0: comm.Send([data,3,MPI.DOUBLE],dest=1,tag=1) if myid == 1: info = MPI.Status() comm.Recv(data,MPI.ANY_SOURCE,MPI.ANY_TAG,info) source = info.Get_source() tag = info.Get_tag() count = info.Get_elements(MPI.DOUBLE) size = info.Get_count() print 'on',myid, 'source, tag, count, size is',source, tag, count, size
Mutual communication and avoiding deadlocks
Non-blocking operations can be used also for avoiding deadlocks Deadlock is a situation where processes wait after each other without any of them able to do anything useful. Deadlocks can occur:
In case of mutual communication there are 3 possibilities: 1. Both processes start with send followed by receive 2. Both processes start with receive followed by send 3. One process starts with send followed by receive, another vica versa Depending on blocking there are different possibilities:
# 1.1 Send followed by receive (vers. 2) if rank==0: request=comm.Isend([sendbuf, MPI.FLOAT], 1, tag) comm.Recv([recvbuf, MPI.FLOAT], 1, tag) request.Wait() elif rank == 1: request=comm.Isend([sendbuf, MPI.FLOAT], 0, tag) comm.Recv([recvbuf, MPI.FLOAT], 0, tag) request.Wait()
Is this deadlock-free?
Why Wait() cannot follow right after Isend(...) ?
Mutual communication and avoiding deadlocks
# 1. Send followed by receive (vers. 1 ) if rank==0: comm.Send([sendbuf, MPI.FLOAT], 1, tag) comm.Recv([recvbuf, MPI.FLOAT], 1, tag) elif rank == 1: comm.Send([sendbuf, MPI.FLOAT], 0, tag) comm.Recv([recvbuf, MPI.FLOAT], 0, tag)
Is this OK?
then system message send-buffer But what about large messages?
# 2. Receive followed by send (version 2) if rank==0: request=comm.Irecv([recvbuf, MPI.FLOAT], 1, tag) comm.Send([sendbuf, MPI.FLOAT], 1, tag) request.Wait() elif rank == 1: request=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag) comm.Send([sendbuf, MPI.FLOAT], 0, tag) request.Wait()
… deadlock-free?
# 2. Receive followed by send (version 1) if rank==0: comm.Recv([recvbuf, MPI.FLOAT], 1, tag) comm.Send([sendbuf, MPI.FLOAT], 1, tag) elif rank == 1: comm.Recv([recvbuf, MPI.FLOAT], 0, tag) comm.Send([sendbuf, MPI.FLOAT], 0, tag)
… Is this OK?
○ Produces deadlock in any message buffer size
Mutual communication and avoiding deadlocks
Mutual communication and avoiding deadlocks
# Generally, the following communication pattern is advised: if rank==0: req1=comm.Isend([sendbuf, MPI.FLOAT], 1, tag) req2=recomm.Irecv([recvbuf, MPI.FLOAT], 1, tag) else: req1=comm.Isend([sendbuf, MPI.FLOAT], 0, tag) req2=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag) req1.Wait() req2.Wait() # 3. One starts with Send, the other one with receive if rank==0: comm.Send([sendbuf, MPI.FLOAT], 1, tag) comm.Recv([recvbuf, MPI.FLOAT], 1, tag) else: comm.Recv([recvbuf, MPI.FLOAT], 0, tag) comm.Send([sendbuf, MPI.FLOAT], 0, tag)
… Could we use non-blocking commands instead?
whichever call here as well)
Alternatively, use comm.Sendrecv
Docstring: Comm.Sendrecv(self, sendbuf, int dest, int sendtag=0, recvbuf=None, int source=ANY_SOURCE, int recvtag=ANY_TAG, Status status=None) Send and receive a message .. note:: This function is guaranteed not to deadlock in situations where pairs of blocking sends and receives may deadlock. .. caution:: A common mistake when using this function is to mismatch the tags with the source and destination ranks, which can result in deadlock. Type: builtin_function_or_method
Collective communication
from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'key1' : [7, 2.72, 2+3j], 'key2' : ( 'abc', 'xyz')} else: data = None data = comm.bcast(data, root=0)
from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = np.arange(100, dtype='i') else: data = np.empty(100, dtype='i') comm.Bcast(data, root=0) for i in range(100): assert data[i] == i
Collective communication
should make the call!
comm.barrier()
00 01 02 03 04 p0 A0 p1 p2 p3 p4 comm.Bcast(...) 00 01 02 03 04 p0 A0 p1 A0 p2 A0 p3 A0 p4 A0
Collective communication
00 01 02 03 04 p0 A0 A1 A2 A3 A4 p1 p2 p3 p4 comm.Scatter(...) 00 01 02 03 04 p0 A0 p1 A1 p2 A2 p3 A3 p4 A4
from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() data = (rank+1)**2 data = comm.gather(data, root=0) if rank == 0: for i in range(size): assert data[i] == (i+1)**2 else: assert data is None from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() if rank == 0: data = [(i+1)**2 for i in range(size)] else: data = None data = comm.scatter(data, root=0) assert data == (rank+1)**2
comm.Gather(...)
from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() sendbuf = None if rank == 0: sendbuf = np.empty([size, 100], dtype='i') sendbuf.T[:,:] = range(size) recvbuf = np.empty(100, dtype='i') comm.Scatter(sendbuf, recvbuf, root=0) assert np.allclose(recvbuf, rank)
Collective communication
00 01 02 03 04 p0 A0 A1 A2 A3 A4 p1 A0 A1 A2 A3 A4 p2 A0 A1 A2 A3 A4 p3 A0 A1 A2 A3 A4 p4 A0 A1 A2 A3 A4 00 01 02 03 04 p0 A0 p1 A1 p2 A2 p3 A3 p4 A4 comm.Allgather(...) 00 01 02 03 04 p0 A0 A1 A2 A3 A4 p1 B0 B1 B2 B3 B4 p2 C0 C1 C2 C3 C4 p3 D0 D1 D2 D3 D4 p4 E0 E1 E2 E3 E4 00 01 02 03 04 p0 A0 B0 C0 D0 E0 p1 A1 B1 C1 D1 E1 p2 A2 B2 C2 D2 E2 p3 A3 B3 C3 D3 E3 p4 A4 B4 C4 D4 E4 comm.Alltoall(...)
Collective communication
00 01 02 03 04 p0 A0 B0 A1 A2 B2 C2 A3 A4 p1 p2 p3 p4 comm.Scatterv(...) 00 01 02 03 04 p0 A0 B0 p1 A1 p2 A2 B2 C2 p3 A3 p4 A4 comm.Gatherv(...) https://www.programcreek.com/python/example/89115/mpi4py.MPI.DOUBLE
Collective communication
00 01 02 03 04 p0 A0 B0 A1 A2 B2 C2 A3 A4 p1 A0 B0 A1 A2 B2 C2 A3 A4 p2 A0 B0 A1 A2 B2 C2 A3 A4 p3 A0 B0 A1 A2 B2 C2 A3 A4 p4 A0 B0 A1 A2 B2 C2 A3 A4 00 01 02 03 04 p0 A0 B0 p1 A1 p2 A2 B2 C2 p3 A3 p4 A4 comm.Allgatherv(...) https://www.programcreek.com/python/example/89115/mpi4py.MPI.DOUBLE
Collective communication
00 01 02 03 04 p0 31 p1 p2 p3 p4 comm.Reduce(...,op=MPI.SUM,...) 00 01 02 03 04 p0 1 p1 2 p2 4 p3 8 p4 16
comm.Reduce(sendbuf, recvbuf, op=MPI.SUM, root=0) comm.reduce(sendobj=None, recvobj=None, op=MPI.SUM, root=0) a_size = 3 recvdata = numpy.zeros(a_size,dtype=numpy.int) senddata = (rank+1)*numpy.arange(a_size,dtype=numpy.int) comm.Reduce(senddata,recvdata,root=0,op=MPI.PROD) print 'on task',rank,'after Reduce: data = ',recvdata comm.Allreduce(senddata,recvdata) print 'on task',rank,'after Allreduce: data = ',recvdata https://info.gwdg.de/~ceulig/docs-dev/doku.php?id=en:services:application_services:high_performance_computing:mpi4py
Example sendbuf recvbuf
Collective communication
00 01 02 03 04 p0 31 p1 31 p2 31 p3 31 p4 31 comm.Allreduce(...,op=MPI.SUM,...) 00 01 02 03 04 p0 1 p1 2 p2 4 p3 8 p4 16
comm.Allreduce(sendbuf, recvbuf, op=MPI.SUM) comm.allreduce(sendobj=None, recvobj=None, op=MPI.SUM) a_size = 3 recvdata = numpy.zeros(a_size,dtype=numpy.int) senddata = (rank+1)*numpy.arange(a_size,dtype=numpy.int) comm.Reduce(senddata,recvdata,root=0,op=MPI.PROD) print 'on task',rank,'after Reduce: data = ',recvdata comm.Allreduce(senddata,recvdata) print 'on task',rank,'after Allreduce: data = ',recvdata https://info.gwdg.de/~ceulig/docs-dev/doku.php?id=en:services:application_services:high_performance_computing:mpi4py
Example sendbuf recvbuf
Reduction operations
MPI.MIN minimum MPI.MAX maximum MPI.SUM sum MPI.PROD product MPI.LAND logical and MPI.BAND bitwise and MPI.LOR logical or MPI.BOR bitwise or MPI.LXOR logical xor MPI.BXOR bitwise xor MPI.MAXLOC max value and location MPI.MINLOC min value and location
Collective communication
MPI derived datatypes MPI packing Communicators & process groups Intra and Intercommunicators Processor topologies One-sided communication Library creation support Fault-tolerance issues MPI-IO
MPI’s main abstraction for programmer:
○ Powerful ○ Low-level ■ Is it too low level?
Secure MPI libraries
‘ MPI programming is difficult. It is “schizophrenic programming” in that you are writing a single programming with multiple threads of execution “many voices in one head”. ‘ link
In-class reading-exercise 4:
○ Read it and choose one of the technologies mentioned in the article and post a review of main points of it to Piazza