Parallel Computing Eero Vainikko eero.vainikko@ut.ee Spring 2019 - - PowerPoint PPT Presentation

parallel computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Computing Eero Vainikko eero.vainikko@ut.ee Spring 2019 - - PowerPoint PPT Presentation

University of Tartu Institute of Computer Science Parallel Computing Eero Vainikko eero.vainikko@ut.ee Spring 2019 Overview Lectures -- WED 10:15 - 12:00 -- Liivi 2 - 512 Bring your laptop! Practicals -- FRI 12:15 - 14:00 -- Liivi 2 - 512


slide-1
SLIDE 1

Parallel Computing

Eero Vainikko

Spring 2019 eero.vainikko@ut.ee University of Tartu Institute of Computer Science

slide-2
SLIDE 2

Overview

Lectures -- WED 10:15 - 12:00 -- Liivi 2 - 512 Practicals -- FRI 12:15 - 14:00 -- Liivi 2 - 512

Bring your laptop!

Final grade

  • Homework/class exercises - 70%
  • Exam / Presentation 30%

○ Exam - blackboard exam

  • Mainly Individual work

○ with pair-programming, sometimes

slide-3
SLIDE 3

Syllabus (preliminary)

Themes

  • Introduction to parallel computing
  • Parallel computer architectures
  • Parallel programming models
  • Designing parallel programs
  • Shared memory parallel programming
  • MPI
  • GPGPU
  • Profiling parallel programs
  • Applications: Parallel solution of systems of

linear equations, ...

  • Iterative methods
  • Domain Decomposition methods
  • Parallel programming languages
  • Parallel Algorithms

Computer classes

  • Jupyter notebook
  • JupyterHUB
  • Computational problems using

○ Numpy ○ Scipy

  • Ipyparallel
  • Mpi4py
  • Using UT HPC resources
  • ...
slide-4
SLIDE 4

History of computing

ENIAC

Pre-history Even before electronic computers were invented, parallel computing existed!

  • 1929 – parallelisation of weather

forecasts - using human arrays

  • ≈ 1940 – Parallel computations

in war industry using Felix-like devices 1950 - Emergence of first electronic

  • computers. Based on lamps. ENIAC and
  • thers
slide-5
SLIDE 5

History of computing

1960 - Mainframe era. IBM

IBM 1620

1970 - Era of Minis 1980 - Era of PCs 1990 - Era of parallel computers 2000 - Clusters / Grids 2010 … - Cloud computing era 20xx - Quantum computers? What’s next?

Something else?

slide-6
SLIDE 6

Expert's predictions

Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will need in the future at most 6 computers! 1950: Thomas J. Watson as well: 6 computers will be needed in the world 1977: Seymour Cray: The computer Cray-1 will attract only 100 clients in the world 1980: Study by IBM – about 50 Cray-class computers could be sold at most in a year worldwide Reality: How many Cray-* processing power do we currently carry in our pockets?

slide-7
SLIDE 7

Expert's predictions: Moore’s law

Gordon Moore’s (founder of Intel) law: (1965: the number of switches doubles every second year ) 1975: - refinement of the above: The number of switches on a CPU doubles every 18 months Until 2030 we would reach in such a way to the atomic level or quantum computer! But why?

slide-8
SLIDE 8

Example

for i in range(1000000000000): z(i)=x(i)+y(i) # ie. 3*10^12 memory accesses Assuming, that data is traveling with the speed of light 3 ∗ 10⁸ m/s, for finishing the operation in 1 second, the average distance of a memory cell must be: Typical computer memory – a rectangular mesh. The length of each edge would be

  • the size of a small atom!
slide-9
SLIDE 9

Computer hardware developments

https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

slide-10
SLIDE 10

FLOPS and goals

First computers

10²

100 Flops Hundred flops (sada op/s) Desktop computers

10⁹

Gigaflops (GFlops) Billion flops (miljard op/s) Supercomputers

10¹²

Teraflops (TFlops) Trillion flops (triljon op/s) Fastest clusters

10¹⁵

Petaflops (PFlops) Quadrillion flops (kvadriljon op/s) Aim now

10¹⁸

Exaflops (EFlops) Quintillion flops (kvintiljon op/s) Next step

10²¹

Zettaflops (ZFlops) Sextillion flops (sekstiljon op/s) Further goal

10²⁴

Yottaflops (YFlops) Septillion flops (septiljon op/s)

slide-11
SLIDE 11

How can parallel computers help achieving these goals?

Example

Weather forecast in Europe for next 48 hours from sea lever upto 20km – need to solve a PDE in 3D (xyz and t) The volume of the region: 5000∗4000∗20 km³ Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km Timesteps: 1min. Ca 1000 flops per gridpoint at each timestep t 5000 ∗ 4000 ∗ 20 km³ × 10 rectangles per km³ = 4 ∗ 10⁹ meshpoints 4 ∗ 10⁹ meshpoints ∗ 48 ∗ 60 timesteps × 10³ flop ≈ 1.15 ∗ 10¹⁵ flop 1.15 ∗ 10¹⁵ / 10⁹ = 1.15 ∗ 10⁶ seconds ≈ 13 days! But, with 10¹² flops, < 20 min 1.15 ∗ 10³ seconds

slide-12
SLIDE 12
  • Reduce the mesh stepsize to 0.1km in each directions and

timestep to 10 seconds → the total time would grow from 20 min to 8 days!

Example continued...

  • We could replace the Europe with the whole World model

(the area: 2 ∗10⁷ km² → 5 ∗ 10⁸ km²) and to combine the model with an Ocean model.

  • The needs of science and technology grow faster than

available possibilities, need only to change ε and h to get unsatisfied again!

slide-13
SLIDE 13

TOP500

https://www.top500.org/resources/presentations/

slide-14
SLIDE 14

HPC Applications

modeling, simulation, design, signal processing, engineering, geology, tectonics, ... In-class exercise #1:

For hints, some pictures from https://computing.llnl.gov/tutorials/parallel_comp/

piazza.com/ut.ee/spring2019/mtat08020 In order to get used to Piazza: ➢ Look (google) around for challenging parallel computing problems ➢ choose your favorite HPC application ➢ Post your findings at Piazza ○ Describe, the application! ■ What makes it challenging? ■ (If you can, find out approximately, the the time /

  • rder of Flops a typical

calculation/simulation takes? ○ Include links to project page, pictures, videos etc.!

slide-15
SLIDE 15

Engineering applications

  • Wind tunnels
  • Combustion engine optimisation
  • High frequency cirquits
  • Buildings and mechanical structures

(optimisation, cost, shape, safety etc)

  • Effective and safe nuclear power station

design

  • Simulation of nuclear explosions

Scientific applications

  • Bioinformatics
  • Computational physics and chemistry (quantum

mechanics, macromolecules, composite materials, chemical reactions etc).

  • Astrophysics (development of galaxies,

thermonuclear reactions, postprocessing of large datasets produced by telescopes)

  • Weather modelling
  • Climate and environment modelling
  • Looking for minerals
  • Flood predictions

Usage areas and applications for Petaflop computer

slide-16
SLIDE 16

Usage areas and applications for Petaflop computer

Commercial applications

  • Oil industry
  • Market and monetary predictions
  • Global economy modelling
  • Car industry (like crash-tests)
  • Aviation (design of modern aircrafts and

space shuttles ) Applications in computer systems

  • Computer security (discovery of network

attacks)

  • Cryptography
  • Embedded systems (for example, car)
  • etc
slide-17
SLIDE 17

➢ Find a set of Exascale computing applications ➢ Post your findings at Piazza ○ Describe, the application! ■ Why is it Exascale? ■ Include links to project page, pictures, videos etc...!

Exascale Computing

In-class exercise #2

piazza.com/ut.ee/spring2019/mtat08020

slide-18
SLIDE 18

Exascale Computing

Useful Videos:

  • The Next Frontier in Computing - Los

Alamos National Lab

  • Exascale Computing Project Update

Exascale, what's happening?:

  • Japan Tests Silicon for Exascale

Computing in 2021

  • Extremely Scalable Spiking Neuronal Network

Simulation Code: From Laptops to Exascale Computers

  • Why We Need Exascale Computing
  • Link to paper and vidoe on ACM: Exascale

computing and big data

  • Some interesting articles and presentations about

Exascale computing and its applications from various sources ○ Springer ○ Kothe2018.pdf ○ Presentation

Some interesting links (student’s choice)

slide-19
SLIDE 19

What is parallel Computing?

Serial computing

  • Traditional software written in serial way:

○ Problem being broken into a series if instructions ○ Instructions are executed one-by-one ○ On a single processor - only one instruction can be executed at any time moment problem

t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12

instructions

slide-20
SLIDE 20

What is parallel Computing?

Parallel computing

○ simultaneous use of multiple compute resources to solve a computational problem:

A problem is broken into discrete parts that can be solved concurrently

Each part is further broken down to a series of instructions

Instructions from each part execute simultaneously on different processors

An overall control/coordination mechanism is employed

t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12 t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12 t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12 t1 t2 t3 t4 t5 t6 t7 t8 t13 t14 t15 t16 t9 t10 t11 t12

  • Computer with multiple

cores/CPUs

  • Number of such

computers connected by a network

slide-21
SLIDE 21

To be able to parallelise computations:

  • The computational problem should be able to:

○ Be broken apart into discrete pieces of work that can be solved simultaneously ○ Execute multiple program instructions at any moment in time ○ Be solved in less time with multiple compute resources than with a single compute resource

slide-22
SLIDE 22

Parallel Computers:

  • Virtually all stand-alone computers

today are parallel from a hardware perspective: ○ Multiple functional units (L1 cache, L2 cache, branch, prefetch, decode, floating-point, graphics processing (GPU), integer, etc.) ○ Multiple execution units/cores ○ Multiple hardware threads

IBM BG/Q Compute Chip with 18 cores (PU) and 16 L2 Cache units (L2):

slide-23
SLIDE 23

Parallel Computers:

IBM Blue Gene

By Courtesy Argonne National Laboratory, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=24653857

# 3 in top500 - 2012 June # 21 in top500 - 2018 Nov

slide-24
SLIDE 24

Granularity

Fine-grain parallelism

  • Example: ILP -

Instruction-Level Parallelism

  • Relatively small amounts of

computational work are done between communication/synchronisatio n events

  • Low computation to

communication ratio

  • Easier to load-balance
  • Easier to automate

Course-grain parallelism

  • Relatively large amounts of

computational work are done between communication/synchroniza tion events

  • High computation to

communication ratio

  • Implies more opportunity for

performance increase

  • Harder to load balance

efficiently

  • More difficult to automate

Embarrasingly parallel

  • Coarsest granularity one

can imagine

  • Example:

Monte-Carlo method Processing a huge set of images in parallel

  • size of parallelisable (independent) parts of the program between

synchronization points

slide-25
SLIDE 25

Instruction Level Parallelism (ILP)

ILP - measure of how many of the instructions in a computer program can be executed simultaneously Parallel instructions are a set of instructions that do not depend on each other when executed

  • Bit-level Parallelism

○ 16-bit add on 8-bit CPU

  • Instruction level Parallelism
  • Loop level Parallelism

for i in range(10000): x(i)=x(i)+y(i)

  • Thread level Parallelism

○ Multi-core CPUs Example:

e = a + b f = c + d g = e * f

  • How many units of time this can be done?

ILP - challenge for compilers and processor designers ILP allows the compiler to use

  • pipelining,
  • superscalar execution
  • ut-of-order execution
  • register renaming
  • speculative execution
  • branch prediction
slide-26
SLIDE 26

Micro-architectural techniques of ILP

❖ Instruction pipelining

➢ (similar to car production lines) - performing different sub-operations with moving objects Basic five-stage pipeline in a RISC machine

  • IF = Instruction Fetch,
  • ID = Instruction Decode,
  • EX = Execute,
  • MEM = Memory access,
  • WB = Register write back).

In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.

slide-27
SLIDE 27

Micro-architectural techniques of ILP

➢ Superscalar CPU architecture

○ Implements ILP inside a single CPU ○ More than one instruction per clock cycle ○ Dispatches multiple instructions to multiple redundant functional units inside the processor ■ Each separate functional unit not a separate CPU core but an execution resource inside CPU, like:

  • arithmetic logic unit,
  • floating point unit,
  • a bit shifter,
  • multiplier

Example: Fetching-dispatching 2 instructions a time:

slide-28
SLIDE 28

Micro-architectural techniques of ILP

➢ Out-of-Order execution

○ Technique used in most high-performance CPUs ○ The key technique - allow processor to avoid certain class of delays

  • ccuring due to operation data

unavailability Processing of instructions broken into these steps: ➔ Instruction fetch ➔ Instruction dispatch to an instruction queue (called also instruction buffer) ➔ The instruction waits in the queue until its input

  • perands are available

➔ Instruction is issued to the appropriate functional unit and executed there ➔ The results are queued (re-order buffer) ➔ The reults are written back to register

slide-29
SLIDE 29

Micro-architectural techniques of ILP

➢ Register renaming

○ technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution. ○ technique that abstracts logical registers from physical registers r1 = m[1024] r1 = r1 + 2 m[1032] = r1 r1 = m[2048] r1 = r1 + 4 m[2056] = r1 r1 = m[1024] r1 = r1 + 2 m[1032] = r1 r2 = m[2048] r2 = r2 + 4 m[2056] = r2

slide-30
SLIDE 30

Micro-architectural techniques of ILP

➢ Speculative execution

○ allows the execution of complete instructions or parts of instructions before being certain whether this execution should take place ■ control flow speculation ■ value prediction ■ memory dependence prediction ■ cache latency prediction

  • Eager execution

○ → execute both (all) possible scenarios

  • Predictive execution

○ → predict the most likely scenario!

slide-31
SLIDE 31

Micro-architectural techniques of ILP

➢ Branch prediction

○ used to avoid stalling for control dependencies to be resolved ○ used with speculative execution

  • Static branch prediction
  • Dynamic branch prediction
  • Random branch prediction
  • One-level branch prediction
  • Two-level branch preiction

○ Pattern History Table

slide-32
SLIDE 32

Memory performance

Memory system, not processor speed often appears often as a bottleneck

  • 2 main parameters:

latency - time that it takes from request to delivery from the memory bandwidth - amount of data flow in timestep (between the processor and memory)

Example: Memory latency

  • Simplified architecture: 1 GHz processor
  • memory fetching time 100 ns (no cache),

blocksize 1 word

  • 2 multiply-add units, 4 multiplication or addition
  • perations per clock cycle =>
  • peak performance – 4 Gflops.
  • Suppose, we need to perform dot product. Each

memory access takes 100 clock cycles

  • ⇒ 1 flop per 100 ns
  • ⇒ actual performance?
  • Only 10 Mflops!
slide-33
SLIDE 33

Example: BLAS (Basic Linear Algebra Subroutines)

RAM

Storage Network storage, clouds

registers

cache

Data movement before operations D a t a m

  • v

e m e n t a f t e r

  • p

e r a t i

  • n

s Motivation Consider an arbitrary algorithm. Denote

  • f - flops
  • m - # memory references

Introduce q=f/m Why this number is important?

  • tf - time spent on 1 flop
  • tm- time for memory access

In general, tm⋙ tf therefore total time reflecting processor speed

  • nly if q is large
slide-34
SLIDE 34

Example: Gauss elimination method

for each i – key operations (1) and (2): for i in range(n): A(i+1:n,i) = A(i+1:n,i)/A(i,i) # op. (1) A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n) # op. (2) Operation (1) of type: y=ax+y # saxpy

  • m = 3n + 1 memory references:

○ 2n + 1 reads ○ n writes

  • Computations take f=2n flops
  • ⇒ q = 2/3 + O(1/n) ≈ 2/3 for large n

Operation (2) of type: A = A − vwT , A ∈ ℝn×n , v, w ∈ ℝn

  • Rank-one update
  • m = 2n2 + 2n memory references:

○ n2 + 2n reads ○ n2 writes

  • Computations – f = 2n2 flops
  • ⇒ q = 1 + O(1/n) ≈ 1 for large n

1st order operation O(n) flops 2nd order operation O(n2) flops

But q = 1 + O(1) in both cases!

slide-35
SLIDE 35

Example of 3rd order operation

Faster results in case of 3rd order operations (O(n3) operations with O(n2) memory references). For example, matrix multiplication: C = AB +C, where A, B, C ∈ ℝn×n Here m = 4n2 and f = n2 (2n − 1) + n2 = 2n3 ⇒ q = n/2 → ∞ if n → ∞. This operation can give processor work near peak performance, with good algorithm scheduling!

slide-36
SLIDE 36

BLAS implementations

  • BLAS – standard library for simple 1st, 2nd and 3rd order operations

○ BLAS – freeware, available for example from netlib (http://www.netlib.org/blas/) ○ Processor vendors often supply their own implementation ○ BLAS ATLAS implementation ATLAS (http://math-atlas.sourceforge.net/) – self-optimising code

  • – OpenBLAS - supporting x86, x86-64, MIPS and ARM processors.

Example of using BLAS (fortran90):

  • LU factorisation using BLAS3 operations

http://www.ut.ee/~eero/SC/konspekt/Naited/lu1blas3.f90.html

  • main program for testing different BLAS levels

http://www.ut.ee/~eero/SC/konspekt/Naited/testblas3.f90.html )

slide-37
SLIDE 37

Memory latency hiding with the help of cache

  • Cache - small but fast memory

between main memory and DRAM

  • works as low latency, high throughput

storage

Example: Cache effect

➢ Cache size: 32 KB with latency 1 ns ➢

  • peration C = AB , where A and B are 32 × 32

matrices (i.e. all matrices A , B and C fit simultaneously into cache ➢ We observe that: ○ reading 2 matrices into cache (meaning 2K words) takes ca 200μs ○ multiplying two n × n matrices takes 2n³

  • perations, ⇒ 64K operations, can be

performed in 16K cycles (4 op/cycle)

➢ ⇒ total time (data access + calculations) = 200μs + 16μs ➢ close to peak performance 64K/216 or 303 Mflops. ➢ Access to same data corresponds to temporal locality In given example O(n²) data accesses and O(n³) computations. Such asymptotic difference very good in case of cache Data reusage is of critical importance for performance!

  • Cache - helps to decrease real memory latency only if enough reuse of data is present
  • part of data served by cache without main memory access – cache hit ratio
  • ften: hit ratio a major factor to performance
slide-38
SLIDE 38

Some trends in HPC concerning cache hierarchies

Anant Vithal Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney and Hong Wang, Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-level Cache Hierarchies, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), DOI: 10.1109/ISCA.2018.00019 https://ieeexplore.ieee.org/abstract/document/8416821 Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure and Stefano Markidis, Exploring the Performance Benefit of Hybrid Memory System on HPC Environments, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), DOI: 10.1109/IPDPSW.2017.115 https://ieeexplore.ieee.org/abstract/document/7965110 In-class reading:

slide-39
SLIDE 39

Flynn’s taxonomy of parallel computers

Single instr. stream Multiple instr. stream

SIMD SISD MIMD MISD

Multiple data stream Single data stream

SISD - Single Instruction Single Data

  • A serial (non-parallel) computer
  • Single Instruction: Only one instruction stream is

being acted on by the CPU during any one clock cycle

  • Single Data: Only one data stream is being used

as input during any one clock cycle

  • Deterministic execution
  • Examples: older generation mainframes,

minicomputers and workstations, early laptops etc.

(https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)

Parallel Architectures

slide-40
SLIDE 40

SISD

Flynn’s taxonomy of parallel computers

Single instr. stream Multiple instr. stream

SIMD MIMD MISD

Multiple data stream Single data stream

SIMD - Single Instruction Multiple Data

  • A type of parallel computer
  • Single Instruction: All processing units execute the same

instruction at any given clock cycle

  • Multiple Data: Each processing unit can operate on a different data

element

  • Best suited for specialized problems characterized by a high

degree of regularity, such as graphics/image processing

  • Synchronous (lockstep) and deterministic execution
  • Two varieties: Processor Arrays and Vector Pipelines
  • Processor Arrays: Connection Machine CM-2, MasPar MP-1 &

MP-2, ILLIAC IV61 // Architectures

(https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)

  • Vector Pipelines: IBM 9000, Cray

X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10

  • GPUs
slide-41
SLIDE 41

MIMD (MISD) - Multiple Instruction Single Data

  • A type of parallel computer
  • Multiple Instruction: Each processing unit operates
  • n the data independently via separate instruction

streams

  • Single Data: A single data stream is fed into

multiple processing units.

  • Few actual examples of this class of parallel

computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971).

  • Some conceivable uses might be:

○ multiple frequency filters operating on a single signal stream ○ multiple cryptography algorithms attempting to crack a single coded message.

SISD

Flynn’s taxonomy of parallel computers

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream (Carnegie-MellonC.mmp https://en.wikipedia.org/wiki/C.mmp)

slide-42
SLIDE 42

MIMD - Multiple Instruction Multiple Data

  • Parallel computer
  • Multiple Instruction: Every processor may be

executing a different instruction stream

  • Multiple Data: Every processor may be working

with a different data stream

  • Execution can be synchronous or asynchronous,

deterministic or nondeterministic

  • Currently, the most common type of parallel

computer

  • Examples: most current supercomputers,

networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs

  • Note: many MIMD architectures also include

SIMD execution sub-components

MIMD SISD

Flynn’s taxonomy of parallel computers

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream (https://computing.llnl.gov/tutorials/parallel_comp/#Flynn)

slide-43
SLIDE 43

DMMP

Distributed-memory multicomputers

MIMD SISD

Flynn-Johnson classification

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream

GMSV

Shared-memory multiprocessors Global memory Distributed memory

GMMP

Rarely used

DMSV

Distributed shared memory Message passing Shared variables

slide-44
SLIDE 44

DMMP

Distributed

  • memory

multicomputers

MIMD SISD

Flynn-Johnson classification

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream

GMSV

Shared-memory multiprocessors Global memory Distributed memory

GMMP

Rarely used

DMSV

Distributed shared memory Message passing Shared variables

Shared-memory multiprocessors

  • Common shared memory
  • Global address space
  • Multiple processors operate independently

sharing the same memory resource

  • Changes in memory state affect all

processors

  • Historically classified as UMA and NUMA

(according to memory access times) Shared memory

(multicore)

CPU

(multicore)

CPU

(multicore)

CPU

(multicore)

CPU

(multicore)

CPU

(multicore)

CPU

slide-45
SLIDE 45

DMMP

Distrib-memory multiicomputers

MIMD SISD

Flynn-Johnson classification

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream

GMSV

Shared-memory multiprocessors Global memory Distributed memory

GMMP

Rarely used

DMSV

Distributed shared memory Message passing Shared variables

Shared-memory multiprocessors

Uniform Memory Access (UMA):

  • Most commonly represented today by

Symmetric Multiprocessor (SMP) machines

  • Identical processors
  • Equal access and access times to

memory

  • Sometimes called CC-UMA - Cache

Coherent UMA

  • Cache coherence:

○ if one processor updates a location in shared memory, all the other processors know about the update. ○ implemented at the hardware level. Shared memory

CPU CPU CPU CPU

slide-46
SLIDE 46

GMSV

Shared-memory multiprocessors

DMMP

Distrib-memory multiicomputers

MIMD SISD

Flynn-Johnson classification

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream Global memory Distributed memory

GMMP

Rarely used

DMSV

Distributed shared memory Message passing Shared variables

Distributed shared-memory

Non-Uniform Memory Access (NUMA):

  • Often made by physically linking two or more

SMPs

  • One SMP can directly access memory of

another SMP

  • Not all processors have equal access time to

all memories

  • Memory access across link is slower
  • If cache coherency is maintained, then may

also be called CC-NUMA - Cache Coherent NUMA

Advantages:

  • Global address space provides a user-friendly

programming perspective to memory

  • Data sharing between tasks is both fast and uniform due

to the proximity of memory to CPUs

Memory

CPU CPU CPU CPU

Memory

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Memory Memory

Bus Interconnect

Disadvantages:

  • lack of scalability between memory and CPUs.
  • for cache coherent systems, geometrically increase traffic

associated with cache/memory management.

  • Programmer responsibility for synchronization constructs that

ensure "correct" access of global memory.

slide-47
SLIDE 47

DMMP

Distrib-memory multiicomputers

MIMD SISD

Flynn-Johnson classification

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream

GMSV

Shared-memory multiprocessors Global memory Distributed memory

GMMP

Rarely used

DMSV

Distributed shared memory Message passing Shared variables

Distributed memory architectures

Networked processors with their private memories

  • No global memory address space
  • All processors independent
  • Communication

○ via message-passing ○ explicit communication operations

CPU CPU CPU Memory

Network

Memory Memory

Advantages

  • Memory is scalable

with the number of processors

  • No overhead

accessing local memory

  • Easy to use
  • ff-the-shelf

components Disadvantages

  • Communication has to be
  • rchestrated by the

programmer

  • Data stuctures’ mapping to

the system can be complicated

  • Access times vary a lot due

to non-uniform memory

slide-48
SLIDE 48

DMSV

Distributed shared memory

GMSV

Shared-memory multiprocessors

DMMP

Distrib-memory multiicomputers

MIMD SISD

Flynn-Johnson classification

Single instr. stream Multiple instr. stream

SIMD MISD

Multiple data stream Single data stream Global memory Distributed memory

GMMP

Rarely used Message passing Shared variables

Hybrid Distributed-Shared Memory

  • Each node a shared memory machine:

○ Memory components shared between CPUs and GPUs

  • Nodes connected in a distributed memory

machine ○ Network communication required to move data between the nodes

Memory

CPU CPU GPU GPU

Memory

CPU CPU GPU GPU GPU GPU CPU CPU GPU GPU CPU CPU

Memory Memory

network

  • Used in most high end computing systems nowadays
  • All signs show that this will be the prevailing architecture type for the forseeable future

Memory

CPU CPU CPU CPU

Memory

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Memory Memory

network

slide-49
SLIDE 49

In-class exercise 5

1. Look around in Top500 list and choose for a reviewal your favorite computer system there for a short review about the highlights of the system! In particular, address also the architectural aspects we have discussed up to now during the course! 2. Post the short review to the course Piazza: piazza.com/ut.ee/spring2019/mtat08020

slide-50
SLIDE 50

Analytical modeling of Parallel Systems

  • Sources of Overhead in Parallel

Programs

  • Performance Metrics for Parallel

Systems

  • Effect
  • f

Granularity

  • n

Performance

  • Scalability of Parallel Systems
  • Minimum Execution Time and

Minimum Cost-Optimal Execution Time

  • Amdahls’s law
  • Gustavson-Barsis law
  • Asymptotic Analysis of Parallel

Programs

  • Other Scalability Metrics

Based on: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel Computing, Second Edition. Addison-Wesley, 2003

slide-51
SLIDE 51

Analytical Modeling - Basics

  • A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime T(n) as a

function of input size n ) –

T(n) = O(f(n)) if there exist positive constants c and n0 such that T(n) ≤ cf(n), for all n ≥ n0

T(n) = Ω(f(n)) if there exist positive constants c and n0 such that T(n) ≥ cf(n), for all n ≥ n0

T(n) = Θ(f(n)) if T(n) = O(f(n)) and T(n) = Ω(f(n))

  • The asymptotic runtime is independent of the platform. Analysis “at a constant factor”.
  • A parallel algorithm is evaluated by its runtime in function of

the input size

the number of processors

communication parameters of the machine

  • An algorithm must therefore be analyzed in the context of the underlying platform
slide-52
SLIDE 52

Analytical Modeling - Basics

  • A parallel system is a combination of a parallel algorithm and an underlying platform
  • Wall clock time - the time from the start of the first processor to the stopping time of

the last processor in a parallel ensemble.

○ But how does this scale when the number of processors is changed of the program is ported to another machine altogether?

  • How much faster is the parallel version?

○ This begs the obvious followup question - whats the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look more attractive?

  • Raw FLOP count - What good are FLOP counts when they dont solve a problem?
slide-53
SLIDE 53

Sources of Overhead in Parallel Programs

  • If I use two processors, shouldn’t my program

run twice as fast?

  • No - a number of overheads, including wasted

computation, communication, idling, and contention cause degradation in performance

  • Interprocess interactions: Processors

working on any non-trivial parallel problem will need to talk to each other.

  • Dependencies: Computation depends on

results from other processes

  • Idling: Processes may idle because of load

imbalance, synchronization, or serial components.

  • Load imbalance: Due to algorithm or system
  • Excess Computation: This is computation not

performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication.

slide-54
SLIDE 54

Performance Metrics for Parallel Systems: Execution Time, Speedup

Execution time = time elapsed between ➢ beginning and end of execution on sequential a computer ➢ beginning on first computer and end on last computer on a parallel computer We denote the serial runtime by TS and the parallel runtime by TP . Let Tall be the total time collectively spent by all the processing elements.

  • Observe that Tall - TS is then the total time spent

by all processors combined in non-useful work. This is called the total overhead.

  • The total time collectively spent by all the

processing elements Tall = p TP (p is the number of processors).

  • The overhead function (To) is therefore given by

To = p TP - TS

  • Speedup (S) is the ratio of the time taken to solve a problem on a single

processor to the time required to solve the same problem on a parallel computer with p identical processing elements.

slide-55
SLIDE 55

Performance Metrics: Example

  • Consider the problem of

adding n numbers by using n processing elements.

  • If n is a power of two, we

can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors.

Computing the globalsum of 16 partial sums using 16 processing

  • elements. Σj

i denotes the sum of numbers with consecutive labels from i to j.

slide-56
SLIDE 56

Performance Metrics: Example (continued)

  • If an addition takes constant time, say, tc and

communication of a single word takes time ts + tw, we have the parallel time TP = Θ (log n)

  • We know that TS = Θ (n)
  • Speedup S is given by S = Θ (n / log n)
slide-57
SLIDE 57

Performance Metrics: Speedup

  • For a given problem, there might be many serial algorithms available.

These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees.

  • For the purpose of computing speedup, we always consider the best

sequential program as the baseline.

slide-58
SLIDE 58

Performance Metrics: Speedup Example

  • Consider the problem of parallel bubble sort.
  • Suppose serial time for bubblesort is 52 seconds.
  • The parallel time for odd-even sort (efficient parallelization of bubble sort) is 14

seconds.

  • The speedup would appear to be 52/14 = 3.71
  • But is this really a fair assessment of the system?
  • What if serial quicksort only took 12 seconds?

○ In this case, the speedup is 12/14 = 0.86. This is a more realistic assessment of the system.

slide-59
SLIDE 59

Performance Metrics: Speedup Bounds

  • Speedup can be as low as

○ 0 (the parallel program never terminates).

  • Speedup, in theory, should be upper bounded by p - after all, we can
  • nly expect a p-fold speedup if we use times as many resources.
  • A speedup greater than p is possible only if each processing element

spends less than time TS / p solving the problem.

  • In this case, a single processor could be timeslided to achieve a faster

serial program, which contradicts our assumption of fastest serial program as basis for speedup.

slide-60
SLIDE 60

Performance Metrics: Superlinear Speedups

One reason for superlinearity is that the parallel version does less work than corresponding serial algorithm. Searching an unstructured tree for a node with a given label, `S', on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. The corresponding serial formulation expands the entire

  • tree. It is clear that the serial algorithm does more work than the parallel

algorithm.

slide-61
SLIDE 61

Performance Metrics: Superlinear Speedups

Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better cache-hit ratios, and therefore superlinearity. Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from local memory and 2% from remote memory. If DRAM access time is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this corresponds to a speedup of 2.43!

slide-62
SLIDE 62

Performance Metrics: Efficiency

  • Efficiency is a measure of the fraction of time for which a

processing element is usefully employed

  • Mathematically, it is given by
  • Following the bounds on speedup, efficiency can be as low as 0

and as high as 1.

slide-63
SLIDE 63

Performance Metrics: Efficiency Example

  • The speedup of adding numbers on processors is given by
  • Efficiency is given by
slide-64
SLIDE 64

Parallel Time, Speedup, and Efficiency Example

Consider the problem of edge-detection in images. The problem requires us to apply a 3 x 3 template to each pixel. If each multiply-add operation takes time tc, the serial time for an n x n image is given by TS= tc n2

Example of edge detection: (a) an 8 x 8 image; (b) typical templates for detecting edges; and (c) partitioning of the image across four processors with shaded regions indicating image data that must be communicated from neighboring processors to processor 1.

slide-65
SLIDE 65

Parallel Time, Speedup, and Efficiency Example (continued)

  • One possible parallelization partitions the image equally into vertical

segments, each with n2 / p pixels.

  • The boundary of each segment is 2n pixels. This is also the number
  • f pixel values that will have to be communicated. This takes time

2(ts + twn).

  • Templates may now be applied to all n2 / p pixels in time 9 tcn2 / p.
slide-66
SLIDE 66

Parallel Time, Speedup, and Efficiency Example (continued)

  • The total time for the algorithm is therefore given by:
  • The corresponding values of speedup and efficiency are given by:

and

slide-67
SLIDE 67

Cost of a Parallel System

  • Cost is the product of parallel runtime and the number of processing

elements used (p x TP ).

  • Cost reflects the sum of the time that each processing element spends

solving the problem.

  • A parallel system is said to be cost-optimal if the cost of solving a problem
  • n a parallel computer is asymptotically identical to serial cost.
  • Since E = TS / p TP, for cost optimal systems, E = O(1).
  • Cost is sometimes referred to as work or processor-time product
slide-68
SLIDE 68

Cost of a Parallel System: Example

Consider the problem of adding numbers on processors.

  • We have, TP = log n (for p = n).
  • The cost of this system is given by p TP = n log n.
  • Since the serial runtime of this operation is Θ(n), the algorithm is not

cost optimal.

slide-69
SLIDE 69

Impact of Non-Cost Optimality

Consider a sorting algorithm that uses n processing elements to sort the list in time (log n)2

  • Since the serial runtime of a (comparison-based) sort is n log n, the

speedup and efficiency of this algorithm are given by n / log n and 1 / log n, respectively

  • The p TP product of this algorithm is n (log n)2
  • This algorithm is not cost optimal but only by a factor of log n
  • If p < n, assigning n tasks to p processors gives TP = n (log n)2 / p
  • The corresponding speedup of this formulation is p / log n
  • This speedup goes down as the problem size n is increased for a given p !
slide-70
SLIDE 70

Effect of Granularity on Performance

  • Often, using fewer processors improves performance of parallel systems
  • Using fewer than the maximum possible number of processing elements to

execute a parallel algorithm is called scaling down a parallel system

  • A naive way of scaling down is to think of each processor in the original

case as a virtual processor and to assign virtual processors equally to scaled down processors

  • Since the number of processing elements decreases by a factor of n / p,

the computation at each processing element increases by a factor of n / p

  • The communication cost should not increase by this factor since some of

the virtual processors assigned to a physical processors might talk to each

  • ther. This is the basic reason for the improvement from building granularity
slide-71
SLIDE 71

Building Granularity: Example

  • Consider the problem of adding n numbers on p processing

elements such that p < n and both n and p are powers of 2

  • Use the parallel algorithm for n processors, except, in this case, we

think of them as virtual processors

  • Each of the p processors is now assigned n / p virtual processors;

○ virtual processing element i is simulated by the physical processing element labeled i mod p

  • The first log p of the log n steps of the original algorithm are simulated

in (n / p) log p steps on p processing elements.

  • Subsequent log n - log p steps do not require any communication.
slide-72
SLIDE 72

Building Granularity: Example (continued)

  • The overall parallel execution time of this parallel system is

Θ ( (n / p) log p).

  • The cost is Θ (n log p), which is asymptotically higher than the Θ (n)

cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal.

slide-73
SLIDE 73

Building Granularity: Example (continued)

Can we build granularity in the example in a cost-optimal fashion?

  • Each processing element locally adds its n / p numbers in

time Θ (n / p).

  • The p partial sums on p processing elements can be added in

time Θ(n /p) A cost-optimal way of computing the sum of 16 numbers using four processing elements.

slide-74
SLIDE 74

Building Granularity: Example (continued)

  • The parallel runtime of this algorithm is

TP = Θ (n / p + log p),

  • The cost is Θ (n + p log p)
  • This is cost-optimal, so long as n = Ω(p log p) !
slide-75
SLIDE 75

Scaling Characteristics of Parallel Programs

  • The efficiency of a parallel program can be written as:
  • r
  • The total overhead function To is an increasing function of p
  • For a given problem size (i.e., the value of TS remains constant), as we

increase the number of processing elements, To increases.

  • The overall efficiency of the parallel program goes down. This is the case for all

parallel programs.

slide-76
SLIDE 76

Scaling Characteristics of Parallel Programs: Example

  • Consider the problem of adding numbers on processing

elements.

  • We have seen that:
slide-77
SLIDE 77

Scaling Characteristics of Parallel Programs: Example (continued)

Plotting the speedup for various input sizes gives us: Speedup versus the number of processing elements for adding a list

  • f numbers.

Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law

slide-78
SLIDE 78

In-class exercise 6:

  • Search for best parallel programming languages
  • Choose one of them (your favorite!) and post a brief review to Course Piazza!

One possible starting point can be: https://www.slant.co/topics/6024/~programming-languages-for-concurrent-programming

slide-79
SLIDE 79

Amdahl’s law

In each algorithm there exist parts that cannot be parallelised

  • Let Let σ where ( 0 < σ ≤ 1 )

– sequential part

  • Assume that the rest 1 − σ

parallelised optimally

  • Then, in best case:

P max S(N,P) 2 1.9 4 3.5 10 6.9 20 10.3 100 16.8 ∞ 20

Example 1. Assume 5% of the algorithm is not parallelisable (ie. σ = 0.05 ) => : Example 2. σ = 0.67 (33% parallelisable), P = 10:

slide-80
SLIDE 80

John Gustafson & Ed Barsis (Scania Laboratory) 1988:

  • 1024-processor nCube/10 claimed:

they bet Amdahl’s law!

  • Their σ ≈ 0.004...0.008
  • but got S ≈ 1000
  • (Acording to Amdahl’s S might be

125...250) How was it possible? Does Amdahl’s law hold?

  • Mathematically – yes. But in practice –

not very good idea to solve a problem with fixed size N on whatever number of processors!

  • In general, σ = σ(N) ≠ const
  • Usually, σ decreases with N growing!
  • Algorithm is said to be effectively parallel

if σ → 0 with N → ∞ Scaled efficiency (to avoid misunderstandings:)

Gustafson-Barsis’ law

slide-81
SLIDE 81

Scaled efficiency

Problem size increasing accordingly with adding new processors – does time remain the same?

  • 0 < ES(N, P) ≤ 1
  • If ES(N, P) = 1 – linear speedup
slide-82
SLIDE 82

Methods to increase efficiency

Factors influencing efficiency:

  • communication time
  • waiting time
  • additional computations
  • changing/improving algorithm

Profiling parallel programs

  • MPE - jumpshot, LMPI, MpiP
  • Valgrind, Totalview, Vampir, Allinea OPT
  • Linux - gprof (compiler switch -pg)
  • SUN - prof, gprof, prism
  • Many other commercial applications

Overlapping communication and computations Example: Parallel Ax-operation for sparse matrices in Domain Decomposition setups

  • Matrix partitioned, divided between processors.

Starting communication (non-blocking); calculations at inside parts of the region => economy in waiting times Extra computations instead of communication

  • Computations in place instead of importing the

results over the network

  • Sometimes it pays off!

Example: Random number generation. Broadcasting only seed and generate in parallel (deterministic algorithm

slide-83
SLIDE 83

Computer Benchmarks

Some HPC (High Performance Computing) benchmarks:

  • HPL (High Performance Linpack)
  • NPB (NAS Parallel Benchmarks)
  • HINT (Hierarchical INTegration)
  • Perf
  • IOzone
  • Graph 500
slide-84
SLIDE 84

Linpack

Jack Dongarra. HPL - High Performance Linpack, using MPI and BLAS. Solving systems of linear equations with dense matrices. The aim is to fit a problem withmaximal size (advisably, utilising 80% of memory). http://www.netlib.org/benchmark/hpl/ Used for http://www.top500.org

  • Rpeak - peak performance in Gflops
  • N - size of matrix giving peak performance in

Gflops (usually <80%memory size)

  • Rmax - maximal achieved performance in

Gflops

  • NB - blocksize. In general, the smaller

the better, but usually in range 32...256.

slide-85
SLIDE 85

Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [1] https://www.nas.nasa.gov/publications/npb.html (1992 v1, 1996 v2.1, 2002 v2.2) The original eight benchmarks specified in NPB 1 mimic the computation and data movement in CFD applications:

  • five kernels

○ IS - Integer Sort, random memory access ○ EP - Embarrassingly Parallel ○ CG - Conjugate Gradient, irregular memory access and communication ○ MG - Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive ○ FT - discrete 3D fast Fourier Transform, all-to-all communication

  • three pseudo applications

○ BT - Block Tri-diagonal solver ○ SP - Scalar Penta-diagonal solver ○ LU - Lower-Upper Gauss-Seidel solver

slide-86
SLIDE 86

Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [2] Multi-zone versions of NPB (NPB-MZ) are designed to exploit multiple levels of parallelism in applications and to test the effectiveness of multi-level and hybrid parallelization paradigms and tools.

  • Three types of benchmark problems derived from single-zone pseudo

applications of NPB:

○ BT-MZ - uneven-size zones within a problem class, increased number of zones as problem class grows ○ SP-MZ - even-size zones within a problem class, increased number of zones as problem class grows ○ LU-MZ - even-size zones within a problem class, a fixed number of zones for all problem classes

slide-87
SLIDE 87

Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [3]

  • NPB 3: Benchmarks for unstructured computation, parallel I/O, and data

movement

○ UA - Unstructured Adaptive mesh, dynamic and irregular memory access ○ BT-IO - test of different parallel I/O techniques ○ DC - Data Cube ○ DT - Data Traffic

  • GridNPB is designed specifically to rate the performance of computational
  • grids. Each of the four benchmarks in the set consists of a collection of

communicating tasks derived from the NPB. They symbolize distributed applications typically run on grids.

○ ED - Embarrassingly Distributed ○ HC - Helical Chain ○ VP - Visualization Pipeline ○ MB - Mixed Bag

slide-88
SLIDE 88

HINT benchmark

The HINT (Hierarchical INTegration). Graphical view of:

  • floating point performance
  • integer operation performance
  • performances with different memory hierarchies

https://web.archive.org/web/20130724124556/http://hint.byu.edu/

Some other benchmarks

  • Perf
  • IOzone
  • etc.
slide-89
SLIDE 89

In-class exercise 7

A) Find out at OpenBenchmarking.org

a) What are currently the most popular benchmarks related to parallel computing? b) Find a benchmark that you like the most and describe it! c) Any other interesting aspect you find fascinating?

B) Find out about top500 vs graph500

a) What is their difference? b) What inspired the creation of graph500? c) How different are these lists? d) Some other interesting aspect you notice when comparing the two benchmarks?

slide-90
SLIDE 90

Applications with the need for parallel computing

  • Embarrasingly parallel applications

○ Data mininig ○ Molecular dynamics ○ Cryptographic algorithms ○

  • etc. etc.
  • Applications depending on good ways for communication and synchronisation

○ Integral equations ■ Numerical solution usually implies dense matrices ■ Parallelisation of LU-factorisation ○ Numerical solution of Partial Differential Equations (PDEs) ■ Sparse matrice structures ■ → Methods for Iterative solution of systems with sparse matrices

slide-91
SLIDE 91

EXAMPLE Finite element method for solving Poisson equation

Example: 2D Finite Element Method for Poisson Equation Poisson Equation where Laplacian Δ is defined by

u can be for example

  • temperature
  • Electro-magnetic potential
  • displacement of an elastic membrane fixed

at the boundary under a transversal load of intensity

Ω Γ

x y

slide-92
SLIDE 92

Divergence of a vector function F=(F1,F2) is defined by

Divergence theorem where F=( F1 , F2 ) is a vector-valued function defined on Ω , n = (n1, n2) – outward unit normal to Γ. Here dx – element of area in ℝ2 ; ds – arc length along Γ .

Ω

x y

n

Γ Finite element method for solving Poisson equation

slide-93
SLIDE 93

Applying divergence theorem to: F=(vw ,0) and F=(0, vw) (details not shown here) – Green's first identity we come to a variational formulation of the Poisson problem on V : Poisson problem in Variational Formulation: Find u ∈ V such that where in case of Poisson equation

The gradient ∇ of a scalar function f( x,y ) is defined by:

Finite element method for solving Poisson equation

( a(u,v) - being called also linear form )

slide-94
SLIDE 94

Again, the Variational formulation of Poisson equation the equation is: With discretisation, we replace the space V with Vh – space of piecewise linear functions. Each function in Vh can be written as where φi (x, y) – basis functions (or 'hat' functions)

Finite element method for solving Poisson equation

slide-95
SLIDE 95

With 2D FEM we demand that the equation in the Variational formulation is satisfied for M basis functions φi ∈ Vh i.e. But we have M linear equations with respect to unknowns ξj :

Finite element method for solving Poisson equation

slide-96
SLIDE 96

The stiffness matrix A = (aij) elements and right-hand side b = (bi) calculation: Integrals computed only where the pairs ∇φi · ∇φj get in touch (have mutual support)

( Support of a function f=f(x) is defined as the region of values x for which f(x)≠0 )

Finite element method for solving Poisson equation

slide-97
SLIDE 97

Example Two basis functions φi and φj for nodes Ni and Nj. Their common support is τ ∪ τ’ so that

Finite element method for solving Poisson equation

slide-98
SLIDE 98

Element matrices Consider single element τ Pick two basis functions φi and φj (out of three). φk – piecewise linear ⇒ denoting by pk(x, y) = φk|τ : their dot product

Finite element method for solving Poisson equation

slide-99
SLIDE 99

Finding coefficients α and β – put three points (xi, yi, 1), (xj, yj, 0) , (xk, yk, 0) to the plane equation and solve the system The integral is computed by the multiplication with the triangle area 0.5 ×|det D| The element matrix for τ is

Finite element method for solving Poisson equation

slide-100
SLIDE 100

Assembled stiffness matrix

  • created by adding appropriately all the element matrices together
  • Different types of boundary values used in the problem setup result in slightly

different stiffness matrices

  • Most typical boundary conditions:

○ Dirichlet’ ○ Neumann

  • but also:

○ free boundary condition ○ Robin boundary condition ○ Special boundary conditions for special PDEs (like Impedance boundary conditions for the Helmholtz equation)

Finite element method for solving Poisson equation

slide-101
SLIDE 101

Right-hand side Right-hand side Approximate calculation of an integral through the f value in the middle of the triangle τ :

  • support – the area of the triangle
  • φi is a pyramid (integrating φi over τ gives pyramid volume)
  • Assembling procedure for the RHS b values from bi

τ (somewhat similarly to

matrix assembly, but more simple)

  • Introduction of Dirichlet boundary condition just eliminates the rows.

Finite element method for solving Poisson equation

slide-102
SLIDE 102

EXAMPLE - Poisson equation on unit square Variational formulation: Vh – space of piecewise linar functions on Ω with zero on Γ and values defined on inner nodes Discretised problem in Variational Formulation: Find uh ∈ Vh such that where in case of Poisson equation (1)

Finite element method for solving Poisson equation

slide-103
SLIDE 103

In FEM the equation (1) needs to be satisfied on a set of testfunctions (the ones called also as hat-functions) φi = φi(x), which are defined such that and it is demanded that (1) is satisfied with each φi (i = 1, ..., N)

  • As a result, a matrix of the linear equations is obtained
  • In general case n2×n2 Laplace's matrix has form:

Finite element method for solving Poisson equation

where

slide-104
SLIDE 104

Soving systems of linear equations with sparse matrices

  • Sparse storage schemes
  • Serial solvers: direct solvers for

sparse matrices

○ Analysis for best ordering of unknowns (typically 100 time units) ○ Sparse LU-factorisation (10 time units) ○ Solving corresponding triangular systems (1 time unit) ○

  • Example software: SuperLU package
  • Serial solvers: Iterative methods

○ e.g. Gauss-Seidel method ○ Krylov subspace methods (CG, GMRES)

  • Parallel solvers

○ Direct solvers, ■ e.g. MUMPS solver ○ Iterative parallal solvers ■ Most efficient ones based on Krylov subspace methods ■ But for large problems good preconditioners are needed ■ → Domain Decomposition preconditioner techniques

slide-105
SLIDE 105

Domain Decomposition Method for solving Poisson equation

  • regular 2D grid with stepsize ℎ=1/(𝑂+2), where N=n2
  • Let P=p2; N=n×p.
  • For example, if p=3 and 𝑜=3 => 𝑂 = 9 in the Figures below.
slide-106
SLIDE 106

In-class exercise 8: Continuing with the task from In-class exercise 6, which was:

  • Search for best parallel programming languages
  • Choose one of them (your favorite!) and compose a brief review!

The task is to get some corresponding real examples working using appropriate Jupyter extension kernel! Like always, posting your achievements’ results to Course Piazza!

slide-107
SLIDE 107

Domain Decomposition Method for solving linear systems with sparse matrices

  • Some examples:
slide-108
SLIDE 108

Domain Decomposition Method for solving linear systems with sparse matrices Classical Alternating Schwarz method

  • H. A. Schwarz, 1870
  • Solving Poisson equation using spectral method

○ Available only on regular domains

  • Alternating iterative process

○ Called also multiplicative overlapping Schwarz method

Ω1 Γ2 Ω2 Γ1

slide-109
SLIDE 109

Domain Decomposition Method for solving linear systems with sparse matrices As discussed before - solving systems with sparse matrices: iterative methods are the best due to

  • memory restrictions
  • sparsity of the matrices
  • But Iterative methods’ convergence rate depends on spectrum (eigenvalue

distribution) of the matrices

  • If, e.g. the system dimension is N = 106 the number of iterations would be of
  • rder N

○ 1 million iterations - way to many! Solution: to use preconditioner

slide-110
SLIDE 110

Domain Decomposition Method for solving linear systems with sparse matrices

Preconditioned Conjugate Gradient (PCG) method

Calculate r(0)=b-Ax(0) with given starting vector x(0) for i = 1,2,... solve Mz(i−1) = r(i−1) # M−1 is called Preconditioner ρi−1 = (r(i−1))T z(i−1) if i==1: p(1) = z(0) else: βi−1 = ρi−1 /ρi−2 p(i) = z(i−1) + βi−1 p(i−1) q(i) = Ap(i) αi = ρi−1 /(p(i))Tq(i) x(i) = x(i−1) + αip(i) r(i) = r(i−1) − αiq(i) check for convergence ; continue if needed

Conjugate Gradient (CG) method

  • If M=I => CG (without

preconditioning)

  • converges only in case of

symmetric matrix A Preconditioned CG method (PCG)

  • application of M−1 needs to be

symmetric as well

slide-111
SLIDE 111

Domain Decomposition Method for solving linear systems with sparse matrices Parallelizing PCG method

Calculate r(0)=b-Ax(0) with given starting vector x(0) for i = 1,2,... solve Mz(i−1) = r(i−1) # Preconditioner solve ρi−1 = (r(i−1))T z(i−1) # dot product if i==1: p(1) = z(0) else: βi−1 = ρi−1 /ρi−2 p(i) = z(i−1) + βi−1 p(i−1) q(i) = Ap(i) # matrix-vector application αi = ρi−1 /(p(i))Tq(i) # dot product x(i) = x(i−1) + αip(i) r(i) = r(i−1) − αiq(i) check for convergence ; continue if needed

Parallelization ❖ Preconditioner solve ➢ Suitable M needed ■ symmetric ■ allowing parallelization ■ easy to solve ❖ Matrix-vector operation ➢ Parallel version of the operation ❖ Dot product ➢ Parallelizable ➢ all-to-all communication Which of the above operations can be implemented with hiding communication behind computations?

slide-112
SLIDE 112

Domain Decomposition Method for solving linear systems with sparse matrices Preconditioners Instead of solving Ax=b, solve M -1Ax=M -1b How to choose a preconditioner? A. Problem Mz = r being easy to solve B. Matrix B=M -1A being better conditioned than A meaning that κ2(B) < κ2(A), where κ2 - condition number = {largest eigenvalue}/{smallest eigenvalue}

  • In extreme cases:

○ M=I ○ M=A

slide-113
SLIDE 113

Domain Decomposition Method for solving linear systems with sparse matrices Block-Jacobi method Take M1 and M2 are preconditioners for matrix A diagonal blocks A11 and A22 For example, M1 = A11

  • 1 , M2 = A22
  • 1
slide-114
SLIDE 114

Domain Decomposition Method for solving linear systems with sparse matrices Multiple Subdomain case

  • N overlapping subdomains

Ω7 Ω8 Ω9 Ω4 Ω5 Ω6 Ω1 Ω2 Ω3

Subdomain restriction operators Restriction operator Ri – matrix applied to global vector x returns components xi=Ri x belonging to subdomain Ωi . One-level Additive Schwarz preconditioner:

N = 9:

slide-115
SLIDE 115

Denote:

Ri - Restriction operator

Ai-1

  • 1 - i-th subdomain solve

Subdomain solvers – exact or inexact ■ Typical inexact solvers

  • (symmetric) Gauss-Seidel method
  • Incomplete LU-factorization (ILU)

In case of CG - symmetry of M-1 – prerequisite!

Domain Decomposition Method for solving linear systems with sparse matrices

One-level Additive Schwarz preconditioner:

slide-116
SLIDE 116

Domain Decomposition Method for solving linear systems with sparse matrices

Multilevel methods

Coarse grid Restriction and Interpolation operators Restriction operator R0 is defined as transpose operator of interpolation

  • perator R0

T interpolating linearly (or bilinearly) values on coarse grid to

fine grid nodes Coarse grid matrix A0

  • Assembled

○ analytically or ○ through formula: A0 = R0A R0

T

  • Define

Two-level Additive Schwarz method defined as

Requirements to the coarse grid:

  • Coarse grid should

cover all the fine grid nodes

  • Each coarse grid cell

must contain fine grid nodes

  • If fine grid is with uneven

density, coarse grid must adapt according to the fine grid in terms of nodes numbers in each cell

slide-117
SLIDE 117

Parallel programming models

How are the computations structured? Organized by data Organized by tasks Organized by data flow Linear? Recursive? Linear? Recursive? Regular? Irregular? Geometric Decomposition Recursive Data Task Parallelism Divide and Conquer Pipeline Event-based Coordination

A simplified view:

slide-118
SLIDE 118
  • Partitioned global address

space (PGAS)

○ High Performance Fortran

  • Multi-threaded

○ Clojure

  • Object-oriented

○ Charm++ ○ Smalltalk

Parallel programming models

Different models for expressing parallelism in programming languages

  • Message-passing

○ MPI ○ PVM

  • Shared memory model

○ PRAM (Parallel Random Acess Machine) model

slide-119
SLIDE 119
  • Functional

○ Concurrent Haskell

  • Coordination languages

○ Linda

  • GPU languages

○ CUDA ○ OpenCL ○ OpenACC ○ OpenHMPP (HMPP for Hybrid Multicore Parallel Programming)

  • Actor model --> Actor model ín 10 min

○ Erlang ○ ScalaDataflow ○ SISAL (Streams and Iteration in a Single Assignment Language)

Parallel programming models

Different models for expressing parallelism in programming languages

  • Distributed

○ Sequoia ○ Bloom

  • Event-driven and hardware description

○ Verilog hardware description language (HDL)

  • CSP-based (Communicating

Sequential Processes)

○ FortranM ○ Occam

slide-120
SLIDE 120

Shared memory programing model

PRAM (Parallel Random Access Machine) model

  • consisting of P deterministic RAM processors (von Neumann machines)

processors work in parallel

  • communication and synchronisation through shared memory

Variations

  • EREW PRAM (Exclusive Read Exclusive Write)
  • CREW PRAM (Concurrent Read Exclusive Write)
  • CRCW PRAM (Concurrent Read Concurrent Write)

○ common CRCW PRAM - write succeeds only if all have the same value ○ arbitrary CRCW PRAM - arbitrary process succeeds ○ priority CRCW PRAM - with predefined priorities

slide-121
SLIDE 121

HPF (High Performance Fortran)

Partitioned global address space parallel programming model Fortran90 extension

  • SPMD (Single Program Multiple

Data) model

  • each process operates with its own

part of data

  • HPF commands specify which

processor gets which part of the data

  • Concurrency is defined by HPF

commands based on Fortran90

  • HPF directives as comments:
  • !HPF$ <directive>

HPF example: matrix multiplication:

PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C INTEGER :: i, j !HPF$ PROCESSORS square(2,2) !HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO square :: C !HPF$ ALIGN A(i,*) WITH C(i,*) !replicate copies of row A(i,*) onto proc.s which compute C(i,j) !HPF$ ALIGN B(*,j) WITH C(*,j) !replicate copies of col. B(*,j)) onto proc.s which compute C(i,j) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N !All the work is local due to ALIGNs C (i,j) = DOT_PRODUCT (A(i,:), B(:,j)) END DO END DO WRITE(*,*) C END

  • Owner calculates rule
slide-122
SLIDE 122

HPF (High Performance Fortran)

  • Need to find balance between concurrency and communication
  • the more processes the more communication aiming to

○ find balanced load based from the owner calculates rule ○ data locality

  • Easy to write a program in HPF but difficult to gain good efficiency
  • Programming in HPF technique is more or less like this:
  • 1. Write a correctly working serial program, test and debug it
  • 2. add distribution directives introducing as less as possible

communication

slide-123
SLIDE 123

OpenMP

OpenMP tutorial (https://computing.llnl.gov/tutorials/openMP/)

  • Programming model based on thread parallelism
  • Helping tool for a programmer
  • Based on compiler directives

○ C/C++, Fortran* ○ PyMP (TBC)

  • Nested Parallelism is supported (though, not all implementations support it)
  • Dynamic threads
  • OpenMP has become a standard
slide-124
SLIDE 124

OpenMP

  • Fork-join model
slide-125
SLIDE 125

OpenMP example

/********************************************************** * FILE: omp_mm.c https://computing.llnl.gov/tutorials/openMP/samples/C/omp_m m.c * DESCRIPTION: * OpenMp Example - Matrix Multiply - C Version * Demonstrates a matrix multiply using OpenMP. Threads share row iterations * according to a predefined chunk size. * AUTHOR: Blaise Barney * LAST REVISED: 06/28/05 ****************************************************************/ #include <omp.h> #include <stdio.h> #include <stdlib.h> #define NRA 62 /* number of rows in matrix A */ #define NCA 15 /* number of columns in matrix A */ #define NCB 7 /* number of columns in matrix B */ int main (int argc, char *argv[]) { int tid, nthreads, i, j, k, chunk; double a[NRA][NCA], /* matrix A to be multiplied */ b[NCA][NCB], /* matrix B to be multiplied */ c[NRA][NCB]; /* result matrix C */ chunk = 10; /* set loop iteration chunk size */ /*** Spawn a parallel region explicitly scoping all variables ***/ #pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Starting matrix multiple example with %d threads\n",nthreads); printf("Initializing matrices...\n"); } /*** Initialize matrices ***/ #pragma omp for schedule (static, chunk) for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; #pragma omp for schedule (static, chunk) for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; #pragma omp for schedule (static, chunk) for (i=0; i<NRA; i++) for (j=0; j<NCB; j++) c[i][j]= 0; /*** Do matrix multiply sharing iterations on outer loop ***/

/*** Display who does which iterations for demonstration purposes ***/

printf("Thread %d starting matrix multiply...\n",tid); #pragma omp for schedule (static, chunk) for (i=0; i<NRA; i++) { printf("Thread=%d did row=%d\n",tid,i); for(j=0; j<NCB; j++) for (k=0; k<NCA; k++) c[i][j] += a[i][k] * b[k][j]; } } /*** End of parallel region ***/ /*** Print results ***/ printf("**************************** ***n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { for (j=0; j<NCB; j++.) printf("%6.2f ", c[i][j]); printf("\n"); } printf("**************************** ***\n"); printf ("Done.\n"); }

slide-126
SLIDE 126

Message Passing model

When to use MPI

  • Portability and Performance
  • Irregular Data Structures
  • Building Tools for Others

○ Libraries

  • Need to Manage memory on a per

processor basis When not to use MPI

  • Regular computation matches

HPF/OpenMP

  • Solution (e.g., library) already exists

○ Ne collection (probably a little outdated) http://www.mcs.anl.gov/mpi/libraries.html

  • Require Fault Tolerance

○ Sockets

  • Distributed Computing

○ CORBA, DCOM, etc.

slide-127
SLIDE 127

MPI (Message Passing Interface) & mpi4py

Outline

  • Basic concept
  • MPI 6 basic routines
  • Point-to-point communication
  • Collective communication
  • MPI reduction commands
  • Further discussions

○ Communicators groups ○ Topologies ○ One-sided communication ○ MPI fault tolerance

  • From messages to higher levels of

abstraction in parallel application developmnet

(Covered at Lecture 4)

slide-128
SLIDE 128

MPI main concepts

  • A message-passing library specification

○ – extended message-passing model ○ – not a language or compiler specification ○ – not a specific implementation or product

  • For parallel computers, clusters, and

heterogeneous networks

  • Full-featured
  • Designed to provide access to advanced

parallel hardware for end users, library writers, and tool developers

MPI main goals

Goals of the MPI standard MPI’s prime goals are:

  • To provide source-code portability
  • To allow efficient implementations

MPI also offers:

  • A great deal of functionality
  • Support for heterogeneous parallel architectures
slide-129
SLIDE 129

MPI History

  • early 80s – various message passing

environments

○ library for Caltech’s N-cube (later commercially available Express) ○ P4 library (Aragonne) ○ PICL and PVM (Oak Ridge) ○ LAM (Ohio Supercomputer Center)

  • Supercomputing 1992 – decision to develop

common standard – MPI

○ ISIS – a different way, towards fault tolerant (e.g. business applications); ○ MPI – to serve supercomputing

  • MPI-1 standard

○ completed in May 1994 ○ first implementations 1995 ■ MPICH, based on P4 and Chameleon) ■ LAM MPI ■ many supercomputer vendors – their own

  • MPI-2 standard

○ completed 1998 ○ first implementations Nov 2002 ○ Also parallel IO defined here ■ first in MPI-IO (NASA)

  • MPI-3 – approved by MPI Forum on

September 21, 2012. MPI 3.0 Documents

  • Currently MPI-3.1
slide-130
SLIDE 130

MPI implementations

  • Other

○ P2P-MPI ○ mpiJava API ○ ...

  • Classical (clusters and supercomputers)

○ MPICH ○ LAM MPI ○ OpenMPI ○ Intel MPI ○ HP MPI ○ Microsoft Message Passing Interface

  • Python

○ pyMPI ○ mpi4py ○ Pypar ○ MYMPI

Compiling and running MPI programs

  • Compiling

○ typically with something like: ■ mpicc − g file.c ■ mpif77 − g file.f ■ mpiF90 − O file.f90

  • Running

  • ften, special mpi daemons are started first

○ for example: mpirun − np 8 a.out ○

  • n some systems: mpiexec − np 8 a.out
slide-131
SLIDE 131

MPI as standard

Goals of the MPI standard MPI’s prime goals are:

  • To provide source-code portability
  • To allow efficient implementations

MPI also offers:

  • A great deal of functionality
  • Support for heterogeneous parallel architectures

4 basic types of MPI calls 1. Calls used to initialize, manage, and terminate communications 2. Calls used to communicate between pairs

  • f processors

→ point-to-point communication (P2P) 3. Calls used to communicate among groups → collective communication 4. Calls to create data types Communication can be:

  • Blocking (P2P, collective)
  • Nonblocking (P2P)

Bindings for:

  • C
  • C++
  • java
  • Fortran
  • etc.
slide-132
SLIDE 132

6 basic MPI calls

MPI_Init: initialise MPI MPI_Comm_Size: how many PE? MPI_Comm_Rank: identify the PE MPI_Send MPI_Receive MPI_Finalise: close MPI

EXAMPLE (fortran90): http://www.ut.ee/~eero/SC/konspekt/Naited/greetings.f90.html Example (C): https://github.com/wesleykendall/mpitutorial/blob/gh-pages/tutor ials/mpi-hello-world/code/mpi_hello_world.c

Full range of MPI calls: http://www.mpich.org/static/docs/latest/

from mpi4py import MPI comm = MPI.COMM_WORLD # Defines the default communicator num_procs = comm.Get_size() # Stores the number of processes in size. rank = comm.Get_rank() # Stores the rank (pid) of the current process stat = MPI.Status() msg = "Hello world, say process %s !", % rank if rank == 0: # Master work print msg for i in range(num_procs - 1): msg = comm.recv(source=i+1, tag=MPI.ANY_TAG, status=stat) print msg elif rank == 1: # Worker work comm.send(msg, dest = 0)

Send, Ssend, Bsend, Rsend - blocking calls Isend, Issend, Ibsend, Irsend - nonblocking calls

slide-133
SLIDE 133

Non-Blocking Send and Receive, Avoiding Deadlocks

  • Non-Blocking communications allows the

separation between the initiation of the communication and the completion

  • Advantages: between the initiation and

completion the program could do other useful computation (latency hiding)

  • Disadvantages: the programmer has to

insert code to check for completion

  • Sending objects (pickling underneath)

○ Blocking commands: send, recv ○ Non-blocking commands: isend, irecv ■ Return request object to be able to check the message status

  • Sending contiguous memory contents:

○ Blocking commands: Send, Recv ○ Non-blocking commands: Isend, Irecv ■ Return request object to be able to check the message status

from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} req = comm.isend(data, dest=1, tag=11) # non-blocking req.wait() elif rank == 1: req = comm.irecv(source=0, tag=11) # non-blocking data = req.wait() from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} comm.send(data, dest=1, tag=11) # blocking elif rank == 1: data = comm.recv(source=0, tag=11) # blocking

slide-134
SLIDE 134

Unidirectional communication between processors

from mpi4py import MPI import numpy as np size = MPI.COMM_WORLD.size rank = MPI.COMM_WORLD.rank comm = MPI.COMM_WORLD len = 100 data = np.arange(len,dtype=float) # (or similar) tag = 99 ... # 1. Blocking send and blocking receive if rank==0: print "[0] Sending: ", data comm.Send([data, MPI.FLOAT], 1, tag) elif rank == 1: print "[1] Receiving..." comm.Recv([data, MPI.FLOAT], 0, tag) print "[1] Data: ", data # 2. Non-blocking send and blocking receive if rank==0: print "[0] Sending: ", data request = comm.Isend([data, MPI.FLOAT], 1, tag) ... # calculate or do something useful... request.Wait() elif rank == 1: print "[1] Receiving..." comm.Recv([data, MPI.FLOAT], 0, tag) print "[1] Data: ", data # 3. Blocking send and non-blocking receive if rank==0: print "[0] Sending: ", data comm.Send([data, MPI.FLOAT], 1, tag) elif rank == 1: print "[1] Receiving..." request=comm.Irecv([data, MPI.FLOAT], 0, tag) ... # calculate or do something useful... request.Wait() print "[1] Data: ", data

slide-135
SLIDE 135

# 4. Non-blocking send and non-blocking receive if rank==0: print "[0] Sending: ", data request = comm.Isend([data, MPI.FLOAT], 1, tag) ... # calculate or do something useful... request.Wait() elif rank == 1: print "[1] Receiving..." request=comm.Irecv([data, MPI.FLOAT], 0, tag) ... # calculate or do something useful... request.Wait() print "[1] Data: ", data

Unidirectional communication between processors

Wildcards:

  • ANY_SOURCE
  • ANY_TAG

Can be used In *recv

slide-136
SLIDE 136

Possibilities for checking received message’s details

#probe.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD nproc = comm.Get_size() myid = comm.Get_rank() if myid == 0: data = myid*numpy.ones(5,dtype = numpy.float64) comm.Send([data,3,MPI.DOUBLE],dest=1,tag=1) if myid == 1: info = MPI.Status() comm.Probe(MPI.ANY_SOURCE,MPI.ANY_TAG,info) count = info.Get_elements(MPI.DOUBLE) data = numpy.empty(count,dtype = numpy.float64) comm.Recv(data,MPI.ANY_SOURCE,MPI.ANY_TAG,info) print 'on',myid, 'data: ',data #status.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD nproc = comm.Get_size() myid = comm.Get_rank() data = myid*numpy.ones(5,dtype = numpy.float64) if myid == 0: comm.Send([data,3,MPI.DOUBLE],dest=1,tag=1) if myid == 1: info = MPI.Status() comm.Recv(data,MPI.ANY_SOURCE,MPI.ANY_TAG,info) source = info.Get_source() tag = info.Get_tag() count = info.Get_elements(MPI.DOUBLE) size = info.Get_count() print 'on',myid, 'source, tag, count, size is',source, tag, count, size

slide-137
SLIDE 137

Mutual communication and avoiding deadlocks

Non-blocking operations can be used also for avoiding deadlocks Deadlock is a situation where processes wait after each other without any of them able to do anything useful. Deadlocks can occur:

  • caused of false order of send and receive
  • caused by system send-buffer fill-up

In case of mutual communication there are 3 possibilities: 1. Both processes start with send followed by receive 2. Both processes start with receive followed by send 3. One process starts with send followed by receive, another vica versa Depending on blocking there are different possibilities:

slide-138
SLIDE 138

# 1.1 Send followed by receive (vers. 2) if rank==0: request=comm.Isend([sendbuf, MPI.FLOAT], 1, tag) comm.Recv([recvbuf, MPI.FLOAT], 1, tag) request.Wait() elif rank == 1: request=comm.Isend([sendbuf, MPI.FLOAT], 0, tag) comm.Recv([recvbuf, MPI.FLOAT], 0, tag) request.Wait()

Is this deadlock-free?

  • It is!

Why Wait() cannot follow right after Isend(...) ?

Mutual communication and avoiding deadlocks

# 1. Send followed by receive (vers. 1 ) if rank==0: comm.Send([sendbuf, MPI.FLOAT], 1, tag) comm.Recv([recvbuf, MPI.FLOAT], 1, tag) elif rank == 1: comm.Send([sendbuf, MPI.FLOAT], 0, tag) comm.Recv([recvbuf, MPI.FLOAT], 0, tag)

Is this OK?

  • OK with small messages only (if sendbuf is smaller

then system message send-buffer But what about large messages?

  • Large messages produce Deadlock!
slide-139
SLIDE 139

# 2. Receive followed by send (version 2) if rank==0: request=comm.Irecv([recvbuf, MPI.FLOAT], 1, tag) comm.Send([sendbuf, MPI.FLOAT], 1, tag) request.Wait() elif rank == 1: request=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag) comm.Send([sendbuf, MPI.FLOAT], 0, tag) request.Wait()

… deadlock-free?

  • Yes, no deadlock!

# 2. Receive followed by send (version 1) if rank==0: comm.Recv([recvbuf, MPI.FLOAT], 1, tag) comm.Send([sendbuf, MPI.FLOAT], 1, tag) elif rank == 1: comm.Recv([recvbuf, MPI.FLOAT], 0, tag) comm.Send([sendbuf, MPI.FLOAT], 0, tag)

… Is this OK?

  • No it is not!

○ Produces deadlock in any message buffer size

Mutual communication and avoiding deadlocks

slide-140
SLIDE 140

Mutual communication and avoiding deadlocks

# Generally, the following communication pattern is advised: if rank==0: req1=comm.Isend([sendbuf, MPI.FLOAT], 1, tag) req2=recomm.Irecv([recvbuf, MPI.FLOAT], 1, tag) else: req1=comm.Isend([sendbuf, MPI.FLOAT], 0, tag) req2=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag) req1.Wait() req2.Wait() # 3. One starts with Send, the other one with receive if rank==0: comm.Send([sendbuf, MPI.FLOAT], 1, tag) comm.Recv([recvbuf, MPI.FLOAT], 1, tag) else: comm.Recv([recvbuf, MPI.FLOAT], 0, tag) comm.Send([sendbuf, MPI.FLOAT], 0, tag)

… Could we use non-blocking commands instead?

  • (Non-blocking commands can be used in

whichever call here as well)

Alternatively, use comm.Sendrecv

Docstring: Comm.Sendrecv(self, sendbuf, int dest, int sendtag=0, recvbuf=None, int source=ANY_SOURCE, int recvtag=ANY_TAG, Status status=None) Send and receive a message .. note:: This function is guaranteed not to deadlock in situations where pairs of blocking sends and receives may deadlock. .. caution:: A common mistake when using this function is to mismatch the tags with the source and destination ranks, which can result in deadlock. Type: builtin_function_or_method

slide-141
SLIDE 141

Collective communication

from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'key1' : [7, 2.72, 2+3j], 'key2' : ( 'abc', 'xyz')} else: data = None data = comm.bcast(data, root=0)

from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = np.arange(100, dtype='i') else: data = np.empty(100, dtype='i') comm.Bcast(data, root=0) for i in range(100): assert data[i] == i

Collective communication

  • Always blocking
  • All participating processes

should make the call!

  • Inherently deadlock-free

comm.barrier()

00 01 02 03 04 p0 A0 p1 p2 p3 p4 comm.Bcast(...) 00 01 02 03 04 p0 A0 p1 A0 p2 A0 p3 A0 p4 A0

slide-142
SLIDE 142

Collective communication

00 01 02 03 04 p0 A0 A1 A2 A3 A4 p1 p2 p3 p4 comm.Scatter(...) 00 01 02 03 04 p0 A0 p1 A1 p2 A2 p3 A3 p4 A4

from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() data = (rank+1)**2 data = comm.gather(data, root=0) if rank == 0: for i in range(size): assert data[i] == (i+1)**2 else: assert data is None from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() if rank == 0: data = [(i+1)**2 for i in range(size)] else: data = None data = comm.scatter(data, root=0) assert data == (rank+1)**2

comm.Gather(...)

from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() sendbuf = None if rank == 0: sendbuf = np.empty([size, 100], dtype='i') sendbuf.T[:,:] = range(size) recvbuf = np.empty(100, dtype='i') comm.Scatter(sendbuf, recvbuf, root=0) assert np.allclose(recvbuf, rank)

slide-143
SLIDE 143

Collective communication

00 01 02 03 04 p0 A0 A1 A2 A3 A4 p1 A0 A1 A2 A3 A4 p2 A0 A1 A2 A3 A4 p3 A0 A1 A2 A3 A4 p4 A0 A1 A2 A3 A4 00 01 02 03 04 p0 A0 p1 A1 p2 A2 p3 A3 p4 A4 comm.Allgather(...) 00 01 02 03 04 p0 A0 A1 A2 A3 A4 p1 B0 B1 B2 B3 B4 p2 C0 C1 C2 C3 C4 p3 D0 D1 D2 D3 D4 p4 E0 E1 E2 E3 E4 00 01 02 03 04 p0 A0 B0 C0 D0 E0 p1 A1 B1 C1 D1 E1 p2 A2 B2 C2 D2 E2 p3 A3 B3 C3 D3 E3 p4 A4 B4 C4 D4 E4 comm.Alltoall(...)

slide-144
SLIDE 144

Collective communication

00 01 02 03 04 p0 A0 B0 A1 A2 B2 C2 A3 A4 p1 p2 p3 p4 comm.Scatterv(...) 00 01 02 03 04 p0 A0 B0 p1 A1 p2 A2 B2 C2 p3 A3 p4 A4 comm.Gatherv(...) https://www.programcreek.com/python/example/89115/mpi4py.MPI.DOUBLE

  • Available for numpy arrays only
slide-145
SLIDE 145

Collective communication

00 01 02 03 04 p0 A0 B0 A1 A2 B2 C2 A3 A4 p1 A0 B0 A1 A2 B2 C2 A3 A4 p2 A0 B0 A1 A2 B2 C2 A3 A4 p3 A0 B0 A1 A2 B2 C2 A3 A4 p4 A0 B0 A1 A2 B2 C2 A3 A4 00 01 02 03 04 p0 A0 B0 p1 A1 p2 A2 B2 C2 p3 A3 p4 A4 comm.Allgatherv(...) https://www.programcreek.com/python/example/89115/mpi4py.MPI.DOUBLE

  • Available for numpy arrays only
slide-146
SLIDE 146

Collective communication

00 01 02 03 04 p0 31 p1 p2 p3 p4 comm.Reduce(...,op=MPI.SUM,...) 00 01 02 03 04 p0 1 p1 2 p2 4 p3 8 p4 16

comm.Reduce(sendbuf, recvbuf, op=MPI.SUM, root=0) comm.reduce(sendobj=None, recvobj=None, op=MPI.SUM, root=0) a_size = 3 recvdata = numpy.zeros(a_size,dtype=numpy.int) senddata = (rank+1)*numpy.arange(a_size,dtype=numpy.int) comm.Reduce(senddata,recvdata,root=0,op=MPI.PROD) print 'on task',rank,'after Reduce: data = ',recvdata comm.Allreduce(senddata,recvdata) print 'on task',rank,'after Allreduce: data = ',recvdata https://info.gwdg.de/~ceulig/docs-dev/doku.php?id=en:services:application_services:high_performance_computing:mpi4py

Example sendbuf recvbuf

slide-147
SLIDE 147

Collective communication

00 01 02 03 04 p0 31 p1 31 p2 31 p3 31 p4 31 comm.Allreduce(...,op=MPI.SUM,...) 00 01 02 03 04 p0 1 p1 2 p2 4 p3 8 p4 16

comm.Allreduce(sendbuf, recvbuf, op=MPI.SUM) comm.allreduce(sendobj=None, recvobj=None, op=MPI.SUM) a_size = 3 recvdata = numpy.zeros(a_size,dtype=numpy.int) senddata = (rank+1)*numpy.arange(a_size,dtype=numpy.int) comm.Reduce(senddata,recvdata,root=0,op=MPI.PROD) print 'on task',rank,'after Reduce: data = ',recvdata comm.Allreduce(senddata,recvdata) print 'on task',rank,'after Allreduce: data = ',recvdata https://info.gwdg.de/~ceulig/docs-dev/doku.php?id=en:services:application_services:high_performance_computing:mpi4py

Example sendbuf recvbuf

slide-148
SLIDE 148

Reduction operations

MPI.MIN minimum MPI.MAX maximum MPI.SUM sum MPI.PROD product MPI.LAND logical and MPI.BAND bitwise and MPI.LOR logical or MPI.BOR bitwise or MPI.LXOR logical xor MPI.BXOR bitwise xor MPI.MAXLOC max value and location MPI.MINLOC min value and location

Collective communication

slide-149
SLIDE 149

Further discussion

MPI derived datatypes MPI packing Communicators & process groups Intra and Intercommunicators Processor topologies One-sided communication Library creation support Fault-tolerance issues MPI-IO

MPI’s main abstraction for programmer:

  • → Message

○ Powerful ○ Low-level ■ Is it too low level?

Secure MPI libraries

  • FT-MPI

‘ MPI programming is difficult. It is “schizophrenic programming” in that you are writing a single programming with multiple threads of execution “many voices in one head”. ‘ link

In-class reading-exercise 4:

  • Jonathan Dursi, HPC is dying, and MPI is killing it

○ Read it and choose one of the technologies mentioned in the article and post a review of main points of it to Piazza