[PPT] - Parallel Programming and High-Performance Computing Part 1: PowerPoint Presentation

SLIDE 1

Technische Universität München

Parallel Programming and High-Performance Computing

Part 1: Introduction

Dr. Ralf-Peter Mundani

CeSIM / IGSSE / CiE Technische Universität München

SLIDE 2

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−2

1 Introduction

General Remarks

Ralf-Peter Mundani

email mundani@tum.de, phone 289–25057, room 3181 (city centre) consultation-hour: by appointment

Atanas Atanasov

email atanasoa@in.tum.de, phone 289-18615, room 02.05.036

lecture (2 SWS)

weekly Tuesday, 14:00—15:30, room 02.07.23

exercise (1 SWS)

fortnightly Wednesday, 08:30—10:00, room 02.07.23

materials: http://www5.in.tum.de/

SLIDE 3

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−3

1 Introduction

General Remarks

content

part 1: introduction part 2: high-performance networks part 3: foundations part 4: programming memory-coupled systems part 5: programming message-coupled systems part 6: dynamic load balancing part 7: examples of parallel algorithms

SLIDE 4

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−4

1 Introduction

Overview

motivation
hardware excursion
supercomputers
classification of parallel computers
levels of parallelism
quantitative performance evaluation

I think there is a world market for maybe five computers. —Thomas Watson, chairman IBM, 1943

SLIDE 5

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−5

1 Introduction

Motivation

numerical simulation: from phenomena to predictions

physical phenomenon technical process

1. modelling

determination of parameters, expression of relations

2. numerical treatment

model discretisation, algorithm development

3. implementation

software development, parallelisation

4. visualisation

illustration of abstract simulation results

5. validation

comparison of results with reality

6. embedding

insertion into working process mathematics computer science application discipline

SLIDE 6

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−6

1 Introduction

Motivation

why numerical simulation?

because experiments are sometimes impossible life cycle of galaxies, weather forecast, terror attacks, e. g. because experiments are sometimes not welcome avalanches, nuclear tests, medicine, e. g.

bomb attack on WTC (1993)

SLIDE 7

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−7

1 Introduction

Motivation

why numerical simulation? (cont’d)

because experiments are sometimes very costly and-time consuming protein folding, material sciences, e. g. because experiments are sometimes more expensive aerodynamics, crash test, e. g.

Mississippi basin model (Jackson, MS)

SLIDE 8

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−8

1 Introduction

Motivation

why parallel programming and HPC?

complex problems (especially the so called “grand challenges”) demand for more computing power climate or geophysics simulation (tsunami, e. g.) structure or flow simulation (crash test, e. g.) development systems (CAD, e. g.) large data analysis (Large Hadron Collider at CERN, e. g.) military applications (crypto analysis, e. g.) … performance increase due to faster hardware, more memory (“work harder”) more efficient algorithms, optimisation (“work smarter”) parallel computing (“get some help”)

SLIDE 9

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−9

1 Introduction

Motivation

bjectives (in case all resources would be available N-times)

throughput: compute N problems simultaneously running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e. g. drawback: limited resources of single nodes response time: compute one problem at a fraction (1/N) of time running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g. drawback: writing a parallel program; communication problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g. drawback: writing a parallel program; communication

SLIDE 10

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−10

1 Introduction

Overview

motivation
hardware excursion
supercomputers
classification of parallel computers
levels of parallelism
quantitative performance evaluation

SLIDE 11

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−11

1 Introduction

Hardware Excursion

definition of parallel computers

“A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989)

possible appearances of such processing elements

specialised units (steps of a vector pipeline, e. g.) parallel features in modern monoprocessors (instruction pipelining, superscalar architectures, VLIW, multithreading, multicore, …) several uniform arithmetical units (processing elements of array computers, GPUs, e. g.) complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called metacomputers)

SLIDE 12

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−12

1 Introduction

Hardware Excursion

reminder: arithmetic logical unit (ALU)

schematic layout of the (classical 32-bit) arithmetic logical unit

A B C ALU

registers main memory

32-bit data bus

… … … …

C ← A ⊗ B with arithmetic operation ⊗

SLIDE 13

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−13

1 Introduction

Hardware Excursion

reminder: memory hierarchy

serial access register cache main memory background memory archive memory single access block access page access capacity access speed

SLIDE 14

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−14

1 Introduction

Hardware Excursion

instruction pipelining
instruction execution involves several operations

1. instruction fetch (IF) 2. decode (DE) 3. fetch operands (OP) 4. execute (EX) 5. write back (WB) which are executed successively

hence, only one part of CPU works at a given moment

IF DE OP EX WB IF DE OP EX WB … … instruction N instruction N+1

SLIDE 15

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−15

1 Introduction

Hardware Excursion

instruction pipelining (cont‘d)
bservation: while processing particular stage of instruction, other

stages are idle hence, multiple instructions to be overlapped in execution instruction pipelining (similar to assembly lines) advantage: no additional hardware necessary instruction N IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB … … instruction N+1 instruction N+2 instruction N+3 instruction N+4 time

SLIDE 16

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−16

1 Introduction

Hardware Excursion

superscalar

CPU (containing several ALUs) might execute several instructions in parallel (with static or dynamic (i. e. out-of-order execution) scheduling)

instruction N+9 instruction N instruction N+1 …

IF DE OP EX WB IF DE OP EX WB

time

IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB

SLIDE 17

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−17

1 Introduction

Hardware Excursion

very long instruction word (VLIW)
in contrast to superscalar architectures, the compiler groups parallel

executable instructions during compilation (pipelining still possible)

advantage: no additional hardware logic necessary
drawback: not always fully useable ( dummy filling (NOP))

VLIW instruction

instr. 1
instr. 4
instr. 3
instr. 2

registers

SLIDE 18

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−18

1 Introduction

Hardware Excursion

dual core, quad core, manycore, and multicore
bservation: increasing frequency (and thus core voltage) over past years

problem: thermal power dissipation P ∼ f⋅v2 (f: frequency; v: core voltage)

SLIDE 19

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−19

1 Introduction

Hardware Excursion

dual core, quad core, manycore, and multicore (cont’d)

25% reduction in performance (i. e. core voltage) leads to approx. 50% reduction in dissipation

dissipation performance normal CPU reduced CPU

SLIDE 20

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−20

1 Introduction

Hardware Excursion

dual core, quad core, manycore, and multicore (cont’d)

idea: installation of two cores per die with same dissipation as single core system

dissipation performance single core

dual core

SLIDE 21

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−21

1 Introduction

Hardware Excursion

dual core, quad core, manycore, and multicore (cont’d)

single vs. dual/quad FSB core 0 L1 L2 FSB core 0 core 1 L1 L1 shared L2 FSB core 0 core 1 L1 L1 shared L2 core 2 core 3 L1 L1 shared L2

FSB: front side bus (i. e. connection to memory (via north bridge))

SLIDE 22

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−22

1 Introduction

Hardware Excursion

INTEL Nehalem Core i7

source: www.samrathacks.com

QPI core 0 core 1 L1+L2 L1+L2 shared L3 core 2 core 3 L1+L2 L1+L2

QPI: QuickPath Interconnect replaces FSB (QPI is a point-to-point interconnection – with a memory controller now on-die – in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GByte/s data transfer, i. e. 2× FSB)

SLIDE 23

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−23

1 Introduction

Hardware Excursion

source: www.computerbase.de

SLIDE 24

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−24

1 Introduction

Overview

motivation
hardware excursion
supercomputers
classification of parallel computers
levels of parallelism
quantitative performance evaluation

SLIDE 25

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−25

1 Introduction

Supercomputers

arrival of clusters

in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MPI library 1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops

SLIDE 26

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−26

1 Introduction

Supercomputers

supercomputers

supercomputing or high-performance scientific computing as the most important application of the big number crunchers national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in 1992/93 decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world’s first TFlops computer) then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, … meanwhile new high-end computing memorandum (2004)

SLIDE 27

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−27

1 Introduction

Supercomputers

supercomputers (cont’d)

federal “Bundeshöchstleistungsrechner” initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich)

ne new installation every second year (i. e. a six year upgrade cycle

for each centre) the newest one to be among the top 10 of the world

verview and state of the art: Top500 list (updated every six month), see

http://www.top500.org

finally (a somewhat different definition)

Supercomputer: Turns CPU-bound problems into I/O-bound problems. —Ken Batcher

SLIDE 28

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−28

1 Introduction

Supercomputers

MOORE’s law
bservation of Intel co-founder Gordon E. MOORE, describes important

trend in history of computer hardware (1965) number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every two years

SLIDE 29

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−29

1 Introduction

Supercomputers

some numbers: Top500

SLIDE 30

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−30

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 31

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−31

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 32

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−32

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 33

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−33

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 34

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−34

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 35

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−35

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 36

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−36

1 Introduction

Supercomputers

some numbers: Top500 (cont’d)

SLIDE 37

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−37

1 Introduction

Supercomputers

The Earth Simulator – world’s #1 from 2002—04

installed in 2002 in Yokohama, Japan ES-building (approx. 50m × 65m × 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching) 8 vector processors (8GFlops each) 16GB shared memory 5120 processors (40.96TFlops peak performance) and 10TB memory; 35.86TFlops sustained performance (Linpack) nodes connected by 640×640 single stage crossbar (83,200 cables with a total extension of 2400km; 8TBps total bandwidth) further 700TB disc space and 1.60PB mass storage

SLIDE 38

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−38

1 Introduction

Supercomputers

BlueGene/L – world’s #1 from 2004−08

installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes) 2 PowerPC 440d processors (2.8GFlops each) 512MB memory 131,072 processors (367.00TFlops peak performance) and 33.50TB memory; 280.60TFlops sustained performance (Linpack) nodes configured as 3D torus (32 × 32 × 64); global reduction tree for fast operations (global max / sum) in a few microseconds 1024Gbps link to global parallel file system further 806TB disc space; operating system SuSE SLES 9

SLIDE 39

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−39

1 Introduction

Supercomputers

Roadrunner – world’s #1 from 2008−09

installed in 2008 at LANL, NM, USA installation costs about $120 million first “hybrid” supercomputer dual-core Opteron Cell Broadband Engine 129,600 cores (1456.70TFlops peak performance) and 98TB memory; 1144.00TFlops sustained performance (Linpack) standard processing (file system I/O, e. g.) handled by Opteron, while mathematically and CPU-intensive tasks are handled by Cell 2.35MW power consumption ( 437MFlops per Watt ☺) primarily usage: ensure safety and reliability of nation’s nuclear weapons stockpile, real-time applications (cause & effect in capital markets, bone structures and tissues renderings as patients are being examined, e. g.)

SLIDE 40

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−40

1 Introduction

Supercomputers

Jaguar – world’s #1 since 2009

installed in 2009 at ORNL, TN, USA each compute node contains two hex-core Opteron 2.6GHz (10.4GFlops) 16GB memory 224,162 cores (2331.00TFlops peak performance); 1759.00TFlops sustained performance (Linpack)

SLIDE 41

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−41

1 Introduction

Supercomputers

HLRB II (world’s #6 for 04/2006)

installed in 2006 at LRZ, Garching installation costs 38M€ monthly costs approx. 400,000€ upgrade in 2007 (finished)

ne of Germany’s 3 supercomputers

SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus) 256 blades (ccNUMA link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.80GFlops) 4GB memory per core 9728 processor cores (62.30TFlops peak performance) and 39TB memory; 56.50TFlops sustained performance (Linpack) footprint 24m × 12m; total weight 103 metric tons

SLIDE 42

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−42

1 Introduction

Overview

motivation
hardware excursion
supercomputers
classification of parallel computers
levels of parallelism
quantitative performance evaluation

SLIDE 43

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−43

1 Introduction

Classification of Parallel Computers

standard classification according to FLYNN

global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data drawback: very different computers may belong to the same class

SLIDE 44

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−44

1 Introduction

Classification of Parallel Computers

standard classification according to FLYNN (cont’d)

SISD

ne processing unit that has access to one data memory and to one

program memory classical monoprocessor following VON NEUMANN’s principle

processor program memory data memory

SLIDE 45

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−45

1 Introduction

Classification of Parallel Computers

standard classification according to FLYNN (cont’d)

SIMD several processing units, each with separate access to a (shared or distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes out- dated due to recent developments at commodity market

processor program memory data memory data memory processor

SLIDE 46

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−46

1 Introduction

Classification of Parallel Computers

standard classification according to FLYNN (cont’d)

MISD several processing units that have access to one data memory; several program memories not very popular class (mainly for special applications such as Digital Signal Processing)

perating on a single stream of data, forwarding results from one

processing unit to the next example: systolic array (network of primitive processing elements that “pump” data)

processor program memory data memory processor program memory

SLIDE 47

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−47

1 Introduction

Classification of Parallel Computers

standard classification according to FLYNN (cont’d)

MIMD several processing units, each with separate access to a (shared or distributed) data memory; several program memories classification according to (physical) memory organisation shared memory shared (global) address space distributed memory distributed (local) address space example: multiprocessor systems, networks of computers

processor program memory data memory data memory processor program memory

SLIDE 48

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−48

1 Introduction

Classification of Parallel Computers

processor coupling

cooperation of processors / computers as well as their shared use of various resources require communication and synchronisation the following types of processor coupling can be distinguished memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS) MesMS ∅ distributed address space Mem-MesMS (hybrid) MemMS, SMP shared address space distributed memory global memory

SLIDE 49

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−49

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

central issues scalability: costs for adding new nodes / processors programming model: costs for writing parallel programs portability: costs for portation (migration), i. e. transfer from one system to another while preserving executability and flexibility load distribution: costs for obtaining a uniform load distribution among all nodes / processors MemMS are advantageous concerning scalability, MesMS are typically better concerning the rest hence, combination of MemMS and MesMS for exploiting all advantages distributed / virtual shared memory (DSM / VSM) physical distributed memory with global shared address space

SLIDE 50

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−50

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

uniform memory access (UMA) each processor P has direct access via the network to each memory module M with same access times to all data standard programming model can be used (i. e. no explicit send / receive of messages necessary) communication and synchronisation via shared variables (inconsistencies (write conflicts, e. g.) have to prevented in general by the programmer) M … … network P P P M M

SLIDE 51

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−51

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

symmetric multiprocessor (SMP)

nly a small amount of processors, in most cases a central bus, one

address space (UMA), but bad scalability cache-coherence implemented in hardware (i. e. a read always provides a variable’s value from its last write) example: double or quad boards, SGI Challenge C: cache M C … P C C P P

SLIDE 52

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−52

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of data (i. e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM / VSM, Cray T3E P M … network M P

SLIDE 53

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−53

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

cache-coherent non-uniform memory access (ccNUMA) caches for local and remote addresses; cache-coherence implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000 C M … network P M C P

SLIDE 54

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−54

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories = global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1 P C … network C C P P

SLIDE 55

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−55

1 Introduction

Classification of Parallel Computers

processor coupling (cont’d)

no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange (due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and I/O due to parallel data transfer (Direct Memory Access, e. g.) possible example: IBM SP2, ASCI Red / Blue / White M P … network M P M P

SLIDE 56

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−56

1 Introduction

Overview

motivation
hardware excursion
supercomputers
classification of parallel computers
levels of parallelism
quantitative performance evaluation

SLIDE 57

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−57

1 Introduction

Levels of Parallelism

the suitability of a parallel architecture for a given parallel program strongly

depends on the granularity of parallelism

some remarks on granularity

quantitative meaning: ratio of computational effort and communication / synchronisation effort (≈ amount of instructions between two necessary communication / synchronisation steps) qualitative meaning: level on which work is done in parallel

program level process level block level instruction level sub-instruction level coarse-grain parallelism fine-grain parallelism

SLIDE 58

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−58

1 Introduction

Levels of Parallelism

program level

parallel processing of different programs independent units without any shared data no or only small amount of communication / synchronisation

rganised by the OS
process level

a program is subdivided into processes to be executed in parallel each process consists of a larger amount of sequential instructions and has a private address space synchronisation necessary (in case all processes in one program) communication in most cases necessary (data exchange, e. g.) support by OS via routines for process management, process communication, and process synchronisation term of process often referred to as heavy-weight process

SLIDE 59

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−59

1 Introduction

Levels of Parallelism

block level

blocks of instructions are executed in parallel each block consists of a smaller amount of instructions and shares the address space with other blocks communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread)

instruction level

parallel execution of machine instructions

ptimising compilers can increase this potential by modifying the order
f commands (better exploitation of superscalar architecture and

pipelining mechanisms)

sub-instruction level

instructions are further subdivided in units to be executed in parallel or via overlapping (vector operations, e. g.)

SLIDE 60

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−60

1 Introduction

Levels of Parallelism

difference between processes and threads

program (.exe, e. g.) messages messages process model program (.exe, e. g.) thread model

SLIDE 61

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−61

1 Introduction

Overview

motivation
hardware excursion
supercomputers
classification of parallel computers
levels of parallelism
quantitative performance evaluation

SLIDE 62

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−62

1 Introduction

Quantitative Performance Evaluation

execution time

time T of a parallel program between start of the execution on one processor and end of all computations on the last processor during execution all processors are in one of the following states compute (TCOMP): time spent for computations communicate (TCOMM): time spent for send and receive operations idle (TIDLE): time spent for waiting (sending / receiving messages) hence T = TCOMP + TCOMM + TIDLE

communication—computation-ratio (CCR)

important quantity measuring the success of a parallelisation relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size

SLIDE 63

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−63

1 Introduction

Quantitative Performance Evaluation

comparison multiprocessor / monoprocessor

correlation of multi- and monoprocessor systems’ performance important: program that can be executed on both systems definitions P(1): amount of unit operations of a program on the monoprocessor system P(p): amount of unit operations of a program on the multiprocessor systems with p processors T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors

SLIDE 64

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−64

1 Introduction

Quantitative Performance Evaluation

comparison multiprocessor / monoprocessor (cont’d)

simplifying preconditions T(1) = P(1)

ne operation to be executed in one step on the monoprocessor

system T(p) ≤ P(p) more than one operation to be executed in one step (for p ≥ 2)

n the multiprocessor system with p processors

SLIDE 65

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−65

1 Introduction

Quantitative Performance Evaluation

comparison multiprocessor / monoprocessor (cont’d)

speed-up S(p) indicates the improvement in processing speed in general, 1 ≤ S(p) ≤ p efficiency E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p in general, 1/p ≤ E(p) ≤ 1 ) ( (1) ) ( p T T p S = p p S p E ) ( ) ( =

SLIDE 66

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−66

1 Introduction

Quantitative Performance Evaluation

comparison multiprocessor / monoprocessor (cont’d)

speed-up and efficiency can be seen in two different ways algorithm-independent best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency algorithm-dependent parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; “unfair” due to communication and synchronisation overhead relative speed-up relative efficiency

SLIDE 67

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−67

1 Introduction

Quantitative Performance Evaluation

scalability
bjective: adding further processing elements to the system shall reduce

the execution time without any program modifications

i. e. a linear performance increase with an efficiency close to 1

important for the scalability is a sufficient problem size

ne porter may carry one suitcase in a minute

60 porters won’t do it in a second but 60 porters may carry 60 suitcases in a minute in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems

SLIDE 68

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−68

1 Introduction

Quantitative Performance Evaluation

AMDAHL’s law

the probably most important and most famous estimate for the speed- up (even if quite pessimistic) underlying model each program consists of a sequential part s, 0 ≤ s ≤ 1, that can only be executed in a sequential way: synchronisation, data I/O, … furthermore, each program consists of a parallelisable part 1−s that can be executed in parallel by several processes; finding the maximum value within a set of numbers, e. g. hence, the execution time for the parallel program executed on p processors can be written as (1) 1 (1) ) ( T p s T s p T ⋅ − + ⋅ =

SLIDE 69

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−69

1 Introduction

Quantitative Performance Evaluation

AMDAHL’s law (cont’d)

the speed-up can thus be computed as when increasing p → ∞ we finally get AMDAHL’s law speed-up is bounded: S(p) ≤ 1/s the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s < 0.1) p s s T p s T s T p T T − + = ⋅ − + ⋅ = = 1 1 (1) 1 (1) (1) ) ( (1) ) (p S s p s s p S

p p

1 1 1 lim ) ( lim = − + =

∞ → ∞ →

SLIDE 70

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−70

1 Introduction

Quantitative Performance Evaluation

AMDAHL’s law (cont’d)

example s = 0.1 and, thus, S(p) ≤ 10 independent from p the speed-up is bounded by this limit where’s the error?

10 5 15 20 25 5 10 S(p) p s = 0.1

SLIDE 71

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−71

1 Introduction

Quantitative Performance Evaluation

GUSTAFSON’s law

addresses the shortcomings of AMDAHL’s law as it states that any sufficient large problem can be efficiently parallelised instead of a fixed problem size it supposes a fixed time concept underlying model execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part σ, 0 ≤ σ ≤ 1 hence, the execution time for the sequential program on the monoprocessor can be written as T(1) = σ + p⋅(1−σ) the speed-up can thus be computed as S(p) = σ + p⋅(1−σ) = p + σ⋅(1−p)

SLIDE 72

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−72

1 Introduction

Quantitative Performance Evaluation

GUSTAFSON’s law (cont’d)

difference to AMDAHL sequential part s(p) is not constant, but gets smaller with increasing p s(p) ∈ ]0, 1[

ften more realistic, because more processors are used for a larger

problem size, and here parallelisable parts typically increase (more computations, less declarations, …) speed-up is not bounded for increasing p , ) (1 ) ( σ − ⋅ + σ σ = p p s

SLIDE 73

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−73

1 Introduction

Quantitative Performance Evaluation

GUSTAFSON’s law (cont’d)

some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main memory

f the monoprocessor system) completely runs in cache and

main memory of the nodes from the multiprocessor system

SLIDE 74

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−74

1 Introduction

Twelve ways…

…to fool the masses when giving performance results on parallel computers. —David H. Bailey, NASA Ames Research Centre, 1991

1. Quote only 32-bit performance results, not 64-bit results.
2. Present performance figures for an inner kernel, and then represent these

figures as the performance of the entire application.

3. Quietly employ assembly code and other low-level language constructs.
4. Scale up the problem size with the number of processors, but omit any

mention of this fact.

5. Quote performance results projected to a full system.
6. Compare your results against scalar, unoptimised codes on Crays.

SLIDE 75

Technische Universität München

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2010

1−75

1 Introduction

Twelve ways…

7. When direct run time comparisons are required, compare with an old code
n an obsolete system.
8. If MFLOPS rates must be quoted, base the operation count on the parallel

implementation, not on the best sequential implementation.

9. Quote performance in terms of processor utilisation, parallel speed-ups or

MFLOPS per dollar. 10.Mutilate the algorithm used in the parallel implementation to match the architecture. 11.Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12.If all else fails, show pretty pictures and animated videos, and don’t talk about performance.