Parallel Programming and High-Performance Computing Part 3: - - PowerPoint PPT Presentation

parallel programming and high performance computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming and High-Performance Computing Part 3: - - PowerPoint PPT Presentation

Technische Universitt Mnchen Parallel Programming and High-Performance Computing Part 3: Foundations Dr. Ralf-Peter Mundani CeSIM / IGSSE Technische Universitt Mnchen 3 Foundations Overview terms and definitions process


slide-1
SLIDE 1

Technische Universität München

Parallel Programming and High-Performance Computing

Part 3: Foundations

  • Dr. Ralf-Peter Mundani

CeSIM / IGSSE

slide-2
SLIDE 2

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−2

3 Foundations

Overview

  • terms and definitions
  • process interaction for MemMS
  • process interaction for MesMS
  • example of a parallel program

A distributed system is the one that prevents you from working because of the failure

  • f a machine that you had never heard of.

—Leslie Lamport

slide-3
SLIDE 3

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−3

3 Foundations

Terms and Definitions

  • sequential vs. parallel: an algorithm analysis

– sequential algorithms are characterised that way

  • all instructions U are processed in a certain sequence
  • this sequence is given due to the causal ordering of U, i. e. the

causal dependencies from another instructions’ results – hence, for the set U a partial order ≤ can be declared

  • x ≤ y for x, y ∈ U
  • ≤ representing a reflexive, antisymmetric, transitive relation

– often, for (U, ≤) more than one sequence can be found so that all computations (on the monoprocessor) are executed correctly

sequence 1 sequence 2 sequence 3

≤ ≤

(blockwise)

slide-4
SLIDE 4

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−4

3 Foundations

Terms and Definitions

  • sequential vs. parallel: an algorithm analysis (cont’d)

– first step towards a parallel program: concurrency

  • via (U, ≤) identification of independent blocks (of instructions)
  • simple parallel processing of independent blocks possible (due to
  • nly a few communication / synchronisation points)

– suited for both parallel processing (multiprocessor) and distributed processing (metacomputer, grid)

≤ ≤ 1 2 3 4 5 time

slide-5
SLIDE 5

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−5

  • sequential vs. parallel: an algorithm analysis (cont’d)

– further parallelisation of sequential blocks

  • subdivision of suitable blocks (loop constructs, e. g.) for parallel

processing

  • here, communication / synchronisation indispensable

– mostly suitable for parallel processing (MemMS and MesMS) 3 Foundations

Terms and Definitions

≤ ≤ 1 2 3 4 5 time A B C D E

slide-6
SLIDE 6

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−6

3 Foundations

Terms and Definitions

  • general design questions

– several considerations have to be taken into account for writing a parallel program (either from scratch or based on an existing sequential program) – standard questions comprise

  • which part of the (sequential) program can be parallelised
  • what kind of structure to be used for parallelisation
  • which parallel programming model to be used
  • which parallel programming language to be used
  • what kind of compiler to be used
  • what about load balancing strategies
  • what kind of architecture is the target machine
slide-7
SLIDE 7

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−7

3 Foundations

Terms and Definitions

  • dependence analysis

– processes / (blocks of) instructions cannot be executed simultaneously if there exist dependencies between them – hence, a dependence analysis of a given algorithm is necessary – example

for_all_processes (i = 0; i < N; ++i) a[i] = 0

– what about the following code

for_all_processes (i = 1; i < N; ++i) x = i − 2*i + i*i a[i] = a[x]

– as it is not always obvious, an algorithmic way of recognising dependencies (via the compiler, e. g.) would preferable

slide-8
SLIDE 8

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−8

3 Foundations

Terms and Definitions

  • dependence analysis (cont’d)

– BERNSTEIN (1966) established a set of conditions, sufficient for determining whether two processes can be executed in parallel – definitions

  • Ii (input): set of memory locations read by process Pi
  • Oi (output): set of memory locations written by process Pi

– BERNSTEIN’s conditions I1 ∩ O2 = ∅ I2 ∩ O1 = ∅ O1 ∩ O2 = ∅ – example P1: a = x + y P2: b = x + z I1 = {x, y}, O1 = {a}, I2 = {x, z}, O2 = {b} all conditions fulfilled

slide-9
SLIDE 9

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−9

3 Foundations

Terms and Definitions

  • dependence analysis (cont’d)

– further example P1: a = x + y P2: b = a + b I1 = {x, y}, O1 = {a}, I2 = {a, b}, O2 = {b} I2 ∩ O1 ≠ ∅ – BERNSTEIN’s conditions help to identify instruction-level parallelism or coarser parallelism (loops, e. g.) – hence, sometimes dependencies within loops can be solved – example: two loops with dependencies – which to be solved loop A: loop B:

for (i = 2; i < 100; ++i) for (i = 2; i < 100; ++i) a[i] = a[i−1] + 4 a[i] = a[i−2] + 4

slide-10
SLIDE 10

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−10

3 Foundations

Terms and Definitions

  • dependence analysis (cont’d)

– expansion of loop B

a[2] = a[0] + 4 a[3] = a[1] + 4 a[4] = a[2] + 4 a[5] = a[3] + 4 a[6] = a[4] + 4 a[7] = a[5] + 4

– hence, a[3] can only be computed after a[1], a[4] after a[2], … computation can be split into two independent loops

a[0] = … a[1] = … for (i = 1; i < 50; ++i) for (i = 1; i < 50; ++i) j = 2*i j = 2*i + 1 a[j] = a[j−2] + 4 a[j] = a[j−2] + 4

– many other techniques for recognising / creating parallelism exist (see also Chapter 4: Dependence Analysis)

… …

slide-11
SLIDE 11

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−11

3 Foundations

Terms and Definitions

  • structures of parallel programs

– examples of structures

parallel program … macropipelining dynamic static … competitive parallelism data parallelism function parallelism commissioning

  • rder

acceptance

slide-12
SLIDE 12

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−12

3 Foundations

Terms and Definitions

  • structures of parallel programs (cont’d)

– function parallelism

  • parallel execution (on different processors) of components such as

functions, procedures, or blocks of instructions

  • drawback

– separate program for each processor necessary – limited degree of parallelism limited scalability

  • macropipelining for data transfer between single components

– overlapping parallelism similar to pipelining in processors – one component (producer) hands its processed data to the next

  • ne (consumer) stream of results

– components should be of same complexity ( idle times) – data transfer can either be synchronous (all components communicate simultaneously) or asynchronous (buffered)

slide-13
SLIDE 13

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−13

3 Foundations

Terms and Definitions

  • structures of parallel programs (cont’d)

– data parallelism (1)

  • parallel execution of same instructions (functions or even programs)
  • n different parts of the data (SIMD)
  • advantages

– only one program for all processors necessary – in most cases ideal scalability

  • drawback: explicit distribution of data necessary (MesMS)
  • structuring of data parallel programs

– static: compiler decides about parallel and sequential processing of concurrent parts – dynamic: decision about parallel processing at run time,

  • i. e. dynamic structure allows for load balancing (at the

expenses of higher organisation / synchronisation costs)

slide-14
SLIDE 14

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−14

3 Foundations

Terms and Definitions

  • structures of parallel programs (cont’d)

– data parallelism (2)

  • dynamic structuring

– commissioning (master-slave) » one master process assigns data to slave processes » both master and slave program necessary » master becomes potential bottleneck in case of too much slaves ( hierarchical organisation) – order polling (bag-of-tasks) » processes pick the next part of available data “from a bag” as soon as they have finished their computations » mostly suitable for MemMS as bag has to be accessible from all processes ( communication overhead for MesMS)

slide-15
SLIDE 15

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−15

3 Foundations

Terms and Definitions

  • structures of parallel programs (cont’d)

– competitive parallelism

  • parallel execution of different processes (based on different

algorithms or strategies) all solving the same problem

  • advantages

– as soon as first process found the solution, computations of all subsequent processes are allowed to stop – on average, superlinear speed-up possible

  • drawback

– lots of different programs necessary

  • nevertheless, rare case of parallel programs
  • examples

– sorting algorithms – theorem proving within computational semantics

slide-16
SLIDE 16

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−16

3 Foundations

Terms and Definitions

  • parallel programming models

– representation of parallel activities and their respective data – definition of abstractions for

  • sequential activities (process, task, thread, e. g.)
  • dependencies among sequential activities (messages, remote

memory accesses, synchronisation, e. g.) – examples of (abstract) parallel programming models

  • memory-coupling: shared address space; direct access to shared

variables; synchronisation mechanism

  • message-coupling: exchange of messages; communication either

based on ports ( channels) or on process identifiers

  • data parallelism: same operation applied to all data elements

simultaneously; data distribution done by programmer – anyway, parallel programming language necessary

slide-17
SLIDE 17

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−17

3 Foundations

Terms and Definitions

  • parallel programming languages

– explicit parallelism

  • parallel programming interfaces

– extension of sequential languages (C, Fortran, e. g.) by additional parallel language constructs – implementation via procedure calls from respective libraries – example: MPI, PVM, Linda

  • parallel programming environments

– parallel programming interface plus additional tools such as compiler, libraries, debugger, … – most (machine dependent) environments come along with a parallel computer – example: MPICH

slide-18
SLIDE 18

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−18

3 Foundations

Terms and Definitions

  • parallel programming languages (cont’d)

– implicit parallelism

  • mapping of programs (written in a sequential language) to the

parallel computer via compiler directives

  • primarily for the parallelisation of loops
  • only minor modifications of source code necessary
  • level of parallelism

– block level for parallelising compilers ( threads) – instruction / sub-instruction level for vectorising compilers

  • example: OpenMP (parallelising), Intel compiler (vectorising)
slide-19
SLIDE 19

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−19

3 Foundations

Terms and Definitions

  • differences between processes and threads

– process (heavy-weight process)

  • functional unit with separate, exclusive memory for instructions and

data (so called environment)

  • context: register values of processor executing a process
  • difficult to handle: process creation, process changes (both

environment and context), access protection, inter-process communication – thread (light-weight process)

  • share the environment, i. e. address space, of a process
  • change of thread only implies a change of context
  • but synchronisation mechanism necessary
  • standardisation effort: POSIX-Thread or Pthread (1995)
slide-20
SLIDE 20

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−20

  • administration of threads

– observation: programs are almost never perfectly parallel, i. e. alternation of sequential and parallel parts – three different models for the administration of parallel threads

  • fork/join
  • single program, multiple data (SPMD)
  • reusable thread pool

3 Foundations

Terms and Definitions

… … sequential (initialise) sequential sequential (finalise) parallel parallel time

slide-21
SLIDE 21

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−21

3 Foundations

Terms and Definitions

  • administration of threads (cont’d)

– fork/join-model

  • main process starts execution of the program
  • at the beginning of each parallel block, the main process generates

threads (fork operation) that can work in parallel

  • at the end of each parallel block, all threads are synchronised with

the main process and terminated (join operation)

  • generation / termination continued for each block overhead

… … sequential (initialise) sequential sequential (finalise) parallel parallel time

slide-22
SLIDE 22

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−22

3 Foundations

Terms and Definitions

  • administration of threads (cont’d)

– SPMD-model

  • partially avoids the overhead of the fork/join-model
  • threads are generated once at program start and collectively

terminate at program end

  • sequential parts ( ) are executed from all threads redundant

multiple processing

  • parallel parts are executed by just one thread each

… sequential (initialise) sequential sequential (finalise) parallel parallel time

slide-23
SLIDE 23

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−23

3 Foundations

Terms and Definitions

  • administration of threads (cont’d)

– reusable-thread-pool-model

  • combines both models (fork/join and SPMD) to avoid overhead

(generation / termination) and redundant processing

  • threads are generated the first time they are needed
  • at the end of a parallel block, threads are set into the idle state

( ) to be reactivated at the beginning of the next one

  • hence, sequential parts are processed by just one thread

… sequential (initialise) sequential sequential (finalise) parallel parallel time

slide-24
SLIDE 24

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−24

3 Foundations

Terms and Definitions

  • development phases

specification verification coding decomposition mapping visualisation performance analysis debugging dynamic load balancing performance estimation early phases of design late phases

  • f design
slide-25
SLIDE 25

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−25

specification verification coding decomposition mapping visualisation performance analysis debugging dynamic load balancing performance estimation early phases of design late phases

  • f design
  • development phases (cont’d)

– definition of parallel processes and communication points – usage of specific tools and parallel programming environments (LAM, mpC, ASKALON, SUN Studio Tools, e. g.) – dependent from that, the development of a parallel program can be manual, semi-automatic, or automatic (parallelising compiler) – programming language with explicit or implicit parallelism (OpenMP, MPI, PVM, Linda, Orca, e. g.) 3 Foundations

Terms and Definitions

slide-26
SLIDE 26

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−26

specification verification coding decomposition mapping visualisation performance analysis debugging dynamic load balancing performance estimation early phases of design late phases

  • f design
  • development phases (cont’d)

– parallelisation (especially for already existing sequential programs) – rough estimation of resources, run times, memory usage, I/O, and communication profilers and simulators – mapping of processes (and data) to processors 3 Foundations

Terms and Definitions

slide-27
SLIDE 27

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−27

specification verification coding decomposition mapping visualisation performance analysis debugging dynamic load balancing performance estimation early phases of design late phases

  • f design
  • development phases (cont’d)

3 Foundations

Terms and Definitions

besides performance evaluation nothing done here

  • n the target architecture

↓ development runs (for optimisation) ↓ production runs (for results)

slide-28
SLIDE 28

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−28

specification verification coding decomposition mapping visualisation performance analysis debugging dynamic load balancing performance estimation early phases of design late phases

  • f design
  • development phases (cont’d)

– execution of the program on the target architecture (at least once) for monitoring purposes necessary – interactive parallel debuggers for testing (xpdb, e. g.) – performance analysis and visualisation of program execution on the target architecture based on utilisation of resources and communication advent (VAMPIR for MPI programs, e. g.) – observation of load distribution (Ganglia, e. g.) for manual or automated load balancing strategies 3 Foundations

Terms and Definitions

slide-29
SLIDE 29

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−29

3 Foundations

Terms and Definitions

  • alternative phase model

– heuristic approach with four phases (according to FOSTER)

  • partitioning: instructions and respective data are subdivided into

smaller tasks (ignoring machine specific aspects) for a maximum degree of parallelism

  • communication: regulation of necessary communication
  • agglomeration: cost and performance evaluation of first two phases

bundling of tasks possible

  • mapping: tasks are mapped statically (via compiler) or dynamically

during run time (via load balancing strategies) to processors for a maximum utilisation of processors and minimum communication costs – first two phases focus on parallelism and scalability – latter two phases focus on locality and parallel efficiency

slide-30
SLIDE 30

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−30

3 Foundations

Overview

  • terms and definitions
  • process interaction for MemMS
  • process interaction for MesMS
  • example of a parallel program
slide-31
SLIDE 31

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−31

3 Foundations

Process Interaction for MemMS

  • principles

– processes depend from each other if they have to be executed in a certain order; this can have two reasons

  • cooperation: processes execute parts of a common task

– producer/consumer: one process generates data to be processed by another one – client/server: same as above, but second process also returns some data (result of a computation, e. g.) – …

  • competition: activities of one process hinder other processes

– synchronisation: management of cooperation / competition of processes

  • rdering of processes’ activities

– communication: data exchange among processes – MemMS: realised via shared variables with read / write access

slide-32
SLIDE 32

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−32

3 Foundations

Process Interaction for MemMS

  • synchronisation

– two types of synchronisation can be distinguished

  • unilateral: if activity A2 depends on the results of activity A1 then A1

has to be executed before A2 (i. e. A2 has to wait until A1 finishes); synchronisation does not affect A1

  • multilateral: order of execution of A1 and A2 does not matter, but A1

and A2 are not allowed to be executed in parallel (due to write / write

  • r write / read conflicts, e. g.)

– activities affected by multilateral synchronisation are mutual exclusive,

  • i. e. they cannot be executed in parallel and act to each other atomically

(no activity can interrupt another one) – instructions requiring mutual exclusion are called critical sections – synchronisation might lead to deadlocks (mutual blocking) or lockout (“starvation”) of processes, i. e. indefinable long delay

slide-33
SLIDE 33

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−33

3 Foundations

Process Interaction for MemMS

  • synchronisation (cont’d)

– necessary and sufficient constraints for deadlocks

  • resources are only exclusively useable
  • resources cannot be withdrawn from a process
  • processes do not release assigned resources while waiting for the

allocation of other resources

  • there exists a cyclic chain of processes that use at least one

resource needed by the next processes within the chain

P1 P2 A B resource requested by process resource allocated by process

slide-34
SLIDE 34

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−34

3 Foundations

Process Interaction for MemMS

  • synchronisation (cont’d)

– possibilities to handle deadlocks

  • deadlock detection

– techniques to detect deadlocks (identification of cycles in waiting graphs, e. g.) and measures to eliminate them (rollback,

  • e. g.)
  • deadlock avoidance

– by rules: paying attention that at least one of the four constraints for deadlocks is not fulfilled – by requirements analysis: analysing future resource allocations

  • f processes and forbidding states that could lead to deadlocks

(HABERMANN’s / banker’s algorithm well known from OS, e. g.)

slide-35
SLIDE 35

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−35

3 Foundations

Process Interaction for MemMS

  • methods of synchronisation

– lock variable / mutex – semaphore – monitor – barrier

slide-36
SLIDE 36

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−36

3 Foundations

Process Interaction for MemMS

  • lock variable / mutex

– used to control the access to critical sections – when entering a critical section a process

  • has to wait until the respective lock is open
  • enters and closes the lock, thus no other process can follow
  • opens the lock and leaves when finished
  • lock / unlock have to be executed from the same process

– lock variables are abstract data types consisting of

  • a boolean variable of type mutex
  • at least two functions lock and unlock
  • further functions (Pthreads): init, destroy, trylock, …

– function lock consists of two operations “test” and “set” which together form a non interruptible (i. e. atomic) activity

slide-37
SLIDE 37

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−37

3 Foundations

Process Interaction for MemMS

  • semaphore

– abstract data type consisting of

  • nonnegative variable of type integer (semaphore counter)
  • two atomic operations P (“passeeren”) and V (“vrijgeven”)

– after initialisation of semaphore S the counter can only be manipulated with the operations P(S) and V(S)

  • P(S): if S > 0 then S = S − 1

else the processes executing P(S) will be suspended

  • V(S): S = S + 1

– after a V-operation any suspended process is reactivated (busy waiting); alternatives: always next process in queue – binary semaphore: has only values “0” and “1” (similar to lock variable, but P and V can be executed by different processes) – general semaphore: has any nonnegative number

slide-38
SLIDE 38

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−38

3 Foundations

Process Interaction for MemMS

  • semaphore (cont’d)

– initial value of semaphore counter defines the maximum amount of processes that can enter a critical section simultaneously – critical section enclosed by operations P and V

# mutual exclusion (binary) semaphore s; s = 1 execute p1 and p2 in parallel begin procedure p1 begin procedure p2 while (true) do while (true) do P(s) P(s) critical section 1 critical section 2 V(s) V(s)

  • d
  • d

end end

slide-39
SLIDE 39

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−39

3 Foundations

Process Interaction for MemMS

  • semaphore (cont’d)

– consumer/producer-problem: semaphore indicates difference between produced and consumed elements – assumption: unlimited buffer, atomic operations store and remove

# consumer/producer (general) semaphore s; s = 0 execute producer and consumer in parallel begin procedure producer begin procedure consumer while (true) do while (true) do produce X P(s) store X remove X V(s) consume X

  • d
  • d

end end

slide-40
SLIDE 40

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−40

3 Foundations

Process Interaction for MemMS

  • monitor

– semaphores solve synchronisation on a very low level already one wrong semaphore operation might cause the breakdown of the entire system – better: synchronisation on a higher level with monitors

  • abstract data type with implicit synchronisation mechanism,
  • i. e. implementation details (such as access to shared data or

mutual exclusion) are hidden from the user

  • all access operations are mutual exclusive, thus all resources

(controlled by the monitor) are only exclusively useable – monitors consist of

  • several monitor variables and monitor procedures
  • a monitor body (instructions executed after program start for

initialisation of the monitor variables)

slide-41
SLIDE 41

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−41

3 Foundations

Process Interaction for MemMS

  • monitor (cont’d)

– only access to monitor-bound variables via monitor procedures, direct access from outside the monitor is not possible – only one process can enter a monitor at each point in time, all others are suspended and have to wait outside the monitor – synchronisation via condition variables (based on mutex)

  • wait(c): calling process is blocked and appended to an internal

queue of processes also blocked due to condition c,

  • i. e. immediate entry to monitor for other processes possible
  • signal(c): if queue for condition c is not empty, the process at the

queue’s head is reactivated (and also preferred to processes waiting

  • utside for entering the monitor)

– condition variables are monitor-bound and only accessible via

  • perations wait and signal ( no manipulation from outside)
slide-42
SLIDE 42

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−42

3 Foundations

Process Interaction for MemMS

  • monitor (cont’d)

– consumer/producer-problem with limited buffer (1)

# consumer/producer const size = … monitor limitedbuffer buffer[size] of integer in, out: integer n: integer notempty, notfull: condition begin procedure store(X) if n = size then wait(notfull) buffer[in] = X; in = in + 1 if in = size then set in = 0 n = n + 1 signal(notempty) end

  • ut = 3

in = 10 n = 7

slide-43
SLIDE 43

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−43

3 Foundations

Process Interaction for MemMS

  • monitor (cont’d)

– consumer/producer-problem with limited buffer (2)

begin procedure remove(X) if n = 0 then wait(notempty) X = buffer[out]; out = out + 1 if out = size then set out = 0 n = n − 1 signal(notfull) end monitor body: in = 0; out = 0; n = 0 begin procedure producer begin procedure consumer while (true) do while (true) do produce X remove X store X consume X

  • d
  • d

end end

slide-44
SLIDE 44

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−44

3 Foundations

Process Interaction for MemMS

  • monitor (cont’d)

– some remarks

  • compared to semaphores, after once correctly programmed

monitors cannot be disturbed by adding further processes

  • semaphores can be implemented via monitors and vice versa
  • nowadays, multithreaded OS use lock variables and condition

variables instead of monitors

  • nevertheless, monitors are used within Java for synchronisation of

threads

slide-45
SLIDE 45

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−45

3 Foundations

Process Interaction for MemMS

  • barrier

– synchronisation point for several processes, i. e. each process has to wait until the last one also arrived – initialisation of counter C before usage with the amount of processes that should wait (init-barrier operation) – each process executes a wait-barrier operation

  • counter C is decremented by one
  • process is suspended if C > 0, otherwise all processes are

reactivated and the counter C is set back to the initial value – useful for setting all processes (after independent processing steps) into the same state and for debugging purposes

slide-46
SLIDE 46

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−46

3 Foundations

Overview

  • terms and definitions
  • process interaction for MemMS
  • process interaction for MesMS
  • example of a parallel program
slide-47
SLIDE 47

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−47

3 Foundations

Process Interaction for MesMS

  • message passing paradigm

– no shared memory for synchronisation and communication – hence, transfer mechanism for information interchange necessary – message passing

  • messages: data units transferred between processes
  • send / receive operations instead of read / write operations

– implicit (sequential) order during send/receive-stage

  • a message can only be received after a prior send
  • communication via message passing (independent from the

transferred data) leads to an implicit synchronisation

  • synchronisation due to availability / unavailability of messages
  • messages are resources that don’t exist before the send and in

general also after the receive operation

slide-48
SLIDE 48

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−48

3 Foundations

Process Interaction for MesMS

  • messages

– created whenever a process performs a send – necessary information to be provided from the sender

  • destination (process, node, communication channel, e. g.)
  • unique identifier of message (number, e. g.)
  • memory (so called send buffer) containing the data to be transferred

with the message

  • data type and amount of elements within the send buffer

– data type and amount of elements have to match for the receiver,

  • therwise a correct interpretation of the data is in doubt
slide-49
SLIDE 49

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−49

3 Foundations

Process Interaction for MesMS

  • sending messages

– send operations can be

  • synchronous / asynchronous: a process performing a send is either

blocked (synchronous) or not (asynchronous) until a respective receive operation is executed

  • buffered / unbuffered: a send operation may first copy the data from

the send buffer to a system buffer (buffered) or directly perform the transfer (unbuffered faster, but risk of overwriting the send buffer due to parallel execution of transfer (NIC) and next send operation (CPU) possible)

  • blocking / non-blocking: a send operation can either be blocked until

the send buffer has been emptied (blocking) or immediately give control to the next instruction of the sending process (non-blocking risk of overwriting send buffer)

slide-50
SLIDE 50

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−50

3 Foundations

Process Interaction for MesMS

  • receiving messages

– a process has to specify which message to receive (via message identifier or wildcard) and where to store the data (so called receive buffer) – receive operations can be

  • destructive / non-destructive: a receive operation copies the data

transferred with the message into the receive buffer and either destroys the message (destructive) or keeps it for later usage (non- destructive)

  • synchronous / asynchronous: a process performing a receive is

either blocked until a message has arrived (synchronous) or not (asynchronous), thus it can continue with its execution and check again at a later point in time

slide-51
SLIDE 51

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−51

3 Foundations

Process Interaction for MesMS

  • synchronisation characteristics

– synchronous message passing

  • both sender and receiver use synchronous operations
  • content and destination of a message are known on both sides at

the same time

  • hence, the message can directly be transferred from the send to the

receive buffer – asynchronous message passing

  • at least sender or receiver uses asynchronous operations (typically

the sender)

  • as not both processes are available for communication at the same

time, some buffer for the message transfer from sender to receiver is necessary

slide-52
SLIDE 52

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−52

3 Foundations

Process Interaction for MesMS

  • addressing modes

– different addressing modes can be distinguished

  • direct naming: process identifiers are used for sender and receiver

identifiers have to be known during development

  • mailbox: global memory where processes can store (send) and

remove (receive) messages (used in the Distributed Execution and Communication Kernel (DECK), e. g.)

  • port: a port is bound to one process and can be used in one

direction only, i. e. either for sending or receiving messages

  • connection / channel: for the usage of ports the setup of

connections or channels is required, i. e. the connection of a send port of one process with the receive port of another process data written from the sender to its port can be read from the receiver on the other port

slide-53
SLIDE 53

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−53

3 Foundations

Process Interaction for MesMS

  • activity-oriented communication

– communication discussed so far can be seen as data-oriented – different approach: communication that concentrates on activities of

  • ther processes

– basic schema of activity-oriented communication

  • the client sends a service request to the server
  • the server performs the requested activity while the client waits for a

response

  • when finished, the server sends its response (maybe together with

the computed data) back to the client

  • hence, two data-oriented communications are necessary

– this is also referred to as remote procedure call (RPC) – in general, RPCs are synchronous communications as either the clients (busy server) or the server (no client) might be blocked

slide-54
SLIDE 54

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−54

3 Foundations

Overview

  • terms and definitions
  • process interaction for MemMS
  • process interaction for MesMS
  • example of a parallel program
slide-55
SLIDE 55

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−55

3 Foundations

Example of a Parallel Program

  • warehousing

– problem from the field of nonlinear optimisation – some company sells one certain product P – disposal of P at discrete points in time t0 < t1 < … < tN, i. e. within a complete planning period of N intervals [ti, ti+1], i = 0, …, N−1 – the company runs a warehouse that will be delivered with P at the beginning of each interval [ti, ti+1], i = 0, …, N−1 – objective: planning of warehousing to minimise overall costs – let be

  • Ai: warehouse stock of P (before delivery) at time ti
  • Ri: request for P within interval [ti, ti+1]
  • Ui: delivery quantity of P at the beginning of interval [ti, ti+1]
  • A0 = α ≥ 0: warehouse stock of P at time t0
slide-56
SLIDE 56

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−56

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– changes of Ai within the following intervals to be computed via the warehouse balance equation Ai+1 = Ai − Ri + Ui, i = 0, …, N−1 – costs (delivery, purchase, warehousing, e. g.) within in each interval are given via the function fi(Ai, Ui); all Ui should be chosen in that way to minimise the overall costs for the entire planning period [t0, tN] – furthermore, warehouse stock AN at time tN should also be very small,

  • i. e. the objective function looks as follows

ρ ≥ 0

− = 1 N i i i i

) U , (A f , A ) U , (A f U) f(A,

1 N i 2 N i i i

− =

⋅ ρ + =

slide-57
SLIDE 57

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−57

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– altogether, this leads to the following problem constraints: Ai+1 = Ai − Ri + Ui, i = 0, …, N−1 and A0 = α ≥ 0 – variable U defines a series of decisions, also called politics – for each U an unique A can be found, called appropriate state – if (Ã, Ũ) is a solution for (WH) then Ũ is called optimal politics – in general, request Ri unknown, but such models can be used to examine the dependence of an optimal politics from typical request profiles – question: how to find an optimal politics, i. e. Ũ

− =

⋅ ρ + =

1 N i 2 N i i i

A ) U , (A f U) f(A, min

(WH)

slide-58
SLIDE 58

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−58

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– solution via mutation-selection-method (MS-solver)

  • biologically motivated make incidentally mutations of current

iteration point and select the “good ones”

  • basic structure

1) chose starting point x(0) ∈ RN; set k = 0 2) compute new v(k) via incidentally mutation of x(k) 3) if f(v(k)) < f(x(k)) set x(k+1) = v(k), otherwise set x(k+1) = x(k) 4) set k = k+1; continue with step 2

  • possible mutation

i = 1, …, N with search direction di ∈ [-0.5, 0.5] and step size σk

  • sequential algorithm so far, next step parallelisation

, d x v

i k (k) i (k) i

⋅ σ + =

slide-59
SLIDE 59

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−59

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– parallelisation of warehousing problem

  • basic questions

– which parts to be parallelised – which structure: function or data parallelism – which model: shared or distributed memory

  • possible candidates

– computation of (WH), i. e. fi(Ai, Ui), i = 0, …, N−1 – computation of new v(k) (step 2 of MS-solver)

slide-60
SLIDE 60

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−60

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant A: data parallelism (1)

  • all fvali = fi(Ai, Ui) can be computed independently

– each process computes some fvali – no communication and synchronisation necessary – works perfect for both MemMS and MesMS

  • summation of all fvali

– each process computes a partial sum SP = ∑ fvali – MemMS: each process computes S = S + SP, but synchronisation necessary due to parallel write access – MesMS: each process sends its SP to one dedicated (master) process that computes the global sum S

slide-61
SLIDE 61

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−61

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant A: data parallelism (2)

  • computation of v(k) can be processed independently

– each process computes some parts of v(k), i. e. some components vi

(k)

– MemMS: each process updates v(k) with its computed components vi

(k); no synchronisation necessary

– MesMS: each process sends its computed components vi

(k) to

  • ne dedicated (master) process that assembles v(k)

– here, MemMS are advantageous

slide-62
SLIDE 62

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−62

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant A: data parallelism (3)

  • parallel program for shared memory (MemMS)

choose starting point x(0) and set k = 0 while (halt condition not true) do parallel block compute some vi

(k) and update v(k)

wait for all threads to be finished evaluate constraints Ai+1 = Ai − Ri + Ui parallel block compute some fi(Ai, Ui) and partial sum SP compute S = S + SP with exclusive write access if f(v(k)) < f(x(k)) then x(k+1) = v(k) else x(k+1) = x(k) k = k + 1

  • d
slide-63
SLIDE 63

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−63

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant A: data parallelism (4)

  • parallel program for distributed memory (MesMS)

choose starting point x(0) and set k = 0 while (halt condition not true) do compute some vi

(k) and send to master process (MP)

MP receive all vi

(k) and assemble v(k)

MP evaluate constraints Ai and distribute to all compute some fi(Ai, Ui) compute partial sum SP and send to MP MP receive all SP and compute S MP check for f(v(k)) < f(x(k)) and update x(k+1) MP send new x(k+1) to all k = k + 1

  • d
slide-64
SLIDE 64

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−64

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant B: function parallelism (1)

  • compute different mutations of v(k) and proceed with best one

– each process computes a different mutation vP

(k)

– each process computes f(vP

(k))

– find minimal f(vmin

(k)) within all f(vP (k))

– check if f(vmin

(k)) < f(x(k)) and set x(k+1) correspondingly

– MemMS: synchronisation necessary due to finding global minimum f(vmin

(k)) of local values f(vP (k))

– MesMS: each process sends its f(vP

(k)) to one dedicated (master)

process and retrieves x(k+1)

slide-65
SLIDE 65

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−65

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant B: function parallelism (2)

  • parallel program for shared memory (MemMS)

choose starting point x(0) and set k = 0 while (halt condition not true) do parallel block compute vP

(k) and f(vP (k))

wait for all threads to be finished find f(vmin

(k)) within all f(vP (k))

check for f(vmin

(k)) < f(x(k)) and update x(k+1)

k = k + 1

  • d
slide-66
SLIDE 66

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−66

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant B: function parallelism (3)

  • parallel program for distributed memory (MesMS)

choose starting point x(0) and set k = 0 while (halt condition not true) do compute vP

(k)

compute f(vP

(k))and send to master process (MP)

MP receive all f(vP

(k)) and find f(vmin (k))

MP check for f(vmin

(k)) < f(x(k)) and update x(k+1)

MP send new x(k+1) to all k = k + 1

  • d
slide-67
SLIDE 67

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

3−67

3 Foundations

Example of a Parallel Program

  • warehousing (cont’d)

– variant C: competitive parallelism

  • parallel program for shared memory (MemMS)

STOP = false choose starting point x(0) and set k = 0 parallel block while (halt condition not true) do compute one iteration x(k) with algorithm AlgP check for STOP else continue k = k + 1

  • d

if (STOP) finish else STOP = true