Terminology Programmierung Paralleler und Verteilter Systeme (PPV) - - PowerPoint PPT Presentation

terminology
SMART_READER_LITE
LIVE PREVIEW

Terminology Programmierung Paralleler und Verteilter Systeme (PPV) - - PowerPoint PPT Presentation

Terminology Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Terminology 2 Parallel Programming Concepts | 2013 / 1014 Terminology 3 When two trains


slide-1
SLIDE 1

Terminology

Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015

Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc.,

  • Prof. Dr. Andreas Polze
slide-2
SLIDE 2

Terminology

Parallel Programming Concepts | 2013 / 1014

2

slide-3
SLIDE 3

Terminology

3

„When two trains approach each

  • ther at a crossing,

both shall come to a full stop and neither shall start up again until the other has gone.“ [Kansas legislature, early 20th century]

slide-4
SLIDE 4

Terminology

■ Concurrency □ Capability of a system to have two or more activities in progress at the same time □ May be independent, loosely coupled or closely coupled □ Classical operating system responsibility for a better utilization

  • f CPU, memory, network, and other resources

□ Demands scheduling and synchronization ■ Parallelism □ Capability of a system to execute activities simultaneously □ Demands parallel hardware, concurrency support, (and communication) ■ Any parallel program is a concurrent program ■ Some concurrent programs cannot be run as parallel program

4

slide-5
SLIDE 5

Terminology

■ Concurrency vs. parallelism vs. distribution □ Two threads started by the application ◊ Define concurrent activities in the program code ◊ Might (!) be executed in parallel ◊ Can be distributed on different machines ■ Management of concurrent activities in an operating system □ Multiple applications being executed at the same time □ Single application leveraging threads for speedup / scaleup □ Non-sequential operating system activities „The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects“ [Herb Sutter, 2005]

5

slide-6
SLIDE 6

Concurrency [Breshears]

■ Processes / threads represent the execution of atomic statements □ „Atomic“ can be defined on different granularity levels, e.g. source code line □ Concurrency should be treated as abstract concept ■ Concurrent execution □ Interleaving of multiple atomic instruction streams □ Leads to unpredictable result ◊ Non-deterministic scheduling, interrupts □ Concurrent algorithm should maintain its properties for all possible inter-leavings of sequential activities □ Example: All instructions are eventually included (fairness) ■ Some literature distinguishes between interleaving (uniprocessor) and overlapping (multiprocessor) of statements

6

slide-7
SLIDE 7

Concurrency

■ In hardware □ Context switch support ■ In operating systems □ Native process / thread support □ Synchronization support ■ In virtual runtime environments □ Java / .NET thread support ■ In middleware □ J2EE / CORBA thread pooling ■ In programming languages □ Asynchronous and event-based programming

7

Operating System Virtual Runtime Middleware Server Application Server Application Server Application

slide-8
SLIDE 8

Example: Operating System

8

code% data% files% registers% stack% Thread' code% data% files% registers% stack%

Thread'

stack% registers% stack% registers%

Thread' Thread'

slide-9
SLIDE 9

Concurrency Is Hard

■ Sharing of global resources □ Concurrent reads and writes on the same global resource (variable) makes ordering a critical issue ■ Optimal management of resource allocation □ Process gets control over a I/O channel and is then suspended before using it ■ Programming errors become non-deterministic □ Order of interleaving may / may not activate the bug ■ Happens all even on uniprocessors ■ Race condition □ The result of an operation depends on the order of execution □ Well-known issue since the 60‘s, identified by E. Dijkstra

9

slide-10
SLIDE 10

Race Condition

■ One piece of code in one process, executed at the same time … □ … by two threads on a single core. □ … by two threads on two cores. ■ What happens ?

10

void echo() { char_in = getchar(); char_out = char_in; putchar(char_out); }

slide-11
SLIDE 11

Potential Deadlock

11

[Stallings]

slide-12
SLIDE 12

Actual Deadlock

Parallel Programming Concepts | 2013 / 1014

12

[Stallings]

slide-13
SLIDE 13

13

Terminology

Deadlock ■ Two or more processes / threads are unable to proceed ■ Each is waiting for one of the others to do something Livelock ■ Two or more processes / threads continuously change their states in response to changes in the other processes / threads ■ No global progress for the application Race condition ■ Two or more processes / threads are executed concurrently ■ Final result of the application depends on the relative timing of their execution

slide-14
SLIDE 14

14

Terminology

Starvation ■ A runnable process / thread is overlooked indefinitely ■ Although it is able to proceed, it is never chosen to run (dispatching / scheduling) Atomic Operation ■ Function or action implemented as a sequence of one or more instructions ■ Appears to be indivisible - no other process / thread can see an intermediate state or interrupt the operation ■ Executed as a group, or not executed at all Mutual Exclusion ■ The requirement that when one process / thread is using a resource, no other shall be allowed to do that

slide-15
SLIDE 15

From Concurrency to Parallelism

15

Program Program Program Process Process Process Process Task

Processor

Process Process Process Process Task Process Process Process Process Task

Processor Processor Processor Memory

Node

Network

Processor Processor Processor Memory Processor Processor Processor Memory Processor Processor Processor Memory Processor Processor Processor Memory

slide-16
SLIDE 16

Parallelism for …

■ Speedup – compute faster ■ Throughput – compute more in the same time ■ Scalability – compute faster / more with additional resources ■ Price / performance – be as fast as possible for given money ■ Scavenging – compute faster / more with idle resources

16 Processing Element A1 Processing Element A2 Processing Element A3 Processing Element B1 Processing Element B2 Processing Element B3

Scaling Up Scaling Out

Main Memory Main Memory

slide-17
SLIDE 17

The Parallel Programming Problem

17

Execution Environment Parallel Application Match ? Configuration Flexible Type

slide-18
SLIDE 18

Parallelism [Mattson et al.]

■ Task - Parallel program breaks a problem into tasks ■ Execution unit □ Representation of a concurrently running task (e.g. thread) □ Tasks are mapped to execution units during development time ■ Processing element □ Hardware element running one execution unit □ Depends on scenario - logical processor vs. core vs. machine □ Execution units run simultaneously on processing elements, controlled by the scheduling entity ■ Synchronization □ Mechanism to order activities of parallel tasks ■ Race condition □ Program result depends on the scheduling order

18

slide-19
SLIDE 19

Parallel Processing

■ Inside the processor □ Instruction-level parallelism (ILP) □ Multicore □ Shared memory ■ With multiple processing elements in one machine □ Multiprocessing □ Shared memory ■ With multiple processing elements in many machines □ Multicomputer □ Shared nothing (in terms of a globally accessible memory)

19

slide-20
SLIDE 20

Multiprocessor: Flynn‘s Taxonomy (1966)

■ Classify multiprocessor architectures among instruction and data processing dimension

20 Single Instruction,
 Single Data (SISD)

(C) Blaise Barney

Single Instruction,
 Multiple Data (SIMD) Multiple Instruction,
 Single Data (MISD) Multiple Instruction,
 Multiple Data (MIMD)

slide-21
SLIDE 21

Another Taxonomy (Tanenbaum)

21 Multiprocessors (shared memory) Multicomputers (private memory) Bus Bus Switched MIMD Parallel and Distributed Computers Switched

slide-22
SLIDE 22

Another Taxonomy (Foster)

■ Multicomputer □ Set of connected von Neumann computers (DM-MIMD) □ Each computer runs a local program in local memory and sends / receives messages □ Local memory access is less expensive than remote memory access

22 Interconnect

Central Unit

Memory Control Unit Arithmetic Logic Unit Input Output

Bus

slide-23
SLIDE 23

Shared Memory vs. Shared Nothing

■ Organization of parallel processing hardware as … □ Shared memory system ◊ Concurrent processes can directly access a common address space ◊ Typically implemented as memory hierarchy, with different cache levels ◊ Examples: SMP systems, distributed shared memory systems, virtual runtime environment □ Shared nothing system ◊ Concurrent processes can only access local memory and exchange messages with other processes ◊ Message exchange typically order of magnitudes slower than memory ◊ Examples: Cluster systems, distributed systems (Hadoop, Grids, …)

23

slide-24
SLIDE 24

Shared Memory vs. Shared Nothing

■ Pfister: „shared memory“ vs. „distributed memory“ ■ Foster: „multiprocessor“ vs. „multicomputer“ ■ Tannenbaum: „shared memory“ vs. „private memory“

24

Processor

Process

Shared Memory

Processor

Process

Processor

Process

Processor

Process

Message Message Message Message

Data Data Data Data

slide-25
SLIDE 25

Shared Memory

■ All processors act independently and use the same global address space, changes in one memory location are visible for all others ■ Uniform memory access (UMA) system □ Equal load and store access for all processors to all memory □ Default approach for SMP systems of the past ■ Non-uniform memory access (NUMA) system □ Delay on memory access according to the accessed region □ Typically realized by processor networks and local memories ◊ Cache-coherent NUMA (CC-NUMA), completely implemented in hardware ◊ Became standard approach with recent X86 chips

25

slide-26
SLIDE 26

UMA Example

26

Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache – Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and mediates connections to memory

slide-27
SLIDE 27

NUMA Example

27

Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one ‘logical’ single address space of ‘physically distributed memory’

slide-28
SLIDE 28

NUMA Example: Intel Nehalem

28

Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I L3 Cache L3 Cache L3 Cache Memory Controller Memory Controller Memory Controller L3 Cache Memory Controller I/O I/O I/O I/O Memory Memory Memory Memory

slide-29
SLIDE 29

Example: Intel Xeon Phi

■ Tag Directory (TD) per L2 cache ■ 4 groups of 16 cores, 4 threads per core ■ 512-bit SIMD vector unit / core ■ Multiple rings □ Data, addresses, coherence information ■ Getter / scatter address machinery

29

(George Chrysos, Intel)

slide-30
SLIDE 30

Shared Nothing

■ Processing elements no longer share a common global memory, but can only exchange messages ■ Allows an easy scale-out by just adding machines to the network ■ Messaging has more overhead than shared memory access ■ Cluster computing: Combine machines with cheap interconnect □ Compute cluster: Speedup for an application ◊ Batch processing (embarrassingly parallel workload) ◊ Scalable data parallelism (Google search) ◊ Parallel applications □ Load-balancing cluster: Better throughput for service □ High Availability (HA) cluster: Fault tolerance ■ Investment for more memory in one shared memory system vs. investment for more machines in a shared nothing cluster

30

slide-31
SLIDE 31

Shared Nothing Example

Parallel Programming Concepts | 2013 / 1014

31

Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today

slide-32
SLIDE 32

Clusters

32

Processor Process Processor Process Message Message Message Message Data Data

slide-33
SLIDE 33

Shared Nothing

■ Always there, but widely ignored by the ,average‘ developer ■ High-End Systems □ Toy Story (1995) – 100 dual-processors as render farm □ Toy Story 2 (1999) - 1400 processors □ Monsters Inc. (2001) – 250 servers with 14 processors each = 3500 CPU‘s □ June 2013 TOP 500 #1 ◊ Tianhe-2 cluster, 3.120.000 Xeon E5 cores, 1.024.000 GB RAM ■ Massively parallel processing (MPP) □ Clusters at maximum scale for scientific High-Performance Computing (HPC) ■ Grid computing: Combination of clusters at different locations

33

slide-34
SLIDE 34

MPP Systems

Parallel Programming Concepts | 2013 / 1014

34

slide-35
SLIDE 35

Shared-Nothing Workload [Pfister]

35

Bulk Data Traffic (inter-node and memory bandwidth stress) Synchronization Traffic (inter-node latency stress)

LSLD

(‘Parallel Nirvana’)

LSHD HSHD

(‘Parallel Hell’)

HSLD

UMA Cluster Cluster NUMA

slide-36
SLIDE 36

Terminology: Distributed System

■ Tanenbaum (Distributed Operating Systems): „A distributed system is a collection of independent computers that appear to the users of the system as a single computer.“ ■ Coulouris et al.: „... [system] in which hardware or software components located at networked computers communicate and coordinate their actions

  • nly by passing messages.“

■ Lamport: „A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” ■ Concurrency, no global clock, independent failures, heterogeneity,

  • penness, security, scalability, failure handling, …

36

slide-37
SLIDE 37

Hybrid Environments

37

Processor A Processor B Cache Cache Memory Processor C Processor D Cache Cache Memory

High-Speed Interconnect

slide-38
SLIDE 38

Hybrid Environments

Parallel Programming Concepts | 2013 / 1014

38

Shared-memory nodes (here ccNUMA) with local NIs NI mediates connections to other remote ‘SMP nodes’

slide-39
SLIDE 39

Hybrid Environments

39

Client Client Client Client

Operating System Virtual Runtime Middleware Server Application

Intra-request parallelism for response time Inter-request parallelism for throughput and fault tolerance

Scaleup Speedup SM (Inter) Intra DM Inter (Intra)

P P P P

Operating System Virtual Runtime Middleware Server Application P P P P Operating System Virtual Runtime Middleware Server Application P P P P Operating System Virtual Runtime Middleware Server Application P P P P

Speedup -> intra-request -> SM-MIMD (SMP) Scaleup -> inter-request -> DM-MIMD (Cluster)

slide-40
SLIDE 40

Example: Cluster of Nehalem SMPs

40

Network

slide-41
SLIDE 41

The Parallel Programming Problem

41

Execution Environment Parallel Application Match ? Configuration Flexible Type

slide-42
SLIDE 42

Programming Paradigm

■ Programming paradigm □ Coding convention or standard □ Something a majority of people agrees upon ■ Parallel programming is one of these paradigms □ Other: Declarative, constraint-based, object-oriented ■ Each paradigm can be realized by a set of programming models ■ Programming model: „set of rules for a game“ [Almasi] □ Programs and algorithms as game strategies □ Point where execution environment and application meet □ High-level view of the application on it‘s run time environment □ Hardware might imply a model, but does not enforce it □ For uniprocessor, no question due to „von Neumann“ □ Delivering performance while raising the level of abstraction

42

slide-43
SLIDE 43

Programming Models

■ High-level view of the application on it‘s execution environment □ Intended as contract – if you follow the rules, scalability should be achievable □ Decouples software and execution environment architecture development ■ Classification in this course □ Multi-Tasking: Typically used for SM-MIMD execution environments, recently also relevant for SM-SIMD □ Message Passing: Typically used for DM-MIMD environments □ Implicit Parallelism: Works for all environments □ No enforcement, different mappings are possible ■ Programming models are implemented by languages and libraries

43

slide-44
SLIDE 44

Parallel Programming Languages

■ Programming languages contain of syntax and standard library □ Example: C + libc, Python + class library □ Allows to add parallel programming to sequential languages □ Often languages implement more than one programming model ■ Languages are often categorized by their feasibility for a workload - „data-parallel languages“ vs. „task-parallel languages“ ■ Most algorithmic problems match to one kind of execution environment: „data-parallel problem“ vs. „task-parallel problem“

44

Multi-Tasking PThreads, OpenMP, OpenCL, Linda, Cilk, ... Message Passing MPI, PVM, CSP Channels, Actors, ... Implicit Parallelism Map/Reduce, PLINQ, HPF, Lisp, Fortress, ... Mixed Approaches Ada, Scala, Clojure, Erlang, X10, ...