SLIDE 1 Terminology
Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015
Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc.,
SLIDE 2
Terminology
Parallel Programming Concepts | 2013 / 1014
2
SLIDE 3 Terminology
3
„When two trains approach each
both shall come to a full stop and neither shall start up again until the other has gone.“ [Kansas legislature, early 20th century]
SLIDE 4 Terminology
■ Concurrency □ Capability of a system to have two or more activities in progress at the same time □ May be independent, loosely coupled or closely coupled □ Classical operating system responsibility for a better utilization
- f CPU, memory, network, and other resources
□ Demands scheduling and synchronization ■ Parallelism □ Capability of a system to execute activities simultaneously □ Demands parallel hardware, concurrency support, (and communication) ■ Any parallel program is a concurrent program ■ Some concurrent programs cannot be run as parallel program
4
SLIDE 5
Terminology
■ Concurrency vs. parallelism vs. distribution □ Two threads started by the application ◊ Define concurrent activities in the program code ◊ Might (!) be executed in parallel ◊ Can be distributed on different machines ■ Management of concurrent activities in an operating system □ Multiple applications being executed at the same time □ Single application leveraging threads for speedup / scaleup □ Non-sequential operating system activities „The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects“ [Herb Sutter, 2005]
5
SLIDE 6
Concurrency [Breshears]
■ Processes / threads represent the execution of atomic statements □ „Atomic“ can be defined on different granularity levels, e.g. source code line □ Concurrency should be treated as abstract concept ■ Concurrent execution □ Interleaving of multiple atomic instruction streams □ Leads to unpredictable result ◊ Non-deterministic scheduling, interrupts □ Concurrent algorithm should maintain its properties for all possible inter-leavings of sequential activities □ Example: All instructions are eventually included (fairness) ■ Some literature distinguishes between interleaving (uniprocessor) and overlapping (multiprocessor) of statements
6
SLIDE 7
Concurrency
■ In hardware □ Context switch support ■ In operating systems □ Native process / thread support □ Synchronization support ■ In virtual runtime environments □ Java / .NET thread support ■ In middleware □ J2EE / CORBA thread pooling ■ In programming languages □ Asynchronous and event-based programming
7
Operating System Virtual Runtime Middleware Server Application Server Application Server Application
SLIDE 8
Example: Operating System
8
code% data% files% registers% stack% Thread' code% data% files% registers% stack%
Thread'
stack% registers% stack% registers%
Thread' Thread'
SLIDE 9
Concurrency Is Hard
■ Sharing of global resources □ Concurrent reads and writes on the same global resource (variable) makes ordering a critical issue ■ Optimal management of resource allocation □ Process gets control over a I/O channel and is then suspended before using it ■ Programming errors become non-deterministic □ Order of interleaving may / may not activate the bug ■ Happens all even on uniprocessors ■ Race condition □ The result of an operation depends on the order of execution □ Well-known issue since the 60‘s, identified by E. Dijkstra
9
SLIDE 10
Race Condition
■ One piece of code in one process, executed at the same time … □ … by two threads on a single core. □ … by two threads on two cores. ■ What happens ?
10
void echo() { char_in = getchar(); char_out = char_in; putchar(char_out); }
SLIDE 11 Potential Deadlock
11
[Stallings]
SLIDE 12 Actual Deadlock
Parallel Programming Concepts | 2013 / 1014
12
[Stallings]
SLIDE 13
13
Terminology
Deadlock ■ Two or more processes / threads are unable to proceed ■ Each is waiting for one of the others to do something Livelock ■ Two or more processes / threads continuously change their states in response to changes in the other processes / threads ■ No global progress for the application Race condition ■ Two or more processes / threads are executed concurrently ■ Final result of the application depends on the relative timing of their execution
SLIDE 14
14
Terminology
Starvation ■ A runnable process / thread is overlooked indefinitely ■ Although it is able to proceed, it is never chosen to run (dispatching / scheduling) Atomic Operation ■ Function or action implemented as a sequence of one or more instructions ■ Appears to be indivisible - no other process / thread can see an intermediate state or interrupt the operation ■ Executed as a group, or not executed at all Mutual Exclusion ■ The requirement that when one process / thread is using a resource, no other shall be allowed to do that
SLIDE 15 From Concurrency to Parallelism
15
Program Program Program Process Process Process Process Task
Processor
Process Process Process Process Task Process Process Process Process Task
Processor Processor Processor Memory
Node
Network
Processor Processor Processor Memory Processor Processor Processor Memory Processor Processor Processor Memory Processor Processor Processor Memory
SLIDE 16
Parallelism for …
■ Speedup – compute faster ■ Throughput – compute more in the same time ■ Scalability – compute faster / more with additional resources ■ Price / performance – be as fast as possible for given money ■ Scavenging – compute faster / more with idle resources
16 Processing Element A1 Processing Element A2 Processing Element A3 Processing Element B1 Processing Element B2 Processing Element B3
Scaling Up Scaling Out
Main Memory Main Memory
SLIDE 17
The Parallel Programming Problem
17
Execution Environment Parallel Application Match ? Configuration Flexible Type
SLIDE 18
Parallelism [Mattson et al.]
■ Task - Parallel program breaks a problem into tasks ■ Execution unit □ Representation of a concurrently running task (e.g. thread) □ Tasks are mapped to execution units during development time ■ Processing element □ Hardware element running one execution unit □ Depends on scenario - logical processor vs. core vs. machine □ Execution units run simultaneously on processing elements, controlled by the scheduling entity ■ Synchronization □ Mechanism to order activities of parallel tasks ■ Race condition □ Program result depends on the scheduling order
18
SLIDE 19
Parallel Processing
■ Inside the processor □ Instruction-level parallelism (ILP) □ Multicore □ Shared memory ■ With multiple processing elements in one machine □ Multiprocessing □ Shared memory ■ With multiple processing elements in many machines □ Multicomputer □ Shared nothing (in terms of a globally accessible memory)
19
SLIDE 20 Multiprocessor: Flynn‘s Taxonomy (1966)
■ Classify multiprocessor architectures among instruction and data processing dimension
20 Single Instruction,
Single Data (SISD)
(C) Blaise Barney
Single Instruction,
Multiple Data (SIMD) Multiple Instruction,
Single Data (MISD) Multiple Instruction,
Multiple Data (MIMD)
SLIDE 21
Another Taxonomy (Tanenbaum)
21 Multiprocessors (shared memory) Multicomputers (private memory) Bus Bus Switched MIMD Parallel and Distributed Computers Switched
SLIDE 22 Another Taxonomy (Foster)
■ Multicomputer □ Set of connected von Neumann computers (DM-MIMD) □ Each computer runs a local program in local memory and sends / receives messages □ Local memory access is less expensive than remote memory access
22 Interconnect
Central Unit
Memory Control Unit Arithmetic Logic Unit Input Output
Bus
SLIDE 23
Shared Memory vs. Shared Nothing
■ Organization of parallel processing hardware as … □ Shared memory system ◊ Concurrent processes can directly access a common address space ◊ Typically implemented as memory hierarchy, with different cache levels ◊ Examples: SMP systems, distributed shared memory systems, virtual runtime environment □ Shared nothing system ◊ Concurrent processes can only access local memory and exchange messages with other processes ◊ Message exchange typically order of magnitudes slower than memory ◊ Examples: Cluster systems, distributed systems (Hadoop, Grids, …)
23
SLIDE 24
Shared Memory vs. Shared Nothing
■ Pfister: „shared memory“ vs. „distributed memory“ ■ Foster: „multiprocessor“ vs. „multicomputer“ ■ Tannenbaum: „shared memory“ vs. „private memory“
24
Processor
Process
Shared Memory
Processor
Process
Processor
Process
Processor
Process
Message Message Message Message
Data Data Data Data
SLIDE 25
Shared Memory
■ All processors act independently and use the same global address space, changes in one memory location are visible for all others ■ Uniform memory access (UMA) system □ Equal load and store access for all processors to all memory □ Default approach for SMP systems of the past ■ Non-uniform memory access (NUMA) system □ Delay on memory access according to the accessed region □ Typically realized by processor networks and local memories ◊ Cache-coherent NUMA (CC-NUMA), completely implemented in hardware ◊ Became standard approach with recent X86 chips
25
SLIDE 26
UMA Example
26
Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache – Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and mediates connections to memory
SLIDE 27
NUMA Example
27
Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one ‘logical’ single address space of ‘physically distributed memory’
SLIDE 28 NUMA Example: Intel Nehalem
28
Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I Core Core Core Core Q P I L3 Cache L3 Cache L3 Cache Memory Controller Memory Controller Memory Controller L3 Cache Memory Controller I/O I/O I/O I/O Memory Memory Memory Memory
SLIDE 29 Example: Intel Xeon Phi
■ Tag Directory (TD) per L2 cache ■ 4 groups of 16 cores, 4 threads per core ■ 512-bit SIMD vector unit / core ■ Multiple rings □ Data, addresses, coherence information ■ Getter / scatter address machinery
29
(George Chrysos, Intel)
SLIDE 30
Shared Nothing
■ Processing elements no longer share a common global memory, but can only exchange messages ■ Allows an easy scale-out by just adding machines to the network ■ Messaging has more overhead than shared memory access ■ Cluster computing: Combine machines with cheap interconnect □ Compute cluster: Speedup for an application ◊ Batch processing (embarrassingly parallel workload) ◊ Scalable data parallelism (Google search) ◊ Parallel applications □ Load-balancing cluster: Better throughput for service □ High Availability (HA) cluster: Fault tolerance ■ Investment for more memory in one shared memory system vs. investment for more machines in a shared nothing cluster
30
SLIDE 31
Shared Nothing Example
Parallel Programming Concepts | 2013 / 1014
31
Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today
SLIDE 32 Clusters
32
Processor Process Processor Process Message Message Message Message Data Data
SLIDE 33
Shared Nothing
■ Always there, but widely ignored by the ,average‘ developer ■ High-End Systems □ Toy Story (1995) – 100 dual-processors as render farm □ Toy Story 2 (1999) - 1400 processors □ Monsters Inc. (2001) – 250 servers with 14 processors each = 3500 CPU‘s □ June 2013 TOP 500 #1 ◊ Tianhe-2 cluster, 3.120.000 Xeon E5 cores, 1.024.000 GB RAM ■ Massively parallel processing (MPP) □ Clusters at maximum scale for scientific High-Performance Computing (HPC) ■ Grid computing: Combination of clusters at different locations
33
SLIDE 34
MPP Systems
Parallel Programming Concepts | 2013 / 1014
34
SLIDE 35 Shared-Nothing Workload [Pfister]
35
Bulk Data Traffic (inter-node and memory bandwidth stress) Synchronization Traffic (inter-node latency stress)
LSLD
(‘Parallel Nirvana’)
LSHD HSHD
(‘Parallel Hell’)
HSLD
UMA Cluster Cluster NUMA
SLIDE 36 Terminology: Distributed System
■ Tanenbaum (Distributed Operating Systems): „A distributed system is a collection of independent computers that appear to the users of the system as a single computer.“ ■ Coulouris et al.: „... [system] in which hardware or software components located at networked computers communicate and coordinate their actions
- nly by passing messages.“
■ Lamport: „A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” ■ Concurrency, no global clock, independent failures, heterogeneity,
- penness, security, scalability, failure handling, …
36
SLIDE 37
Hybrid Environments
37
Processor A Processor B Cache Cache Memory Processor C Processor D Cache Cache Memory
High-Speed Interconnect
SLIDE 38
Hybrid Environments
Parallel Programming Concepts | 2013 / 1014
38
Shared-memory nodes (here ccNUMA) with local NIs NI mediates connections to other remote ‘SMP nodes’
SLIDE 39 Hybrid Environments
39
Client Client Client Client
Operating System Virtual Runtime Middleware Server Application
Intra-request parallelism for response time Inter-request parallelism for throughput and fault tolerance
Scaleup Speedup SM (Inter) Intra DM Inter (Intra)
P P P P
Operating System Virtual Runtime Middleware Server Application P P P P Operating System Virtual Runtime Middleware Server Application P P P P Operating System Virtual Runtime Middleware Server Application P P P P
Speedup -> intra-request -> SM-MIMD (SMP) Scaleup -> inter-request -> DM-MIMD (Cluster)
SLIDE 40
Example: Cluster of Nehalem SMPs
40
Network
SLIDE 41
The Parallel Programming Problem
41
Execution Environment Parallel Application Match ? Configuration Flexible Type
SLIDE 42
Programming Paradigm
■ Programming paradigm □ Coding convention or standard □ Something a majority of people agrees upon ■ Parallel programming is one of these paradigms □ Other: Declarative, constraint-based, object-oriented ■ Each paradigm can be realized by a set of programming models ■ Programming model: „set of rules for a game“ [Almasi] □ Programs and algorithms as game strategies □ Point where execution environment and application meet □ High-level view of the application on it‘s run time environment □ Hardware might imply a model, but does not enforce it □ For uniprocessor, no question due to „von Neumann“ □ Delivering performance while raising the level of abstraction
42
SLIDE 43
Programming Models
■ High-level view of the application on it‘s execution environment □ Intended as contract – if you follow the rules, scalability should be achievable □ Decouples software and execution environment architecture development ■ Classification in this course □ Multi-Tasking: Typically used for SM-MIMD execution environments, recently also relevant for SM-SIMD □ Message Passing: Typically used for DM-MIMD environments □ Implicit Parallelism: Works for all environments □ No enforcement, different mappings are possible ■ Programming models are implemented by languages and libraries
43
SLIDE 44
Parallel Programming Languages
■ Programming languages contain of syntax and standard library □ Example: C + libc, Python + class library □ Allows to add parallel programming to sequential languages □ Often languages implement more than one programming model ■ Languages are often categorized by their feasibility for a workload - „data-parallel languages“ vs. „task-parallel languages“ ■ Most algorithmic problems match to one kind of execution environment: „data-parallel problem“ vs. „task-parallel problem“
44
Multi-Tasking PThreads, OpenMP, OpenCL, Linda, Cilk, ... Message Passing MPI, PVM, CSP Channels, Actors, ... Implicit Parallelism Map/Reduce, PLINQ, HPF, Lisp, Fortress, ... Mixed Approaches Ada, Scala, Clojure, Erlang, X10, ...