Parallel Programming and Heterogeneous Computing A2 - Parallel - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing A2 - Parallel - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Types of Parallel Hardware Task Level Parallelism Data
Chart 2
Types of Parallel Hardware
Lukas Wenzel ParProg 2020 A2 Parallel Hardware
Data Level Parallelism
The same operation is applied in parallel to multiple units of data.
Task Level Parallelism
Multiple operations are executed in parallel.
I D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D D D D D D I I I I D D D D D D D D D D D D D D D D D D D D D D D D D
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 3
Hardware Taxonomy [Flynn1966]
Multiple Data Streams Multiple Instruction Streams
MISD
Multiple Instruction streams Single Data stream
MIMD
Multiple Instruction streams Multiple Data streams D D D D D D I I I I D D D D D D D D D D D D D D D I I I I D D D D D D D D D D D D D D D D D D D D D D D D D I D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D I I I D D D I I I I I I I I I D D D D I I I I D D D D I I D D D D D D D D
SISD
Single Instruction stream Single Data stream
SIMD
Single Instruction stream Multiple Data streams
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 4
Hardware Taxonomy [Flynn1966]
LD
A
LD
B
ADD C
A
B ST
C
MUL ST
A
B 2
A
Multiple Data Streams Multiple Instruction Streams
SISD SIMD
LD
A
LD
B
ADD C0
A
B MUL 3 SUB
C0
B LD
A
LD
B
SUB Cn B DIV
Cn
MUL ST 8 C0 C0
C0
Dn
A Cn Cn
Dn
Cn
ST
C0
MISD MIMD
LD
A
LD
B
ADD C
A
B ST
C
MUL ST
A
B 2
A
LD
D
ADD
D
LD T CMP BGE label ST
D D 6 D
T LD LD ADD ST MUL ST
A0 A1 An B0 B1 Bn C0 C1 Cn A0 B0 A1 B1 An Bn C0 C1 Cn A0 A1 An B0 B1 Bn
2 2 2
A0 A1 An
Most exotic class of parallel hardware, not in mainstream use. = Redundant systems like safety-critical embedded controllers or high-reliability mainframes
■
Parallelism not for performance, but dependability Not covered in this lecture.
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 5
MISD Hardware
Sub-System A Sub-System B Sub-System C Voter Input Output
Example: Triple Modular Redundant Architecture
Popular class of parallel hardware for special purpose systems. = Vector processors
■
Early examples: ILLIAC IV, Cray-1, ... Recently in widespread use:
■
GPUs
■
Instruction Set Extensions (AltiVec, SSE, AVX, ...) Covered in chapter C.
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 6
SIMD Hardware
ILLIAC IV Control Unit Cray-1 NVidia Pascal GPU Module
Classic and most general class of parallel hardware. = Wide range of systems from Multicore CPUs to Supercomputers and Clusters
➢
Variety of architectures and characteristics requires further distinction
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 7
MIMD Hardware
POWER9 Die with 24 Cores Summit Supercomputer
Processing Element Task Task Task
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 8
MIMD Hardware Taxonomy
MIMD SM-MIMD
(Shared Memory)
Processing elements can directly access a common address space
DM-MIMD
(Distributed Memory)
Processing elements can access their private address spaces and exchange messages
Processing Element Task Task Task Processing Element Task Task Task
...
Shared Memory Data Data Processing Element Task Task Task Private Memory Message Interconnect / Network Data Message Message Private Memory Data
...
e.g. Multicore CPUs
■
Low interaction overhead due to high coupling between processing elements
~ Shared Memory Parallelism Covered in chapter B.
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 9
MIMD Hardware Taxonomy
Terminology shared memory system vs. distributed memory system SM-MIMD vs. DM-MIMD Multiprocessor vs. Multicomputer see [Tanenbaum1985], [Foster1995], [Pfister1998]
e.g. Clusters
■
Highly scalable due to low coupling between processing elements
~ Shared Nothing Parallelism Covered in chapter D.
MIMD SM-MIMD DM-MIMD
Processing elements can directly access a common address space
■
Uniform memory access (UMA) system Processing elements observe the same memory access characteristics over the entire memory.
➢
Simple to program against, but scalability issues
■
Non-uniform memory access (NUMA) system Processing elements have different access characteristics for different memory regions
➢
Scales well, but unaware programs can exhibit performance issues
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 10
SM-MIMD Hardware
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 11
SM-MIMD Hardware
MIMD SM-MIMD
(Shared Memory)
DM-MIMD
(Distributed Memory)
UMA
(Uniform Memory Access)
NUMA
(Non-Uniform Memory Access)
Memory PE PE PE Memory PE Node Memory PE Node Memory PE Node Memory PE Node
Processing elements can access their private address spaces and exchange messages Cluster: Multiple independent machines connected through a network
□
Compute cluster: Speedup
□
Load Balancing cluster: Throughput
□
High Availability cluster: Dependability All clusters are distributed systems, but only compute clusters intended for parallel workloads. This lecture considers only compute clusters.
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 12
DM-MIMD Hardware
Simple way of scaling available compute resources: Just connect multiple machines in a network. Dominant architecture for High-End Systems: Especially High-Performance Computing
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 13
DM-MIMD Hardware
Cluster of Desktop Computers Cluster of RaspberryPI Singleboard Computers
1995 Toy Story Render Farm
117 nodes × 2 CPUs = 234 CPUs
2001 Monsters Inc. Render Farm
250 nodes × 14 CPUs = 3500 CPUs
2019 Summit cluster (TOP500 #1 in 2019)
4608 nodes, 2 PB RAM, 10 MW power × 2 CPUs × 22 Cores = 202 752 Cores × 6 GPUs = 27 648 GPUs
Summit Cluster
Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 14
Literature
[Flynn1966] "Very High-Speed Computing Systems" Flynn, Michael J. Proceedings of the IEEE 54.12 (1966) IEEE [Tanenbaum 1985] "Distributed Operating Systems" Tanenbaum, Andrew S and Van Renesse,
- Robbert. ACM Computing Surveys 17.4 (1985) ACM
[Foster1995] "Designing and Building Parallel Programs" Foster, Ian (1995) Addison-Wesley [Pfister1998] "In Search of Clusters" Pfister, Gregory F. 2nd edition (1998) Prentice-Hall Inc
And now for a break and a bowl of Sencha.
*or beverage of your choice