Parallel Programming and Heterogeneous Computing A2 - Parallel - - PowerPoint PPT Presentation

parallel programming and heterogeneous computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming and Heterogeneous Computing A2 - Parallel - - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Types of Parallel Hardware Task Level Parallelism Data


slide-1
SLIDE 1

Parallel Programming and Heterogeneous Computing

A2 - Parallel Hardware

Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

slide-2
SLIDE 2

Chart 2

Types of Parallel Hardware

Lukas Wenzel ParProg 2020 A2 Parallel Hardware

Data Level Parallelism

The same operation is applied in parallel to multiple units of data.

Task Level Parallelism

Multiple operations are executed in parallel.

I D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D D D D D D I I I I D D D D D D D D D D D D D D D D D D D D D D D D D

slide-3
SLIDE 3

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 3

Hardware Taxonomy [Flynn1966]

Multiple Data Streams Multiple Instruction Streams

MISD

Multiple Instruction streams Single Data stream

MIMD

Multiple Instruction streams Multiple Data streams D D D D D D I I I I D D D D D D D D D D D D D D D I I I I D D D D D D D D D D D D D D D D D D D D D D D D D I D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D I I I I D D D D D D D D D D I I I D D D I I I I I I I I I D D D D I I I I D D D D I I D D D D D D D D

SISD

Single Instruction stream Single Data stream

SIMD

Single Instruction stream Multiple Data streams

slide-4
SLIDE 4

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 4

Hardware Taxonomy [Flynn1966]

LD

A

LD

B

ADD C

A

B ST

C

MUL ST

A

B 2

A

Multiple Data Streams Multiple Instruction Streams

SISD SIMD

LD

A

LD

B

ADD C0

A

B MUL 3 SUB

C0

B LD

A

LD

B

SUB Cn B DIV

Cn

MUL ST 8 C0 C0

C0

Dn

A Cn Cn

Dn

Cn

ST

C0

MISD MIMD

LD

A

LD

B

ADD C

A

B ST

C

MUL ST

A

B 2

A

LD

D

ADD

D

LD T CMP BGE label ST

D D 6 D

T LD LD ADD ST MUL ST

A0 A1 An B0 B1 Bn C0 C1 Cn A0 B0 A1 B1 An Bn C0 C1 Cn A0 A1 An B0 B1 Bn

2 2 2

A0 A1 An

slide-5
SLIDE 5

Most exotic class of parallel hardware, not in mainstream use. = Redundant systems like safety-critical embedded controllers or high-reliability mainframes

Parallelism not for performance, but dependability Not covered in this lecture.

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 5

MISD Hardware

Sub-System A Sub-System B Sub-System C Voter Input Output

Example: Triple Modular Redundant Architecture

slide-6
SLIDE 6

Popular class of parallel hardware for special purpose systems. = Vector processors

Early examples: ILLIAC IV, Cray-1, ... Recently in widespread use:

GPUs

Instruction Set Extensions (AltiVec, SSE, AVX, ...) Covered in chapter C.

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 6

SIMD Hardware

ILLIAC IV Control Unit Cray-1 NVidia Pascal GPU Module

slide-7
SLIDE 7

Classic and most general class of parallel hardware. = Wide range of systems from Multicore CPUs to Supercomputers and Clusters

Variety of architectures and characteristics requires further distinction

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 7

MIMD Hardware

POWER9 Die with 24 Cores Summit Supercomputer

slide-8
SLIDE 8

Processing Element Task Task Task

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 8

MIMD Hardware Taxonomy

MIMD SM-MIMD

(Shared Memory)

Processing elements can directly access a common address space

DM-MIMD

(Distributed Memory)

Processing elements can access their private address spaces and exchange messages

Processing Element Task Task Task Processing Element Task Task Task

...

Shared Memory Data Data Processing Element Task Task Task Private Memory Message Interconnect / Network Data Message Message Private Memory Data

...

slide-9
SLIDE 9

e.g. Multicore CPUs

Low interaction overhead due to high coupling between processing elements

~ Shared Memory Parallelism Covered in chapter B.

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 9

MIMD Hardware Taxonomy

Terminology shared memory system vs. distributed memory system SM-MIMD vs. DM-MIMD Multiprocessor vs. Multicomputer see [Tanenbaum1985], [Foster1995], [Pfister1998]

e.g. Clusters

Highly scalable due to low coupling between processing elements

~ Shared Nothing Parallelism Covered in chapter D.

MIMD SM-MIMD DM-MIMD

slide-10
SLIDE 10

Processing elements can directly access a common address space

Uniform memory access (UMA) system Processing elements observe the same memory access characteristics over the entire memory.

Simple to program against, but scalability issues

Non-uniform memory access (NUMA) system Processing elements have different access characteristics for different memory regions

Scales well, but unaware programs can exhibit performance issues

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 10

SM-MIMD Hardware

slide-11
SLIDE 11

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 11

SM-MIMD Hardware

MIMD SM-MIMD

(Shared Memory)

DM-MIMD

(Distributed Memory)

UMA

(Uniform Memory Access)

NUMA

(Non-Uniform Memory Access)

Memory PE PE PE Memory PE Node Memory PE Node Memory PE Node Memory PE Node

slide-12
SLIDE 12

Processing elements can access their private address spaces and exchange messages Cluster: Multiple independent machines connected through a network

Compute cluster: Speedup

Load Balancing cluster: Throughput

High Availability cluster: Dependability All clusters are distributed systems, but only compute clusters intended for parallel workloads. This lecture considers only compute clusters.

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 12

DM-MIMD Hardware

slide-13
SLIDE 13

Simple way of scaling available compute resources: Just connect multiple machines in a network. Dominant architecture for High-End Systems: Especially High-Performance Computing

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 13

DM-MIMD Hardware

Cluster of Desktop Computers Cluster of RaspberryPI Singleboard Computers

1995 Toy Story Render Farm

117 nodes × 2 CPUs = 234 CPUs

2001 Monsters Inc. Render Farm

250 nodes × 14 CPUs = 3500 CPUs

2019 Summit cluster (TOP500 #1 in 2019)

4608 nodes, 2 PB RAM, 10 MW power × 2 CPUs × 22 Cores = 202 752 Cores × 6 GPUs = 27 648 GPUs

Summit Cluster

slide-14
SLIDE 14

Lukas Wenzel ParProg 2020 A2 Parallel Hardware Chart 14

Literature

[Flynn1966] "Very High-Speed Computing Systems" Flynn, Michael J. Proceedings of the IEEE 54.12 (1966) IEEE [Tanenbaum 1985] "Distributed Operating Systems" Tanenbaum, Andrew S and Van Renesse,

  • Robbert. ACM Computing Surveys 17.4 (1985) ACM

[Foster1995] "Designing and Building Parallel Programs" Foster, Ian (1995) Addison-Wesley [Pfister1998] "In Search of Clusters" Pfister, Gregory F. 2nd edition (1998) Prentice-Hall Inc

slide-15
SLIDE 15

And now for a break and a bowl of Sencha.

*or beverage of your choice