parallel programming Paolo Burgio paolo.burgio@unimore.it - - PowerPoint PPT Presentation

parallel programming
SMART_READER_LITE
LIVE PREVIEW

parallel programming Paolo Burgio paolo.burgio@unimore.it - - PowerPoint PPT Presentation

An introduction to parallel programming Paolo Burgio paolo.burgio@unimore.it Definitions Parallel computing Partition computation across different compute engines Distributed computing Paritition computation across different


slide-1
SLIDE 1

An introduction to parallel programming

Paolo Burgio

paolo.burgio@unimore.it

slide-2
SLIDE 2

Definitions

› Parallel computing

– Partition computation across different compute engines

› Distributed computing

– Paritition computation across different machines

Same principle, more general

2

slide-3
SLIDE 3

Outline

› Introduction to "traditional" programming

– Writing code – Operating systems – …

› Why do we need parallel programming?

– Focus on programming shared memory

› Different ways of parallel programming

– PThreads – OpenMP – MPI? – GPU/accelerators programming

3

slide-4
SLIDE 4

As a side…

› A bit of computer architecture

– We will understand why… – Focus on shared memory systems

› A bit of algorithms

– We will understand why…

› A bit of performance analysis

– Which is our ultimate goal! – Being able to identify bottlenecks

4

slide-5
SLIDE 5

Programming basics

slide-6
SLIDE 6

Take-aways

› Programming basics

– Variables – Functions – Loops

› Programming stacks

– BSP – Operating systems – Runtimes

› Computer architectures

– Computing domains – Single processor/multiple processors – From single- to multi- to many- core

6

slide-7
SLIDE 7

Why do we need parallel computing?

Increase performance of our machines

› Scale-up

– Solve a "bigger" problem in the same time

› Scale-out

– Solve the same problem in less time

7

slide-8
SLIDE 8

Yes but..

› Why (highly) parallel machines… › …and not faster single-core machines?

8

slide-9
SLIDE 9

The answer #1 - Money

9

slide-10
SLIDE 10

The answer #2 – the "hot" one

Moore's law › "The number of transistors that we can pack in a given die area doubles every 18 months" Dennard's scaling › "performance per watt of computing is growing exponentially at roughly the same rate"

10

slide-11
SLIDE 11

Transistors (K’s) Clock (MHz) Power (W) Perf/Cl f/Clock (ILP) LP)

› SoC design paradigm › Gordon Moore

– His law is still valid, but…

Performa rmance nce  frequenc uency

The answer #2 – the "hot" one

11

slide-12
SLIDE 12

Transistors (K’s) Clock (MHz) Power (W) Perf/Cl f/Clock (ILP) LP)

› SoC design paradigm › Gordon Moore

– His law is still valid, but…

› “The free lunch is over”

– Herb Sutter, 2005

The answer #2 – the "hot" one

11

Performa rmance nce  par aral allelism ism

slide-13
SLIDE 13

In other words…

1970 1980 1990 2000 2010 2020

Hot plate Summer temperature Nuclear Reactor Surface of the sun Rocket nozzle First PCs The explosion

  • f web

Modern computers

12

slide-14
SLIDE 14

Instead of going faster..

› ..(go faster but through) parallelism! Problem #1 › New computer architectures › At least, three architectural templates Problem #2 › Need to efficiently program them › HPC already has this problem! The problem › Programmers must know a bit of the architecture! › To make parallelization effective › "Let's run this on a GPU. It certainly goes faster" (cit.)

13

slide-15
SLIDE 15

The Big problem

› Effectively programming in parallel is difficult

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?”

14

slide-16
SLIDE 16

I am *really* sorry guys..

› I will give you code… › ..but first I need to give you some maths… › …and then, some architectual principles

15

slide-17
SLIDE 17

Amdahl's Law

slide-18
SLIDE 18

Amdahl's law

› A sequential program that takes 100 sec to exec › Only 95% can run in parallel (it's a lot) › And.. you are an extremely good programmer, and you have a machine with 1billion cores, so that part takes 0 sec › So, 𝑈

𝑞𝑏𝑠 = 100𝑡𝑓𝑑 − 95𝑡𝑓𝑑 = 5𝑡𝑓𝑑

𝑇𝑞𝑓𝑓𝑒𝑣𝑞 = 100𝑡𝑓𝑑 5𝑡𝑓𝑑 = 20𝑦 …20x, on one billion cores!!!

17

slide-19
SLIDE 19

Computer architecture

slide-20
SLIDE 20

Step-by-step

  • 1. "Traditional" multi-cores

– Typically, shared-memory – Max 8-16 cores – This laptop

  • 2. Many-cores

– GPUs but not only – Heterogeneous architectures

  • 3. More advanced stuff

– Field-programmable Gate Arrays – Neural Networks

19

slide-21
SLIDE 21

Symmetric multi-processing

› Memory: centralized with bus interconnect, I/O › Typically, multi-core (sub)systems

– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)

CPU One or more cache levels

Main memory

CPU 1 One or more cache levels CPU 2 One or more cache levels CPU 3 One or more cache levels

I/O system Can be 1 bus, N busses, or any network

20

slide-22
SLIDE 22

Asymmetric multi-processing

› Memory: centralized with uniform access time (UMA) and bus interconnect, I/O › Typically, multi-core (sub)systems

– Examples: ARM Big.LITTLE, NVIDIA Tegra X2 (Drive PX)

CPU One or more cache levels

Main memory

CPU 1 One or more cache levels CPU A One or more cache levels CPU B One or more cache levels

I/O system Can be 1 bus, N busses, or any network

21

slide-23
SLIDE 23

SMP – distributed shared memory

› Non-Uniform Access Time - NUMA › Scalable interconnect

– Typically, many cores – Examples: embedded accelerators, GPUs

CPU

$

Main memory

SPM

I/O system

$ = "cache"

CPU 1

$ SPM

CPU 2

$ SPM

CPU 3

$ SPM ScratchPad Memory Scalable interconnection

22

slide-24
SLIDE 24

Go complex: NVIDIA's Tegra

› Complex heterogeneous system

– 3 ISAs – 2 subdomains – Shmem between Big.SUPER host and GP-GPU

23

slide-25
SLIDE 25

NUMA UMA

UMA vs. NUMA

› Shared mem: every thread can access every memory item

– (Not considering security issues…)

› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)

– Different access time for accessing different memory spaces

CPU CPU 1 CPU 3 CPU 2

SPM

CPU 12 CPU 13 CPU 15 CPU 14

SPM 3

CPU 4 CPU 5 CPU 7 CPU 6 CPU 8 CPU 9 CPU 11 CPU 10

SPM 1 SPM 2

CPU CPU 1 CPU 3 CPU 2

MAIN MEM

24

slide-26
SLIDE 26

NUMA UMA

UMA vs. NUMA

› Shared mem: every thread can access every memory item

– (Not considering security issues…)

› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)

– Different access time for accessing different memory spaces

CPU CPU 1 CPU 3 CPU 2

SPM

CPU 12 CPU 13 CPU 15 CPU 14

SPM 3

CPU 4 CPU 5 CPU 7 CPU 6 CPU 8 CPU 9 CPU 11 CPU 10

SPM 1 SPM 2

CPU CPU 1 CPU 3 CPU 2

MAIN MEM

MEM0 MEM1 MEM2 MEM3 CPU0…3 clock 10 clock 20 clock 10 clock CPU4…7 10 clock clock 10 clock 20 clock CPU8…11 20 clock 10 clock clock 10 clock CPU12..15 10 clock 20 clock 10 clock 00 clock

24

slide-27
SLIDE 27

Some definitions

slide-28
SLIDE 28

What is…

› ..a core?

– An electronic circuit to execute instruction (=> programs)

› …a program?

– The implementation of an algorithm

› …a process?

– A program that is executing

› …a thread?

– A unit of execution (of a process)

› ..a task?

– A unit of work (of a program)

27

slide-29
SLIDE 29

What is…

› ..a core?

– An electronic circuit to execute instruction (=> programs)

› …a program?

– The implementation of an algorithm

› …a process?

– A program that is executing

› …a thread?

– A unit of execution (of a process)

› ..a task?

– A unit of work (of a program)

28

T

t code.c

DATA

CORE CORE 1 CORE 3 CORE 2

MEM

code.c

DATA DATA

code.c code.c

P

SHARED MEM

T T T

slide-30
SLIDE 30

What is a task?

Operating System task Real-time task OpenMP task

29

slide-31
SLIDE 31

P0 P1

Symmetric multi-processing

› Memory: centralized with bus interconnect, I/O › Typically, multi-core (sub)systems

– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)

CPU One or more cache levels

Main memory

CPU 1 One or more cache levels CPU 2 One or more cache levels CPU 3 One or more cache levels

I/O system

T T T T T

Can be 1 bus, N busses, or any network

30

slide-32
SLIDE 32

..start simple…

slide-33
SLIDE 33

Something you're used to..

› Multiple processes › That communicate via shared data

Process P0 ????

T

Process P1

T

(read, write) (read, write)

DATUM

32

slide-34
SLIDE 34

Howto #1 - MPI

› Multiple processes › That communicate via shared data

Process P0

T

Process P1

T

DATUM

MPI_Send MPI_Recv

33

slide-35
SLIDE 35

Howto #2 – UNIX pipes

› Multiple processes › That communicate via shared data

Process P0

T

Process P1

T

DATUM int main(void) { int fd[2], nbytes; char string[] = "Hello, world!\n"; pipe(fd); /* Send "string" through the output side of pipe */ write(fd[1], string, (strlen(string)+1)); return(0); } int main(void) { int fd[2], nbytes; pipe(fd); /* Receive "string" from the input side of pipe */ nbytes = read(fd[0], readbuffer, sizeof(readbuffer)); return(0); }

34

slide-36
SLIDE 36

File.txt

Howto #3 – Files

› Multiple processes › That communicate via shared data

Process P0

T

Process P1

T

DATUM

35

slide-37
SLIDE 37

Shared memory

› Coherence problem

– Memory consistency issue – Data races

› Can share data ptrs

– Ease-to-use

Process P0 Shared memory

T T T

(read, write) (read, write)

DATUM

36

slide-38
SLIDE 38

References

› "Calcolo parallelo" website

– http://hipert.unimore.it/people/paolob/pub/PhD/index.html

› My contacts

– paolo.burgio@unimore.it – http://hipert.mat.unimore.it/people/paolob/

› Useful links › A "small blog"

– http://www.google.com

37