parallel programming Paolo Burgio paolo.burgio@unimore.it - PowerPoint PPT Presentation

An introduction to parallel programming Paolo Burgio paolo.burgio@unimore.it

Definitions › Parallel computing – Partition computation across different compute engines › Distributed computing – Paritition computation across different machines Same principle, more general 2

Outline › Introduction to "traditional" programming – Writing code – Operating systems – … › Why do we need parallel programming? – Focus on programming shared memory › Different ways of parallel programming – PThreads – OpenMP – MPI? – GPU/accelerators programming 3

As a side… › A bit of computer architecture – We will understand why… – Focus on shared memory systems › A bit of algorithms – We will understand why… › A bit of performance analysis – Which is our ultimate goal! – Being able to identify bottlenecks 4

Programming basics

Take-aways › Programming basics – Variables – Functions – Loops › Programming stacks – BSP – Operating systems – Runtimes › Computer architectures – Computing domains – Single processor/multiple processors – From single- to multi- to many- core 6

Why do we need parallel computing? Increase performance of our machines › Scale-up – Solve a "bigger" problem in the same time › Scale-out – Solve the same problem in less time 7

Yes but.. › Why (highly) parallel machines… › …and not faster single -core machines? 8

The answer #1 - Money 9

The answer #2 – the "hot" one Moore's law › "The number of transistors that we can pack in a given die area doubles every 18 months" Dennard's scaling › "performance per watt of computing is growing exponentially at roughly the same rate" 10

The answer #2 – the "hot" one › SoC design paradigm › Gordon Moore – His law is still valid, but… Performa rmance nce  Transistors (K’s) frequenc uency Clock (MHz) Power (W) Perf/Cl f/Clock (ILP) LP) 11

The answer #2 – the "hot" one › SoC design paradigm › Gordon Moore – His law is still valid, but… › “The free lunch is over” – Herb Sutter, 2005 Transistors (K’s) Performa rmance nce Clock (MHz) Power (W)  Perf/Cl f/Clock (ILP) LP) par aral allelism ism 11

In other words … Surface of the sun Rocket nozzle Modern Nuclear computers Reactor First PCs The explosion of web Summer temperature 1980 2020 1970 2000 1990 2010 Hot plate 12

Instead of going faster.. › ..(go faster but through) parallelism! Problem #1 › New computer architectures › At least, three architectural templates Problem #2 › Need to efficiently program them › HPC already has this problem! The problem › Programmers must know a bit of the architecture! › To make parallelization effective › "Let's run this on a GPU. It certainly goes faster" (cit.) 13

The Big problem › Effectively programming in parallel is difficult “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” 14

I am *really* sorry guys.. › I will give you code… › ..but first I need to give you some maths… › …and then, some architectual principles 15

Amdahl's Law

Amdahl's law › A sequential program that takes 100 sec to exec › Only 95% can run in parallel (it's a lot) › And.. you are an extremely good programmer, and you have a machine with 1billion cores, so that part takes 0 sec › So, 𝑈 𝑞𝑏𝑠 = 100 𝑡𝑓𝑑 − 95 𝑡𝑓𝑑 = 5 𝑡𝑓𝑑 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 = 100 𝑡𝑓𝑑 = 20𝑦 5 𝑡𝑓𝑑 …20x, on one billion cores!!! 17

Computer architecture

Step-by-step 1. "Traditional" multi-cores – Typically, shared-memory – Max 8-16 cores – This laptop 2. Many-cores – GPUs but not only – Heterogeneous architectures 3. More advanced stuff – Field-programmable Gate Arrays – Neural Networks 19

Symmetric multi-processing › Memory: centralized with bus interconnect, I/O › Typically, multi-core (sub)systems – Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop) CPU CPU CPU CPU 0 1 2 3 Can be 1 bus, N One or One or One or One or busses, or any more more more more network cache cache cache cache levels levels levels levels I/O system Main memory 20

Asymmetric multi-processing › Memory: centralized with uniform access time (UMA) and bus interconnect, I/O › Typically, multi-core (sub)systems – Examples: ARM Big.LITTLE, NVIDIA Tegra X2 (Drive PX) CPU CPU CPU CPU 0 1 A B Can be 1 bus, N One or One or One or One or busses, or any more more more more network cache cache cache cache levels levels levels levels I/O system Main memory 21

SMP – distributed shared memory › Non-Uniform Access Time - NUMA › Scalable interconnect – Typically, many cores – Examples: embedded accelerators, GPUs CPU CPU $ = "cache" CPU CPU 3 2 0 1 $ $ $ $ SPM SPM SPM SPM ScratchPad Memory Scalable I/O system Main memory interconnection 22

Go complex: NVIDIA's Tegra › Complex heterogeneous system – 3 ISAs – 2 subdomains – Shmem between Big.SUPER host and GP-GPU 23

UMA vs. NUMA › Shared mem: every thread can access every memory item – (Not considering security issues …) › Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA) – Different access time for accessing different memory spaces NUMA UMA CPU CPU CPU CPU CPU 0 SPM SPM 4 5 0 1 0 1 CPU CPU CPU CPU 3 2 7 6 MAIN CPU CPU 3 MEM 1 CPU CPU CPU CPU SPM SPM 12 13 8 9 CPU 3 2 CPU CPU CPU CPU 2 11 10 15 14 24

UMA vs. NUMA › Shared mem: every thread can access every memory item MEM0 MEM1 MEM2 MEM3 – (Not considering security issues …) CPU0…3 0 clock 10 clock 20 clock 10 clock CPU4…7 10 clock 0 clock 10 clock 20 clock › Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA) – Different access time for accessing different memory spaces CPU8…11 20 clock 10 clock 0 clock 10 clock CPU12..15 10 clock 20 clock 10 clock 00 clock NUMA UMA CPU CPU CPU CPU CPU 0 SPM SPM 4 5 0 1 0 1 CPU CPU CPU CPU 3 2 7 6 MAIN CPU CPU 3 MEM 1 CPU CPU CPU CPU SPM SPM 12 13 8 9 CPU 3 2 CPU CPU CPU CPU 2 11 10 15 14 24

Some definitions

What is… › ..a core? – An electronic circuit to execute instruction (=> programs) › …a program? – The implementation of an algorithm › …a process? – A program that is executing › …a thread? – A unit of execution (of a process) › ..a task? – A unit of work (of a program) 27

What is … CORE 0 CORE CORE MEM › ..a core? 3 1 – An electronic circuit to execute instruction (=> programs) CORE 2 › …a program? code.c code.c code.c – The implementation of an algorithm › …a process? P SHARED T – A program that is executing MEM T T › …a thread? T – A unit of execution (of a process) › ..a task? – A unit of work (of a program) t code.c DATA DATA DATA 28

What is a task? OpenMP task Operating System task Real-time task 29

Symmetric multi-processing › Memory: centralized with bus interconnect, I/O › Typically, multi-core (sub)systems – Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop) P0 P1 T T T T T CPU CPU CPU CPU 0 1 2 3 Can be 1 bus, N One or One or One or One or busses, or any more more more more network cache cache cache cache levels levels levels levels I/O system Main memory 30

..start simple…

Something you're used to.. › Multiple processes › That communicate via shared data Process Process P0 P1 T T (read, write) (read, write) ???? DATUM 32

Howto #1 - MPI › Multiple processes › That communicate via shared data Process Process P0 P1 T T MPI_Send MPI_Recv DATUM 33

Howto #2 – UNIX pipes › Multiple processes › That communicate via shared data Process Process P0 P1 T T int main(void) int main(void) { { int fd[2], nbytes; pipe(fd); int fd[2], nbytes; /* Receive "string" from char string[] = "Hello, world!\n"; the input side of pipe */ pipe(fd); nbytes = read(fd[0], readbuffer, sizeof(readbuffer)); /* Send "string" through the output side of pipe */ DATUM return(0); write(fd[1], string, } (strlen(string)+1)); return(0); } 34

Howto #3 – Files › Multiple processes › That communicate via shared data Process Process P0 P1 T T File.txt DATUM 35

Shared memory › Coherence problem – Memory consistency issue – Data races › Can share data ptrs Process T T P0 – Ease-to-use T (read, write) (read, write) Shared memory DATUM 36

References › "Calcolo parallelo" website – http://hipert.unimore.it/people/paolob/pub/PhD/index.html › My contacts – paolo.burgio@unimore.it – http://hipert.mat.unimore.it/people/paolob/ › Useful links › A "small blog" – http://www.google.com 37

parallel programming Paolo Burgio paolo.burgio@unimore.it - PowerPoint PPT Presentation

An introduction to parallel programming Paolo Burgio paolo.burgio@unimore.it Definitions Parallel computing Partition computation across different compute engines Distributed computing Paritition computation across different

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Accelerating Atomistic Simulation on Many-core Computing Platform Liu Peng Collaboratory for

ECS 289M Lecture 6 April 12, 2006 Safety Result If the scheme is acyclic and attenuating,

Lasers and LEDs Lasers and LEDs Lasers produce narrow beams of intense light Lasers produce

Engineering inverse power law decoherence of a qubit Filippo Giraldi and Francesco Petruccione

Schwinger Effect and Hawking Radiation in Charged Black Holes* Sang Pyo Kim Kunsan National

A category of completely positive maps on B ( H ) Rolf Gohm Department of Mathematics

parallel programming Paolo Burgio paolo.burgio@unimore.it - PowerPoint PPT Presentation

An introduction to parallel programming Paolo Burgio paolo.burgio@unimore.it Definitions Parallel computing Partition computation across different compute engines Distributed computing Paritition computation across different

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect &amp; Development

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Accelerating Atomistic Simulation on Many-core Computing Platform Liu Peng Collaboratory for

ECS 289M Lecture 6 April 12, 2006 Safety Result If the scheme is acyclic and attenuating,

Lasers and LEDs Lasers and LEDs Lasers produce narrow beams of intense light Lasers produce

Engineering inverse power law decoherence of a qubit Filippo Giraldi and Francesco Petruccione

Schwinger Effect and Hawking Radiation in Charged Black Holes* Sang Pyo Kim Kunsan National

A category of completely positive maps on B ( H ) Rolf Gohm Department of Mathematics

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development