[PPT] - Programming Instructor PanteA Zardoshti Department of Computer PowerPoint Presentation

SLIDE 1

Parallel Programming

Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu

Instructor

PanteA Zardoshti

SLIDE 2

Computational Mathematics, OpenMP , Sharif University Fall 2015 2

Learn how to program numberical methods Object

SLIDE 3

3

Past:
Doubled every 2 years for 40 years until 9 years ago.

Single CPU Performance

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 4

3

Past:
Doubled every 2 years for 40 years until 9 years ago.
Current Situation:
Marginal improvement in the last 9 years.

Single CPU Performance

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 5

3

Past:
Doubled every 2 years for 40 years until 9 years ago.
Current Situation:
Marginal improvement in the last 9 years.
Main Reasons
Memory Wall
Power Wall
Processor Design Complexity

Single CPU Performance

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 6

4

Memory Wall

Process cessor-Memo Memory ry Perfor formanc ance e Gap Growi wing

Source:Intel

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 7

5

Power Wall

Source: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 8

6

Traditionally, software has been written for serial

computation:

What is Serial Computing?

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 9

6

Traditionally, software has been written for serial

computation:

To be run on a single computer having a single Central Processing Unit

(CPU);

What is Serial Computing?

CPU

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 10

6

Traditionally, software has been written for serial

computation:

To be run on a single computer having a single Central Processing Unit

(CPU);

A problem is broken into a discrete series of instructions.

What is Serial Computing?

CPU

Problem

blem

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 11

6

Traditionally, software has been written for serial

computation:

To be run on a single computer having a single Central Processing Unit

(CPU);

A problem is broken into a discrete series of instructions.

What is Serial Computing?

CPU

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 12

6

Traditionally, software has been written for serial

computation:

To be run on a single computer having a single Central Processing Unit

(CPU);

A problem is broken into a discrete series of instructions.
Instructions are executed one after another.

What is Serial Computing?

CPU

t1 t1 t2 t2 t3 t3 Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 13

6

Traditionally, software has been written for serial

computation:

To be run on a single computer having a single Central Processing Unit

(CPU);

A problem is broken into a discrete series of instructions.
Instructions are executed one after another.
Only one instruction may execute at any moment in time.

What is Serial Computing?

CPU

t2 t2 t3 t3 Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 14

7

What is Parallel Computing?

parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 15

7

What is Parallel Computing?

parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

To be run using multiple CPUs

CPU CPU CPU

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 16

7

What is Parallel Computing?

parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently

CPU CPU CPU

Pro bl bl em em

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 17

7

What is Parallel Computing?

parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down to a series of instructions

CPU

t1 t1 t2 t2 t3 t3

CPU

t1 t1 t2 t2 t3 t3

CPU

t1 t1 t2 t2 t3 t3

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 18

7

What is Parallel Computing?

parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down to a series of instructions
Instructions from each part execute simultaneously on

different CPUs

CPU

t2 t2 t3 t3

CPU

t2 t2 t3 t3

CPU

t2 t2 t3 t3

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 19

8

Serial vs. Parallel

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 20

9

The compute resources can include:
A single computer with multiple processors;
A single computer with (multiple) processor(s) and some

specialized computer resources (GPU, Xeon phi …);

An arbitrary number of computers connected by a

network;

A combination of both.

Parallel Computing: Resources

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 21

10

The primary reasons for using parallel computing:
Save time
Solve larger problems
Provide concurrency (do multiple things at the same time)

Why Parallel Computing?

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 22

11

Finding enough parallelism (Amdahl’s Law)
Granularity
Locality
Load balance
Coordination and synchronization

Principles of Parallel Computing

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 23

12

Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

Amdahl’s Law

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 24

12

Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

Let P be the number of processors

Amdahl’s Law

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 25

12

Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

Let P be the number of processors

Amdahl’s Law

Sp Speedup(P)=Time(1)/Tim ime(P) Maximum Sp Speedup < <(1 – F) / 1 + F / N

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 26

12

Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

Let P be the number of processors
Example:
Let f be 80% (0.8)

Amdahl’s Law

Sp Speedup(P)=Time(1)/Tim ime(P) Maximum Sp Speedup < <(1 – F) / 1 + F / N

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 27

12

Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

Let P be the number of processors
Example:
Let f be 80% (0.8)
Speed up cannot be more than 5 even if you have hundreds of

processors

Amdahl’s Law

Sp Speedup(P)=Time(1)/Tim ime(P) Maximum Sp Speedup < <(1 – F) / 1 + F / N

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 28

13

Overhead of Parallelism

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 29

13

Overhead of Parallelism
cost of starting a thread or process

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 30

13

Overhead of Parallelism
cost of starting a thread or process
cost of communicating shared data

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 31

13

Overhead of Parallelism
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 32

13

Overhead of Parallelism
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 33

13

Overhead of Parallelism
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
Tradeoff

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 34

13

Overhead of Parallelism
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
Tradeoff
Large units of work reduces overhead of Parallelisms

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 35

13

Overhead of Parallelism
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
Tradeoff
Large units of work reduces overhead of Parallelisms
Large units of work reduces available parallelism

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 36

14

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 37

14

Large memories are slow

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 38

14

Large memories are slow
fast memories are small

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 39

14

Large memories are slow
fast memories are small
Algorithms should do most of the work on local data

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 40

15

Load imbalance is the time that someprocessors in

the system are idle due to

Insufficient parallelism (during that phase)
Unequal size tasks
Algorithm needs to balance load

Load Balance

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 41

16

Communications
Parallel tasks typically need to exchange data. There are several ways

this can be accomplished, such as through a shared memory bus or

ver a network, however the actual event of data exchange is

commonly referred to as communications regardless of the method employed.

Coordination and Synchronization

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 42

16

Communications
Parallel tasks typically need to exchange data. There are several ways

this can be accomplished, such as through a shared memory bus or

ver a network, however the actual event of data exchange is

commonly referred to as communications regardless of the method employed.

Synchronization
The coordination of parallel tasks in real time, very often associated

with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point.

Synchronization usually involves waiting by at least one task.

Coordination and Synchronization

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 43

17

Shared Memory vs. Distributed Memory
SIMD Vs. MIMD
ILP Vs. TLP

Parallel Computing Platforms

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 44

18

Shared Memory:
Single physical address space
All processors have access to a pool of shared memory
Example: most existing multiprocessors

Shared vs. Distributed Memory

CPU CPU CPU CPU

Shared d Memory

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 45

18

Shared Memory:
Single physical address space
All processors have access to a pool of shared memory
Example: most existing multiprocessors
Distributed Memory:
Each processor has its own memory
No direct access to remote memory (message passing is required)
Example: clusters

Shared vs. Distributed Memory

CPU CPU CPU CPU

Shared d Memory

CPU Mem CPU Mem CPU Mem CPU Mem

Netw twork rk

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 46

19

Single Instruction Multiple Data (SIMD) model:
A large number of (usually) small processors.
A single “control processor” issues each instruction.
Each processor executes the same instruction.xamples:
Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

SIMD vs. MIMD

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 47

19

Single Instruction Multiple Data (SIMD) model:
A large number of (usually) small processors.
A single “control processor” issues each instruction.
Each processor executes the same instruction.xamples:
Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820
Multiple Instruction Multiple Data (MIMD) model:
Each processor executes its own sequence of instructions independently
Intel iPSC Hypercube
Intel Paragon

SIMD vs. MIMD

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 48

20

Stands for: Streaming SIMD Extensions
Available in most x86 (and x86_64) CPUs today

SIMD Example: SSE

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 49

21

Instruction-Level Parallelism (ILP)
Parallelism among individual instructions
Automatically extracted by the microprocessor
Implicit
Limited by
Pipeline width
Instruction dependencies

ILP vs. TLP

Slide Source: Dr.khunjush , Shiraz University

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 50

21

Instruction-Level Parallelism (ILP)
Parallelism among individual instructions
Automatically extracted by the microprocessor
Implicit
Limited by
Pipeline width
Instruction dependencies
Thread-Level Parallelism (TLP)
Multiple concurrent tasks
Requires programmer’s involvement
Explicit

ILP vs. TLP

Slide Source: Dr.khunjush , Shiraz University

Computational Mathematics, OpenMP , Sharif University Fall 2015

SLIDE 51