Programming Instructor PanteA Zardoshti Department of Computer - - PowerPoint PPT Presentation

programming
SMART_READER_LITE
LIVE PREVIEW

Programming Instructor PanteA Zardoshti Department of Computer - - PowerPoint PPT Presentation

Parallel Programming Instructor PanteA Zardoshti Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu Object Learn how to program numberical methods 2 Computational Mathematics, OpenMP , Sharif


slide-1
SLIDE 1

Parallel Programming

Department of Computer Engineering Sharif University of Technology e-mail: azad@sharif.edu

Instructor

PanteA Zardoshti

slide-2
SLIDE 2

Computational Mathematics, OpenMP , Sharif University Fall 2015 2

Learn how to program numberical methods Object

slide-3
SLIDE 3

3

  • Past:
  • Doubled every 2 years for 40 years until 9 years ago.

Single CPU Performance

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-4
SLIDE 4

3

  • Past:
  • Doubled every 2 years for 40 years until 9 years ago.
  • Current Situation:
  • Marginal improvement in the last 9 years.

Single CPU Performance

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-5
SLIDE 5

3

  • Past:
  • Doubled every 2 years for 40 years until 9 years ago.
  • Current Situation:
  • Marginal improvement in the last 9 years.
  • Main Reasons
  • Memory Wall
  • Power Wall
  • Processor Design Complexity

Single CPU Performance

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-6
SLIDE 6

4

Memory Wall

Process cessor-Memo Memory ry Perfor formanc ance e Gap Growi wing

Source:Intel

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-7
SLIDE 7

5

Power Wall

Source: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-8
SLIDE 8

6

  • Traditionally, software has been written for serial

computation:

What is Serial Computing?

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-9
SLIDE 9

6

  • Traditionally, software has been written for serial

computation:

  • To be run on a single computer having a single Central Processing Unit

(CPU);

What is Serial Computing?

CPU

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-10
SLIDE 10

6

  • Traditionally, software has been written for serial

computation:

  • To be run on a single computer having a single Central Processing Unit

(CPU);

  • A problem is broken into a discrete series of instructions.

What is Serial Computing?

CPU

Problem

  • blem

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-11
SLIDE 11

6

  • Traditionally, software has been written for serial

computation:

  • To be run on a single computer having a single Central Processing Unit

(CPU);

  • A problem is broken into a discrete series of instructions.

What is Serial Computing?

CPU

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-12
SLIDE 12

6

  • Traditionally, software has been written for serial

computation:

  • To be run on a single computer having a single Central Processing Unit

(CPU);

  • A problem is broken into a discrete series of instructions.
  • Instructions are executed one after another.

What is Serial Computing?

CPU

t1 t1 t2 t2 t3 t3 Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-13
SLIDE 13

6

  • Traditionally, software has been written for serial

computation:

  • To be run on a single computer having a single Central Processing Unit

(CPU);

  • A problem is broken into a discrete series of instructions.
  • Instructions are executed one after another.
  • Only one instruction may execute at any moment in time.

What is Serial Computing?

CPU

t2 t2 t3 t3 Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-14
SLIDE 14

7

What is Parallel Computing?

  • parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-15
SLIDE 15

7

What is Parallel Computing?

  • parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

  • To be run using multiple CPUs

CPU CPU CPU

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-16
SLIDE 16

7

What is Parallel Computing?

  • parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

  • To be run using multiple CPUs
  • A problem is broken into discrete parts that can be solved concurrently

CPU CPU CPU

Pro bl bl em em

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-17
SLIDE 17

7

What is Parallel Computing?

  • parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

  • To be run using multiple CPUs
  • A problem is broken into discrete parts that can be solved concurrently
  • Each part is further broken down to a series of instructions

CPU

t1 t1 t2 t2 t3 t3

CPU

t1 t1 t2 t2 t3 t3

CPU

t1 t1 t2 t2 t3 t3

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-18
SLIDE 18

7

What is Parallel Computing?

  • parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem.

  • To be run using multiple CPUs
  • A problem is broken into discrete parts that can be solved concurrently
  • Each part is further broken down to a series of instructions
  • Instructions from each part execute simultaneously on

different CPUs

CPU

t2 t2 t3 t3

CPU

t2 t2 t3 t3

CPU

t2 t2 t3 t3

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-19
SLIDE 19

8

Serial vs. Parallel

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-20
SLIDE 20

9

  • The compute resources can include:
  • A single computer with multiple processors;
  • A single computer with (multiple) processor(s) and some

specialized computer resources (GPU, Xeon phi …);

  • An arbitrary number of computers connected by a

network;

  • A combination of both.

Parallel Computing: Resources

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-21
SLIDE 21

10

  • The primary reasons for using parallel computing:
  • Save time
  • Solve larger problems
  • Provide concurrency (do multiple things at the same time)

Why Parallel Computing?

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-22
SLIDE 22

11

  • Finding enough parallelism (Amdahl’s Law)
  • Granularity
  • Locality
  • Load balance
  • Coordination and synchronization

Principles of Parallel Computing

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-23
SLIDE 23

12

  • Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

Amdahl’s Law

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-24
SLIDE 24

12

  • Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

  • Let P be the number of processors

Amdahl’s Law

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-25
SLIDE 25

12

  • Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

  • Let P be the number of processors

Amdahl’s Law

Sp Speedup(P)=Time(1)/Tim ime(P) Maximum Sp Speedup < <(1 – F) / 1 + F / N

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-26
SLIDE 26

12

  • Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

  • Let P be the number of processors
  • Example:
  • Let f be 80% (0.8)

Amdahl’s Law

Sp Speedup(P)=Time(1)/Tim ime(P) Maximum Sp Speedup < <(1 – F) / 1 + F / N

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-27
SLIDE 27

12

  • Let (1-f) be the fraction of work that must be done

sequentially, so f is fraction parallelizable

  • Let P be the number of processors
  • Example:
  • Let f be 80% (0.8)
  • Speed up cannot be more than 5 even if you have hundreds of

processors

Amdahl’s Law

Sp Speedup(P)=Time(1)/Tim ime(P) Maximum Sp Speedup < <(1 – F) / 1 + F / N

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-28
SLIDE 28

13

  • Overhead of Parallelism

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-29
SLIDE 29

13

  • Overhead of Parallelism
  • cost of starting a thread or process

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-30
SLIDE 30

13

  • Overhead of Parallelism
  • cost of starting a thread or process
  • cost of communicating shared data

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-31
SLIDE 31

13

  • Overhead of Parallelism
  • cost of starting a thread or process
  • cost of communicating shared data
  • cost of synchronizing

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-32
SLIDE 32

13

  • Overhead of Parallelism
  • cost of starting a thread or process
  • cost of communicating shared data
  • cost of synchronizing
  • extra (redundant) computation

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-33
SLIDE 33

13

  • Overhead of Parallelism
  • cost of starting a thread or process
  • cost of communicating shared data
  • cost of synchronizing
  • extra (redundant) computation
  • Tradeoff

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-34
SLIDE 34

13

  • Overhead of Parallelism
  • cost of starting a thread or process
  • cost of communicating shared data
  • cost of synchronizing
  • extra (redundant) computation
  • Tradeoff
  • Large units of work reduces overhead of Parallelisms

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-35
SLIDE 35

13

  • Overhead of Parallelism
  • cost of starting a thread or process
  • cost of communicating shared data
  • cost of synchronizing
  • extra (redundant) computation
  • Tradeoff
  • Large units of work reduces overhead of Parallelisms
  • Large units of work reduces available parallelism

Granularity

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-36
SLIDE 36

14

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-37
SLIDE 37

14

  • Large memories are slow

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-38
SLIDE 38

14

  • Large memories are slow
  • fast memories are small

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-39
SLIDE 39

14

  • Large memories are slow
  • fast memories are small
  • Algorithms should do most of the work on local data

Locality

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-40
SLIDE 40

15

  • Load imbalance is the time that someprocessors in

the system are idle due to

  • Insufficient parallelism (during that phase)
  • Unequal size tasks
  • Algorithm needs to balance load

Load Balance

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-41
SLIDE 41

16

  • Communications
  • Parallel tasks typically need to exchange data. There are several ways

this can be accomplished, such as through a shared memory bus or

  • ver a network, however the actual event of data exchange is

commonly referred to as communications regardless of the method employed.

Coordination and Synchronization

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-42
SLIDE 42

16

  • Communications
  • Parallel tasks typically need to exchange data. There are several ways

this can be accomplished, such as through a shared memory bus or

  • ver a network, however the actual event of data exchange is

commonly referred to as communications regardless of the method employed.

  • Synchronization
  • The coordination of parallel tasks in real time, very often associated

with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point.

  • Synchronization usually involves waiting by at least one task.

Coordination and Synchronization

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-43
SLIDE 43

17

  • Shared Memory vs. Distributed Memory
  • SIMD Vs. MIMD
  • ILP Vs. TLP

Parallel Computing Platforms

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-44
SLIDE 44

18

  • Shared Memory:
  • Single physical address space
  • All processors have access to a pool of shared memory
  • Example: most existing multiprocessors

Shared vs. Distributed Memory

CPU CPU CPU CPU

Shared d Memory

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-45
SLIDE 45

18

  • Shared Memory:
  • Single physical address space
  • All processors have access to a pool of shared memory
  • Example: most existing multiprocessors
  • Distributed Memory:
  • Each processor has its own memory
  • No direct access to remote memory (message passing is required)
  • Example: clusters

Shared vs. Distributed Memory

CPU CPU CPU CPU

Shared d Memory

CPU Mem CPU Mem CPU Mem CPU Mem

Netw twork rk

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-46
SLIDE 46

19

  • Single Instruction Multiple Data (SIMD) model:
  • A large number of (usually) small processors.
  • A single “control processor” issues each instruction.
  • Each processor executes the same instruction.xamples:
  • Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
  • Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

SIMD vs. MIMD

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-47
SLIDE 47

19

  • Single Instruction Multiple Data (SIMD) model:
  • A large number of (usually) small processors.
  • A single “control processor” issues each instruction.
  • Each processor executes the same instruction.xamples:
  • Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
  • Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820
  • Multiple Instruction Multiple Data (MIMD) model:
  • Each processor executes its own sequence of instructions independently
  • Intel iPSC Hypercube
  • Intel Paragon

SIMD vs. MIMD

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-48
SLIDE 48

20

  • Stands for: Streaming SIMD Extensions
  • Available in most x86 (and x86_64) CPUs today

SIMD Example: SSE

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-49
SLIDE 49

21

  • Instruction-Level Parallelism (ILP)
  • Parallelism among individual instructions
  • Automatically extracted by the microprocessor
  • Implicit
  • Limited by
  • Pipeline width
  • Instruction dependencies

ILP vs. TLP

Slide Source: Dr.khunjush , Shiraz University

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-50
SLIDE 50

21

  • Instruction-Level Parallelism (ILP)
  • Parallelism among individual instructions
  • Automatically extracted by the microprocessor
  • Implicit
  • Limited by
  • Pipeline width
  • Instruction dependencies
  • Thread-Level Parallelism (TLP)
  • Multiple concurrent tasks
  • Requires programmer’s involvement
  • Explicit

ILP vs. TLP

Slide Source: Dr.khunjush , Shiraz University

Computational Mathematics, OpenMP , Sharif University Fall 2015

slide-51
SLIDE 51