[PDF] - Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures PDF Document

SLIDE 1

ECE 451/566 - Intro. to Parallel & Distributed Prog. 1

ECE-451/ECE-566 - Introduction to Parallel and Distributed Programming

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

Department of Electrical & Computer Engineering Rutgers University

Machine Architectures and Machine Architectures and Interconnection Networks

SLIDE 2

ECE 451/566 - Intro. to Parallel & Distributed Prog. 2

Architecture Spectrum

Shared-Everything

S t i M lti – Symmetric Multiprocessors

Shared Memory

– NUMA, CC-NUMA

Distributed Memory

– DSM, Message Passing

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 3

Shared-Nothing

– Clusters, NOW’s

Client/Server

Pros and Cons

Shared Memory

P – Pros

flexible, easier to program

– Cons

not scalable, synchronization/coherency issues

Distributed Memory

– Pros

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 4

s

scalable

– Cons

difficult to program, require explicit message passing

SLIDE 3

ECE 451/566 - Intro. to Parallel & Distributed Prog. 3

Conventional Computer

Consists of a processor executing a program stored in a (main) memory:

Main memory Processor Instructions (to processor) Data (to or from processor)

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 5

Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.

Shared Memory Multiprocessor System

Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module

Interconnection t k Memory modules One address space

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 6

Processors network

SLIDE 4

ECE 451/566 - Intro. to Parallel & Distributed Prog. 4

Simplistic view of a small shared memory multiprocessor

Processors Shared memory

Examples:

Dual Pentiums

Bus

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 7 Quad Pentiums

Quad Pentium Shared Memory Multiprocessor

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache

Processor

L1 cache L2 Cache Bus interface L2 Cache Bus interface L2 Cache Bus interface L2 Cache Bus interface

Memory Controller I/O interface Processor/ memory bus

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 8

Memory I/O bus

Shared memory

SLIDE 5

ECE 451/566 - Intro. to Parallel & Distributed Prog. 5

Programming Shared Memory Multiprocessors

Use:

Threads - programmer decomposes program into individual parallel sequences,

(threads) each being able to access variables declared outside threads (threads), each being able to access variables declared outside threads. Example Pthreads

Sequential programming language with preprocessor compiler directives to

declare shared variables and specify parallelism. Example OpenMP - industry standard - needs OpenMP compiler

Sequential programming language with added syntax to declare shared

variables and specify parallelism. Example UPC (Unified Parallel C) - needs a UPC compiler.

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 9

Example UPC (Unified Parallel C) needs a UPC compiler.

Parallel programming language with syntax to express parallelism - compiler

creates executable code for each processor (not now common)

Sequential programming language and ask parallelizing compiler to convert it

into parallel executable code. - also not now common

Distributed Shared Memory

Making main memory of group of interconnected computers

look as though a single memory with single address space. g g y g p

Shared memory programming techniques can then be used.

Processor Interconnection network Messages

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 10

Shared Computers memory

SLIDE 6

ECE 451/566 - Intro. to Parallel & Distributed Prog. 6

Message-Passing Multicomputer

Complete computers connected through an

interconnection network

Processor Interconnection network Messages

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 11

Local Computers memory

Interconnection Networks

Limited and exhaustive interconnections 2

d 3 di i l h

2- and 3-dimensional meshes Hypercube (not now common) Using Switches

– Crossbar – Trees – Multistage interconnection networks

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 12

SLIDE 7

ECE 451/566 - Intro. to Parallel & Distributed Prog. 7

Two-dimensional array (mesh)

Links Computer/ processor

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 13

Also three-dimensional - used in some large high performance systems.

Three-dimensional hypercube

110 111 010 011 100 101

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 14

000 001

SLIDE 8

ECE 451/566 - Intro. to Parallel & Distributed Prog. 8

Four-dimensional hypercube

1110 0000 0001 0010 0011 0100 0110 0101 0111 1000 1001 1010 1011 1100 1110 1101 1111

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 15

Hypercubes popular in 1980’s - not now

0001 00

Crossbar switch

Memories Switches Processors

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 16

SLIDE 9

ECE 451/566 - Intro. to Parallel & Distributed Prog. 9

Tree

R t Switch element Root Links

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 17

Processors

Multistage Interconnection Network

Example: Omega network

2 × 2 switch elements (straight-through or crossover connections) 000 001 010 011 000 001 010 011 Inputs Outputs

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 18

100 101 110 111 100 101 110 111

SLIDE 10

ECE 451/566 - Intro. to Parallel & Distributed Prog. 10

Taxonomy of Taxonomy of HPC Architectures

Taxonomy of Architectures

Flynn (1966) created a simple classification

f t b d b f for computers based upon number of instruction streams and data streams – SISD - conventional – SIMD - data parallel, vector computing – MISD - systolic arrays

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 20

– MIMD - very general, multiple approaches.

Current focus on MIMD model, using general

purpose processors or multicomputers

SLIDE 11

ECE 451/566 - Intro. to Parallel & Distributed Prog. 11

HPC Architecture Examples

SISD - mainframes, workstations, PCs. SIMD Shared Memory

Vector machines Cray

SIMD Shared Memory - Vector machines, Cray... MIMD Shared Memory - Sequent, KSR, Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM-2... MIMD Distributed Memory - Cray T3D, Intel, Transputers,

TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).

Note: Modern sequential machines are not purely SISD – advanced

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 21

Note: Modern sequential machines are not purely SISD advanced RISC processors use many concepts from vector and parallel architectures (pipelining, parallel execution of instructions, prefetching

f data, etc) in order to achieve one or more arithmetic operations per

clock cycle.

SISD : A Conventional Computer

Instruc Si l t i l t f i t ti

Processor Data Input Data Output

ctions

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 22 Single processor computer - single stream of instructions

generated from program. Instructions operate upon a single stream of data items.

Speed is limited by the rate at which computer can transfer

information internally.

e.g. PC, Macintosh, Workstations

SLIDE 12

ECE 451/566 - Intro. to Parallel & Distributed Prog. 12

The MISD Architecture

Instruction Stream A Instruction Stream B

Data Input Stream Data Output Stream

Processor

A

Processor

B

Stream B Instruction Stream C

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 23

More of an intellectual exercise than a practical

configuration. Few built, but commercially not available

Processor

C

Single Instruction Stream-Multiple Data Stream (SIMD) Computer

A specially designed computer - a single instruction A specially designed computer

a single instruction stream from a single program, but multiple data streams exist.

Single source program written and each processor

executes its personal copy of this program, although independently and not in synchronism.

Developed because a number of important applications

th t tl t f d t

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 24

that mostly operate upon arrays of data.

Source program can be constructed so that parts of the

program are executed by certain computers and not others depending upon the identity of the computer.

SLIDE 13

ECE 451/566 - Intro. to Parallel & Distributed Prog. 13

SIMD Architecture

Instruction Stream

Processor

A

Processor

B

Data Input stream A Data Input stream B Data Output stream A Data Output stream B

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 25

e.g. CRAY machine vector processing, Thinking machine cm*

Ci<= Ai * Bi

Processor

C

Data Input stream C Data Output stream C

Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer

General-purpose multiprocessor system. Each

processor has a separate program and one instruction processor has a separate program and one instruction stream is generated from each program for each

processor. Each instruction operates upon different data

Program Program Instructions Instructions

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 26

Processor Data Processor Data

SLIDE 14

ECE 451/566 - Intro. to Parallel & Distributed Prog. 14

MIMD Architecture

Instruction Stream A Instruction Stream B Instruction Stream C

Processor

A

Processor

B

Processor

Data Input stream A Data Input stream B Data Output stream A Data Output stream B Data Output

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 27

Unlike SISD, MISD, MIMD computer works asynchronously.

» Shared memory (tightly coupled) MIMD » Distributed memory (loosely coupled) MIMD

Processor

C

Data Input stream C Data Output stream C

Shared Memory MIMD machine

Comm: Source PE writes data to GM & destination retrieves it

Easy to build conventional OSes of Processor A Processor B Processor C Easy to build, conventional OSes of

SISD can be easily be ported

Limitation : reliability & expandability.

A memory component or any processor failure affects the whole system.

Increase of processors leads to

memory contention

M E M O R Y B U S M E M O R Y B U S

A

B C

M E M O R Y B U S

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 28

memory contention. E.g. : SGI machines.... Global Memory System

SLIDE 15

ECE 451/566 - Intro. to Parallel & Distributed Prog. 15

Distributed Memory MIMD

IPC channel IPC channel Communication : IPC on High

Speed Network.

Network can be configured to

... Tree, Mesh, Cube, etc.

Unlike Shared MIMD

easily/ r

eadily expandable

Highly r

eliable (any CPU

M E M O R Y B U S

Processor A

Processor B Processor C

M E M O R Y B U S M E M O R Y B U S

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 29 Highly r

eliable (any CPU failur e does not affec t the whole system) Memory System A Memory System B Memory System C

Towards Cluster and Towards Cluster and Distributed Computing

SLIDE 16

ECE 451/566 - Intro. to Parallel & Distributed Prog. 16

Parallel Processing Paradox

Time required to develop a parallel

application for solving GCA is equal to:

– Half Life of Parallel Supercomputers.

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 31

An Alternative Supercomputing Resource

Vast numbers of under utilised workstations

il bl t available to use.

Huge numbers of unused processor cycles

and resources that could be put to good use in a wide variety of applications areas.

Reluctance to buy Supercomputer due to

th i t d h t lif

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 32

their cost and short life span.

Distributed compute resources “fit” better into

today's funding model.

SLIDE 17

ECE 451/566 - Intro. to Parallel & Distributed Prog. 17

Networked Computers as a Computing Platform

A network of computers became a very attractive alternative to

p y expensive supercomputers and parallel computer systems for high- performance computing in early 1990’s.

Several early projects. Notable:

– Berkeley NOW (network of workstations) project.

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 33

– NASA Beowulf project. (Will look at this one later)

Key advantages

Very high performance workstations and PCs readily

available at low cost available at low cost.

The latest processors can easily be incorporated into

the system as they become available.

Existing software can be used or modified.

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 34

SLIDE 18

ECE 451/566 - Intro. to Parallel & Distributed Prog. 18

Software Tools for Clusters

Based upon Message Passing Parallel Programming Parallel Virtual Machine (PVM) - developed in late 1980’s.

Became very popular.

Message-Passing Interface (MPI) - standard defined in 1990s. Both provide a set of user-level libraries for message passing. Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 35

Use with regular programming languages (C, C++, ...).

Beowulf Clusters*

A group of interconnected “commodity” computers

achieving high performance with low cost achieving high performance with low cost.

Typically using commodity interconnects - high

speed Ethernet, and Linux OS. * Beowulf comes from name given by NASA Goddard

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 36

Space Flight Center cluster project.

SLIDE 19

ECE 451/566 - Intro. to Parallel & Distributed Prog. 19

Cluster Interconnects

Originally fast Ethernet on low cost clusters

g y

Gigabit Ethernet - easy upgrade path

More Specialized/Higher Performance

Myrinet - 2.4 Gbits/sec - disadvantage: single vendor cLan SCI (Scalable Coherent Interface) Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 37

( )

QNet Infiniband - may be important as infininband interfaces may be

integrated on next generation PCs

Dedicated cluster with a master node

Dedicated Cluster User Switch Master node Compute nodes Up link

2nd Ethernet interface

External network Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 38

SLIDE 20

ECE 451/566 - Intro. to Parallel & Distributed Prog. 20

Scalable Parallel Computers

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 39

Design Space of Competing Computer Architecture

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 40

SLIDE 21

ECE 451/566 - Intro. to Parallel & Distributed Prog. 21

Machine and Programming Models Machine and Programming Models applied to Parallel Systems

Generic Parallel Architecture

P P P P P P P P

Interconnection Network

M M M M

Memory

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 42

° Where is the memory physically located? y

SLIDE 22

ECE 451/566 - Intro. to Parallel & Distributed Prog. 22

Parallel Programming Models

Control

– How is parallelism created? p – What orderings exist between operations? – How do different threads of control synchronize?

Data

– What data is private vs. shared? – How is logically shared data accessed or communicated?

Operations

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 43

Operations

– What are the atomic operations?

Cost

– How do we account for the cost of each of the above?

Trivial Example

Parallel Decomposition:

f A i

i n

( [ ] )

= −

∑

1

a a e

eco pos t o

– Each evaluation and each partial sum is a task.

Assign n/p numbers to each of p processors

– Each computes independent “private” results and partial sum. – One (or all) collects the p partial sums and computes the global sum.

Two Classes of Data:

– Logically Shared

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 44

– Logically Shared

» The original n numbers, the global sum.

– Logically Private

» The individual function evaluations. » What about the individual partial sums?

SLIDE 23

ECE 451/566 - Intro. to Parallel & Distributed Prog. 23

Programming Model 1: Shared Address Space

Program consists of a collection of threads of control Each has a set of private variables, e.g. local variables on the stack.

C ll ti l ith t f h d i bl t ti i bl h d

Collectively with a set of shared variables, e.g., static variables, shared

common blocks, global heap

Threads communicate implicitly by writing and reading shared variables Threads coordinate explicitly by synchronization operations on shared

variables -- writing and reading flags, locks or semaphores

Like concurrent programming on a uniprocessor

Address:

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 45

i res s

P P P

i res s

. . .

x = ... y = ..x ...

Shared Private

Machine Model 1: Shared Memory Multiprocessor

Processors all connected to a large shared memory “Local” memory is not (usually) part of the hardware P1 P2 Pn $ $ $

ca

e

y s
t (usua y) pa t o t e

a d a e

– Sun, DEC, Intel SMPs in Millennium, SGI Origin

Cost: much cheaper to access data in cache than in

main memory

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 46

network memory

SLIDE 24

ECE 451/566 - Intro. to Parallel & Distributed Prog. 24

Shared Memory Code for Computing a Sum

Thread 1 Thread 2 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 What could go wrong?

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 47

Pitfall and Solution via Synchronization

° Pitfall in computing a global sum s = local_s1 + local_s2:

Thread 1 (initially s=0) load s [from mem to reg] Thread 2 (initially s=0) s = s+local_s1 [=local_s1, in reg] store s [from reg to mem]

Time

load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem]

° Instructions from different threads can be interleaved arbitrarily. ° What can final result s stored in memory be? ° Problem: race condition. ° Possible solution: mutual exclusion with locks

Thread 1 Thread 2

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 48

Thread 1 lock load s s = s+local_s1 store s unlock Thread 2 lock load s s = s+local_s2 store s unlock

° Locks must be atomic (execute completely without interruption).

SLIDE 25

ECE 451/566 - Intro. to Parallel & Distributed Prog. 25

Programming Model 2: Message Passing

Program consists of a collection of named processes. Thread of control plus local address space -- NO shared data. Local variables, static variables, common blocks, heap.

p

Processes communicate by explicit data transfers -- matching send and receive

pair by source and destination processors.

Coordination is implicit in every communication event. Logically shared data is partitioned over local processes. Like distributed programming – program with MPI, PVM.

send P0,X

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 49

P P P

i res s

. . .

i res s

recv Pn,Y X Y A: A: n 1

Machine Model 2: Distributed Memory

Cray XT, IBM SP2, BlueGene, Roadrunner, NOW, etc. Each processor is connected to its own memory and cache

ac

p ocesso s co ected to ts o e

y a d cac e

but cannot directly access another processor’s memory

Each “node” has a network interface (NI) for all

communication and synchronization

P1 memory NI P2 memory NI Pn NI

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 50

interconnect memory memory memory . . .

SLIDE 26

ECE 451/566 - Intro. to Parallel & Distributed Prog. 26

Computing s = x(1)+x(2) on each processor

Processor 1 Processor 2 i t 1 ° First possible solution: send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote ° Second possible solution -- what could go wrong? Processor 1 Processor 2

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 51

send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote ° What if send/receive acts like the telephone system? The post office?

Programming Model 3: Data Parallel

Single sequential thread of control consisting of parallel operations Parallel operations applied to all (or a defined subset) of a data structure

C i ti i i li it i ll l t d “ hift d” d t

Communication is implicit in parallel operators and “shifted” data

structures

Elegant and easy to understand and reason about Like marching in a regiment Used by Matlab, Accelerators, GPUs, etc. Drawback: not all problems fit this model

A:

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 52

A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s:

SLIDE 27

ECE 451/566 - Intro. to Parallel & Distributed Prog. 27

Machine Model 3: SIMD System

A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction Each processor executes the same instruction. Some processors may be turned off on some instructions. Machines are not popular (CM2), but programming model is Applicable to emerging accelerators (GPGPUs, CellBE, etc.)

control processor

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 53

interconnect P1 memory NI P2 memory NI Pn memory NI . . .

Machine Model 4: Clusters of SMPs

Since small shared memory machines (SMPs) are the

fastest commodity machine, why not build a larger machine by connecting many of them with a network?

CLUMP = Cluster of SMPs. Shared memory within one SMP, but message passing

utside of an SMP.

Two programming models:

– Treat machine as “flat”, always use message passing, even within SMP

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 54

Treat machine as flat , always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy). – Expose two layers: shared memory and message passing (usually higher performance, but ugly to program).

SLIDE 28

ECE 451/566 - Intro. to Parallel & Distributed Prog. 28

Programming Model 4: Bulk Synchronous

Used within the message passing or shared memory

models as a programming convention. models as a programming convention.

Phases are separated by global barriers:

– Compute phases: all operate on local data (in distributed memory) or read access to global data (in shared memory). – Communication phases: all participate in rearrangement or reduction of global data.

Generally all doing the “same thing” in a phase:

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 55

y g g

– all do f, but may all do different things within f.

Features the simplicity of data parallelism, but without the

restrictions of a strict data parallel model.

Summary So Far

Historically, each parallel machine was unique, along with

its programming model and programming language

It was necessary to throw away software and start over with

each new kind of machine – ugh !!!

Now we distinguish the programming model from the

underlying machine, so we can write portably correct codes that run on many machines

– MPI now the most portable option, but can be tedious

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 56

Writing portably fast code requires tuning for architecture

– Algorithm design challenge is to make this process easy – Example: picking a block size, not rewriting whole algorithm

SLIDE 29

ECE 451/566 - Intro. to Parallel & Distributed Prog. 29

Steps in Writing p g Parallel Programs

Creating a Parallel Program

Identify work that can be done in parallel. Partition work and perhaps data among logical processes Partition work and perhaps data among logical processes

(threads).

Manage the data access, communication, synchronization. Goal: maximize speedup due to parallelism

Speedupprob(P procs) = Time to solve prob with “best” sequential solution Time to solve prob in parallel on P processors <= P (?)

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 58

<= P (?) Efficiency(P) = Speedup(P) / P <= 1

° Key question is when you can solve each piece:

statically, if information is known in advance.
dynamically, otherwise.

SLIDE 30

ECE 451/566 - Intro. to Parallel & Distributed Prog. 30

Steps in the Process

ition nt ion ng

Task: arbitrarily defined piece of work that forms the basic unit of

Overall Computation Decompos Grains

f Work

Assignmen Processes/ Threads Orchestrati Processes/ Threads Mappi Processors

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 59

y p concurrency.

Process/Thread: abstract entity that performs tasks

– tasks are assigned to threads via an assignment mechanism. – threads must coordinate to accomplish their collective tasks.

Processor: physical entity that executes a thread.

Decomposition

Break the overall computation into individual grains of

work (tasks). work (tasks).

– Identify concurrency and decide at what level to exploit it. – Concurrency may be statically identifiable or may vary dynamically. – It may depend only on problem size, or it may depend on the particular input data.

Goal: identify enough tasks to keep the target range of

processors busy, but not too many.

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 60

– Establishes upper limit on number of useful processors (i.e., scaling).

Tradeoff: sufficient concurrency vs. task control

verhead.

SLIDE 31

ECE 451/566 - Intro. to Parallel & Distributed Prog. 31

Assignment

Determine mechanism to divide work among threads

– Functional partitioning:

A i l i ll di ti t t f k t diff t th d i li i » Assign logically distinct aspects of work to different thread, e.g. pipelining.

– Structural mechanisms:

» Assign iterations of “parallel loop” according to a simple rule, e.g. proc j gets iterates j*n/p through (j+1)*n/p-1. » Throw tasks in a bowl (task queue) and let threads feed.

– Data/domain decomposition:

» Data describing the problem has a natural decomposition. k h d d i k i d i h i f

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 61

» Break up the data and assign work associated with regions, e.g. parts of physical system being simulated. Goals:

– Balance the workload to keep everyone busy (all the time). – Allow efficient orchestration.

Orchestration

Provide a means of

– Naming and accessing shared data. – Communication and coordination among threads of control.

Goals:

– Correctness of parallel solution -- respect the inherent dependencies within the algorithm. – Avoid serialization. R d t f i ti h i ti d t

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 62

– Reduce cost of communication, synchronization, and management. – Preserve locality of data reference.

SLIDE 32

ECE 451/566 - Intro. to Parallel & Distributed Prog. 32

Mapping

Binding processes to physical processors. Time to reach processor across network does not depend on

which processor (roughly) which processor (roughly).

– lots of old literature on “network topology”, no longer so important.

Basic issue is how many remote accesses.

Proc Cache Proc Cache

fast l

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 63

Memory Memory Network

slow really slow

Example

s = f(A[1]) + … + f(A[n])

f(A[1]) + … + f(A[n])

Decomposition

i h f(A[j]) i h f(A[j]) – computing each f(A[j]) computing each f(A[j]) – n n-fold parallelism, where n may be >> p fold parallelism, where n may be >> p – computing sum s computing sum s

Assignment

– thread k sums s thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p = f(A[k*n/p]) + … + f(A[(k+1)*n/p-

1])

1]) – thread 1 sums s = s thread 1 sums s = s1+ … + s + … + sp (for simplicity of this example) – thread 1 communicates s to other threads thread 1 communicates s to other threads

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 64 Orchestration

Orchestration

– starting up threads – communicating, synchronizing with thread 1

Mapping

– processor j runs thread j