ECE 451/566 - Intro. to Parallel & Distributed Prog. 1
Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures - - PDF document
Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures - - PDF document
ECE 451/566 - Intro. to Parallel & Distributed Prog. ECE-451/ECE-566 - Introduction to Parallel and Distributed Programming Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models Department of Electrical
ECE 451/566 - Intro. to Parallel & Distributed Prog. 2
Architecture Spectrum
Shared-Everything
S t i M lti – Symmetric Multiprocessors
Shared Memory
– NUMA, CC-NUMA
Distributed Memory
– DSM, Message Passing
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 3
Shared-Nothing
– Clusters, NOW’s
Client/Server
Pros and Cons
Shared Memory
P – Pros
flexible, easier to program
– Cons
not scalable, synchronization/coherency issues
Distributed Memory
– Pros
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 4
- s
scalable
– Cons
difficult to program, require explicit message passing
ECE 451/566 - Intro. to Parallel & Distributed Prog. 3
Conventional Computer
Consists of a processor executing a program stored in a (main) memory:
Main memory Processor Instructions (to processor) Data (to or from processor)
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 5
Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.
Shared Memory Multiprocessor System
Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module
Interconnection t k Memory modules One address space
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 6
Processors network
ECE 451/566 - Intro. to Parallel & Distributed Prog. 4
Simplistic view of a small shared memory multiprocessor
Processors Shared memory
Examples:
Dual Pentiums
Bus
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 7 Quad Pentiums
Quad Pentium Shared Memory Multiprocessor
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache
Processor
L1 cache L2 Cache Bus interface L2 Cache Bus interface L2 Cache Bus interface L2 Cache Bus interface
Memory Controller I/O interface Processor/ memory bus
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 8
Memory I/O bus
Shared memory
ECE 451/566 - Intro. to Parallel & Distributed Prog. 5
Programming Shared Memory Multiprocessors
Use:
Threads - programmer decomposes program into individual parallel sequences,
(threads) each being able to access variables declared outside threads (threads), each being able to access variables declared outside threads. Example Pthreads
Sequential programming language with preprocessor compiler directives to
declare shared variables and specify parallelism. Example OpenMP - industry standard - needs OpenMP compiler
Sequential programming language with added syntax to declare shared
variables and specify parallelism. Example UPC (Unified Parallel C) - needs a UPC compiler.
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 9
Example UPC (Unified Parallel C) needs a UPC compiler.
Parallel programming language with syntax to express parallelism - compiler
creates executable code for each processor (not now common)
Sequential programming language and ask parallelizing compiler to convert it
into parallel executable code. - also not now common
Distributed Shared Memory
Making main memory of group of interconnected computers
look as though a single memory with single address space. g g y g p
Shared memory programming techniques can then be used.
Processor Interconnection network Messages
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 10
Shared Computers memory
ECE 451/566 - Intro. to Parallel & Distributed Prog. 6
Message-Passing Multicomputer
Complete computers connected through an
interconnection network
Processor Interconnection network Messages
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 11
Local Computers memory
Interconnection Networks
Limited and exhaustive interconnections 2
d 3 di i l h
2- and 3-dimensional meshes Hypercube (not now common) Using Switches
– Crossbar – Trees – Multistage interconnection networks
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 12
ECE 451/566 - Intro. to Parallel & Distributed Prog. 7
Two-dimensional array (mesh)
Links Computer/ processor
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 13
Also three-dimensional - used in some large high performance systems.
Three-dimensional hypercube
110 111 010 011 100 101
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 14
000 001
ECE 451/566 - Intro. to Parallel & Distributed Prog. 8
Four-dimensional hypercube
1110 0000 0001 0010 0011 0100 0110 0101 0111 1000 1001 1010 1011 1100 1110 1101 1111
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 15
Hypercubes popular in 1980’s - not now
0001 00
Crossbar switch
Memories Switches Processors
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 16
ECE 451/566 - Intro. to Parallel & Distributed Prog. 9
Tree
R t Switch element Root Links
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 17
Processors
Multistage Interconnection Network
Example: Omega network
2 × 2 switch elements (straight-through or crossover connections) 000 001 010 011 000 001 010 011 Inputs Outputs
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 18
100 101 110 111 100 101 110 111
ECE 451/566 - Intro. to Parallel & Distributed Prog. 10
Taxonomy of Taxonomy of HPC Architectures
Taxonomy of Architectures
Flynn (1966) created a simple classification
f t b d b f for computers based upon number of instruction streams and data streams – SISD - conventional – SIMD - data parallel, vector computing – MISD - systolic arrays
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 20
– MIMD - very general, multiple approaches.
Current focus on MIMD model, using general
purpose processors or multicomputers
ECE 451/566 - Intro. to Parallel & Distributed Prog. 11
HPC Architecture Examples
SISD - mainframes, workstations, PCs. SIMD Shared Memory
Vector machines Cray
SIMD Shared Memory - Vector machines, Cray... MIMD Shared Memory - Sequent, KSR, Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM-2... MIMD Distributed Memory - Cray T3D, Intel, Transputers,
TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).
Note: Modern sequential machines are not purely SISD – advanced
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 21
Note: Modern sequential machines are not purely SISD advanced RISC processors use many concepts from vector and parallel architectures (pipelining, parallel execution of instructions, prefetching
- f data, etc) in order to achieve one or more arithmetic operations per
clock cycle.
SISD : A Conventional Computer
Instruc Si l t i l t f i t ti
Processor Data Input Data Output
ctions
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 22 Single processor computer - single stream of instructions
generated from program. Instructions operate upon a single stream of data items.
Speed is limited by the rate at which computer can transfer
information internally.
e.g. PC, Macintosh, Workstations
ECE 451/566 - Intro. to Parallel & Distributed Prog. 12
The MISD Architecture
Instruction Stream A Instruction Stream B
Data Input Stream Data Output Stream
Processor
A
Processor
B
Stream B Instruction Stream C
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 23
More of an intellectual exercise than a practical
- configuration. Few built, but commercially not available
Processor
C
Single Instruction Stream-Multiple Data Stream (SIMD) Computer
A specially designed computer - a single instruction A specially designed computer
a single instruction stream from a single program, but multiple data streams exist.
Single source program written and each processor
executes its personal copy of this program, although independently and not in synchronism.
Developed because a number of important applications
th t tl t f d t
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 24
that mostly operate upon arrays of data.
Source program can be constructed so that parts of the
program are executed by certain computers and not others depending upon the identity of the computer.
ECE 451/566 - Intro. to Parallel & Distributed Prog. 13
SIMD Architecture
Instruction Stream
Processor
A
Processor
B
Data Input stream A Data Input stream B Data Output stream A Data Output stream B
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 25
e.g. CRAY machine vector processing, Thinking machine cm*
Ci<= Ai * Bi
Processor
C
Data Input stream C Data Output stream C
Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer
General-purpose multiprocessor system. Each
processor has a separate program and one instruction processor has a separate program and one instruction stream is generated from each program for each
- processor. Each instruction operates upon different data
Program Program Instructions Instructions
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 26
Processor Data Processor Data
ECE 451/566 - Intro. to Parallel & Distributed Prog. 14
MIMD Architecture
Instruction Stream A Instruction Stream B Instruction Stream C
Processor
A
Processor
B
Processor
Data Input stream A Data Input stream B Data Output stream A Data Output stream B Data Output
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 27
Unlike SISD, MISD, MIMD computer works asynchronously.
» Shared memory (tightly coupled) MIMD » Distributed memory (loosely coupled) MIMD
Processor
C
Data Input stream C Data Output stream C
Shared Memory MIMD machine
Comm: Source PE writes data to GM & destination retrieves it
Easy to build conventional OSes of Processor A Processor B Processor C Easy to build, conventional OSes of
SISD can be easily be ported
Limitation : reliability & expandability.
A memory component or any processor failure affects the whole system.
Increase of processors leads to
memory contention
M E M O R Y B U S M E M O R Y B U S
A
B C
M E M O R Y B U S
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 28
memory contention. E.g. : SGI machines.... Global Memory System
ECE 451/566 - Intro. to Parallel & Distributed Prog. 15
Distributed Memory MIMD
IPC channel IPC channel Communication : IPC on High
Speed Network.
Network can be configured to
... Tree, Mesh, Cube, etc.
Unlike Shared MIMD
easily/ r
eadily expandable
Highly r
eliable (any CPU
M E M O R Y B U S
Processor A
Processor B Processor C
M E M O R Y B U S M E M O R Y B U S
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 29 Highly r
eliable (any CPU failur e does not affec t the whole system) Memory System A Memory System B Memory System C
Towards Cluster and Towards Cluster and Distributed Computing
ECE 451/566 - Intro. to Parallel & Distributed Prog. 16
Parallel Processing Paradox
Time required to develop a parallel
application for solving GCA is equal to:
– Half Life of Parallel Supercomputers.
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 31
An Alternative Supercomputing Resource
Vast numbers of under utilised workstations
il bl t available to use.
Huge numbers of unused processor cycles
and resources that could be put to good use in a wide variety of applications areas.
Reluctance to buy Supercomputer due to
th i t d h t lif
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 32
their cost and short life span.
Distributed compute resources “fit” better into
today's funding model.
ECE 451/566 - Intro. to Parallel & Distributed Prog. 17
Networked Computers as a Computing Platform
A network of computers became a very attractive alternative to
p y expensive supercomputers and parallel computer systems for high- performance computing in early 1990’s.
Several early projects. Notable:
– Berkeley NOW (network of workstations) project.
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 33
– NASA Beowulf project. (Will look at this one later)
Key advantages
Very high performance workstations and PCs readily
available at low cost available at low cost.
The latest processors can easily be incorporated into
the system as they become available.
Existing software can be used or modified.
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 34
ECE 451/566 - Intro. to Parallel & Distributed Prog. 18
Software Tools for Clusters
Based upon Message Passing Parallel Programming Parallel Virtual Machine (PVM) - developed in late 1980’s.
Became very popular.
Message-Passing Interface (MPI) - standard defined in 1990s. Both provide a set of user-level libraries for message passing. Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 35
Use with regular programming languages (C, C++, ...).
Beowulf Clusters*
A group of interconnected “commodity” computers
achieving high performance with low cost achieving high performance with low cost.
Typically using commodity interconnects - high
speed Ethernet, and Linux OS. * Beowulf comes from name given by NASA Goddard
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 36
Space Flight Center cluster project.
ECE 451/566 - Intro. to Parallel & Distributed Prog. 19
Cluster Interconnects
Originally fast Ethernet on low cost clusters
g y
Gigabit Ethernet - easy upgrade path
More Specialized/Higher Performance
Myrinet - 2.4 Gbits/sec - disadvantage: single vendor cLan SCI (Scalable Coherent Interface) Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 37
( )
QNet Infiniband - may be important as infininband interfaces may be
integrated on next generation PCs
Dedicated cluster with a master node
Dedicated Cluster User Switch Master node Compute nodes Up link
2nd Ethernet interface
External network Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 38
ECE 451/566 - Intro. to Parallel & Distributed Prog. 20
Scalable Parallel Computers
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 39
Design Space of Competing Computer Architecture
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 40
ECE 451/566 - Intro. to Parallel & Distributed Prog. 21
Machine and Programming Models Machine and Programming Models applied to Parallel Systems
Generic Parallel Architecture
P P P P P P P P
Interconnection Network
M M M M
Memory
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 42
° Where is the memory physically located? y
ECE 451/566 - Intro. to Parallel & Distributed Prog. 22
Parallel Programming Models
Control
– How is parallelism created? p – What orderings exist between operations? – How do different threads of control synchronize?
Data
– What data is private vs. shared? – How is logically shared data accessed or communicated?
Operations
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 43
Operations
– What are the atomic operations?
Cost
– How do we account for the cost of each of the above?
Trivial Example
- Parallel Decomposition:
f A i
i n
( [ ] )
= −
∑
1
- a a e
eco pos t o
– Each evaluation and each partial sum is a task.
Assign n/p numbers to each of p processors
– Each computes independent “private” results and partial sum. – One (or all) collects the p partial sums and computes the global sum.
Two Classes of Data:
– Logically Shared
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 44
– Logically Shared
» The original n numbers, the global sum.
– Logically Private
» The individual function evaluations. » What about the individual partial sums?
ECE 451/566 - Intro. to Parallel & Distributed Prog. 23
Programming Model 1: Shared Address Space
Program consists of a collection of threads of control Each has a set of private variables, e.g. local variables on the stack.
C ll ti l ith t f h d i bl t ti i bl h d
Collectively with a set of shared variables, e.g., static variables, shared
common blocks, global heap
Threads communicate implicitly by writing and reading shared variables Threads coordinate explicitly by synchronization operations on shared
variables -- writing and reading flags, locks or semaphores
Like concurrent programming on a uniprocessor
Address:
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 45
i res s
P P P
i res s
. . .
x = ... y = ..x ...
Shared Private
Machine Model 1: Shared Memory Multiprocessor
Processors all connected to a large shared memory “Local” memory is not (usually) part of the hardware P1 P2 Pn $ $ $
- ca
e
- y s
- t (usua y) pa t o t e
a d a e
– Sun, DEC, Intel SMPs in Millennium, SGI Origin
Cost: much cheaper to access data in cache than in
main memory
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 46
network memory
ECE 451/566 - Intro. to Parallel & Distributed Prog. 24
Shared Memory Code for Computing a Sum
Thread 1 Thread 2 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 What could go wrong?
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 47
Pitfall and Solution via Synchronization
° Pitfall in computing a global sum s = local_s1 + local_s2:
Thread 1 (initially s=0) load s [from mem to reg] Thread 2 (initially s=0) s = s+local_s1 [=local_s1, in reg] store s [from reg to mem]
Time
load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem]
° Instructions from different threads can be interleaved arbitrarily. ° What can final result s stored in memory be? ° Problem: race condition. ° Possible solution: mutual exclusion with locks
Thread 1 Thread 2
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 48
Thread 1 lock load s s = s+local_s1 store s unlock Thread 2 lock load s s = s+local_s2 store s unlock
° Locks must be atomic (execute completely without interruption).
ECE 451/566 - Intro. to Parallel & Distributed Prog. 25
Programming Model 2: Message Passing
Program consists of a collection of named processes. Thread of control plus local address space -- NO shared data. Local variables, static variables, common blocks, heap.
p
Processes communicate by explicit data transfers -- matching send and receive
pair by source and destination processors.
Coordination is implicit in every communication event. Logically shared data is partitioned over local processes. Like distributed programming – program with MPI, PVM.
send P0,X
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 49
P P P
i res s
. . .
i res s
recv Pn,Y X Y A: A: n 1
Machine Model 2: Distributed Memory
Cray XT, IBM SP2, BlueGene, Roadrunner, NOW, etc. Each processor is connected to its own memory and cache
- ac
p ocesso s co ected to ts o e
- y a d cac e
but cannot directly access another processor’s memory
Each “node” has a network interface (NI) for all
communication and synchronization
P1 memory NI P2 memory NI Pn NI
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 50
interconnect memory memory memory . . .
ECE 451/566 - Intro. to Parallel & Distributed Prog. 26
Computing s = x(1)+x(2) on each processor
Processor 1 Processor 2 i t 1 ° First possible solution: send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote ° Second possible solution -- what could go wrong? Processor 1 Processor 2
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 51
send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote ° What if send/receive acts like the telephone system? The post office?
Programming Model 3: Data Parallel
Single sequential thread of control consisting of parallel operations Parallel operations applied to all (or a defined subset) of a data structure
C i ti i i li it i ll l t d “ hift d” d t
Communication is implicit in parallel operators and “shifted” data
structures
Elegant and easy to understand and reason about Like marching in a regiment Used by Matlab, Accelerators, GPUs, etc. Drawback: not all problems fit this model
A:
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 52
A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s:
ECE 451/566 - Intro. to Parallel & Distributed Prog. 27
Machine Model 3: SIMD System
A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction Each processor executes the same instruction. Some processors may be turned off on some instructions. Machines are not popular (CM2), but programming model is Applicable to emerging accelerators (GPGPUs, CellBE, etc.)
control processor
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 53
interconnect P1 memory NI P2 memory NI Pn memory NI . . .
Machine Model 4: Clusters of SMPs
Since small shared memory machines (SMPs) are the
fastest commodity machine, why not build a larger machine by connecting many of them with a network?
CLUMP = Cluster of SMPs. Shared memory within one SMP, but message passing
- utside of an SMP.
Two programming models:
– Treat machine as “flat”, always use message passing, even within SMP
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 54
Treat machine as flat , always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy). – Expose two layers: shared memory and message passing (usually higher performance, but ugly to program).
ECE 451/566 - Intro. to Parallel & Distributed Prog. 28
Programming Model 4: Bulk Synchronous
Used within the message passing or shared memory
models as a programming convention. models as a programming convention.
Phases are separated by global barriers:
– Compute phases: all operate on local data (in distributed memory) or read access to global data (in shared memory). – Communication phases: all participate in rearrangement or reduction of global data.
Generally all doing the “same thing” in a phase:
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 55
y g g
– all do f, but may all do different things within f.
Features the simplicity of data parallelism, but without the
restrictions of a strict data parallel model.
Summary So Far
Historically, each parallel machine was unique, along with
its programming model and programming language
It was necessary to throw away software and start over with
each new kind of machine – ugh !!!
Now we distinguish the programming model from the
underlying machine, so we can write portably correct codes that run on many machines
– MPI now the most portable option, but can be tedious
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 56
Writing portably fast code requires tuning for architecture
– Algorithm design challenge is to make this process easy – Example: picking a block size, not rewriting whole algorithm
ECE 451/566 - Intro. to Parallel & Distributed Prog. 29
Steps in Writing p g Parallel Programs
Creating a Parallel Program
Identify work that can be done in parallel. Partition work and perhaps data among logical processes Partition work and perhaps data among logical processes
(threads).
Manage the data access, communication, synchronization. Goal: maximize speedup due to parallelism
Speedupprob(P procs) = Time to solve prob with “best” sequential solution Time to solve prob in parallel on P processors <= P (?)
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 58
<= P (?) Efficiency(P) = Speedup(P) / P <= 1
° Key question is when you can solve each piece:
- statically, if information is known in advance.
- dynamically, otherwise.
ECE 451/566 - Intro. to Parallel & Distributed Prog. 30
Steps in the Process
ition nt ion ng
Task: arbitrarily defined piece of work that forms the basic unit of
Overall Computation Decompos Grains
- f Work
Assignmen Processes/ Threads Orchestrati Processes/ Threads Mappi Processors
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 59
y p concurrency.
Process/Thread: abstract entity that performs tasks
– tasks are assigned to threads via an assignment mechanism. – threads must coordinate to accomplish their collective tasks.
Processor: physical entity that executes a thread.
Decomposition
Break the overall computation into individual grains of
work (tasks). work (tasks).
– Identify concurrency and decide at what level to exploit it. – Concurrency may be statically identifiable or may vary dynamically. – It may depend only on problem size, or it may depend on the particular input data.
Goal: identify enough tasks to keep the target range of
processors busy, but not too many.
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 60
– Establishes upper limit on number of useful processors (i.e., scaling).
Tradeoff: sufficient concurrency vs. task control
- verhead.
ECE 451/566 - Intro. to Parallel & Distributed Prog. 31
Assignment
Determine mechanism to divide work among threads
– Functional partitioning:
A i l i ll di ti t t f k t diff t th d i li i » Assign logically distinct aspects of work to different thread, e.g. pipelining.
– Structural mechanisms:
» Assign iterations of “parallel loop” according to a simple rule, e.g. proc j gets iterates j*n/p through (j+1)*n/p-1. » Throw tasks in a bowl (task queue) and let threads feed.
– Data/domain decomposition:
» Data describing the problem has a natural decomposition. k h d d i k i d i h i f
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 61
» Break up the data and assign work associated with regions, e.g. parts of physical system being simulated. Goals:
– Balance the workload to keep everyone busy (all the time). – Allow efficient orchestration.
Orchestration
Provide a means of
– Naming and accessing shared data. – Communication and coordination among threads of control.
Goals:
– Correctness of parallel solution -- respect the inherent dependencies within the algorithm. – Avoid serialization. R d t f i ti h i ti d t
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 62
– Reduce cost of communication, synchronization, and management. – Preserve locality of data reference.
ECE 451/566 - Intro. to Parallel & Distributed Prog. 32
Mapping
Binding processes to physical processors. Time to reach processor across network does not depend on
which processor (roughly) which processor (roughly).
– lots of old literature on “network topology”, no longer so important.
Basic issue is how many remote accesses.
Proc Cache Proc Cache
fast l
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 63
Memory Memory Network
slow really slow
Example
s = f(A[1]) + … + f(A[n])
f(A[1]) + … + f(A[n])
Decomposition
Decomposition
i h f(A[j]) i h f(A[j]) – computing each f(A[j]) computing each f(A[j]) – n n-fold parallelism, where n may be >> p fold parallelism, where n may be >> p – computing sum s computing sum s
Assignment
Assignment
– thread k sums s thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p = f(A[k*n/p]) + … + f(A[(k+1)*n/p-
- 1])
1]) – thread 1 sums s = s thread 1 sums s = s1+ … + s + … + sp (for simplicity of this example) – thread 1 communicates s to other threads thread 1 communicates s to other threads
Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 64 Orchestration
Orchestration
– starting up threads – communicating, synchronizing with thread 1
Mapping
Mapping
– processor j runs thread j