I: Performance metrics (contd) II: Parallel programming models and - - PowerPoint PPT Presentation

i performance metrics cont d ii parallel programming
SMART_READER_LITE
LIVE PREVIEW

I: Performance metrics (contd) II: Parallel programming models and - - PowerPoint PPT Presentation

I: Performance metrics (contd) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3


slide-1
SLIDE 1

I: Performance metrics (cont’d) II: Parallel programming models and mechanics

  • Prof. Richard Vuduc

Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008

1

slide-2
SLIDE 2

Algorithm Serial PRAM Memory # procs Dense LU Band LU Jacobi Explicit inverse

  • Conj. grad.

RB SOR Sparse LU FFT Multigrid Lower bound N3 N N2 N2 N2 (N7/3) N N3/2 (N5/3) N (N4/3) N2 (N5/3) N (N2/3) N N N2 log N N2 N2 N3/2 (N4/3) N1/2(1/3) log N N N N3/2 (N4/3) N1/2 (N1/3) N N N3/2 (N2) N1/2 N log N (N4/3) N N log N log N N N N log2 N N N N log N N Algorithms for 2-D (3-D) Poisson, N=n2 (=n3) PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997)

2

slide-3
SLIDE 3

Sources for today’s material

Mike Heath at UIUC CS 267 (Yelick & Demmel, UCB)

3

slide-4
SLIDE 4

Efficiency and scalability metrics (wrap-up)

4

slide-5
SLIDE 5

Example: Summation using a tree algorithm

Efficiency

Ep ≡ C1 Cp ≈ n n + p log p = 1 1 + p

n log p

5

slide-6
SLIDE 6

Basic definitions

M W V T C

Memory complexity Storage for given problem (e.g., words) Computational complexity Amount of work for given problem (e.g., flops) Processor speed Ops / time (e.g., flop/s) Execution time Elapsed wallclock (e.g., secs) Computational cost (No. procs) * (exec. time) [e.g., processor-hours]

6

slide-7
SLIDE 7

Parallel scalability

Algorithm is scalable if Why use more processors?

Solve fixed problem in less time Solve larger problem in same time (or any time) Obtain sufficient aggregate memory Tolerate latency and/or use all available bandwidth (Little’s Law)

Ep ≡ C1 Cp = Θ(1) as p → ∞

7

slide-8
SLIDE 8

Is this algorithm scalable?

No, for fixed problem size, exec. time, and work / proc. Determine isoefficiency function for which efficiency is constant But then execution time grows with p:

Ep ≡ C1 Cp ≈ n n + p log p = 1 1 + p

n log p = E (const.)

= ⇒ n(p) = Θ(p log p)

Tp = n p + log p = Θ(log p)

8

slide-9
SLIDE 9

A simple model of communication performance

9

slide-10
SLIDE 10

1 6 39 241 1500 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI

Time to Send a Message Time (μsec) Size (bytes)

10

slide-11
SLIDE 11

Latency and bandwidth model

Model time to send a message in terms of latency and bandwidth Usually have cost(flop) << 1/β << α

One long message cheaper than many short ones Can do hundreds or thousands of flops for each message

Efficiency demands large computation-to-communication ratio

t(n) = α + n β

11

slide-12
SLIDE 12

Empirical latency and (inverse) bandwidth (μsec) on real machines

12

slide-13
SLIDE 13

1 6 39 241 1500 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI

Time to Send a Message Time (μsec) Size (bytes)

13

slide-14
SLIDE 14

Time to Send a Message (Model) Time (μsec) Size (bytes)

1.0000 6.2233 38.7298 241.0285 1500.0000 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI

14

slide-15
SLIDE 15

Latency on some current machines (MPI round-trip)

5 10 15 20 25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed 18.5 9.6 22.1 24.2 6.6 14.6

8-byte Roundtrip Latency

Roundtrip Latency (usec) MPI ping-pong

Source: Yelick (UCB/LBNL) 15

slide-16
SLIDE 16

Latency over Time

169 93 34 27.9 63 2.2 3 21 25 50 35 73 11 83 34 6.526 1.245 19.4985 21.4725 9.944 1.7 2.815 14.2365 12.005 6.9745 18.916 36.34 7.2755 3.3 12.0805 9.25 2.6 6.905 11.027 4.81 1 10 100 1000 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Latency (usec)

End-to-end latency (1/2 round-trip) over time

Source: Yelick (UCB/LBNL) 16

slide-17
SLIDE 17

Bandwidth vs. Message Size

Source: Mike Welcome (NERSC) 17

slide-18
SLIDE 18

Parallel programming models

18

slide-19
SLIDE 19

A generic parallel architecture

Physical location of memories, processors? Connectivity?

Proc Interconnection Network Memory Proc Proc Proc Proc Proc Memory Memory Memory Memory

19

slide-20
SLIDE 20

What is a “parallel programming model?”

Languages + libraries composing abstract view of machine Major constructs

Control: Create parallelism? Execution model? Data: Private vs. shared? Synchronization: Coordinating tasks? Atomicity?

Variations in models

Reflect diversity of machine architectures Imply variations in cost

20

slide-21
SLIDE 21

Running example: Summation

Compute the sum, Questions: Where is “A”? Which processors do what? How to combine?

s =

n

  • i=1

f(ai)

A[1..n] f(A[1..n]) s f(·) ⊕

21

slide-22
SLIDE 22

Programming model 1: Shared memory

Program = collection of threads of control Each thread has private variables May access shared variables, for communicating implicitly and synchronizing

Pn P1 P0

s

s = ... y = ..s ... Shared memory i: 2 i: 5 Private memory i: 8

22

slide-23
SLIDE 23

Need to avoid race conditions: Use locks

Race condition (data race): Two threads access a variable, with at least

  • ne writing and concurrent accesses

Thread 1 for i = 0, n/2-1 s = s + f(A[i]) Thread 2 for i = n/2, n-1 s = s + f(A[i]) shared int s = 0;

23

slide-24
SLIDE 24

Need to avoid race conditions: Use locks

Explicitly lock to guarantee atomic operations

Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 shared int s = 0; shared lock lk; lock(lk); unlock(lk); lock(lk); unlock(lk);

24

slide-25
SLIDE 25

Machine model 1a: Symmetric multiprocessors (SMPs)

All processors connect to large shared memory Challenging to scale both hardware & software > 32 procs

P1 bus $ memory P2 $ Pn $

shared $

25

slide-26
SLIDE 26

Source: Pat Worley (ORNL)

26

slide-27
SLIDE 27

Machine model 1b: Simultaneous multithreaded processor (SMT)

Multiple thread contexts share memory and functional units Switch among threads during long-latency memory ops

Memory shared $, shared floating point units, etc.

T0 T1 Tn

27

slide-28
SLIDE 28

Cray El Dorado processor Source: John Feo (Cray)

28

slide-29
SLIDE 29

Machine model 1c: Distributed shared memory

Memory logically shared, but physically distributed Challenge to scale cache coherency protocols > 512 procs

P1 network $ memory P2 $ Pn $ memory memory

Cache lines (pages) must be large to amortize

  • verhead  locality is

critical to performance

29

slide-30
SLIDE 30

Programming model 2: Message passing

Program = named processes No shared address space Processes communicate via explicit send/receive operations

Pn P1 P0 y = ..s ...

s: 12

i: 2

s: 14

i: 3

s: 11

i: 1 send P1,s Network receive Pn,s

30

slide-31
SLIDE 31

Example: Computing A[1]+A[2]

What could go wrong in the following code?

Scenario A: Send/receive is like the telephone system Scenario B: Send/receive is like the post office

Processor 1: x = A[1] SEND x → Proc. 2 RECEIVE y ← Proc. 2 s = x + y Processor 2: x = A[2] SEND x → Proc. 1 RECEIVE y ← Proc. 1 s = x + y

31

slide-32
SLIDE 32

Machine model 2a: Distributed memory

Separate processing nodes, memory Communicate through network interface over interconnect

interconnect P0 memory NI . . . P1 memory NI Pn memory NI

32

slide-33
SLIDE 33

Programming model 2b: Global address space (GAS)

Program = named threads Shared data, but partitioned over local processes Implied cost model: remote accesses cost more

Pn P1 P0 s[myThread] = ... y = ..s[i] ... i: 2 i: 5 Private memory Shared memory i: 8

s[0]: 27 s[1]: 27 s[n]: 27

33

slide-34
SLIDE 34

Machine model 2b: Global address space

Same as distributed, but NI can access memory w/o interrupting CPU One-sided communication; remote direct memory access (RDMA)

interconnect P0 memory NI . . . P1 memory NI Pn memory NI

34

slide-35
SLIDE 35

Programming model 3: Data parallel

Program = single thread performing parallel operations on data Implicit communication and coordination; easy to understand Drawback: Not always applicable Examples: HPF, MATLAB/StarP A[1..n] f(A[1..n]) s f(·) ⊕

35

slide-36
SLIDE 36

Machine model 3a: Single instruction, multiple data (SIMD)

Control processor issues instruction, (usually) simpler processors execute May “turn off” some processors Examples: CM2, Maspar

interconnect

P1 memory NI

. . . control processor

P1 memory NI P1 memory NI P1 memory NI P1 memory NI 36

slide-37
SLIDE 37

Machine model 3b: Vector processors

Single processor with multiple functional units

Perform same operation Instruction specifies large amount of parallelism, hardware executes on a subset

Rely on compiler to find parallelism Resurgent interest

Large scale: Earth Simulator, Cray X1 Small scale: SIMD units (e.g., SSE, Altivec, VIS)

37

slide-38
SLIDE 38

Vector hardware

Operations on vector registers, O(10-100) elements / register Actual hardware has 2-4 vector pipes or lanes

r1 r2 r3

+ +

vr2

vr1

vr3

(logically, performs # elts adds in parallel)

38

slide-39
SLIDE 39

Programming model 4: Hybrid

May mix any combination of preceeding models

MPI + threads DARPA HPCS languages mix threads and data parallel in global address space

39

slide-40
SLIDE 40

Machine model 4: Clusters of SMPs (CLUMPs)

Use SMPs as building block nodes Many clusters (e.g., GT “warp” cluster) Best programming model?

“Flat” MPI Shared mem in SMP , MPI between nodes

40

slide-41
SLIDE 41

Administrivia

41

slide-42
SLIDE 42

Administrative stuff

No office hours today (maybe “virtual” only—AIM:VuducOfficeHours) Accounts: Apparently, you already have them or will soon (!)

Try logging into ‘warp1’ with your UNIX account password If it doesn’t work, go see TSO Help Desk (and good luck!) CCB 148 / M-F 7a-5p / 404.894.7065 / AIM:tsohlpdsk IHPCL mailing list: https://mailman.cc.gatech.edu/mailman/listinfo/ihpc-lab

42

slide-43
SLIDE 43

Shared memory programming: POSIX Threads and OpenMP

43

slide-44
SLIDE 44

Programming model 1: Shared memory

Program = collection of threads of control Each thread has private variables May access shared variables, for communicating implicitly and synchronizing

Pn P1 P0

s

s = ... y = ..s ... Shared memory i: 2 i: 5 Private memory i: 8

44

slide-45
SLIDE 45

Shared memory programming

Libraries for existing languages

POSIX Threads (PThreads), Solaris Threads: Portable, low-level library OpenMP: Pragma-based, targets scientific computing apps Intel Thread Building Blocks (TBB): pThreads + OpenMP

Language extensions

45

slide-46
SLIDE 46

Common notions of thread creation

cobegin task1 (a1); task2 (a2); coend id = fork (task1, a1); task2 (a2); join (id); v = future (task1 (a1)); … … = … v …

46

slide-47
SLIDE 47

POSIX Threads (PThreads)

Portable system call interface for creating and synchronizing threads Threads share all global variables Fork/join style Reference: https://computing.llnl.gov/tutorials/pthreads/

errcode = pthread_create (&thread_id, &thread_attribute, &thread_fun, &fun_arg) … errcode = pthread_join (thread_id, NULL);

47

slide-48
SLIDE 48

Loop-level parallelism

May fork threads at any time, e.g., within a loop Must have sufficient granularity to mask thread-creation overhead

… A[n]; for (i = 0; i < n; ++i) pthread_create (…, &task, &i); …

48

slide-49
SLIDE 49

Low-level policy control

Detached state: Avoid pthread_join calls Scheduling parameters: priority, policy (FIFO vs. round-robin) Contention scope: With what thread does this thread compete for CPU

49

slide-50
SLIDE 50

Barriers for global synchronization (Optional extension)

Usage outline

pthread_barrier_t b; pthread_barrier_init (&b, NULL, 3); // 3 threads … pthread_barrier_wait (&b); // All threads wait … pthread_barrier_destroy (&b);

50

slide-51
SLIDE 51

Mutual exclusion locks (mutexes)

Basic usage Beware of deadlock pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_init (&lock, NULL); … pthread_mutex_lock (&lock); // … do critical work … pthread_mutex_unlock (&lock); Thread 1 Thread 2 lock (a); lock (b); lock (b); lock (a);

51

slide-52
SLIDE 52

OpenMP: An API for multithreaded shared-memory programming

Programmer identifies serial and parallel regions, not threads Library + directives (requires compiler support) Official website: http://www.openmp.org

Also: https://computing.llnl.gov/tutorials/openMP/

52

slide-53
SLIDE 53

Simple example

int main() { printf (“hello, world!\n”); // Execute in parallel return 0; }

53

slide-54
SLIDE 54

Simple example

int main() {

  • mp_set_num_threads (16);

#pragma omp parallel { printf (“hello, world!\n”); // Execute in parallel } // Implicit barrier/join return 0; }

54

slide-55
SLIDE 55

Concurrent loops

May parallelize a loop, but you must check dependencies

s = 0; for (i = 0; i < n; ++i) s += x[i]; #pragma omp parallel for \ reduction(+: s) for (i = 0; i < n; ++i) s += x[i]; #pragma omp parallel for \ shared (s) for (i = 0; i < n; ++i) #pragma omp critical s += x[i];

55

slide-56
SLIDE 56

Loop scheduling

Use “schedule” clause to partition loop iterations Static: k iterations per thread, assigned statically Dynamic: k iterations per thread, using logical work queue Guided: k iterations per thread initially, reduced with each allocation Run-time: Use value of environment variable, OMP_SCHEDULE

#pragma omp parallel for schedule static(k) … #pragma omp parallel for schedule dynamic(k) … #pragma omp parallel for schedule guided(k) …

56

slide-57
SLIDE 57

Synchronization primitives

Critical sections Barriers Explicit locks Single-thread regions No explicit locks

#pragma omp critical { … } #pragma omp barrier

May require flushing

  • mp_set_lock (l);

  • mp_unset_lock (l);

Inside parallel regions

#pragma omp single { /* executed once */ }

57

slide-58
SLIDE 58

“In conclusion…”

58

slide-59
SLIDE 59

Backup slides

59

slide-60
SLIDE 60

Network topology

Of great interest historically, particularly in mapping algorithms to networks

Key metric: Minimize hops Modern networks hide hop cost, so topology less important

Large gap in hardware/software latency: On IBM SP , cf. 1.5 usec to 36 usec Topology affects bisection bandwidth, so still relevant

60

slide-61
SLIDE 61

Bisection bandwidth

Bandwidth across smallest cut that divides network in two equal halves Important for all-to-all communication patterns

Bisection cut Not a bisection cut bisection bw = link bw bisection bw = sqrt(n) * link bw

61

slide-62
SLIDE 62

Linear and ring networks

Linear Diameter ~ n/3 Bisection = 1 Ring/Torus Diameter ~ n/4 Bisection = 2

62

slide-63
SLIDE 63

Multidimensional meshes and tori

2-D mesh Diameter ~ 2*sqrt(n) Bisection = sqrt(n) 2-D torus Diameter ~ sqrt(n) Bisection = 2*sqrt(n)

63

slide-64
SLIDE 64

Hypercubes

  • No. of nodes = 2d for dimension d

Diameter = d Bisection = n/2

d=0 1 2 3 4

64

slide-65
SLIDE 65

Trees

Diameter = log n Bisection bandwidth = 1 Fat trees: Avoid bisection problem using fatter links at top

65

slide-66
SLIDE 66

Butterfly networks

Diameter = log n Bisection = n Cost: Wiring

66

slide-67
SLIDE 67

Topologies in real machines

Machine Network Cray XT3, XT4 BG/L SGI Altix Cray X1 Millennium (UCB, Myricom) HP Alphaserver (Quadrics) IBM SP SGI Origin Intel Paragon BBN Butterfly 3D torus 3D torus Fat tree 4D hypercube* Arbitrary* Fat tree ~ Fat tree Hypercube 2D mesh Butterfly

Newer Older

67

slide-68
SLIDE 68

Evolution of distributed memory machine networks

Message queues replaced by direct memory access (DMA) Wormhole routing: Processor packs/copies, initiates transfer, then goes on Message passing libraries provide store-and-forward abstraction

May send/receive between any pair of nodes Time proportional to distance since each processor along path participates

68