i performance metrics cont d ii parallel programming
play

I: Performance metrics (contd) II: Parallel programming models and - PowerPoint PPT Presentation

I: Performance metrics (contd) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3


  1. I: Performance metrics (cont’d) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1

  2. Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) Band LU Jacobi N 2 (N 5/3 ) N (N 2/3 ) N N Explicit inverse N 2 log N N 2 N 2 Conj. grad. N 3/2 (N 4/3 ) N 1/2(1/3) log N N N RB SOR N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N Sparse LU N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N FFT N log N log N N N Multigrid N log 2 N N N N log N N Lower bound PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 2

  3. Sources for today’s material Mike Heath at UIUC CS 267 (Yelick & Demmel, UCB) 3

  4. Efficiency and scalability metrics (wrap-up) 4

  5. Example: Summation using a tree algorithm Efficiency E p ≡ C 1 n 1 n + p log p = ≈ 1 + p C p n log p 5

  6. Basic definitions M Memory complexity Storage for given problem ( e.g. , words) Amount of work for given problem W Computational complexity ( e.g. , flops) V Processor speed Ops / time ( e.g. , flop/s) Elapsed wallclock T Execution time ( e.g. , secs) (No. procs) * (exec. time) C Computational cost [ e.g. , processor-hours] 6

  7. Parallel scalability Algorithm is scalable if E p ≡ C 1 = Θ (1) as p → ∞ C p Why use more processors? Solve fixed problem in less time Solve larger problem in same time (or any time) Obtain sufficient aggregate memory Tolerate latency and/or use all available bandwidth (Little’s Law) 7

  8. Is this algorithm scalable? No, for fixed problem size , exec. time , and work / proc. Determine isoefficiency function for which efficiency is constant 1 E p ≡ C 1 n n + p log p = n log p = E (const.) ≈ 1 + p C p = ⇒ n ( p ) = Θ ( p log p ) But then execution time grows with p: T p = n p + log p = Θ (log p ) 8

  9. A simple model of communication performance 9

  10. Time to Send a Message Time ( μ sec) 1500 241 39 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI 6 Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 10

  11. Latency and bandwidth model Model time to send a message in terms of latency and bandwidth α + n t ( n ) = β Usually have cost(flop) << 1/ β << α One long message cheaper than many short ones Can do hundreds or thousands of flops for each message Efficiency demands large computation-to-communication ratio 11

  12. Empirical latency and (inverse) bandwidth ( μ sec) on real machines 12

  13. Time to Send a Message Time ( μ sec) 1500 241 39 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI 6 Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 13

  14. Time to Send a Message (Model) Time ( μ sec) 1500.0000 241.0285 38.7298 T3E/Shm T3E/MPI IBM/LAPI 6.2233 IBM/MPI Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1.0000 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 14

  15. Latency on some current machines (MPI round-trip) 8-byte Roundtrip Latency 24.2 25 MPI ping-pong 22.1 20 Roundtrip Latency (usec) 18.5 14.6 15 9.6 10 6.6 5 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed Source: Yelick (UCB/LBNL) 15

  16. End-to-end latency (1/2 round-trip) over time Latency over Time 1000 169 100 93 83 73 63 Latency (usec) 50 36.34 35 34 34 27.9 25 21.4725 21 19.4985 18.916 14.2365 12.005 12.0805 11 11.027 10 9.944 9.25 7.2755 6.9745 6.905 6.526 4.81 3.3 3 2.815 2.6 2.2 1.7 1.245 1 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Source: Yelick (UCB/LBNL) 16

  17. Bandwidth vs. Message Size Source: Mike Welcome (NERSC) 17

  18. Parallel programming models 18

  19. A generic parallel architecture Physical location of memories, processors? Connectivity? Proc Proc Proc Proc Proc Proc Interconnection Network Memory Memory Memory Memory Memory 19

  20. What is a “parallel programming model?” Languages + libraries composing abstract view of machine Major constructs Control : Create parallelism? Execution model? Data : Private vs. shared? Synchronization : Coordinating tasks? Atomicity? Variations in models Reflect diversity of machine architectures Imply variations in cost 20

  21. Running example: Summation n Compute the sum, � s = f ( a i ) i =1 Questions: Where is “A”? Which processors do what? How to combine? A[1..n] f(·) f(A[1..n]) ⊕ s 21

  22. Programming model 1: Shared memory Program = collection of threads of control Each thread has private variables May access shared variables, for communicating implicitly and synchronizing Shared memory s s = ... y = ..s ... i: 5 i: 8 i: 2 Private memory P1 Pn P0 22

  23. Need to avoid race conditions: Use locks Race condition ( data race ): Two threads access a variable, with at least one writing and concurrent accesses shared int s = 0; Thread 2 Thread 1 for i = n/2, n-1 for i = 0, n/2-1 s = s + f(A[i]) s = s + f(A[i]) 23

  24. Need to avoid race conditions: Use locks Explicitly lock to guarantee atomic operations shared int s = 0; shared lock lk; Thread 1 Thread 2 local_s1= 0 local_s2 = 0 for i = 0, n/2-1 for i = n/2, n-1 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) lock(lk); lock(lk); s = s + local_s1 s = s +local_s2 unlock(lk); unlock(lk); 24

  25. Machine model 1a: Symmetric multiprocessors (SMPs) All processors connect to large shared memory Challenging to scale both hardware & software > 32 procs P2 P1 Pn $ $ $ bus shared $ memory 25

  26. Source: Pat Worley (ORNL) 26

  27. Machine model 1b: Simultaneous multithreaded processor (SMT) Multiple thread contexts share memory and functional units Switch among threads during long-latency memory ops T0 T1 Tn shared $, shared floating point units, etc. Memory 27

  28. Cray El Dorado processor Source: John Feo (Cray) 28

  29. Machine model 1c: Distributed shared memory Memory logically shared , but physically distributed Challenge to scale cache coherency protocols > 512 procs P2 P1 Pn Cache lines (pages) must $ be large to amortize $ $ overhead  locality is critical to performance network memory memory memory 29

  30. Programming model 2: Message passing Program = named processes No shared address space Processes communicate via explicit send/receive operations s: 14 s: 12 s: 11 receive Pn,s y = ..s ... i: 3 i: 2 i: 1 send P1,s P1 Pn P0 Network 30

  31. Example: Computing A[1]+A[2] Processor 1 : Processor 2 : x = A[1] x = A[2] SEND x → Proc. 2 SEND x → Proc. 1 RECEIVE y ← Proc. 2 RECEIVE y ← Proc. 1 s = x + y s = x + y What could go wrong in the following code? Scenario A: Send/receive is like the telephone system Scenario B: Send/receive is like the post office 31

  32. Machine model 2a: Distributed memory Separate processing nodes, memory Communicate through network interface over interconnect P0 P1 NI NI Pn NI . . . memory memory memory interconnect 32

  33. Programming model 2b: Global address space (GAS) Program = named threads Shared data, but partitioned over local processes Implied cost model: remote accesses cost more Shared memory s[n]: 27 s[0]: 27 s[1]: 27 y = ..s[i] ... Private i: 5 i: 8 i: 2 memory s[myThread] = ... P1 Pn P0 33

  34. Machine model 2b: Global address space Same as distributed, but NI can access memory w/o interrupting CPU One-sided communication; remote direct memory access (RDMA) P0 P1 NI NI Pn NI . . . memory memory memory interconnect 34

  35. Programming model 3: Data parallel Program = single thread performing parallel operations on data Implicit communication and coordination; easy to understand Drawback : Not always applicable Examples: HPF, MATLAB/StarP A[1..n] f(·) f(A[1..n]) ⊕ s 35

  36. Machine model 3a: Single instruction, multiple data (SIMD) Control processor issues instruction, (usually) simpler processors execute May “turn off” some processors Examples: CM2, Maspar control processor P1 NI P1 NI P1 NI P1 NI P1 NI . . . memory memory memory memory memory interconnect 36

  37. Machine model 3b: Vector processors Single processor with multiple functional units Perform same operation Instruction specifies large amount of parallelism, hardware executes on a subset Rely on compiler to find parallelism Resurgent interest Large scale: Earth Simulator, Cray X1 Small scale: SIMD units ( e.g. , SSE, Altivec, VIS) 37

  38. Vector hardware Operations on vector registers, O(10-100) elements / register r1 r2 … … vr1 vr2 + + (logically, performs # elts adds in parallel) r3 … vr3 Actual hardware has 2-4 vector pipes or lanes 38

  39. Programming model 4: Hybrid May mix any combination of preceeding models MPI + threads DARPA HPCS languages mix threads and data parallel in global address space 39

  40. Machine model 4: Clusters of SMPs (CLUMPs) Use SMPs as building block nodes Many clusters ( e.g. , GT “warp” cluster) Best programming model? “Flat” MPI Shared mem in SMP , MPI between nodes 40

  41. Administrivia 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend