I: Performance metrics (contd) II: Parallel programming models and - PowerPoint PPT Presentation

I: Performance metrics (cont’d) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1

Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) Band LU Jacobi N 2 (N 5/3 ) N (N 2/3 ) N N Explicit inverse N 2 log N N 2 N 2 Conj. grad. N 3/2 (N 4/3 ) N 1/2(1/3) log N N N RB SOR N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N Sparse LU N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N FFT N log N log N N N Multigrid N log 2 N N N N log N N Lower bound PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 2

Sources for today’s material Mike Heath at UIUC CS 267 (Yelick & Demmel, UCB) 3

Efficiency and scalability metrics (wrap-up) 4

Example: Summation using a tree algorithm Efficiency E p ≡ C 1 n 1 n + p log p = ≈ 1 + p C p n log p 5

Basic definitions M Memory complexity Storage for given problem ( e.g. , words) Amount of work for given problem W Computational complexity ( e.g. , flops) V Processor speed Ops / time ( e.g. , flop/s) Elapsed wallclock T Execution time ( e.g. , secs) (No. procs) * (exec. time) C Computational cost [ e.g. , processor-hours] 6

Parallel scalability Algorithm is scalable if E p ≡ C 1 = Θ (1) as p → ∞ C p Why use more processors? Solve fixed problem in less time Solve larger problem in same time (or any time) Obtain sufficient aggregate memory Tolerate latency and/or use all available bandwidth (Little’s Law) 7

Is this algorithm scalable? No, for fixed problem size , exec. time , and work / proc. Determine isoefficiency function for which efficiency is constant 1 E p ≡ C 1 n n + p log p = n log p = E (const.) ≈ 1 + p C p = ⇒ n ( p ) = Θ ( p log p ) But then execution time grows with p: T p = n p + log p = Θ (log p ) 8

A simple model of communication performance 9

Time to Send a Message Time ( μ sec) 1500 241 39 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI 6 Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 10

Latency and bandwidth model Model time to send a message in terms of latency and bandwidth α + n t ( n ) = β Usually have cost(flop) << 1/ β << α One long message cheaper than many short ones Can do hundreds or thousands of flops for each message Efficiency demands large computation-to-communication ratio 11

Empirical latency and (inverse) bandwidth ( μ sec) on real machines 12

Time to Send a Message Time ( μ sec) 1500 241 39 T3E/Shm T3E/MPI IBM/LAPI IBM/MPI 6 Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 13

Time to Send a Message (Model) Time ( μ sec) 1500.0000 241.0285 38.7298 T3E/Shm T3E/MPI IBM/LAPI 6.2233 IBM/MPI Quadrics/Shm Quadrics/MPI Myrinet/GM Myrinet/MPI GigE/VIPL GigE/MPI 1.0000 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 Size (bytes) 14

Latency on some current machines (MPI round-trip) 8-byte Roundtrip Latency 24.2 25 MPI ping-pong 22.1 20 Roundtrip Latency (usec) 18.5 14.6 15 9.6 10 6.6 5 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed Source: Yelick (UCB/LBNL) 15

End-to-end latency (1/2 round-trip) over time Latency over Time 1000 169 100 93 83 73 63 Latency (usec) 50 36.34 35 34 34 27.9 25 21.4725 21 19.4985 18.916 14.2365 12.005 12.0805 11 11.027 10 9.944 9.25 7.2755 6.9745 6.905 6.526 4.81 3.3 3 2.815 2.6 2.2 1.7 1.245 1 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Source: Yelick (UCB/LBNL) 16

Bandwidth vs. Message Size Source: Mike Welcome (NERSC) 17

Parallel programming models 18

A generic parallel architecture Physical location of memories, processors? Connectivity? Proc Proc Proc Proc Proc Proc Interconnection Network Memory Memory Memory Memory Memory 19

What is a “parallel programming model?” Languages + libraries composing abstract view of machine Major constructs Control : Create parallelism? Execution model? Data : Private vs. shared? Synchronization : Coordinating tasks? Atomicity? Variations in models Reflect diversity of machine architectures Imply variations in cost 20

Running example: Summation n Compute the sum, � s = f ( a i ) i =1 Questions: Where is “A”? Which processors do what? How to combine? A[1..n] f(·) f(A[1..n]) ⊕ s 21

Programming model 1: Shared memory Program = collection of threads of control Each thread has private variables May access shared variables, for communicating implicitly and synchronizing Shared memory s s = ... y = ..s ... i: 5 i: 8 i: 2 Private memory P1 Pn P0 22

Need to avoid race conditions: Use locks Race condition ( data race ): Two threads access a variable, with at least one writing and concurrent accesses shared int s = 0; Thread 2 Thread 1 for i = n/2, n-1 for i = 0, n/2-1 s = s + f(A[i]) s = s + f(A[i]) 23

Need to avoid race conditions: Use locks Explicitly lock to guarantee atomic operations shared int s = 0; shared lock lk; Thread 1 Thread 2 local_s1= 0 local_s2 = 0 for i = 0, n/2-1 for i = n/2, n-1 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) lock(lk); lock(lk); s = s + local_s1 s = s +local_s2 unlock(lk); unlock(lk); 24

Machine model 1a: Symmetric multiprocessors (SMPs) All processors connect to large shared memory Challenging to scale both hardware & software > 32 procs P2 P1 Pn $ $ $ bus shared $ memory 25

Source: Pat Worley (ORNL) 26

Machine model 1b: Simultaneous multithreaded processor (SMT) Multiple thread contexts share memory and functional units Switch among threads during long-latency memory ops T0 T1 Tn shared $, shared floating point units, etc. Memory 27

Cray El Dorado processor Source: John Feo (Cray) 28

Machine model 1c: Distributed shared memory Memory logically shared , but physically distributed Challenge to scale cache coherency protocols > 512 procs P2 P1 Pn Cache lines (pages) must $ be large to amortize $ $ overhead  locality is critical to performance network memory memory memory 29

Programming model 2: Message passing Program = named processes No shared address space Processes communicate via explicit send/receive operations s: 14 s: 12 s: 11 receive Pn,s y = ..s ... i: 3 i: 2 i: 1 send P1,s P1 Pn P0 Network 30

Example: Computing A[1]+A[2] Processor 1 : Processor 2 : x = A[1] x = A[2] SEND x → Proc. 2 SEND x → Proc. 1 RECEIVE y ← Proc. 2 RECEIVE y ← Proc. 1 s = x + y s = x + y What could go wrong in the following code? Scenario A: Send/receive is like the telephone system Scenario B: Send/receive is like the post office 31

Machine model 2a: Distributed memory Separate processing nodes, memory Communicate through network interface over interconnect P0 P1 NI NI Pn NI . . . memory memory memory interconnect 32

Programming model 2b: Global address space (GAS) Program = named threads Shared data, but partitioned over local processes Implied cost model: remote accesses cost more Shared memory s[n]: 27 s[0]: 27 s[1]: 27 y = ..s[i] ... Private i: 5 i: 8 i: 2 memory s[myThread] = ... P1 Pn P0 33

Machine model 2b: Global address space Same as distributed, but NI can access memory w/o interrupting CPU One-sided communication; remote direct memory access (RDMA) P0 P1 NI NI Pn NI . . . memory memory memory interconnect 34

Programming model 3: Data parallel Program = single thread performing parallel operations on data Implicit communication and coordination; easy to understand Drawback : Not always applicable Examples: HPF, MATLAB/StarP A[1..n] f(·) f(A[1..n]) ⊕ s 35

Machine model 3a: Single instruction, multiple data (SIMD) Control processor issues instruction, (usually) simpler processors execute May “turn off” some processors Examples: CM2, Maspar control processor P1 NI P1 NI P1 NI P1 NI P1 NI . . . memory memory memory memory memory interconnect 36

Machine model 3b: Vector processors Single processor with multiple functional units Perform same operation Instruction specifies large amount of parallelism, hardware executes on a subset Rely on compiler to find parallelism Resurgent interest Large scale: Earth Simulator, Cray X1 Small scale: SIMD units ( e.g. , SSE, Altivec, VIS) 37

Vector hardware Operations on vector registers, O(10-100) elements / register r1 r2 … … vr1 vr2 + + (logically, performs # elts adds in parallel) r3 … vr3 Actual hardware has 2-4 vector pipes or lanes 38

Programming model 4: Hybrid May mix any combination of preceeding models MPI + threads DARPA HPCS languages mix threads and data parallel in global address space 39

Machine model 4: Clusters of SMPs (CLUMPs) Use SMPs as building block nodes Many clusters ( e.g. , GT “warp” cluster) Best programming model? “Flat” MPI Shared mem in SMP , MPI between nodes 40

Administrivia 41

I: Performance metrics (contd) II: Parallel programming models and - PowerPoint PPT Presentation

I: Performance metrics (contd) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics

Performance metrics How is my parallel code performing and scaling? Performance metrics A

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

What we learned from Community Metrics Agenda Why are metrics used? How metrics are used

AGENCY OPERATIONS METRICS The Metrics of Me The Metrics of Me x 159 13,006 5 days old books

Proposal Metrics Dashboard What Gets Measured Gets Done Topics Why Keep Metrics? What

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Minimal ConT EXt Distribution Mojca Miklavec, BachoT EX 2008 Specifics of ConT EXt

Migration to ConT EXt? First experience with ConT EXt typesetting Tom Hla KONVOJ

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

PUBLISHING SIMULATIONS IN THE VO AND ELSEWHERE Gerard Lemson MPA Garching, Germany 1 ISSAC

Model Checking of Action-Based Concurrent Systems Radu Mateescu INRIA Rhne-Alpes / VASY

Graph searching with advice Nicolas Nisse David Soguet LRI, Universit e Paris-Sud, France.

Automated Connected - Mobile Strategies & Actions towards Automated & Connected

Overview of automated reasoning and ordering-based strategies Maria Paola Bonacina Visiting:

Final Exam Review INFO/CSE 100, Spring 2006 Fluency in Information Technology

MapReduce What it is, and why it is so popular Luigi Laura Dipartimento di Informatica e

Utilization of DCN/ION for Infrastructure of Future Internet Testbed Jin Tanaka KDDI/NICT

I: Performance metrics (contd) II: Parallel programming models and - PowerPoint PPT Presentation

I: Performance metrics (contd) II: Parallel programming models and mechanics Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.05] Tuesday, January 22, 2008 1 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics

Performance metrics How is my parallel code performing and scaling? Performance metrics A

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

What we learned from Community Metrics Agenda Why are metrics used? How metrics are used

AGENCY OPERATIONS METRICS The Metrics of Me The Metrics of Me x 159 13,006 5 days old books

Proposal Metrics Dashboard What Gets Measured Gets Done Topics Why Keep Metrics? What

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Minimal ConT EXt Distribution Mojca Miklavec, BachoT EX 2008 Specifics of ConT EXt

Migration to ConT EXt? First experience with ConT EXt typesetting Tom Hla KONVOJ

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

PUBLISHING SIMULATIONS IN THE VO AND ELSEWHERE Gerard Lemson MPA Garching, Germany 1 ISSAC

Model Checking of Action-Based Concurrent Systems Radu Mateescu INRIA Rhne-Alpes / VASY

Graph searching with advice Nicolas Nisse David Soguet LRI, Universit e Paris-Sud, France.

Automated Connected - Mobile Strategies &amp; Actions towards Automated &amp; Connected

Overview of automated reasoning and ordering-based strategies Maria Paola Bonacina Visiting:

Final Exam Review INFO/CSE 100, Spring 2006 Fluency in Information Technology

MapReduce What it is, and why it is so popular Luigi Laura Dipartimento di Informatica e

Utilization of DCN/ION for Infrastructure of Future Internet Testbed Jin Tanaka KDDI/NICT

Automated Connected - Mobile Strategies & Actions towards Automated & Connected