CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on - PowerPoint PPT Presentation

CS 5220: Performance basics David Bindel 2017-08-24 1

Starting on the Soap Box • The goal is right enough, fast enough — not flop/s. • Performance is not all that matters. • Portability, readability, debuggability matter too! • Want to make intelligent trade-offs. • The road to good performance starts with a single core. • Even single-core performance is hard. • Helps to build on well-engineered libraries. • Parallel efficiency is hard! • Different algorithms parallelize differently. • Speed vs a naive, untuned serial algorithm is cheating! 2 • p processors ̸ = speedup of p

The Cost of Computing for (k = 0; k < n; ++k) Problem: Model assumptions are wrong! 3. Expected time is 2. One flop per clock cycle Simplest model: C[i+j*n] += A[i+k*n] * B[k+j*n]; Consider a simple serial code: 5 4 for (j = 0; j < n; ++j) 3 for (i = 0; i < n; ++i) 2 // Accumulate C += A*B for n-by-n matrices 1 3 1. Dominant cost is 2 n 3 flops (adds and multiplies) 2 n 3 flops Time (s) ≈ 2 . 4 · 10 9 cycle/s × 1 flop/cycle

The Cost of Computing • Dominant cost is often memory traffic! • Special case of a communication cost • Two pieces to cost of fetching data Latency Time from operation start to first result (s) Bandwidth Rate at which data arrives (bytes/s) • Partial solution: caches (to discuss next time) See: Latency numbers every programmer should know 4 Dominant cost is 2 n 3 flops (adds and multiplies)? • Usually latency ≫ bandwidth − 1 ≫ time per flop • Latency to L3 cache is 10s of ns, DRAM is 3–4 × slower

The Cost of Computing cycle ns One flop per clock cycle? For cluster CPU cores: cycle Theoretical peak (one core) is FMA 2flops 5 FMA × 4 vector FMA × 2vector FMA = 16flops 2 n 3 flops Time (s) ≈ 2 . 4 · 10 9 cycle/s × 16 flop/cycle Makes DRAM latency look even worse! DRAM latency ∼ 100 ns: 100 ns × 2 . 4cycle × 16flops cycle = 3840 flops

The Cost of Computing Theoretical peak for matrix-matrix product (one core) is • But lose orders of magnitude if too many memory refs • And getting full vectorization is also not easy! • We’ll talk more about (single-core) arch next week 6 2 n 3 flops Time (s) ≈ 2 . 4 · 10 9 cycle/s × 16 flop/cycle For 12 core node, theoretical peak is 12 × faster.

The Cost of Computing Sanity check: What is the theoretical peak of a Xeon Phi 5110P accelerator? Wikipedia to the rescue! 7

The Cost of Computing What to take away from this performance modeling example? • Start with a simple model • Counting every detail just complicates life • But we want enough detail to predict something • Watch out for hidden costs • Flops are not the only cost! • Memory/communication costs are often killers • Integer computation may play a role as well • Account for instruction-level parallelism, too! And we haven’t even talked about more than one core yet! 8 • Simplest model is asymptotic complexity (e.g. O ( n 3 ) flops)

The Cost of (Parallel) Computing Simple model: • Deploy p processors ... and you should be suspicious by now! 9 • Serial task takes time T (or T ( n ) ) • Parallel time is T ( n ) / p

The Cost of (Parallel) Computing • Overheads: Communication, synchronization, extra computation and memory overheads • Intrinsically serial work • Idle time due to synchronization • Contention for resources We will talk about all of these in more detail. 10 Why is parallel time not T / p ?

Quantifying Parallel Performance • Starting point: good serial performance • Scaling study: compare parallel to serial time as a function of number of processors ( p ) Parallel time p • Barriers to perfect speedup • Serial work (Amdahl’s law) • Parallel overheads (communication, synchronization) 11 Speedup = Serial time Efficiency = Speedup • Ideally, speedup = p . Usually, speedup < p .

Amdahl’s Law Parallel scaling study where some serial code remains: Amdahl’s law: t p 1 s 12 p = number of processors s = fraction of work that is serial t s = serial time t p = parallel time ≥ st s + ( 1 − s ) t s / p Speedup = t s = s + ( 1 − s ) / p < 1 So 1 % serial work = ⇒ max speedup < 100 × , regardless of p .

Strong and weak scaling Ahmdahl looks bad! But two types of scaling studies: Strong scaling Fix problem size, vary p Weak scaling Fix work per processor, vary p For weak scaling, study scaled speedup Gustafson’s Law: 13 T serial ( n ( p )) S ( p ) = T parallel ( n ( p ) , p ) S ( p ) ≤ p − α ( p − 1 ) where α is the fraction of work that is serial.

Pleasing Parallelism A task is “pleasingly parallel” (aka “embarrassingly parallel”) if it requires very little coordination, for example: • Monte Carlo computations with many independent trials • Big data computations mapping many data items independently Result is “high-throughput” computing – easy to get impressive speedups! Says nothing about hard-to-parallelize tasks. 14

Dependencies Main pain point: dependency between computations 1 a = f(x) 2 b = g(x) 3 c = h(a,b) Compute a and b in parallel, but finish both before c ! Limits amount of parallel work available. This is a true dependency (read-after-write). Also have false dependencies (write-after-read and write-after-write) that can be dealt with more easily. 15

Granularity • Coordination is expensive — including parallel start/stop! • Need to do enough work to amortize parallel costs • Not enough to have parallel work, need big chunks! • How big the chunks must be depends on the machine. 16

Patterns and Benchmarks If your task is not pleasingly parallel, you ask: • What is the best performance I reasonably expect? • How do I get that performance? Look at examples somewhat like yours – a parallel pattern – and maybe seek an informative benchmark. Better yet: reduce to a previously well-solved problem (build on tuned kernels ). NB: Easy to pick uninformative benchmarks and go astray. 17

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on - PowerPoint PPT Presentation

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on the Soap Box The goal is right enough, fast enough not flop/s. Performance is not all that matters. Portability, readability, debuggability matter too!

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

Introduction to Compute Cloud Tao Zou CS 5220 Applications of Parallel Computers About me 3

CS 5220: VMs, containers, and clouds David Bindel 2017-10-12 1 Cloud vs HPC Is the cloud

Dave DeFazio Partner dave.defazio@strategycorps.com 615-498-5220 linkedin.com/in/davedefazio

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just for fun

CS 5220: Impact of Floating Point David Bindel 2017-11-16 1 Why this lecture? Isnt this

CS 5220: Mixed languages, libraries, and frameworks David Bindel 2017-11-21 1 Nerdvana? x =

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

CS 5220: Locality and parallelism in simulations II David Bindel 2017-09-14 1 Basic styles of

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

CS 5220: More Sparse LA David Bindel 2017-10-26 1 Reminder: Conjugate Gradients What if we only

CS 5220: Parallel machines and models David Bindel 2017-09-07 1 Why clusters? Clusters of

Seongmoo Heo and Krste Asanovi MIT Laboratory for Computer Science

Touch Technologies Touching the World by Sara Kilcher Distributed Systems Seminar 30. April

Steve Furber The University of Manchester steve.furber@manchester.ac.uk

Objec&ves Graphs Graph Connec&vity, Traversal BFS & DFS Implementa&ons,

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW

Module 14: Tertiary-Storage Structure Tertiary Storage Devices Operating System Issues

Mass Storage Systems Hard Disks Hard Disks Structure Components Platter (Head) Track

> Twists and braids for general threefold flops If a complex surface contains a (-2)-curve,

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on - PowerPoint PPT Presentation

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on the Soap Box The goal is right enough, fast enough not flop/s. Performance is not all that matters. Portability, readability, debuggability matter too!

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

Introduction to Compute Cloud Tao Zou CS 5220 Applications of Parallel Computers About me 3

CS 5220: VMs, containers, and clouds David Bindel 2017-10-12 1 Cloud vs HPC Is the cloud

Dave DeFazio Partner dave.defazio@strategycorps.com 615-498-5220 linkedin.com/in/davedefazio

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just for fun

CS 5220: Impact of Floating Point David Bindel 2017-11-16 1 Why this lecture? Isnt this

CS 5220: Mixed languages, libraries, and frameworks David Bindel 2017-11-21 1 Nerdvana? x =

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

CS 5220: Locality and parallelism in simulations II David Bindel 2017-09-14 1 Basic styles of

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

CS 5220: More Sparse LA David Bindel 2017-10-26 1 Reminder: Conjugate Gradients What if we only

CS 5220: Parallel machines and models David Bindel 2017-09-07 1 Why clusters? Clusters of

Seongmoo Heo and Krste Asanovi MIT Laboratory for Computer Science

Touch Technologies Touching the World by Sara Kilcher Distributed Systems Seminar 30. April

Steve Furber The University of Manchester steve.furber@manchester.ac.uk

Objec&amp;ves Graphs Graph Connec&amp;vity, Traversal BFS &amp; DFS Implementa&amp;ons,

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include &lt;kernel.h&gt; WINDOW

Module 14: Tertiary-Storage Structure Tertiary Storage Devices Operating System Issues

Mass Storage Systems Hard Disks Hard Disks Structure Components Platter (Head) Track

&gt; Twists and braids for general threefold flops If a complex surface contains a (-2)-curve,

Objec&ves Graphs Graph Connec&vity, Traversal BFS & DFS Implementa&ons,

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW

> Twists and braids for general threefold flops If a complex surface contains a (-2)-curve,