CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: - PowerPoint PPT Presentation

CS 5220: Optimization basics David Bindel 2017-08-31 1

Reminder: Modern processors • Modern CPUs are • Wide: start / retire multiple instructions per cycle • Pipelined: overlap instruction executions • Out-of-order: dynamically schedule instructions • Lots of opportunities for instruction-level parallelism (ILP) • Complicated! Want the compiler to handle the details • Implication: we should give the compiler • Good instruction mixes • Independent operations • Vectorizable operations 2

Reminder: Memory systems • Memory access are expensive! • Caches provide intermediate cost/capacity points • Cache benefits from • Spatial locality (regular local access) • Temporal locality (small working sets) 3 • Flop time ≪ bandwidth − 1 ≪ latency

Goal: (Trans)portable performance • Attention to detail has orders-of-magnitude impact • Different systems = different micro-architectures, caches • Want (trans)portable performance across HW • Need principles for high-perf code along with tricks 4

Basic principles • Think before you write • Time before you tune • Stand on the shoulders of giants • Help your tools help you • Tune your data structures 5

Think before you write 6

Premature optimization We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. – Don Knuth 7

Premature optimization Wrong reading: “Performance doesn’t matter” We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil . – Don Knuth 8

Premature optimization What he actually said (with my emphasis) We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. – Don Knuth • Don’t forget the big efficiencies! • Don’t forget the 3%! • Your code is not premature forever! 9

Don’t sweat the small stuff • OK to write high-level stuff in Matlab or Python • OK if configuration file reader is un-tuned 10 • Speed-up from tuning ϵ of code < ( 1 − ϵ ) − 1 ≈ 1 + ϵ • OK if O ( n 2 ) prelude to O ( n 3 ) algorithm is not hyper-tuned?

Lay-of-the-land thinking 1 for (i = 0; i < n; ++i) 2 for (j = 0; j < n; ++j) 3 for (k = 0; k < n; ++k) 4 C[i+j*n] += A[i+k*n] * B[k+j*n]; • What are the “big computations” in my code? • What are the natural algorithmic variants? • Vary loop orders? Different interpretations! • Lower complexity algorithm (Strassen?) • Should I rule out some options in advance? • How can I code so it is easy to experiment? 11

How big is n ? • Behavior at small n may not match behavior at large n ! Beware asymptotic complexity arguments about small- n codes! 12 Typical analysis: time is O ( f ( n )) • Meaning: ∃ C , N : ∀ n ≥ N , T n ≤ Cf ( n ) . • Says nothing about constant factors: O ( 10 n ) = O ( n ) • Ignores lower order term: O ( n 3 + 1000 n 2 ) = O ( n 3 )

Avoid work 9 } 15 return true; 14 return false; 13 if (x[i] < 0) 12 for (int i = 0; i < n; ++i) 11 { 10 bool any_negative2(int* x, int n) 8 1 } 7 return result; 6 result = (result || x[i] < 0); 5 for (int i = 0; i < n; ++i) 4 bool result = false; 3 { 2 bool any_negative1(int* x, int n) 13

Be cheap Approximate when you can get away with it. 14 Fast enough, right enough = ⇒

Do more with less (data) Want lots of work relative to data loads: • Keep data compact to fit in cache • Use short data types for better vectorization • But be aware of tradeoffs! • For integers: may want 64-bit ints sometimes! • For floating-point: will discuss in detail in other lectures 15

Remember the I/O! • 0.25 MB per frame (three fit in L3 cache) • Constant work per element (a few flops) If I write once every 100 frames, how much time is I/O? 16 Example: Explicit PDE time stepper on 256 2 mesh • Time to write to disk ≈ 5 ms

Time before you tune 17

Hot spots and bottlenecks • Often a little bit of code takes most of the time • Usually called a “hot spot” or bottleneck • Goal: Find and eliminate • Cute coinage: “de-slugging” 18

Practical timing Need to worry about: • System timer resolutions • Wall-clock time vs CPU time • Size of data collected vs how informative it is • Cross-interference with other tasks • Cache warm-start on repeated timings • Overlooked issues from too-small timings 19

Manual instrumentation Basic picture: • Identify stretch of code to be timed • Run it several times with “characteristic” data • Accumulate the total time spent Caveats: Effects from repetition, “characteristic” data 20

Manual instrumentation • Hard to get portable high-resolution wall-clock time! • Solution: omp_get_wtime() • Requires OpenMP support (still not CLang) 21

Types of profiling tools • Sampling vs instrumenting • Instrumenting: Rewrite code to insert timers • Instrument at binary or source level • Function level or line-by-line • Function: Inlining can cause mis-attribution • Line-by-line: Usually requires debugging symbols ( -g ) • Context information? • Distinguish full call stack or not? • Time full run, or just part? 22 • Sampling: Interrupt every t profile cycles

Hardware counters • Counters track cache misses, instruction counts, etc • Present on most modern chips • May require significant permissions to access... 23

Automated analysis tools • Examples: MAQAO and IACA • Symbolic execution of model of a code segment • Usually only practical for short segments • But can give detailed feedback on (assembly) quality 24

Shoulders of giants 25

What makes a good kernel? Computational kernels are • Small and simple to describe • General building blocks (amortize tuning work) • Ideally high arithmetic intensity • Arithmetic intensity = flops/byte • Amortizes memory costs 26

Case study: BLAS Basic Linear Algebra Subroutines Level 3 BLAS are key for high-perf transportable LA. 27 • Level 1: O ( n ) work on O ( n ) data • Level 2: O ( n 2 ) work on O ( n 2 ) data • Level 3: O ( n 3 ) work on O ( n 2 ) data

Other common kernels • Apply sparse matrix (or sparse matrix powers) • Compute an FFT • Sort a list 28

Kernel trade-offs • Critical to get properly tuned kernels • Kernel interface is consistent across HW types • Kernel implementation varies according to arch details • General kernels may leave performance on the table • Ex: General matrix-matrix multiply for structured matrices • Overheads may be an issue for small n cases • Ex: Usefulness of batched BLAS extensions • But: Ideally, someone else writes the kernel! • Or it may be automatically tuned 29

Help your tools help you 30

What can your compiler do for you? In decreasing order of effectiveness: • Local optimization • Especially restricted to a “basic block” • More generally, in “simple” functions • Loop optimizations • Global (cross-function) optimizations 31

Local optimizations • Register allocation: compiler > human • Instruction scheduling: compiler > human • Branch joins and jump elim: compiler > human? • Constant folding and propogation: humans OK • Common subexpression elimination: humans OK • Algebraic reductions: humans definitely help 32

Loop optimizations Mostly leave these to modern compilers • Loop invariant code motion • Loop unrolling • Loop fusion • Software pipelining • Vectorization • Induction variable substitution 33

Obstacles for the compiler • Long dependency chains • Excessive branching • Pointer aliasing • Complex loop logic • Cross-module optimization • Function pointers and virtual functions • Unexpected FP costs • Missed algebraic reductions • Lack of instruction diversity Let’s look at a few... 34

Ex: Long dependency chains Sometimes these can be decoupled (e.g. reduction loops) 1 // Version 0 2 float s = 0; 3 for (int i = 0; i < n; ++i) 4 s += x[i]; Apparent linear dependency chain. Compilers might handle this, but let’s try ourselves... 35

Ex: Long dependency chains for (int j = 0; j < 4; ++j) s += x[i]; 13 for (; i < n; ++i) 12 float s = (s[0]+s[1]) + (s[2]+s[3]); 11 // Combine sub-sums and handle trailing elements 10 9 s[j] += x[i+j]; 8 7 Key: Break up chains to expose parallel opportunities for (i = 0; i < n-3; i += 4) 6 // Sum start of list in four independent sub-sums 5 4 int i; 3 float s[4] = {0, 0, 0, 0}; 2 // Version 1 1 36

Ex: Pointer aliasing Why can this not vectorize easily? 1 void add_vecs(int n, double* result, double* a, double* b) 2 { 3 for (int i = 0; i < n; ++i) 4 result[i] = a[i] + b[i]; 5 } Q: What if result overlaps a or b ? 37

Ex: Pointer aliasing C99: Use restrict keyword 1 void add_vecs(int n, double* restrict result, 2 double* restrict a, double* restrict b); Implicit promise: these point to different things in memory. Fortran forbids aliasing — part of why naive Fortran speed beats naive C speed! 38

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: - PowerPoint PPT Presentation

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern CPUs are Wide: start / retire multiple instructions per cycle Pipelined: overlap instruction executions Out-of-order: dynamically

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on the Soap Box The goal is

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

Introduction to Compute Cloud Tao Zou CS 5220 Applications of Parallel Computers About me 3

CS 5220: VMs, containers, and clouds David Bindel 2017-10-12 1 Cloud vs HPC Is the cloud

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single

Dave DeFazio Partner dave.defazio@strategycorps.com 615-498-5220 linkedin.com/in/davedefazio

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just for fun

CS 5220: Impact of Floating Point David Bindel 2017-11-16 1 Why this lecture? Isnt this

CS 5220: Mixed languages, libraries, and frameworks David Bindel 2017-11-21 1 Nerdvana? x =

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

CS 5220: Locality and parallelism in simulations II David Bindel 2017-09-14 1 Basic styles of

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

CS 5220: More Sparse LA David Bindel 2017-10-26 1 Reminder: Conjugate Gradients What if we only

CPFI Christian Pharmacists Fellowship International Student Information Session - 2018 WHAT IS

The Algebraic Eraser: a linear asymmetric protocol for low-resource environments Derek Atkins,

Women Becoming Mathematicians: A Look Back and A Look Forward Marge Murray MIT Celebration of

Computer Science Principles CHAPTER 2 COMPUTER PROGRAMMING FUNDAMENTALS 1 Announcements

A Rational Model of the Closed-End Fund Discount Jonathan Berk and Richard Stanton University

Domain-Specific Defect Models Audris Mockus audris@avaya.com Avaya Labs Research Basking Ridge,

Session Transcript: 29-06-2020 Yoga Alliance Community Sangha Closed Captioning/ Transcript

The Effects of The Effects of Marker/ FCFS FCFS Marker/ Scheduler Scheduler Dropper