cs 5220 optimization basics
play

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: - PowerPoint PPT Presentation

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern CPUs are Wide: start / retire multiple instructions per cycle Pipelined: overlap instruction executions Out-of-order: dynamically


  1. CS 5220: Optimization basics David Bindel 2017-08-31 1

  2. Reminder: Modern processors • Modern CPUs are • Wide: start / retire multiple instructions per cycle • Pipelined: overlap instruction executions • Out-of-order: dynamically schedule instructions • Lots of opportunities for instruction-level parallelism (ILP) • Complicated! Want the compiler to handle the details • Implication: we should give the compiler • Good instruction mixes • Independent operations • Vectorizable operations 2

  3. Reminder: Memory systems • Memory access are expensive! • Caches provide intermediate cost/capacity points • Cache benefits from • Spatial locality (regular local access) • Temporal locality (small working sets) 3 • Flop time ≪ bandwidth − 1 ≪ latency

  4. Goal: (Trans)portable performance • Attention to detail has orders-of-magnitude impact • Different systems = different micro-architectures, caches • Want (trans)portable performance across HW • Need principles for high-perf code along with tricks 4

  5. Basic principles • Think before you write • Time before you tune • Stand on the shoulders of giants • Help your tools help you • Tune your data structures 5

  6. Think before you write 6

  7. Premature optimization We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. – Don Knuth 7

  8. Premature optimization Wrong reading: “Performance doesn’t matter” We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil . – Don Knuth 8

  9. Premature optimization What he actually said (with my emphasis) We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. – Don Knuth • Don’t forget the big efficiencies! • Don’t forget the 3%! • Your code is not premature forever! 9

  10. Don’t sweat the small stuff • OK to write high-level stuff in Matlab or Python • OK if configuration file reader is un-tuned 10 • Speed-up from tuning ϵ of code < ( 1 − ϵ ) − 1 ≈ 1 + ϵ • OK if O ( n 2 ) prelude to O ( n 3 ) algorithm is not hyper-tuned?

  11. Lay-of-the-land thinking 1 for (i = 0; i < n; ++i) 2 for (j = 0; j < n; ++j) 3 for (k = 0; k < n; ++k) 4 C[i+j*n] += A[i+k*n] * B[k+j*n]; • What are the “big computations” in my code? • What are the natural algorithmic variants? • Vary loop orders? Different interpretations! • Lower complexity algorithm (Strassen?) • Should I rule out some options in advance? • How can I code so it is easy to experiment? 11

  12. How big is n ? • Behavior at small n may not match behavior at large n ! Beware asymptotic complexity arguments about small- n codes! 12 Typical analysis: time is O ( f ( n )) • Meaning: ∃ C , N : ∀ n ≥ N , T n ≤ Cf ( n ) . • Says nothing about constant factors: O ( 10 n ) = O ( n ) • Ignores lower order term: O ( n 3 + 1000 n 2 ) = O ( n 3 )

  13. Avoid work 9 } 15 return true; 14 return false; 13 if (x[i] < 0) 12 for (int i = 0; i < n; ++i) 11 { 10 bool any_negative2(int* x, int n) 8 1 } 7 return result; 6 result = (result || x[i] < 0); 5 for (int i = 0; i < n; ++i) 4 bool result = false; 3 { 2 bool any_negative1(int* x, int n) 13

  14. Be cheap Approximate when you can get away with it. 14 Fast enough, right enough = ⇒

  15. Do more with less (data) Want lots of work relative to data loads: • Keep data compact to fit in cache • Use short data types for better vectorization • But be aware of tradeoffs! • For integers: may want 64-bit ints sometimes! • For floating-point: will discuss in detail in other lectures 15

  16. Remember the I/O! • 0.25 MB per frame (three fit in L3 cache) • Constant work per element (a few flops) If I write once every 100 frames, how much time is I/O? 16 Example: Explicit PDE time stepper on 256 2 mesh • Time to write to disk ≈ 5 ms

  17. Time before you tune 17

  18. Hot spots and bottlenecks • Often a little bit of code takes most of the time • Usually called a “hot spot” or bottleneck • Goal: Find and eliminate • Cute coinage: “de-slugging” 18

  19. Practical timing Need to worry about: • System timer resolutions • Wall-clock time vs CPU time • Size of data collected vs how informative it is • Cross-interference with other tasks • Cache warm-start on repeated timings • Overlooked issues from too-small timings 19

  20. Manual instrumentation Basic picture: • Identify stretch of code to be timed • Run it several times with “characteristic” data • Accumulate the total time spent Caveats: Effects from repetition, “characteristic” data 20

  21. Manual instrumentation • Hard to get portable high-resolution wall-clock time! • Solution: omp_get_wtime() • Requires OpenMP support (still not CLang) 21

  22. Types of profiling tools • Sampling vs instrumenting • Instrumenting: Rewrite code to insert timers • Instrument at binary or source level • Function level or line-by-line • Function: Inlining can cause mis-attribution • Line-by-line: Usually requires debugging symbols ( -g ) • Context information? • Distinguish full call stack or not? • Time full run, or just part? 22 • Sampling: Interrupt every t profile cycles

  23. Hardware counters • Counters track cache misses, instruction counts, etc • Present on most modern chips • May require significant permissions to access... 23

  24. Automated analysis tools • Examples: MAQAO and IACA • Symbolic execution of model of a code segment • Usually only practical for short segments • But can give detailed feedback on (assembly) quality 24

  25. Shoulders of giants 25

  26. What makes a good kernel? Computational kernels are • Small and simple to describe • General building blocks (amortize tuning work) • Ideally high arithmetic intensity • Arithmetic intensity = flops/byte • Amortizes memory costs 26

  27. Case study: BLAS Basic Linear Algebra Subroutines Level 3 BLAS are key for high-perf transportable LA. 27 • Level 1: O ( n ) work on O ( n ) data • Level 2: O ( n 2 ) work on O ( n 2 ) data • Level 3: O ( n 3 ) work on O ( n 2 ) data

  28. Other common kernels • Apply sparse matrix (or sparse matrix powers) • Compute an FFT • Sort a list 28

  29. Kernel trade-offs • Critical to get properly tuned kernels • Kernel interface is consistent across HW types • Kernel implementation varies according to arch details • General kernels may leave performance on the table • Ex: General matrix-matrix multiply for structured matrices • Overheads may be an issue for small n cases • Ex: Usefulness of batched BLAS extensions • But: Ideally, someone else writes the kernel! • Or it may be automatically tuned 29

  30. Help your tools help you 30

  31. What can your compiler do for you? In decreasing order of effectiveness: • Local optimization • Especially restricted to a “basic block” • More generally, in “simple” functions • Loop optimizations • Global (cross-function) optimizations 31

  32. Local optimizations • Register allocation: compiler > human • Instruction scheduling: compiler > human • Branch joins and jump elim: compiler > human? • Constant folding and propogation: humans OK • Common subexpression elimination: humans OK • Algebraic reductions: humans definitely help 32

  33. Loop optimizations Mostly leave these to modern compilers • Loop invariant code motion • Loop unrolling • Loop fusion • Software pipelining • Vectorization • Induction variable substitution 33

  34. Obstacles for the compiler • Long dependency chains • Excessive branching • Pointer aliasing • Complex loop logic • Cross-module optimization • Function pointers and virtual functions • Unexpected FP costs • Missed algebraic reductions • Lack of instruction diversity Let’s look at a few... 34

  35. Ex: Long dependency chains Sometimes these can be decoupled (e.g. reduction loops) 1 // Version 0 2 float s = 0; 3 for (int i = 0; i < n; ++i) 4 s += x[i]; Apparent linear dependency chain. Compilers might handle this, but let’s try ourselves... 35

  36. Ex: Long dependency chains for (int j = 0; j < 4; ++j) s += x[i]; 13 for (; i < n; ++i) 12 float s = (s[0]+s[1]) + (s[2]+s[3]); 11 // Combine sub-sums and handle trailing elements 10 9 s[j] += x[i+j]; 8 7 Key: Break up chains to expose parallel opportunities for (i = 0; i < n-3; i += 4) 6 // Sum start of list in four independent sub-sums 5 4 int i; 3 float s[4] = {0, 0, 0, 0}; 2 // Version 1 1 36

  37. Ex: Pointer aliasing Why can this not vectorize easily? 1 void add_vecs(int n, double* result, double* a, double* b) 2 { 3 for (int i = 0; i < n; ++i) 4 result[i] = a[i] + b[i]; 5 } Q: What if result overlaps a or b ? 37

  38. Ex: Pointer aliasing C99: Use restrict keyword 1 void add_vecs(int n, double* restrict result, 2 double* restrict a, double* restrict b); Implicit promise: these point to different things in memory. Fortran forbids aliasing — part of why naive Fortran speed beats naive C speed! 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend