performance analysis of gpu accelerated applications
play

Performance Analysis of GPU-Accelerated Applications using the - PowerPoint PPT Presentation

S9624: Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose Samuel Williams Charlene Yang Application Performance Specialist Senior Staff Scientist NERSC, LBNL CRD, LBNL swwilliams@lbl.gov


  1. S9624: Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose Samuel Williams Charlene Yang Application Performance Specialist Senior Staff Scientist NERSC, LBNL CRD, LBNL swwilliams@lbl.gov cjyang@lbl.gov

  2. You just bought a $10,000 throughput-optimized GPU! Are you making good use of your investment? 1

  3. You could just run benchmarks § Imagine a mix of benchmarks or kernels… § GFLOP/s alone may not be particularly insightful GFLOP/s § Moreover, speedup relative to a Xeon may seem random Kernel (or apps) 2

  4. Making good use of your GPU? 1. Are you operating it in the throughput-limited regime? Not sensitive to Amdahl effects o Not sensitive to D2H/H2D transfers o Not sensitive to launch overheads o Not sensitive to latencies o 2. If in the throughput-limited regime, are you making good use of the GPU’s compute and bandwidth capabilities? 3

  5. The Roofline Model § Roofline Model is a throughput- oriented performance model § Premised on the interplay between FLOP/s, bandwidth, and reuse § Tracks rates not times § Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs, etc…) https://crd.lbl.gov/departments/computer-science/PAR/research/roofline Jouppi et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA, 2017. 4

  6. (DRAM) Roofline § One could hope to always attain peak performance (GFLOP/s) GPU (compute, GFLOP/s) § However, finite locality (reuse) and bandwidth limit performance. DRAM Bandwidth (GB/s) § Assume: DRAM Idealized processor/caches o (data, GB) Cold start (data in DRAM) o #FLOPs / Peak GFLOP/s Time = max #Bytes / Peak GB/s 5

  7. (DRAM) Roofline § One could hope to always attain peak performance (GFLOP/s) GPU (compute, GFLOP/s) § However, finite locality (reuse) and bandwidth limit performance. DRAM Bandwidth (GB/s) § Assume: DRAM Idealized processor/caches o (data, GB) Cold start (data in DRAM) o Peak GFLOP/s GFLOP/s = min AI * Peak GB/s Note, Arithmetic Intensity (AI) = FLOPs / Bytes (as presented to DRAM ) 6

  8. Arithmetic Intensity § Arithmetic Intensity is the most important concept in Roofline. § Measure of data locality (data reuse) § Ratio of Total FLOPs performed to Total Bytes moved § For the DRAM Roofline… Total Bytes to/from DRAM and includes all cache and prefetcher effects o Can be very different from total loads/stores (bytes requested) due to cache reuse o 7

  9. (DRAM) Roofline § Plot Roofline bound using Arithmetic Intensity as the x-axis Peak GFLOP/s § Log-log scale makes it easy to Attainable GFLOP/s doodle, extrapolate performance along Moore’s Law, etc… § Kernels with AI less than DRAM-bound Compute-bound machine balance are ultimately DRAM bound (we’ll refine this Arithmetic Intensity (FLOP:Byte) later…) Transition @ AI == Peak Gflop/s / Peak GB/s == ‘Machine Balance’ 8

  10. Example § Consider 3 kernels (A,B,C) calculate or measure the Arithmetic o C Intensity for each Peak GFLOP/s B Attainable GFLOP/s Determine the Roofline intercept for o each kernel Ø kernels A and B are bound by A memory bandwidth Ø kernel C is bound by peak FLOP/s Arithmetic Intensity (FLOP:Byte) 9

  11. Scaling to Future GPUs § Imagine you run on a future GPU C 2x GFLOP/s with twice the peak FLOPs… B Ø kernel C’s performance could double Attainable GFLOP/s kernels A and B will be no faster ✘ A Arithmetic Intensity (FLOP:Byte) 10

  12. Scaling to Future GPUs § What if that future GPU also C 2x GFLOP/s doubled its memory bandwidth… B Ø kernel A and B’s performance could Attainable GFLOP/s also double A Arithmetic Intensity (FLOP:Byte) 11

  13. Why is Roofline Useful? § Think back to our mix of loop nests where GFLOP/s alone wasn’t useful… GFLOP/s Kernel (or apps) 12

  14. Why is Roofline Useful? § We can sort kernels by AI … GFLOP/s Arithmetic Intensity (FLOP:Byte) 13

  15. Why is Roofline Useful? § We can sort kernels by AI … § … and compare performance Peak GFLOP/s relative to machine capabilities GFLOP/s Arithmetic Intensity (FLOP:Byte) 14

  16. Why is Roofline Useful? § Kernels near the roofline are making good use of Peak GFLOP/s computational resources… Ø kernels can have low performance 50% of Peak GFLOP/s (GFLOP/s), but make good use of a machine Ø kernels can have high performance (GFLOP/s), but make poor use of a machine Arithmetic Intensity (FLOP:Byte) 15

  17. Can Performance Be Below Roofline? § Analogous to asking whether one can always attain either… Peak GFLOP/s Peak Bandwidth o Attainable GFLOP/s Peak GFLOP/s o § Sure, there can be other performance bottlenecks… Cache bandwidth / locality o Lack of FMA / tensor instructions o Arithmetic Intensity (FLOP:Byte) Thread divergence / predication o Too many non-FP instructions o … o 16

  18. Cache Effects… § Hierarchical Roofline Model § Construct superposition of Peak GFLOP/s Rooflines… Attainable GFLOP/s Measure AI and bandwidth for each o level of memory/cache Loop nests will have multiple AI’s and o L2 Bound multiple performance bounds… L2 AI*BW is less than … but performance is ultimately the o DDR AI*BW minimum of these bounds. Arithmetic Intensity (FLOP:Byte) 17

  19. Cache Effects… § Hierarchical Roofline Model § Construct superposition of Peak GFLOP/s Rooflines… Attainable GFLOP/s Measure AI and bandwidth for each o level of memory/cache Loop nests will have multiple AI’s and o multiple performance bounds… … but performance is ultimately the o minimum of these bounds. § Extend to other memories… Arithmetic Intensity (FLOP:Byte) L1 / Shared o System o 18

  20. Insights – Exploiting Caches § Widely separated Arithmetic Intensities indicate high reuse in Peak GFLOP/s the cache Attainable GFLOP/s High Reuse Arithmetic Intensity (FLOP:Byte) 19

  21. Insights – Exploiting Caches § Widely separated Arithmetic Intensities indicate high reuse in Peak GFLOP/s the cache Attainable GFLOP/s § Similar Arithmetic Intensities indicate effectively no cache reuse ( == streaming ) § As one changes problem size, no reuse (streaming) L2 and DRAM arithmetic intensities can behave very Arithmetic Intensity (FLOP:Byte) differently 20

  22. Failure to Exploit CISC Instructions § Death of Moore’s Law is motivating a return of Complex Instruction Set Computing (CISC) § Modern CPUs and GPUs are increasingly reliant on special (fused) instructions that perform multiple operations. FMA (Fused Multiply Add): z=a*x+y … z,x,y are vectors or scalars o 4FMA (quad FMA): z=A*x+z … A is a FP32 matrix; x,z are vectors o HMMA (Tensor Core): Z=AB+C …Z,A,B,C are FP16 matrices o … o Ø Performance is now a weighted average of Mul/Add, FMA, and HMMA operations. 21

  23. Failure to Exploit CISC Instructions § Total lack of FMA reduces Volta performance by 2x… FMA.f64 Peak creates ADD.f64 ceiling o Attainable GFLOP/s Partial FMA ADD.f64 Ceiling § In reality, applications are a mix of FMA.f64, ADD.f64, and MUL.f64… Performance is a weighted average o Ø Produces a partial FMA ceiling that Arithmetic Intensity (FLOP:Byte) bounds kernel performance 22

  24. Failure to Exploit CISC Instructions § On Volta, Tensor cores provide 125 TFLOPs of FP16 performance (vs. 15 for FP32) HMMA.f16 Peak Attainable GFLOP/s § However, kernels/apps will mix Partial HMMA Ceiling HMMA with FMA, MULs, ADDs, … ADD.f32 Ceiling Ø A few non-HMMA operations can quickly limit Tensor core performance Arithmetic Intensity (FLOP:Byte) 23

  25. Using Roofline To Drive Optimization

  26. Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA GFLOP/s Arithmetic Intensity (FLOP:Byte) 25

  27. Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance GFLOP/s (e.g. minimize predication) Current AI Arithmetic Intensity (FLOP:Byte) 26

  28. Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance (e.g. GFLOP/s minimize predication) § Maximize memory bandwidth Current AI (e.g. avoid pathological memory access patterns) Arithmetic Intensity (FLOP:Byte) 27

  29. Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance (e.g. GFLOP/s minimize predication) Compulsory AI § Maximize memory bandwidth Current AI (e.g. avoid pathological memory access patterns) Arithmetic Intensity (FLOP:Byte) § Minimize data movement (i.e. exploit reuse) 28

  30. Estimating Arithmetic Intensity

  31. DRAM vs L1 Arithmetic Intensity § Consider a 7-point constant GPU coefficient stencil… (compute, GFLOP/s) 7 FLOPs o 8 memory references (7 reads, 1 store) per point o AI = 0.11 FLOPs per byte (L1) o #pragma omp parallel for for(k=1;k<dim+1;k++){ DRAM Bandwidth for(j=1;j<dim+1;j++){ (GB/s) for(i=1;i<dim+1;i++){ new[k][j][i] = -6.0*old[k ][j ][i ] + old[k ][j ][i-1] DRAM + old[k ][j ][i+1] + old[k ][j-1][i ] (data, GB) + old[k ][j+1][i ] + old[k-1][j ][i ] + old[k+1][j ][i ]; }}} 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend