Performance Analysis of GPU-Accelerated Applications using the - PowerPoint PPT Presentation

S9624: Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose Samuel Williams Charlene Yang Application Performance Specialist Senior Staff Scientist NERSC, LBNL CRD, LBNL swwilliams@lbl.gov cjyang@lbl.gov

You just bought a $10,000 throughput-optimized GPU! Are you making good use of your investment? 1

You could just run benchmarks § Imagine a mix of benchmarks or kernels… § GFLOP/s alone may not be particularly insightful GFLOP/s § Moreover, speedup relative to a Xeon may seem random Kernel (or apps) 2

Making good use of your GPU? 1. Are you operating it in the throughput-limited regime? Not sensitive to Amdahl effects o Not sensitive to D2H/H2D transfers o Not sensitive to launch overheads o Not sensitive to latencies o 2. If in the throughput-limited regime, are you making good use of the GPU’s compute and bandwidth capabilities? 3

The Roofline Model § Roofline Model is a throughput- oriented performance model § Premised on the interplay between FLOP/s, bandwidth, and reuse § Tracks rates not times § Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs, etc…) https://crd.lbl.gov/departments/computer-science/PAR/research/roofline Jouppi et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA, 2017. 4

(DRAM) Roofline § One could hope to always attain peak performance (GFLOP/s) GPU (compute, GFLOP/s) § However, finite locality (reuse) and bandwidth limit performance. DRAM Bandwidth (GB/s) § Assume: DRAM Idealized processor/caches o (data, GB) Cold start (data in DRAM) o #FLOPs / Peak GFLOP/s Time = max #Bytes / Peak GB/s 5

(DRAM) Roofline § One could hope to always attain peak performance (GFLOP/s) GPU (compute, GFLOP/s) § However, finite locality (reuse) and bandwidth limit performance. DRAM Bandwidth (GB/s) § Assume: DRAM Idealized processor/caches o (data, GB) Cold start (data in DRAM) o Peak GFLOP/s GFLOP/s = min AI * Peak GB/s Note, Arithmetic Intensity (AI) = FLOPs / Bytes (as presented to DRAM ) 6

Arithmetic Intensity § Arithmetic Intensity is the most important concept in Roofline. § Measure of data locality (data reuse) § Ratio of Total FLOPs performed to Total Bytes moved § For the DRAM Roofline… Total Bytes to/from DRAM and includes all cache and prefetcher effects o Can be very different from total loads/stores (bytes requested) due to cache reuse o 7

(DRAM) Roofline § Plot Roofline bound using Arithmetic Intensity as the x-axis Peak GFLOP/s § Log-log scale makes it easy to Attainable GFLOP/s doodle, extrapolate performance along Moore’s Law, etc… § Kernels with AI less than DRAM-bound Compute-bound machine balance are ultimately DRAM bound (we’ll refine this Arithmetic Intensity (FLOP:Byte) later…) Transition @ AI == Peak Gflop/s / Peak GB/s == ‘Machine Balance’ 8

Example § Consider 3 kernels (A,B,C) calculate or measure the Arithmetic o C Intensity for each Peak GFLOP/s B Attainable GFLOP/s Determine the Roofline intercept for o each kernel Ø kernels A and B are bound by A memory bandwidth Ø kernel C is bound by peak FLOP/s Arithmetic Intensity (FLOP:Byte) 9

Scaling to Future GPUs § Imagine you run on a future GPU C 2x GFLOP/s with twice the peak FLOPs… B Ø kernel C’s performance could double Attainable GFLOP/s kernels A and B will be no faster ✘ A Arithmetic Intensity (FLOP:Byte) 10

Scaling to Future GPUs § What if that future GPU also C 2x GFLOP/s doubled its memory bandwidth… B Ø kernel A and B’s performance could Attainable GFLOP/s also double A Arithmetic Intensity (FLOP:Byte) 11

Why is Roofline Useful? § Think back to our mix of loop nests where GFLOP/s alone wasn’t useful… GFLOP/s Kernel (or apps) 12

Why is Roofline Useful? § We can sort kernels by AI … GFLOP/s Arithmetic Intensity (FLOP:Byte) 13

Why is Roofline Useful? § We can sort kernels by AI … § … and compare performance Peak GFLOP/s relative to machine capabilities GFLOP/s Arithmetic Intensity (FLOP:Byte) 14

Why is Roofline Useful? § Kernels near the roofline are making good use of Peak GFLOP/s computational resources… Ø kernels can have low performance 50% of Peak GFLOP/s (GFLOP/s), but make good use of a machine Ø kernels can have high performance (GFLOP/s), but make poor use of a machine Arithmetic Intensity (FLOP:Byte) 15

Can Performance Be Below Roofline? § Analogous to asking whether one can always attain either… Peak GFLOP/s Peak Bandwidth o Attainable GFLOP/s Peak GFLOP/s o § Sure, there can be other performance bottlenecks… Cache bandwidth / locality o Lack of FMA / tensor instructions o Arithmetic Intensity (FLOP:Byte) Thread divergence / predication o Too many non-FP instructions o … o 16

Cache Effects… § Hierarchical Roofline Model § Construct superposition of Peak GFLOP/s Rooflines… Attainable GFLOP/s Measure AI and bandwidth for each o level of memory/cache Loop nests will have multiple AI’s and o L2 Bound multiple performance bounds… L2 AI*BW is less than … but performance is ultimately the o DDR AI*BW minimum of these bounds. Arithmetic Intensity (FLOP:Byte) 17

Cache Effects… § Hierarchical Roofline Model § Construct superposition of Peak GFLOP/s Rooflines… Attainable GFLOP/s Measure AI and bandwidth for each o level of memory/cache Loop nests will have multiple AI’s and o multiple performance bounds… … but performance is ultimately the o minimum of these bounds. § Extend to other memories… Arithmetic Intensity (FLOP:Byte) L1 / Shared o System o 18

Insights – Exploiting Caches § Widely separated Arithmetic Intensities indicate high reuse in Peak GFLOP/s the cache Attainable GFLOP/s High Reuse Arithmetic Intensity (FLOP:Byte) 19

Insights – Exploiting Caches § Widely separated Arithmetic Intensities indicate high reuse in Peak GFLOP/s the cache Attainable GFLOP/s § Similar Arithmetic Intensities indicate effectively no cache reuse ( == streaming ) § As one changes problem size, no reuse (streaming) L2 and DRAM arithmetic intensities can behave very Arithmetic Intensity (FLOP:Byte) differently 20

Failure to Exploit CISC Instructions § Death of Moore’s Law is motivating a return of Complex Instruction Set Computing (CISC) § Modern CPUs and GPUs are increasingly reliant on special (fused) instructions that perform multiple operations. FMA (Fused Multiply Add): z=a*x+y … z,x,y are vectors or scalars o 4FMA (quad FMA): z=A*x+z … A is a FP32 matrix; x,z are vectors o HMMA (Tensor Core): Z=AB+C …Z,A,B,C are FP16 matrices o … o Ø Performance is now a weighted average of Mul/Add, FMA, and HMMA operations. 21

Failure to Exploit CISC Instructions § Total lack of FMA reduces Volta performance by 2x… FMA.f64 Peak creates ADD.f64 ceiling o Attainable GFLOP/s Partial FMA ADD.f64 Ceiling § In reality, applications are a mix of FMA.f64, ADD.f64, and MUL.f64… Performance is a weighted average o Ø Produces a partial FMA ceiling that Arithmetic Intensity (FLOP:Byte) bounds kernel performance 22

Failure to Exploit CISC Instructions § On Volta, Tensor cores provide 125 TFLOPs of FP16 performance (vs. 15 for FP32) HMMA.f16 Peak Attainable GFLOP/s § However, kernels/apps will mix Partial HMMA Ceiling HMMA with FMA, MULs, ADDs, … ADD.f32 Ceiling Ø A few non-HMMA operations can quickly limit Tensor core performance Arithmetic Intensity (FLOP:Byte) 23

Using Roofline To Drive Optimization

Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA GFLOP/s Arithmetic Intensity (FLOP:Byte) 25

Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance GFLOP/s (e.g. minimize predication) Current AI Arithmetic Intensity (FLOP:Byte) 26

Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance (e.g. GFLOP/s minimize predication) § Maximize memory bandwidth Current AI (e.g. avoid pathological memory access patterns) Arithmetic Intensity (FLOP:Byte) 27

Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance (e.g. GFLOP/s minimize predication) Compulsory AI § Maximize memory bandwidth Current AI (e.g. avoid pathological memory access patterns) Arithmetic Intensity (FLOP:Byte) § Minimize data movement (i.e. exploit reuse) 28

Estimating Arithmetic Intensity

DRAM vs L1 Arithmetic Intensity § Consider a 7-point constant GPU coefficient stencil… (compute, GFLOP/s) 7 FLOPs o 8 memory references (7 reads, 1 store) per point o AI = 0.11 FLOPs per byte (L1) o #pragma omp parallel for for(k=1;k<dim+1;k++){ DRAM Bandwidth for(j=1;j<dim+1;j++){ (GB/s) for(i=1;i<dim+1;i++){ new[k][j][i] = -6.0*old[k ][j ][i ] + old[k ][j ][i-1] DRAM + old[k ][j ][i+1] + old[k ][j-1][i ] (data, GB) + old[k ][j+1][i ] + old[k-1][j ][i ] + old[k+1][j ][i ]; }}} 30

Performance Analysis of GPU-Accelerated Applications using the - PowerPoint PPT Presentation

S9624: Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose Samuel Williams Charlene Yang Application Performance Specialist Senior Staff Scientist NERSC, LBNL CRD, LBNL swwilliams@lbl.gov

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated Applications S.

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

Overview of the Energy Efficiency Center & the NEPTUNE Program Benjamin Finkelor, Executive

Second Quarter and First Half 2020 Results Forward looking statements and non-IFRS measures This

A Leading Intermediate Copper Producer September 2015 1 Cautionary Note On Forward Looking

A Leading Intermediate Copper Producer June 2016 1 Cautionary Note On Forward Looking

3 rd Grade Curriculum Night Social Curriculum The Responsive Classroom (RC) The RC

good look like? Moderator: David Lee Neil Stevenson, Chief Executive, Scottish Legal Complaints

Workshop I Advanced Air & NPDES Permits Best Practices & Succeeding in the Process to

Natural Resources Conservation Service (NRCS) Provides technical and financial assistance to

Performance Analysis of GPU-Accelerated Applications using the - PowerPoint PPT Presentation

S9624: Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose Samuel Williams Charlene Yang Application Performance Specialist Senior Staff Scientist NERSC, LBNL CRD, LBNL swwilliams@lbl.gov

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated Applications S.

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

Overview of the Energy Efficiency Center &amp; the NEPTUNE Program Benjamin Finkelor, Executive

Second Quarter and First Half 2020 Results Forward looking statements and non-IFRS measures This

A Leading Intermediate Copper Producer September 2015 1 Cautionary Note On Forward Looking

A Leading Intermediate Copper Producer June 2016 1 Cautionary Note On Forward Looking

3 rd Grade Curriculum Night Social Curriculum The Responsive Classroom (RC) The RC

good look like? Moderator: David Lee Neil Stevenson, Chief Executive, Scottish Legal Complaints

Workshop I Advanced Air &amp; NPDES Permits Best Practices &amp; Succeeding in the Process to

Natural Resources Conservation Service (NRCS) Provides technical and financial assistance to

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Overview of the Energy Efficiency Center & the NEPTUNE Program Benjamin Finkelor, Executive

Workshop I Advanced Air & NPDES Permits Best Practices & Succeeding in the Process to