S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Christoph Angerer, Jakob Progsch, GTC 2017

BEFORE YOU START The five steps to enlightenment 1. Know your application What does it compute? How is it parallelized? What final performance is expected? • 2. Know your hardware • What are the target machines, how many nodes? Machine-specific optimizations okay? 3. Know your tools • Strengths and weaknesses of each tool? Learn how to use them (and learn one well!) 4. Know your process • Performance optimization is a constant learning process 5. Make it so! 2

THE APOD CYCLE 4. D eploy 1. A ssess and Test • Identify Performance Limiter • Analyze Profile • Find Indicators 3. O ptimize 2. P arallelize 3b. Build Knowledge 3

GUIDING OPTIMIZATION EFFORT “Drilling Down into the Metrics” • Challenge: How to know where to start? • Top-down Approach: Find Hotspot Kernel • Scope Identify Performance Limiter of the Hotspot • • Find performance bottleneck indicators related to the limiter Identify associated regions in the source code • Come up with strategy to fix and change the code • Start again • 4

KNOW YOUR APPLICATION: HPGMG 5

5/9/2017 HPGMG High-Performance Geometric Multi-Grid, Hybrid Implementation V-CYCLE F-CYCLE SMOOTHER SMOOTHER & RESIDUAL GPU SMOOTHER SMOOTHER THRESHOLD & RESIDUAL CPU DIRECT SOLVE Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/ 6

5/9/2017 MULTI-GRID BOTTLENECK Cost of operations smoother interpolation copy_blocks smoother interpolation copy_blocks residual restriction apply_bc residual restriction apply_bc 0.8 0.5 kernel time / level time kernel time / total time 0.7 SURFACE 0.4 0.6 MOST TIME SPENT VOLUME ON STENCILS 0.5 0.3 0.4 0.2 0.3 0.2 0.1 0.1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 level level 7

KNOW YOUR HARDWARE: PASCAL ARCHITECTURE 8

GPU COMPARISON P100 (SXM2) M40 K40 Double/Single/Half TFlop/s 5.3/10.6/21.2 0.2/7.0/NA 1.4/4.3/NA Memory Bandwidth (GB/s) 732 288 288 Memory Size 16GB 12GB, 24GB 12GB L2 Cache Size 4096 KB 3072 KB 1536 KB Base/Boost Clock (Mhz) 1328/1480 948/1114 745/875 TDP (Watts) 300 250 235 9

GP100 SM GP100 CUDA Cores 64 Register File 256 KB Shared 64 KB Memory Active Threads 2048 Active Blocks 32 10 10

KNOW YOUR TOOLS: PROFILERS 11

PROFILING TOOLS Many Options! From NVIDIA Third Party • nvprof • TAU Performance System • NVIDIA Visual Profiler • VampirTrace • PAPI CUDA component • Standalone (nvvp) • Integrated into Nsight Eclipse • HPC Toolkit Edition (nsight) • (Tools using CUPTI) • Nsight Visual Studio Edition Without loss of generality, in this talk we will be showing nvvp screenshots 12

THE NVVP PROFILER WINDOW Timeline Summary Guide Analysis Results • S7824 – DEVELOPER TOOLS UPDATE, Wed 4:00 PM • S7495 - OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS, Thur 10:00 AM 13

MAKE IT SO: ITERATION 1 2 ND ORDER 7-POINT STENCIL 14

IDENTIFY HOTSPOT Hotspot Identify the hotspot: smooth_kernel() Kernel Time Speedup Original Version 0.109443s 1.00x 15

IDENTIFY PERFORMANCE LIMITER Load/Store Memory Ops Memory Utilization Issues? 16 16

PERFORMANCE LIMITER CATEGORIES Memory Utilization vs Compute Utilization Four possible combinations: 60% Comp Mem Comp Comp Mem Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound 17

DRILLING DOWN: LATENCY ANALYSIS 18 18

OCCUPANCY GPU Utilization Each SM has limited resources: • max. 64K Registers (32 bit) distributed between threads max. 48KB of shared memory per block (96KB per SMM) • • max. 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps) • When a resource is used up, occupancy is reduced (*) Values vary with Compute Capability 19

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Exposed latency, not enough warps Fully covered latency warp 0 warp 0 warp 1 warp 1 warp 2 warp 2 warp 3 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issues 20

LATENCY AT HIGH OCCUPANCY Many active warps but with high latency instructions Exposed latency at high occupancy warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing 21

LOOKING FOR MORE INDICATORS Source Code Association 12 Global Load For line numbers use: Transactions per 1 Request nvcc -lineinfo 22 22

MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred 23

MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp thread 2 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred 24

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 25 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

FIX: BETTER GPU TILING Before After Block Size Up Transactions Per Access Down Memory Utilization Up Kernel Time Speedup Original Version 0.109443s 1.00x Better Memory Accesses 0.076051s 1.44x 26 26

PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by: • Varying block size • Varying shared memory usage • Varying register count (use __launch_bounds) 27

PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality 28

PERF-OPT QUICK REFERENCE CARD Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality 29

ITERATION 2: REGISTER OPTIMIZATION AND CACHING 30

NEW PERFORMANCE LIMITER: MEMORY BANDWIDTH 31 31

GPU MEMORY HIERARCHY P100 (SMX2) Registers (256 KB/SM): good • Functional Units Functional Units for intra-thread data reuse Shared memory (64 KB/SM): • Register File Register File good for explicit intra-block data reuse Memory Memory Unified Shared Unified Shared Cache Cache L1$/Tex$, L2$ (4096 KB): • Bring reused SM SM implicit data reuse data closer to the SMs L2$ Global Memory (Framebuffer) 32

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Christoph Angerer, Jakob Progsch, GTC 2017 BEFORE YOU START The five steps to enlightenment 1. Know your application What does it compute? How is it parallelized? What final

Aberdeen & Aberdeenshire: Telling the story Lorna Easton Adam Bates Telling the Story of

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

IgProf The ignominious profiler. A generic memory and performance profiler for linux

Using The QML Profiler Ulf Hermann The Qt Company October 8, 2014 / Qt Developer Days 2014 1/28

Profiling Low-End Platforms using HawkTracer Profiler Marcin Kolny Amazon FOSDEM 2019 February

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner

Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at

Dynamic temperature profiler update Ranjan Dharmapalan, Alex Dvornikov, Jelena Maricic, Radovan

The .NET Profiling API OVERVIEW The .NET Profiler API is available since CLR/.NET Framework

Good Mor Good Mor ning, Ki Or ning, Ki Or a a I am going to begin by telling you a little

Pat Bode Presentation Telling the Next Generation Let us begin with prayer. Dear Lord God,

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

incidents to make a meaningful refresher program? What are these incidents telling you about

your sTory Telling Effective church newsletters By Rick Frennea editor of The Edgewood

LONGITUDINAL DATA AND FISCAL IMPACT TRANSPARENCY Transparency is telling all of the people all of

PRESENTATION + REPRESENTATION Narrative (Telling a Story) Composition Color

Developing the autom atic m easurem ent of surface condition on local roads Alex W right Alex W

Unleashing the Potential of Canadian Crops What is PIC? Federal Government initiative launched

Fourth Quarter 2018 Advancing our vision to be the most sustainable protein company on earth

megan's final project proposal/ presentation/ thoughts/ etc megan's thoughts clothing and

Analysis and processing of SPM data Introduction Gwyddion is a free software developed by two

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

State-of-the Art in the Measurement of Page 1 Pavement Surface Characteristics PDRG-RPUG 1 st

E E Energy Efficiency in Energy Efficiency in Effi i Effi i i i Graphics Rendering

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Christoph Angerer, Jakob Progsch, GTC 2017 BEFORE YOU START The five steps to enlightenment 1. Know your application What does it compute? How is it parallelized? What final

Aberdeen &amp; Aberdeenshire: Telling the story Lorna Easton Adam Bates Telling the Story of

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

IgProf The ignominious profiler. A generic memory and performance profiler for linux

Using The QML Profiler Ulf Hermann The Qt Company October 8, 2014 / Qt Developer Days 2014 1/28

Profiling Low-End Platforms using HawkTracer Profiler Marcin Kolny Amazon FOSDEM 2019 February

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner

Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at

Dynamic temperature profiler update Ranjan Dharmapalan, Alex Dvornikov, Jelena Maricic, Radovan

The .NET Profiling API OVERVIEW The .NET Profiler API is available since CLR/.NET Framework

Good Mor Good Mor ning, Ki Or ning, Ki Or a a I am going to begin by telling you a little

Pat Bode Presentation Telling the Next Generation Let us begin with prayer. Dear Lord God,

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

incidents to make a meaningful refresher program? What are these incidents telling you about

your sTory Telling Effective church newsletters By Rick Frennea editor of The Edgewood

LONGITUDINAL DATA AND FISCAL IMPACT TRANSPARENCY Transparency is telling all of the people all of

PRESENTATION + REPRESENTATION Narrative (Telling a Story) Composition Color

Developing the autom atic m easurem ent of surface condition on local roads Alex W right Alex W

Unleashing the Potential of Canadian Crops What is PIC? Federal Government initiative launched

Fourth Quarter 2018 Advancing our vision to be the most sustainable protein company on earth

megan's final project proposal/ presentation/ thoughts/ etc megan's thoughts clothing and

Analysis and processing of SPM data Introduction Gwyddion is a free software developed by two

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

State-of-the Art in the Measurement of Page 1 Pavement Surface Characteristics PDRG-RPUG 1 st

E E Energy Efficiency in Energy Efficiency in Effi i Effi i i i Graphics Rendering

Aberdeen & Aberdeenshire: Telling the story Lorna Easton Adam Bates Telling the Story of