S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018

BEFORE YOU START The five steps to enlightenment 1. Know your hardware What are the target machines, how many nodes? Machine-specific optimizations okay? • 2. Know your tools • Strengths and weaknesses of each tool? Learn how to use them (and learn one well!) 3. Know your application • What does it compute? How is it parallelized? What final performance is expected? 4. Know your process • Performance optimization is a constant learning process 5. Make it so! 2

THE APOD CYCLE 4. D eploy 1. A ssess and Test • Identify Performance Limiter • Analyze Profile • Find Indicators 3. O ptimize 2. P arallelize 3b. Build Knowledge 3

GUIDING OPTIMIZATION EFFORT “Drilling Down into the Metrics” • Challenge: How to know where to start? Scope • Top-down Approach: Find Hotspot Kernel • Identify Performance Limiter of the Hotspot • • Find performance bottleneck indicators related to the limiter Identify associated regions in the source code • Come up with strategy to fix and change the code • Start again • 4

KNOW YOUR HARDWARE: VOLTA ARCHITECTURE 5

VOLTA V100 FEATURES Improved NVLink & Volta MPS Improved SIMT Model Tensor Core Volta Architecture HBM2 120 Programmable Inference Utilization New Algorithms Most Productive GPU Efficient Bandwidth TFLOPS Deep Learning 6

GPU COMPARISON P100 (SXM2) V100 (SXM2) Double/Single/Half TFlop/s 5.3/10.6/21.2 7.8/15.7/125 (TensorCores) Memory Bandwidth (GB/s) 732 900 Memory Size 16GB 16GB L2 Cache Size 4096 KB 6144 KB Base/Boost Clock (Mhz) 1328/1480 1312/1530 TDP (Watts) 300 300 7

VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048 8

IMPROVED L1 CACHE Pascal SM Volta SM Load/Store Units Load/Store Units Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB Streaming L1$ 24 KB L2$ L2$ 4 MB 6 MB 9

KNOW YOUR TOOLS: PROFILERS 10

PROFILING TOOLS Many Options! From NVIDIA Third Party • nvprof • TAU Performance System • NVIDIA Visual Profiler (nvvp) • VampirTrace • Nsight Visual Studio Edition • PAPI CUDA component Coming Soon: • HPC Toolkit • NVIDIA Nsight Systems • (Tools using CUPTI) • NVIDIA Nsight Compute Without loss of generality, in this talk we will be showing nvvp screenshots 11

THE NVVP PROFILER WINDOW Timeline Summary Guide Analysis Results 12

KNOW YOUR APPLICATION: HPGMG 13

3/24/2018 HPGMG High-Performance Geometric Multi-Grid, Hybrid Implementation V-CYCLE F-CYCLE SMOOTHER SMOOTHER & RESIDUAL GPU SMOOTHER SMOOTHER THRESHOLD & RESIDUAL CPU DIRECT SOLVE Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/ 14

MAKE IT SO: ITERATION 1 2 ND ORDER 7-POINT STENCIL 15

IDENTIFY HOTSPOT Hotspot Identify the hotspot: smooth_kernel() Kernel Time Speedup Original Version 2.079ms 1.00x 16

IDENTIFY PERFORMANCE LIMITER Memory utilization Compute utilization 17 17

PERFORMANCE LIMITER CATEGORIES Memory Utilization vs Compute Utilization Four possible combinations: Comp Mem Comp Comp Mem Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound 18

LATENCY BOUND ON P100 19 19

BANDWIDTH BOUND ON V100 20 20

DRILLING DOWN: LATENCY ANALYSIS (V100) The profiler warns about low occupancy Limited by block size of only 8x4=32 threads 21 21

OCCUPANCY GPU Utilization Each SM has limited resources: • max. 64K Registers (32 bit) distributed between threads max. 48KB (96KB opt in) of shared memory per block (96KB per SMM) • • max. 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps) • When a resource is used up, occupancy is reduced Values vary with Compute Capability 22

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Exposed latency, not enough warps Fully covered latency warp 0 warp 0 warp 1 warp 1 warp 2 warp 2 warp 3 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issues 23

LATENCY AT HIGH OCCUPANCY Many active warps but with high latency instructions Exposed latency at high occupancy warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing 24

LOOKING FOR MORE INDICATORS Source Code Association 12 Global Load For line numbers use: Transactions per 1 Request nvcc -lineinfo 25 25

MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred 26

MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp thread 2 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred 27

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 28 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

FIX: BETTER GPU TILING Before After Block Size Up Transactions Per Access Down +10% Memory Utilization Up Kernel Time Speedup Original Version 2.079ms 1.00x Better Memory Accesses 1.756ms 1.18x 29 29

ITERATION 2: DATA MIGRATION 31

PAGE FAULTS Details 32

MEMORY MANAGEMENT Using Unified Memory Developer View With No changes to data structures Unified Memory No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations Unified Memory 3/24/2 33 018

UNIFIED MEMORY Eliminating page migrations and faults F-CYCLE GPU THRESHOLD CPU Page faults Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory) 3/24/2 34 018

PAGE FAULTS Almost gone 35

PAGE FAULTS Significant speedup for affected kernel 36

3/24/2018 MEM ADVICE API Not used here cudaMemPrefetchAsync (ptr, length, destDevice, stream) Migrate data to destDevice: overlap with compute Update page table: much lower overhead than page fault in kernel Async operation that follows CUDA stream semantics cudaMemAdvise (ptr, length, advice, device) Specifies allocation and usage policy for memory region User can set and unset at any time 37

ITERATION 3: REGISTER OPTIMIZATION AND CACHING 38

LIMITER: STILL MEMORY BANDWIDTH 39 39

GPU MEMORY HIERARCHY V100 Registers (256 KB/SM): good for • Functional Units Functional Units intra-thread data reuse Shared mem / L1$ (128 KB/SM): • Register File Register File good for explicit intra-block data reuse Shared Memory / Shared Memory / L1$ L1$ L2$ (6144 KB): implicit data • Bring reused SM SM reuse data closer to the SMs L2$ Global Memory (Framebuffer) 40

3/24/2018 CACHING IN REGISTERS No data loaded initially 41

3/24/2018 CACHING IN REGISTERS Load first set of data load 42

3/24/2018 CACHING IN REGISTERS Perform calculation Stencil 43

3/24/2018 CACHING IN REGISTERS Naively load next set of data? load 44

3/24/2018 CACHING IN REGISTERS Reusing already loaded data is better keep keep load 45

3/24/2018 CACHING IN REGISTERS Repeat Stencil Higher register usage may result in reduced occupancy => trade off (run experiments!) 46

THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced by a factor of 8 Memory utilization still high, but transferring less redundant data Kernel Time Speedup Original Version 2.079ms 1.00x Better Memory Accesses 1.756ms 1.18x Register Caching 1.486ms 1.40x 47 47

SHARED MEMORY Programmer-managed cache Great for caching data reused across threads in a CTA 128KB split between shared memory and L1 cache per SM Each block can use at most 96KB shared memory on GV100 Search for cudaFuncAttributePreferredSharedMemoryCarveout in the docs global __global__ void sharedMemExample(int *d) { registers __shared__ float s[64]; int t = threadIdx.x; s[t] = d[t]; __syncthreads(); global if(t>0 && t<63) stencil[t] = -2.0f*s[t] + s[t-1] + s[t+1]; shared } registers 48

ITERATION 4: KERNELS WITH INCREASED ARITHMETIC INTENSITY 50

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 BEFORE YOU START The five steps to enlightenment 1. Know your hardware What are the target machines, how many nodes? Machine-specific

Aberdeen & Aberdeenshire: Telling the story Lorna Easton Adam Bates Telling the Story of

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

IgProf The ignominious profiler. A generic memory and performance profiler for linux

Using The QML Profiler Ulf Hermann The Qt Company October 8, 2014 / Qt Developer Days 2014 1/28

Profiling Low-End Platforms using HawkTracer Profiler Marcin Kolny Amazon FOSDEM 2019 February

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Christoph Angerer, Jakob

Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at

Dynamic temperature profiler update Ranjan Dharmapalan, Alex Dvornikov, Jelena Maricic, Radovan

The .NET Profiling API OVERVIEW The .NET Profiler API is available since CLR/.NET Framework

Good Mor Good Mor ning, Ki Or ning, Ki Or a a I am going to begin by telling you a little

Pat Bode Presentation Telling the Next Generation Let us begin with prayer. Dear Lord God,

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

incidents to make a meaningful refresher program? What are these incidents telling you about

your sTory Telling Effective church newsletters By Rick Frennea editor of The Edgewood

LONGITUDINAL DATA AND FISCAL IMPACT TRANSPARENCY Transparency is telling all of the people all of

PRESENTATION + REPRESENTATION Narrative (Telling a Story) Composition Color

Calculating Heat Energy Requirements Sham Kashyap Computing and Information Sciences Kansas

Project Plan Fundamenta The Capstone Experience Team Quicken Loans Vishal Adusumilli Turner

1.Review the Tradition Life Cycle Methodology for developing and implementing computer based

national /foreign flows Started in 1972 One island one resort concept Free visa on

Central Toronto Integrated Regional Resource Plan (IRRP) Introduction to the Engagement and IRRP

Mountfield Road Lewes Mountfield Road Lewes Community Energy Options Grant f nded b the DECCs

Outline Outline Fluid Film Thrust Bearings: Fluid Film Thrust Bearings: n Basic Design Criteria

6 Visual Quality Review Committee Meeting November 6, 2013 Slide 1 COMMITTEE FOCUS TOPICS

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 BEFORE YOU START The five steps to enlightenment 1. Know your hardware What are the target machines, how many nodes? Machine-specific

Aberdeen &amp; Aberdeenshire: Telling the story Lorna Easton Adam Bates Telling the Story of

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

IgProf The ignominious profiler. A generic memory and performance profiler for linux

Using The QML Profiler Ulf Hermann The Qt Company October 8, 2014 / Qt Developer Days 2014 1/28

Profiling Low-End Platforms using HawkTracer Profiler Marcin Kolny Amazon FOSDEM 2019 February

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Christoph Angerer, Jakob

Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at

Dynamic temperature profiler update Ranjan Dharmapalan, Alex Dvornikov, Jelena Maricic, Radovan

The .NET Profiling API OVERVIEW The .NET Profiler API is available since CLR/.NET Framework

Good Mor Good Mor ning, Ki Or ning, Ki Or a a I am going to begin by telling you a little

Pat Bode Presentation Telling the Next Generation Let us begin with prayer. Dear Lord God,

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

incidents to make a meaningful refresher program? What are these incidents telling you about

your sTory Telling Effective church newsletters By Rick Frennea editor of The Edgewood

LONGITUDINAL DATA AND FISCAL IMPACT TRANSPARENCY Transparency is telling all of the people all of

PRESENTATION + REPRESENTATION Narrative (Telling a Story) Composition Color

Calculating Heat Energy Requirements Sham Kashyap Computing and Information Sciences Kansas

Project Plan Fundamenta The Capstone Experience Team Quicken Loans Vishal Adusumilli Turner

1.Review the Tradition Life Cycle Methodology for developing and implementing computer based

national /foreign flows Started in 1972 One island one resort concept Free visa on

Central Toronto Integrated Regional Resource Plan (IRRP) Introduction to the Engagement and IRRP

Mountfield Road Lewes Mountfield Road Lewes Community Energy Options Grant f nded b the DECCs

Outline Outline Fluid Film Thrust Bearings: Fluid Film Thrust Bearings: n Basic Design Criteria

6 Visual Quality Review Committee Meeting November 6, 2013 Slide 1 COMMITTEE FOCUS TOPICS

Aberdeen & Aberdeenshire: Telling the story Lorna Easton Adam Bates Telling the Story of