NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ™ ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA

WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code: https://github.com/chmaruni/nsight-gtc2015

INTRODUCING THE APPLICATION Grayscale Blur Edges

INTRODUCING THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

INTRODUCING THE APPLICATION Blur: 7x7 Gaussian Filter foreach pixel p: p = weighted sum of p and its 48 neighbors 1 2 3 4 3 2 1 2 4 6 8 6 4 2 3 6 9 12 9 6 3 4 8 12 16 12 8 4 3 6 9 12 9 6 3 2 4 6 8 6 4 2 1 2 3 4 3 2 1 Image from Wikipedia

INTRODUCING THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy) Weights for Gx: Weights for Gy: -1 0 1 1 2 1 -2 0 2 0 0 0 -1 0 1 -1 -2 -1

ENVIRONMENT NVIDIA Tesla K40m GK110B SM3.5 ECC off 3004 MHz memory clock, 875 MHz SM clock NVIDIA CUDA 7.0 release candidate Similar results are obtained on Windows

PERFORMANCE OPTIMIZATION CYCLE 1. Profile Application 2. Identify 5. Change and Performance Test Code Limiter 3. Analyze Profile 4. Reflect & Find Indicators 4b. Build Knowledge Chameleon from http://www.vectorportal.com, Creative Commons

PREREQUISITES Basic understanding of the GPU Memory Hierarchy Global Memory (slow, generous) Shared Memory (fast, limited) Registers (very fast, very limited) (Texture Cache) Basic understanding of the CUDA execution model Grid 1D/2D/3D Block 1D/2D/3D Warp-synchronous execution (32 threads per warp)

ITERATION 1

CREATE A NEW NVVP SESSION

THE PROFILER WINDOW Timeline Summary Guide Analysis Results

TIMELINE

EXAMINE INDIVIDUAL KERNELS (GUIDED ANALYSIS) Launch

IDENTIFY HOTSPOT Hotspot Identify the hotspot: gaussian_filter_7x7_v0() Kernel Time Speedup Original Version 5.233ms 1.00x

PERFORM KERNEL ANALYSIS Select Launch

IDENTIFY PERFORMANCE LIMITER

PERFORMANCE LIMITER CATEGORIES Memory Utilization vs Compute Utilization Four possible combinations: 60% Comp Comp Mem Mem Comp Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound

IDENTIFY PERFORMANCE LIMITER Load/Store Memory Ops Memory Related Issues?

LOOKING FOR INDICATORS Launch Large number of memory operations stalling LSU

LOOKING FOR MORE INDICATORS Unguided Analysis 4-5 Global Load/Store Transactions per 1 Request

MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp warp 2 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred

TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B 1 st line: Threads 0-7 Threads 24-31 2 nd line: Threads 8-15 3 rd line: Threads 16-23 1 instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued Instruction re-issued 1 st replay 2 nd replay Time Threads Threads Threads 0-7/24-31 8-15 16-23

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

CHANGING THE BLOCK LAYOUT Our blocks are 8x8 threadIdx.x (stride-1, uchar) Warp 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 Warp 1 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 Data Overfetch We should use blocks of size 32x2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

IMPROVED MEMORY ACCESS Blocks of size 32x2 Memory is used more efficiently Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x

PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

PERF-OPT QUICK REFERENCE CARD Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

ITERATION 2

IDENTIFY HOTSPOT Hotspot gaussian_filter_7x7_v0() still the hotspot Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x

IDENTIFY PERFORMANCE LIMITER Still Latency Bound

A lot of idle LOOKING FOR INDICATORS time Launch Not enough work inside a thread to hide latency?

STALL REASONS: EXECUTION DEPENDENCY a = b + c; // ADD a = b[i]; // LOAD d = a + e; // ADD d = a + e; // ADD Memory accesses may influence execution dependencies Global accesses create longer dependencies than shared accesses Read-only/texture dependencies are counted in Texture Instruction level parallelism can reduce dependencies a = b + c; // Independent ADDs d = e + f;

ILP AND MEMORY ACCESSES No ILP 2-way ILP (with loop unrolling) float a, a0 = 0.0f, a1 = 0.0f; float a = 0.0f; for( int i = 0 ; i < N ; i += 2 ) for( int i = 0 ; i < N ; ++i ) { a += logf(b[i]); a0 += logf(b[i]); a1 += logf(b[i+1]); c = b[0] } a += logf(c) a = a0 + a1 c = b[1] c0 = b[0] a += logf(c) c1 = b[1] a0 += logf(c0) c = b[2] a1 += logf(c1) c0 = b[2] a += logf(c) c1 = b[3] a0 += logf(c0) c = b[3] a1 += logf(c1) ... a = a0 + a1 a += logf(c) #pragma unroll is useful to extract ILP Manually rewrite code if not a simple loop

LOOKING FOR MORE INDICATORS

LOOKING FOR MORE INDICATORS Not enough active warps to hide latencies?

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Fully covered latency Exposed latency warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing

LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

IMPROVED OCCUPANCY Bigger blocks of size 32x4 Increases achieved occupancy slightly (from 47.6% to 52.4%) Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x

PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by: • Varying block size • Varying shared memory usage • Varying register count

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code:

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN

Using Eclipse for Java Using Eclipse for Java 1 / 1 Using Eclipse IDE for Java Development

INTRODUCTION TO NVIDIA PROFILING TOOLS Chandler Zhou, 20191219 Overview of Profilers Nsight

Introducing Eclipse Plug-ins Eclipse Standard Widget Toolkit Perspectives, views, and

Welcome Getting Started With Eclipse Setting Up Eclipse A First Project Getting Started With

ECLIPSE! Todays Target: I can differentiate between a solar and lunar eclipse, including the

CS2334 Lab2 Eclipse And Debugging Survival Guide Yu-Hsin Li and Mark Woehrer Fall 2008 Yu-Hsin

Eclipse Marketplace Client (MPC) Release and Graduation Review Submitter Ian Skerrett, Eclipse

Total Solar Eclipse Project Exploratoriums eclipse history 1998 Aruba 1999 Turkey

Introducing OSGi Eclipse Plug-ins 1 Plug-in State Information Plug-in Structure

ECLIPSE TIPS & TRICKS LAKSHMI P SHANMUGAM SARIKA SINHA Eclipse Platform Co-lead Eclipse

Eclipse Software Engineering with an Integrated Development Environment (IDE) Markus Scheidgen

Move your VS Code extension into Eclipse Che Florent Benoit 1 Eclipse Che 7 Eclipse Che 7 2

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo Esser, NVIDIA Device

Graph Prefetching Using Data Structure Knowledge Sam Ainsworth and Timothy M. Jones Computer

PSL and Flow Models Conrad Bock Michael Gruninger 8/2004 1 Overview Approaches to system

Formal Semantics as a Language Designer's Toolbox Paolo G. Giarrusso Klaus Ostermann

Global Aviation Trend and Nowcating/Mesoscale Modelling Requirements CM Shun and Sharon Lau Hong

Investor Presentation NOVEMBER 2018 For Information Purposes Only 1 CAUTIONARY STATEMENT

PROPOSED MID-CITY DISTRICT EXPANSION OPEN HOUSE NOVEMBER 8 TH , 2018 AGENDA WHAT AND WHERE IS

INTERNATIONAL SEMINAR ARTWORKS & LANDMARKS Contemporary Arts Practice & Contexts Period:

Should you Rust in embedded yet? Who am I? Simonas Kazlauskas; Member of the Rust compiler team;

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code:

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN

Using Eclipse for Java Using Eclipse for Java 1 / 1 Using Eclipse IDE for Java Development

INTRODUCTION TO NVIDIA PROFILING TOOLS Chandler Zhou, 20191219 Overview of Profilers Nsight

Introducing Eclipse Plug-ins Eclipse Standard Widget Toolkit Perspectives, views, and

Welcome Getting Started With Eclipse Setting Up Eclipse A First Project Getting Started With

ECLIPSE! Todays Target: I can differentiate between a solar and lunar eclipse, including the

CS2334 Lab2 Eclipse And Debugging Survival Guide Yu-Hsin Li and Mark Woehrer Fall 2008 Yu-Hsin

Eclipse Marketplace Client (MPC) Release and Graduation Review Submitter Ian Skerrett, Eclipse

Total Solar Eclipse Project Exploratoriums eclipse history 1998 Aruba 1999 Turkey

Introducing OSGi Eclipse Plug-ins 1 Plug-in State Information Plug-in Structure

ECLIPSE TIPS &amp; TRICKS LAKSHMI P SHANMUGAM SARIKA SINHA Eclipse Platform Co-lead Eclipse

Eclipse Software Engineering with an Integrated Development Environment (IDE) Markus Scheidgen

Move your VS Code extension into Eclipse Che Florent Benoit 1 Eclipse Che 7 Eclipse Che 7 2

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo Esser, NVIDIA Device

Graph Prefetching Using Data Structure Knowledge Sam Ainsworth and Timothy M. Jones Computer

PSL and Flow Models Conrad Bock Michael Gruninger 8/2004 1 Overview Approaches to system

Formal Semantics as a Language Designer's Toolbox Paolo G. Giarrusso Klaus Ostermann

Global Aviation Trend and Nowcating/Mesoscale Modelling Requirements CM Shun and Sharon Lau Hong

Investor Presentation NOVEMBER 2018 For Information Purposes Only 1 CAUTIONARY STATEMENT

PROPOSED MID-CITY DISTRICT EXPANSION OPEN HOUSE NOVEMBER 8 TH , 2018 AGENDA WHAT AND WHERE IS

INTERNATIONAL SEMINAR ARTWORKS &amp; LANDMARKS Contemporary Arts Practice &amp; Contexts Period:

Should you Rust in embedded yet? Who am I? Simonas Kazlauskas; Member of the Rust compiler team;

ECLIPSE TIPS & TRICKS LAKSHMI P SHANMUGAM SARIKA SINHA Eclipse Platform Co-lead Eclipse

INTERNATIONAL SEMINAR ARTWORKS & LANDMARKS Contemporary Arts Practice & Contexts Period: