CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION - - PowerPoint PPT Presentation

cuda optimization with
SMART_READER_LITE
LIVE PREVIEW

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION - - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion


slide-1
SLIDE 1

CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT™ VISUAL STUDIO EDITION

slide-2
SLIDE 2

An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion Code: https://github.com/chmaruni/nsight-gtc2015

WHAT YOU WILL LEARN

slide-3
SLIDE 3

Blur

INTRODUCING THE APPLICATION

Grayscale Edges

slide-4
SLIDE 4

Grayscale Conversion

// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

INTRODUCING THE APPLICATION

slide-5
SLIDE 5

Blur: 7x7 Gaussian Filter

foreach pixel p: p = weighted sum of p and its 48 neighbors

16 12 8 4 9 6 3 6 4 2 3 2 1 6 3 4 2 9 6 3 2 1 4 8 12 3 6 9 2 4 6 1 2 3 3 6 9 2 4 6 1 2 3 12 8 4 4 8 12

Image from Wikipedia

INTRODUCING THE APPLICATION

slide-6
SLIDE 6

Edges: 3x3 Sobel Filters

foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)

  • 1 0

1

  • 2 0

2

  • 1 0

1 Weights for Gx: 1 2 1

  • 1 -2 -1

Weights for Gy:

INTRODUCING THE APPLICATION

slide-7
SLIDE 7

NVIDIA GTX Titan X

GM200 SM5.2

Windows 7 NVIDIA Nsight Visual Studio Edition 4.6

ENVIRONMENT

slide-8
SLIDE 8

PERFORMANCE OPTIMIZATION CYCLE

  • 1. Profile

Application

  • 2. Identify

Performance Limiter

  • 3. Analyze Profile

& Find Indicators

  • 4. Reflect
  • 5. Change and

Test Code

  • 4b. Build Knowledge

Chameleon from http://www.vectorportal.com, Creative Commons

slide-9
SLIDE 9

Basic understanding of the GPU Memory Hierarchy

Global Memory (slow, generous) Shared Memory (fast, limited) Registers (very fast, very limited) (Texture Cache)

Basic understanding of the CUDA execution model

Grid 1D/2D/3D Block 1D/2D/3D Warp-synchronous execution (32 threads per warp)

PREREQUISITES

slide-10
SLIDE 10

ITERATION 1

slide-11
SLIDE 11

TRACING THE APPLICATION

Select Trace Application Activate CUDA Launch Verify Parameters

slide-12
SLIDE 12

NAVIGATING THE ANALYSIS REPORTS

Timeline CUDA Summary CUDA Launches

slide-13
SLIDE 13

TIMELINE

slide-14
SLIDE 14

Identify the hotspot: gaussian_filter_7x7_v0()

IDENTIFY HOTSPOT (CUDA SUMMARY)

Hotspot

Kernel Time Speedup Original Version 1.971ms 1.00x

slide-15
SLIDE 15

PERFORM KERNEL ANALYSIS

Select Profile CUDA Application Select the Kernel Launch Select the Experiments (All)

slide-16
SLIDE 16

THE CUDA LAUNCHES VIEW

Select Kernel Experiment Results Select Experiment

slide-17
SLIDE 17

Memory Utilization vs Compute Utilization Four possible combinations:

IDENTIFY MAIN PERFORMANCE LIMITER

Comp Mem

Compute Bound

Comp Mem

Bandwidth Bound

Comp Mem

Latency Bound

Comp Mem

Compute and Bandwidth Bound

60%

slide-18
SLIDE 18

MEMORY BANDWIDTH

SMEM/L1$ Registers

SM

SMEM/L1$ Registers

SM Global Memory (Framebuffer) L2$

slide-19
SLIDE 19

Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2% Not limited by memory bandwidth

IDENTIFY PERFORMANCE LIMITER

slide-20
SLIDE 20

INSTRUCTION THROUGHPUT

Each SM has 4 schedulers (Maxwell) Schedulers issue instructions to pipes Each scheduler schedules up to 2 instructions per cycle A scheduler issues inst. from a single warp Cannot issue to a pipe if its issue slot is full

SM

Pipes

Sched Tex/L1$ 256KB Register File

Pipes

Sched

Pipes

Sched TEX/L1$

Pipes

Sched 96KB Shared Memory

slide-21
SLIDE 21

INSTRUCTION THROUGHPUT

Sched Sched Sched Sched

Schedulers saturated

Utilization: 90%

Shared Mem Texture Control Flow ALU

11% 8% 65% 6%

Sched Sched Sched Sched

Schedulers and pipe saturated

4% 27% Utilization: 92%

Shared Mem Texture Control Flow ALU

90%

Sched Sched Sched Sched

Pipe saturated

78% Utilization: 64%

Shared Mem Texture Control Flow ALU

24% 4%

slide-22
SLIDE 22

WARP ISSUE EFFICIENCY

Percentage of issue slots used (blue) Aggregated over all the schedulers

slide-23
SLIDE 23

PIPE UTILIZATION

Percentages of issue slots used per pipe Accounts for pipe throughputs Four groups of pipes:

Shared Memory Texture Control Flow Arithmetic (ALU)

slide-24
SLIDE 24

INSTRUCTION THROUGHPUT

Neither schedulers nor pipes are saturated Not limited by the instruction throughput

Our Kernel is Latency Bound

slide-25
SLIDE 25

56% of theoretical occupancy 29.35 active warps per cycle 1.18 warps eligible per cycle Let’s start with occupancy

LOOKING FOR INDICATORS

slide-26
SLIDE 26

OCCUPANCY

Each SM has limited resources 64K Registers (32 bit) shared by threads Up to 48KB of shared memory per block (96KB per SMM) 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps)

Values vary with Compute Capability

slide-27
SLIDE 27

LATENCY

GPUs cover latencies by having a lot of work in flight

warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9

The warp issues The warp waits (latency)

Fully covered latency Exposed latency

No warp issuing

slide-28
SLIDE 28

LATENCY: LACK OF OCCUPANCY

Not enough active warps The schedulers cannot find eligible warps at every cycle

warp 0 warp 1 warp 2 warp 3

No warp issues

slide-29
SLIDE 29

LOOKING FOR MORE INDICATORS

Block Size seems OK We don’t want to change the register count yet

slide-30
SLIDE 30

CONTINUE LOOKING FOR INDICATORS 4-8 L2 Transactions per 1 Request

slide-31
SLIDE 31

MEMORY TRANSACTIONS: BEST CASE

A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 4x L2 transactions: 128B needed / 128B transferred

4x 32B L2 transactions per warp 1x 128B load/store request per warp

slide-32
SLIDE 32

MEMORY TRANSACTIONS: WORST CASE

Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment 32x L2 transactions: 128B needed / 32x 32B transferred

1x 32B L2 transaction per thread 1x 128B load/store request per warp

Stride: 32x4B

warp 2

slide-33
SLIDE 33

Threads 24-31 Threads 0-7

TRANSACTIONS AND REPLAYS

A warp reads from addresses spanning 3 lines of 128B 1 instr. executed and 2 replays = 1 request and 3 transactions

Threads 8-15 Threads 16-23

Time

Instruction issued Instruction re-issued 1st replay

Threads 0-7/24-31 Threads 8-15

Instruction re-issued 2nd replay

Threads 16-23

1st line: 2nd line: 3rd line:

slide-34
SLIDE 34

TRANSACTIONS AND REPLAYS

With replays, requests take more time and use more resources

More instructions issued More memory traffic Increased execution time

  • Inst. 0

Issued

  • Inst. 1

Issued

  • Inst. 2

Issued

Execution time

Threads 0-7/24-31 Threads 8-15 Threads 16-23

  • Inst. 0

Completed

  • Inst. 1

Completed

  • Inst. 2

Completed

Threads 0-7/24-31 Threads 8-15 Threads 16-23

Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2

Extra latency Extra work (SM) Extra memory traffic

slide-35
SLIDE 35

CHANGING THE BLOCK LAYOUT

Our blocks are 8x8 We should use blocks of size 32x2

Warp 0 Warp 1

27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 0 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63 4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31 36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63

threadIdx.x (stride-1, uchar)

27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63 27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63 27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63

Data Overfetch

slide-36
SLIDE 36

IMPROVED MEMORY ACCESS

Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x

Blocks of size 32x2 Memory is used more efficiently

slide-37
SLIDE 37

Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by:

  • Varying block size
  • Varying shared memory usage
  • Varying register count

PERF-OPT QUICK REFERENCE CARD

slide-38
SLIDE 38

Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by:

  • Cooperative loading inside a block
  • Change block layout
  • Aligning data
  • Changing data layout to improve locality

PERF-OPT QUICK REFERENCE CARD

slide-39
SLIDE 39

Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by:

  • Cooperative loading inside a block
  • Change block layout
  • Aligning data
  • Changing data layout to improve locality

PERF-OPT QUICK REFERENCE CARD

slide-40
SLIDE 40

ITERATION 2

slide-41
SLIDE 41

gaussian_filter_7x7_v0() still the hotspot

IDENTIFY HOTSPOT

Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x

Hotspot

slide-42
SLIDE 42

Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 4% Not limited by memory bandwidth

IDENTIFY PERFORMANCE LIMITER

slide-43
SLIDE 43

Scheduler is starting to be busy but Tex pipe is clearly the limiter

IDENTIFY PERFORMANCE LIMITER

Load/Store pipeline is saturated

slide-44
SLIDE 44

98.89% Hit Rate in L2 Cache The kernel is mostly working from the L2 cache

LOOKING FOR INDICATORS

slide-45
SLIDE 45

Kernel Transfers 8MB to/from Device Memory but 360MB to/from L2 Cache

LOOKING FOR MORE INDICATORS

Can we move the data closer to the SM?

slide-46
SLIDE 46

Adjacent pixels access similar neighbors in Gaussian Filter We should use shared memory to store those common pixels

SHARED MEMORY

__shared__ unsigned char smem_pixels[10][64];

slide-47
SLIDE 47

SHARED MEMORY

Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x Shared Memory 0.334ms 5.90x

Using shared memory for the Gaussian Filter Significant speedup, < 0.5ms

slide-48
SLIDE 48

Category: Latency Bound – Shared Memory Problem: Long memory latencies are harder to hide Goal: Reduce latency, move data to faster memory Indicators: Shared memory not occupancy limiter High L2 hit rate Data reuse between threads and small-ish working set Strategy: (Cooperatively) move data to:

  • Shared Memory
  • (or Registers)
  • (or Constant Memory)
  • (or Texture Cache)

PERF-OPT QUICK REFERENCE CARD

slide-49
SLIDE 49

Category: Memory Bound – Shared Memory Problem: Too much data movement Goal: Reduce amount of data traffic to/from global mem Indicators: Higher than expected memory traffic to/from global memory Low arithmetic intensity of the kernel Strategy: (Cooperatively) move data closer to SM:

  • Shared Memory
  • (or Registers)
  • (or Constant Memory)
  • (or Texture Cache)

PERF-OPT QUICK REFERENCE CARD

slide-50
SLIDE 50

ITERATION 3

slide-51
SLIDE 51

gaussian_filter_7x7_v0() still the hotspot

IDENTIFY HOTSPOT

Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x Shared Memory 0.334ms 5.90x

Hotspot

slide-52
SLIDE 52

Utilization of L2$ Bandwidth (BW) moderate and DRAM BW < 8% Not limited by memory bandwidth

IDENTIFY PERFORMANCE LIMITER

slide-53
SLIDE 53

IDENTIFY PERFORMANCE LIMITER

The Kernel is Compute Bound

slide-54
SLIDE 54

No Divergence in our code

LOOKING FOR INDICATORS

slide-55
SLIDE 55

BRANCH DIVERGENCE

Threads of a warp take different branches of a conditional

if( threadIdx.x < 12 ) {} else {}

Time Threads execute the “if” branch Threads execute the “else” branch

Execution time = “if” branch + “else” branch

slide-56
SLIDE 56

Execution dependency is largest block Not a clear indicator however

LOOKING FOR MORE INDICATORS

slide-57
SLIDE 57

STALL REASONS: EXECUTION DEPENDENCY

Memory accesses may influence execution dependencies

Global accesses create longer dependencies than shared accesses Read-only/texture dependencies are counted in Texture

Instruction level parallelism can reduce dependencies

a = b + c; // ADD d = a + e; // ADD a = b[i]; // LOAD d = a + e; // ADD a = b + c; // Independent ADDs d = e + f;

slide-58
SLIDE 58

ILP AND MEMORY ACCESSES

#pragma unroll is useful to extract ILP Manually rewrite code if not a simple loop

float a = 0.0f; for( int i = 0 ; i < N ; ++i ) a += logf(b[i]);

c = b[0]

No ILP 2-way ILP (with loop unrolling)

float a, a0 = 0.0f, a1 = 0.0f; for( int i = 0 ; i < N ; i += 2 ) { a0 += logf(b[i]); a1 += logf(b[i+1]); } a = a0 + a1

a += logf(c) c = b[1] a += logf(c) c = b[2] a += logf(c) c = b[3] a += logf(c) c0 = b[0] a0 += logf(c0) c0 = b[2] a0 += logf(c0) c1 = b[1] a1 += logf(c1) c1 = b[3] a1 += logf(c1) a = a0 + a1 ...

slide-59
SLIDE 59

>4TIOP/second

LOOKING FOR MORE INDICATORS

The Kernel is simply computing a lot

slide-60
SLIDE 60

Separable Filter:

Gaussian filters are circular and separable Compute horizontal and vertical convolution separately

REDUCING COMPUTATIONAL COMPLEXITY

16 12 8 4 9 6 3 6 4 2 3 2 1 6 3 4 2 9 6 3 2 1 4 8 12 3 6 9 2 4 6 1 2 3 3 6 9 2 4 6 1 2 3 12 8 4 4 8 12

1 2 3 4 3 2 1 ∗ 1 2 3 4 3 2 1 = 1 2 3 4 3 2 1 2 4 6 8 6 4 2 3 6 9 12 9 6 3 4 8 12 16 12 8 4 3 6 9 12 9 6 3 2 4 6 8 6 4 2 1 2 3 4 3 2 1

slide-61
SLIDE 61

SEPARABLE FILTER + INCREASED ILP

Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x Shared Memory 0.334ms 5.90x Separable Filter + incr. ILP 0.179ms 11.01x

Separable filter reduces computational load Processing two elements per thread increases instruction level parallelism

slide-62
SLIDE 62

Category: Compute Bound – Branch Divergence Problem: Diverging threads Goal: Reduce divergence within warps Indicators: Low warp execution efficiency, high control flow utilization Strategy:

  • Refactor code to avoid intra-warp divergence
  • Restructure data (sorting?) to avoid data-

dependent branch divergence

PERF-OPT QUICK REFERENCE CARD

slide-63
SLIDE 63

Category: Latency Bound – Instruction Level Parallelism Problem: Not enough independent work per thread Goal: Do more parallel work inside single threads Indicators: High execution dependency, increasing occupancy has no/little positive effect, still registers available Strategy:

  • Unroll loops (#pragma unroll)
  • Refactor threads to compute n output values at

the same time (code duplication)

PERF-OPT QUICK REFERENCE CARD

slide-64
SLIDE 64

Category: Compute Bound – Algorithmic Changes Problem: GPU is computing as fast as possible Goal: Reduce computation if possible Indicators: Clearly compute bound problem, speedup only with less computation Strategy:

  • Pre-compute or store (intermediate) results
  • Trade memory for compute time
  • Use a computationally less expensive algorithm

PERF-OPT QUICK REFERENCE CARD

slide-65
SLIDE 65

THE RESULT: 11.01X

Much better utilization The sobel filter is starting to become the bottleneck

slide-66
SLIDE 66

MORE IN OUR COMPANION CODE

Companion Code: https://github.com/chmaruni/nsight-gtc2015

Kernel Time Speedup Gaussian Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x Shared Memory 0.334ms 5.90x Separable Filter + incr. ILP 0.179ms 11.01x Floats instead of int ops 0.153ms 12.88x Sobel Filter Baseline 0.200ms 1.00x Floats+Intrinsics+fast_math 0.152ms 1.32x Your Next Idea!

slide-67
SLIDE 67

SUMMARY

slide-68
SLIDE 68

Trace the Application Identify the Hotspot and Profile It Identify the Performance Limiter Memory Bandwidth Instruction Throughput Latency Look for indicators Reflect and Optimize the Code Iterate

ITERATIVE OPTIMIZATION WITH NSIGHT VSE

slide-69
SLIDE 69

REFERENCES

Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization- Guidelines-GPU-Architecture-Details.mp4 http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming- Guidelines-GPU-Architecture.pdf

CUDA Best Practices Guide

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Parallel Forall devblog

http://devblogs.nvidia.com/parallelforall/

Upcoming GTC 2015 Sessions:

S5655 CUDA Application Development Life Cycle with Nsight Eclipse Edition (Hands-on lab), Nikita Shulga, Thursday 2pm S5353+S5376 Memory Bandwidth Bootcamp (and Beyond), Tony Scudiero, Thursday 3:30pm and 5pm

slide-70
SLIDE 70

NVIDIA REGISTERED DEVELOPER PROGRAMS

Everything you need to develop with NVIDIA products Membership is your first step in establishing a working relationship with NVIDIA Engineering

Exclusive access to pre-releases Submit bugs and features requests Stay informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers Interact with other developers in the NVIDIA Developer Forums

REGISTER FOR FREE AT: developer.nvidia.com

slide-71
SLIDE 71

THANK YOU