NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN - - PowerPoint PPT Presentation

nvidia nsight eclipse edition
SMART_READER_LITE
LIVE PREVIEW

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN - - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code:


slide-1
SLIDE 1

CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT™ ECLIPSE EDITION

slide-2
SLIDE 2

An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code: https://github.com/chmaruni/nsight-gtc2015

WHAT YOU WILL LEARN

slide-3
SLIDE 3

Blur

INTRODUCING THE APPLICATION

Grayscale Edges

slide-4
SLIDE 4

Grayscale Conversion

// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

INTRODUCING THE APPLICATION

slide-5
SLIDE 5

Blur: 7x7 Gaussian Filter

foreach pixel p: p = weighted sum of p and its 48 neighbors

16 12 8 4 9 6 3 6 4 2 3 2 1 6 3 4 2 9 6 3 2 1 4 8 12 3 6 9 2 4 6 1 2 3 3 6 9 2 4 6 1 2 3 12 8 4 4 8 12

Image from Wikipedia

INTRODUCING THE APPLICATION

slide-6
SLIDE 6

Edges: 3x3 Sobel Filters

foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)

  • 1 0

1

  • 2 0

2

  • 1 0

1 Weights for Gx: 1 2 1

  • 1 -2 -1

Weights for Gy:

INTRODUCING THE APPLICATION

slide-7
SLIDE 7

NVIDIA Tesla K40m

GK110B SM3.5 ECC off 3004 MHz memory clock, 875 MHz SM clock

NVIDIA CUDA 7.0 release candidate Similar results are obtained on Windows

ENVIRONMENT

slide-8
SLIDE 8

PERFORMANCE OPTIMIZATION CYCLE

  • 1. Profile

Application

  • 2. Identify

Performance Limiter

  • 3. Analyze Profile

& Find Indicators

  • 4. Reflect
  • 5. Change and

Test Code

  • 4b. Build Knowledge

Chameleon from http://www.vectorportal.com, Creative Commons

slide-9
SLIDE 9

Basic understanding of the GPU Memory Hierarchy

Global Memory (slow, generous) Shared Memory (fast, limited) Registers (very fast, very limited) (Texture Cache)

Basic understanding of the CUDA execution model

Grid 1D/2D/3D Block 1D/2D/3D Warp-synchronous execution (32 threads per warp)

PREREQUISITES

slide-10
SLIDE 10

ITERATION 1

slide-11
SLIDE 11

CREATE A NEW NVVP SESSION

slide-12
SLIDE 12

THE PROFILER WINDOW Timeline Analysis Results Summary Guide

slide-13
SLIDE 13

TIMELINE

slide-14
SLIDE 14

EXAMINE INDIVIDUAL KERNELS

(GUIDED ANALYSIS)

Launch

slide-15
SLIDE 15

Identify the hotspot: gaussian_filter_7x7_v0()

IDENTIFY HOTSPOT

Hotspot

Kernel Time Speedup Original Version 5.233ms 1.00x

slide-16
SLIDE 16

PERFORM KERNEL ANALYSIS Select Launch

slide-17
SLIDE 17

IDENTIFY PERFORMANCE LIMITER

slide-18
SLIDE 18

Memory Utilization vs Compute Utilization Four possible combinations:

PERFORMANCE LIMITER CATEGORIES

Comp Mem

Compute Bound

Comp Mem

Bandwidth Bound

Comp Mem

Latency Bound

Comp Mem

Compute and Bandwidth Bound

60%

slide-19
SLIDE 19

IDENTIFY PERFORMANCE LIMITER Memory Ops Load/Store Memory Related Issues?

slide-20
SLIDE 20

LOOKING FOR INDICATORS

Launch Large number of memory

  • perations stalling LSU
slide-21
SLIDE 21

LOOKING FOR MORE INDICATORS Unguided Analysis 4-5 Global Load/Store Transactions per 1 Request

slide-22
SLIDE 22

MEMORY TRANSACTIONS: BEST CASE

A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x 128B load/store request per warp

slide-23
SLIDE 23

MEMORY TRANSACTIONS: WORST CASE

Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred

1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 1x 128B load/store request per warp

Stride: 32x4B

warp 2

slide-24
SLIDE 24

Threads 24-31 Threads 0-7

TRANSACTIONS AND REPLAYS

A warp reads from addresses spanning 3 lines of 128B 1 instr. executed and 2 replays = 1 request and 3 transactions

Threads 8-15 Threads 16-23

Time

Instruction issued Instruction re-issued 1st replay

Threads 0-7/24-31 Threads 8-15

Instruction re-issued 2nd replay

Threads 16-23

1st line: 2nd line: 3rd line:

slide-25
SLIDE 25

TRANSACTIONS AND REPLAYS

With replays, requests take more time and use more resources

More instructions issued More memory traffic Increased execution time

  • Inst. 0

Issued

  • Inst. 1

Issued

  • Inst. 2

Issued

Execution time

Threads 0-7/24-31 Threads 8-15 Threads 16-23

  • Inst. 0

Completed

  • Inst. 1

Completed

  • Inst. 2

Completed

Threads 0-7/24-31 Threads 8-15 Threads 16-23

Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2

Extra latency Extra work (SM) Extra memory traffic

slide-26
SLIDE 26

CHANGING THE BLOCK LAYOUT

Our blocks are 8x8 We should use blocks of size 32x2

Warp 0 Warp 1

27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 0 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63 4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31 36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63

threadIdx.x (stride-1, uchar)

27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63 27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63 27 28 29 30 36 37 38 44 45 46 52 53 54 21 22 13 14 20 12 4 5 6 24 25 26 32 33 34 40 41 42 48 49 50 16 17 18 8 9 10 1 2 19 11 3 51 43 35 31 39 47 55 23 15 7 60 61 62 56 57 58 59 63

Data Overfetch

slide-27
SLIDE 27

IMPROVED MEMORY ACCESS

Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x

Blocks of size 32x2 Memory is used more efficiently

slide-28
SLIDE 28

Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by:

  • Cooperative loading inside a block
  • Change block layout
  • Aligning data
  • Changing data layout to improve locality

PERF-OPT QUICK REFERENCE CARD

slide-29
SLIDE 29

Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by:

  • Cooperative loading inside a block
  • Change block layout
  • Aligning data
  • Changing data layout to improve locality

PERF-OPT QUICK REFERENCE CARD

slide-30
SLIDE 30

ITERATION 2

slide-31
SLIDE 31

gaussian_filter_7x7_v0() still the hotspot

IDENTIFY HOTSPOT

Hotspot

Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x

slide-32
SLIDE 32

IDENTIFY PERFORMANCE LIMITER Still Latency Bound

slide-33
SLIDE 33

LOOKING FOR INDICATORS

A lot of idle time Launch

Not enough work inside a thread to hide latency?

slide-34
SLIDE 34

STALL REASONS: EXECUTION DEPENDENCY

Memory accesses may influence execution dependencies

Global accesses create longer dependencies than shared accesses Read-only/texture dependencies are counted in Texture

Instruction level parallelism can reduce dependencies

a = b + c; // ADD d = a + e; // ADD a = b[i]; // LOAD d = a + e; // ADD a = b + c; // Independent ADDs d = e + f;

slide-35
SLIDE 35

ILP AND MEMORY ACCESSES

#pragma unroll is useful to extract ILP Manually rewrite code if not a simple loop

float a = 0.0f; for( int i = 0 ; i < N ; ++i ) a += logf(b[i]);

c = b[0]

No ILP 2-way ILP (with loop unrolling)

float a, a0 = 0.0f, a1 = 0.0f; for( int i = 0 ; i < N ; i += 2 ) { a0 += logf(b[i]); a1 += logf(b[i+1]); } a = a0 + a1

a += logf(c) c = b[1] a += logf(c) c = b[2] a += logf(c) c = b[3] a += logf(c) c0 = b[0] a0 += logf(c0) c0 = b[2] a0 += logf(c0) c1 = b[1] a1 += logf(c1) c1 = b[3] a1 += logf(c1) a = a0 + a1 ...

slide-36
SLIDE 36

LOOKING FOR MORE INDICATORS

slide-37
SLIDE 37

Not enough active warps to hide latencies?

LOOKING FOR MORE INDICATORS

slide-38
SLIDE 38

LATENCY

GPUs cover latencies by having a lot of work in flight

warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9

The warp issues The warp waits (latency)

Fully covered latency Exposed latency

No warp issuing

slide-39
SLIDE 39

LATENCY: LACK OF OCCUPANCY

Not enough active warps The schedulers cannot find eligible warps at every cycle

warp 0 warp 1 warp 2 warp 3

No warp issues

slide-40
SLIDE 40

IMPROVED OCCUPANCY

Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x

Bigger blocks of size 32x4 Increases achieved occupancy slightly (from 47.6% to 52.4%)

slide-41
SLIDE 41

Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by:

  • Varying block size
  • Varying shared memory usage
  • Varying register count

PERF-OPT QUICK REFERENCE CARD

slide-42
SLIDE 42

Category: Latency Bound – Instruction Level Parallelism Problem: Not enough independent work per thread Goal: Do more parallel work inside single threads Indicators: High execution dependency, increasing occupancy has no/little positive effect, still registers available Strategy:

  • Unroll loops (#pragma unroll)
  • Refactor threads to compute n output values at

the same time (code duplication)

PERF-OPT QUICK REFERENCE CARD

slide-43
SLIDE 43

ITERATION 3

slide-44
SLIDE 44

gaussian_filter_7x7_v0() still the hotspot

IDENTIFY HOTSPOT

Hotspot

Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x

slide-45
SLIDE 45

IDENTIFY PERFORMANCE LIMITER Still Latency Bound

slide-46
SLIDE 46

LOOKING FOR INDICATORS

Still high execution dependency, but

  • ccupancy OK
slide-47
SLIDE 47

LOOKING FOR MORE INDICATORS

Is our working set mostly in L2$? Medium L2 Bandwidth Utilization Very low device memory bandwidth utilization

Launch

slide-48
SLIDE 48

CHECKING L2 HIT RATE: 98.9%

Our working set is mostly in L2$ Can we move it even closer?

slide-49
SLIDE 49

Adjacent pixels access similar neighbors in Gaussian Filter We should use shared memory to store those common pixels

SHARED MEMORY

__shared__ unsigned char smem_pixels[10][64];

slide-50
SLIDE 50

SHARED MEMORY

Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x Shared Memory 0.911ms 5.74x

Using shared memory for the Gaussian Filter Significant speedup, < 1ms

slide-51
SLIDE 51

Category: Latency Bound – Shared Memory Problem: Long memory latencies are difficult to hide Goal: Reduce latency, move data to faster memory Indicators: Shared memory not occupancy limiter High L2 hit rate Data reuse between threads and small-ish working set Strategy: (Cooperatively) move data to:

  • Shared Memory
  • (or Registers)
  • (or Constant Memory)
  • (or Texture Cache)

PERF-OPT QUICK REFERENCE CARD

slide-52
SLIDE 52

Category: Memory Bound – Shared Memory Problem: Too much data movement Goal: Reduce amount of data traffic to/from global mem Indicators: Higher than expected memory traffic to/from global memory Low arithmetic intensity of the kernel Strategy: (Cooperatively) move data closer to SM:

  • Shared Memory
  • (or Registers)
  • (or Constant Memory)
  • (or Texture Cache)

PERF-OPT QUICK REFERENCE CARD

slide-53
SLIDE 53

ITERATION 4

slide-54
SLIDE 54

gaussian_filter_7x7_v0() still the hotspot

IDENTIFY HOTSPOT

Hotspot

Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x Shared Memory 0.911ms 5.74x

slide-55
SLIDE 55

IDENTIFY PERFORMANCE LIMITER

Aha! Getting into the high utilization region

slide-56
SLIDE 56

LOOKING FOR INDICATORS

Launch

slide-57
SLIDE 57

LOOKING FOR MORE INDICATORS

Load/Store Unit is really busy! Can we reduce the load?

slide-58
SLIDE 58

INSTRUCTION THROUGHPUT

Each SM has 4 schedulers (Kepler) Schedulers issue instructions to pipes A scheduler issues up to 2 instructions/cycle

Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8)

A scheduler issues inst. from a single warp Cannot issue to a pipe if its issue slot is full

SMEM/L1$ Registers

SM

Pipes Pipes Pipes Pipes

Sched Sched Sched Sched

slide-59
SLIDE 59

INSTRUCTION THROUGHPUT

Sched Sched Sched Sched

Schedulers saturated

Utilization: 90%

Load Store Texture Control Flow ALU

11% 8% 65% 6%

Sched Sched Sched Sched

Schedulers and pipe saturated

4% 27% Utilization: 92%

Load Store Texture Control Flow ALU

90%

Sched Sched Sched Sched

Pipe saturated

78% Utilization: 64%

Load Store Texture Control Flow ALU

24% 4%

slide-60
SLIDE 60

READ-ONLY CACHE (TEXTURE UNITS)

SMEM/L1$ Registers

SM

SMEM/L1$ Registers

SM Global Memory (Framebuffer) L2$

Texture Units Texture Units

Skip LSU Cache loads

slide-61
SLIDE 61

READ-ONLY PATH

Annotate read-only parameters with const __restrict The compiler generates LDG instructions: 0.808ms

__global__ void gaussian_filter_7x7_v2(int w, int h, const uchar *__restrict src, uchar *dst) Kernel Time Speedup Original version 5.233ms 1.00x Better memory accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x Shared memory 0.911ms 5.74x Read-Only path 0.808ms 6.48x

slide-62
SLIDE 62

Category: Latency Bound – Texture Cache Problem: Load/Store Unit becomes bottleneck Goal: Relieve Load/Store Unit from read-only data Indicators: High utilization of Load/Store Unit, pipe-busy stall reason, significant amount of read-only data Strategy: Load read-only data through Texture Units:

  • Annotate read-only pointers with const

__restrict__

  • Use __ldg() intrinsic

PERF-OPT QUICK REFERENCE CARD

slide-63
SLIDE 63

THE RESULT: 6.5X

Looking much better Things to investigate next

Reduce computational intensity (separable filter) Increase Instruction Level Parallelism (process two elements per thread)

The sobel filter is starting to become the bottleneck

slide-64
SLIDE 64

MORE IN OUR COMPANION CODE

Kernel Time Speedup Original version 5.233ms 1.00x Better memory accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x Shared memory 0.911ms 5.74x Read-Only path 0.808ms 6.48x Separable filter 0.481ms 10.88x Process two pixels per thread (memory efficiency + ILP) 0.415ms 12.61x Use 64-bit shared memory (remove bank conflicts) 0.403ms 12.99x Use float instead of int (increase instruction throughput) 0.363ms 14.42x Your next idea!!!

Companion Code: https://github.com/chmaruni/nsight-gtc2015

slide-65
SLIDE 65

SUMMARY

slide-66
SLIDE 66

ITERATIVE OPTIMIZATION WITH NSIGHT EE

Trace the Application Identify the Hotspot and Profile it Identify the Performance Limiter

Memory Bandwidth Instruction Throughput Latency

Look for indicators

Take nvvp guided analysis as a starting point But don’t follow it too closely

Optimize the Code Iterate

slide-67
SLIDE 67

REFERENCES

Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization- Guidelines-GPU-Architecture-Details.mp4 http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming- Guidelines-GPU-Architecture.pdf

CUDA Best Practices Guide

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Parallel Forall devblog

http://devblogs.nvidia.com/parallelforall/

Upcoming GTC 2015 Sessions:

S5655 CUDA Application Development Life Cycle with Nsight Eclipse Edition (Hands-on lab), Nikita Shulga, Thursday 2pm S5353+S5376 Memory Bandwidth Bootcamp (and Beyond), Tony Scudiero, Thursday 3:30pm and 5pm

slide-68
SLIDE 68

NVIDIA REGISTERED DEVELOPER PROGRAMS

Everything you need to develop with NVIDIA products Membership is your first step in establishing a working relationship with NVIDIA Engineering

Exclusive access to pre-releases Submit bugs and features requests Stay informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers Interact with other developers in the NVIDIA Developer Forums

REGISTER FOR FREE AT: developer.nvidia.com

slide-69
SLIDE 69

THANK YOU