[PPT] - PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS Avinash Baliga, PowerPoint Presentation

SLIDE 1

April 4-7, 2016 | Silicon Valley

Avinash Baliga, NVIDIA Developer Tools Software Architect April 5, 2016 @ 3:00 p.m. Room 211B

PERFWORKS

A LIBRARY FOR GPU PERFORMANCE ANALYSIS

SLIDE 2

2

NVIDIA PerfWorks SDK

New API for collecting performance metrics from NVIDIA GPUs.

Cross-API:

CUDA, OpenGL, OpenGL ES, D3D11, and D3D12

Cross-Platform:

Windows, Linux, Mobile

GPUs:

Kepler, Maxwell, Pascal Tegra, GeForce, Tesla, Quadro

Target Audience:

tools developers, engine developers Successor to the NVIDIA Perfkit SDK (NVPMAPI)

Adds range-based profiling
Supports next-gen APIs featuring multi-threaded GPU work submission

SLIDE 3

3

GPU Counters and Metrics

PerfWorks delivers actionable, high-level metrics, allowing you to recognize top performance limiters quickly and directly.

Raw Counters : elapsed_cycles, time_duration
Metric : average_clock_rate = elapsed_cycles / time_duration

Metric Categories

Cumulative Work : compute warps launched, shaded pixels
Timing : elapsed cycles, duration in nanoseconds
Activity : active, stalled, idle cycles
Throughput : rate of operations, memory transactions, instruction issue, etc.

SLIDE 4

4

“Speed of Light” Metrics

SOL = “Speed of Light” = peak throughput of a given piece of hardware max instructions per cycle, max bytes per cycle, etc. SOL% = achieved throughput, as % of the peak; “how close are you to perfection?” Unit SOL% takes the max across sub-unit SOL%s. SM, partition, sub-partition, ALU Example: the SM SOL% is the max of

Instruction Issue utilization
ALU utilization
Shared memory utilization
Texture/L1 utilization

Image of Maxwell SM sub-partition from NVIDIA GeForce GTX 750 Ti Whitepaper

SLIDE 5

5

Compute Metrics

SM L1 Tex Shared Cache Hit/Miss Utilization Efficiency Instruction Issue-Efficiency Instruction Pipeline Statistics Stall Reasons L2 Device System Cache Hit/Miss Utilization by Op Type Utilization by Client

% utilization % utilization % utilization

Utilization

SLIDE 6

6

Compute Metrics: Compute-Bound

SM L1 Tex Shared High instruction issue utilization High pipeline utilization L2 Device System Medium-low utilization on all other units

SLIDE 7

7

Compute Metrics: Memory-Bound

SM L1 Tex Shared L2 Device System Medium-low utilization in the SM. One of the memory units has reached close to its maximum throughput.

SLIDE 8

8

Compute Metrics: Latency-Bound

SM L1 Tex Shared L2 Device System Stalls High number of pipeline stalls. Medium-low utilization on everything. Same amount of data transferred from both L1 and L2. Or same amount from both L2 and memory.

SLIDE 9

9

Graphics Metrics

IA

(Vertex Fetch)

Vertex Shader Hull Shader Tess Domain Shader Geom Shader Raster Pixel Shader CROP ZROP Front End (decoder) SM (unified shaders) L2 CPU Image L1 Tex System Device XFB

SLIDE 10

10

Range Based Profiling

Previous tools profile one kernel or draw-call at a time: With PerfWorks, you can profile them as a range, allowing for inherent parallelism: Optimizing these 2 cases is very different!

Improving individual duration may increase resource usage per kernel, which can

prevent parallelism or harm parallel execution time.

Ranges can include diverse workloads, and setup cost.

SLIDE 11

11

Multi-Pass Profiling

The hardware has a limited number of physical counters. To collect more than the physical limit, PerfWorks requires the application to deterministically replay the GPU work multiple times. During each replay pass:

the application must make the same GPU calls, with the same range delimiters
a different set of counters is collected

4/11/2016

BeginPass EndPass Range A Range B BeginPass EndPass Range A Range B ctr0 ctr0 ctr1 ctr1

SLIDE 12

12

CUDA Example

{ kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); cuLaunchKernel(...); cudaDeviceSynchronize(); }

SLIDE 13

13

CUDA Example

{ NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); cuLaunchKernel(...); NVPA_CUDA_PopRange(); cudaDeviceSynchronize(); } Range ‘A’ Range ‘B’

SLIDE 14

14

CUDA Example

do { cuCtxGetCurrent(&ctx); NVPA_Context_BeginPass(ctx); NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); cuLaunchKernel(...); NVPA_CUDA_PopRange(); NVPA_Context_EndPass(ctx); cudaDeviceSynchronize(); } while ( ! IsDataReady(ctx) ); Replay Pass

SLIDE 15

15

CUDA Example

do { cuCtxGetCurrent(&ctx); NVPA_Context_BeginPass(ctx); NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); cuLaunchKernel(...); NVPA_CUDA_PopRange(); NVPA_Context_EndPass(ctx); cudaDeviceSynchronize(); } while ( ! IsDataReady(ctx) ); Range ‘A’ Range ‘B’ Replay Pass Range IDs gpu__dispatch_count A 2 B 1

SLIDE 16

16

OpenGL Example

do { glContext = wglGetCurrentContext(); NVPA_Context_BeginPass(glContext); NVPA_OpenGL_PushRange('A'); glDrawElements(...); glDrawElements(...); NVPA_OpenGL_PopRange(); NVPA_OpenGL_PushRange('B'); glDrawElements(...); NVPA_OpenGL_PopRange(); NVPA_Context_EndPass(glContext); SwapBuffers(...); } while ( ! IsDataReady(ctx) ); Range ‘A’ Range ‘B’ Replay Pass Range IDs gpu__draw_count A 2 B 1

SLIDE 17

17

D3D12 Example

ID3D12GraphicsCommandList* pCmd = ...; NVPA_Object_PushRange(pCmd, 'A'); pCmd->DrawInstanced(...); pCmd->DrawInstanced(...); NVPA_Object_PopRange(pCmd); NVPA_Object_PushRange(pCmd, 'B'); pCmd->DrawInstanced(...); NVPA_Object_PopRange(pCmd); Range ‘A’ ID3D12CommandQueue* pQueue = ...; NVPA_Context_BeginPass(pQueue); NVPA_Object_PushRange(pQueue, 'F'); pQueue->ExecuteCommandLists(1, &pCmd); NVPA_Object_PopRange(pQueue); NVPA_Context_EndPass(pQueue); pSwapChain->Present(0, 0); Range ‘F’ Replay Pass Range ‘B’ Prebake draw calls into a CommandList. Submit rendering work.

SLIDE 18

18

D3D12 Metric Data

This example produces nested ranges. The CommandList ranges {A, B} are nested under the Queue range ‘F’. Deterministic counters like draw count or shaded pixels will sum perfectly. Activity and throughput are NOT summable, due to parallel execution.

Range IDs gpu__draw_count gpu__time_duration F 3 800 usec F .A 2 700 usec F .B 1 500 usec

SLIDE 19

19

NVIDIA Nsight Range Profiler

The new Range Profiler in the Nsight VSE Graphics Debugger allows you to define ranges by performance markers, render targets, shader programs, etc. This lets you see an overview of performance first, before drilling down into details. Every requested metric is re-collected per range.

Image from NVIDIA Nsight VSE, showing perf markers from Unreal Engine 4 demo

SLIDE 20

20

Future: NVIDIA Developer Tools

NVIDIA Developer Tools are moving to PerfWorks.

Nsight Visual Studio Edition : new Graphics Range Profiler, Analysis CUDA Profiler
CUDA Profiler Suite : CUDA Visual Profiler, nvprof

Consistent metrics across tools and APIs. Bringing CUDA profiler features to OpenGL and D3D tools.

SLIDE 21

21

Future: NVIDIA Developer Tools

SLIDE 22

22

Future: PerfWorks SDK

Source-level counters for compute and graphics shaders. GPU shader PC sampling, as in the Visual Profiler. Lower overhead, realtime counters – usable for perf stats in a HUD. Frequency-based sampling of GPU counters. GPU workload trace – events that produce an execution timeline.

4/11/2016

SLIDE 23

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join SEND QUESTIONS TO devtools-support@nvidia.com

SLIDE 24

24

BACKUP SLIDES...

SLIDE 25

25

D3D11 Sample

ID3D11DeviceContext* pContext = ...; NVPA_Context_BeginPass(pContext); NVPA_Object_PushRange(pContext, 'A'); pContext->DrawElements(...); pContext->DrawElements(...); NVPA_Object_PopRange(pContext); NVPA_Object_PushRange(pContext, 'B'); pContext->DrawElements(...); NVPA_Object_PopRange(pContext); NVPA_Context_EndPass(pContext); pSwapChain->Present(0, 0); Range ‘A’ Range ‘B’ Replay Pass