April 4-7, 2016 | Silicon Valley
Avinash Baliga, NVIDIA Developer Tools Software Architect April 5, 2016 @ 3:00 p.m. Room 211B
PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS Avinash Baliga, - - PowerPoint PPT Presentation
April 4-7, 2016 | Silicon Valley PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS Avinash Baliga, NVIDIA Developer Tools Software Architect April 5, 2016 @ 3:00 p.m. Room 211B NVIDIA PerfWorks SDK New API for collecting performance metrics
April 4-7, 2016 | Silicon Valley
Avinash Baliga, NVIDIA Developer Tools Software Architect April 5, 2016 @ 3:00 p.m. Room 211B
2
New API for collecting performance metrics from NVIDIA GPUs.
CUDA, OpenGL, OpenGL ES, D3D11, and D3D12
Windows, Linux, Mobile
Kepler, Maxwell, Pascal Tegra, GeForce, Tesla, Quadro
tools developers, engine developers Successor to the NVIDIA Perfkit SDK (NVPMAPI)
3
PerfWorks delivers actionable, high-level metrics, allowing you to recognize top performance limiters quickly and directly.
Metric Categories
4
SOL = “Speed of Light” = peak throughput of a given piece of hardware max instructions per cycle, max bytes per cycle, etc. SOL% = achieved throughput, as % of the peak; “how close are you to perfection?” Unit SOL% takes the max across sub-unit SOL%s. SM, partition, sub-partition, ALU Example: the SM SOL% is the max of
Image of Maxwell SM sub-partition from NVIDIA GeForce GTX 750 Ti Whitepaper
5
SM L1 Tex Shared Cache Hit/Miss Utilization Efficiency Instruction Issue-Efficiency Instruction Pipeline Statistics Stall Reasons L2 Device System Cache Hit/Miss Utilization by Op Type Utilization by Client
% utilization % utilization % utilization
Utilization
6
SM L1 Tex Shared High instruction issue utilization High pipeline utilization L2 Device System Medium-low utilization on all other units
7
SM L1 Tex Shared L2 Device System Medium-low utilization in the SM. One of the memory units has reached close to its maximum throughput.
8
SM L1 Tex Shared L2 Device System Stalls High number of pipeline stalls. Medium-low utilization on everything. Same amount of data transferred from both L1 and L2. Or same amount from both L2 and memory.
9
IA
(Vertex Fetch)
Vertex Shader Hull Shader Tess Domain Shader Geom Shader Raster Pixel Shader CROP ZROP Front End (decoder) SM (unified shaders) L2 CPU Image L1 Tex System Device XFB
10
Previous tools profile one kernel or draw-call at a time: With PerfWorks, you can profile them as a range, allowing for inherent parallelism: Optimizing these 2 cases is very different!
prevent parallelism or harm parallel execution time.
11
The hardware has a limited number of physical counters. To collect more than the physical limit, PerfWorks requires the application to deterministically replay the GPU work multiple times. During each replay pass:
4/11/2016
BeginPass EndPass Range A Range B BeginPass EndPass Range A Range B ctr0 ctr0 ctr1 ctr1
12
{ kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); cuLaunchKernel(...); cudaDeviceSynchronize(); }
13
{ NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); cuLaunchKernel(...); NVPA_CUDA_PopRange(); cudaDeviceSynchronize(); } Range ‘A’ Range ‘B’
14
do { cuCtxGetCurrent(&ctx); NVPA_Context_BeginPass(ctx); NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); cuLaunchKernel(...); NVPA_CUDA_PopRange(); NVPA_Context_EndPass(ctx); cudaDeviceSynchronize(); } while ( ! IsDataReady(ctx) ); Replay Pass
15
do { cuCtxGetCurrent(&ctx); NVPA_Context_BeginPass(ctx); NVPA_CUDA_PushRange('A'); kernel1<<<1, N, 0, s0>>>(...); kernel2<<<1, N, 0, s1>>>(...); NVPA_CUDA_PopRange(); NVPA_CUDA_PushRange('B'); cuLaunchKernel(...); NVPA_CUDA_PopRange(); NVPA_Context_EndPass(ctx); cudaDeviceSynchronize(); } while ( ! IsDataReady(ctx) ); Range ‘A’ Range ‘B’ Replay Pass Range IDs gpu__dispatch_count A 2 B 1
16
do { glContext = wglGetCurrentContext(); NVPA_Context_BeginPass(glContext); NVPA_OpenGL_PushRange('A'); glDrawElements(...); glDrawElements(...); NVPA_OpenGL_PopRange(); NVPA_OpenGL_PushRange('B'); glDrawElements(...); NVPA_OpenGL_PopRange(); NVPA_Context_EndPass(glContext); SwapBuffers(...); } while ( ! IsDataReady(ctx) ); Range ‘A’ Range ‘B’ Replay Pass Range IDs gpu__draw_count A 2 B 1
17
ID3D12GraphicsCommandList* pCmd = ...; NVPA_Object_PushRange(pCmd, 'A'); pCmd->DrawInstanced(...); pCmd->DrawInstanced(...); NVPA_Object_PopRange(pCmd); NVPA_Object_PushRange(pCmd, 'B'); pCmd->DrawInstanced(...); NVPA_Object_PopRange(pCmd); Range ‘A’ ID3D12CommandQueue* pQueue = ...; NVPA_Context_BeginPass(pQueue); NVPA_Object_PushRange(pQueue, 'F'); pQueue->ExecuteCommandLists(1, &pCmd); NVPA_Object_PopRange(pQueue); NVPA_Context_EndPass(pQueue); pSwapChain->Present(0, 0); Range ‘F’ Replay Pass Range ‘B’ Prebake draw calls into a CommandList. Submit rendering work.
18
This example produces nested ranges. The CommandList ranges {A, B} are nested under the Queue range ‘F’. Deterministic counters like draw count or shaded pixels will sum perfectly. Activity and throughput are NOT summable, due to parallel execution.
Range IDs gpu__draw_count gpu__time_duration F 3 800 usec F .A 2 700 usec F .B 1 500 usec
19
The new Range Profiler in the Nsight VSE Graphics Debugger allows you to define ranges by performance markers, render targets, shader programs, etc. This lets you see an overview of performance first, before drilling down into details. Every requested metric is re-collected per range.
Image from NVIDIA Nsight VSE, showing perf markers from Unreal Engine 4 demo
20
NVIDIA Developer Tools are moving to PerfWorks.
Consistent metrics across tools and APIs. Bringing CUDA profiler features to OpenGL and D3D tools.
21
22
Source-level counters for compute and graphics shaders. GPU shader PC sampling, as in the Visual Profiler. Lower overhead, realtime counters – usable for perf stats in a HUD. Frequency-based sampling of GPU counters. GPU workload trace – events that produce an execution timeline.
4/11/2016
April 4-7, 2016 | Silicon Valley
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join SEND QUESTIONS TO devtools-support@nvidia.com
24
25
ID3D11DeviceContext* pContext = ...; NVPA_Context_BeginPass(pContext); NVPA_Object_PushRange(pContext, 'A'); pContext->DrawElements(...); pContext->DrawElements(...); NVPA_Object_PopRange(pContext); NVPA_Object_PushRange(pContext, 'B'); pContext->DrawElements(...); NVPA_Object_PopRange(pContext); NVPA_Context_EndPass(pContext); pSwapChain->Present(0, 0); Range ‘A’ Range ‘B’ Replay Pass