Optimizing GPU-accelerated Applications with HPCToolkit
Keren Zhou and John Mellor-Crummey Department of Computer Science Rice University
7/28/2019 1
Optimizing GPU-accelerated Applications with HPCToolkit Keren Zhou - - PowerPoint PPT Presentation
Optimizing GPU-accelerated Applications with HPCToolkit Keren Zhou and John Mellor-Crummey Department of Computer Science Rice University 7/28/2019 1 Problems with Existing Tools OpenMP Target, Kokkos, and RAJA generate sophisticated
7/28/2019 1
7/28/2019 2
7/28/2019 3
Loop Call Hidden Info: NVCC generates a loop at the end of every function Kernel Launch
attributes them to the corresponding worker threads
launches, memory copies, and synchronizations in large-scale applications
7/28/2019 4
7/28/2019 5
OMPT/CUPTI nvdisasm cubins associate instruction mix w/ source approximate GPU calling context tree instruction mix
correlation record when it launches a kernel and tags the kernel with a correlation ID C, notifying the monitor thread that C belongs to T
measurements associated with C and communicates measurement records back to thread T
7/29/2019 6
push the chain to the private stack
7/28/2019 7
7/28/2019 8
main f0 cudaLaunchKernel pc0 f1 pc1 pc2 Correlation Group Free Correlation Group Free Measurement Group Measurement Group Worker Produce Monitor Consume Worker Consume Monitor Produce Program Order
7/28/2019 9
7/28/2019 10
7/26/2019 11
12
instruction pointer return address return address return address return address return address instruction pointer return address return address instruction pointer return address return address return address return address
while unwinding
frame clears mark
next unwind
are common prefix
unmarked frame RAs
frame moves mark
next unwind
indicates common prefix
unmarked frames
Eager LCA Arnold & Sweeny, IBM TR, 1999. Lazy LCA Froyd et al, ICS05.
7/28/2019 17
7/28/2019 18
7/28/2019 19
Max CPUTIME − SUM(GPU_APITIME) CPUTIME , 0 × CPUTIME EXECUTIONTIME Ratio of a procedure’s time to the whole execution time Ratio of a procedure’s pure CPU time. If more time is spent on GPU than CPU, the ratio is set to 0
7/28/2019 20
GPU_APITIME SUM(GPU_APITIME) Consider the importance of the memory copy to all the GPU time
7/28/2019 21
7/28/2019 22
B 0x70 E 0x30 A 0x10 0x50 C 0x80 0x30 D
7/28/2019 23
7/28/2019 24
B 0x70 E 0x30 A 0x10 0x50 C 0x80 0x30 D
1 2 T F F F T 2 1
7/28/2019 25
B 0x70 E 0x30 A 0x10 0x50 C 0x80 0x30 D
1 2 T T T F T 2 1 1 1
7/28/2019 26
B 0x70 E 0x30 A 0x10 0x50 C 0x80 0x30
1 2 T T T T 2 1 1 1
7/28/2019 27
B 0x70 E 0x30 A 0x10 0x50 C 0x80 0x30
1 2 T T T T 2 1 1 1
SCC
7/28/2019 28
B 0x70 E 0x30 A 0x10 0x50 C 0x80 0x30
1 1 T T T T 2 1 1 1
SCC E’ 0x301 SCC’
T
7/28/2019 29
7/28/2019 30
7/28/2019 31
60% execution time
pinned memory with help
7/28/2019 32
7/28/2019 33
7/28/2019 34
7/28/2019 35
requests do not always hit cache. +shared memory
global memory read properly to hide latency. +reorder global memory read
for integer division. +precompute reciprocal to simplify division
7/28/2019 36
7/28/2019 37
200 400 600 800 1000 1200 1400 1600 1800 2000 baseline +shared +reorder +reciprocal DP GFLOPS PC SAMPLING NSIGHT HAND
7/28/2019 38
1 2 4 8 16 32 64 8192 1/2 1/4 DP/Byte
Peak Performance (7065 GFLOPS)
𝐵𝐽 = 12𝑂4𝐹 + 15𝑂3𝐹 8𝑂3𝐹 × 8 ≈ 2.1
Achieved Performance (1573 GFLOPS) Theoretical Performance (1890 GFLOPS)
128 256 512 1024 2048 4096 GFLOPS 𝐹: 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑓𝑚𝑓𝑛𝑓𝑜𝑢𝑡 𝑂: 𝑢𝑓𝑜𝑡𝑝𝑠 𝑒𝑗𝑛𝑓𝑜𝑡𝑗𝑝𝑜
7/28/2019 39
7/28/2019 40