understanding the performance of gpgpu applications from
play

Understanding the Performance of GPGPU Applications from a - PowerPoint PPT Presentation

Understanding the Performance of GPGPU Applications from a Data-Centric View Hui Zhang w.hzhang86@samsung.com Jeffrey K. Hollingsworth hollings@umd.edu Hui Zhang SC19 - Protools19 11/17/19 Motivation Its hard for programmers


  1. Understanding the Performance of GPGPU Applications from a Data-Centric View Hui Zhang w.hzhang86@samsung.com Jeffrey K. Hollingsworth hollings@umd.edu Hui Zhang SC’19 - Protools’19 11/17/19

  2. Motivation • It’s hard for programmers to write efficient code on highly parallel and heterogeneous architectures • There are few performance tools for CUDA users that can locate inefficient source code and guide user- level optimizations • Traditional Code-centric profiling approach is insufficient in investigating data placement issue 1 Hui Zhang SC’19 - Protools’19 11/17/19

  3. Contributions First, the tool offers fine-grained , in-depth performance • analysis into the kernel execution, providing programmers with finer insight into the GPU kernel execution. Second, the tool uses a data-centric performance analysis • technique to map performance data back to variables in the source code. Third, it proposes a method to get the complete calling • context profiling, including the CPU call stack before a kernel is launched and the GPU call stack within a kernel. 2 Hui Zhang SC’19 - Protools’19 11/17/19

  4. CUDA Programming Overview __global__ stencil_1d() __device__ stencil_helper() * Picture obtained from Nvidia: https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf 3 Hui Zhang SC’19 - Protools’19 11/17/19

  5. Data-centric Profiling Code-centric Profiling int busy(int *x) { // hotspot function *x = complex(); main: 100% return *x; busy: 100% } complex: 100% int main() { Data-centric Profiling for (i=0; i<n; i++) { A[i] = busy(&B[i]) + busy(&C[i-1]) + A: 100% busy(&C[i+1]); B: 33.3% } C: 66.7% } 4 Hui Zhang SC’19 - Protools’19 11/17/19

  6. Properly Assign Blame “I didn’t say you were to blame… I said I am blaming you.” 5 Hui Zhang SC’19 - Protools’19 11/17/19

  7. CUDABlamer Framework • Data flow analysis • Control flow analysis Static Analysis • Intra-procedural Blame analysis • Exit variables analysis • CUPTI Callback API: tag kernel invocation • Monitored Execution Libunwind: CPU stack unwinding • CUPTI Activity API: GPU kernel sampling • Process runtime information • Reconstruct CPU&GPU calling context Postmortem Process • Inter-procedural Blame analysis • Determine Blame attribution vars/funcs • Data-centric profiling result GUI Presentation • Code-centric profiling result Hui Zhang SC’19 - Protools’19 11/17/19 6

  8. CUDABlamer – Static Analysis Graphical Representation to resolve Blame relation • var a : int = 6; var b : int = 7; var c : int = a + b; Resolve LLVM composite instructions to propagate blame hierarchically • (a) Normal GEP instruction (b) Composite GEP instruction 7 Hui Zhang SC’19 - Protools’19 11/17/19

  9. CUDABlamer – Postmortem Process Construct Calling Context for CPU-GPU Hybrid Model • CPU stack : keep call stack with Kernel Launch ID (correlationID) o GPU stack for kernel execution: find all paths from sample point to kernel using o Depth-First-Search (top & bottom node info from ActivityAPI) example Reconstruct the calling context: Connect CPU & GPU stacks through correlationID o 1 __global__ void kernelFunc(…){ 8 foo(); … 18 bar(); … } 28 __device__ void foo(){ Ambiguity : 2 possible 38 bar(); … call paths from the 39 x = 1; … //Sample 1 sample point to 40 y = 2; … //Sample 2 “kernelFunc” } 48 __device__ void bar(){ 56 A[i] = B[i]*s; //Sample 3 88 } 8 Hui Zhang SC’19 - Protools’19 11/17/19

  10. Precision Evaluation Coverage Metric: • SHOC Stencil2D 100.00% SHOC BFS 90.00% 80.00% SHOC Sort 68.01% 70.00% SHOC MD5Hash 60.00% SHOC Neuralnet 50.00% SHOC Reduction 40.00% SHOC Scan 30.00% SHOC Triad 20.00% Rodinia pathFinder 10.00% Rodinia cfd 0.00% Stencil2D BFS Sort MD5Hash Neuralnet Reduction Scan Triad pathFinder cfd hotspot gaussian heartwall nn particlefilter streamcluster Rodinia hotspot Rodinia gaussian Rodinia heartwall Rodinia nn Rodinia particlefilter SHOC Rodinia Rodinia streamcluster 9 Hui Zhang SC’19 - Protools’19 11/17/19

  11. Tool Evaluation – Particlefilter Single-node : 2 NVIDIA Tesla P100 GPUs, each P100 GPU contains 16 GB on-chip memory and 56 SM (streaming multiprocessors). Each SM also has 64KB of shared memory. The GPU also provides 48KB of constant memory. Compilers : nvcc 8.0, gcc 4.8.5 and clang 4.0.1 Variable Type Context Blame ye/xe double main.particleFilter 100% arrayX/arrayY *double main.particleFilter 100% xj *double main.particleFilter 97.9% yj *double main.particleFilter 97.8% xj_GPU *double main.particleFilter 97.9% yj_GPU *double main.particleFilter 97.8% index int main.particleFilter.kernel 95.7% 10 Hui Zhang SC’19 - Protools’19 11/17/19

  12. Tool Evaluation – Particlefilter Optimization • using constant memory for read-only variables arrayX_GPU, arrayY_GPU,  u_GPU, CDF_GPU Particlefilter 180 163.1 160 Kernel Execution Time (ms) 140 120 100 Speedup 46.6x 80 60 40 20 3.5 0 Original Optimized 11 Hui Zhang SC’19 - Protools’19 11/17/19

  13. Tool Evaluation - Gesummv • Gesummv is part of the Polybench test suite and has a kernel that does scalar, vector, and matrix multiplication Variable Type Context Blame y_outputFromGpu *float main 100% y_gpu *float main.gesummvCuda 100% tmp_gpu *float main.gesummvCuda 52.1% j int gesummv_kernel 4.3% A_gpu/B_gpu *float main.gesummvCuda 1.2% x_gpu *float main.gesummvCuda 1.2% 12 Hui Zhang SC’19 - Protools’19 11/17/19

  14. Tool Evaluation - Gesummv Optimization • y_gpu is allocated in the global memory and updating it iteratively is  costly. We use temporary variables to hold intermediate result in the for loop and assigning the ultimate value to the corresponding array element once in the end Gesummv 16 14.82 Kernel Execution Time (ms) 14 Speedup 1.5x 12 9.95 10 8 6 4 2 0 Original Optimized 13 Hui Zhang SC’19 - Protools’19 11/17/19

  15. Tool Evaluation - Gramschm Variable Type Context Blame Data-centric A_outputFromGpu *float main 99.1% A_gpu *float main.gramschmidtCuda 99.1% R_gpu *float main.gramschmidtCuda 60.6% nrm float main.gramschmidtCuda 19.5% i int Gramschmidt_kernel3 6.7% Q_gpu *float main.gramschmidtCuda 2.8% Function Scope Blame Code-centric main CPU 100% gramschmidtCuda CPU 100% gramschmidt_kernel3 GPU 78.2% gramschmidt_kernel1 GPU 19.9% gramschmidt_kernel2 GPU 1.9% 14 Hui Zhang SC’19 - Protools’19 11/17/19

  16. Tool Evaluation - Gramschm Optimization • R_gpu : Use a temporary variable to hold the incremental value of R_gpu  and do one-time assignment after the loop Q_gpu : Use shared memory instead of global memory to store per-block  copy of it, and change the column-based access to row-based access Gramschm 3.5 2.89 3 Kernel Execution Time (ms) 2.5 2 Speedup 5.7x 1.5 1 0.51 0.5 0 Original Optimized 15 Hui Zhang SC’19 - Protools’19 11/17/19

  17. CUDABlamer Overhead Benchmark Clean Static Monitored Post Runtime Total name execution analysis execution processing overhead overhead Hotspot 10.43 1.61 10.82 0.83 3.7% 27.0% Streamcluster 16.96 2.54 115.35 55.46 580% 922% Particlefilter 10.21 1.34 11.1 1.74 8.7% 38.9% Unit: seconds • Static analysis runs once for each benchmark w/ different problem sizes • Post processing overhead depends on #samples & #blame variables/sample • Runtime overhead = (Monitored execution / Clean execution) - 1 • Total overhead = (Total profiling time / Clean execution) - 1 • Runtime overhead can be high due to the poor performance of CUPTI library provided by NVIDIA when using PC_SAMPLING mechanism 16 Hui Zhang SC’19 - Protools’19 11/17/19

  18. Conclusion • New Performance Attribution for Emerging Programming Models Developed a data-centric CUDA profiler: CUDABlamer o • Complete User-level Calling Context Using static and runtime information to interpolate the complete calling context for heterogeneous o architecture • Valuable Performance Insights Manual optimization gained speedup up to 47x for selected CUDA kernels o 17 Hui Zhang SC’19 - Protools’19 11/17/19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend