Understanding the Performance of GPGPU Applications from a - PowerPoint PPT Presentation

Understanding the Performance of GPGPU Applications from a Data-Centric View Hui Zhang w.hzhang86@samsung.com Jeffrey K. Hollingsworth hollings@umd.edu Hui Zhang SC’19 - Protools’19 11/17/19

Motivation • It’s hard for programmers to write efficient code on highly parallel and heterogeneous architectures • There are few performance tools for CUDA users that can locate inefficient source code and guide user- level optimizations • Traditional Code-centric profiling approach is insufficient in investigating data placement issue 1 Hui Zhang SC’19 - Protools’19 11/17/19

Contributions First, the tool offers fine-grained , in-depth performance • analysis into the kernel execution, providing programmers with finer insight into the GPU kernel execution. Second, the tool uses a data-centric performance analysis • technique to map performance data back to variables in the source code. Third, it proposes a method to get the complete calling • context profiling, including the CPU call stack before a kernel is launched and the GPU call stack within a kernel. 2 Hui Zhang SC’19 - Protools’19 11/17/19

CUDA Programming Overview __global__ stencil_1d() __device__ stencil_helper() * Picture obtained from Nvidia: https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf 3 Hui Zhang SC’19 - Protools’19 11/17/19

Data-centric Profiling Code-centric Profiling int busy(int *x) { // hotspot function *x = complex(); main: 100% return *x; busy: 100% } complex: 100% int main() { Data-centric Profiling for (i=0; i<n; i++) { A[i] = busy(&B[i]) + busy(&C[i-1]) + A: 100% busy(&C[i+1]); B: 33.3% } C: 66.7% } 4 Hui Zhang SC’19 - Protools’19 11/17/19

Properly Assign Blame “I didn’t say you were to blame… I said I am blaming you.” 5 Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer Framework • Data flow analysis • Control flow analysis Static Analysis • Intra-procedural Blame analysis • Exit variables analysis • CUPTI Callback API: tag kernel invocation • Monitored Execution Libunwind: CPU stack unwinding • CUPTI Activity API: GPU kernel sampling • Process runtime information • Reconstruct CPU&GPU calling context Postmortem Process • Inter-procedural Blame analysis • Determine Blame attribution vars/funcs • Data-centric profiling result GUI Presentation • Code-centric profiling result Hui Zhang SC’19 - Protools’19 11/17/19 6

CUDABlamer – Static Analysis Graphical Representation to resolve Blame relation • var a : int = 6; var b : int = 7; var c : int = a + b; Resolve LLVM composite instructions to propagate blame hierarchically • (a) Normal GEP instruction (b) Composite GEP instruction 7 Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer – Postmortem Process Construct Calling Context for CPU-GPU Hybrid Model • CPU stack : keep call stack with Kernel Launch ID (correlationID) o GPU stack for kernel execution: find all paths from sample point to kernel using o Depth-First-Search (top & bottom node info from ActivityAPI) example Reconstruct the calling context: Connect CPU & GPU stacks through correlationID o 1 __global__ void kernelFunc(…){ 8 foo(); … 18 bar(); … } 28 __device__ void foo(){ Ambiguity : 2 possible 38 bar(); … call paths from the 39 x = 1; … //Sample 1 sample point to 40 y = 2; … //Sample 2 “kernelFunc” } 48 __device__ void bar(){ 56 A[i] = B[i]*s; //Sample 3 88 } 8 Hui Zhang SC’19 - Protools’19 11/17/19

Precision Evaluation Coverage Metric: • SHOC Stencil2D 100.00% SHOC BFS 90.00% 80.00% SHOC Sort 68.01% 70.00% SHOC MD5Hash 60.00% SHOC Neuralnet 50.00% SHOC Reduction 40.00% SHOC Scan 30.00% SHOC Triad 20.00% Rodinia pathFinder 10.00% Rodinia cfd 0.00% Stencil2D BFS Sort MD5Hash Neuralnet Reduction Scan Triad pathFinder cfd hotspot gaussian heartwall nn particlefilter streamcluster Rodinia hotspot Rodinia gaussian Rodinia heartwall Rodinia nn Rodinia particlefilter SHOC Rodinia Rodinia streamcluster 9 Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation – Particlefilter Single-node : 2 NVIDIA Tesla P100 GPUs, each P100 GPU contains 16 GB on-chip memory and 56 SM (streaming multiprocessors). Each SM also has 64KB of shared memory. The GPU also provides 48KB of constant memory. Compilers : nvcc 8.0, gcc 4.8.5 and clang 4.0.1 Variable Type Context Blame ye/xe double main.particleFilter 100% arrayX/arrayY *double main.particleFilter 100% xj *double main.particleFilter 97.9% yj *double main.particleFilter 97.8% xj_GPU *double main.particleFilter 97.9% yj_GPU *double main.particleFilter 97.8% index int main.particleFilter.kernel 95.7% 10 Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation – Particlefilter Optimization • using constant memory for read-only variables arrayX_GPU, arrayY_GPU,  u_GPU, CDF_GPU Particlefilter 180 163.1 160 Kernel Execution Time (ms) 140 120 100 Speedup 46.6x 80 60 40 20 3.5 0 Original Optimized 11 Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gesummv • Gesummv is part of the Polybench test suite and has a kernel that does scalar, vector, and matrix multiplication Variable Type Context Blame y_outputFromGpu *float main 100% y_gpu *float main.gesummvCuda 100% tmp_gpu *float main.gesummvCuda 52.1% j int gesummv_kernel 4.3% A_gpu/B_gpu *float main.gesummvCuda 1.2% x_gpu *float main.gesummvCuda 1.2% 12 Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gesummv Optimization • y_gpu is allocated in the global memory and updating it iteratively is  costly. We use temporary variables to hold intermediate result in the for loop and assigning the ultimate value to the corresponding array element once in the end Gesummv 16 14.82 Kernel Execution Time (ms) 14 Speedup 1.5x 12 9.95 10 8 6 4 2 0 Original Optimized 13 Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gramschm Variable Type Context Blame Data-centric A_outputFromGpu *float main 99.1% A_gpu *float main.gramschmidtCuda 99.1% R_gpu *float main.gramschmidtCuda 60.6% nrm float main.gramschmidtCuda 19.5% i int Gramschmidt_kernel3 6.7% Q_gpu *float main.gramschmidtCuda 2.8% Function Scope Blame Code-centric main CPU 100% gramschmidtCuda CPU 100% gramschmidt_kernel3 GPU 78.2% gramschmidt_kernel1 GPU 19.9% gramschmidt_kernel2 GPU 1.9% 14 Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gramschm Optimization • R_gpu : Use a temporary variable to hold the incremental value of R_gpu  and do one-time assignment after the loop Q_gpu : Use shared memory instead of global memory to store per-block  copy of it, and change the column-based access to row-based access Gramschm 3.5 2.89 3 Kernel Execution Time (ms) 2.5 2 Speedup 5.7x 1.5 1 0.51 0.5 0 Original Optimized 15 Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer Overhead Benchmark Clean Static Monitored Post Runtime Total name execution analysis execution processing overhead overhead Hotspot 10.43 1.61 10.82 0.83 3.7% 27.0% Streamcluster 16.96 2.54 115.35 55.46 580% 922% Particlefilter 10.21 1.34 11.1 1.74 8.7% 38.9% Unit: seconds • Static analysis runs once for each benchmark w/ different problem sizes • Post processing overhead depends on #samples & #blame variables/sample • Runtime overhead = (Monitored execution / Clean execution) - 1 • Total overhead = (Total profiling time / Clean execution) - 1 • Runtime overhead can be high due to the poor performance of CUPTI library provided by NVIDIA when using PC_SAMPLING mechanism 16 Hui Zhang SC’19 - Protools’19 11/17/19

Conclusion • New Performance Attribution for Emerging Programming Models Developed a data-centric CUDA profiler: CUDABlamer o • Complete User-level Calling Context Using static and runtime information to interpolate the complete calling context for heterogeneous o architecture • Valuable Performance Insights Manual optimization gained speedup up to 47x for selected CUDA kernels o 17 Hui Zhang SC’19 - Protools’19 11/17/19

Understanding the Performance of GPGPU Applications from a - PowerPoint PPT Presentation

Understanding the Performance of GPGPU Applications from a Data-Centric View Hui Zhang w.hzhang86@samsung.com Jeffrey K. Hollingsworth hollings@umd.edu Hui Zhang SC19 - Protools19 11/17/19 Motivation Its hard for programmers

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Understanding and Tuning Performance in PETSc (on emerging manycore, GPGPU, and traditional

Squeezing GPU performance GPGPU 2015: High Performance Computing with CUDA University of Cape Town

Incremental Migration of C and Fortran Applications to GPGPU using HMPP Peppher 2011

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web

High Performance GPGPU Implementation of a Large 2D Histogram (S9734) Mark Roulo Wed, March

CHANGE IT . I N THIS PRECISE WEEKS OR DAYS , WE IN UNESPA ARE DISCLOSING THE E NGLISH VERSION OF

Projections for Approximate Policy Iteration Algorithms Riad Akrour , Joni Pajarinen, Gerhard

Verifiable ASICs: trustworthy hardware with untrusted components Riad S. Wahby , Max

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad

The G 0 Experiment: Backangle Running Riad Suleiman Virginia Tech November 02, 2006 OUTLINE

A multi-genre SMT system for Arabic to French Saa Hasan and Hermann Ney LREC 2008 Marrakech,

eb- nstrumented an- achine nteractions, ommunities, and emantics* a proposal for a joint