Understanding the Performance of GPGPU Applications from a - - PowerPoint PPT Presentation

understanding the performance of gpgpu applications from
SMART_READER_LITE
LIVE PREVIEW

Understanding the Performance of GPGPU Applications from a - - PowerPoint PPT Presentation

Understanding the Performance of GPGPU Applications from a Data-Centric View Hui Zhang w.hzhang86@samsung.com Jeffrey K. Hollingsworth hollings@umd.edu Hui Zhang SC19 - Protools19 11/17/19 Motivation Its hard for programmers


slide-1
SLIDE 1

Hui Zhang SC’19 - Protools’19 11/17/19

Understanding the Performance

  • f GPGPU Applications from a

Data-Centric View

Hui Zhang w.hzhang86@samsung.com Jeffrey K. Hollingsworth hollings@umd.edu

slide-2
SLIDE 2

Hui Zhang SC’19 - Protools’19 11/17/19

Motivation

  • It’s hard for programmers to write efficient code on

highly parallel and heterogeneous architectures

  • There are few performance tools for CUDA users that

can locate inefficient source code and guide user- level optimizations

  • Traditional Code-centric profiling approach is

insufficient in investigating data placement issue

1

slide-3
SLIDE 3

Hui Zhang SC’19 - Protools’19 11/17/19

Contributions

  • First, the tool offers fine-grained, in-depth performance

analysis into the kernel execution, providing programmers with finer insight into the GPU kernel execution.

  • Second, the tool uses a data-centric performance analysis

technique to map performance data back to variables in the source code.

  • Third, it proposes a method to get the complete calling

context profiling, including the CPU call stack before a kernel is launched and the GPU call stack within a kernel.

2

slide-4
SLIDE 4

Hui Zhang SC’19 - Protools’19 11/17/19

CUDA Programming Overview

3

* Picture obtained from Nvidia: https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

__global__ stencil_1d() __device__ stencil_helper()

slide-5
SLIDE 5

Hui Zhang SC’19 - Protools’19 11/17/19

Data-centric Profiling

int busy(int *x) { // hotspot function *x = complex(); return *x; } int main() { for (i=0; i<n; i++) { A[i] = busy(&B[i]) + busy(&C[i-1]) + busy(&C[i+1]); } } Data-centric Profiling main: 100% busy: 100% complex: 100% Code-centric Profiling A: 100% B: 33.3% C: 66.7%

4

slide-6
SLIDE 6

Hui Zhang SC’19 - Protools’19 11/17/19

5

“I didn’t say you were to blame… I said I am blaming you.”

Properly Assign Blame

slide-7
SLIDE 7

Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer Framework

6

Static Analysis Monitored Execution Postmortem Process GUI Presentation

  • Data flow analysis
  • Control flow analysis
  • Intra-procedural Blame analysis
  • Exit variables analysis
  • CUPTI Callback API: tag kernel invocation
  • Libunwind: CPU stack unwinding
  • CUPTI Activity API: GPU kernel sampling
  • Process runtime information
  • Reconstruct CPU&GPU calling context
  • Inter-procedural Blame analysis
  • Determine Blame attribution vars/funcs
  • Data-centric profiling result
  • Code-centric profiling result
slide-8
SLIDE 8

Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer – Static Analysis

  • Graphical Representation to resolve Blame relation
  • Resolve LLVM composite instructions to propagate blame hierarchically

7 (a) Normal GEP instruction (b) Composite GEP instruction

var a : int = 6; var b : int = 7; var c : int = a + b;

slide-9
SLIDE 9

Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer – Postmortem Process

  • Construct Calling Context for CPU-GPU Hybrid Model
  • CPU stack : keep call stack with Kernel Launch ID (correlationID)
  • GPU stack for kernel execution: find all paths from sample point to kernel using

Depth-First-Search (top & bottom node info from ActivityAPI)

  • Reconstruct the calling context: Connect CPU & GPU stacks through correlationID

8 1 __global__ void kernelFunc(…){ 8 foo(); … 18 bar(); … } 28 __device__ void foo(){ 38 bar(); … 39 x = 1; … //Sample 1 40 y = 2; … //Sample 2 } 48 __device__ void bar(){ 56 A[i] = B[i]*s; //Sample 3 88 }

example

Ambiguity: 2 possible call paths from the sample point to “kernelFunc”

slide-10
SLIDE 10

Hui Zhang SC’19 - Protools’19 11/17/19

Precision Evaluation

  • Coverage Metric:

9 68.01% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Stencil2D BFS Sort MD5Hash Neuralnet Reduction Scan Triad pathFinder cfd hotspot gaussian heartwall nn particlefilter streamcluster SHOC Rodinia SHOC Stencil2D SHOC BFS SHOC Sort SHOC MD5Hash SHOC Neuralnet SHOC Reduction SHOC Scan SHOC Triad Rodinia pathFinder Rodinia cfd Rodinia hotspot Rodinia gaussian Rodinia heartwall Rodinia nn Rodinia particlefilter Rodinia streamcluster

slide-11
SLIDE 11

Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation – Particlefilter

Variable Type Context Blame ye/xe double main.particleFilter 100% arrayX/arrayY *double main.particleFilter 100% xj *double main.particleFilter 97.9% yj *double main.particleFilter 97.8% xj_GPU *double main.particleFilter 97.9% yj_GPU *double main.particleFilter 97.8% index int main.particleFilter.kernel 95.7%

10

Single-node: 2 NVIDIA Tesla P100 GPUs, each P100 GPU contains 16 GB on-chip memory

and 56 SM (streaming multiprocessors). Each SM also has 64KB of shared memory. The GPU also provides 48KB of constant memory.

Compilers: nvcc 8.0, gcc 4.8.5 and clang 4.0.1

slide-12
SLIDE 12

Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation – Particlefilter

  • Optimization
  • using constant memory for read-only variables arrayX_GPU, arrayY_GPU,

u_GPU, CDF_GPU

11 163.1 3.5 20 40 60 80 100 120 140 160 180 Original Optimized Kernel Execution Time (ms)

Particlefilter

Speedup 46.6x

slide-13
SLIDE 13

Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gesummv

  • Gesummv is part of the Polybench test suite and has a

kernel that does scalar, vector, and matrix multiplication

12

Variable Type Context Blame y_outputFromGpu *float main 100% y_gpu *float main.gesummvCuda 100% tmp_gpu *float main.gesummvCuda 52.1% j int gesummv_kernel 4.3% A_gpu/B_gpu *float main.gesummvCuda 1.2% x_gpu *float main.gesummvCuda 1.2%

slide-14
SLIDE 14

Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gesummv

  • Optimization
  • y_gpu is allocated in the global memory and updating it iteratively is
  • costly. We use temporary variables to hold intermediate result in the for

loop and assigning the ultimate value to the corresponding array element

  • nce in the end

13

14.82 9.95 2 4 6 8 10 12 14 16 Original Optimized

Kernel Execution Time (ms)

Gesummv

Speedup 1.5x

slide-15
SLIDE 15

Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gramschm

14

Variable Type Context Blame A_outputFromGpu *float main 99.1% A_gpu *float main.gramschmidtCuda 99.1% R_gpu *float main.gramschmidtCuda 60.6% nrm float main.gramschmidtCuda 19.5% i int Gramschmidt_kernel3 6.7% Q_gpu *float main.gramschmidtCuda 2.8% Function Scope Blame main CPU 100% gramschmidtCuda CPU 100% gramschmidt_kernel3 GPU 78.2% gramschmidt_kernel1 GPU 19.9% gramschmidt_kernel2 GPU 1.9%

Data-centric Code-centric

slide-16
SLIDE 16

Hui Zhang SC’19 - Protools’19 11/17/19

Tool Evaluation - Gramschm

  • Optimization
  • R_gpu: Use a temporary variable to hold the incremental value of R_gpu

and do one-time assignment after the loop

  • Q_gpu: Use shared memory instead of global memory to store per-block

copy of it, and change the column-based access to row-based access

15

2.89 0.51 0.5 1 1.5 2 2.5 3 3.5 Original Optimized

Kernel Execution Time (ms)

Gramschm Speedup 5.7x

slide-17
SLIDE 17

Hui Zhang SC’19 - Protools’19 11/17/19

CUDABlamer Overhead

Benchmark name Clean execution Static analysis Monitored execution Post processing Runtime

  • verhead

Total

  • verhead

Hotspot 10.43 1.61 10.82 0.83 3.7% 27.0% Streamcluster 16.96 2.54 115.35 55.46 580% 922% Particlefilter 10.21 1.34 11.1 1.74 8.7% 38.9%

16

  • Static analysis runs once for each benchmark w/ different problem sizes
  • Post processing overhead depends on #samples & #blame variables/sample
  • Runtime overhead = (Monitored execution / Clean execution) - 1
  • Total overhead = (Total profiling time / Clean execution) - 1
  • Runtime overhead can be high due to the poor performance of CUPTI

library provided by NVIDIA when using PC_SAMPLING mechanism

Unit: seconds

slide-18
SLIDE 18

Hui Zhang SC’19 - Protools’19 11/17/19

Conclusion

  • New Performance Attribution for Emerging Programming Models
  • Developed a data-centric CUDA profiler: CUDABlamer
  • Complete User-level Calling Context
  • Using static and runtime information to interpolate the complete calling context for heterogeneous

architecture

  • Valuable Performance Insights
  • Manual optimization gained speedup up to 47x for selected CUDA kernels

17