Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories
Khaled Ibrahim, Sam Williams, and Leonid Oliker
Bench’19 Conference,
- Nov. 14, 2019
Denver, Colorado
UNIVERSITY OF CALIFORNIA
Performance Analysis of GPU Programming Models using the Roofline - - PowerPoint PPT Presentation
Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA Acknowledgements This
Khaled Ibrahim, Sam Williams, and Leonid Oliker
Bench’19 Conference,
Denver, Colorado
UNIVERSITY OF CALIFORNIA
§ This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This material is based upon work supported by the DOE RAPIDS SciDAC Institute. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department
§ This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Layout and Contributions
Contributions
for the performance analysis on GPU architecture.
OpenACC models for GPU offloading.
(> 2x improvement).
SM=2 SM=32 SM=80
Inefficiency at low SM count is typically correlated with low warp efficiency AI degradation due to excessive HBM data movement to the L2 cache Potential throughput improvement with AI degradation Measured throughput improvement Measured < Potential indicates loss of100000 250000 SM count MFlop/s 02 04 08 16 32 48 64 80
CLASS B CLASS C
Micro Analysis: (typically attributed to a particular code site.) Miss rate, vectorization, Load imbalance?
But, are they impacting performance?
Macro Analysis: (Model-Based) Are we utilizing resources effectively? Roofline technique, logGP, etc Clear performance expectation. But, no source code attribution Vendor Hardware Events: Could serve both micro/macro analyses! But, they are many, hard to understand and time consuming to collect
Performance Analysis: Micro vs. Macro
#pragma #pragma omp
parallel for for for for ( ( int int i = 0; = 0; i < n; < n; i++ ) { ++ ) { s = 0; s = 0; for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; y[ y[i] = s; ] = s; }
/* Is it balances? */ /* Is it balances? */ #pragma #pragma omp
parallel for for for for ( ( int int i = 0; = 0; i < n; < n; i++ ) { ++ ) { /* Is it vectorized? */ /* Is it vectorized? */ for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]];
Roofline Performance Model (Empirical)
§ Performance Model Bounds
§
DRAM Bandwidth
§
Compute
§ Typical HPC application
§
Performance: Flop/s
§
Arithmetic intensity: Flop/Byte
§ DRAM Roofline
§ Define an AI for each level of cache/memory
§ CARM Roofline (Cache-aware Roofline)
§ Define a single AI for all levels of memory based on L1 data movement.
7
Peak Flop/s Attainable Flop/s D R A M G B / s Arithmetic Intensity (Flop:Byte) DRAM-bound Compute-bound
Williams et al, "Roofline: An Insightful Visual Performance Model For Multicore Architectures", CACM, 2009.
Attainable Flop/s M C D R A M c a c h e G B / s Arithmetic Intensity (Flop:Byte) L 2 G B / s Peak Flop/s
Aleksandar Ilic, et al. Cache-aware Roofline model: Upgrading the loft, IEEE Computer Architecture Letters, vol. 13, n. 1, pp. 21-24, January 2014
GPU Parallelism and Performance
Multi-Level of parallelism Warp Level :Ideally threads execute the same instruction Thread block Level: Cooperative execution SM Level: Resource sharing to hide latency. Occupancy of multiple blocks depends on resource requirements. GPU level: Occupancy Scaling depends on level of parallelism Performance influencing factor Warp efficiency Occupancy Locality (especially spatial, not necessarily temporal) etc.
Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Streaming Multiprocessor (SM)
Warp Warp Warp Warp
Register File Cache/Shared Memory
. . . . . .
Global Memory (HBM) L2 Cache
“The ratio of the average active threads per warp to the maximum number of threads per warp” Warp Level : (32 per warp)
Warp Efficiency
i = = map_func map_func(thread,block thread,block) ) s = 0; s = 0; for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; y[ y[i] = s; ] = s; If If ( ( threadIdx.X threadIdx.X < N ) { < N ) { X; X; Y; Y; } else { } else { A; A; B; B; } Z; Z;
X Y A B Z Z Control Divergence: Latency Divergence: nvprof: event based warp efficiency due only to control divergence.
“The ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor” nvprof: report active occupancy (while GPU is active executing a kernel) Thread block Level/SM Level: 64 warps per SM GPU level: 80+ SM per Volta GPU Resource Sharing, impacting occupancy: Block level (register file, Shared mem)
Occupancy
Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Streaming Multiprocessor (SM)
Warp Warp Warp Warp
Register File Cache/Shared Memory
. . . . . .
Global Memory (HBM) L2 Cache
Scaling Curves
Volta GPU
Possible to control SM count
NAS Parallel Benchmarks
LU, CUDA
Typical Scaling Plot
Provide performance with SM change, No insights into root causes. Why Class B scale better than A, but Class C is not better than B?
12
100000 250000 SM count MFlop/s 02 04 08 16 32 48 64 80
CLASS B CLASS C
SM=2 SM=32 SM=80
Roofline Strong Scaling Trajectory
Roofline Scaling Trajectories Diagnostic technique for scaling issue. Track performance while changing the level of concurrency. Ideal behavior: △y = increase in computational resources or share of BW △x=0 (No change in arithmetic intensity)
SM=2 SM=32 SM=80
Inefficiency at low SM count is typically correlated with low warp efficiency AI degradation due to excessive HBM data movement to the L2 cache Potential throughput improvement with AI degradation Measured throughput improvement Measured < Potential indicates loss of
scaling
Typical Strong Scaling Trajectory
AI ← cannonical flop count (dram read trans + dram write trans) × 32
Evaluation Testbed
NAS Parallel benchmarks FT: spectral methods CG: sparse linear algebra MG: multi-grid PDE LU: regular-sparse lower and upper triangular system BT, SP: mini-apps adaptive mesh benchmark. Programming Models: Cuda, OpenACC Problem Sizes: Class A, B, and C (4x problem increase while changing classes) Low level Tools: nvprof for data movement. Flop count is based on application estimate (constant per class while scaling)
OLCF Summit, P9 - V100 node
(e.g., texture, shared memory, etc)
GPU Programming Model Influence
Vender Programming Language (CUDA) Pragma-based Language (OpenACC, OpenMP)
Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop
Roofline Scaling Trajectories Case Studies
Good Scaling:
(e.g., ACC MG) Locality at high concurrency Some Influence of problem size on locality
Need improvement
(e.g., ACC FT) Scaling problem due to occupancy
0.01 0.02 0.05 0.10 0.20 0.50 1.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )
Class B Class C
0.01 0.02 0.05 0.10 0.20 0.50 1.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B Class C
GPU Scaling Trajectories
Scaling plot vs. Roofline scaling trajectory
18 100000 250000 SM count MFlop/s 02 04 08 16 32 48 64 80
CLASS B CLASS C
0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 5.00 1 5 50 500 5000 Arithmetic Intensity (Flops/Byte) GFlop/s
ADD(SMs=80) (3536) ADD(SMs=2) (88) HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B Class C
Behavior at low concurrency (warp efficiency) vs. occupancy Influence of programming model
0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 1 5 50 500
Arithmetic Intensity (Flops/Byte) GFlop/s
HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B Class C
Scaling of NAS BT
0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B
BT CUDA Implementation
CUDA AI
BT OpenACC Implementation
x_solve y_solve z_solve compute_rhs
Class A Class B
. Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0 x_solve y_solve z_solve compute_rhs Class A Class B Class C . Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0
Warp Efficiency (CUDA vs. OpenACC)
BT CUDA Implementation BT OpenACC Implementation
Low-level metric confirmation
Occupancy (CUDA vs. OpenACC)
BT CUDA Implementation BT OpenACC Implementation
jacu_buts jacld_blts rhs_kern_x rhs_kern_y rhs_kern_z
02 04 08 16 32 48 64 80
. Occupancy 0.0 0.2 0.4 0.6 0.8 1.0
LU variation per kernel!
Tuning for Tile size improve warp efficiency
Optimization of CUDA BT
x_solve y_solve z_solve compute_rhs
Class A Class B Class C
. Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0
x_solve y_solve z_solve compute_rhs
Class A Class B Class C
. Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0
BT CUDA Base Implementation BT CUDA Opt Implementation
x_solve y_solve z_solve compute_rhs
02 04 08 16 32 48 64 80
. Occupancy 0.0 0.2 0.4 0.6 0.8 1.0 x_solve y_solve z_solve compute_rhs
02 04 08 16 32 48 64 80
. Occupancy 0.0 0.2 0.4 0.6 0.8 1.0
No problem reducing occupancy as long as gradient does not change!
0.1 0.2 0.5 1.0 2.0 1 5 50 500
Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B Amdahl Shadow
Class A Class B Class C Invocation Count 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06jacu buts jacld blts rhs Others cuRoutines
Roofline Shadow Curve:
Performance if invocation
Insights:
Hitting the roof with the shadow curve indicates inactive occupancy
Influence of programming model: Incremental loop parallelization kernel launch influence on performance.
Distinguishing Occupancy Issues (Active vs. Inactive)
0.1 0.2 0.5 1.0 2.0 1 5 50 500
Arithmetic Intensity (Flops/Byte) GFlop/s H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )
Class B Amdahl Shadow
Class A Class B Class C Invocation Count 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06jacu_buts jacld_blts rhs_kernel_x rhs_kernel_y rhs_kernel_z Others cuRoutines
LU CUDA LU OpenACC
Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop
Other cases studies in the paper
Other Case Studies
0.1 0.2 0.5 1.0 2.0 1 5 50 500
Arithmetic Intensity (Flops/Byte) GFlop/s
H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )
Class B Class C
0.1 0.2 0.5 1.0 2.0 1 5 50 500
A r i t h m e t i c I n t e n s i t y ( F l
s / B y t e ) G F l
/ s
HBM(SMs=80) (829) HBM(SMs=2) (35)
l a s s A C l a s s B C l a s s C
0.01 0.02 0.05 0.10 0.20 0.50 1.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )
Class B Class C
0.1 0.2 0.5 1.0 2.0 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B 0.1 0.2 0.5 1.0 2.0 1 5 50 500
A r i t h m e t i c I n t e n s i t y ( F l
s / B y t e ) G F l
/ s
HBM(SMs=80) (829) HBM(SMs=2) (35)
Class B Class C
0.1 0.2 0.5 1.0 1 5 50 500
Arithmetic Intensity (Flops/Byte) GFlop/s
H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )
Class B Class C
Performance analysis for Post-Moore’s architectures is a daunting problem Both Micro and Macro analyses are important Roofline Scaling Trajectory bridges the two sorts of analyses
Warp efficiency Occupancy gradient Locality
Improve performance, which involves tradeoffs between various low-level optimizations! Warp efficiency vs. occupancy More than 2x speedup for multiple applications. Paper carries multiple case studies.
Summary and Conclusions
SM=2 SM=32 SM=80 Inefficiency at low SM count is typically correlated with low warp efficiency AI degradation due to excessive HBM data movement to the L2 cache Potential throughput improvement with AI degradation Measured throughput improvement Measured < Potential indicates loss ofAdditional Resources
LBL Roofline: https://crd.lbl.gov/departments/computer- science/PAR/research/roofline NVprof https://docs.nvidia.com/cuda/profiler-users-guide/index.html NERSC Recipe http://www.nersc.gov/users/application-performance/measuring- arithmetic-intensity/