Performance Analysis of GPU Programming Models using the Roofline - - PowerPoint PPT Presentation

performance analysis of gpu programming models using the
SMART_READER_LITE
LIVE PREVIEW

Performance Analysis of GPU Programming Models using the Roofline - - PowerPoint PPT Presentation

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA Acknowledgements This


slide-1
SLIDE 1

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories

Khaled Ibrahim, Sam Williams, and Leonid Oliker

Bench’19 Conference,

  • Nov. 14, 2019

Denver, Colorado

UNIVERSITY OF CALIFORNIA

slide-2
SLIDE 2

Acknowledgements

§ This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This material is based upon work supported by the DOE RAPIDS SciDAC Institute. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department

  • f Energy under contract DE-AC02-05CH11231.

§ This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

slide-3
SLIDE 3

Layout and Contributions

Contributions

  • The roofline-scaling trajectory technique

for the performance analysis on GPU architecture.

  • Comparison between CUDA and

OpenACC models for GPU offloading.

  • Case studies for tuning for performance

(> 2x improvement).

SM=2 SM=32 SM=80

Inefficiency at low SM count is typically correlated with low warp efficiency AI degradation due to excessive HBM data movement to the L2 cache Potential throughput improvement with AI degradation Measured throughput improvement Measured < Potential indicates loss of
  • ccupancy while
scaling

100000 250000 SM count MFlop/s 02 04 08 16 32 48 64 80

  • CLASS A

CLASS B CLASS C

slide-4
SLIDE 4

Micro Analysis: (typically attributed to a particular code site.) Miss rate, vectorization, Load imbalance?

But, are they impacting performance?

Macro Analysis: (Model-Based) Are we utilizing resources effectively? Roofline technique, logGP, etc Clear performance expectation. But, no source code attribution Vendor Hardware Events: Could serve both micro/macro analyses! But, they are many, hard to understand and time consuming to collect

Performance Analysis: Micro vs. Macro

#pragma #pragma omp

  • mp parallel

parallel for for for for ( ( int int i = 0; = 0; i < n; < n; i++ ) { ++ ) { s = 0; s = 0; for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; y[ y[i] = s; ] = s; }

/* Is it balances? */ /* Is it balances? */ #pragma #pragma omp

  • mp parallel

parallel for for for for ( ( int int i = 0; = 0; i < n; < n; i++ ) { ++ ) { /* Is it vectorized? */ /* Is it vectorized? */ for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]];

slide-5
SLIDE 5

Roofline Performance Model (Empirical)

§ Performance Model Bounds

§

DRAM Bandwidth

§

Compute

§ Typical HPC application

§

Performance: Flop/s

§

Arithmetic intensity: Flop/Byte

§ DRAM Roofline

§ Define an AI for each level of cache/memory

§ CARM Roofline (Cache-aware Roofline)

§ Define a single AI for all levels of memory based on L1 data movement.

7

Peak Flop/s Attainable Flop/s D R A M G B / s Arithmetic Intensity (Flop:Byte) DRAM-bound Compute-bound

Williams et al, "Roofline: An Insightful Visual Performance Model For Multicore Architectures", CACM, 2009.

Attainable Flop/s M C D R A M c a c h e G B / s Arithmetic Intensity (Flop:Byte) L 2 G B / s Peak Flop/s

Aleksandar Ilic, et al. Cache-aware Roofline model: Upgrading the loft, IEEE Computer Architecture Letters, vol. 13, n. 1, pp. 21-24, January 2014

slide-6
SLIDE 6

GPU Parallelism and Performance

Multi-Level of parallelism Warp Level :Ideally threads execute the same instruction Thread block Level: Cooperative execution SM Level: Resource sharing to hide latency. Occupancy of multiple blocks depends on resource requirements. GPU level: Occupancy Scaling depends on level of parallelism Performance influencing factor Warp efficiency Occupancy Locality (especially spatial, not necessarily temporal) etc.

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Streaming Multiprocessor (SM)

Warp Warp Warp Warp

Register File Cache/Shared Memory

. . . . . .

Global Memory (HBM) L2 Cache

slide-7
SLIDE 7

“The ratio of the average active threads per warp to the maximum number of threads per warp” Warp Level : (32 per warp)

Warp Efficiency

i = = map_func map_func(thread,block thread,block) ) s = 0; s = 0; for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; y[ y[i] = s; ] = s; If If ( ( threadIdx.X threadIdx.X < N ) { < N ) { X; X; Y; Y; } else { } else { A; A; B; B; } Z; Z;

X Y A B Z Z Control Divergence: Latency Divergence: nvprof: event based warp efficiency due only to control divergence.

slide-8
SLIDE 8

“The ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor” nvprof: report active occupancy (while GPU is active executing a kernel) Thread block Level/SM Level: 64 warps per SM GPU level: 80+ SM per Volta GPU Resource Sharing, impacting occupancy: Block level (register file, Shared mem)

Occupancy

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Streaming Multiprocessor (SM)

Warp Warp Warp Warp

Register File Cache/Shared Memory

. . . . . .

Global Memory (HBM) L2 Cache

slide-9
SLIDE 9

Scaling Curves

Volta GPU

Possible to control SM count

NAS Parallel Benchmarks

LU, CUDA

Typical Scaling Plot

Provide performance with SM change, No insights into root causes. Why Class B scale better than A, but Class C is not better than B?

12

100000 250000 SM count MFlop/s 02 04 08 16 32 48 64 80

  • CLASS A

CLASS B CLASS C

slide-10
SLIDE 10

SM=2 SM=32 SM=80

Roofline Strong Scaling Trajectory

Roofline Scaling Trajectories Diagnostic technique for scaling issue. Track performance while changing the level of concurrency. Ideal behavior: △y = increase in computational resources or share of BW △x=0 (No change in arithmetic intensity)

slide-11
SLIDE 11

SM=2 SM=32 SM=80

Inefficiency at low SM count is typically correlated with low warp efficiency AI degradation due to excessive HBM data movement to the L2 cache Potential throughput improvement with AI degradation Measured throughput improvement Measured < Potential indicates loss of

  • ccupancy while

scaling

Typical Strong Scaling Trajectory

AI ← cannonical flop count (dram read trans + dram write trans) × 32

slide-12
SLIDE 12

Evaluation Testbed

NAS Parallel benchmarks FT: spectral methods CG: sparse linear algebra MG: multi-grid PDE LU: regular-sparse lower and upper triangular system BT, SP: mini-apps adaptive mesh benchmark. Programming Models: Cuda, OpenACC Problem Sizes: Class A, B, and C (4x problem increase while changing classes) Low level Tools: nvprof for data movement. Flop count is based on application estimate (constant per class while scaling)

OLCF Summit, P9 - V100 node

slide-13
SLIDE 13
  • Full code rewrite
  • Full leverage of architectural features

(e.g., texture, shared memory, etc)

  • Possible change of data layout
  • Preserve data layout
  • Incremental source code annotation
  • Same loop structure for CPU and GPU.

GPU Programming Model Influence

Vender Programming Language (CUDA) Pragma-based Language (OpenACC, OpenMP)

Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop

slide-14
SLIDE 14

Roofline Scaling Trajectories Case Studies

Good Scaling:

(e.g., ACC MG) Locality at high concurrency Some Influence of problem size on locality

Need improvement

(e.g., ACC FT) Scaling problem due to occupancy

0.01 0.02 0.05 0.10 0.20 0.50 1.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )

  • Class A

Class B Class C

0.01 0.02 0.05 0.10 0.20 0.50 1.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)

  • Class A

Class B Class C

slide-15
SLIDE 15

GPU Scaling Trajectories

Scaling plot vs. Roofline scaling trajectory

18 100000 250000 SM count MFlop/s 02 04 08 16 32 48 64 80

  • CLASS A

CLASS B CLASS C

0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 5.00 1 5 50 500 5000 Arithmetic Intensity (Flops/Byte) GFlop/s

ADD(SMs=80) (3536) ADD(SMs=2) (88) HBM(SMs=80) (829) HBM(SMs=2) (35)

  • Class A

Class B Class C

slide-16
SLIDE 16

Behavior at low concurrency (warp efficiency) vs. occupancy Influence of programming model

0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 1 5 50 500

Arithmetic Intensity (Flops/Byte) GFlop/s

HBM(SMs=80) (829) HBM(SMs=2) (35)

  • Class A

Class B Class C

Scaling of NAS BT

0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)

  • Class A

Class B

BT CUDA Implementation

CUDA AI

BT OpenACC Implementation

slide-17
SLIDE 17

x_solve y_solve z_solve compute_rhs

Class A Class B

. Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0 x_solve y_solve z_solve compute_rhs Class A Class B Class C . Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0

Warp Efficiency (CUDA vs. OpenACC)

BT CUDA Implementation BT OpenACC Implementation

Low-level metric confirmation

slide-18
SLIDE 18

Occupancy (CUDA vs. OpenACC)

BT CUDA Implementation BT OpenACC Implementation

jacu_buts jacld_blts rhs_kern_x rhs_kern_y rhs_kern_z

02 04 08 16 32 48 64 80

. Occupancy 0.0 0.2 0.4 0.6 0.8 1.0

LU variation per kernel!

slide-19
SLIDE 19

Tuning for Tile size improve warp efficiency

Optimization of CUDA BT

x_solve y_solve z_solve compute_rhs

Class A Class B Class C

. Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0

x_solve y_solve z_solve compute_rhs

Class A Class B Class C

. Warp_Efficiency 0.0 0.2 0.4 0.6 0.8 1.0

BT CUDA Base Implementation BT CUDA Opt Implementation

x_solve y_solve z_solve compute_rhs

02 04 08 16 32 48 64 80

. Occupancy 0.0 0.2 0.4 0.6 0.8 1.0 x_solve y_solve z_solve compute_rhs

02 04 08 16 32 48 64 80

. Occupancy 0.0 0.2 0.4 0.6 0.8 1.0

No problem reducing occupancy as long as gradient does not change!

slide-20
SLIDE 20

0.1 0.2 0.5 1.0 2.0 1 5 50 500

Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)

  • Class A

Class B Amdahl Shadow

Class A Class B Class C Invocation Count 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
  • ut of memory

jacu buts jacld blts rhs Others cuRoutines

Roofline Shadow Curve:

Performance if invocation

  • verhead is negligible.

Insights:

Hitting the roof with the shadow curve indicates inactive occupancy

  • f SMs.

Influence of programming model: Incremental loop parallelization kernel launch influence on performance.

Distinguishing Occupancy Issues (Active vs. Inactive)

0.1 0.2 0.5 1.0 2.0 1 5 50 500

Arithmetic Intensity (Flops/Byte) GFlop/s H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )

  • Class A

Class B Amdahl Shadow

Class A Class B Class C Invocation Count 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

jacu_buts jacld_blts rhs_kernel_x rhs_kernel_y rhs_kernel_z Others cuRoutines

LU CUDA LU OpenACC

Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop

slide-21
SLIDE 21

Other cases studies in the paper

Other Case Studies

0.1 0.2 0.5 1.0 2.0 1 5 50 500

Arithmetic Intensity (Flops/Byte) GFlop/s

H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )

  • roofline_summary_rhs_kernel_x_a_LU_c_CUDA
  • Class A

Class B Class C

0.1 0.2 0.5 1.0 2.0 1 5 50 500

A r i t h m e t i c I n t e n s i t y ( F l

  • p

s / B y t e ) G F l

  • p

/ s

HBM(SMs=80) (829) HBM(SMs=2) (35)

  • r
  • f
l i n e _ s u m m a r y _ r h s _ k e r n e l _ z _ a _ L U _ c _ C U D A
  • C

l a s s A C l a s s B C l a s s C

0.01 0.02 0.05 0.10 0.20 0.50 1.00 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )

  • roofline_summary_Total_a_CG_c_ACC
  • Class A

Class B Class C

0.1 0.2 0.5 1.0 2.0 1 5 50 500 Arithmetic Intensity (Flops/Byte) GFlop/s HBM(SMs=80) (829) HBM(SMs=2) (35)

  • roofline_summary_Total_a_LU_c_ACC
  • Class A

Class B 0.1 0.2 0.5 1.0 2.0 1 5 50 500

A r i t h m e t i c I n t e n s i t y ( F l

  • p

s / B y t e ) G F l

  • p

/ s

HBM(SMs=80) (829) HBM(SMs=2) (35)

  • roofline_summary_Total_a_LU_c_CUDA
  • Class A

Class B Class C

0.1 0.2 0.5 1.0 1 5 50 500

Arithmetic Intensity (Flops/Byte) GFlop/s

H B M ( S M s = 8 ) ( 8 2 9 ) H B M ( S M s = 2 ) ( 3 5 )

  • roofline_summary_Total_a_SP_c_CUDA
  • Class A

Class B Class C

slide-22
SLIDE 22

Performance analysis for Post-Moore’s architectures is a daunting problem Both Micro and Macro analyses are important Roofline Scaling Trajectory bridges the two sorts of analyses

  • Visually capture low level issues

Warp efficiency Occupancy gradient Locality

  • Assess application behavior against a performance model

Improve performance, which involves tradeoffs between various low-level optimizations! Warp efficiency vs. occupancy More than 2x speedup for multiple applications. Paper carries multiple case studies.

Summary and Conclusions

SM=2 SM=32 SM=80 Inefficiency at low SM count is typically correlated with low warp efficiency AI degradation due to excessive HBM data movement to the L2 cache Potential throughput improvement with AI degradation Measured throughput improvement Measured < Potential indicates loss of
  • ccupancy while
scaling
slide-23
SLIDE 23

Additional Resources

LBL Roofline: https://crd.lbl.gov/departments/computer- science/PAR/research/roofline NVprof https://docs.nvidia.com/cuda/profiler-users-guide/index.html NERSC Recipe http://www.nersc.gov/users/application-performance/measuring- arithmetic-intensity/