Performance Analysis of GPU Programming Models using the Roofline - PowerPoint PPT Presentation

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench’19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA

Acknowledgements § This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This material is based upon work supported by the DOE RAPIDS SciDAC Institute. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH11231. § This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Layout and Contributions CLASS A 250000 ● ● ● CLASS B ● Contributions ● CLASS C MFlop/s • The roofline-scaling trajectory technique ● 100000 for the performance analysis on GPU ● architecture. ● ● 0 • Comparison between CUDA and 02 04 08 16 32 48 64 80 SM count OpenACC models for GPU offloading. Potential • Case studies for tuning for performance throughput improvement with AI degradation (> 2x improvement). Measured < Potential indicates loss of occupancy while SM=80 scaling SM=32 Inefficiency at low SM count is typically correlated Measured throughput SM=2 with low warp efficiency improvement AI degradation due to excessive HBM data movement to the L2 cache

Performance Analysis: Micro vs. Macro Micro Analysis: (typically attributed to a particular code site.) Miss rate, vectorization, Load imbalance? But, are they impacting performance? #pragma #pragma omp omp parallel parallel for for Macro Analysis: (Model-Based) for for ( ( int int i = 0; = 0; i < n; < n; i++ ) { ++ ) { s = 0; s = 0; Are we utilizing resources effectively? for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; Roofline technique, logGP, etc y[i] = s; y[ ] = s; } Clear performance expectation. /* Is it vectorized? */ /* Is it vectorized? */ /* Is it balances? */ /* Is it balances? */ But, no source code attribution for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) #pragma omp #pragma omp parallel parallel for for Vendor Hardware Events: for ( for ( int s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; int i = 0; = 0; i < n; < n; i++ ) { ++ ) { Could serve both micro/macro analyses! But, they are many, hard to understand and time consuming to collect

7 Roofline Performance Model (Empirical) § Performance Model Bounds DRAM Bandwidth § Peak Flop/s Peak Flop/s Compute s / § B G s § Typical HPC application e Attainable Flop/s Attainable Flop/s / B h c G a c 2 s L M / Performance: Flop/s B A G § R M D C A Arithmetic intensity: Flop/Byte M R § D § DRAM Roofline DRAM-bound Compute-bound § Define an AI for each level of cache/memory Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte) § CARM Roofline (Cache-aware Roofline) Williams et al, "Roofline: An Insightful Visual Performance Model For Multicore Architectures", CACM, 2009. § Define a single AI for all levels of Aleksandar Ilic, et al. Cache-aware Roofline model: Upgrading the loft, IEEE memory based on L1 data movement. Computer Architecture Letters, vol. 13, n. 1, pp. 21-24, January 2014

GPU Parallelism and Performance Streaming Multiprocessor (SM) . . . Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Multi-Level of parallelism Warp . . . Warp Level : Ideally threads execute the same instruction Warp Warp Thread block Level: Cooperative execution Warp SM Level: Resource sharing to hide latency. Occupancy of multiple blocks depends on resource requirements. GPU level: Occupancy Scaling depends on level of parallelism Performance influencing factor Cache/Shared Register File Warp efficiency Memory Occupancy L2 Cache Locality (especially spatial, not necessarily temporal) Global Memory (HBM) etc.

Warp Efficiency “The ratio of the average active threads per warp to the maximum number of threads per warp” Control Divergence: nvprof : event based warp efficiency due only to control divergence. X Y Z Warp Level : (32 per warp) Latency Divergence: A B Z If If ( ( threadIdx.X threadIdx.X < N ) { < N ) { X; X; i = = map_func map_func(thread,block thread,block) ) Y; Y; s = 0; s = 0; } else { } else { for( for ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) A; A; s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; B; B; y[ y[i] = s; ] = s; } Z; Z;

Occupancy Streaming Multiprocessor (SM) . . . Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) “The ratio of the average active warps per active cycle Warp . . . to the maximum number of warps supported on a Warp multiprocessor” Warp nvprof: report active occupancy (while GPU is active Warp executing a kernel) Thread block Level/SM Level: 64 warps per SM GPU level: Cache/Shared Register File Memory 80+ SM per Volta GPU Resource Sharing, impacting occupancy: L2 Cache Block level (register file, Shared mem) Global Memory (HBM)

12 Scaling Curves CLASS A 250000 ● ● Volta GPU ● CLASS B ● ● CLASS C Possible to control SM count MFlop/s NAS Parallel Benchmarks ● 100000 LU, CUDA ● Typical Scaling Plot ● Provide performance with SM change, ● No insights into root causes. 0 Why Class B scale better than A, but Class C is not better than B? 02 04 08 16 32 48 64 80 SM count

Roofline Strong Scaling Trajectory Roofline Scaling Trajectories Diagnostic technique for scaling issue. Track performance while changing the level of concurrency. SM=80 Ideal behavior: SM=32 △ y = increase in computational resources or share of BW SM=2 △ x=0 (No change in arithmetic intensity)

Typical Strong Scaling Trajectory Potential cannonical flop count AI ← throughput ( dram read trans + dram write trans ) × 32 improvement with AI degradation Measured < Potential indicates loss of occupancy while SM=80 scaling SM=32 Inefficiency at low SM count is typically Measured throughput SM=2 correlated with low warp improvement efficiency AI degradation due to excessive HBM data movement to the L2 cache

Evaluation Testbed NAS Parallel benchmarks OLCF Summit, P9 - V100 node FT : spectral methods CG : sparse linear algebra MG : multi-grid PDE LU : regular-sparse lower and upper triangular system BT, SP : mini-apps adaptive mesh benchmark. Programming Models: Cuda, OpenACC Problem Sizes: Class A, B, and C (4x problem increase while changing classes) Low level Tools: nvprof for data movement. Flop count is based on application estimate (constant per class while scaling)

GPU Programming Model Influence Pragma-based Language (OpenACC, OpenMP) Vender Programming Language (CUDA) • Full code rewrite • Preserve data layout • Full leverage of architectural features • Incremental source code annotation (e.g., texture, shared memory, etc) • Same loop structure for CPU and GPU. • Possible change of data layout Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop

Roofline Scaling Trajectories Case Studies Good Scaling: (e.g., ACC MG) 500 500 Locality at high concurrency Class A Class A Some Influence of problem size on locality Class B ● Class B ● Class C Class C ) HBM(SMs=80) (829) 9 2 GFlop/s 8 GFlop/s ● ● ● ( ● ) 50 0 ● 50 8 Need improvement = s M ● S ( M ● ● ● ● ● B ● H ● (e.g., ACC FT) ● ● ● ) Scaling problem due to occupancy HBM(SMs=2) (35) 5 5 5 3 ( ) 2 = s M S ( M B H 1 1 0.01 0.01 0.02 0.02 0.05 0.05 0.10 0.10 0.20 0.20 0.50 0.50 1.00 1.00 Arithmetic Intensity (Flops/Byte) Arithmetic Intensity (Flops/Byte)

18 GPU Scaling Trajectories Scaling plot vs. Roofline scaling trajectory 5000 ADD(SMs=80) (3536) Class A CLASS A 250000 ● ● ● Class B ● CLASS B ● Class C ● CLASS C 500 MFlop/s ● GFlop/s HBM(SMs=80) (829) ADD(SMs=2) (88) 100000 ● ● ● ● 50 ● ● ● ● ● ● 0 HBM(SMs=2) (35) 5 ● 02 04 08 16 32 48 64 80 SM count 1 0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 5.00 Arithmetic Intensity (Flops/Byte)

Scaling of NAS BT Behavior at low concurrency (warp efficiency) vs. occupancy Influence of programming model BT OpenACC Implementation BT CUDA Implementation 500 500 Class A Class A Class B ● Class B ● Class C HBM(SMs=80) (829) ● ● ● GFlop/s ● ● CUDA AI ● ● ● 50 GFlop/s ● 50 HBM(SMs=80) (829) ● ● ● ● ● ● HBM(SMs=2) (35) 5 ● 5 HBM(SMs=2) (35) 1 1 0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 Arithmetic Intensity (Flops/Byte) Arithmetic Intensity (Flops/Byte)

Performance Analysis of GPU Programming Models using the Roofline - PowerPoint PPT Presentation

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA Acknowledgements This

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

E NCR YP T E D ME S S AGE S FR OM T HE HE IGHT S OF CR Y PT OMANIA Craig Gentry,

Developing Exercise Control Tools ; Linking Simulation Models with Others LTC Son, Eun Chul ;

Dawn Song dawnsong@cs.berkeley.edu 1 What is a botnet? An army of compromised hosts

Feature Extraction with Description Logics Functional Subsumption Rodrigo de Salvo Braz Dan

Prospects for a Robust Poll terror attack risk higher . BBC News, 18th April 2005.

Inclusive Growth in Ghana By Robert D. Osei, Richmond Atta-Ankomah & Monica Lambon-Quayefio

Evolving Secure Information Systems through Attack Simulation Elmar Kiesling, Andreas Ekelhart,

w-Eiux.ee#IE.XI=lXI,qIdadicmodce- Couve KEE ) Ho ( Xp ht Classica wth modeler forms Kao ) 1=0

Performance Analysis of GPU Programming Models using the Roofline - PowerPoint PPT Presentation

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA Acknowledgements This

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

E NCR YP T E D ME S S AGE S FR OM T HE HE IGHT S OF CR Y PT OMANIA Craig Gentry,

Developing Exercise Control Tools ; Linking Simulation Models with Others LTC Son, Eun Chul ;

Dawn Song dawnsong@cs.berkeley.edu 1 What is a botnet? An army of compromised hosts

Feature Extraction with Description Logics Functional Subsumption Rodrigo de Salvo Braz Dan

Prospects for a Robust Poll terror attack risk higher . BBC News, 18th April 2005.

Inclusive Growth in Ghana By Robert D. Osei, Richmond Atta-Ankomah &amp; Monica Lambon-Quayefio

Evolving Secure Information Systems through Attack Simulation Elmar Kiesling, Andreas Ekelhart,

w-Eiux.ee#IE.XI=lXI,qIdadicmodce- Couve KEE ) Ho ( Xp ht Classica wth modeler forms Kao ) 1=0

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Inclusive Growth in Ghana By Robert D. Osei, Richmond Atta-Ankomah & Monica Lambon-Quayefio