Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - PowerPoint PPT Presentation

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 � rgayatri@lbl.gov, cjyang@lbl.gov

Motivation Why Attend this Talk • 5 of the top 10 supercomputers are using NVIDIA GPUs • Most of the codes optimized for CPUs have to now be rewritten for GPUs • Compiler directive based approaches are attractive due to their ease of use ◦ Port incrementally for big codes • This talk would provide a detailed analysis of the current state of the directive based programming models ◦ Their performance compared to optimized CUDA code ◦ Supported compilers ◦ Differences in compiler implementations GTC 2019 Rahul (NERSC-LBL) March 8, 2019 2 / 41

Overview Outline of the Presentation • BerkeleyGW, a material science code ◦ General Plasmon Pole (GPP), a mini-app • Baseline CPU implementation • GPU programming models (OpenMP, OpenACC, CUDA) • GPP on GPU ◦ Naive implementation ◦ Optimized implementation ◦ Compare approaches and performance of each implementation • Backport GPU implementation on CPU for performance portability GTC 2019 Rahul (NERSC-LBL) March 8, 2019 3 / 41

BerkeleyGW BerkeleyGW • The GW method is an accurate approach to simulate the excited state properties of materials ◦ What happens when you add or remove an electron from a system ◦ How do electrons behave when you apply a voltage ◦ How does the system respond to light or x-rays • Extract stand alone kernels that could be run as mini-apps GTC 2019 Rahul (NERSC-LBL) March 8, 2019 4 / 41

Test Case Kernel General Plasmon Pole (GPP) • Mini-app from BerkeleyGW ◦ Computes the electron self-energy using the General Plasmon Pole approximation • Characteristics of GPP ◦ Reduction over a series of double complex arrays involving multiply, divide and add instructions (partial FMA) ◦ For typical calculations, it evaluates to an arithmetic intensity (Flops/Byte) between 1-10, i.e., the kernel has to be optimized for memory locality and vectorization/SIMT efficiency GTC 2019 Rahul (NERSC-LBL) March 8, 2019 5 / 41

Complex Number in C/C++ Complex Number Class • BerkeleyGW consist of double-complex number calculation • std::complex difficulties ◦ Performance issues ◦ Difficult to vectorize ◦ Cannot offload operations onto the device using OpenMP 4.5 • Thrust::complex ◦ Challenges in offloading complex operator routines on device • Built an in-house complex class ◦ 2-doubles on CPU ◦ double2 vector type on GPU GTC 2019 Rahul (NERSC-LBL) March 8, 2019 6 / 41

GPP GPP pseudo code - reduction in the innermost loop Code • Memory O(2GBs) for(X){ // X = 512 • Typical single node problem size for(N){ // N = 1638 for(M){ // M = 32768 for(int iw = 0; iw < 3; ++iw){ • output - double complex //Some computation output[iw] += ... } } } } GTC 2019 Rahul (NERSC-LBL) March 8, 2019 7 / 41

GPP On CPU GTC 2019 Rahul (NERSC-LBL) March 8, 2019 8 / 41

GPP CPU Parallelization OpenMP 3.0 parallelization of GPP • Unroll innermost iw-loop #pragma omp parallel for • Vectorize M-loop reduction(output re[0-2], output im[0-2] for(X){ for(N){ • Collapse increased the for(M){ //Vectorize runtime by 10% for(int iw = 0; iw < 3; ++iw){ //Unroll //Store local } • Check compiler reports } (intel/2018) to guarantee for(int iw = 0; iw < 3; ++iw){ vectorization and unrolling output_re[iw] += ... output_im[iw] += ... } • Flatten arrays into scalars } with compilers that do not } support array reduction GTC 2019 Rahul (NERSC-LBL) March 8, 2019 9 / 41

GPP Performance on CPU Runtime of GPP on Cori • Performance numbers from Cori Performance of GPP on Cori 8 at NERSC,LBL Haswell ◦ Haswell 7 Xeon Phi ◦ Xeon Phi 6 Lower is Better 5 • intel/2018 compilers T[secs] 4 3 • A perfect scaling would allow a KNL execution to be 4 × faster 2 than Haswell 1 ◦ KNL implementation of GPP 0 CPU-architecture is approximately 3.5 × faster than Haswell GTC 2019 Rahul (NERSC-LBL) March 8, 2019 10 / 41

GPP Performance on CPU Runtime of GPP on Cori Xeon Phi - 2.2 seconds • Performance numbers from Cori Performance of GPP on Cori 8 at LBNL Haswell ◦ Haswell 7 Xeon Phi ◦ Xeon Phi 6 Lower is Better 5 • intel/2018 compilers T[secs] 4 3 • A perfect scaling would allow a KNL execution to be 4 × faster 2 than Haswell 1 ◦ KNL implementation of GPP 0 CPU-architecture is 3 × faster than Haswell GTC 2019 Rahul (NERSC-LBL) March 8, 2019 11 / 41

GPP On GPU GTC 2019 Rahul (NERSC-LBL) March 8, 2019 12 / 41

Parallelism on GPU KNL to Volta GPU Hardware KNL GPU • Going from 272 to 164K threads • 164k threads ◦ 80 SMs ◦ 2048 threads within a SM GTC 2019 Rahul (NERSC-LBL) March 8, 2019 13 / 41

GPU Programming Models Programming Models used to port GPP on GPU Volta GPU available on Cori and Summit • OpenMP 4.5 ◦ Cray ◦ XL(IBM) ◦ Clang ◦ GCC • OpenACC ◦ PGI ◦ Cray • CUDA GTC 2019 Rahul (NERSC-LBL) March 8, 2019 14 / 41

GPU Programming Models OpenMP 4.5 OpenMP offloading to GPU Volta GPU available on Cori and Summit • OpenMP 4.5 ◦ Cray ◦ XL(IBM) ◦ Clang ◦ GCC • OpenACC ◦ PGI ◦ Cray • CUDA GTC 2019 Rahul (NERSC-LBL) March 8, 2019 15 / 41

OpenMP 4.5 Offload Directives OpenMP directives to offload code-blocks onto GPUs Directives to distribute work across GPU threads target − offload the code − block on to the device teams − spawn one or more thread team distribute − distribute iterations of the loops onto master threads of the team parallel for − distribute loop iterations among threads in a threadblock simd − implementation dependent on compilers #pragma omp target teams distribute for () //Distribute the loop across threadblocks #pragma omp parallel for for () //Distribute the loop across threads within a threadblock GTC 2019 Rahul (NERSC-LBL) March 8, 2019 16 / 41

OpenMP 4.5 Data Movement OpenMP 4.5 directives to move data from device to host Allocate and delete data on the device #pragma omp target enter data map ( alloc: list − of − data − structures [:] ) #pragma omp target exit data map ( delete: list − of − data − structures [:] ) Update data on device and host #pragma omp target update to / from (list − of − data − structures [:] ) to − HostToDevice from − DeviceToHost Clauses to use with target directives map ( to:... ) map ( from:... ) map ( tofrom:... ) GTC 2019 Rahul (NERSC-LBL) March 8, 2019 17 / 41

OpenMP 4.5 Routines on Device OpenMP 4.5 directives to offload routines on the device Routines #pragma omp declare target void foo() ; #pragma omp end declare target Not necessary if routines are inlined GTC 2019 Rahul (NERSC-LBL) March 8, 2019 18 / 41

OpenMP Offload of GPP Naive OpenMP 4.5 implementation of GPP • Distribute M-loop across #pragma omp target teams distribute threadblocks map(to:...) map(tofrom:output re[0-2], output im[0-2]) for(X){ • Distribute N-loop among #pragma omp parallel for threads in a threadblocks for(N){ for(M){ • No array reduction with for(int iw = 0; iw < 3; ++iw){ //Store local OpenMP 4.5 directives. } Hence use atomic to } maintain correctness for(int iw = 0; iw < 3; ++iw){ #pragma omp atomic output_re[iw] += ... • Parallelizing M-loop #pragma omp atomic increases overhead of output_im[iw] += ... synchronization } } GTC 2019 Rahul (NERSC-LBL) March 8, 2019 19 / 41

Optimized Implementation Optimized implementation with OpenMP 4.5 • XL, Clang, Cray and GCC #pragma omp target enter data gave the best performance map(alloc:input[0:X]) with the same parallelization #pragma omp target update input[0:X]) technique #pragma omp target teams distribute \ ◦ Collapse N and M loops parallel for collapse(2) \ and distribute them across reduction(+:output re(0,1,2), output im(0,1,2)) threadblocks and threads for(X){ within a block for(N){ for(M){ • Memory allocation improved for(int iw = 0; iw < 3; ++iw){ //Store local the performance of the kernel } by 10% } ◦ #pragma omp target output_re(0,1,2) += ... enter/exit data output_im(0,1,2) += ... } • Reduction gave a 3 × boost } #pragma omp target exit data map(delete:input) in the performance ◦ Flatten arrays to scalars GTC 2019 Rahul (NERSC-LBL) March 8, 2019 20 / 41

GPP on GPU Performance of GPP on V100 with OpenMP 4.5 • Cray is 3 × slower than Performance of GPP on V100 with OpenMP 4.5 5 XL Lower is Better 4 • Clang is 30% slower than XL 3 T[secs] Xeon Phi • GCC implementation 2 takes 26 seconds 1 0 GCC Cray Clang xl GTC 2019 Rahul (NERSC-LBL) March 8, 2019 21 / 41

OpenMP 4.5 directives Compiler Interpretations OpenMP 4.5 directives map onto hardware Grid Thread GCC teams distribute parallel for XL teams distribute parallel for Clang teams distribute parallel for Cray teams distribute simd Table 1: OpenMP 4.5 mapping onto GPU hardware GTC 2019 Rahul (NERSC-LBL) March 8, 2019 22 / 41

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - PowerPoint PPT Presentation

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 rgayatri@lbl.gov,

CS 301 Lecture 20 Reductions Stephen Checkoway April 9, 2018 1 / 17 Reductions Reductions

Polynomial-time reductions We have seen several reductions: Polynomial-time reductions Informal

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Recommended Round 2 March Budget Reductions GENERAL FUND SUMMARY TOTAL REDUCTIONS ROUNDS

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer (Stanford) Henry

Successful Emission Reductions in Yard Locomotives Yard Emissions Reductions Repower Slug

Budget Proposal Personnel/BOCES Personnel Reductions Personnel Reductions Assistant

Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Faster

GHG reductions potentials and GHG reductions potentials and mitigation costs in world regions

Chapter 2 Reductions and NP CS 573: Algorithms, Fall 2013 August 29, 2013 2.1 Reductions

Reductions and NP Part I Lecture 2 August 29, 2013 Reductions Continued Sariel (UIUC) CS573

Reductions So far, two reductions that preserve the CS 611 meaning of a lambda calculus

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Supplytime 2005 and Supplytime 2017 An overview of some main changes & Some comments about

Changed compliance requirements in the 2012 JORC Code Steve Hunt Deputy Chair JORC April 2013

Agenda 1 Group Overview 2 Focus on Sirone plant & our Core Business 3 Further

Interim Update: M &A Announcement M ahindra CIE Automotive Limited 12 th March 2019 Mumbai

03-1160 Page 3 of 20 of Minnesota (1) granted Deeres motion for summary judgment of

Customer and Consumer Panel 19 November 2015 Introductions Gerard Reilly Meeting overview

APX Group Holdings, Inc. Financial and Operating Highlights Three and Nine Months ended September

Curbside Collection Cost Introductions Workshop Team: Alec Scott, Laurie Westaway 2016-06-13

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - PowerPoint PPT Presentation

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 rgayatri@lbl.gov,

CS 301 Lecture 20 Reductions Stephen Checkoway April 9, 2018 1 / 17 Reductions Reductions

Polynomial-time reductions We have seen several reductions: Polynomial-time reductions Informal

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Recommended Round 2 March Budget Reductions GENERAL FUND SUMMARY TOTAL REDUCTIONS ROUNDS

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer (Stanford) Henry

Successful Emission Reductions in Yard Locomotives Yard Emissions Reductions Repower Slug

Budget Proposal Personnel/BOCES Personnel Reductions Personnel Reductions Assistant

Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Faster

GHG reductions potentials and GHG reductions potentials and mitigation costs in world regions

Chapter 2 Reductions and NP CS 573: Algorithms, Fall 2013 August 29, 2013 2.1 Reductions

Reductions and NP Part I Lecture 2 August 29, 2013 Reductions Continued Sariel (UIUC) CS573

Reductions So far, two reductions that preserve the CS 611 meaning of a lambda calculus

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Supplytime 2005 and Supplytime 2017 An overview of some main changes &amp; Some comments about

Changed compliance requirements in the 2012 JORC Code Steve Hunt Deputy Chair JORC April 2013

Agenda 1 Group Overview 2 Focus on Sirone plant &amp; our Core Business 3 Further

Interim Update: M &amp;A Announcement M ahindra CIE Automotive Limited 12 th March 2019 Mumbai

03-1160 Page 3 of 20 of Minnesota (1) granted Deeres motion for summary judgment of

Customer and Consumer Panel 19 November 2015 Introductions Gerard Reilly Meeting overview

APX Group Holdings, Inc. Financial and Operating Highlights Three and Nine Months ended September

Curbside Collection Cost Introductions Workshop Team: Alec Scott, Laurie Westaway 2016-06-13

Supplytime 2005 and Supplytime 2017 An overview of some main changes & Some comments about

Agenda 1 Group Overview 2 Focus on Sirone plant & our Core Business 3 Further

Interim Update: M &A Announcement M ahindra CIE Automotive Limited 12 th March 2019 Mumbai