optimizing large reductions in berkeleygw on gpus
play

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and - PowerPoint PPT Presentation

Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 rgayatri@lbl.gov,


  1. Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC Rahulkumar Gayatri, Charlene Yang National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory March 8, 2019 � rgayatri@lbl.gov, cjyang@lbl.gov

  2. Motivation Why Attend this Talk • 5 of the top 10 supercomputers are using NVIDIA GPUs • Most of the codes optimized for CPUs have to now be rewritten for GPUs • Compiler directive based approaches are attractive due to their ease of use ◦ Port incrementally for big codes • This talk would provide a detailed analysis of the current state of the directive based programming models ◦ Their performance compared to optimized CUDA code ◦ Supported compilers ◦ Differences in compiler implementations GTC 2019 Rahul (NERSC-LBL) March 8, 2019 2 / 41

  3. Overview Outline of the Presentation • BerkeleyGW, a material science code ◦ General Plasmon Pole (GPP), a mini-app • Baseline CPU implementation • GPU programming models (OpenMP, OpenACC, CUDA) • GPP on GPU ◦ Naive implementation ◦ Optimized implementation ◦ Compare approaches and performance of each implementation • Backport GPU implementation on CPU for performance portability GTC 2019 Rahul (NERSC-LBL) March 8, 2019 3 / 41

  4. BerkeleyGW BerkeleyGW • The GW method is an accurate approach to simulate the excited state properties of materials ◦ What happens when you add or remove an electron from a system ◦ How do electrons behave when you apply a voltage ◦ How does the system respond to light or x-rays • Extract stand alone kernels that could be run as mini-apps GTC 2019 Rahul (NERSC-LBL) March 8, 2019 4 / 41

  5. Test Case Kernel General Plasmon Pole (GPP) • Mini-app from BerkeleyGW ◦ Computes the electron self-energy using the General Plasmon Pole approximation • Characteristics of GPP ◦ Reduction over a series of double complex arrays involving multiply, divide and add instructions (partial FMA) ◦ For typical calculations, it evaluates to an arithmetic intensity (Flops/Byte) between 1-10, i.e., the kernel has to be optimized for memory locality and vectorization/SIMT efficiency GTC 2019 Rahul (NERSC-LBL) March 8, 2019 5 / 41

  6. Complex Number in C/C++ Complex Number Class • BerkeleyGW consist of double-complex number calculation • std::complex difficulties ◦ Performance issues ◦ Difficult to vectorize ◦ Cannot offload operations onto the device using OpenMP 4.5 • Thrust::complex ◦ Challenges in offloading complex operator routines on device • Built an in-house complex class ◦ 2-doubles on CPU ◦ double2 vector type on GPU GTC 2019 Rahul (NERSC-LBL) March 8, 2019 6 / 41

  7. GPP GPP pseudo code - reduction in the innermost loop Code • Memory O(2GBs) for(X){ // X = 512 • Typical single node problem size for(N){ // N = 1638 for(M){ // M = 32768 for(int iw = 0; iw < 3; ++iw){ • output - double complex //Some computation output[iw] += ... } } } } GTC 2019 Rahul (NERSC-LBL) March 8, 2019 7 / 41

  8. GPP On CPU GTC 2019 Rahul (NERSC-LBL) March 8, 2019 8 / 41

  9. GPP CPU Parallelization OpenMP 3.0 parallelization of GPP • Unroll innermost iw-loop #pragma omp parallel for • Vectorize M-loop reduction(output re[0-2], output im[0-2] for(X){ for(N){ • Collapse increased the for(M){ //Vectorize runtime by 10% for(int iw = 0; iw < 3; ++iw){ //Unroll //Store local } • Check compiler reports } (intel/2018) to guarantee for(int iw = 0; iw < 3; ++iw){ vectorization and unrolling output_re[iw] += ... output_im[iw] += ... } • Flatten arrays into scalars } with compilers that do not } support array reduction GTC 2019 Rahul (NERSC-LBL) March 8, 2019 9 / 41

  10. GPP Performance on CPU Runtime of GPP on Cori • Performance numbers from Cori Performance of GPP on Cori 8 at NERSC,LBL Haswell ◦ Haswell 7 Xeon Phi ◦ Xeon Phi 6 Lower is Better 5 • intel/2018 compilers T[secs] 4 3 • A perfect scaling would allow a KNL execution to be 4 × faster 2 than Haswell 1 ◦ KNL implementation of GPP 0 CPU-architecture is approximately 3.5 × faster than Haswell GTC 2019 Rahul (NERSC-LBL) March 8, 2019 10 / 41

  11. GPP Performance on CPU Runtime of GPP on Cori Xeon Phi - 2.2 seconds • Performance numbers from Cori Performance of GPP on Cori 8 at LBNL Haswell ◦ Haswell 7 Xeon Phi ◦ Xeon Phi 6 Lower is Better 5 • intel/2018 compilers T[secs] 4 3 • A perfect scaling would allow a KNL execution to be 4 × faster 2 than Haswell 1 ◦ KNL implementation of GPP 0 CPU-architecture is 3 × faster than Haswell GTC 2019 Rahul (NERSC-LBL) March 8, 2019 11 / 41

  12. GPP On GPU GTC 2019 Rahul (NERSC-LBL) March 8, 2019 12 / 41

  13. Parallelism on GPU KNL to Volta GPU Hardware KNL GPU • Going from 272 to 164K threads • 164k threads ◦ 80 SMs ◦ 2048 threads within a SM GTC 2019 Rahul (NERSC-LBL) March 8, 2019 13 / 41

  14. GPU Programming Models Programming Models used to port GPP on GPU Volta GPU available on Cori and Summit • OpenMP 4.5 ◦ Cray ◦ XL(IBM) ◦ Clang ◦ GCC • OpenACC ◦ PGI ◦ Cray • CUDA GTC 2019 Rahul (NERSC-LBL) March 8, 2019 14 / 41

  15. GPU Programming Models OpenMP 4.5 OpenMP offloading to GPU Volta GPU available on Cori and Summit • OpenMP 4.5 ◦ Cray ◦ XL(IBM) ◦ Clang ◦ GCC • OpenACC ◦ PGI ◦ Cray • CUDA GTC 2019 Rahul (NERSC-LBL) March 8, 2019 15 / 41

  16. OpenMP 4.5 Offload Directives OpenMP directives to offload code-blocks onto GPUs Directives to distribute work across GPU threads target − offload the code − block on to the device teams − spawn one or more thread team distribute − distribute iterations of the loops onto master threads of the team parallel for − distribute loop iterations among threads in a threadblock simd − implementation dependent on compilers #pragma omp target teams distribute for () //Distribute the loop across threadblocks #pragma omp parallel for for () //Distribute the loop across threads within a threadblock GTC 2019 Rahul (NERSC-LBL) March 8, 2019 16 / 41

  17. OpenMP 4.5 Data Movement OpenMP 4.5 directives to move data from device to host Allocate and delete data on the device #pragma omp target enter data map ( alloc: list − of − data − structures [:] ) #pragma omp target exit data map ( delete: list − of − data − structures [:] ) Update data on device and host #pragma omp target update to / from (list − of − data − structures [:] ) to − HostToDevice from − DeviceToHost Clauses to use with target directives map ( to:... ) map ( from:... ) map ( tofrom:... ) GTC 2019 Rahul (NERSC-LBL) March 8, 2019 17 / 41

  18. OpenMP 4.5 Routines on Device OpenMP 4.5 directives to offload routines on the device Routines #pragma omp declare target void foo() ; #pragma omp end declare target Not necessary if routines are inlined GTC 2019 Rahul (NERSC-LBL) March 8, 2019 18 / 41

  19. OpenMP Offload of GPP Naive OpenMP 4.5 implementation of GPP • Distribute M-loop across #pragma omp target teams distribute threadblocks map(to:...) map(tofrom:output re[0-2], output im[0-2]) for(X){ • Distribute N-loop among #pragma omp parallel for threads in a threadblocks for(N){ for(M){ • No array reduction with for(int iw = 0; iw < 3; ++iw){ //Store local OpenMP 4.5 directives. } Hence use atomic to } maintain correctness for(int iw = 0; iw < 3; ++iw){ #pragma omp atomic output_re[iw] += ... • Parallelizing M-loop #pragma omp atomic increases overhead of output_im[iw] += ... synchronization } } GTC 2019 Rahul (NERSC-LBL) March 8, 2019 19 / 41

  20. Optimized Implementation Optimized implementation with OpenMP 4.5 • XL, Clang, Cray and GCC #pragma omp target enter data gave the best performance map(alloc:input[0:X]) with the same parallelization #pragma omp target update input[0:X]) technique #pragma omp target teams distribute \ ◦ Collapse N and M loops parallel for collapse(2) \ and distribute them across reduction(+:output re(0,1,2), output im(0,1,2)) threadblocks and threads for(X){ within a block for(N){ for(M){ • Memory allocation improved for(int iw = 0; iw < 3; ++iw){ //Store local the performance of the kernel } by 10% } ◦ #pragma omp target output_re(0,1,2) += ... enter/exit data output_im(0,1,2) += ... } • Reduction gave a 3 × boost } #pragma omp target exit data map(delete:input) in the performance ◦ Flatten arrays to scalars GTC 2019 Rahul (NERSC-LBL) March 8, 2019 20 / 41

  21. GPP on GPU Performance of GPP on V100 with OpenMP 4.5 • Cray is 3 × slower than Performance of GPP on V100 with OpenMP 4.5 5 XL Lower is Better 4 • Clang is 30% slower than XL 3 T[secs] Xeon Phi • GCC implementation 2 takes 26 seconds 1 0 GCC Cray Clang xl GTC 2019 Rahul (NERSC-LBL) March 8, 2019 21 / 41

  22. OpenMP 4.5 directives Compiler Interpretations OpenMP 4.5 directives map onto hardware Grid Thread GCC teams distribute parallel for XL teams distribute parallel for Clang teams distribute parallel for Cray teams distribute simd Table 1: OpenMP 4.5 mapping onto GPU hardware GTC 2019 Rahul (NERSC-LBL) March 8, 2019 22 / 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend