unleashing the performance potential of
play

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic - PowerPoint PPT Presentation

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016 Tsinghua HPGC


  1. Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016

  2. Tsinghua HPGC Group  HPGC: high performance geo-computing http://www.thuhpgc.org  High performance computational solutions for geoscience applications  simulation-oriented research: providing highly efficient and highly scalable simulation applications (exploration geophysics, climate modeling)  data-oriented research: data processing, data compression, and data mining  Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures

  3. A Design Process That Combines Optimizations from Different Layers Application Algorithm Architecture The “Best” Computational Solution 3

  4. • Exploration Geophysics • GPU-based BEAM Migration (sponsored by Statoil) • GPU-based ETE Forward Modeling (sponsored by BGP) • Parallel Finite Element Electromagnetic Forward Modeling Method (sponsored by NSFC) • FPGA-based RTM (sponsored by NSFC and IBM) • Climate Modeling Application • global-scale atmospheric simulation (800 Tflops Shallow Water Equation Solver on Tianhe-1A, 1.4 Pflops atmospheric simulation 3D Euler Equation Solver on Tianhe-2) • FPGA-based atmospheric simulation (selected as one of the 27 Significant papers in the 25 years of the FPL conference) • Remote Sensing Data Processing • data analysis and visualization (sponsored by Microsoft) • deep learning based land cover mapping • Parallel Stencil on Different HPC Architectures • Parallel Sparse Matrix Solver Algorithm • Parallel Data Compression (PLZMA) (sponsored by ZTE) • Hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup • multi-core/many-core (CPU, GPU, MIC) Architecture • reconfigurable hardware (FPGA) Tsinghua HPGC Group: a Quick Overview on existing projects

  5. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers 5

  6. The Gap between Software and Hardware 50P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 6

  7. Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 7

  8. Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 8

  9. Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or cloud resolving computational mathematics other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University CPU, GPU, MIC, FPGA geo-computing C/C++, Fortran, MPI, CUDA, Java, … The “ Best ” Computational Solution 9

  10. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Previous Efforts 10

  11. Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs For more details, please refer to our PPoPP 2013 paper: “ A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric 11 Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) , pp. 1-12, Shenzhen, 2013. .

  12. Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs  2013: 2D SWE on MIC and FPGA  1.26 Pflops on 207,456 CPU cores, and 25,932 MICs  another 10x on FPGA For more details, please refer to our IPDPS 2014 paper: "Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe- 2”; and our FPL 2013 paper: “Accelerating Solvers for Global Atmospheric Equations Through Mixed -Precision Data Flow Engine ”.

  13. Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs  2013: 2D SWE on CPU+MIC and CPU+FPGA  1.26 Pflops on 207,456 CPU cores, and 25,932 MICs  another 10x on FPGA  2014: 3D Euler on MIC  1.7 Pflops on 147,456 CPU cores, and 18,432 MICs For more details, please refer to our paper: “Ultra -scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe- 2” , IEEE Transaction on Computers.

  14. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: 3D Euler on CPU+GPU 14

  15. CPU-only Algorithm  Parallel Version - Multi-node & Multi-core - MPI Parallelism 25 points stencil 3D channel 15

  16. CPU-only Algorithm  Parallel Version CPU Algorithm per Stencil sweep Multi-node & Multi-core For each subdomain MPI Parallelism ① Update Halo  CPU Algorithm ② Calculate Euler stencil Workflow a. Compute Local Coordinate b. Compute Fluxes c. Compute Source Terms Per Stencil Sweep Halo CPU Stencil Computation Updating ② ① CPU Workflow 16

  17. Hybrid (CPU+GPU) Algorithm  Hybrid Partition  GPU  Inner Stencil Computation  CPU  Halo Updating & Outer Stencil Computation  CPU-GPU Hybrid Algorithm  CPU-GPU Hybrid Algorithm Per Stencil Sweep For each subdomain GPU side: PETSc Inner-part Euler Stencil CPU side: ① Update Halo ② Outer-part Euler stencil 3D channel Inner part Outer part BARRIER 4 layers GPU CPU-GPU Exchange CPU 17

  18. Hybrid Algorithm Design Per Stencil Sweep Halo CPU Stencil Computation Updating ① ② Per Stencil Sweep Inner Stencil Computation G2C GPU Halo Outer Stencil C2G CPU Updating Computation ① ② ③ Barrier Workflow 18

  19. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: GPU-related Optimizations 19

  20. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 20

  21. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 21

  22. Optimizations Pinned-memory Virtual Memory Physical Memory T2 Physical T1 Memory GPU GPU Theoretic: T2 = 1/3 * T1 Reality: T2 < 1/2 * T1 22

  23. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP Compiler option CPU SIMD Vectorization Opt -Xptxas dlcm= ca Cache blocking 23

  24. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 24

  25. Optimizations Pinned Memory SMEM/L1 Streaming Multi- AoS -> SoA Processor 64K Register Register Adjustment 2048 threads GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 256 registers per threads Inner-thread Rescheduling Rt = 256 1 Block per OpenMP SM CPU SIMD Vectorization Opt Occupancy = (64*1024) / (2048*Rt) = 12.5% Cache blocking 25

  26. Optimizations Pinned Memory SMEM/L1 Streaming Multi- Processor AoS -> SoA 64K Register 2048 threads Register Adjustment GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 64 registers per threads Inner-thread Rescheduling Rt = 64 4 Block per SM OpenMP Occupancy = (64*1024) / (2048*Rt) = 50% CPU SIMD Vectorization Opt Compiler option -maxrregcount = 64 Cache blocking 26

  27. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 27

  28. Optimizations 28

  29. Optimizations 29

  30. Optimizations 30

  31. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Results 31

  32. Experimental Result OpenMP CPU 19.7s SIMD Vectorization Opt Cache blocking 70% Pinned Memory 5.91s SMEM/L1 31.64x AoS -> SoA speedup over 69% Kernel Splitting 12-core CPU GPU 1.80s Register Adjustment Opt (E5-2697 v2) Other Methods Customized Data Cache 49% Inner-thread Rescheduling 0.92s 32

  33. Experimental Result 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend