Unleashing the Performance Potential of GPUs for Atmospheric Dynamic - PowerPoint PPT Presentation

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016

Tsinghua HPGC Group  HPGC: high performance geo-computing http://www.thuhpgc.org  High performance computational solutions for geoscience applications  simulation-oriented research: providing highly efficient and highly scalable simulation applications (exploration geophysics, climate modeling)  data-oriented research: data processing, data compression, and data mining  Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures

A Design Process That Combines Optimizations from Different Layers Application Algorithm Architecture The “Best” Computational Solution 3

• Exploration Geophysics • GPU-based BEAM Migration (sponsored by Statoil) • GPU-based ETE Forward Modeling (sponsored by BGP) • Parallel Finite Element Electromagnetic Forward Modeling Method (sponsored by NSFC) • FPGA-based RTM (sponsored by NSFC and IBM) • Climate Modeling Application • global-scale atmospheric simulation (800 Tflops Shallow Water Equation Solver on Tianhe-1A, 1.4 Pflops atmospheric simulation 3D Euler Equation Solver on Tianhe-2) • FPGA-based atmospheric simulation (selected as one of the 27 Significant papers in the 25 years of the FPL conference) • Remote Sensing Data Processing • data analysis and visualization (sponsored by Microsoft) • deep learning based land cover mapping • Parallel Stencil on Different HPC Architectures • Parallel Sparse Matrix Solver Algorithm • Parallel Data Compression (PLZMA) (sponsored by ZTE) • Hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup • multi-core/many-core (CPU, GPU, MIC) Architecture • reconfigurable hardware (FPGA) Tsinghua HPGC Group: a Quick Overview on existing projects

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers 5

The Gap between Software and Hardware 50P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 6

Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 7

Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 8

Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or cloud resolving computational mathematics other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University CPU, GPU, MIC, FPGA geo-computing C/C++, Fortran, MPI, CUDA, Java, … The “ Best ” Computational Solution 9

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Previous Efforts 10

Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs For more details, please refer to our PPoPP 2013 paper: “ A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric 11 Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) , pp. 1-12, Shenzhen, 2013. .

Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs  2013: 2D SWE on MIC and FPGA  1.26 Pflops on 207,456 CPU cores, and 25,932 MICs  another 10x on FPGA For more details, please refer to our IPDPS 2014 paper: "Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe- 2”; and our FPL 2013 paper: “Accelerating Solvers for Global Atmospheric Equations Through Mixed -Precision Data Flow Engine ”.

Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs  2013: 2D SWE on CPU+MIC and CPU+FPGA  1.26 Pflops on 207,456 CPU cores, and 25,932 MICs  another 10x on FPGA  2014: 3D Euler on MIC  1.7 Pflops on 147,456 CPU cores, and 18,432 MICs For more details, please refer to our paper: “Ultra -scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe- 2” , IEEE Transaction on Computers.

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: 3D Euler on CPU+GPU 14

CPU-only Algorithm  Parallel Version - Multi-node & Multi-core - MPI Parallelism 25 points stencil 3D channel 15

CPU-only Algorithm  Parallel Version CPU Algorithm per Stencil sweep Multi-node & Multi-core For each subdomain MPI Parallelism ① Update Halo  CPU Algorithm ② Calculate Euler stencil Workflow a. Compute Local Coordinate b. Compute Fluxes c. Compute Source Terms Per Stencil Sweep Halo CPU Stencil Computation Updating ② ① CPU Workflow 16

Hybrid (CPU+GPU) Algorithm  Hybrid Partition  GPU  Inner Stencil Computation  CPU  Halo Updating & Outer Stencil Computation  CPU-GPU Hybrid Algorithm  CPU-GPU Hybrid Algorithm Per Stencil Sweep For each subdomain GPU side: PETSc Inner-part Euler Stencil CPU side: ① Update Halo ② Outer-part Euler stencil 3D channel Inner part Outer part BARRIER 4 layers GPU CPU-GPU Exchange CPU 17

Hybrid Algorithm Design Per Stencil Sweep Halo CPU Stencil Computation Updating ① ② Per Stencil Sweep Inner Stencil Computation G2C GPU Halo Outer Stencil C2G CPU Updating Computation ① ② ③ Barrier Workflow 18

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: GPU-related Optimizations 19

Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 20

Optimizations Pinned-memory Virtual Memory Physical Memory T2 Physical T1 Memory GPU GPU Theoretic: T2 = 1/3 * T1 Reality: T2 < 1/2 * T1 22

Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP Compiler option CPU SIMD Vectorization Opt -Xptxas dlcm= ca Cache blocking 23

Optimizations Pinned Memory SMEM/L1 Streaming Multi- AoS -> SoA Processor 64K Register Register Adjustment 2048 threads GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 256 registers per threads Inner-thread Rescheduling Rt = 256 1 Block per OpenMP SM CPU SIMD Vectorization Opt Occupancy = (64*1024) / (2048*Rt) = 12.5% Cache blocking 25

Optimizations Pinned Memory SMEM/L1 Streaming Multi- Processor AoS -> SoA 64K Register 2048 threads Register Adjustment GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 64 registers per threads Inner-thread Rescheduling Rt = 64 4 Block per SM OpenMP Occupancy = (64*1024) / (2048*Rt) = 50% CPU SIMD Vectorization Opt Compiler option -maxrregcount = 64 Cache blocking 26

Optimizations 28

Optimizations 29

Optimizations 30

A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Results 31

Experimental Result OpenMP CPU 19.7s SIMD Vectorization Opt Cache blocking 70% Pinned Memory 5.91s SMEM/L1 31.64x AoS -> SoA speedup over 69% Kernel Splitting 12-core CPU GPU 1.80s Register Adjustment Opt (E5-2697 v2) Other Methods Customized Data Cache 49% Inner-thread Rescheduling 0.92s 32

Experimental Result 33

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic - PowerPoint PPT Presentation

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016 Tsinghua HPGC

Potential Games Matoula Petrolia April 14, 2011 Examples Potential Games Potential vs

Unleashing Unlimited Potential of Lao Generation 2030 No time to lose Nov 2019 20

Unleashing Indigenous potential The purpose, power, and promise of gifted education Associate

UNLEASHING CANADAS ZINC POTENTIAL A Unique and Strategic Zinc Exploration Portfolio

Unleashing the potential of CBM using MIMOSA standards Pak Wong PdMA Corporation MIMOSA Open

Unleashing the Potential of Canadian Crops What is PIC? Federal Government initiative launched

Unleashing the Potential of the Internet - Indian Case Study Sudhir Gupta Secretary, TRAI

FREE KIDS! UNLEASHING THE POTENTIAL OF THE UN GLOBAL STUDY ON CHILDREN DEPRIVED OF LIBERTY

RE-IMAGINING AMERICA: DAYTON OHIO llaboratory the unleashing daytons potential The

Unleashing potential through relationship focus Stefan Carlsson Head of Large Corporates &

Unleashing the Potential of ASEAN Organisations: The Leadership Imperative LONDON BUSINESS SCHOOL

Unleashing the potential of open-source in the 5G arena 5G and OpenAirInteface - R2Lab

Unleashing the potential of open-source in the 5G arena Some visions of 5G and beyond 5G and

Kinetic and Potential Energy Potential Energy Potential energy is that energy which an object has

UNLEASHING VALUE FROM CONFLICT-FREE TIN ITRI CONFERENCE 2016 BORIS KAMSTRA CEO ALPHAMIN RESOURCE

New Developments in the Bundesbanks RDSC Linking various sources as a key for unleashing more

EudraVigilance Change Management Planning Latest updates, new webpage and training support 8 th

Ricardo Ortega General Manager, Grassland Water District Exhibit GWD-16, p. 001 The Refuge

CIPM Presidents Report Dr Barry Inglis President, CIPM CGPM 18-20 November 2014 Overview

Presenters: Henry Harris Jamie Lewis Welcome and Overview Henry Harris Agenda Provider

Showing Up: An Evaluation of No Show Rates Daryl Nowacki CentroMed San Antonio, TX

Fiscal 2017 Q2 Earnings Presentation April 6, 2017 Risks and Non-GAAP Disclosures This

Fiscal 2017 Q3 Earnings Presentation July 12, 2017 Risks and Non-GAAP Disclosures This

Consultation Process (SE-109) Revised Proposal for Process Redesign June 20, 2013 Agenda

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic - PowerPoint PPT Presentation

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016 Tsinghua HPGC

Potential Games Matoula Petrolia April 14, 2011 Examples Potential Games Potential vs

Unleashing Unlimited Potential of Lao Generation 2030 No time to lose Nov 2019 20

Unleashing Indigenous potential The purpose, power, and promise of gifted education Associate

UNLEASHING CANADAS ZINC POTENTIAL A Unique and Strategic Zinc Exploration Portfolio

Unleashing the potential of CBM using MIMOSA standards Pak Wong PdMA Corporation MIMOSA Open

Unleashing the Potential of Canadian Crops What is PIC? Federal Government initiative launched

Unleashing the Potential of the Internet - Indian Case Study Sudhir Gupta Secretary, TRAI

FREE KIDS! UNLEASHING THE POTENTIAL OF THE UN GLOBAL STUDY ON CHILDREN DEPRIVED OF LIBERTY

RE-IMAGINING AMERICA: DAYTON OHIO llaboratory the unleashing daytons potential The

Unleashing potential through relationship focus Stefan Carlsson Head of Large Corporates &amp;

Unleashing the Potential of ASEAN Organisations: The Leadership Imperative LONDON BUSINESS SCHOOL

Unleashing the potential of open-source in the 5G arena 5G and OpenAirInteface - R2Lab

Unleashing the potential of open-source in the 5G arena Some visions of 5G and beyond 5G and

Kinetic and Potential Energy Potential Energy Potential energy is that energy which an object has

UNLEASHING VALUE FROM CONFLICT-FREE TIN ITRI CONFERENCE 2016 BORIS KAMSTRA CEO ALPHAMIN RESOURCE

New Developments in the Bundesbanks RDSC Linking various sources as a key for unleashing more

EudraVigilance Change Management Planning Latest updates, new webpage and training support 8 th

Ricardo Ortega General Manager, Grassland Water District Exhibit GWD-16, p. 001 The Refuge

CIPM Presidents Report Dr Barry Inglis President, CIPM CGPM 18-20 November 2014 Overview

Presenters: Henry Harris Jamie Lewis Welcome and Overview Henry Harris Agenda Provider

Showing Up: An Evaluation of No Show Rates Daryl Nowacki CentroMed San Antonio, TX

Fiscal 2017 Q2 Earnings Presentation April 6, 2017 Risks and Non-GAAP Disclosures This

Fiscal 2017 Q3 Earnings Presentation July 12, 2017 Risks and Non-GAAP Disclosures This

Consultation Process (SE-109) Revised Proposal for Process Redesign June 20, 2013 Agenda

Unleashing potential through relationship focus Stefan Carlsson Head of Large Corporates &