The Ramses Code for Numerical Astrophysics: Toward Full GPU Enabling - PowerPoint PPT Presentation

The Ramses Code for Numerical Astrophysics: Toward Full GPU Enabling Claudio Gheller (ETH Zurich - CSCS) Giacomo Rosilho de Souza (EPF Lausanne) Marco Sutti (EPF Lausanne) Romain Teyssier (University of Zurich)

Simulations in astrophyisics • Numerical simulations represent an extraordinary tool to study and solve astrophysical problems • They are actual virtual laboratories, where numerical experiments can run • Sophisticated codes are used to run the simulations on the most powerful HPC systems 2

Evolution of the Large Scale Structure of the Universe (https://github.com/splotchviz/splotch) Visualization made with Splotch Magneticum Simulation, K.Dolag et al., http://www.magneticum.org

Multi-species/quantities physics (https://github.com/splotchviz/splotch) Visualization made with Splotch 4 F.Vazza et al, Hamburg Observatory, CSCS, PRACE

Galaxy formation IRIS simulation, L.Mayer et al., University of Zurich, CSCS 5

Formation of the moon 6 R.Canup et al., https://www.boulder.swri.edu/~robin/

Codes: RAMSES • RAMSES (R.Teyssier, A&A, 385, 2002) : code to study of astrophysical problems • various components (dark energy, dark matter, baryonic matter, photons) treated • Includes a variety of physical processes (gravity, magnetohydrodynamics, chemical reactions, star formation, supernova and AGN feedback, etc.) • Adaptive Mesh Refinement adopted to provide high spatial resolution ONLY where this is strictly necessary • Open Source • Fortran 90 • Code size: about 70000 lines • MPI parallel (public version) • OpenMP support (restricted access) • OpenACC under development

HPC power: Piz Daint “Piz Daint” CRAY XC30 system @ CSCS (N.6 in Top500) Nodes: 5272 CPUs 8-core Intel SandyBridge equipped with: • 32 GB DDR3 memory • One NVIDIA Tesla K20X GPU with 6 GB of GDDR5 memory Overall system • 42176 cores and 5272 GPUs • 170+32 TB • Interconnect: Aries routing and communications ASIC, and dragonfly network topology • Peak performance: 7.787 Petaflops 8

Scope Overall goal: Enable the RAMSES code to exploit hybrid, accelerated architectures Adopted programming model: OpenACC (http://www.openacc-standard.org/) Development follows an incremental “bottom-up” approach 9

RAMSES: modular physics Hydro Load AMR build Gravity Balance Time loop MHD More RT Cooling N-Body Physics

Processor architecture (Piz Daint) DDR-5, 6 GB memory DDR-3, 32 GB shared memory 200 GB/sec 50 GB/sec 1.31 TFlops DP 3.95 Tflops SP 2688 cuda cores 14 SMX Interconnect PCI-E2 8GB/sec 732 MHz/core CRAY Aries 235 W à à 10 GB/sec Intel Sandybridge Xeon E5-2670 5.57 GF/W peak 8 cores Nvidia Kepler K20X 2.6 GHz/core 166.4 Gflops DP 115 W à à 1.44 GF/W

RAMSES: Modular, incremental GPU implementation GF = “GPU FRIENDLY” Computational intensity + Low GF Data independency Mid GF MPI MPI Hydro Load AMR build Gravity MPI Balance MHD Time loop Hi GF More RT Cooling N-Body Physics Hi GF Mid GF

First steps toward the GPU GF = “GPU FRIENDLY” Computational intensity + Data independency Low GF Mid GF MPI MPI Hydro Load AMR build Gravity MPI Balance MHD Time loop Hi GF More RT Cooling N-Body Physics Hi GF Mid GF

Step 1: solving fluid dynamics • Fluid dynamics is one of the key kernels; • It is also among the most computational AMR build demanding; • It is a local problem; Communication, • fluid dynamics is solved on a computational Balancing Time loop mesh solving three conservation equations: mass, momentum and energy: Gravity ∂ρ ∂ t + ∇ · ( ρ u ) = 0 Hydro ∂ ∂ t ( ρ u ) + ∇ · ( ρ u ⊗ u ) + ∇ p = − ρ ∇ φ Flux N-Body ∂ ∂ t ( ρ e ) + ∇ · [ ρ u ( e + p/ ρ )] = − ρ u · ∇ φ More physics Cell Flux Flux i,j Flux

The challenge: RAMSES AMR Mesh Fully Threaded Tree with Cartesian mesh CELL BY CELL refinement • COMPLEX data structure • IRREGULAR memory distribution •

GPU implementation of the Hydro kernel 1. Memory Bandwidth: 1. reorganization of memory in spatially (and memory) contiguous large patches, so that work can be easily split in blocks with efficient memory access 2. Further grouping of patches to increase data locality 2. Parallelism: 1. patches to blocks assignment, 2. one cell per thread integration 3. Data transfer: 1. Offload data only when and where necessary 4. GPU memory size: 1. Still an open issue…

Some Results: hydro only Fraction of time saved using the GPU We need to extend Data movement is still 30-40% • the fraction of the overhead: can be worse with code enabled to the more complex AMR hierarchies GPU, reducing data A large fraction of the code is • transfers and still on the CPU overlapping as much Scalability of the CPU No overlap of GPU and CPU • and GPU versions as possible to the computation (Total time) remaining CPU part Scalability of the CPU and GPU versions (Hydro time) 17

Step 2: Adding the cooling module • Energy is corrected only on leaf cells independently • GPU implementation requires minimization of data transfer… • Iterative procedure with a cell-by-cell timestep • exploitation of the high degree of parallelism with “automatic” load balancing:

Adding the cooling • Comparing 64 GPUs to 64 CPUs: Speed-up = 2.55 19

Toward a full GPU enabling • Gravity is being moved to the GPU • ALL MPI communication is being moved to the GPU using the GPUDirect MPI implementation • N-body will stay on the CPU • Low computational intensity • Can easily overlap to the GPU • No need of transferring all particle data, saving time but especially GPU memory 20

Summary Objective: Enable the RAMSES code to the GPU Methodology Incremental approach exploiting RAMSES’modular architecture and OpenACC programming mode Current achievement: Hydro and Cooling kernels ported on GPU; MHD kernel almost done On-going work: • Move all MPI stuff to the GPU • Enable gravity to the GPU • Data transfer minimization 21

Thanks for your attention 22

The Ramses Code for Numerical Astrophysics: Toward Full GPU Enabling - PowerPoint PPT Presentation

The Ramses Code for Numerical Astrophysics: Toward Full GPU Enabling Claudio Gheller (ETH Zurich - CSCS) Giacomo Rosilho de Souza (EPF Lausanne) Marco Sutti (EPF Lausanne) Romain Teyssier (University of Zurich) Simulations in

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

full year results full year results full year results full full year results full year results full

Gamma- Gamma -Ray Particle Ray Particle Astrophysics: Astrophysics: Astrophysics:

FULL YEAR RESULTS FULL YEAR RESULTS. 2017 FULL YEAR RESULTS FULL YEAR RESULTS . 2017 . 2017 .

OpenMP 4 - Whats New? SciNet Developer Seminar Ramses van Zon September 25, 2013 Intro to

Time-domain Astrophysics in the Era of Big Data V. Ashley Villar Center for Astrophysics |

EUROPEAN STUDIES IN A GLOBAL CONTEXT MASTERS OPEN DAY 14 NOVEMBER 2019 Prof. RAMSES WESSEL

Massive galaxies with RAMSES Romain Teyssier Yohan Dubois (IAP), Oliver Hahn (ETH) Davide

Incident Response in Large Complex Business Environments Ramses Martinez Ismail Guneydas Yahoo!

Steven Y. Ko (SUNY at Buffalo), Kyungho Jeon (SUNY at Buffalo), Ramses Morales (Xerox Research

Incident Response to Social Engineering Attacks: Domain Hijacking Ramses Martinez Director of

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

2020 Deans Presentation College of Science Dr. Lucas Macri Associate Dean for Undergraduate

Two number systems and one application Amr Elmasry 1) and Claus Jensen 2) Jyrki Katajainen 1) 1)

Th The Maya ya and Nu Numbers The Maya had a good understanding of numbers and they developed a

In It Together: Why Less Inequality Benefits All Der neue Verteilungsbericht der OECD Michael

United Nations Framework Classification (UNFC-2009) Harmonisation without Homogenisation

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

Strong Q2 Outlook increased Q2 FY15 Management Presentation (preliminary figures) OSRAM

DA Final: Symbolic 3D+t Reconstruction From Cone-Beam Projections Jakob Vogel (Supervised by

The Ramses Code for Numerical Astrophysics: Toward Full GPU Enabling - PowerPoint PPT Presentation

The Ramses Code for Numerical Astrophysics: Toward Full GPU Enabling Claudio Gheller (ETH Zurich - CSCS) Giacomo Rosilho de Souza (EPF Lausanne) Marco Sutti (EPF Lausanne) Romain Teyssier (University of Zurich) Simulations in

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

full year results full year results full year results full full year results full year results full

Gamma- Gamma -Ray Particle Ray Particle Astrophysics: Astrophysics: Astrophysics:

FULL YEAR RESULTS FULL YEAR RESULTS. 2017 FULL YEAR RESULTS FULL YEAR RESULTS . 2017 . 2017 .

OpenMP 4 - Whats New? SciNet Developer Seminar Ramses van Zon September 25, 2013 Intro to

Time-domain Astrophysics in the Era of Big Data V. Ashley Villar Center for Astrophysics |

EUROPEAN STUDIES IN A GLOBAL CONTEXT MASTERS OPEN DAY 14 NOVEMBER 2019 Prof. RAMSES WESSEL

Massive galaxies with RAMSES Romain Teyssier Yohan Dubois (IAP), Oliver Hahn (ETH) Davide

Incident Response in Large Complex Business Environments Ramses Martinez Ismail Guneydas Yahoo!

Steven Y. Ko (SUNY at Buffalo), Kyungho Jeon (SUNY at Buffalo), Ramses Morales (Xerox Research

Incident Response to Social Engineering Attacks: Domain Hijacking Ramses Martinez Director of

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Numerical Differentiation &amp; Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation &amp; Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

2020 Deans Presentation College of Science Dr. Lucas Macri Associate Dean for Undergraduate

Two number systems and one application Amr Elmasry 1) and Claus Jensen 2) Jyrki Katajainen 1) 1)

Th The Maya ya and Nu Numbers The Maya had a good understanding of numbers and they developed a

In It Together: Why Less Inequality Benefits All Der neue Verteilungsbericht der OECD Michael

United Nations Framework Classification (UNFC-2009) Harmonisation without Homogenisation

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

Strong Q2 Outlook increased Q2 FY15 Management Presentation (preliminary figures) OSRAM

DA Final: Symbolic 3D+t Reconstruction From Cone-Beam Projections Jakob Vogel (Supervised by

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th