PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus - PowerPoint PPT Presentation

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018 smaintz@nvidia.com; mwetzstein@nvidia.com

Short introduction to VASP Status of the CUDA port Prioritizing Use Cases for GPU Acceleration AGENDA OpenACC in VASP Comparative Benchmarking 2

VASP OVERVIEW Leading electronic structure program for solids, surfaces, and interfaces. Used to study chemical / physical properties, reactions paths, etc. Atomic scale materials modeling from first principles Simulate 1 – 1000s atoms (mostly solids/surfaces) Liquids, crystals, magnetism, semiconductor/insulators, surfaces, catalysts Solve many-body Schrödinger equation 3

VASP OVERVIEW Quantum-mechanical methods Density Functional Theory (DFT) Enables solving sets of Kohn-Sham equations In a plane-wave based framework (PAW) Hybrid DFT adding (parts of) of exact-exchange (Hartree-Fock) and even beyond! 4

VASP The Vienna Ab initio Simulation Package Developed at G. Kresse’s group at University of Vienna (and external contributors) Under development/refactoring for about 25 years 460K lines of Fortran 90, some FORTRAN 77 MPI parallel, OpenMP recently added for multicore First endeavors on GPU acceleration date back to <2011 timeframe with CUDA C 5

VASP USERS / USAGE 12-25% of CPU cycles @ supercomputing centers Material Sciences CSC, Finland (2012) Academia Chemical Engineering Physics & Physical Chemistry Top 5 HPC Applications Rank Application 1 GROMACS Large semiconductor companies Companies Oil & gas 2 ANSYS - Fluent Chemicals – bulk or fine 3 Gaussian Materials – glass, rubber, VASP 4 ceramic, alloys, NAMD 5 polymers and metals Source: Intersect360 2017 Site Census Mentions 6

VASP COLLABORATION ON CUDA PORT Collaborators U of Chicago CUDA Port Project Scope Minimization algorithms to calculate electronic ground state: Blocked Davidson (ALGO = NORMAL & FAST) and RMM-DIIS (ALGO = VERYFAST & FAST) Parallelization over k -points Exact-exchange calculations Earlier Work Speeding up plane-wave electronic-structure calculations using graphics-processing units , Maintz, Eck, Dronskowski VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron , Hutchinson, Widom Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units , Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard 8

CUDA ACCELERATED VERSION OF VASP Available today on NVIDIA Tesla GPUs All GPU acceleration with CUDA C • Upper Levels • Not all use cases are ported to GPUs GPU CPU Different source trees for Fortran vs CUDA C • call tree call tree • CPU code gets continuously updated and enhanced, required for various platforms routine A routine A Challenge to keep CUDA C sources up-to-date • routine B routine B • Long development cycles to port new solvers 9

INTEGRATION WITH VASP 5.4.4 (CUDA) davidson.cu davidson.F davidson_gpu.F cuda_helpers.h makefile cuda_helpers.cu switch … GPU-accelerated Original Custom Kernels Routine, Drop-in Routine and support code Replacement - Fortran - CUDA-C - Fortran 10

CUDA Accelerated Version of VASP Available today on NVIDIA Tesla GPUs Source code duplication in CUDA C in VASP led to: Upper Levels • increased maintenance cost GPU CPU improvements in CPU code need replication • call tree call tree • long development cycles to port new solvers routine A routine A Explore OpenACC as an routine B routine B improvement for GPU acceleration 11

CATEGORIES FOR METHODOLOGICAL OPTIONS This does not contain options influencing parallelization LEVELS OF THEORY SOLVERS / MAIN ALGORITHM Standard DFT Davidson Hybrid DFT (exact exchange) RMM-DIIS RPA (ACFDT , GW) Davidson+RMM-DIIS Bethe-Salpeter Equations (BSE) Damped … … PROJECTION SCHEME EXECUTABLE FLAVORS Real space Standard variant Real space (automatic optimization) Gamma-point only (simplifications possible) Reciprocal space Non-collinear variant (more interactions) 13

EXAMPLE BENCHMARK: SILICA_IFPEN standard KPAR, NSIM, NCORE exec. flavor parallelization options Davidson Realspace Gamma-point solver proj. scheme exec. flavor RMM-DIIS Reciprocal Non-collinear solver proj. scheme exec. flavor Standard DFT level of theory Dav.+RMM-DIIS Automatic solver proj. scheme Hybrid DFT level of theory Damped solver RPA level of theory BSE level of theory 14

PARALLELIZATION OPTIONS KPAR NSIM NCORE Distributes k -points Blocking of orbitals Distributes plane waves Highest level parallelism, No parallelism here, just Lowest level parallelism, more or less grouping that can needs parallel 3D FFT , embarrassingly parallel influence communication inserts lots of MPI msgs Can help for smaller Ideal value different Can help with load systems between CPU and GPU balancing problems Not always possible Needs to be tuned No support in CUDA port 15

PARALLELIZATION LAYERS IN VASP Wavefunction Ψ Spins ↑ ↓ k -points KPAR>1 … … 𝑙 1 𝑙 2 𝑙 3 𝑙 1 𝑙 2 𝑙 3 Bands/Orbitals Default … … 𝑜 1 𝑜 2 𝑜 1 𝑜 2 … … 𝐷 1 𝐷 1 𝐷 1 𝐷 1 Plane-wave NCORE>1 … … coefficients 𝐷 2 𝐷 2 𝐷 2 𝐷 2 … … 𝐷 3 𝐷 3 𝐷 3 𝐷 3 Physical Parallelization … … … … … … quantities feature 16

POSSIBLE USE CASES IN VASP Each with a different computational profile Supports a plethora of run-time options that define the workload (use case) Those methodological options can be grouped into categories Some, but not all are combinable Combination determines if GPU acceleration is supported and also how well Benchmarking the complete situation is tremendously complex 17

WHERE TO START You cannot accelerate everything (at least soon) Ideally every use case would be ported Standard and Hybrid DFT alone give 72 use cases (ignoring parallelization options)! Need to select most important use-cases Selection should be based on real-world or supercomputing-facility scenarios 18

STATISTICS ON VASP USE CASES NERSC job submission data 2014 Zhengji Zhao collected such data (INCAR) for 30397 VASP jobs over nearly 2 months Data is based on job count , but has no timing information Includes 130 unique users on Edison (CPU-only system) No 1:1-mapping of parameters possible, expect large error margins Data does not include calculation sizes, but it’s a great start 19

EMPLOYED MAIN ALGORITHMS AND LEVELS OF THEORY 2% Davidson Dav+RMM RMM-DIIS standard DFT Damped hybrid DFT 51% Exact RPA RPA BSE Conjugate BSE EIGENVAL Source: based on data provided by Zhengji Zhao, NERSC, 2014 20

SUMMARY Where to start Start with standard DFT, to accelerate most jobs RMM-DIIS and Davidson nearly equally important, share a lot of routines anyway Realspace projection more important for large setups Gamma-point executable flavor as important as standard, so start with general one Support as many parallelization options as possible (KPAR, NSIM, NCORE) Communication is important, but scaling to large node counts is low priority (62% fit into 4 nodes, 95% used ≤ 12 nodes) 21

VASP OPENACC PORTING PROJECT feasibility study Can we get a working version, with today’s compilers, tools, HW ? • • Decision to focus on one algorithm: RMM-DIIS Guidelines: • • work out of existing CPU code • minimally invasive to CPU code Goals: • allow for performance comparison to CUDA port • • assess maintainability, threshold for future porting efforts 22

OPENACC DIRECTIVES Data directives are designed to be optional !$acc data copyin(a,b) copyout(c) Manage Data ... Movement !$acc parallel Initiate Parallel !$acc loop gang vector Execution do i=1, n c(i) = a(i) + b(i) Optimize ... Loop enddo Mappings !$acc end parallel ... !$acc end data 24

DATA REGIONS IN OPENACC intrinsic data types, static and dynamic All static intrinsic data types of the programming language can appear in an OpenACC data directive, e.g. real, complex, integer scalar variables in Fortran. Same for all fixed size arrays of intrinsic types, and dynamically allocated arrays of intrinsic type, e.g. allocatable and pointer variables in Fortran The compiler will know the base address and size (in C size needs to be specified in directive). … So what about derived types ? Two variants: type stat_def type dyn_def integer a,b integer m real c real,allocatable,dimension(:) :: r end type stat_def end type dyn_def type(stat_def)::var_stat type(dyn_def)::var_dyn 25

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus - PowerPoint PPT Presentation

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018 smaintz@nvidia.com; mwetzstein@nvidia.com Short introduction to VASP Status of the CUDA port Prioritizing Use Cases for GPU Acceleration AGENDA OpenACC in VASP

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

VASP 5.4.1 February 2017 Interface on P100s PCIe 0.00500 Interface Running VASP version 5.4.1

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

PORTING THE HAMMER FILE SYSTEM TO LINUX Daniel Lorch June 10, 2009 Outline 2/13 Motivation 1.

Porting OpenVMS to x86-64 Update Clair Grant Camiel Vanderhoeven April 8, 2016 Porting OpenVMS

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

NATIVE MODE PORTING CASE STUDY Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Native mode

VASP Project Briefing Report Eastern Alliance for Greenhouse Action Northern Alliance for

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

College and Career Readiness: What High School Parents Need to Know Fall 2019 Jurupa Unified

MICHAEL BURAWOY: So I'd like to welcome you all to the opening of the 2004 Annual Meetings of the

On Gender & CIMIC OF3 Jeroen Sennef www.cimic-coe.org 1 24/06/2009 Introduction

Frankly, we just dont understand it. Craig Hogan, University of Washington at Seattle

Attendance Boundary Policy and the Segregation of Public Schools in the United States Tomas

S USTAINABLE DEVELOPMENT MARKETS ~50% 19% 34% ENERGY PROCESS INDUSTRIES RENEWABLE ENERGIES

Status of IAEA International Database on Irradiated Graphite Properties with Respect to HTR

Developing world-class projects for the graphite revolution Ideally positioned to utilise its

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus - PowerPoint PPT Presentation

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018 smaintz@nvidia.com; mwetzstein@nvidia.com Short introduction to VASP Status of the CUDA port Prioritizing Use Cases for GPU Acceleration AGENDA OpenACC in VASP

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

VASP 5.4.1 February 2017 Interface on P100s PCIe 0.00500 Interface Running VASP version 5.4.1

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Porting Go to NetBSD/arm64 Maya Rashish &lt;coypu@sdf.org&gt; Porting Go to NetBSD/arm64

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Porting Porting Biological Biological Applications Applications in Grid: An in Grid: An

PORTING THE HAMMER FILE SYSTEM TO LINUX Daniel Lorch June 10, 2009 Outline 2/13 Motivation 1.

Porting OpenVMS to x86-64 Update Clair Grant Camiel Vanderhoeven April 8, 2016 Porting OpenVMS

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President &amp; CEO

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Prex: Finding Guidance for Forward and Backward Porting of Linux Device Drivers Julia Lawall,

Security- -Enhanced Darwin: Enhanced Darwin: Security Porting SELinux to Mac OS X Porting

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

NATIVE MODE PORTING CASE STUDY Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Native mode

VASP Project Briefing Report Eastern Alliance for Greenhouse Action Northern Alliance for

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

College and Career Readiness: What High School Parents Need to Know Fall 2019 Jurupa Unified

MICHAEL BURAWOY: So I'd like to welcome you all to the opening of the 2004 Annual Meetings of the

On Gender &amp; CIMIC OF3 Jeroen Sennef www.cimic-coe.org 1 24/06/2009 Introduction

Frankly, we just dont understand it. Craig Hogan, University of Washington at Seattle

Attendance Boundary Policy and the Segregation of Public Schools in the United States Tomas

S USTAINABLE DEVELOPMENT MARKETS ~50% 19% 34% ENERGY PROCESS INDUSTRIES RENEWABLE ENERGIES

Status of IAEA International Database on Irradiated Graphite Properties with Respect to HTR

Developing world-class projects for the graphite revolution Ideally positioned to utilise its

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO

On Gender & CIMIC OF3 Jeroen Sennef www.cimic-coe.org 1 24/06/2009 Introduction