NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES - PowerPoint PPT Presentation

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18

AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features Comparative Benchmarking 2

INTRODUCTION TO VASP Scientific Background A leading electronic structure program for solids, surfaces and interfaces Used to study chemical and physical properties, reactions paths, etc. Atomic scale materials modeling from first principles from 1 to 1000s atoms Liquids, crystals, magnetism, semiconductors/insulators, surfaces, catalysts Solves many-body Schrödinger equation 3

INTRODUCTION TO VASP Quantum-mechanical methods Density Functional Theory (DFT) Enables solving sets of Kohn-Sham equations In a plane-wave based framework (PAW) Hybrid DFT adding (parts of) exact exchange (Hartree-Fock) and VASP can go even beyond! 4

INTRODUCTION TO VASP 12-25% of CPU cycles at supercomputing centers CSC, Finland (2012) Material Sciences Academia Top 5 HPC Chemical Engineering Applications Physics & Physical Chemistry Rank Application 1 GROMACS 2 ANSYS Fluent 3 Gaussian Companies Archer, UK (2019/03) Large semiconductor companies VASP Oil & gas 4 VASP Chemicals – bulk or fine 18.2% Gromacs Materials – glass, rubber , 5 NAMD cp2k 9.3% ceramic, alloys, Source: Intersect360 2017 Site Census Mentions polymers and metals NEMO Other Source: http://www.archer.ac.uk/status/codes/ 2019/03/28 5

INTRODUCTION TO VASP Details on the code Developed by Prof. Kresse’s group at University of Vienna (and external contributors) Under development/refactoring for about 25 years 460K lines of Fortran 90, some FORTRAN 77 MPI parallel, OpenMP recently added for multicore First endeavors on GPU acceleration date back to <2011 timeframe with CUDA C 6

INTRODUCTION TO VASP Computational characteristics Lots of small Fast-Fourier-Transformation (about 100x100x100 nodes) Matrix-Matrix and Matrix-Vector multiplications Matrix diagonalizations AllToAll communications And of course some custom kernels 7

COLLABORATION ON CUDA C PORT OF VASP 5 Collaborators U of Chicago CUDA Port Project Scope Minimization algorithms to calculate electronic ground state: Blocked Davidson (ALGO = Normal & Fast) and RMM-DIIS (ALGO = VeryFast & Fast) Parallelization over k-points Exact-exchange calculations Earlier Work Speeding up plane-wave electronic-structure calculations using graphics-processing units (Maintz, Eck, Dronskowski) VASP on a GPU: Application to exact-exchange calculations of the stability of elemental boron (Hutchinson, Widom) Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units (Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard) 9

INSTRUCTIONS TO COMPILE AND RUN VASP 5 ON GPUS NVIDIA offers step-by-step instructions to compile and run VASP 5 with the CUDA C port: https://www.nvidia.cn/data-center/gpu- accelerated-applications/vasp/ Exemplary benchmarks to test against expected performance Hints on tuning important run-time parameters English version at https://www.nvidia.com/vasp 10

CUDA SOURCE INTEGRATION IN VASP 5.4.4 davidson.cu davidson_gpu.F davidson.F cuda_helpers.h makefile cuda_helpers.cu switch … Original source tree Accelerated call tree, Custom kernels drop-in replacements and support code (Fortran) (Fortran) (CUDA C) 11

CUDA C ACCELERATED VERSION OF VASP Available today on NVIDIA Tesla GPUs with VASP 5.4.4 All GPU acceleration with CUDA C Upper Levels Only some cases are ported to GPUs Different source trees for Fortran vs CUDA C GPU CPU call tree call tree CPU code gets continuously updated and enhanced, required for various platforms Challenge to keep CUDA C sources up-to-date routine A routine A Long development cycles to port new features routine B routine B 12

CUDA C ACCELERATED VERSION OF VASP Available today on NVIDIA Tesla GPUs with VASP 5.4.4 Source code duplication for CUDA C in VASP led to: Upper Levels • increased maintenance cost • improvements in CPU code need replication GPU CPU call tree call tree • long development cycles to port new solvers routine A routine A Explore OpenACC as an routine B routine B improvement for GPU acceleration 13

FEATURES AVAILABLE AND ACCELERATED IN VASP 5 LEVELS OF THEORY SOLVERS / MAIN ALGORITHM Standard DFT Davidson Hybrid DFT (exact exchange) RMM-DIIS RPA (ACFDT , GW) Davidson+RMM-DIIS Bethe-Salpeter Equations (BSE) Direct optimizers (Damped, All) … Linear response ... PROJECTION SCHEME EXECUTABLE FLAVORS Real space Standard variant Real space (automatic optimization) Gamma-point simplification variant Reciprocal space Non-collinear spin variant 15 Light green: GPU accelerated in VASP 5, black: not accelerated in VASP 5

EXAMPLE BENCHMARK: C U C_ VD W standard KPAR, NSIM, NCORE exec. flavor parallelization options Davidson Real space Gamma-point solver proj. scheme exec. flavor RMM-DIIS Reciprocal Non-collinear solver proj. scheme exec. flavor Standard DFT level of theory Dav.+RMM-DIIS Automatic solver proj. scheme Hybrid DFT level of theory Damped solver RPA level of theory BSE level of theory 16

PARALLELIZATION LAYERS IN VASP Wavefunction Ψ Spins ↑ ↓ k -points KPAR>1 … … 𝑙 1 𝑙 2 𝑙 3 𝑙 1 𝑙 2 𝑙 3 Bands/Orbitals Default … … 𝑜 1 𝑜 2 𝑜 1 𝑜 2 … … 𝐷 1 𝐷 1 𝐷 1 𝐷 1 Plane-wave NCORE>1 … … coefficients 𝐷 2 𝐷 2 𝐷 2 𝐷 2 … … 𝐷 3 𝐷 3 𝐷 3 𝐷 3 Physical Parallelization … … … … … … quantities feature 17

PARALLELIZATION OPTIONS KPAR NSIM NCORE Distributes k -points Blocking of orbitals Distributes plane waves Highest level parallelism, Grouping (no distributing) Lowest level parallelism, more or less influences caching and needs parallel 3D FFT , embarrassingly parallel communication inserts lots of MPI msgs Can help for smaller Ideal value differs Can help with load systems between CPU and GPU balancing problems Not always possible Needs to be tuned No support in CUDA port 18

POSSIBLE USE CASES IN VASP Each with a different computational profile Supports a plethora of run-time options that define the workload (use case) Those methodological options can be grouped into categories Some, but not all are combinable Combination determines if GPU acceleration is supported and also how well Benchmarking the complete situation is tremendously complex 19

WHERE TO START You cannot accelerate everything (at least soon) Ideally every use case would be ported Standard and Hybrid DFT alone give 72 use cases (ignoring parallelization options)! Need to select most important use-cases Selection should be based on real-world or supercomputing-facility scenarios 20

STATISTICS ON VASP USE CASES NERSC job submission data 2014 Zhengji Zhao (NERSC) collected such data (INCAR) for 30397 VASP jobs over nearly 2 months Data is based on job count , but has no timing information Includes 130 unique users on Edison (CPU-only system) No 1:1-mapping of parameters possible, expect large error margins Data does not include calculation sizes, but it’s a great start 21

VASP FEATURE USAGE AT NERSC Levels of theory and main algorithms based on job count 2% Davidson Dav+RMM standard DFT RMM-DIIS hybrid DFT Direct Opt. 51% RPA Other BSE RPA BSE Solver / main algorithm Level of theory Source: based on data provided by Zhengji Zhao, NERSC, 2014 22

SUMMARY Where to start Start with standard DFT, to accelerate most jobs RMM-DIIS and Davidson nearly equally important, share a lot of routines anyway Real-space projection scheme more important for large setups Gamma-point executable flavor as important as standard, so start with general one Support as many parallelization options as possible (KPAR, NSIM, NCORE) Communication is important, but scaling to large node counts is low priority (62% fit into 4 nodes, 95% used ≤ 12 nodes) 23

OPENACC DIRECTIVES Data directives are designed to be optional !$acc data copyin(a,b) copyout(c) Manage Data ... Movement !$acc parallel Initiate Parallel !$acc loop gang vector Execution do i=1, n c(i) = a(i) + b(i) Optimize ... Loop enddo Mappings !$acc end parallel ... !$acc end data 25

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES - PowerPoint PPT Presentation

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

VASP 5.4.1 February 2017 Interface on P100s PCIe 0.00500 Interface Running VASP version 5.4.1

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

NEW RADIATION LEGISLATION M M TREVOR RADIATION Aims to: Produce a working definition for

Co Corporate Prese sentatio ion October 2019 CSE : CBIS www.cannabisone.life Di Disclaimer

New Frontiers in Adaptive Learning for Upskilling African Workforce Dr. Nishikant Sonwalkar

1 Hierarchial Models of Nanomechanics and Micromechanics Nasr M. Ghoniem (1) , and Nicholas

D01: Ultimate Physics Analysis Eiichiro Komatsu (Max-Planck-Institut fr Astrophysik / Kavli

Could quantum decoherence and measurement be deterministic phenomena? eda Nour 1 , co 2 Jean-Marc

Presented by: Dawen Lim Prof. Dato Dr. See Ching Mey Loh Guan Lye Specialists Centre Did you

PRE REJU JUDICIAL AL PREFERENCES The Discriminatory Selection Practices of Colbys Greek

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES - PowerPoint PPT Presentation

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

VASP 5.4.1 February 2017 Interface on P100s PCIe 0.00500 Interface Running VASP version 5.4.1

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher &amp; Activity Lead

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

NEW RADIATION LEGISLATION M M TREVOR RADIATION Aims to: Produce a working definition for

Co Corporate Prese sentatio ion October 2019 CSE : CBIS www.cannabisone.life Di Disclaimer

New Frontiers in Adaptive Learning for Upskilling African Workforce Dr. Nishikant Sonwalkar

1 Hierarchial Models of Nanomechanics and Micromechanics Nasr M. Ghoniem (1) , and Nicholas

D01: Ultimate Physics Analysis Eiichiro Komatsu (Max-Planck-Institut fr Astrophysik / Kavli

Could quantum decoherence and measurement be deterministic phenomena? eda Nour 1 , co 2 Jean-Marc

Presented by: Dawen Lim Prof. Dato Dr. See Ching Mey Loh Guan Lye Specialists Centre Did you

PRE REJU JUDICIAL AL PREFERENCES The Discriminatory Selection Practices of Colbys Greek

MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Pea Sr. Researcher & Activity Lead