FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - PowerPoint PPT Presentation

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018

Introduction • • Single-GPU Exposing parallelism • • Memory Coalescing Data Management Optimization • Interoperability • AGENDA • ESCAPE performance results • Multi-GPU CUDA-aware MPI • AlltoAll optimization • • DGX-2 (with NVSwitch) vs DGX-1V 2

INTRODUCTION 3

ESCAPE NVIDIA’s role is to take existing GPU -enabled codes and optimize. • 4

ECMWF ESCAPE Project Leaders European Centre for Medium Range Weather Forecasts (ECMWF) are an • intergovernmental org. Global forecasts: • • used by more than 22 countries to drive their regional forecasts. provide accurate weekly, monthly and seasonal predictions, including early warnings of • severe events. In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall • of Hurricane Sandy: information that undoubtedly saved lives. • We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and improve the algorithms, techniques and software on GPUs. This is done through use of dwarves: mini-apps designed to represent the key • properties of the real simulation applications. 5

ESCAPE DWARVES Spherical Harmonics (SH) Dwarf • ECMWF’s Integrated Forecasting System (IFS) is a global prediction system: entire earth’s atmosphere is represented as a spherical grid. • Info in “grid - point” space can be equivalently represented in “spectral” space, i.e. in terms of the frequencies of the fluctuating waves, which is more suited to some calculations. IFS therefore repeatedly transforms between these representations, Fourier • transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with AlltoAll data movement in-between. This dwarf represents the spectral transforms from IFS. • NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally, • there exist multiple altitude “levels”, in third dimension away from surface of earth, each with 3 “fields”. 6

ESCAPE DWARVES MPDATA Dwarf Advection: horizontal transport • Uses unstructured grid with • nearest-neighbour stencils MPDATA scheme already used • within COSMO-EULAG (PSNC), and of interest to ECMWF for future developments Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to • CUDA libraries. Many of the optimizations I will present are transferable to other • applications/languages etc. 7

SINGLE-GPU: EXPOSING PARALLELISM 8

EXPOSING PARALLELISM OpenACC Loop Mapping Typical Structure of Application (usually spanning multiple routines/files): Loop over timesteps … Loop over 1 st dimension … Aim is to expose as Loop over 2 nd dimension much parallelism in … this red box as Loop over fields possible, as flexibly as … possible Operation (involving multidimensional arrays) … Another Loop over dimensions… … 9

EXPOSING PARALLELISM OpenACC Loop Mapping Before Optimization: Loop over 1 st dimension Loop over 1 st dimension was sequential: SH … • Loop over 2 nd dimension parallelism not exposed to GPU … Loop over fields … Operation MPDATA Naïve mapping of loops, using “kernels” • and/or “loop” directives without restructuring. Resulting decomposition chosen by • compiler did not work well since runtime loop extents didn’t map well to GPU architecture. 10

EXPOSING PARALLELISM OpenACC Loop Mapping Assuming loops are independent, • $!ACC parallel loop collapse(3) better to restructure such that Loop over 1 st dimension loops are tightly nested, and use Loop over 2 nd dimension “collapse” clause. Loop over fields Loops will be collapsed into a • … single loop, and compiler will map Operation all parallelism to GPU blocks/threads in an efficient way. • This can require extensive restructuring, depending on how application is originally written. 11

EXPOSING PARALLELISM OpenACC Loop Mapping Sometimes we have loop-carried • $!ACC parallel loop collapse(2) dependencies. Loop over 1 st dimension Loop over 2 nd dimension These can be performed • $!ACC loop seq sequentially by each thread at the innermost level. Loop with dependencies … Can still perform well if there is • Operation enough parallelism in outermost loops. 12

EXPOSING PARALLELISM OpenACC Loop Mapping Sometimes extent of a loop depends Can replace extent with max value, • • on index of another (outer) loop and use conditional statement in loop which prevents loop collapsing. body. $!ACC parallel loop collapse(3) do i=1, N do i=1, N do j=1, i do j=1, MAX_J do k=1, P do k=1, P … if(j .le. i) then Operation … Operation 13

SINGLE-GPU: MEMORY COALESCING 14

KERNEL OPTIMIZATION Memory Coalescing: data layout For memory coalescing, fastest moving index in array access should correspond to • vector level (CUDA thread), which will correspond to innermost collapsed loop index. $!ACC parallel loop collapse( 2 ) $!ACC parallel loop collapse( 3 ) do m=1, … do k=1, … do n =1, … do j=1, … $!ACC loop seq do i =1, … do p=1, … … … Array( i ,j,k )=… Array( n ,m,p )=… 15

KERNEL OPTIMIZATION Memory Coalescing: transposing with tile clause If you need to transpose, either • read or write will be in wrong layout for coalescing. !$ACC parallel loop tile(16,32) But can use OpenACC “tile” clause do j=1, … • to improve performance do i=1, … … Compiler tiles the operation by • array_t(i,j)=array(j,i) generating new innermost loops For each tile, data is staged on-chip • Results in better global memory • access patterns Experiment with tile sizes • 16

KERNEL OPTIMIZATION Memory Coalescing: transposing within CUDA BLAS In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix • multiplication (DGEMM) calls. Coalescing was facilitated through replacing C = AB matrix multiplications by • equivalent 𝐷 𝑈 = 𝐶 𝑈 𝐵 𝑈 . • This allows transpose operations to be pushed into the DGEMM library calls, which have much higher-performing implementations of transposed data accesses 17

SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION 18

DATA MANAGEMENT OPTIMIZATION Minimizing data allocation and movement Important to keep as much data as possible • Data allocation and movement resident on GPU within timestep loop. is expensive All allocations/frees should be outside timestep • loop. Many codes have: • Copies for constant data should be outside main timestep loop. Loop over timestep: … Re-use temporary scratch arrays on device • loops over spatial dims … Any necessary repeated copies (e.g. halo • regions in MPI code): volume copied should be minimized. 19

SINGLE-GPU: INTEROPERABILITY 20

INTEROPERABILITY Simple example: Calling C/CUDA from PGI Fortran !main.f90 //kernel.cu program main #include <stdio.h> interface __global__ void kernel(int arg){ subroutine kernel_from_f(arg) & if (threadIdx.x==0) bind(C,name='kernel_wrapper') printf("hello from kernel\n"); use iso_c_binding return; integer(c_int),value :: arg } end subroutine kernel_from_f end interface extern "C" void kernel_wrapper(int arg){ call kernel_from_f(1) kernel <<<1,1>>> (arg); cudaDeviceSynchronize(); end program main return; } $ nvcc -c -arch=sm_60 kernel.cu -o kernel.o $ pgf90 -c main.f90 -o main.o $ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 – lcudart $ ./code hello from kernel CUDA libraries can be called from C code in the usual manner. 21

INTEROPERABILITY AND LIBRARIES: SH DWARF OpenACC OpenACC cuFFT cuFFT OpenACC OpenACC cublasDgemm cublasDgemm OpenACC OpenACC Base language Fortran, MPI for multi-GPU communications. 22

BLAS/FFT LIBRARY CALLS IN SH DWARF At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs. • Multiple operations - one for each: • Field (associated with vertical levels) • • Longitude (Matmult) / Latitude (FFT) Can batch over fields, since sizes are the same. But different longitudes/latitudes • have different sizes: not supported by batched versions of cublasDgemm/cuFFT. • So, originally we had many small calls: low parallelism exposure and launch latency sensitivity. • For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as fields: single call to library; extra operations do not contribute to result. • But FFT does not allow padding in the same way. Worked around launch latency problem by removing sync after each call: allows launch latency to be hidden behind execution. As will be seen, however, this is the only part of the dwarf which remains suboptimal. • Future: batched FFT with differing sizes should improve performance. 23

SINGLE-GPU: ESCAPE RESULTS 24

MPDATA OPTIMIZATION: P100 Before: After: 25

OPTIMIZED MPDATA: P100 VS V100 P100 V100 26

ESCAPE DWARF V100 PERFORMANCE 27

MPDATA KERNEL PERFORMANCE 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound • 28

ESCAPE DWARF V100 PERFORMANCE 29

FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - PowerPoint PPT Presentation

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018 Introduction Single-GPU Exposing parallelism Memory Coalescing Data Management Optimization

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

SCIENCE SCIENCE Scientific Question Hypothesis Prediction Experimental Test Scientific

Scientific Programming in mpags-python.github.io Steven Bamford An introduction to scientific

LECTURE 6 Numerical and Scientific Packages NUMERICAL AND SCIENTIFIC APPLICATIONS As you

Modernising Scientific Careers: Modernising Scientific Careers: The story so far and future

life-cycle of the product Spiros Vamvakas Head of Scientific Advice Product Development

Scientific Inquiry Introduction to Evolution and Scientific Inquiry Dr. Spielman;

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

UNDERSTANDING THE SCIENTIFIC METHOD ATI TEAS SCIENCE - THE SCIENTIFIC METHOD Questions related

Scientific Committee of the EFLM Elvar Theodorsson Tasks of the scientific committee

Big Data and Social Scientific Applications Stony Brook University CSE545, Fall 2016 Why

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Scientific Presentation Skills Dr. Bryan Koivisto The Common Presentations in Scientific

Scientific advice on clinical aspects key considerations Topic: Highlights from recent

Understanding, Scripting and Extending GDB Kevin Pouget Jean-Fran cois M ehaut, Fabrice

The Brown Dwarfs of our Milky Way Benne W. Holwerda (University of Louisville) Isabel van

MHD studies towards simulating White Dwarf Mergers using Castro Mar a G. Barrios Sazo Stony

Ceres & Pluto (and many, many, many more) Damian G. Allis, Ph.D. NASA Solar System Ambassador

Understanding isolated and satellite galaxies through simulations Kenza Arraki Blue Waters

GPU Benefits for Earth System Science Stan Posey, Program Manager, ESS Domain NVIDIA (HQ), Santa

Dwarf Mistletoe Leave Trees & Regeneration Brennan Ferguson USFS -- Forest Health Protection

NIRSPEC RV Measurements of Late-M Dwarfs by Angelle Tanner (GSU) with Russel White (GSU), John

FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - PowerPoint PPT Presentation

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018 Introduction Single-GPU Exposing parallelism Memory Coalescing Data Management Optimization

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

SCIENCE SCIENCE Scientific Question Hypothesis Prediction Experimental Test Scientific

Scientific Programming in mpags-python.github.io Steven Bamford An introduction to scientific

LECTURE 6 Numerical and Scientific Packages NUMERICAL AND SCIENTIFIC APPLICATIONS As you

Modernising Scientific Careers: Modernising Scientific Careers: The story so far and future

life-cycle of the product Spiros Vamvakas Head of Scientific Advice Product Development

Scientific Inquiry Introduction to Evolution and Scientific Inquiry Dr. Spielman;

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

UNDERSTANDING THE SCIENTIFIC METHOD ATI TEAS SCIENCE - THE SCIENTIFIC METHOD Questions related

Scientific Committee of the EFLM Elvar Theodorsson Tasks of the scientific committee

Big Data and Social Scientific Applications Stony Brook University CSE545, Fall 2016 Why

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Scientific Presentation Skills Dr. Bryan Koivisto The Common Presentations in Scientific

Scientific advice on clinical aspects key considerations Topic: Highlights from recent

Understanding, Scripting and Extending GDB Kevin Pouget Jean-Fran cois M ehaut, Fabrice

The Brown Dwarfs of our Milky Way Benne W. Holwerda (University of Louisville) Isabel van

MHD studies towards simulating White Dwarf Mergers using Castro Mar a G. Barrios Sazo Stony

Ceres &amp; Pluto (and many, many, many more) Damian G. Allis, Ph.D. NASA Solar System Ambassador

Understanding isolated and satellite galaxies through simulations Kenza Arraki Blue Waters

GPU Benefits for Earth System Science Stan Posey, Program Manager, ESS Domain NVIDIA (HQ), Santa

Dwarf Mistletoe Leave Trees &amp; Regeneration Brennan Ferguson USFS -- Forest Health Protection

NIRSPEC RV Measurements of Late-M Dwarfs by Angelle Tanner (GSU) with Russel White (GSU), John

Ceres & Pluto (and many, many, many more) Damian G. Allis, Ph.D. NASA Solar System Ambassador

Dwarf Mistletoe Leave Trees & Regeneration Brennan Ferguson USFS -- Forest Health Protection