Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018
FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - - PowerPoint PPT Presentation
FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - - PowerPoint PPT Presentation
PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018 Introduction Single-GPU Exposing parallelism Memory Coalescing Data Management Optimization
2
AGENDA
- Introduction
- Single-GPU
- Exposing parallelism
- Memory Coalescing
- Data Management Optimization
- Interoperability
- ESCAPE performance results
- Multi-GPU
- CUDA-aware MPI
- AlltoAll optimization
- DGX-2 (with NVSwitch) vs DGX-1V
3
INTRODUCTION
4
ESCAPE
- NVIDIA’s role is to take existing GPU-enabled codes and optimize.
5
ECMWF
- European Centre for Medium Range Weather Forecasts (ECMWF) are an
intergovernmental org.
- Global forecasts:
- used by more than 22 countries to drive their regional forecasts.
- provide accurate weekly, monthly and seasonal predictions, including early warnings of
severe events.
- In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall
- f Hurricane Sandy: information that undoubtedly saved lives.
- We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and
improve the algorithms, techniques and software on GPUs.
- This is done through use of dwarves: mini-apps designed to represent the key
properties of the real simulation applications.
ESCAPE Project Leaders
6
ESCAPE DWARVES
- Info in “grid-point” space can be equivalently represented in “spectral” space, i.e. in
terms of the frequencies of the fluctuating waves, which is more suited to some calculations.
- IFS therefore repeatedly transforms between these representations, Fourier
transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with AlltoAll data movement in-between.
- This dwarf represents the spectral transforms from IFS.
- NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally,
there exist multiple altitude “levels”, in third dimension away from surface of earth, each with 3 “fields”.
- ECMWF’s Integrated Forecasting System (IFS) is a global prediction
system: entire earth’s atmosphere is represented as a spherical grid.
Spherical Harmonics (SH) Dwarf
7
ESCAPE DWARVES
- Advection: horizontal transport
- Uses unstructured grid with
nearest-neighbour stencils
- MPDATA scheme already used
within COSMO-EULAG (PSNC), and
- f interest to ECMWF for future
developments
- Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to
CUDA libraries.
- Many of the optimizations I will present are transferable to other
applications/languages etc.
MPDATA Dwarf
8
SINGLE-GPU: EXPOSING PARALLELISM
9
EXPOSING PARALLELISM
Loop over timesteps … Loop over 1st dimension … Loop over 2nd dimension … Loop over fields … Operation (involving multidimensional arrays) … Another Loop over dimensions… …
OpenACC Loop Mapping
Typical Structure of Application (usually spanning multiple routines/files):
Aim is to expose as much parallelism in this red box as possible, as flexibly as possible
10
EXPOSING PARALLELISM
Loop over 1st dimension … Loop over 2nd dimension … Loop over fields … Operation
OpenACC Loop Mapping
Before Optimization:
SH
- Loop over 1st dimension was sequential:
parallelism not exposed to GPU MPDATA
- Naïve mapping of loops, using “kernels”
and/or “loop” directives without restructuring.
- Resulting decomposition chosen by
compiler did not work well since runtime loop extents didn’t map well to GPU architecture.
11
EXPOSING PARALLELISM
$!ACC parallel loop collapse(3) Loop over 1st dimension Loop over 2nd dimension Loop over fields … Operation
OpenACC Loop Mapping
- Assuming loops are independent,
better to restructure such that loops are tightly nested, and use “collapse” clause.
- Loops will be collapsed into a
single loop, and compiler will map all parallelism to GPU blocks/threads in an efficient way.
- This can require extensive
restructuring, depending on how application is originally written.
12
EXPOSING PARALLELISM
$!ACC parallel loop collapse(2) Loop over 1st dimension Loop over 2nd dimension $!ACC loop seq Loop with dependencies … Operation
OpenACC Loop Mapping
- Sometimes we have loop-carried
dependencies.
- These can be performed
sequentially by each thread at the innermost level.
- Can still perform well if there is
enough parallelism in outermost loops.
13
EXPOSING PARALLELISM
do i=1, N do j=1, i do k=1, P … Operation
OpenACC Loop Mapping
$!ACC parallel loop collapse(3) do i=1, N do j=1, MAX_J do k=1, P if(j .le. i) then … Operation
- Sometimes extent of a loop depends
- n index of another (outer) loop
which prevents loop collapsing.
- Can replace extent with max value,
and use conditional statement in loop body.
14
SINGLE-GPU: MEMORY COALESCING
15
KERNEL OPTIMIZATION
$!ACC parallel loop collapse(3) do k=1, … do j=1, … do i=1, … … Array(i,j,k)=…
Memory Coalescing: data layout
- For memory coalescing, fastest moving index in array access should correspond to
vector level (CUDA thread), which will correspond to innermost collapsed loop index. $!ACC parallel loop collapse(2) do m=1, … do n=1, … $!ACC loop seq do p=1, … … Array(n,m,p)=…
16
KERNEL OPTIMIZATION
!$ACC parallel loop tile(16,32) do j=1, … do i=1, … … array_t(i,j)=array(j,i)
Memory Coalescing: transposing with tile clause
- If you need to transpose, either
read or write will be in wrong layout for coalescing.
- But can use OpenACC “tile” clause
to improve performance
- Compiler tiles the operation by
generating new innermost loops
- For each tile, data is staged on-chip
- Results in better global memory
access patterns
- Experiment with tile sizes
17
KERNEL OPTIMIZATION
Memory Coalescing: transposing within CUDA BLAS
- In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix
multiplication (DGEMM) calls.
- Coalescing was facilitated through replacing C = AB matrix multiplications by
equivalent 𝐷𝑈 = 𝐶𝑈𝐵𝑈.
- This allows transpose operations to be pushed into the DGEMM library calls,
which have much higher-performing implementations of transposed data accesses
18
SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION
19
DATA MANAGEMENT OPTIMIZATION
Loop over timestep: … loops over spatial dims …
Minimizing data allocation and movement
- Important to keep as much data as possible
resident on GPU within timestep loop.
- All allocations/frees should be outside timestep
loop.
- Copies for constant data should be outside main
timestep loop.
- Re-use temporary scratch arrays on device
- Any necessary repeated copies (e.g. halo
regions in MPI code): volume copied should be minimized. Data allocation and movement is expensive Many codes have:
20
SINGLE-GPU: INTEROPERABILITY
21
INTEROPERABILITY
Simple example: Calling C/CUDA from PGI Fortran
!main.f90 program main interface subroutine kernel_from_f(arg) & bind(C,name='kernel_wrapper') use iso_c_binding integer(c_int),value :: arg end subroutine kernel_from_f end interface call kernel_from_f(1) end program main //kernel.cu #include <stdio.h> __global__ void kernel(int arg){ if (threadIdx.x==0) printf("hello from kernel\n"); return; } extern "C" void kernel_wrapper(int arg){ kernel <<<1,1>>> (arg); cudaDeviceSynchronize(); return; } $ nvcc -c -arch=sm_60 kernel.cu -o kernel.o $ pgf90 -c main.f90 -o main.o $ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 –lcudart $ ./code hello from kernel
CUDA libraries can be called from C code in the usual manner.
22
INTEROPERABILITY AND LIBRARIES: SH DWARF
cuFFT cuFFT cublasDgemm cublasDgemm
Base language Fortran, MPI for multi-GPU communications.
OpenACC OpenACC OpenACC OpenACC OpenACC OpenACC
23
BLAS/FFT LIBRARY CALLS IN SH DWARF
- At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs.
- Multiple operations - one for each:
- Field (associated with vertical levels)
- Longitude (Matmult) / Latitude (FFT)
- Can batch over fields, since sizes are the same. But different longitudes/latitudes
have different sizes: not supported by batched versions of cublasDgemm/cuFFT.
- So, originally we had many small calls: low parallelism exposure and launch latency
sensitivity.
- For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as
fields: single call to library; extra operations do not contribute to result.
- But FFT does not allow padding in the same way. Worked around launch latency
problem by removing sync after each call: allows launch latency to be hidden behind execution.
- As will be seen, however, this is the only part of the dwarf which remains suboptimal.
Future: batched FFT with differing sizes should improve performance.
24
SINGLE-GPU: ESCAPE RESULTS
25
MPDATA OPTIMIZATION: P100
Before: After:
26
OPTIMIZED MPDATA: P100 VS V100
P100 V100
27
ESCAPE DWARF V100 PERFORMANCE
28
MPDATA KERNEL PERFORMANCE
- 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound
29
ESCAPE DWARF V100 PERFORMANCE
30
SH KERNEL PERFORMANCE
- 100% Roofline is peak DP Performance (compute bound kernels) or STREAM benchmark
throughput (memory bandwidth bound kernels)
Experiments with batching (using av. size) show 4.6X speedup
31
MULTI-GPU: CUDA-AWARE MPI
32
CUDA-AWARE MPI
- Modern MPI implementations are CUDA-aware. This means that pointers to GPU
memory can be passed directly into the MPI calls, avoiding unnecessary transfers (both in the application and in the underlying MPI implementation).
!non CUDA-aware: !$ACC update host(array,…) call MPI_Alltoallv(array,…) !$ACC update device(array,…) !CUDA-aware: !$ACC host_data use_device(array,…) call MPI_Alltoallv(array,…) !$ACC end host_data
- Also for point-to-point etc. Note that if explicit buffer packing is involved, this will
also need to be ported from CPU to GPU.
33
MULTI-GPU: ALLTOALL OPTIMIZATION
34
ALLTOALL: 4 GPUS ON DGX-1
- SH dwarf performs AlltoAll operations. Even with CUDA-aware MPI, there are
inefficiencies:
- Can instead use CUDA IPC directly with CUDA streams to
- verlap.
- Future MPI implementations are expected to take care of this
35
CUDA IPC ALLTOALL
As a starting point, used the 0_Simple/simpleIPCcode from the CUDA samples. Setup:
- On each MPI rank, get IPC memory handle for array locally
- Share IPC memory handles via MPI
- Setup CUDA streams
AlltoAll:
- On each rank, loop over all targetranks (including own)
- targetrank=(targetrank+rank)%number_of_gpus (for better balance)
- Push message to targetrank (in stream[targetrank])
- Sync all streams
36
SH RESULTS ON 4 GPUS
37
MULTI-GPU: DGX-2 (WITH NVSWITCH) VS DGX-1V
38
SPHERICAL HARMONICS: SCALING BEYOND 4 GPUS
- When using all 8 GPU in DGX-1V:
- No AlltoAll NVLINK Connectivity – some
messages go through PCIe and system memory
- This limits performance
- When using 16 GPUs across 2 DGX-1V
servers
- Some messages go across Infiniband
network
- Further bottleneck
39
DGX-2 WITH NVSWITCH
- AlltoAll network architecture with NVSwitch maps perfectly to the problem.
- Full bandwidth between each GPU pair.
40
SPHERICAL HARMONICS: DGX-2 VS DGX-1V
41
SUMMARY AND FUTURE WORK
- Optimizing the exposure of parallelism, memory coalescing and data management
can have dramatic effects on performance.
- MPDATA single-GPU performance is now optimal.
- MPDATA multi-GPU: halo-exchange currently implemented via MPI abstracted CPU-
based ATLAS library (also developed by ECMWF). To be fully optimized, ATLAS needs to be made CUDA aware, and this is currently being worked on by others.
- SH single-GPU performance is vastly improved, but FFT part remains sub-optimal.
- Implementation of batching where different sizes are allowed within each batch
would expectedly fix this.
- DGX-2/NVSwitch all-to-all connectivity allows SH to scale to all 16 GPUs.
- These results give indications that multi-GPU systems can be effectively exploited
to allow forecasting agencies to continue to further improve weather predictions.