FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - - PowerPoint PPT Presentation

for scientific applications
SMART_READER_LITE
LIVE PREVIEW

FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology - - PowerPoint PPT Presentation

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018 Introduction Single-GPU Exposing parallelism Memory Coalescing Data Management Optimization


slide-1
SLIDE 1

Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018

PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS

slide-2
SLIDE 2

2

AGENDA

  • Introduction
  • Single-GPU
  • Exposing parallelism
  • Memory Coalescing
  • Data Management Optimization
  • Interoperability
  • ESCAPE performance results
  • Multi-GPU
  • CUDA-aware MPI
  • AlltoAll optimization
  • DGX-2 (with NVSwitch) vs DGX-1V
slide-3
SLIDE 3

3

INTRODUCTION

slide-4
SLIDE 4

4

ESCAPE

  • NVIDIA’s role is to take existing GPU-enabled codes and optimize.
slide-5
SLIDE 5

5

ECMWF

  • European Centre for Medium Range Weather Forecasts (ECMWF) are an

intergovernmental org.

  • Global forecasts:
  • used by more than 22 countries to drive their regional forecasts.
  • provide accurate weekly, monthly and seasonal predictions, including early warnings of

severe events.

  • In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall
  • f Hurricane Sandy: information that undoubtedly saved lives.
  • We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and

improve the algorithms, techniques and software on GPUs.

  • This is done through use of dwarves: mini-apps designed to represent the key

properties of the real simulation applications.

ESCAPE Project Leaders

slide-6
SLIDE 6

6

ESCAPE DWARVES

  • Info in “grid-point” space can be equivalently represented in “spectral” space, i.e. in

terms of the frequencies of the fluctuating waves, which is more suited to some calculations.

  • IFS therefore repeatedly transforms between these representations, Fourier

transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with AlltoAll data movement in-between.

  • This dwarf represents the spectral transforms from IFS.
  • NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally,

there exist multiple altitude “levels”, in third dimension away from surface of earth, each with 3 “fields”.

  • ECMWF’s Integrated Forecasting System (IFS) is a global prediction

system: entire earth’s atmosphere is represented as a spherical grid.

Spherical Harmonics (SH) Dwarf

slide-7
SLIDE 7

7

ESCAPE DWARVES

  • Advection: horizontal transport
  • Uses unstructured grid with

nearest-neighbour stencils

  • MPDATA scheme already used

within COSMO-EULAG (PSNC), and

  • f interest to ECMWF for future

developments

  • Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to

CUDA libraries.

  • Many of the optimizations I will present are transferable to other

applications/languages etc.

MPDATA Dwarf

slide-8
SLIDE 8

8

SINGLE-GPU: EXPOSING PARALLELISM

slide-9
SLIDE 9

9

EXPOSING PARALLELISM

Loop over timesteps … Loop over 1st dimension … Loop over 2nd dimension … Loop over fields … Operation (involving multidimensional arrays) … Another Loop over dimensions… …

OpenACC Loop Mapping

Typical Structure of Application (usually spanning multiple routines/files):

Aim is to expose as much parallelism in this red box as possible, as flexibly as possible

slide-10
SLIDE 10

10

EXPOSING PARALLELISM

Loop over 1st dimension … Loop over 2nd dimension … Loop over fields … Operation

OpenACC Loop Mapping

Before Optimization:

SH

  • Loop over 1st dimension was sequential:

parallelism not exposed to GPU MPDATA

  • Naïve mapping of loops, using “kernels”

and/or “loop” directives without restructuring.

  • Resulting decomposition chosen by

compiler did not work well since runtime loop extents didn’t map well to GPU architecture.

slide-11
SLIDE 11

11

EXPOSING PARALLELISM

$!ACC parallel loop collapse(3) Loop over 1st dimension Loop over 2nd dimension Loop over fields … Operation

OpenACC Loop Mapping

  • Assuming loops are independent,

better to restructure such that loops are tightly nested, and use “collapse” clause.

  • Loops will be collapsed into a

single loop, and compiler will map all parallelism to GPU blocks/threads in an efficient way.

  • This can require extensive

restructuring, depending on how application is originally written.

slide-12
SLIDE 12

12

EXPOSING PARALLELISM

$!ACC parallel loop collapse(2) Loop over 1st dimension Loop over 2nd dimension $!ACC loop seq Loop with dependencies … Operation

OpenACC Loop Mapping

  • Sometimes we have loop-carried

dependencies.

  • These can be performed

sequentially by each thread at the innermost level.

  • Can still perform well if there is

enough parallelism in outermost loops.

slide-13
SLIDE 13

13

EXPOSING PARALLELISM

do i=1, N do j=1, i do k=1, P … Operation

OpenACC Loop Mapping

$!ACC parallel loop collapse(3) do i=1, N do j=1, MAX_J do k=1, P if(j .le. i) then … Operation

  • Sometimes extent of a loop depends
  • n index of another (outer) loop

which prevents loop collapsing.

  • Can replace extent with max value,

and use conditional statement in loop body.

slide-14
SLIDE 14

14

SINGLE-GPU: MEMORY COALESCING

slide-15
SLIDE 15

15

KERNEL OPTIMIZATION

$!ACC parallel loop collapse(3) do k=1, … do j=1, … do i=1, … … Array(i,j,k)=…

Memory Coalescing: data layout

  • For memory coalescing, fastest moving index in array access should correspond to

vector level (CUDA thread), which will correspond to innermost collapsed loop index. $!ACC parallel loop collapse(2) do m=1, … do n=1, … $!ACC loop seq do p=1, … … Array(n,m,p)=…

slide-16
SLIDE 16

16

KERNEL OPTIMIZATION

!$ACC parallel loop tile(16,32) do j=1, … do i=1, … … array_t(i,j)=array(j,i)

Memory Coalescing: transposing with tile clause

  • If you need to transpose, either

read or write will be in wrong layout for coalescing.

  • But can use OpenACC “tile” clause

to improve performance

  • Compiler tiles the operation by

generating new innermost loops

  • For each tile, data is staged on-chip
  • Results in better global memory

access patterns

  • Experiment with tile sizes
slide-17
SLIDE 17

17

KERNEL OPTIMIZATION

Memory Coalescing: transposing within CUDA BLAS

  • In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix

multiplication (DGEMM) calls.

  • Coalescing was facilitated through replacing C = AB matrix multiplications by

equivalent 𝐷𝑈 = 𝐶𝑈𝐵𝑈.

  • This allows transpose operations to be pushed into the DGEMM library calls,

which have much higher-performing implementations of transposed data accesses

slide-18
SLIDE 18

18

SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION

slide-19
SLIDE 19

19

DATA MANAGEMENT OPTIMIZATION

Loop over timestep: … loops over spatial dims …

Minimizing data allocation and movement

  • Important to keep as much data as possible

resident on GPU within timestep loop.

  • All allocations/frees should be outside timestep

loop.

  • Copies for constant data should be outside main

timestep loop.

  • Re-use temporary scratch arrays on device
  • Any necessary repeated copies (e.g. halo

regions in MPI code): volume copied should be minimized. Data allocation and movement is expensive Many codes have:

slide-20
SLIDE 20

20

SINGLE-GPU: INTEROPERABILITY

slide-21
SLIDE 21

21

INTEROPERABILITY

Simple example: Calling C/CUDA from PGI Fortran

!main.f90 program main interface subroutine kernel_from_f(arg) & bind(C,name='kernel_wrapper') use iso_c_binding integer(c_int),value :: arg end subroutine kernel_from_f end interface call kernel_from_f(1) end program main //kernel.cu #include <stdio.h> __global__ void kernel(int arg){ if (threadIdx.x==0) printf("hello from kernel\n"); return; } extern "C" void kernel_wrapper(int arg){ kernel <<<1,1>>> (arg); cudaDeviceSynchronize(); return; } $ nvcc -c -arch=sm_60 kernel.cu -o kernel.o $ pgf90 -c main.f90 -o main.o $ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 –lcudart $ ./code hello from kernel

CUDA libraries can be called from C code in the usual manner.

slide-22
SLIDE 22

22

INTEROPERABILITY AND LIBRARIES: SH DWARF

cuFFT cuFFT cublasDgemm cublasDgemm

Base language Fortran, MPI for multi-GPU communications.

OpenACC OpenACC OpenACC OpenACC OpenACC OpenACC

slide-23
SLIDE 23

23

BLAS/FFT LIBRARY CALLS IN SH DWARF

  • At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs.
  • Multiple operations - one for each:
  • Field (associated with vertical levels)
  • Longitude (Matmult) / Latitude (FFT)
  • Can batch over fields, since sizes are the same. But different longitudes/latitudes

have different sizes: not supported by batched versions of cublasDgemm/cuFFT.

  • So, originally we had many small calls: low parallelism exposure and launch latency

sensitivity.

  • For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as

fields: single call to library; extra operations do not contribute to result.

  • But FFT does not allow padding in the same way. Worked around launch latency

problem by removing sync after each call: allows launch latency to be hidden behind execution.

  • As will be seen, however, this is the only part of the dwarf which remains suboptimal.

Future: batched FFT with differing sizes should improve performance.

slide-24
SLIDE 24

24

SINGLE-GPU: ESCAPE RESULTS

slide-25
SLIDE 25

25

MPDATA OPTIMIZATION: P100

Before: After:

slide-26
SLIDE 26

26

OPTIMIZED MPDATA: P100 VS V100

P100 V100

slide-27
SLIDE 27

27

ESCAPE DWARF V100 PERFORMANCE

slide-28
SLIDE 28

28

MPDATA KERNEL PERFORMANCE

  • 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound
slide-29
SLIDE 29

29

ESCAPE DWARF V100 PERFORMANCE

slide-30
SLIDE 30

30

SH KERNEL PERFORMANCE

  • 100% Roofline is peak DP Performance (compute bound kernels) or STREAM benchmark

throughput (memory bandwidth bound kernels)

Experiments with batching (using av. size) show 4.6X speedup

slide-31
SLIDE 31

31

MULTI-GPU: CUDA-AWARE MPI

slide-32
SLIDE 32

32

CUDA-AWARE MPI

  • Modern MPI implementations are CUDA-aware. This means that pointers to GPU

memory can be passed directly into the MPI calls, avoiding unnecessary transfers (both in the application and in the underlying MPI implementation).

!non CUDA-aware: !$ACC update host(array,…) call MPI_Alltoallv(array,…) !$ACC update device(array,…) !CUDA-aware: !$ACC host_data use_device(array,…) call MPI_Alltoallv(array,…) !$ACC end host_data

  • Also for point-to-point etc. Note that if explicit buffer packing is involved, this will

also need to be ported from CPU to GPU.

slide-33
SLIDE 33

33

MULTI-GPU: ALLTOALL OPTIMIZATION

slide-34
SLIDE 34

34

ALLTOALL: 4 GPUS ON DGX-1

  • SH dwarf performs AlltoAll operations. Even with CUDA-aware MPI, there are

inefficiencies:

  • Can instead use CUDA IPC directly with CUDA streams to
  • verlap.
  • Future MPI implementations are expected to take care of this
slide-35
SLIDE 35

35

CUDA IPC ALLTOALL

As a starting point, used the 0_Simple/simpleIPCcode from the CUDA samples. Setup:

  • On each MPI rank, get IPC memory handle for array locally
  • Share IPC memory handles via MPI
  • Setup CUDA streams

AlltoAll:

  • On each rank, loop over all targetranks (including own)
  • targetrank=(targetrank+rank)%number_of_gpus (for better balance)
  • Push message to targetrank (in stream[targetrank])
  • Sync all streams
slide-36
SLIDE 36

36

SH RESULTS ON 4 GPUS

slide-37
SLIDE 37

37

MULTI-GPU: DGX-2 (WITH NVSWITCH) VS DGX-1V

slide-38
SLIDE 38

38

SPHERICAL HARMONICS: SCALING BEYOND 4 GPUS

  • When using all 8 GPU in DGX-1V:
  • No AlltoAll NVLINK Connectivity – some

messages go through PCIe and system memory

  • This limits performance
  • When using 16 GPUs across 2 DGX-1V

servers

  • Some messages go across Infiniband

network

  • Further bottleneck
slide-39
SLIDE 39

39

DGX-2 WITH NVSWITCH

  • AlltoAll network architecture with NVSwitch maps perfectly to the problem.
  • Full bandwidth between each GPU pair.
slide-40
SLIDE 40

40

SPHERICAL HARMONICS: DGX-2 VS DGX-1V

slide-41
SLIDE 41

41

SUMMARY AND FUTURE WORK

  • Optimizing the exposure of parallelism, memory coalescing and data management

can have dramatic effects on performance.

  • MPDATA single-GPU performance is now optimal.
  • MPDATA multi-GPU: halo-exchange currently implemented via MPI abstracted CPU-

based ATLAS library (also developed by ECMWF). To be fully optimized, ATLAS needs to be made CUDA aware, and this is currently being worked on by others.

  • SH single-GPU performance is vastly improved, but FFT part remains sub-optimal.
  • Implementation of batching where different sizes are allowed within each batch

would expectedly fix this.

  • DGX-2/NVSwitch all-to-all connectivity allows SH to scale to all 16 GPUs.
  • These results give indications that multi-GPU systems can be effectively exploited

to allow forecasting agencies to continue to further improve weather predictions.

slide-42
SLIDE 42