HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: - - PowerPoint PPT Presentation

high performance and productivity with unified memory and
SMART_READER_LITE
LIVE PREVIEW

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus, Senior Devtech Compute, April 7th 2016 OPENACC DIRECTIVES Incremental #pragma acc data copyin(a,b) copyout(c)


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Jiri Kraus, Senior Devtech Compute, April 7th 2016

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY

slide-2
SLIDE 2

2

OPENACC DIRECTIVES

#pragma acc data copyin(a,b) copyout(c) { #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { z[i] = x[i] + y[i]; ... } } ... }

4/11/2016

Incremental Single source Interoperable Performance portable

CPU, GPU, MIC

Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings

slide-3
SLIDE 3

3

UNIFIED MEMORY

Traditional Developer View Developer View With Unified Memory

Unified Memory System Memory GPU Memory

Dramatically Lower Developer Effort

slide-4
SLIDE 4

4

UNIFIED MEMORY

Traditional Developer View Developer View With Unified Memory

void foo(FILE *fp, int N) { float *x, *y, *z; x = (float *)malloc(N*sizeof(float)); y = (float *)malloc(N*sizeof(float)); z = (float *)malloc(N*sizeof(float)); fread(x, sizeof(float), N, fp); fread(y, sizeof(float), N, fp); #pragma acc kernels copy(x[0:N],y[0:N],z[0:N]) for (int i=0; i<N; ++i) z[i] = x[i] + y[i]; use_data(z); free(z); free(y); free(x); } void foo(FILE *fp, int N) { float *x, *y, *z; x = (float *)malloc(N*sizeof(float)); y = (float *)malloc(N*sizeof(float)); z = (float *)malloc(N*sizeof(float)); fread(x, sizeof(float), N, fp); fread(y, sizeof(float), N, fp); #pragma acc kernels for (int i=0; i<N; ++i) z[i] = x[i] + y[i]; use_data(z); free(z); free(y); free(x); }

slide-5
SLIDE 5

5

Identify Available Parallelism Express Parallelism Express Data Movement Optimize Loop Performance

slide-6
SLIDE 6

6

OPENACC AND UNIFIED MEMORY

All heap allocations are in managed memory (Unified Memory Heap) Pointers can be used on GPU and CPU Enabled with compiler switch –ta=tesla:managed,… More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm

PGI Support for Unified Memory with OpenACC

slide-7
SLIDE 7

7

OPENACC AND UNIFIED MEMORY

Unified Memory can be used in CPU and GPU code No need for any data clauses No need to fully understand data flow and allocation logic of application Simplifies handling of complex data structures Incremental profiler driven acceleration -> Data movement is just another

  • ptimization

Advantages

slide-8
SLIDE 8

8

OPENACC AND UNIFIED MEMORY

Does not apply for stack, static or global data (only heap data) Limits allocatable memory to available device memory even on the host Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size. Data is coherent only at kernel launch and sync points. Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault.

Implementations Details on Kepler and Maxwell

slide-9
SLIDE 9

9

LBM D2Q37

D2Q37 model Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by simulating virtual particles which collide and propagate Simulation of large systems requires double precision computation and many GPUs

Lattice Boltzmann Method (LBM)

slide-10
SLIDE 10

10

LBM D2Q37

MPI + OpenMP + vector intrinsics using AoS data layout MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following) MPI + CUDA C using SoA data layout OpenCL Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S.

  • F. Schifano and R. Tripiccione

Versions

slide-11
SLIDE 11

11

LBM D2Q37 – INITIAL VERSION

CPU Profile (480x512) – 1 MPI rank Rank

Method Time (s) Initial 1 collide 17.01 2 propagate 10.71 3

  • ther

2.26 4 bc 0.17 Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s

collide propagate

  • ther

bc

slide-12
SLIDE 12

12

LBM D2Q37 – INITIAL VERSION

Enable OpenACC and Managed Memory

  • acc -ta=tesla:managed,…

Enable Accelerator Information

  • Minfo=accel

Enable CPU Profiling information

  • Mprof=func

Change build environment

slide-13
SLIDE 13

13

LBM D2Q37 – INITIAL VERSION

CPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s) UM Time (s) Initial 1 propagate 41.18 10.71 2 collide 16.82 17.01 3

  • ther

6.58 2.26 4 bc 0.17 0.17 Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s)

collide propagate

  • ther

bc

slide-14
SLIDE 14

14

LBM D2Q37 – INITIAL VERSION

NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

MPI handling periodic boundary conditions – causes flush of data to GPU in every iteration

slide-15
SLIDE 15

15

LBM D2Q37 – INITIAL VERSION

NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank

Propagate slowed down due to unified memory page migrations

slide-16
SLIDE 16

16

LBM D2Q37 – ACCELERATING PROPAGATE

inline void propagate(const data_t* restrict prv, data_t* restrict nxt) { int ix, iy, site_i; #pragma acc kernels #pragma acc loop independent device_type(NVIDIA) gang for ( ix=HX; ix < (HX+SIZEX); ix++) { #pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX) for ( iy=HY; iy < (HY+SIZEY); iy++) { site_i = (ix*NY) + iy; nxt[ site_i] = prv[ site_i - 3*NY + 1]; nxt[ NX*NY + site_i] = prv[ NX*NY + site_i - 3*NY ]; //... nxt[35*NX*NY + site_i] = prv[35*NX*NY + site_i + 3*NY ]; nxt[36*NX*NY + site_i] = prv[36*NX*NY + site_i + 3*NY - 1]; } } }

slide-17
SLIDE 17

17

LBM D2Q37 – PROPAGATE ACCELERATED

CPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s) +propagate Time (s) UM Time (s) Initial 1 bc 34.59 0.17 0.17 2 collide 16.75 16.82 17.01 3

  • ther

6.94 6.58 2.26 4 propagate 2.14 41.18 10.71 Application Reported Solvertime: 57.65 s (UM: 62.96 s) Profiler: Total Time for Process: 60.42 s (UM: 64.75 s)

collide propagate

  • ther

bc

Propagate

  • n GPU
slide-18
SLIDE 18

18

LBM D2Q37 – PROPAGATE ACCELERATED

NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

slide-19
SLIDE 19

19

LBM D2Q37 – PROPAGATE ACCELERATED

NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank

BC slowed down due to unified memory page migrations

slide-20
SLIDE 20

20

LBM D2Q37 – BC ACCELERATED

CPU Profile (480x512) using Unified Memory – 1 MPI rank

Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s) Propagate

  • n GPU

collide propagate

  • ther

bc

Propagate

  • n GPU

bc on GPU Rank Method Time (s) +bc Time (s) +propagate Time (s) UM Time (s) Initial 1 collide 49.99 16.75 16.82 17.01 2

  • ther

7.61 6.94 6.58 2.26 3 propagate 2.15 2.14 41.18 10.71 4 bc 0.11 34.59 0.17 0.17

slide-21
SLIDE 21

21

LBM D2Q37 – BC ACCELERATED

NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

slide-22
SLIDE 22

22

LBM D2Q37 – BC ACCELERATED

NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

collide slowed down due to unified memory page migrations

slide-23
SLIDE 23

23

LBM D2Q37 – COLLIDE ACCELERATED

CPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s) Final Time (s) UM+propagate+bc Time (s) Initial main 7.69 2.39 1.89 1 collide 0.52 49.99 17.01 2 lbm 0.41 4.72 0.06 3 init 0.19 0.19 0.04 4 printMass 0.15 0.17 0.01 5 propagate 0.13 2.15 10.71 6 bc 0.09 0.11 0.17 7 projection 0.05 0.05 0.06 Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)

slide-24
SLIDE 24

24

LBM D2Q37 – COLLIDE ACCELERATED

NVVP Timeline (480x512) using Unified Memory – 1 MPI rank

Data stays on GPU while simulation is running

slide-25
SLIDE 25

25

LBM D2Q37 – MULTI GPU

CUDA-aware MPI with support for Unified Memory

E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0

Start one MPI rank per GPU

Requirements

slide-26
SLIDE 26

26

LBM D2Q37 – MULTI GPU

Handling GPU AFFINITY

int rank = 0; int size = 1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); #if _OPENACC int ngpus=acc_get_num_devices(acc_device_nvidia); int devicenum=rank%ngpus; acc_set_device_num(devicenum,acc_device_nvidia); acc_init(acc_device_nvidia); #endif /*_OPENACC*/

slide-27
SLIDE 27

27

LBM D2Q37 – MULTI GPU

NVVP Timeline (480x512) using Unified Memory – 2 MPI ranks

slide-28
SLIDE 28

28

LBM D2Q37 – MULTI GPU

NVVP Timeline (480x512) using Unified Memory - Zoom – 2 MPI ranks

MPI

slide-29
SLIDE 29

29

LBM D2Q37 – MULTI GPU

Strong Scaling

50 100 150 200 250 300 350 400 1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80) Runtime (s) 1000 Steps - 1440x10240 Grid Tesla K80 Linear

slide-30
SLIDE 30

30

LBM D2Q37 – MULTI GPU

Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously. All kernel launches when using –ta=tesla:managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1 Set PGI_ACC_SYNCHRONOUS=0 to allow overlap

Overlapping Communication and Computation

slide-31
SLIDE 31

31

LBM D2Q37 – MULTI GPU

Overlapping Communication and Computation Grid size: 1920x2048

slide-32
SLIDE 32

32

LBM D2Q37 – MULTI GPU

Overlapping Communication and Computation Grid size: 1920x2048

slide-33
SLIDE 33

33

CONCLUSIONS

Unified Memory for OpenACC support makes GPU acceleration even more productive

Profiler guided incremental acceleration No need to insert any data clauses or to change allocation code

slide-34
SLIDE 34

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

slide-35
SLIDE 35

35

BACKUP

slide-36
SLIDE 36

36

LBM D2Q37

PGI 15.5 CUDA-aware build of OpenMPI 1.8.5 (GPUDirect P2P/RDMA disabled) CUDA 6.5 Intel Xeon E5-2698 v3 @ 2.30GHz Problem size: 100 Iterations on 480x512 GPU optimized SoA data layout is used so CPU runtime is not representative.

Setup