April 4-7, 2016 | Silicon Valley
Jiri Kraus, Senior Devtech Compute, April 7th 2016
HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: - - PowerPoint PPT Presentation
April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus, Senior Devtech Compute, April 7th 2016 OPENACC DIRECTIVES Incremental #pragma acc data copyin(a,b) copyout(c)
April 4-7, 2016 | Silicon Valley
Jiri Kraus, Senior Devtech Compute, April 7th 2016
2
#pragma acc data copyin(a,b) copyout(c) { #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { z[i] = x[i] + y[i]; ... } } ... }
4/11/2016
Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings
3
Unified Memory System Memory GPU Memory
4
void foo(FILE *fp, int N) { float *x, *y, *z; x = (float *)malloc(N*sizeof(float)); y = (float *)malloc(N*sizeof(float)); z = (float *)malloc(N*sizeof(float)); fread(x, sizeof(float), N, fp); fread(y, sizeof(float), N, fp); #pragma acc kernels copy(x[0:N],y[0:N],z[0:N]) for (int i=0; i<N; ++i) z[i] = x[i] + y[i]; use_data(z); free(z); free(y); free(x); } void foo(FILE *fp, int N) { float *x, *y, *z; x = (float *)malloc(N*sizeof(float)); y = (float *)malloc(N*sizeof(float)); z = (float *)malloc(N*sizeof(float)); fread(x, sizeof(float), N, fp); fread(y, sizeof(float), N, fp); #pragma acc kernels for (int i=0; i<N; ++i) z[i] = x[i] + y[i]; use_data(z); free(z); free(y); free(x); }
5
6
All heap allocations are in managed memory (Unified Memory Heap) Pointers can be used on GPU and CPU Enabled with compiler switch –ta=tesla:managed,… More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm
7
Unified Memory can be used in CPU and GPU code No need for any data clauses No need to fully understand data flow and allocation logic of application Simplifies handling of complex data structures Incremental profiler driven acceleration -> Data movement is just another
8
Does not apply for stack, static or global data (only heap data) Limits allocatable memory to available device memory even on the host Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size. Data is coherent only at kernel launch and sync points. Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault.
9
D2Q37 model Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by simulating virtual particles which collide and propagate Simulation of large systems requires double precision computation and many GPUs
10
MPI + OpenMP + vector intrinsics using AoS data layout MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following) MPI + CUDA C using SoA data layout OpenCL Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S.
11
Method Time (s) Initial 1 collide 17.01 2 propagate 10.71 3
2.26 4 bc 0.17 Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s
collide propagate
bc
12
Enable OpenACC and Managed Memory
Enable Accelerator Information
Enable CPU Profiling information
13
Rank Method Time (s) UM Time (s) Initial 1 propagate 41.18 10.71 2 collide 16.82 17.01 3
6.58 2.26 4 bc 0.17 0.17 Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s)
collide propagate
bc
14
MPI handling periodic boundary conditions – causes flush of data to GPU in every iteration
15
Propagate slowed down due to unified memory page migrations
16
inline void propagate(const data_t* restrict prv, data_t* restrict nxt) { int ix, iy, site_i; #pragma acc kernels #pragma acc loop independent device_type(NVIDIA) gang for ( ix=HX; ix < (HX+SIZEX); ix++) { #pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX) for ( iy=HY; iy < (HY+SIZEY); iy++) { site_i = (ix*NY) + iy; nxt[ site_i] = prv[ site_i - 3*NY + 1]; nxt[ NX*NY + site_i] = prv[ NX*NY + site_i - 3*NY ]; //... nxt[35*NX*NY + site_i] = prv[35*NX*NY + site_i + 3*NY ]; nxt[36*NX*NY + site_i] = prv[36*NX*NY + site_i + 3*NY - 1]; } } }
17
Rank Method Time (s) +propagate Time (s) UM Time (s) Initial 1 bc 34.59 0.17 0.17 2 collide 16.75 16.82 17.01 3
6.94 6.58 2.26 4 propagate 2.14 41.18 10.71 Application Reported Solvertime: 57.65 s (UM: 62.96 s) Profiler: Total Time for Process: 60.42 s (UM: 64.75 s)
collide propagate
bc
Propagate
18
19
BC slowed down due to unified memory page migrations
20
Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s) Propagate
collide propagate
bc
Propagate
bc on GPU Rank Method Time (s) +bc Time (s) +propagate Time (s) UM Time (s) Initial 1 collide 49.99 16.75 16.82 17.01 2
7.61 6.94 6.58 2.26 3 propagate 2.15 2.14 41.18 10.71 4 bc 0.11 34.59 0.17 0.17
21
22
collide slowed down due to unified memory page migrations
23
Rank Method Time (s) Final Time (s) UM+propagate+bc Time (s) Initial main 7.69 2.39 1.89 1 collide 0.52 49.99 17.01 2 lbm 0.41 4.72 0.06 3 init 0.19 0.19 0.04 4 printMass 0.15 0.17 0.01 5 propagate 0.13 2.15 10.71 6 bc 0.09 0.11 0.17 7 projection 0.05 0.05 0.06 Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)
24
Data stays on GPU while simulation is running
25
CUDA-aware MPI with support for Unified Memory
E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0
Start one MPI rank per GPU
26
int rank = 0; int size = 1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); #if _OPENACC int ngpus=acc_get_num_devices(acc_device_nvidia); int devicenum=rank%ngpus; acc_set_device_num(devicenum,acc_device_nvidia); acc_init(acc_device_nvidia); #endif /*_OPENACC*/
27
28
MPI
29
50 100 150 200 250 300 350 400 1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80) Runtime (s) 1000 Steps - 1440x10240 Grid Tesla K80 Linear
30
Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously. All kernel launches when using –ta=tesla:managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1 Set PGI_ACC_SYNCHRONOUS=0 to allow overlap
31
32
33
Unified Memory for OpenACC support makes GPU acceleration even more productive
Profiler guided incremental acceleration No need to insert any data clauses or to change allocation code
April 4-7, 2016 | Silicon Valley
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
35
36
PGI 15.5 CUDA-aware build of OpenMPI 1.8.5 (GPUDirect P2P/RDMA disabled) CUDA 6.5 Intel Xeon E5-2698 v3 @ 2.30GHz Problem size: 100 Iterations on 480x512 GPU optimized SoA data layout is used so CPU runtime is not representative.