HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus, Senior Devtech Compute, April 7th 2016

OPENACC DIRECTIVES Incremental #pragma acc data copyin(a,b) copyout(c) Manage Data { Movement Single source #pragma acc parallel { Interoperable Initiate #pragma acc loop gang vector Parallel for ( i = 0 ; i < n ; ++ i ) { Execution Performance portable z [ i ] = x [ i ] + y [ i ]; ... CPU, GPU, MIC Optimize } Loop } Mappings ... } 2 4/11/2016

UNIFIED MEMORY Dramatically Lower Developer Effort Developer View With Traditional Developer View Unified Memory System GPU Memory Unified Memory Memory 3

UNIFIED MEMORY Developer View With Traditional Developer View Unified Memory void foo ( FILE * fp , int N ) { void foo ( FILE * fp , int N ) { float * x , * y , * z ; float * x , * y , * z ; x = ( float *) malloc ( N *sizeof( float )); x = ( float *) malloc ( N *sizeof( float )); y = ( float *) malloc ( N *sizeof( float )); y = ( float *) malloc ( N *sizeof( float )); z = ( float *) malloc ( N *sizeof( float )); z = ( float *) malloc ( N *sizeof( float )); fread ( x , sizeof( float ), N , fp ); fread ( x , sizeof( float ), N , fp ); fread ( y , sizeof( float ), N , fp ); fread ( y , sizeof( float ), N , fp ); #pragma acc kernels copy(x[0:N],y[0:N],z[0:N]) #pragma acc kernels for ( int i = 0 ; i < N ; ++ i ) for ( int i = 0 ; i < N ; ++ i ) z [ i ] = x [ i ] + y [ i ]; z [ i ] = x [ i ] + y [ i ]; use_data ( z ); use_data ( z ); free ( z ); free ( y ); free ( x ); free ( z ); free ( y ); free ( x ); } } 4

Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 5

OPENACC AND UNIFIED MEMORY PGI Support for Unified Memory with OpenACC All heap allocations are in managed memory (Unified Memory Heap) Pointers can be used on GPU and CPU Enabled with compiler switch – ta=tesla: managed ,… More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm 6

OPENACC AND UNIFIED MEMORY Advantages Unified Memory can be used in CPU and GPU code No need for any data clauses No need to fully understand data flow and allocation logic of application Simplifies handling of complex data structures Incremental profiler driven acceleration -> Data movement is just another optimization 7

OPENACC AND UNIFIED MEMORY Implementations Details on Kepler and Maxwell Does not apply for stack, static or global data (only heap data) Limits allocatable memory to available device memory even on the host Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size. Data is coherent only at kernel launch and sync points. Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault. 8

LBM D2Q37 Lattice Boltzmann Method (LBM) D2Q37 model Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by simulating virtual particles which collide and propagate Simulation of large systems requires double precision computation and many GPUs 9

LBM D2Q37 Versions MPI + OpenMP + vector intrinsics using AoS data layout MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following) MPI + CUDA C using SoA data layout OpenCL Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S. F. Schifano and R. Tripiccione 10

LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) – 1 MPI rank Rank Method Time (s) Initial 1 collide 17.01 2 10.71 propagate 3 other 2.26 4 bc 0.17 collide propagate other bc Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s 11

LBM D2Q37 – INITIAL VERSION Change build environment Enable OpenACC and Managed Memory -acc -ta=tesla: managed ,… Enable Accelerator Information -Minfo=accel Enable CPU Profiling information -Mprof=func 12

LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) UM Initial 1 propagate 41.18 10.71 2 16.82 17.01 collide 3 other 6.58 2.26 4 bc 0.17 0.17 collide propagate other bc Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s) 13

LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory – 1 MPI rank MPI handling periodic boundary conditions – causes flush of data to GPU in every iteration 14

LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank Propagate slowed down due to unified memory page migrations 15

LBM D2Q37 – ACCELERATING PROPAGATE inline void propagate ( const data_t * restrict prv , data_t * restrict nxt ) { int ix , iy , site_i ; #pragma acc kernels #pragma acc loop independent device_type(NVIDIA) gang for ( ix = HX ; ix < ( HX + SIZEX ); ix ++) { #pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX) for ( iy = HY ; iy < ( HY + SIZEY ); iy ++) { site_i = ( ix * NY ) + iy ; nxt [ site_i ] = prv [ site_i - 3 * NY + 1 ]; nxt [ NX * NY + site_i ] = prv [ NX * NY + site_i - 3 * NY ]; //... nxt [ 35 * NX * NY + site_i ] = prv [ 35 * NX * NY + site_i + 3 * NY ]; nxt [ 36 * NX * NY + site_i ] = prv [ 36 * NX * NY + site_i + 3 * NY - 1 ]; } } } 16

LBM D2Q37 – PROPAGATE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) Time (s) +propagate UM Initial 1 bc 34.59 0.17 0.17 2 16.75 16.82 17.01 collide 3 other 6.94 6.58 2.26 4 propagate 2.14 41.18 10.71 collide propagate other bc Application Reported Solvertime: 57.65 s (UM: 62.96 s) Propagate Profiler: Total Time for Process: 60.42 s (UM: 64.75 s) on GPU 17

LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank 18

LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank BC slowed down due to unified memory page migrations 19

LBM D2Q37 – BC ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) Time (s) Time (s) +bc +propagate UM Initial 1 collide 49.99 16.75 16.82 17.01 2 7.61 6.94 6.58 2.26 other 3 propagate 2.15 2.14 41.18 10.71 4 bc 0.11 34.59 0.17 0.17 bc on GPU collide propagate Propagate other on GPU bc Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Propagate Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s) on GPU 20

LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank 21

LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank collide slowed down due to unified memory page migrations 22

LBM D2Q37 – COLLIDE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) Time (s) Final UM+propagate+bc Initial 0 7.69 2.39 1.89 main 1 collide 0.52 49.99 17.01 2 lbm 0.41 4.72 0.06 3 init 0.19 0.19 0.04 4 0.15 0.17 0.01 printMass 5 0.13 2.15 10.71 propagate 6 bc 0.09 0.11 0.17 7 projection 0.05 0.05 0.06 Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) 23 Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)

LBM D2Q37 – COLLIDE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank Data stays on GPU while simulation is running 24

LBM D2Q37 – MULTI GPU Requirements CUDA-aware MPI with support for Unified Memory E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0 Start one MPI rank per GPU 25

LBM D2Q37 – MULTI GPU Handling GPU AFFINITY int rank = 0 ; int size = 1 ; MPI_Init (& argc , & argv ); MPI_Comm_rank ( MPI_COMM_WORLD , & rank ); MPI_Comm_size ( MPI_COMM_WORLD , & size ); #if _OPENACC int ngpus = acc_get_num_devices ( acc_device_nvidia ); int devicenum = rank % ngpus ; acc_set_device_num ( devicenum , acc_device_nvidia ); acc_init ( acc_device_nvidia ); #endif /*_OPENACC*/ 26

LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory – 2 MPI ranks 27

LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory - Zoom – 2 MPI ranks MPI 28

LBM D2Q37 – MULTI GPU Strong Scaling 400 350 300 Runtime (s) 250 200 150 100 50 0 1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80) 1000 Steps - 1440x10240 Grid Tesla K80 Linear 29

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously. All kernel launches when using – ta=tesla: managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1 Set PGI_ACC_SYNCHRONOUS=0 to allow overlap 30

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation Grid size: 1920x2048 31

LBM D2Q37 – MULTI GPU Overlapping Communication and Computation Grid size: 1920x2048 32

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus, Senior Devtech Compute, April 7th 2016 OPENACC DIRECTIVES Incremental #pragma acc data copyin(a,b) copyout(c)

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Unified memory Talk outline [28 slides] GPGPU 2015: High Performance Computing with CUDA

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Rolling Stones Dissident Intellectuals in Hungary (1977-1994) Andrs Bozki and Agnes Simon

Principal Professional Practice Measures School Year 2014 2015 Name: Kathy Perriello School:

Health and safety priorities 2019/20 and outcomes from the annual Risk Assessment and Risk

SUPPORT UNIT Biological Weapons Convention This presentation is based upon the authors opinions

Impact of EU Target Model on GB Electricity Trading Arrangements Place your chosen image here.

Network Code Capacity Allocation Mechanisms Presentation of Draft for Market Consultation

Agenda Item 6: Gas ENC Approach Overview Place your chosen image here. The four corners must

European Framework Guidelines on gas balancing Konrad Keyserlingk, Ofgem Transmission