S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, - PowerPoint PPT Presentation

S7546 – Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17

Multi-GPU Approaches OpenACC-only OpenACC + MPI Uses OpenACC’s runtime library Uses Message Passing Interface to select devices at runtime. (MPI) for domain decomposition and GPU isolation. One process manages multiple devices. Each process (usually) only interfaces with 1 GPU, but 1 GPU may interface with many processes May be “free” for many apps 3

OpenACC multi-device API 4

OpenACC for Multiple Devices By default, the OpenACC runtime will choose what it believes is the most capable device. OpenACC provides an API for enumerating the available devices, selecting device types, and selecting an individual device. Developers can use this API to change which device the runtime will use. Devices are uniquely identified by a number and type tuple (e.g. NVIDIA device 0). 5

Getting the Number of Devices The acc_get_num_devices function returns the number of devices of a particularl type. int acc_get_num_devices(type) This function should be called before attempting to use multiple devices to determine how many suitable devices are available. 6

Setting the Desired Device The acc_set_device_num function selects the specific device that should be used by all upcoming OpenACC directies. void acc_set_device_num(number, type) or #pragma acc set device_num(number) device_type(type) Once a specific device is selected, all OpenACC directives until the next call to acc_set_device_num will go to the selected device. This function may be called with different values from different threads. 7

Querying the Device Number The acc_get_device_num function returns the device number currently being used for a given type. int acc_get_device_num(type) This function is frequently confused with acc_get_num_devices. One function queries how many devices are available of a type, the other queries which device of that type will be used. 8

Set Device Example def= acc_get_device_num(acc_device_default); Asynchronously run loop on #pragma acc parallel loop async default device. for(int i=0; i<N; ++i) A[i] = i; Asynchronously run loop on acc_set_device_num(1,acc_device_default); device 1, potentially #pragma acc parallel loop async concurrently with previous. for(int i=0; i<N; ++i) Then wait for results from B[i] = 2. * i; device 1. #pragma acc wait Change back to default device, acc_set_device_num(def,acc_device_default); wait for completion, and #pragma acc wait #pragma acc parallel loop finally run summation loop on for(int i=0; i<N; ++i) default device. C[i] = A[i] + B[i]; 10

Multi-GPU Case Study (All-OpenACC) 11

Multi-Device Pipeline A Case Study In NVIDIA’s Fall 2016 OpenACC Course I demonstrated how to pipeline an image filter to overlap data copies and compute kernels. Pipelining: Breaking a large operation into smaller parts so that independent operations can overlap. Since each part is independent, they can easily be run on different devices. 12

Pipelining in a Nutshell H2D kernel D2H H2D kernel D2H T wo Independent Operations Serialized NOTE: In real applications, H2D kernel D2H your boxes will not be so evenly H2D kernel D2H sized. H2D kernel D2H H2D kernel D2H Overlapping Copying and Computation 13

Multi-device Pipelining in a Nutshell H2D kernel D2H H2D kernel D2H Device 0 H2D kernel D2H H2D kernel D2H H2D kernel D2H Device 1 H2D kernel D2H 14

Multi-GPU Pipelined Code Spawn 1 thread per device. #pragma omp parallel num_threads(acc_get_num_devices(acc_device_default)) { Set the device number acc_set_device_num(omp_get_thread_num(),acc_device_default); int queue = 1; per-thread. #pragma acc data create(imgData[w*h*ch],out[w*h*ch]) { #pragma omp for schedule(static) Divide the work for ( long blocky = 0; blocky < nblocks; blocky++) { // For data copies we need to include the ghost zones for the filter among threads. long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgData[starty*step:(endy-starty)*step]) async(queue) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector async(queue) for ( long y = starty; y < endy; y++ ) { for ( long x = 0; x < w; x++ ) { float blue = 0.0, green = 0.0, red = 0.0; <filter code removed for space> out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); }} Wait for each device #pragma acc update self(out[starty*step:blocksize*step]) async(queue) in its thread. queue = (queue%3)+1; } #pragma acc wait } } 19

Multi-GPU Pipelined Performance Crosses quad 3.50X boundary 2.92X 3.00X 2.66X 2.50X 2.00X 1.69X Speed-up from single device 1.50X 1.00X 1.00X 0.51X 0.50X 0.00X Original Pipelined 2 Devices 4 Devices 8 Devices Source: PGI 17.3, NVIDIA Tesla P100 (DGX-1) 21

Multi-GPU Case Study (MPI+OpenACC) 22

OpenACC with MPI Domain decomposition is performed using MPI ranks Each rank should set its own device Maybe acc_set_device_num • Maybe handled by environment variable (CUDA_VISIBLE_DEVICES) • GPU affinity can be handled by standard MPI task placement Multiple MPI Ranks/GPU (using MPS) can work in place of OpenACC work queues/CUDA Streams 23

Setting a device by local rank Determine a unique ID for each rank on the // This is not portable to other MPI libraries char *comm_local_rank = getenv("OMPI_COMM_WORLD_LOCAL_RANK"); same node. int local_rank = atoi(comm_local_rank); char *comm_local_size = getenv("OMPI_COMM_WORLD_LOCAL_SIZE"); Use this unique ID to int local_size = atoi(comm_local_size); select a device per int num_devices = acc_get_num_devices(acc_device_nvidia); rank. #pragma acc set device_num(local_rank%num_devices) \ device_type(acc_device_nvidia) There is no portable way to get a local rank or to map a rank to the GPU(s) with good affinity to the rank. The MPI launcher (mpirun, mpiexec, aprun, etc.) can generally help you place ranks on particular CPUs to improve affinity mapping. For more details about best practices using MPI and GPUs, including GPU affinity see S7133 - MULTI-GPU PROGRAMMING WITH MPI 24

MPI Image Filter (pseudocode) Decompose image if (rank == 0 ) read_image(); // Distribute the image to all ranks across processes MPI_Scatterv(image); (ranks) MPI_Barrier(); // Ensures all ranks line up for timing omp_get_wtime(); blur_filter(); // Contains OpenACC filter MPI_Barrier(); // Ensures all ranks complete before timing omp_get_wtime(); Receive final parts from all ranks. MPI_Gatherv(out); if (rank == 0 ) write_image(); Launch with good $ mpirun --bind-to core --npersocket 4 ... GPU/process affinity There’s a variety of ways to do MPI decomposition, this is what I used for this particular example. 25

Multi-GPU Pipelined Performance (MPI) 9.00X 8.00X 7.00X 6.00X 4.89X 5.00X Speed-up from one device 4.00X 2.83X 3.00X Crosses quad 2.00X boundary 1.53X 1.00X 1.00X 0.00X 1 Device 2 Devices 4 Devices 8 Devices 16 Devices Source: PGI 17.3, NVIDIA Tesla P100 (DGX-1), Communication Excluded 27

Multi-GPU Pipelined Performance (MPI) 9.00X 8.51X 8.00X Crosses node 7.00X boundary 6.00X 4.89X 5.00X Speed-up from one device 4.00X 2.83X 3.00X Crosses quad 2.00X boundary 1.53X 1.00X 1.00X 0.00X 1 Device 2 Devices 4 Devices 8 Devices 16 Devices Source: PGI 17.3, NVIDIA Tesla P100 (DGX-1), Communication Excluded 28

Conclusions 29

Conclusions OpenACC provides an API for managing multiple devices on a single node. One approach to multiple devices is to manage them all with a single process. Developer only needs to worry about a single process; all GPUs read from/write to • the same host memory; no additional API dependency required. • Must be cautious of shared memory race conditions and improper use of asynchronous directives; does not handle GPU affinity very well when significant Another approach is to use MPI with OpenACC Many applications already use MPI (little/no changes required); simple way to • handle affinity; opens possibility of running on multiple nodes • Adds dependency on MPI; domain decomposition can be tricky, may increase memory footprint 30

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, - PowerPoint PPT Presentation

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU Approaches OpenACC-only OpenACC + MPI Uses OpenACCs runtime library Uses Message Passing Interface to select devices at runtime. (MPI) for domain

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Discussion Paper on Substation - Switchgear Coordination based on IEC by Hermann Koch Chairman

Jersey Advertisement Pitch By Alex, Ethan and Charles AIM OF THE ADVERT. Advertise Jersey to

Presentation skills: Grow vertically Generally, Companies dont ask for presentation. But it

AutoML for Object Detection Xiangyu Zhang MEGVII Research 1 AutoML for Advances in AutoML

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Outline 1. Fundamental limitation of EDA software 2. Realization of VHDL operator Synthesis Of

A local 2-approximation algorithm for the vertex cover problem Matti strand Patrik Floren

Chief Inspector Training Baseline Training 2 020-2021 1 Course Logistics Certification

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, - PowerPoint PPT Presentation

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU Approaches OpenACC-only OpenACC + MPI Uses OpenACCs runtime library Uses Message Passing Interface to select devices at runtime. (MPI) for domain

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Discussion Paper on Substation - Switchgear Coordination based on IEC by Hermann Koch Chairman

Jersey Advertisement Pitch By Alex, Ethan and Charles AIM OF THE ADVERT. Advertise Jersey to

Presentation skills: Grow vertically Generally, Companies dont ask for presentation. But it

AutoML for Object Detection Xiangyu Zhang MEGVII Research 1 AutoML for Advances in AutoML

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Outline 1. Fundamental limitation of EDA software 2. Realization of VHDL operator Synthesis Of

A local 2-approximation algorithm for the vertex cover problem Matti strand Patrik Floren

Chief Inspector Training Baseline Training 2 020-2021 1 Course Logistics Certification

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,