MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016

MPI+CUDA System System System GDDR5 Memory GDDR5 Memory GDDR5 Memory Memory Memory Memory … GPU GPU GPU CPU CPU CPU PCI-e PCI-e PCI-e Network Network Network Card Card Card Node 0 Node 1 Node n-1 3 4/11/2016

MPI+CUDA System System System GDDR5 Memory GDDR5 Memory GDDR5 Memory Memory Memory Memory … GPU GPU GPU CPU CPU CPU PCI-e PCI-e PCI-e Network Network Network Card Card Card Node 0 Node 1 Node n-1 4 4/11/2016

MPI+CUDA //MPI rank 0 MPI_Send ( s_buf_d , size , MPI_CHAR , 0 , tag , MPI_COMM_WORLD ); //MPI rank n-1 MPI_Recv ( r_buf_d , size , MPI_CHAR , n - 1 , tag , MPI_COMM_WORLD ,& stat ); 5 4/11/2016

YOU WILL LEARN What MPI is How to use MPI for inter GPU communication with CUDA and OpenACC What CUDA-aware MPI is What Multi Process Service is and how to use it How to use NVIDIA tools in an MPI environment How to hide MPI communication times 6 4/11/2016

MESSAGE PASSING INTERFACE - MPI Standard to exchange data between processes via messages Defines API to exchanges messages Point to Point: e.g. MPI_Send , MPI_Recv Collectives: e.g. MPI_Reduce Multiple implementations (open source and commercial) Bindings for C/C++, Fortran, Python, … E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, … 7 4/11/2016

MPI - SKELETON #include <mpi.h> int main ( int argc , char * argv []) { int rank , size ; /* Initialize the MPI library */ MPI_Init (& argc ,& argv ); /* Determine the calling process rank and total number of ranks */ MPI_Comm_rank ( MPI_COMM_WORLD ,& rank ); MPI_Comm_size ( MPI_COMM_WORLD ,& size ); /* Call MPI routines like MPI_Send, MPI_Recv, ... */ ... /* Shutdown MPI library */ MPI_Finalize (); return 0 ; } 8 4/11/2016

MPI Compiling and Launching $ mpicc -o myapp myapp.c $ mpirun -np 4 ./ myapp < args > rank = 0 rank = 1 rank = 2 rank = 3 myapp myapp myapp myapp 9 4/11/2016

A SIMPLE EXAMPLE 10 4/11/2016

EXAMPLE: JACOBI SOLVER Solves the 2D-Laplace Equation on a rectangle ∆𝒗 𝒚, 𝒛 = 𝟏 ∀ 𝒚, 𝒛 ∈ Ω\𝜺Ω Dirichlet boundary conditions (constant values on boundaries) 𝒗 𝒚, 𝒛 = 𝒈 𝒚, 𝒛 ∈ 𝜺Ω Rank Rank Rank … 2D domain decomposition with n x k domains (0,0) (0,1) (0,n-1) … Rank Rank Rank (k-1,0) (k-1,1) (k-1,n-1) 11 4/11/2016

EXAMPLE: JACOBI SOLVER Single GPU While not converged Do Jacobi step: for ( int iy = 1 ; iy < ny - 1 ; ++ iy ) for ( int ix = 1 ; ix < nx - 1 ; ++ ix ) u_new [ ix ][ iy ] = 0.0f - 0.25f *( u [ ix - 1 ][ iy ] + u [ ix + 1 ][ iy ] + u [ ix ][ iy - 1 ] + u [ ix ][ iy + 1 ]); Swap u_new and u Next iteration 12 4/11/2016 4/11/2016

EXAMPLE: JACOBI SOLVER Multi GPU While not converged Do Jacobi step: for ( int iy = 1 ; iy < ny - 1 ; ++ iy ) for ( int ix = 1 ; ix < nx - 1 ; ++ ix ) u_new [ ix ][ iy ] = 0.0f - 0.25f *( u [ ix - 1 ][ iy ] + u [ ix + 1 ][ iy ] + u [ ix ][ iy - 1 ] + u [ ix ][ iy + 1 ]); Exchange halo with 2 4 neighbors Swap u_new and u Next iteration 13

EXAMPLE JACOBI Top/Bottom Halo 2 MPI_Sendrecv ( u_new + offset_first_row , m - 2 , MPI_DOUBLE , t_nb , 0 , 1 u_new + offset_bottom_boundary , m - 2 , MPI_DOUBLE , b_nb , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); MPI_Sendrecv ( u_new + offset_last_row , m - 2 , MPI_DOUBLE , b_nb , 1 , u_new + offset_top_boundary , m - 2 , MPI_DOUBLE , t_nb , 1 , 2 MPI_COMM_WORLD , MPI_STATUS_IGNORE ); 1 14 4/11/2016

EXAMPLE JACOBI Top/Bottom Halo #pragma acc host_data use_device ( u_new ) { MPI_Sendrecv ( u_new + offset_first_row , m - 2 , MPI_DOUBLE , t_nb , 0 , OpenACC u_new + offset_bottom_boundary , m - 2 , MPI_DOUBLE , b_nb , 0 , 2 MPI_COMM_WORLD , MPI_STATUS_IGNORE ); 1 MPI_Sendrecv ( u_new + offset_last_row , m - 2 , MPI_DOUBLE , b_nb , 1 , u_new + offset_top_boundary , m - 2 , MPI_DOUBLE , t_nb , 1 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); } MPI_Sendrecv (u_new_d+ offset_first_row , m - 2 , MPI_DOUBLE , t_nb , 0 , u_new_d+ offset_bottom_boundary , m - 2 , MPI_DOUBLE , b_nb , 0 , CUDA MPI_COMM_WORLD , MPI_STATUS_IGNORE ); MPI_Sendrecv (u_new_d+ offset_last_row , m - 2 , MPI_DOUBLE , b_nb , 1 , 2 1 u_new_d+ offset_top_boundary , m - 2 , MPI_DOUBLE , t_nb , 1 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); 15 4/11/2016

EXAMPLE: JACOBI Left/Right Halo //right neighbor omitted #pragma acc parallel loop present ( u_new, to_left ) for ( int i = 0 ; i < n - 2 ; ++ i ) to_left [ i ] = u_new [( i + 1 )* m + 1 ]; OpenACC #pragma acc host_data use_device ( from_right, to_left ) { MPI_Sendrecv ( to_left , n - 2 , MPI_DOUBLE , l_nb , 0 , from_right , n - 2 , MPI_DOUBLE , r_nb , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); } #pragma acc parallel loop present ( u_new, from_right ) for ( int i = 0 ; i < n - 2 ; ++ i ) u_new [( m - 1 )+( i + 1 )* m ] = from_right [ i ]; 16 4/11/2016

EXAMPLE: JACOBI Left/Right Halo //right neighbor omitted pack <<< gs , bs , 0 , s >>>( to_left_d , u_new_d , n , m ); cudaStreamSynchronize ( s ); CUDA MPI_Sendrecv ( to_left_d , n - 2 , MPI_DOUBLE , l_nb , 0 , from_right_d , n - 2 , MPI_DOUBLE , r_nb , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); unpack <<< gs , bs , 0 , s >>>( u_new_d , from_right_d , n , m ); 17 4/11/2016

LAUNCH MPI+CUDA/OPENACC PROGRAMS Launch one process per GPU MVAPICH: MV2_USE_CUDA $ MV2_USE_CUDA= 1 mpirun -np ${np} ./ myapp < args > Open MPI: CUDA-aware features are enabled per default Cray: MPICH_RDMA_ENABLED_CUDA IBM Platform MPI: PMPI_GPU_AWARE 18 4/11/2016

JACOBI RESULTS (1000 STEPS) MVAPICH2-2.0b FDR IB - Weak Scaling 4k x 4k per Process 14 12 10 Runtime (s) 8 Tesla K20X 6 Xeon E5-2690 v2 @ 3.0Ghz 4 2 0 1 2 4 8 #MPI Ranks – 1 CPU Socket with 10 OMP Threads or 1 GPU per Rank 19 4/11/2016

EXAMPLE JACOBI Top/Bottom Halo without CUDA-aware #pragma acc update host(u_new[offset_first_row:m-2],u_new[offset_last_row:m-2]) MPI MPI_Sendrecv ( u_new + offset_first_row , m - 2 , MPI_DOUBLE , t_nb , 0 , OpenACC u_new + offset_bottom_boundary , m - 2 , MPI_DOUBLE , b_nb , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); MPI_Sendrecv ( u_new + offset_last_row , m - 2 , MPI_DOUBLE , b_nb , 1 , u_new + offset_top_boundary , m - 2 , MPI_DOUBLE , t_nb , 1 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); #pragma acc update device(u_new[offset_top_boundary:m-2],u_new[offset_bottom_boundary:m- 2]) //send to bottom and receive from top top bottom omitted cudaMemcpy ( u_new+ offset_first_row , u_new_d + offset_first_row , ( m - 2 )*sizeof( double ), cudaMemcpyDeviceToHost ); CUDA MPI_Sendrecv (u_new +offset_first_row , m - 2 , MPI_DOUBLE , t_nb , 0 , u_new +offset_bottom_boundary , m - 2 , MPI_DOUBLE , b_nb , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); cudaMemcpy ( u_new_d+offset_bottom_boundary , u_new +offset_bottom_boundary , ( m - 2 )*sizeof( double ), cudaMemcpyDeviceToHost ); 20 4/11/2016

THE DETAILS 21

UNIFIED VIRTUAL ADDRESSING No UVA: Separate Address Spaces UVA: Single Address Space System GPU System GPU Memory Memory Memory Memory 0x0000 0x0000 0x0000 0xFFFF 0xFFFF 0xFFFF CPU GPU CPU GPU PCI-e PCI-e 22 4/11/2016

UNIFIED VIRTUAL ADDRESSING One address space for all CPU and GPU memory Determine physical memory location from a pointer value Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy) Supported on devices with compute capability 2.0+ for 64-bit applications on Linux and Windows (+TCC) 23 4/11/2016

NVIDIA GPUDIRECT™ Accelerated Communication with Network & Storage Devices GPU1 GPU2 Memory Memory System Memory CPU GPU GPU 1 2 PCI-e Chip IB set 24 4/11/2016

NVIDIA GPUDIRECT™ Accelerated Communication with Network & Storage Devices GPU1 GPU2 Memory Memory System Memory CPU GPU GPU 1 2 PCI-e Chip IB set 25 4/11/2016

NVIDIA GPUDIRECT™ Peer to Peer Transfers GPU1 GPU2 Memory Memory System Memory CPU GPU GPU 1 2 PCI-e Chip IB set 26 4/11/2016

NVIDIA GPUDIRECT™ Peer to Peer Transfers GPU1 GPU2 Memory Memory System Memory CPU GPU GPU 1 2 PCI-e Chip IB set 27 4/11/2016

NVIDIA GPUDIRECT™ Support for RDMA GPU1 GPU2 Memory Memory System Memory CPU GPU GPU 1 2 PCI-e Chip IB set 28 4/11/2016

NVIDIA GPUDIRECT™ Support for RDMA GPU1 GPU2 Memory Memory System Memory CPU GPU GPU 1 2 PCI-e Chip IB set 29 4/11/2016

CUDA-AWARE MPI Example: MPI Rank 0 MPI_Send from GPU Buffer MPI Rank 1 MPI_Recv to GPU Buffer Show how CUDA+MPI works in principle Depending on the MPI implementation, message size, system setup, … situation might be different Two GPUs in two nodes 30 4/11/2016

MPI GPU TO REMOTE GPU Support for RDMA MPI Rank 0 MPI Rank 1 GPU Host MPI_Send ( s_buf_d , size , MPI_CHAR , 1 , tag , MPI_COMM_WORLD ); MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Send ( s_buf_d , size , MPI_CHAR , 1 , tag , MPI_COMM_WORLD ); MPI_Recv ( r_buf_d , size , MPI_CHAR , 0 , tag , MPI_COMM_WORLD ,& stat ); MPI_Recv ( r_buf_d , size , MPI_CHAR , 0 , tag , MPI_COMM_WORLD ,& stat ); MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); 32

MPI GPU TO REMOTE GPU Support for RDMA MPI_Sendrecv Time 33 4/11/2016

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA System System System GDDR5 Memory GDDR5 Memory GDDR5 Memory Memory Memory Memory GPU GPU GPU CPU CPU

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

During the Cold War, Japan clearly understood that it was the anchor in Americas East Asian

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 Michael P Jiaqi Tan 2 , Rajeev

Keynote Speech Of Andrew J. Donohue Partner, Morgan, Lewis & Bockius LLP At J.P. Morgan

Security of IPv6 and DNSSEC for penetration testers Vesselin Hadjitodorov Master education

Living the Spirit-Led Life WEEK 2: OPENING MY HEART TO GODS WAYS 1 Weekly Materials Living

Climate and Energy Policy in Germany and the EU R. Andreas Kraemer Director Ecologic.eu

AICP Island RTPO TAC Meeting Kendra Breiland, Turning Data into Planning Solutions October 10,

Background and introduction to NeST within the context of the GPEDC and the global SSC debates

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA System System System GDDR5 Memory GDDR5 Memory GDDR5 Memory Memory Memory Memory GPU GPU GPU CPU CPU

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

During the Cold War, Japan clearly understood that it was the anchor in Americas East Asian

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 Michael P Jiaqi Tan 2 , Rajeev

Keynote Speech Of Andrew J. Donohue Partner, Morgan, Lewis &amp; Bockius LLP At J.P. Morgan

Security of IPv6 and DNSSEC for penetration testers Vesselin Hadjitodorov Master education

Living the Spirit-Led Life WEEK 2: OPENING MY HEART TO GODS WAYS 1 Weekly Materials Living

Climate and Energy Policy in Germany and the EU R. Andreas Kraemer Director Ecologic.eu

AICP Island RTPO TAC Meeting Kendra Breiland, Turning Data into Planning Solutions October 10,

Background and introduction to NeST within the context of the GPEDC and the global SSC debates

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Keynote Speech Of Andrew J. Donohue Partner, Morgan, Lewis & Bockius LLP At J.P. Morgan