Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming - PowerPoint PPT Presentation

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming Ashwin M. Aji (Ph.D. Candidate), Wu-chun Feng (Virginia Tech) Pavan Balaji, James Dinan, Rajeev Thakur (Argonne National Lab.) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters CPU CPU Network main main memory memory 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if if(ra (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if if(ra (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { GP GPUM UMemc mcpy py(h (host st_b _buf, f, d dev ev_bu buf, f, D2 D2H) H) 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if(ra if (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { GP GPUM UMemc mcpy py(host_ host_buf buf, , de dev_buf v_buf, D2H , D2H) MPI MPI_R _Recv ecv(hos host_b t_buf uf, . , .. . . ..) GPU GPUMe Memcp mcpy(de dev_b v_buf uf, , hos host_ t_buf buf, , H2D H2D) MPI MPI_S _Send end(hos host_b t_buf uf, . , .. . . ..) 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters (Pipelined) int processed[chunks] = {0}; for(j=0;j<chunks;j++) { /* Pipeline */ cudaMemcpyAsync(host_buf+offset, GPU GPU gpu_buf+offset, device device D2H, streams[j], ...); } memory memory numProcessed = 0; j = 0; flag = 1; while (numProcessed < chunks) { if(cudaStreamQuery(streams[j] == cudaSuccess)) { /* start MPI */ CPU CPU Network MPI_Isend(host_buf+offset,...); main main numProcessed++; processed[j] = 1; memory memory } MPI_Testany(...); /* check progress */ if(numProcessed < chunks) /* next chunk */ while(flag) { j=(j+1)%chunks; flag=processed[j]; } MPI Rank = 0 MPI Rank = 1 } MPI_Waitall(); 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory • Performance vs. Productivity tradeoff • CPU CPU Multiple optimizations for different... Network main main  …GPUs (AMD/Intel/NVIDIA) memory memory  …programming models (CUDA/ OpenCL)  …library versions (CUDA v3/CUDA v4) MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

GPU-integrated MPI GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

GPU-integrated MPI GPU GPU device device memory memory • Examples: MPI-ACC, MVAPICH, Open MPI • Programmability/Productivity: multiple accelerators CPU CPU Network main main and prog. models (CUDA, OpenCL) memory memory • Performance: system-specific and vendor-specific optimizations (Pipelining, GPUDirect, pinned host memory, IOH affinity) MPI Rank = 0 MPI Rank = 1 if(ra if (rank nk == == 0) 0) if(ra if(rank == nk == 1 1) { { MP MPI_ I_Sen end(an any_b _buf uf, . .. . .. ..) MPI MPI_R _Recv ecv(any any_bu _buf, .. .. .. ..) } } 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Need for Synchronization foo_blocking(buf_1); /* No Sync Needed */ bar(buf_1); foo_nonblocking(buf_1); /* Sync Needed */ bar(buf_1); foo_nonblocking(buf_1); /* No Sync Needed */ bar(buf_2); 8 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if(ra if (rank nk == == 1) 1) { { MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if(ra if (rank nk == == 1) 1) { { GPU GPUEx Exec( ec(any any_buf buf, , ... ...); ); GPU GPUMe Memcp mcpy(o y(other her_b _buf, uf, H2 H2D); ); MP MPI_ I_Ise send nd(o (othe her_ r_buf uf, , .. ...) GPU GPUEx Exec( ec(oth other_b r_buf uf, . , ...) ..); MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); MPI MPI_I _Isen send(o d(other her_b _buf, uf, .. ...); ); MPI MPI_W _Wait aitall all(); ); } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if if(ra (rank nk == == 1) 1) { { • Interleaved MPI and GPU operations GPU GPUEx Exec( ec(any any_buf buf, , ... ...); ); GPU GPUMe Memcp mcpy(o y(other her_b _buf, uf, H2 H2D); ); • Dependent vs. Independent MP MPI_ I_Ise send nd(o (othe her_ r_buf uf, , .. ...) GPU GPUEx Exec( ec(oth other_b r_buf uf, . , ...) ..); MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) • Blocking vs. Non-blocking GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); MPI MPI_I _Isen send(o d(other her_b _buf, uf, .. ...); ); MPI MPI_W _Wait aitall all(); ); } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Rest of this talk… • With interleaved MPI and GPU operations, what are (or should be) the synchronization semantics? UVA-based – Explicit GPU- – Implicit Integrated MPI Attribute- • How the based synchronization semantics can affect performance and productivity? 10 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

GPU-integrated MPI: UVA-based Design • What is UVA? – Unified Virtual Addressing UVA: Single Address Space Default case: Multiple Memory Spaces 11 Source: Peer-to-Peer & Unified Virtual Addressing CUDA Webinar http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

GPU-integrated MPI: UVA-based Design • void * for CPU or GPU 0 or GPU 1 or GPU n • cuPointerGetAttribute queries for the buffer type UVA: Single Address Space (CPU or GPU i ) • Exclusive to CUDA v4.0+ 12 Source: Peer-to-Peer & Unified Virtual Addressing CUDA Webinar http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

GPU-integrated MPI: UVA-based Design MPI_Send((void *)buf, count, MPI_CHAR, dest, tag, MPI_COMM_WORLD); – buf : CPU or GPU buffer – MPI implementation can perform pipelining if it is GPU – No change to MPI standard if (my_rank == sender) { /* send from GPU (CUDA) */ MPI_Send(dev_buf, ...); } else { /* receive into host */ MPI_Recv(host_buf, ...); } 13 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

GPU-integrated MPI: Attribute-based Design MPI_Send((void *)buf, count, new_type, dest, MPI_Send((void *)buf, count, MPI_CHAR MPI_CHAR, dest, tag, MPI_COMM_WORLD); tag, MPI_COMM_WORLD); – new_type: add attributes to inbuilt datatypes (MPI_INT, MPI_CHAR) double *cuda_dev_buf; cl_mem ocl_dev_buf; /* initialize a custom type */ MPI_Type_dup(MPI_CHAR, &type); if (my_rank == sender) { /* send from GPU (CUDA) */ MPI_Type_set_attr(type, BUF_TYPE, BUF_TYPE_CUDA); MPI_Send(cuda_dev_buf, type, ...); } else { /* receive into GPU (OpenCL) */ MPI_Type_set_attr(type, BUF_TYPE, BUF_TYPE_OPENCL); MPI_Recv(ocl_dev_buf, type, ...); } 14 MPI_Type_free(&type); Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming - PowerPoint PPT Presentation

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming Ashwin M. Aji (Ph.D. Candidate), Wu-chun Feng (Virginia Tech) Pavan Balaji, James Dinan, Rajeev Thakur (Argonne National Lab.) synergy.cs.vt.edu Data Movement in CPU-GPU Clusters

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Semantics 1 / 21 Outline What is semantics? Denotational semantics Semantics of naming What

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Introduction to Recommender Systems Fabio Petroni About me Fabio Petroni Sapienza University of

Basic Assembly Instructions SE 2XA3 Term I, 2020/21 Outline Basic instructions Addition,

How to make MySQL for the Cloud Lixun Peng Staff Database Engineer @ Alibaba Cloud Senior

Implicit learning and working memory in children Nadiia Denhovska ndenhovska@uclan.ac.uk

Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Overview Implicit Vectorisation

THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI

Postoperative Cognitive Decline Noise or Signals? Jacqueline M. Leung, MD, MPH Professor

The Institute for Advanced Architectures and Algorithms (IAA) David H. Rogers Sudip Dosanjh

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming - PowerPoint PPT Presentation

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming Ashwin M. Aji (Ph.D. Candidate), Wu-chun Feng (Virginia Tech) Pavan Balaji, James Dinan, Rajeev Thakur (Argonne National Lab.) synergy.cs.vt.edu Data Movement in CPU-GPU Clusters

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Semantics 1 / 21 Outline What is semantics? Denotational semantics Semantics of naming What

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Introduction to Recommender Systems Fabio Petroni About me Fabio Petroni Sapienza University of

Basic Assembly Instructions SE 2XA3 Term I, 2020/21 Outline Basic instructions Addition,

How to make MySQL for the Cloud Lixun Peng Staff Database Engineer @ Alibaba Cloud Senior

Implicit learning and working memory in children Nadiia Denhovska ndenhovska@uclan.ac.uk

Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Overview Implicit Vectorisation

THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI

Postoperative Cognitive Decline Noise or Signals? Jacqueline M. Leung, MD, MPH Professor

The Institute for Advanced Architectures and Algorithms (IAA) David H. Rogers Sudip Dosanjh

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards