synchronization and ordering semantics in hybrid mpi gpu
play

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming - PowerPoint PPT Presentation

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming Ashwin M. Aji (Ph.D. Candidate), Wu-chun Feng (Virginia Tech) Pavan Balaji, James Dinan, Rajeev Thakur (Argonne National Lab.) synergy.cs.vt.edu Data Movement in CPU-GPU Clusters


  1. Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming Ashwin M. Aji (Ph.D. Candidate), Wu-chun Feng (Virginia Tech) Pavan Balaji, James Dinan, Rajeev Thakur (Argonne National Lab.) synergy.cs.vt.edu

  2. Data Movement in CPU-GPU Clusters 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  3. Data Movement in CPU-GPU Clusters CPU CPU Network main main memory memory 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  4. Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  5. Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 3 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  6. Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if if(ra (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  7. Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if if(ra (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { GP GPUM UMemc mcpy py(h (host st_b _buf, f, d dev ev_bu buf, f, D2 D2H) H) 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  8. Data Movement in CPU-GPU Clusters GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 if(ra if (rank nk == == 0) 0) if(rank == if(ra nk == 1 1) { { GP GPUM UMemc mcpy py(host_ host_buf buf, , de dev_buf v_buf, D2H , D2H) MPI MPI_R _Recv ecv(hos host_b t_buf uf, . , .. . . ..) GPU GPUMe Memcp mcpy(de dev_b v_buf uf, , hos host_ t_buf buf, , H2D H2D) MPI MPI_S _Send end(hos host_b t_buf uf, . , .. . . ..) 3 } } Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  9. Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  10. Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  11. Data Movement in CPU-GPU Clusters (Pipelined) int processed[chunks] = {0}; for(j=0;j<chunks;j++) { /* Pipeline */ cudaMemcpyAsync(host_buf+offset, GPU GPU gpu_buf+offset, device device D2H, streams[j], ...); } memory memory numProcessed = 0; j = 0; flag = 1; while (numProcessed < chunks) { if(cudaStreamQuery(streams[j] == cudaSuccess)) { /* start MPI */ CPU CPU Network MPI_Isend(host_buf+offset,...); main main numProcessed++; processed[j] = 1; memory memory } MPI_Testany(...); /* check progress */ if(numProcessed < chunks) /* next chunk */ while(flag) { j=(j+1)%chunks; flag=processed[j]; } MPI Rank = 0 MPI Rank = 1 } MPI_Waitall(); 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  12. Data Movement in CPU-GPU Clusters (Pipelined) GPU GPU device device memory memory • Performance vs. Productivity tradeoff • CPU CPU Multiple optimizations for different... Network main main  …GPUs (AMD/Intel/NVIDIA) memory memory  …programming models (CUDA/ OpenCL)  …library versions (CUDA v3/CUDA v4) MPI Rank = 0 MPI Rank = 1 4 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  13. GPU-integrated MPI GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  14. GPU-integrated MPI GPU GPU device device memory memory CPU CPU Network main main memory memory MPI Rank = 0 MPI Rank = 1 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  15. GPU-integrated MPI GPU GPU device device memory memory • Examples: MPI-ACC, MVAPICH, Open MPI • Programmability/Productivity: multiple accelerators CPU CPU Network main main and prog. models (CUDA, OpenCL) memory memory • Performance: system-specific and vendor-specific optimizations (Pipelining, GPUDirect, pinned host memory, IOH affinity) MPI Rank = 0 MPI Rank = 1 if(ra if (rank nk == == 0) 0) if(ra if(rank == nk == 1 1) { { MP MPI_ I_Sen end(an any_b _buf uf, . .. . .. ..) MPI MPI_R _Recv ecv(any any_bu _buf, .. .. .. ..) } } 7 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  16. Need for Synchronization foo_blocking(buf_1); /* No Sync Needed */ bar(buf_1); foo_nonblocking(buf_1); /* Sync Needed */ bar(buf_1); foo_nonblocking(buf_1); /* No Sync Needed */ bar(buf_2); 8 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  17. Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if(ra if (rank nk == == 1) 1) { { MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  18. Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if(ra if (rank nk == == 1) 1) { { GPU GPUEx Exec( ec(any any_buf buf, , ... ...); ); GPU GPUMe Memcp mcpy(o y(other her_b _buf, uf, H2 H2D); ); MP MPI_ I_Ise send nd(o (othe her_ r_buf uf, , .. ...) GPU GPUEx Exec( ec(oth other_b r_buf uf, . , ...) ..); MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); MPI MPI_I _Isen send(o d(other her_b _buf, uf, .. ...); ); MPI MPI_W _Wait aitall all(); ); } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  19. Need for Synchronization in GPU- integrated MPI if if(ra (rank nk == == 0) 0) if if(ra (rank nk == == 1) 1) { { • Interleaved MPI and GPU operations GPU GPUEx Exec( ec(any any_buf buf, , ... ...); ); GPU GPUMe Memcp mcpy(o y(other her_b _buf, uf, H2 H2D); ); • Dependent vs. Independent MP MPI_ I_Ise send nd(o (othe her_ r_buf uf, , .. ...) GPU GPUEx Exec( ec(oth other_b r_buf uf, . , ...) ..); MPI MPI_S _Send end(any any_bu _buf, .. ...) .) MPI MPI_R _Recv ecv(any any_bu _buf, .. ...) .) • Blocking vs. Non-blocking GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); GPU GPUMe Memcp mcpy(a y(any_b y_buf uf, H , H2D) 2D); MPI MPI_I _Isen send(o d(other her_b _buf, uf, .. ...); ); MPI MPI_W _Wait aitall all(); ); } } 9 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  20. Rest of this talk… • With interleaved MPI and GPU operations, what are (or should be) the synchronization semantics? UVA-based – Explicit GPU- – Implicit Integrated MPI Attribute- • How the based synchronization semantics can affect performance and productivity? 10 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  21. GPU-integrated MPI: UVA-based Design • What is UVA? – Unified Virtual Addressing UVA: Single Address Space Default case: Multiple Memory Spaces 11 Source: Peer-to-Peer & Unified Virtual Addressing CUDA Webinar http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  22. GPU-integrated MPI: UVA-based Design • void * for CPU or GPU 0 or GPU 1 or GPU n • cuPointerGetAttribute queries for the buffer type UVA: Single Address Space (CPU or GPU i ) • Exclusive to CUDA v4.0+ 12 Source: Peer-to-Peer & Unified Virtual Addressing CUDA Webinar http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  23. GPU-integrated MPI: UVA-based Design MPI_Send((void *)buf, count, MPI_CHAR, dest, tag, MPI_COMM_WORLD); – buf : CPU or GPU buffer – MPI implementation can perform pipelining if it is GPU – No change to MPI standard if (my_rank == sender) { /* send from GPU (CUDA) */ MPI_Send(dev_buf, ...); } else { /* receive into host */ MPI_Recv(host_buf, ...); } 13 Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

  24. GPU-integrated MPI: Attribute-based Design MPI_Send((void *)buf, count, new_type, dest, MPI_Send((void *)buf, count, MPI_CHAR MPI_CHAR, dest, tag, MPI_COMM_WORLD); tag, MPI_COMM_WORLD); – new_type: add attributes to inbuilt datatypes (MPI_INT, MPI_CHAR) double *cuda_dev_buf; cl_mem ocl_dev_buf; /* initialize a custom type */ MPI_Type_dup(MPI_CHAR, &type); if (my_rank == sender) { /* send from GPU (CUDA) */ MPI_Type_set_attr(type, BUF_TYPE, BUF_TYPE_CUDA); MPI_Send(cuda_dev_buf, type, ...); } else { /* receive into GPU (OpenCL) */ MPI_Type_set_attr(type, BUF_TYPE, BUF_TYPE_OPENCL); MPI_Recv(ocl_dev_buf, type, ...); } 14 MPI_Type_free(&type); Ashwin Aji (aaji@cs.vt.edu) synergy.cs.vt.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend