optimizing mpi communication on multi gpu systems using
play

Optimizing MPI Communication on Multi-GPU Systems using CUDA - PowerPoint PPT Presentation

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based Computing Laboratory


  1. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University + Texas Advanced Computing Center 1

  2. Outline • Motivation • Problem Statement • Using CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 2

  3. GPUs for HPC • GPUs are becoming a common component of modern clusters – higher compute density and performance/watt • 3 of the top 5 systems in the latest Top 500 list use GPUs • Increasing number of HPC workloads are being ported to GPUs - many of these use MPI • MPI libraries are being extended to support communication from GPU device memory 3

  4. MVAPICH/MVAPICH2 for GPU Clusters At Sender: � CPU � cudaMemcpy (sbuf, sdev) � MPI_Send (sbuf, . . . ) � � inside PCIe Earlier At Receiver: � 4 MVAPICH2 � MPI_Recv (rbuf, . . . ) � cudaMemcpy (rdev, rbuf) � NIC GPU At Sender: � � Switch MPI_Send (sdev, . . . ) � � Now At Receiver: � � MPI_Recv (rdev, . . . ) � • Efficient overlap copies over the PCIe with RDMA transfers over the network • Allows us to select efficient algorithms for MPI collectives and MPI datatype processing • Available with MVAPICH2 v1.8 ( http://mvapich.cse.ohio-state.edu ) 4

  5. Motivation Process 0 Process 1 • Multi-GPU node architectures are becoming common Memory • Until CUDA 3.2 – Communication between processes staged through the host CPU – Shared Memory (pipelined) – Network Loopback [asynchronous) • CUDA 4.0 – Inter-Process Communication (IPC) I/O Hub – Host bypass – Handled by a DMA Engine – Low latency and Asynchronous – Requires creation, exchange and GPU 0 GPU 1 mapping of memory handles - overhead HCA 5

  6. Comparison of Costs 228 usec Copy Latency (usec) 200 150 100 49 usec 50 3 usec CUDA IPC Copy + CUDA IPC Copy Handle Creation & Copy Via Host Mapping Overhead • Comparison of bare copy costs between two processes on one node, each using a different GPU (outside MPI) • 8 Bytes 6

  7. Outline • Motivation • Problem Statement • Basics of CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 7

  8. Problem Statement • Can we take advantage of CUDA IPC to improve performance of MPI communication between GPUs on a node? • How do we address the memory handle creation and mapping overheads? • What kind of performance do the different MPI communication semantics deliver with CUDA IPC? – Two-sided Semantics – One-sided Semantics • How do CUDA IPC based designs impact the performance of end- applications? 8

  9. Outline • Motivation • Problem Statement • Basics of CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 9

  10. Basics of CUDA IPC sbuf_ptr rbuf_ptr Process 0 Process 1 cudaIpcGetEventHandle (&event_handle, event) cuMemGetAddressRange (&base_ptr, sbuf_ptr) cudaIpcGetMemhandle IPC handles (&handle, base_ptr) cudaIpcOpenMemhandle (&base_ptr, handle) cudaIpcOpenEventhandle (&ipc_event, event_handle) cudaMemcpy (rbuf_ptr, base_ptr + displ) cudaEventRecord (&ipc_event, event_handle) IPC memory handle should be Done cudaStreamWaitEvent closed at Process 1 before the (0, event) buffer is freed at Process 0 other CUDA calls that can 10 modify the sbuf

  11. Outline • Motivation • Problem Statement • Basics of CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 11

  12. Design of Two-sided Communication • MPI communication costs – synchronization – data movement • Small message communication – minimize synchronization overheads – pair-wise eager buffers for host-host communication – associated pair-wise IPC buffers on GPU – synchronization using CUDA Events • Large message communication – minimize number for copies - rendezvous protocol – minimize memory mapping overheads using a mapping cache 12

  13. Design of One-sided Communication • Separates communication from synchronization • Window • Communication calls - put, get, accumulate • Synchronization calls – active - fence, post-wait/start-complete – passive – lock-unlock – period between two synchronization calls is a communication epoch • IPC memory handles created and mapped during window creation • Put/Get implemented as cudaMemcpyAsync • Synchronization using CUDA Events 13

  14. Outline • Motivation • Problem Statement • Basics of CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 14

  15. Experimental Setup • Intel Westmere node – 2 NVIDIA Tesla C2075 GPUs – Red Hat Linux 5.8 and CUDA Toolkit 4.1 • MVAPICH/MVAPICH2 - High Performance MPI Library for IB, 10GigE/iWARP and RoCE – Available since 2002 – Used by more than 1.930 organizations (HPC centers, Industries and Universities) in 68 countries – More than 111,000 downloads from OSU site directly – Empowering many TOP500 clusters • 5th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology • 7th ranked 111,104-core cluster (Pleiades) at NASA 25th ranked 62,976-core cluster (Ranger) at TACC • – http://mvapich.cse.ohio-state.edu 15

  16. Two-sided Communication Performance SHARED-MEM CUDA IPC 50 2000 Latency (usec) 40 Latency (usec) 1500 46% 30 70% 1000 20 500 10 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) 6000 Bandwidth (MBps) 5000 78% considerable 4000 improvement in 3000 MPI performance 2000 due to host 1000 bypass 0 1 16 256 4K 64K 1M 16 Message Size (Bytes)

  17. One-sided Communication Performance (get + active synchronization vs. send/recv) SHARED-MEM-1SC CUDA-IPC-1SC CUDA-IPC-2SC 50 2000 40 Latency (usec) Latency (usec) 1500 30 1000 20 30% 500 10 0 0 2 8 32 128 512 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) 6000 Bandwidth (MBps) 5000 Better 4000 27% performance 3000 compared to 2000 two-sided 1000 semantics. 0 1 16 256 4K 64K 1M Message Size (Bytes) 17

  18. One-sided Communication Performance (get + passive synchronization) SHARED-MEM CUDA IPC 600 500 Latency (usec) 400 300 200 100 true 0 asynchronous 0 100 200 300 progress Target Busy Loop (usec) • Lock + 8 Gets + Unlock with the target in a busy loop (128KB messages) 18

  19. Lattice Boltzmann Method 2SIDED-SHARED-MEM 2SIDED-IPC 1SIDED-IPC 140 LB Step Latency (msec) 120 16% 100 80 60 40 20 0 256x256x64 256x512x64 512x512x64 Dataset per GPU • Computation fluid dynamics code with support for multi-phase flows with large density ratios • Modified to use MPI communication from GPU device memory - one- sided and two-sided semantics • Up to 16% improvement in per step 19

  20. Outline • Motivation • Problem Statement • Basics of CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 20

  21. Conclusion and Future Work • Take advantage of CUDA IPC to improve MPI communication between GPUs on a node • 70% improvement in latency and 78% improvement in bandwidth for two- sided communication • One-sided communication gives better performance and allows for truly asynchronous communication • 16% improvement in execution time of Lattice Boltzmann Method code • Studying the impact on other applications while exploiting computation- communication overlap • Exploring efficient designs for inter-node one-sided communication on GPU clusters 21

  22. Thank You! {potluri, wangh, bureddy, singhas, panda} @cse.ohio-state.edu carlos@tacc.utexas.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ � 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend