efficient and scalable multi source streaming broadcast
play

Efficient and Scalable Multi-Source Streaming Broadcast on GPU - PowerPoint PPT Presentation

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Ching-Hsiang Chu 1 , Xiaoyi Lu 1 , Ammar A. Awan 1 , Hari Subramoni 1 , Jahanzeb Hashmi 1 , Bracy Elton 2 and Dhabaleswar K. (DK) Panda 1 1 Department of


  1. Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Ching-Hsiang Chu 1 , Xiaoyi Lu 1 , Ammar A. Awan 1 , Hari Subramoni 1 , Jahanzeb Hashmi 1 , Bracy Elton 2 and Dhabaleswar K. (DK) Panda 1 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation

  2. Outline • Introduction – Deep Learning on GPU and InfiniBand (IB) Clusters – Multi-source Broadcast-type Operation for Deep Learning • Analysis • Proposed Design – Streaming-based Design with IB multicast and NVIDIA GPUDirect features • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 2

  3. Trends in Modern HPC Architecture Accelerators / Coprocessors High Performance Interconnects – high compute density, high InfiniBand (IB), Omni-Path SSD, NVMe-SSD, NVRAM performance/watt Multi-core Processors < 1 μsec latency, 100 Gbps Bandwidth> > 1 Tflop/s DP on a chip • Multi-core/many-core technologies • High Performance Interconnects • Accelerators/Coprocessors are becoming common in high-end systems • High Performance Storage and Compute devices K - Computer Tianhe – 2 Sunway TaihuLight Titan Network Based Computing Laboratory ICPP 2017 3

  4. GPU in HPC Systems • Growth of GPU clusters in the last 3 years – NVIDIA GPUs boost many Top 500 and Green 500 systems • “Top 13 systems on the latest Green500 are all equipped with the P100 hardware”* NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal 80 *Data collected from http://top500.org 70 60 22 2 System Count 50 40 52 53 50 28 23 33 30 43 20 10 20 18 15 14 10 8 6 0 June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 Network Based Computing Laboratory ICPP 2017 4

  5. Architectures for Deep Learning (DL) Past and Current Trend Near-future Multi-core CPUs within a node Multi-core CPUs + Multi-GPU within a Multi-core CPUs + Multi-GPU node across nodes IB Networks Multi-core CPUs across nodes Multi-core CPUs + Single GPU across nodes IB Networks IB Networks E.g., NVIDIA DGX-1 systems Network Based Computing Laboratory ICPP 2017 5

  6. High-performance Deep Learning • Computation using GPU GPU Node 1 GPU Node 3 • Communication using MPI – Exchanging partial gradients after each minibatch – All-to-all (Multi-Source) communications GPU Node 4 GPU Node 2 Ø E.g., MPI_Bcast • Challenges – High computation-communication overlap – Good scalability for upcoming large-scale GPU clusters – No application-level modification Network Based Computing Laboratory ICPP 2017 6

  7. Outline • Introduction • Analysis – Existing Designs – Problem Statement • Proposed Design • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 7

  8. Evaluation Parameters Message Notation Meaning Unit 𝑵 𝒐 Number of processes N/A 𝒏 Number of broadcast sources N/A 𝑫 𝒖 𝒕 Set up time for sending data sec 𝒖 𝒑 (𝒐) Overhead for issuing an IB-MCAST packet sec 𝑽 𝑵 Original message size bytes Bandwidth 𝑫 Size of a data chunk bytes Maximum Transmission Unit for IB-MCAST, 𝑪 𝑰 ≫ 𝑪 𝑯 𝑽 bytes CPU provided by hardware manufacturer 𝑪 𝑰 Bandwidth of reading Host memory bytes/sec IB HCA 𝑪 𝑸𝑫𝑱𝒇 Bandwidth of reading GPU memory 𝑪 𝑯 bytes/sec (NVIDIA GPUDirect RDMA) GPU PCIe Bandwidth between Host and GPU 𝑪 𝑸𝑫𝑱𝒇 bytes/sec 𝑪 𝑯 memory Network Based Computing Laboratory ICPP 2017 8

  9. Ring-based Broadcast • Direct • Staging • Pipeline 𝑁 𝐷 + (𝑜 − 2) × 𝑢 7 + 𝐷 (𝑜 − 1)× 𝑢 7 + 𝑁 𝑁 + (𝑜 − 1)× 𝑢 7 + 𝑁 𝐶 ; 𝐶 ; 𝐶 >?@A 𝐶 B Poor GDR Read Scalability GDR Write Network Transfer Source Destination 1 Destination 2 Destination 3 GPU CPU CPU IB HCA IB HCA Data IB HCA Data Data GPU IB HCA CPU GPU CPU GPU Data Network Based Computing Laboratory ICPP 2017 9

  10. K-nomial-based Broadcast • Direct • Staging • Pipeline 𝑁 × 𝑢 7 + 𝐷 𝑁 + log F 𝑜 × 𝑢 7 + 𝑁 log F 𝑜 × 𝑢 7 + 𝑁 𝐷 × log F 𝑜 𝐶 ; 𝐶 ; 𝐶 >?@A 𝐶 B Destination 1 Non-optimized GPU CPU Source Scalability Data CPU IB HCA IB HCA GPU Destination 2 Destination 3 Data CPU IB HCA GDR Read IB HCA GDR Write Data GPU CPU GPU Data Network Transfer Network Based Computing Laboratory ICPP 2017 10

  11. Hardware Multicast-based Broadcast* • For GPU-resident data, using – GPUDirect RDMA (GDR) Destination 1 – InfiniBand Hardware Multicast (IB-MCAST) Header CPU IB • Overhead Source HCA Header GPU – IB UD limit CPU Data IB IB – GDR limit Switch HCA Destination N GPU 𝑁 𝑉 × 𝑢 7 + 𝑢 H (𝑜) + 𝑉 Data Header CPU 𝐶 ; IB 1. IB Gather + GDR Read HCA 2. IB Hardware Multicast *A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. GPU 3. IB Scatter + GDR Write Panda, “A High Performance Broadcast Design with Hardware Data Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014 , Dec 2014. Network Based Computing Laboratory ICPP 2017 11

  12. Problem Statement • How to determine techniques to leverage IB-MCAST and other GPU advanced features GDR to design efficient and scalable broadcast with large messages on GPU clusters? • How to achieve high overlap and scalability for multi-source broadcast operations? • How to determine attainable theoretical and practical performance benefits for deep learning applications? Network Based Computing Laboratory ICPP 2017 12

  13. Outline • Introduction • Analysis • Proposed Design – Streaming-based Design with IB multicast and NVIDIA GPUDirect features • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 13

  14. Overview of Proposed Streaming Design • Optimized broadcast send operation – Streaming the GPU-resident data through host memory – Leveraging InfiniBand hardware multicast Ø Low-latency: avoiding GDR Read limit Ø Overlapping data transfers within and across nodes • Optimized broadcast receive operation – Zero-copy scheme by leveraging GDR feature Ø Low-latency: avoiding unnecessary data transfers Network Based Computing Laboratory ICPP 2017 14

  15. Optimized Broadcast Send • Preparing Intermediate buffer ( im_buf ) – Page-locked (pinned) host buffer Ø Fast Device-Host data movement MPI_Bcast(d_out,…) Source – Allocated at initialization phase Header im_buf Ø Low overhead CPU IB IB • Streaming data through host Switch HCA GPU – Fine-tuned chunked data d_out – Asynchronous copy operations 1. Data Preparation 2. IB Gather Ø Three-stage pipeline 3. IB Hardware Multicast Network Based Computing Laboratory ICPP 2017 15

  16. Optimized Broadcast Receive MPI_Bcast(d_in,…) • Zero-copy broadcast receive Destination 1 Header – Pre-posted user buffer (d_in) CPU IB – Avoids additional data movement HCA GPU IB – Leverages IB Scatter and GDR d_in Switch features Destination N Ø Low-latency Header CPU IB Ø Free-up PCIe resources for HCA applications GPU d_in IB Hardware Multicast IB Scatter (GDR Write) Network Based Computing Laboratory ICPP 2017 16

  17. Overlap Opportunities : cudaMemcpyAsync : IB Hardware Multicast : cudaStreamSynchronize Overlap within a node 𝐷 + 𝑁 𝑉 × 𝑢 7 + 𝑢 H (𝑜) + 𝑉 : GDR Write 𝐶 >?@A 𝐶 B Broadcast from Node A GPU Node A CPU HCA GPU Node B CPU HCA HCA Node C CPU GPU Broadcast from Node B Broadcast from Node C Timeline Overlap Across Nodes Network Based Computing Laboratory ICPP 2017 17

  18. Outline • Introduction • Analysis • Proposed Design • Performance Evaluation – OSU Micro-Benchmark (OMB) – Deep Learning Framework • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 18

  19. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,775 organizations in 85 countries – More than 420,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 15th, 241,108-core (Pleiades) at NASA • 20th, 462,462-core (Stampede) at TACC • 44th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1 st in Jun’16, 10M cores, 100 PFlops) – Network Based Computing Laboratory ICPP 2017 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend