Exploiting Maximal Overlap for Non- Contiguous Data Movement - PowerPoint PPT Presentation

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

Outline • Introduction • Proposed Designs • Performance Evaluation • Conclusion IPDPS 2016 2 Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand Multi-core Processors high compute density, high performance/watt <1us latency, >100 Gbps Bandwidth >1 Tflop/s DP on a chip • Multi-core processors are ubiquitous • InfiniBand is very popular in HPC clusters • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale computing Tianhe – 2 Stampede Titan Tianhe – 1A IPDPS 2016 3 Network Based Computing Laboratory

Accelerators in HPC Systems • Growth of Accelerator-enabled clusters in the last 3 years – 22% of Top 50 clusters are boosted by NVIDIA GPUs in Nov’15 – From Top500 list (http://www.top500.org) 100 29 80 System Count 30 60 14 20 16 11 12 15 40 18 20 22 52 31 20 33 28 23 15 8 0 June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 NVIDIA Kepler NVIDIA Fermi Intel Xeon Phi IPDPS 2016 4 Network Based Computing Laboratory

Motivation • Parallel applications on GPU clusters – CUDA (Compute Unified Device Architecture): • Kernel computation on NVIDIA GPUs – CUDA-Aware MPI (Message Passing Interface): • Communications across processes/nodes • Non-blocking communication to overlap with CUDA kernels MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); /* /* Independent Independent computations computations on on CPU/GPU CPU/GPU */ */ MPI_Wait (request1, status1); MPI_Wait (request2, status2); IPDPS 2016 5 Network Based Computing Laboratory

Motivation • Use of non-contiguous data becoming common – Easy to represent complex data structure • MPI Datatypes – E.g., Fluid dynamic, image processing… • What if the data are on GPU memory? 1. Copy data to CPU to perform the packing/unpacking • Slower for large message • Data movements between GPU and CPU are expensive 2. Utilize GPU kernel to perform the packing/unpacking* • No explicit copies, faster for large message *R. Shi et al., “ HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters ,” in 43rd ICPP , Sept 2014, pp. 221–230. IPDPS 2016 6 Network Based Computing Laboratory

Motivation – Non-Contiguous Data Movement in MPI Common Scenario Waste of computing resources on CPU and GPU MPI_Isend(Buf1, ...,req1); MPI_Isend(Buf2, ...,req2); Application work on the CPU/GPU MPI_Waitall(req, …) Timeline *Buf1, Buf2…contain non- contiguous MPI Datatype IPDPS 2016 7 Network Based Computing Laboratory

Problem Statement • Low overlap between CPU and GPU for applications – Packing/Unpacking operations are serialized Proposed • CPU/GPU resources are not fully utilized User Naive User Advanced – GPU threads remain idle for most of the time Utilization Resource – Low utilization, low efficiency Can we have designs to Productivity Overlap leverage new GPU Performanc technology to address these issues? Farther from e the center is Better IPDPS 2016 8 Network Based Computing Laboratory

Goals of this work • Proposes new designs leverage new NVIDIA GPU technologies Ø Hyper-Q technology (Multi-Streaming) Ø CUDA Event and Callback • Achieving Ø High performance and resource utilization for applications Ø High productivity for developers IPDPS 2016 9 Network Based Computing Laboratory

Outline • Introduction • Proposed Designs – Event-based – Callback-based • Performance Evaluation • Conclusion IPDPS 2016 10 Network Based Computing Laboratory

Overview Existing Design Wait Isend(1) Isend(1) Isend(1) CPU Initiate Initiate Initiate Wait For Wait For Wait For Kernel Kernel Kernel Start Start Send Send Send Start Progress Kernel Kernel Kernel (WFK) (WFK) (WFK) GPU Kernel on Stream Kernel on Stream Kernel on Stream Proposed Design CPU Wait Isend(1) Isend(2)Isend(3) Progress Initiate Initiate Initiate Kernel Kernel Kernel Start Start Start WFK Send WFK WFK Send Send GPU Kernel on Stream Expected Benefits Kernel on Stream Kernel on Stream Start Time Finish Proposed Finish Existing IPDPS 2016 11 Network Based Computing Laboratory

Event-based Design • CUDA Event Management – Provides a mechanism to signal when tasks have occurred in a CUDA stream • Basic design idea 1. CPU launches a CUDA packing/unpacking kernel 2. CPU creates CUDA event and then returns immediately • GPU sets the status as ‘ completed’ when the kernel is completed 3. In MPI_Wait/MPI_Waitall : • CPU queries the events when the packed/unpacked data is required for communication IPDPS 2016 12 Network Based Computing Laboratory

Event-based Design HCA CPU GPU pack_kernel1<<< >>> MPI_Isend() cudaEventRecord() pack_kernel2<<< >>> MPI_Isend() cudaEventRecord() pack_kernel3<<< >>> MPI_Isend() cudaEventRecord() MPI_Waitall() Query / Progress Send Completion Request Complete IPDPS 2016 13 Network Based Computing Laboratory

Event-based Design • Major benefits – Overlap between CPU communication and GPU packing kernel – GPU resources are highly utilized • Limitation – CPU is required to keep checking the status of the event MPI_Isend(Buf1, ...,request1); • Lower CPU utilization MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1); MPI_Wait (request2, status2); IPDPS 2016 14 Network Based Computing Laboratory

Callback-based Design • CUDA Stream Callback – Launching work automatically on the CPU when something has completed on the CUDA stream – Restrictions: • Callbacks are processed by a driver thread, where no CUDA APIs can be called • Overhead when initializing callback function • Basic design idea 1. CPU launches a CUDA packing/unpacking kernel 2. CPU adds Callback function and then returns immediately 3. Callback function wakes up a helper thread to process the communication IPDPS 2016 15 Network Based Computing Laboratory

Callback-based Design CPU HCA GPU main callback helper pack_kernel1<<< >>> MPI_Isend() addCallback() pack_kernel2<<< >>> MPI_Isend() addCallback() pack_kernel3<<< >>> MPI_Isend() addCallback() CPU Computations Callback Send Callback Callback Completion Request Complete MPI_Waitall() IPDPS 2016 16 Network Based Computing Laboratory

Callback-based Design • Major benefits – Overlap between CPU communication and GPU packing kernel – Overlap between CPU communication and other computations – Higher CPU and GPU utilization MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]); // // Application Application work work on on the the CPU CPU MPI_Waitall(requests, status); IPDPS 2016 17 Network Based Computing Laboratory

Outline • Introduction • Proposed Designs • Performance Evaluation – Benchmark – HaloExchange-based Application Kernel • Conclusion IPDPS 2016 18 Network Based Computing Laboratory

Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,575 organizations in 80 countries – More than 376,000 (0.37 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘15 ranking) 10 th ranked 519,640-core cluster (Stampede) at TACC • 13 th ranked 185,344-core cluster (Pleiades) at NASA • 25 th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others • – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10 th in Nov’15, 519,640 cores, 5.168 Plops) – IPDPS 2016 19 Network Based Computing Laboratory

Experimental Environments 1. Wilkes cluster @ University of Cambridge – 2 NVIDIA K20c GPUs per node • Up to 32 GPU nodes 2. CSCS cluster @ Swiss National Supercomputing Centre – Cray CS-Storm system – 8 NVIDIA K80 GPUs per node • Up to 96 GPUs over 12 nodes IPDPS 2016 20 Network Based Computing Laboratory

Exploiting Maximal Overlap for Non- Contiguous Data Movement - PowerPoint PPT Presentation

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing

Maximal subsemigroups of finite semigroups Wilf Wilson 7 th November 2014 7 th November 2014 Wilf

Diagrammatically maximal and geometrically maximal knots Jessica Purcell Monash University,

Maximal subsemigroups via independent sets Wilf Wilson 26 th April 2017 Maximal subsemigroups? 1

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

pinned_vector A Contiguous Container without Pointer Invalidation Meeting C++ 2018 std::vector

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

6.4 Superposition Goal: Combine the ideas of superposition for first-order logic without

On the relationship of maximal clones and maximal C-clones Mike Behrisch Institute of Computer

Lower Bounds for Maximal Matchings and Maximal Independent Sets Alkida Balliu Aalto University,

Orbits of D -maximal sets in E . Peter M. Gerdes April 19, 2012 * Joint work with Peter Cholak

The E 11 origin of gauged maximal supergravities Fabio Riccioni Kings College London based

CS 140: Computation on Graphs Maximal Independent Sets A graph problem: Maximal Independent

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

L3 for English Acquisition I B k and II B i , 2011 URL

Disjointed Narratives, Fractured Subjects: Takami Jun, the Marxist as Modernist Koky

POL POL201Y1: Po Politics of Development Karol Czuba, University of Toronto Lecture 12:

ONTO-H: A collaborative semiautomatic annotation tool 8th International Protg Conference

1 Typical Spectroscopy Problem Given: C 10 H 12 O 3 and M + = 180.08 3H 1 H NMR (ppm) 3.36 (s,

intermolecular interactions analysis of pharmaceuticals in the solid-state: X-ray diffraction and

Neutrino magnetic moments Julian Heeck Solvay Workshop 'Beyond the Standard Model with Neutrinos

Referring Expressions & Alternate Views of Summarization Ling 573 Systems and Applications

Exploiting Maximal Overlap for Non- Contiguous Data Movement - PowerPoint PPT Presentation

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing

Maximal subsemigroups of finite semigroups Wilf Wilson 7 th November 2014 7 th November 2014 Wilf

Diagrammatically maximal and geometrically maximal knots Jessica Purcell Monash University,

Maximal subsemigroups via independent sets Wilf Wilson 26 th April 2017 Maximal subsemigroups? 1

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

pinned_vector A Contiguous Container without Pointer Invalidation Meeting C++ 2018 std::vector

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

6.4 Superposition Goal: Combine the ideas of superposition for first-order logic without

On the relationship of maximal clones and maximal C-clones Mike Behrisch Institute of Computer

Lower Bounds for Maximal Matchings and Maximal Independent Sets Alkida Balliu Aalto University,

Orbits of D -maximal sets in E . Peter M. Gerdes April 19, 2012 * Joint work with Peter Cholak

The E 11 origin of gauged maximal supergravities Fabio Riccioni Kings College London based

CS 140: Computation on Graphs Maximal Independent Sets A graph problem: Maximal Independent

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

L3 for English Acquisition I B k and II B i , 2011 URL

Disjointed Narratives, Fractured Subjects: Takami Jun, the Marxist as Modernist Koky

POL POL201Y1: Po Politics of Development Karol Czuba, University of Toronto Lecture 12:

ONTO-H: A collaborative semiautomatic annotation tool 8th International Protg Conference

1 Typical Spectroscopy Problem Given: C 10 H 12 O 3 and M + = 180.08 3H 1 H NMR (ppm) 3.36 (s,

intermolecular interactions analysis of pharmaceuticals in the solid-state: X-ray diffraction and

Neutrino magnetic moments Julian Heeck Solvay Workshop 'Beyond the Standard Model with Neutrinos

Referring Expressions &amp; Alternate Views of Summarization Ling 573 Systems and Applications

Referring Expressions & Alternate Views of Summarization Ling 573 Systems and Applications