Exploiting Computation and Communication Overlap in MVAPICH2 MPI - - PowerPoint PPT Presentation
Exploiting Computation and Communication Overlap in MVAPICH2 MPI - - PowerPoint PPT Presentation
Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++ Workshop (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
Charm++ Workshop (April ’18) 2 Network Based Computing Laboratory
High-End Computing (HEC): Towards Exascale
Expected to have an ExaFlop system in 2020-2022!
100 PFlops in 2016 1 EFlops in 2020- 2022?
Charm++ Workshop (April ’18) 3 Network Based Computing Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory Logical shared memory
Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, …
- Programming models provide abstract machine models
- Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
- PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance
- Task-based models (Charm++) are getting used extensively
Charm++ Workshop (April ’18) 4 Network Based Computing Laboratory
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Charm++, Hadoop (MapReduce), Spark (RDD, DAG)
Application Kernels/Applications
Networking Technologies
(InfiniBand, 40/100GigE, Aries, and Omni-Path)
Multi-/Many-core Architectures Accelerators (GPU and FPGA)
Middleware
Co-Design Opportunities and Challenges across Various Layers
Performance Scalability Resilience Communication Library or Runtime for Programming Models
Point-to-point Communication Collective Communication Energy- Awareness Synchronization and Locks I/O and File Systems Fault Tolerance
Charm++ Workshop (April ’18) 5 Network Based Computing Laboratory
Basic Concept of Overlapping Communication with Computation
Networking Technology Runtime (MPI, Charm++) Application
Design Runtime Primitives Exploiting Overlap Capabilities of Network Mechanisms Take Advantage of Overlap
- Transparently
- Co-design
Charm++ Workshop (April ’18) 6 Network Based Computing Laboratory
Overview of the MVAPICH2 Project
- High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,875 organizations in 86 countries – More than 462,000 (> 0.46 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)
- 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
- 9th, 556,104 cores (Oakforest-PACS) in Japan
- 12th, 368,928-core (Stampede2) at TACC
- 17th, 241,108-core (Pleiades) at NASA
- 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
- Empowering Top500 systems for over a decade
Charm++ Workshop (April ’18) 7 Network Based Computing Laboratory 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Sep-04 Feb-05 Jul-05 Dec-05 May-06 Oct-06 Mar-07 Aug-07 Jan-08 Jun-08 Nov-08 Apr-09 Sep-09 Feb-10 Jul-10 Dec-10 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Mar-17 Aug-17 Jan-18 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.0b MV2-MIC 2.0 MV2-GDR 2.3a MV2-X 2.3b MV2 Virt 2.2 MV2 2.3rc1 OSU INAM 0.9.3
MVAPICH2 Release Timeline and Downloads
Charm++ Workshop (April ’18) 8 Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features
RC XRC UD DC UMR ODP SR- IOV Multi Rail
Transport Mechanisms
Shared Memory CMA
IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM*
Charm++ Workshop (April ’18) 9 Network Based Computing Laboratory
MVAPICH2 Software Family
High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
Charm++ Workshop (April ’18) 10 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 11 Network Based Computing Laboratory
MPI_Init Application
Exchange Addresses Obtain Endpoint Address Initialize HCA Compute / Communicate Set Up Problem Read Input Files P0 P1 P2 P3
MPI_Init Application
Exchange Addresses Obtain Endpoint Address Initialize HCA P0 P1 P2 P3 Communication Independent Tasks Set Up Problem Read Input Files Compute / Communicate
Overlapping Application Compute with MPI Startup
No Overlap between MPI_Init and Application Computation MPI can continue to initialize in the background while Application starts
Charm++ Workshop (April ’18) 12 Network Based Computing Laboratory
- Near-constant MPI and OpenSHMEM
initialization time at any process count
- 10x and 30x improvement in startup time
- f MPI and OpenSHMEM respectively at
16,384 processes
- Memory consumption reduced for
remote endpoint information by O(processes per node)
- 1GB Memory saved per node with 1M
processes and 16 processes per node
Towards High Performance and Scalable Startup at Exascale
P M O Job Startup Performance Memory Required to Store Endpoint Information
a b c d e
P M PGAS – State of the art MPI – State of the art O PGAS/MPI – Optimized PMIX_Ring PMIX_Ibarrier PMIX_Iallgather Shmem based PMI
b c d e a
On-demand Connection
On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15) PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia ’14) Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15) SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16) a b c d e
Charm++ Workshop (April ’18) 13 Network Based Computing Laboratory
Startup Performance on KNL + Omni-Path
50 100 150 200 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 181K 232K MPI_Init (Seconds) Number of Processes MPI_Init - TACC Stampede-KNL Intel MPI 2018 beta MVAPICH2 2.3a 5 10 15 20 25 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Time Taken (Seconds) Number of Processes MPI_Init & Hello World - Oakforest-PACS Hello World (MVAPICH2-2.3a) MPI_Init (MVAPICH2-2.3a)
- MPI_Init takes 51 seconds on 231,956 processes on 3,624 KNL nodes (Stampede – Full scale)
- 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)
- At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)
- All numbers reported with 64 processes per node
5.8s 21s 51s 8.8x
New designs available in MVAPICH2-2.3a and as patch for SLURM-15.08.8 and SLURM-16.05.1
Charm++ Workshop (April ’18) 14 Network Based Computing Laboratory
- SHMEMPMI allows MPI processes to directly read remote endpoint (EP) information from the process
manager through shared memory segments
- Only a single copy per node - O(processes per node) reduction in memory usage
- Estimated savings of 1GB per node with 1 million processes and 16 processes per node
- Up to 1,000 times faster PMI Gets compared to default design
- Available since MVAPICH2 2.2rc1 and SLURM-15.08.8
Process Management Interface (PMI) over Shared Memory (SHMEMPMI)
50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes per Node
Time Taken by one PMI_Get
Default SHMEMPMI
0.0001 0.001 0.01 0.1 1 10 100 1000 10000 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Memory Usage per Node (MB) Number of Processes per Job
Memory Usage for Remote EP Information
Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem
Estimated
1000x
Actual
16x
Charm++ Workshop (April ’18) 15 Network Based Computing Laboratory
On-demand Connection Management for OpenSHMEM+MPI
5 10 15 20 25 30 35 32 64 128 256 512 1K 2K 4K Time Taken (Seconds) Number of Processes
Breakdown of OpenSHMEM Startup
Connection Setup PMI Exchange Memory Registration Shared Memory Setup Other
20 40 60 80 100 120 16 32 64 128 256 512 1K 2K 4K 8K Time Taken (Seconds) Number of Processes
Performance of OpenSHMEM Initialization and Hello World
Hello World - Static Initialization - Static Hello World - On-demand Initialization - On-demand
- Static connection establishment wastes memory and takes a lot of time
- On-demand connection management improves OpenSHMEM initialization time by 29.6 times
- Time taken for Hello World reduced by 8.31 times at 8,192 processes
- Available since MVAPICH2-X 2.1rc1
Charm++ Workshop (April ’18) 16 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 17 Network Based Computing Laboratory Application MPI Library High-Performance Networks
– Good communication performance for smaller messages – No synchronization required between sender and receiver – Cost of extra copies is high for large messages
Communication Costs of Point-to-point Protocols - Eager
Application Data Pre-registered Communication Buffers Pre-registered Communication Buffers
Buffer #1 Buffer #1 Buffer #n Buffer #n
Application Data Cost: Memcpy Cost: Memcpy Cost: Network Transfer
Charm++ Workshop (April ’18) 18 Network Based Computing Laboratory
Communication Costs of Point-to-point Protocols - Rendezvous
Cost: Half RTT Cost: Half RTT Cost: Network Transfer Cost: Half RTT
– Avoid extra copies for larger messages – Synchronization required between sender and receiver – Can be based on RDMA Read or RDMA Write (shown here)
Charm++ Workshop (April ’18) 19 Network Based Computing Laboratory
- Application processes schedule communication operation
- Network adapter progresses communication in the background
- Application process free to perform useful compute in the foreground
- Overlap of computation and communication => Better Overall
Application Performance
- Increased buffer requirement
- Poor communication performance if used for all types of
communication operations
Analyzing Overlap Potential of Eager Protocol
Application Process Application Process
Network Interface Card Network Interface Card Schedule Send Operation Schedule Receive Operation Check for Completion Check for Completion Complete Complete
Impact of changing Eager Threshold on performance of multi-pair message-rate benchmark with 32 processes on Stampede
Computation Communication Progress
Charm++ Workshop (April ’18) 20 Network Based Computing Laboratory
- Application processes schedule communication
- peration
- Application process free to perform useful compute in
the foreground
- Little communication progress in the background
- All communication takes place at final synchronization
- Reduced buffer requirement
- Good communication performance if used for large
message sizes and operations where communication library is progressed frequently
- Poor overlap of computation and communication =>
Poor Overall Application Performance
Analyzing Overlap Potential of Rendezvous Protocol
Application Process Application Process
Network Interface Card Network Interface Card Schedule Send Operation Schedule Receive Operation RTS Check for Completion Check for Completion Not Complete Not Complete CTS Check for Completion Check for Completion Not Complete Not Complete Check for Completion Check for Completion Complete Complete
Computation Communication Progress
Charm++ Workshop (April ’18) 21 Network Based Computing Laboratory
Impact of Tuning Rendezvous Threshold on 3D-Stencil
5 10 15 20 25 30 35 1 4 16 64 256 1K 4K 16K 64K 256K Latency (ms) Message Size (Bytes)
Communication Time
Default Tuned 20 40 60 80 100 1 4 16 64 256 1K 4K 16K 64K 256K Overlap (%) Message Size (Bytes)
Overlap Potential
Default Tuned 10 20 30 40 50 60 1 4 16 64 256 1K 4K 16K 64K 256K Latency (ms) Message Size (Bytes)
Overall Performance
Default Tuned
- Increased eager threshold from 16KB to 512KB
- Very small degradation in raw communication performance
- Significant improvement in overlap of computation and communication
- ~18% Improvement in overall performance
MV2_IBA_EAGER_THRESHOLD=512K MV2_SMP_EAGERSIZE=512K (Applicable to both InfiniBand and Omni-Path) 8192 Processes, SandyBridge + FDR
Charm++ Workshop (April ’18) 22 Network Based Computing Laboratory
Impact of Tuning Rendezvous Protocol on 3D-Stencil
- RDMA Read based protocol (RGET) used instead of RDMA Write
- Very minor penalty in raw performance
- Offers more overlap due to less synchronization overhead
- Up to 15% improvement in overall execution time
MV2_RNDV_PROTOCOL=RGET (Applicable to InfiniBand) 64 Processes, Broadwell + EDR
1 2 3 4 5 6 7 8 Latency (us) Message Size (Bytes)
Communication Time
Default Tuned 20 40 60 80 100 Overlap (%) Message Size (Bytes)
Overlap Potential
Default Tuned 2 4 6 8 10 12 Latency (us) Message Size (Bytes)
Overall Performance
Default Tuned
Charm++ Workshop (April ’18) 23 Network Based Computing Laboratory
Dynamic and Adaptive MPI Point-to-point Communication Protocols
- Different communication protocols have different trade-offs
– Need to consider performance, overlap, memory requirement – Manual tuning is difficult and time-consuming
- Can the MPI library select the best protocol at runtime?
– Use different protocols and thresholds between different pair of processes – Deliver good performance and minimize resource consumption – Dynamically adapt to the application’s communication requirements at runtime
Design Metrics: Overlap & Memory Requirement Metrics: Performance & Productivity Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity
Charm++ Workshop (April ’18) 24 Network Based Computing Laboratory
Dynamic and Adaptive MPI Point-to-point Communication Protocols (Cont.)
Process on Node 1 Process on Node 2
Eager Threshold for Example Communication Pattern with Different Designs
1 2 3 4 5 6 7
Default
16 KB 16 KB 16 KB 16 KB
1 2 3 4 5 6 7
Manually Tuned
128 KB 128 KB 128 KB 128 KB
1 2 3 4 5 6 7
Dynamic + Adaptive
32 KB 64 KB 128 KB 32 KB
- H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper
100 200 300 400 500 600 128 256 512 1K Wall Clock Time (seconds) Number of Processes
Execution Time of Amber
Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold 2 4 6 8 10 128 256 512 1K Relative Memory Consumption Number of Processes
Relative Memory Consumption of Amber
Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold
Process Pair Eager Threshold (KB) 0 – 4 32 1 – 5 64 2 – 6 128 3 – 7 32
Desired Eager Threshold
Charm++ Workshop (April ’18) 25 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 26 Network Based Computing Laboratory
- Non-blocking one-sided communication routines
– Put, Get (Rput, Rget) – Accumulate, Get_accumulate – Atomics
- Flexible synchronization operations to control initiation and completion
MPI-3 RMA: Communication and synchronization Primitives
MPI One-sided Synchronization/Completion Primitives Synchronization Completion Win_sync Lock/ Unlock Lock_all/ Unlock_all Fence Post-Wait/ Start-Complete Flush Flush_all Flush_local Flush_local_all
MVAPICH2 supports all RMA communication with Best performance and overlap
Charm++ Workshop (April ’18) 27 Network Based Computing Laboratory 20 40 60 80 100 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Overlap (%) Message Size (Bytes)
MPI_Put with Lock/Unlock and MPI_Win_flush
MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores Mellanox Connect-X4 EDR HCA Mellanox OFED 4.3
Overlap between Computation and RMA Operations
- 67-99% overlap between MPI_Put and computation
- 75-99% overlap between MPI_Get and computation
20 40 60 80 100 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Overlap (%) Message Size (Bytes)
MPI_Get with Lock/Unlock and MPI_Win_flush
MVAPICH2-2.3rc1
Charm++ Workshop (April ’18) 28 Network Based Computing Laboratory
- Proposed design performs better than default implementation
- For Weakly Connected Components (WCC) on 256 cores, proposed design could
reduce the total execution time by 2X compared with the default scheme
Graph Processing Framework with Optimized MPI RMA
100 200 300 400 500 64 128 256 Execution Time (s)
#Cores
Mizan-Default Mizan-RMA-Opt Better PageRank with LiveJournal1 20 40 60 80 100 120 140 160 64 128 256 Execution Time (s)
#Cores
Mizan-Default Mizan-RMA-Opt WCC with LiveJournal1 2X 3X
- M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 2016
Charm++ Workshop (April ’18) 29 Network Based Computing Laboratory
- Proposed design shows good strong scaling
- Proposed design scales better than default implementation
Graph Processing Framework with Optimized MPI RMA
200 400 600 800 1000 1200 1400 256 384 512 640 Total Execution Time (s)
Applications (128 Processes) Mizan-Default Mizan-RMA-Opt
2.5X PageRank with Arabic Dataset Better
- M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 2016
Charm++ Workshop (April ’18) 30 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 31 Network Based Computing Laboratory
Collective Communication in MVAPICH2
Run-time flags: All shared-memory based collectives : MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast-based collectives : MV2_USE_MCAST (Default : OFF) CMA-based collectives : MV2_USE_CMA_COLL (Default : ON)
Multi/Many-Core Aware Designs Blocking and Non-Blocking Collective Algorithms in MV2 Conventional (Flat) Inter-Node Communication Intra-Node Communication Point to Point (SHMEM, LiMIC, CMA, XPMEM) Direct Shared Memory Direct Kernel Assisted (CMA, XPMEM, LiMIC) Point to Point Hardware Multicast SHARP RDMA Designed for Performance & Overlap
Charm++ Workshop (April ’18) 32 Network Based Computing Laboratory
Hardware Multicast-aware MPI_Bcast on TACC Stampede
5 10 15 20 25 30 35 40 2 8 32 128 512 Latency (us) Message Size (Bytes)
Small Messages (102,400 Cores)
Default Multicast 100 200 300 400 500 2K 8K 32K 128K Latency (us) Message Size (Bytes)
Large Messages (102,400 Cores)
Default Multicast 10 20 30 Latency (us) Number of Nodes
16 Byte Message
Default Multicast 50 100 150 200 Latency (us) Number of Nodes
32 KByte Message
Default Multicast
- MCAST-based designs improve latency of MPI_Bcast by up to 85%
- Use MV2_USE_MCAST=1 to enable MCAST-based designs
80% 85%
Charm++ Workshop (April ’18) 33 Network Based Computing Laboratory
Optimized CMA-based Collectives for Large Messages
1 10 100 1000 10000 100000 1000000 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size
KNL (2 Nodes, 128 Procs)
MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Tuned CMA
Latency (us)
1 10 100 1000 10000 100000 1000000 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Message Size
KNL (4 Nodes, 256 Procs)
MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Tuned CMA
1 10 100 1000 10000 100000 1000000 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size
KNL (8 Nodes, 512 Procs)
MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Tuned CMA
- Significant improvement over existing implementation for Scatter/Gather with
1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower)
- New two-level algorithms for better scalability
- Improved performance for other collectives (Bcast, Allgather, and Alltoall)
~ 2.5x Better ~ 3.2x Better ~ 4x Better ~ 17x Better
- S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI
Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist
Performance of MPI_Gather on KNL nodes (64PPN)
Available in MVAPICH2-X 2.3b
Charm++ Workshop (April ’18) 34 Network Based Computing Laboratory
Shared Address Space (XPMEM)-based Collectives Design
1 10 100 1000 10000 100000 16K 32K 64K 128K 256K 512K 1M 2M 4M Latency (us) Message Size MVAPICH2-2.3b IMPI-2017v1.132 MVAPICH2-Opt
OSU_Allreduce (Broadwell 256 procs)
- “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
- Offloaded computation/communication to peers ranks in reduction collective operation
- Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce
73.2 1.8X
1 10 100 1000 10000 100000 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size MVAPICH2-2.3b IMPI-2017v1.132 MVAPICH2-Opt
OSU_Reduce (Broadwell 256 procs) 4X 36.1 37.9 16.8
- J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction
Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.
Will be available in future
Charm++ Workshop (April ’18) 35 Network Based Computing Laboratory
Application-Level Benefits of XPMEM-Based Collectives
MiniAMR (Broadwell, ppn=16)
- Up to 20% benefits over IMPI for CNTK DNN training using AllReduce
- Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for
MiniAMR application kernel
200 400 600 800 28 56 112 224 Execution Time (s)
- No. of Processes
IMPI-2017v1.132 MVAPICH2-2.3b MVAPICH2-Opt
CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28) 20 40 60 80 16 32 64 128 256 Execution Time (s)
- No. of Processes
IMPI-2017v1.132 MVAPICH2-2.3b MVAPICH2-Opt
20% 9% 27% 15%
Charm++ Workshop (April ’18) 36 Network Based Computing Laboratory
Problems with Blocking Collective Operations
Application Process Application Process Application Process Application Process Computation Communication
- Communication time cannot be used for compute
– No overlap of computation and communication – Inefficient
Charm++ Workshop (April ’18) 37 Network Based Computing Laboratory
- Application processes schedule collective operation
- Check periodically if operation is complete
- Overlap of computation and communication => Better Performance
- Catch: Who will progress communication
Concept of Non-blocking Collectives
Application Process Application Process Application Process Application Process Computation Communication
Communication Support Entity Communication Support Entity Communication Support Entity Communication Support Entity Schedule Operation Schedule Operation Schedule Operation Schedule Operation Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete
Charm++ Workshop (April ’18) 38 Network Based Computing Laboratory
- Enables overlap of computation with communication
- Non-blocking calls do not match blocking collective calls
– MPI may use different algorithms for blocking and non-blocking collectives – Blocking collectives: Optimized for latency – Non-blocking collectives: Optimized for overlap
- A process calling a NBC operation
– Schedules collective operation and immediately returns – Executes application computation code – Waits for the end of the collective
- The communication progress by
– Application code through MPI_Test – Network adapter (HCA) with hardware support – Dedicated processes / thread in MPI library
- There is a non-blocking equivalent for each blocking operation
– Has an “I” in the name
- MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce
Non-blocking Collective (NBC) Operations
Charm++ Workshop (April ’18) 39 Network Based Computing Laboratory
void main() { MPI_Init() ….. MPI_Ialltoall(…) Computation that does not depend on result of Alltoall MPI_Test(for Ialltoall) /* Check if complete (non-blocking) */ Computation that does not depend on result of Alltoall MPI_Wait(for Ialltoall) /* Wait till complete (Blocking) */ … MPI_Finalize() }
How do I write applications with NBC?
Charm++ Workshop (April ’18) 40 Network Based Computing Laboratory
P3DFFT Performance with Non-Blocking Alltoall using RDMA Primitives
- Weak scaling experiments; problem size increases with job size
- RDMA-Aware delivers 19% improvement over Default @ 8,192 procs
- Default-Thread exhibits worst performance
– Possibly because threads steal CPU cycles from P3DFFT – Do not consider for large scale experiments 2 4 6 8 10 12 14 128 256 512 1K 2K 4K 8K CPU Time Per Loop (Seconds) Number of Processes
Large Scale Runs
Default RDMA-Aware 2 4 6 8 10 12 14 16 128 256 512 CPU Time Per Loop (Seconds) Number of Processes
Small Scale Runs
Default RDMA-Aware Default-Thread
19%
Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters, H. Subramoni ,
- A. Awan , K. Hamidouche , D. Pekurovsky , A. Venkatesh , S. Chakraborty , K. Tomko , and D. K. Panda, ISC '15, Jul 2015
Will be available in future
Charm++ Workshop (April ’18) 41 Network Based Computing Laboratory
- Management and execution of MPI operations in the
network by using SHArP
- Manipulation of data while it is being transferred in the switch
network
- SHArP provides an abstraction to realize the reduction
- peration
- Defines Aggregation Nodes (AN), Aggregation Tree, and
Aggregation Groups
- AN logic is implemented as an InfiniBand Target Channel
Adapter (TCA) integrated into the switch ASIC *
- Uses RC for communication between ANs and between AN and
hosts in the Aggregation Tree *
Offloading with Scalable Hierarchical Aggregation Protocol (SHArP)
Physical Network Topology* Logical SHArP Tree*
* Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction
Charm++ Workshop (April ’18) 42 Network Based Computing Laboratory 1 2 3 4 5 6 7 8 9 4 8 16 32 64 128 Pure Communication Latency (us) Message Size (Bytes) 1 PPN*, 8 Nodes
MVAPICH2 MVAPICH2-SHArP
5 10 15 20 25 30 35 40 45 50 4 8 16 32 64 128 Communication-Computation Overlap (%) Message Size (Bytes)
1 PPN, 8 Nodes
MVAPICH2 MVAPICH2-SHArP
Evaluation of SHArP based Non Blocking Allreduce
MPI_Iallreduce Benchmark
2.3x
*PPN: Processes Per Node
- Complete offload of Allreduce collective operation to Switch helps to have
much higher overlap of communication and computation
Lower is Better Higher is Better
Available since MVAPICH2 2.3a
Charm++ Workshop (April ’18) 43 Network Based Computing Laboratory
- Mellanox’s ConnectX-2, ConnectX-3, ConnectIB, ConnectX-4, and ConnectX-5
adapters feature “task-list” offload interface
– Extension to existing InfiniBand APIs
- Collective communication with `blocking’ feature is usually a scaling bottleneck
– Matches with the need for non-blocking collective in MPI
- Accordingly MPI software stacks need to be re-designed to leverage offload in a
comprehensive manner
- Can applications be modified to take advantage of non-blocking collectives and
what will be the benefits?
Collective Offload in ConnectX-2, ConnectX-3, Connect-IB and ConnectX-4, ConnectX-5
Charm++ Workshop (April ’18) 44 Network Based Computing Laboratory Application
Collective Offload Support in ConnectX InfiniBand Adapter (Recv followed by Multi-Send)
- Sender creates a task-list consisting of only send and wait
WQEs
– One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list – A wait WQE is added to make the ConnectX-2 HCA wait for ACK packet from the receiver
InfiniBand HCA Physical Link
Send Q Recv Q Send CQ Recv CQ
Data Data
MCQ
MQ
Task List
Send Wait Send Send Send Wait
Charm++ Workshop (April ’18) 45 Network Based Computing Laboratory
Co-designing HPL with Core-Direct and Performance Benefits
0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 Normalized HPL Performance HPL Problem Size (N) as % of Total Memory HPL-Offload HPL-1ring HPL-Host HPL Performance Comparison with 512 Processes HPL-Offload consistently offers higher throughput than HPL-1ring and HPL-
- Host. Improves peak throughput by up to 4.5 % for large problem sizes
4.5% 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 20 30 40 50 60 70 80 90 64 128 256 512 Throughput (GFlops) Memory Consumption (%) System Size (Number of Processes) HPL-Offload HPL-1ring HPL-Host HPL-Offload HPL-1ring HPL-Host HPL-Offload surpasses the peak throughput of HPL-1ring with significantly smaller problem sizes and run-times!
- K. Kandalla, H. Subramoni, J. Vienne, S. Pai Raikar, K. Tomko, S. Sur, and D K Panda,
Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, (HOTI 2011)
Charm++ Workshop (April ’18) 46 Network Based Computing Laboratory
Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload
5 10 15 64 128 256 512 Run-Time (s) Number of Processes PCG-Default Modified-PCG-Offload
64,000 unknowns per process. Modified PCG with Offload-Allreduce performs 21% better than default PCG
21.8%
- K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking
Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12, May 2012.
Charm++ Workshop (April ’18) 47 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 48 Network Based Computing Laboratory
At Sender: At Receiver:
MPI_Recv(r_devbuf, size, …); inside MVAPICH2
- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
Charm++ Workshop (April ’18) 49 Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
- Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node
- Optimized and tuned collectives for GPU device buffers
- MPI datatype support for point-to-point and collective communication from
GPU device buffers
- Unified memory
Charm++ Workshop (April ’18) 50 Network Based Computing Laboratory
- OFED with support for GPUDirect RDMA is developed by NVIDIA
and Mellanox
- OSU has a design of MVAPICH2 using GPUDirect RDMA
– Hybrid design using GPU-Direct RDMA
- GPUDirect RDMA and Host-based pipelining
- Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
- Similar bottlenecks on Haswell
– Support for communication using multi-rail – Support for Mellanox Connect-IB and ConnectX VPI adapters – Support for RoCE with Mellanox ConnectX VPI adapters
GPU-Direct RDMA (GDR) with CUDA
IB Adapter
System Memory GPU Memory GPU CPU Chipset
SNB E5-2670 IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2
Intra-socket Inter-sockets Intra-socket Inter-sockets P2P read <1.0 GBs <300 MBs 3.5 GBs <300 MBs P2P write 5.2 GBs <300 MBs 6.4 GBs <300 MBs
Charm++ Workshop (April ’18) 51 Network Based Computing Laboratory
2000 4000 6000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K
Latency (us) Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA
10x 9x
Optimized MVAPICH2-GDR Design
1.88us 11X
Charm++ Workshop (April ’18) 52 Network Based Computing Laboratory
20 40 60 80 100
1 4 16 64 256 1K 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes)
GPU-GPU Inter-node Overlap*
MVAPICH2-(NO-GDR) MVAPCH2-GDR-2.3a
20 40 60 80
1 4 16 64 256 1K 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes)
GPU-GPU Intra-node Overlap*
MVAPICH2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA
Overlap with Optimized MVAPICH2-GDR Design
- Up to 69% overlap* for intra-node GPU-GPU
communication
- With GDR, up to 78% overlap* for inter-node small
and medium message transfers
- With intelligent pipeline, up to 88% overlap* for inter-
node large message transfers
*Overlap between GPU-to-GPU communication and CPU computation
Charm++ Workshop (April ’18) 53 Network Based Computing Laboratory
- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
- GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
500 1000 1500 2000 2500 4 8 16 32
Average Time Steps per second (TPS)
Number of Processes
MV2 MV2+GDR
500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes 64K Particles 256K Particles 2X 2X
Charm++ Workshop (April ’18) 54 Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
- 2X improvement on 32 GPUs nodes
- 30% improvement on 96 GPU nodes (8 GPUs/node)
- C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/
Charm++ Workshop (April ’18) 55 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 56 Network Based Computing Laboratory
- Applications use GPU/CPU resources for computation and MPI for
communication directly from GPU buffers
- MPI collectives common in GPU applications. E.g.: Alltoall for FFTs
- Collectives are time consuming with scale so MPI-3.0 introduced NBCs
- Non-blocking communication operations from GPU buffers can
– Allow CPU to overlap GPU-based communication with CPU compute – Ease GPU kernels redundancy in waiting for non-dependent communication – Allow power efficient execution from CPU perspective
- Rich set of GPU and network primitives available for NBC designs but
architectural limitations must be addressed
Motivation: Exploiting CORE-Direct and GPUDirect RDMA
- A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HiPC ’15
Charm++ Workshop (April ’18) 57 Network Based Computing Laboratory
- Realized through mapping of MPICH schedule abstraction
– Schedule composed of sched_send, sched_barrier, sched_recv, sched_start etc – Mapped to Core-Direct primitives with collective-specific GPU↔Host done additionally
- Multiple designs explored
– Naïve Design: Host-assisted GPU NBC (Scatter) – Offload-Staged: Host-assisted GPU NBC + Core-Direct – Offload-GDR: (GDR + Core-Direct)-based NBC – Offload-Callback: (Core-Direct, GDR, CUDA)-based NBC
Overview of Core-Direct + GPUDirect Designs
Charm++ Workshop (April ’18) 58 Network Based Computing Laboratory
- Use of GDR and CUDA callback mechanisms improve latency (comparable for alltoall)
- Latency high for the case of alltoall even though callback designed to avoid staging latency
Latency Comparison with Blocking Collectives
3000 13000 23000 33000 43000 53000 63000
4K 64K 1MB
LATENCY (US) MESSAGE SIZE (BYTES)
64-NODE I E I AL ALLGATHER ER L ATEN ENCY
gdr-offload-cb staged-offload
- ffload-gdr
Blocking 3000 13000 23000 33000 43000 53000 63000 73000
4K 64K 1MB
LATENCY (US) MESSAGE SIZE (BYTES)
64-NODE I I ALLT LTOALL LL L ATEN ENCY
gdr-offload-cb staged-offload
- ffload-gdr
Blocking
Charm++ Workshop (April ’18) 59 Network Based Computing Laboratory
- New schemes are able to exploit overlap well
Effect of Compute Location on Overlap/Latency
10 20 30 40 50 60 70 80 90 100
4K 8K 16K 32K 64K 128K 256K
OVERLAP(%) MESSAGE SIZE (BYTES) 64 64-NOD ODE I IALLGATH THER OVERLAP AP gdr-offload-cb-GPU gdr-offload-cb-CPU
- ffload-gdr-GPU
- ffload-gdr-CPU
1000 2000 3000 4000 5000 6000 7000 8000 9000
4K 8K 16K 32K 64K 128K 256K
LATENCY (US) MESSAGE SIZE (BYTES) 64 64-NOD ODE I IALLGATH THER LAT ATENCY gdr-offload-cb-GPU gdr-offload-cb-CPU
- ffload-gdr-GPU
- ffload-gdr-CPU
Available in MVAPICH2-GDR 2.3a
Charm++ Workshop (April ’18) 60 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 61 Network Based Computing Laboratory
- Scientific parallel applications spend a considerable amount of time in
GPU-based collective communication operations
– E.g. Deep learning frameworks such as TensorFlow and Caffe
- Optimized computation-intensive collectives in MVAPICH2-GDR
– MPI_Reduce and MPI_Allreduce – Exploring the best combinations
- Computation on
– CPU or GPU
- Communication through
– Host or GPU memory
GPU-kernel based Reduction
CPU Host Memory GPU PCIe IB Adapter CPU Host Memory GPU PCIe IB Adapter 1 2 3 4 1 2
Node B Node A
Charm++ Workshop (April ’18) 62 Network Based Computing Laboratory
16K 64K 256K 1M 4M Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH 50 100 150 200 250 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH
Evaluation - MPI_Reduce @ CSCS (96 GPUs)
Gather-first approaches* win for small messages K-nomial GPU-based approach* win for large messages
*Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, ” IEEE/ACM CCGrid’16.
Charm++ Workshop (April ’18) 63 Network Based Computing Laboratory
5 10 15 20 25 Latency (ms) Message Size (Bytes) Default RD-DD BRB-DD 2 4 6 8 10 2 4 8 16 32 Latency (ms) System Size (Number of Nodes) Default RD-DD BRB-DD
Evaluation - MPI_Allreduce
96 GPUs @ CSCS Good Scalability 32 GPU Nodes @ Wilkes
Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, ” IEEE/ACM CCGrid’16.
Available in MVAPICH2-GDR 2.3a
Charm++ Workshop (April ’18) 64 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 65 Network Based Computing Laboratory
- Multi-dimensional data
- Row based organization
- Contiguous on one dimension
- Non-contiguous on other dimensions
- Halo data exchange
- Duplicate the boundary
- Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
Charm++ Workshop (April ’18) 66 Network Based Computing Laboratory
MPI Datatype support in MVAPICH2
- Datatypes support in MPI
– Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data
At Sender:
MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
- Inside MVAPICH2
- Use datatype specific CUDA Kernels to pack data in chunks
- Efficiently move data between nodes using RDMA
- In progress - currently optimizes vector and hindexed datatypes
- Transparent to the user
- H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel
and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
Charm++ Workshop (April ’18) 67 Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
- Comprehensive support
- Targeted kernels for regular datatypes - vector, subarray, indexed_block
- Generic kernels for all other irregular datatypes
- Separate non-blocking stream for kernels launched by MPI library
- Avoids stream conflicts with application kernels
- Flexible set of parameters for users to tune kernels
- Vector
- MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
- MV2_CUDA_KERNEL_VECTOR_YSIZE
- Subarray
- MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
- MV2_CUDA_KERNEL_SUBARR_XDIM
- MV2_CUDA_KERNEL_SUBARR_YDIM
- MV2_CUDA_KERNEL_SUBARR_ZDIM
- Indexed_block
- MV2_CUDA_KERNEL_IDXBLK_XDIM
Charm++ Workshop (April ’18) 68 Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPU
MPI_Isend(Buf1, ...,req1); MPI_Isend(Buf2, ...,req2); Application work on the CPU/GPU MPI_Waitall(requests, …)
Common Scenario
*Buf1, Buf2…contain non-contiguous MPI Datatype
Charm++ Workshop (April ’18) 69 Network Based Computing Laboratory
- Modified ‘CUDA-Aware’ DDTBench for NAS_MG_y
– Up to 90% overlap between datatype processing and other computation
20 40 60 80 100
[32x16x16] [128x64x64] [256x128x128] [512x256x256]
Overlap (%) Input Size Default Event-based Callback-based
- C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement
Processing on Modern GPU-enabled Systems, IPDPS’16
MPI Datatype Processing (Communication Optimization )
Available in MVAPICH2-GDR 2.3a
Charm++ Workshop (April ’18) 70 Network Based Computing Laboratory
- MVAPICH2/MVAPICH2-X
– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication
- MVAPICH2-GDR
– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing
- Deep Learning Application: OSU Caffe
Presentation Outline
Charm++ Workshop (April ’18) 71 Network Based Computing Laboratory
- Deep Learning frameworks are a different game
altogether
– Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers
- Existing State-of-the-art
– cuDNN, cuBLAS, NCCL --> scale-up performance – NCCL2, CUDA-Aware MPI --> scale-out performance
- For small and medium message sizes only!
- Proposed: Can we co-design the MPI runtime (MVAPICH2-
GDR) and the DL framework (Caffe) to achieve both? – Efficient Overlap of Computation and Communication – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scale-up Performance Scale-out Performance
cuDNN NCCL gRPC Hadoop
Proposed Co- Designs
MPI cuBLAS
- A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
- Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
NCCL2
Charm++ Workshop (April ’18) 72 Network Based Computing Laboratory
- To address the limitations of Caffe and existing MPI runtimes, we propose
the OSU-Caffe (S-Caffe) framework
- At the application (DL framework) level
– Develop a fine-grain workflow – i.e. layer-wise communication instead
- f communicating the entire model
- At the runtime (MPI) level
– Develop support to perform reduction of very-large GPU buffers – Perform reduction using GPU kernels
OSU-Caffe: Proposed Co-Design Overview
OSU-Caffe is available from the HiDL project page http://hidl.cse.ohio-state.edu
Charm++ Workshop (April ’18) 73 Network Based Computing Laboratory
- Exploit Non-Blocking Collective (NBC) operations in MPI-3
– Divide communication into fine-grain steps – Overlap computation of layer “i” with communication of layer “i+1” – MPI_Ibcast to post all communication in advance
- Wait in an on-demand fashion
– Allow for runtime selection of data propagation design
- Based on message (DL model) size, number of GPUs, and number of nodes
- Co-design gradient aggregation at application level
– Helper thread based approach to realize a non-blocking MPI_Reduce Optimized Data Propagation and Gradient Aggregation using NBC Designs
- A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI
Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters, PPoPP ’17
Charm++ Workshop (April ’18) 74 Network Based Computing Laboratory
S-Caffe vs. Inspur-Caffe and Microsoft CNTK
- AlexNet: Notoriously hard to scale-out on
multiple nodes due to comm. overhead!
- Large number of parameters ~ 64 Million
(comm. buffer size = 256 MB)
S-Caffe delivers better or comparable performance with
- ther multi-node capable DL frameworks
Up to 14% improvement (Scale-up)
Impact of HR
- GoogLeNet is a popular DNN
- 13 million parameters (comm. buffer
size = ~50 MB)
Charm++ Workshop (April ’18) 75 Network Based Computing Laboratory
- Exploiting overlap between computation and communication is significant in
HPC
- Presented some of the approaches and results along these directions taken by
the MVAPICH2 and MVAPICH2-GDR Libraries
- Allows applications to take advantage of the overlap capabilities
- As exascale systems are getting more complicated in their architectures,
solutions exploiting overlap capabilities will be important
Concluding Remarks
Charm++ Workshop (April ’18) 76 Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by Equipment Support by
Charm++ Workshop (April ’18) 77 Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
–
- A. Awan (Ph.D.)
–
- R. Biswas (M.S.)
–
- M. Bayatpour (Ph.D.)
–
- S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.) –
- S. Guganani (Ph.D.)
Past Students
–
- A. Augustine (M.S.)
–
- P. Balaji (Ph.D.)
–
- S. Bhagvat (M.S.)
–
- A. Bhat (M.S.)
–
- D. Buntinas (Ph.D.)
–
- L. Chai (Ph.D.)
–
- B. Chandrasekharan (M.S.)
–
- N. Dandapanthula (M.S.)
–
- V. Dhanraj (M.S.)
–
- T. Gangadharappa (M.S.)
–
- K. Gopalakrishnan (M.S.)
–
- R. Rajachandrasekar (Ph.D.)
–
- G. Santhanaraman (Ph.D.)
–
- A. Singh (Ph.D.)
–
- J. Sridhar (M.S.)
–
- S. Sur (Ph.D.)
–
- H. Subramoni (Ph.D.)
–
- K. Vaidyanathan (Ph.D.)
–
- A. Vishnu (Ph.D.)
–
- J. Wu (Ph.D.)
–
- W. Yu (Ph.D.)
Past Research Scientist
–
- K. Hamidouche
–
- S. Sur
Past Post-Docs
–
- D. Banerjee
–
- X. Besseron
– H.-W. Jin –
- W. Huang (Ph.D.)
–
- W. Jiang (M.S.)
–
- J. Jose (Ph.D.)
–
- S. Kini (M.S.)
–
- M. Koop (Ph.D.)
–
- K. Kulkarni (M.S.)
–
- R. Kumar (M.S.)
–
- S. Krishnamoorthy (M.S.)
–
- K. Kandalla (Ph.D.)
–
- M. Li (Ph.D.)
–
- P. Lai (M.S.)
–
- J. Liu (Ph.D.)
–
- M. Luo (Ph.D.)
–
- A. Mamidala (Ph.D.)
–
- G. Marsh (M.S.)
–
- V. Meshram (M.S.)
–
- A. Moody (M.S.)
–
- S. Naravula (Ph.D.)
–
- R. Noronha (Ph.D.)
–
- X. Ouyang (Ph.D.)
–
- S. Pai (M.S.)
–
- S. Potluri (Ph.D.)
–
- J. Hashmi (Ph.D.)
–
- H. Javed (Ph.D.)
–
- P. Kousha (Ph.D.)
–
- D. Shankar (Ph.D.)
–
- H. Shi (Ph.D.)
–
- J. Zhang (Ph.D.)
–
- J. Lin
–
- M. Luo
–
- E. Mancini
Current Research Scientists
–
- X. Lu
–
- H. Subramoni
Past Programmers
–
- D. Bureddy
–
- J. Perkins
Current Research Specialist
–
- J. Smith
–
- M. Arnold
–
- S. Marcarelli
–
- J. Vienne
–
- H. Wang
Current Post-doc
–
- A. Ruhela
–
- K. Manian
Current Students (Undergraduate)
–
- N. Sarkauskas (B.S.)
Charm++ Workshop (April ’18) 78 Network Based Computing Laboratory
Upcoming 6th Annual MVAPICH User Group (MUG) Meeting
- August 6-8, 2018; Columbus, Ohio, USA
- Keynote Talks, Invited Talks, Contributed Presentations, Tutorials on
MVAPICH2, MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, OSU- INAM, and High-Performance Deep Learning optimization and tuning
- Student Travel Award
- More details at:
http://mug.mvapich.cse.ohio-state.edu
Charm++ Workshop (April ’18) 79 Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
panda@cse.ohio-state.edu
The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/