Exploiting Computation and Communication Overlap in MVAPICH2 MPI - - PowerPoint PPT Presentation

exploiting computation and communication overlap in
SMART_READER_LITE
LIVE PREVIEW

Exploiting Computation and Communication Overlap in MVAPICH2 MPI - - PowerPoint PPT Presentation

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++ Workshop (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu


slide-1
SLIDE 1

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Keynote Talk at Charm++ Workshop (April ‘18) by

slide-2
SLIDE 2

Charm++ Workshop (April ’18) 2 Network Based Computing Laboratory

High-End Computing (HEC): Towards Exascale

Expected to have an ExaFlop system in 2020-2022!

100 PFlops in 2016 1 EFlops in 2020- 2022?

slide-3
SLIDE 3

Charm++ Workshop (April ’18) 3 Network Based Computing Laboratory

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory Logical shared memory

Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, …

  • Programming models provide abstract machine models
  • Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

  • PGAS models and Hybrid MPI+PGAS models are gradually receiving

importance

  • Task-based models (Charm++) are getting used extensively
slide-4
SLIDE 4

Charm++ Workshop (April ’18) 4 Network Based Computing Laboratory

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models

MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Charm++, Hadoop (MapReduce), Spark (RDD, DAG)

Application Kernels/Applications

Networking Technologies

(InfiniBand, 40/100GigE, Aries, and Omni-Path)

Multi-/Many-core Architectures Accelerators (GPU and FPGA)

Middleware

Co-Design Opportunities and Challenges across Various Layers

Performance Scalability Resilience Communication Library or Runtime for Programming Models

Point-to-point Communication Collective Communication Energy- Awareness Synchronization and Locks I/O and File Systems Fault Tolerance

slide-5
SLIDE 5

Charm++ Workshop (April ’18) 5 Network Based Computing Laboratory

Basic Concept of Overlapping Communication with Computation

Networking Technology Runtime (MPI, Charm++) Application

Design Runtime Primitives Exploiting Overlap Capabilities of Network Mechanisms Take Advantage of Overlap

  • Transparently
  • Co-design
slide-6
SLIDE 6

Charm++ Workshop (April ’18) 6 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,875 organizations in 86 countries – More than 462,000 (> 0.46 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)

  • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 9th, 556,104 cores (Oakforest-PACS) in Japan
  • 12th, 368,928-core (Stampede2) at TACC
  • 17th, 241,108-core (Pleiades) at NASA
  • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade
slide-7
SLIDE 7

Charm++ Workshop (April ’18) 7 Network Based Computing Laboratory 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Sep-04 Feb-05 Jul-05 Dec-05 May-06 Oct-06 Mar-07 Aug-07 Jan-08 Jun-08 Nov-08 Apr-09 Sep-09 Feb-10 Jul-10 Dec-10 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Mar-17 Aug-17 Jan-18 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.0b MV2-MIC 2.0 MV2-GDR 2.3a MV2-X 2.3b MV2 Virt 2.2 MV2 2.3rc1 OSU INAM 0.9.3

MVAPICH2 Release Timeline and Downloads

slide-8
SLIDE 8

Charm++ Workshop (April ’18) 8 Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime

Diverse APIs and Mechanisms

Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis

Support for Modern Networking Technology

(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures

(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features

RC XRC UD DC UMR ODP SR- IOV Multi Rail

Transport Mechanisms

Shared Memory CMA

IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

XPMEM*

slide-9
SLIDE 9

Charm++ Workshop (April ’18) 9 Network Based Computing Laboratory

MVAPICH2 Software Family

High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

slide-10
SLIDE 10

Charm++ Workshop (April ’18) 10 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-11
SLIDE 11

Charm++ Workshop (April ’18) 11 Network Based Computing Laboratory

MPI_Init Application

Exchange Addresses Obtain Endpoint Address Initialize HCA Compute / Communicate Set Up Problem Read Input Files P0 P1 P2 P3

MPI_Init Application

Exchange Addresses Obtain Endpoint Address Initialize HCA P0 P1 P2 P3 Communication Independent Tasks Set Up Problem Read Input Files Compute / Communicate

Overlapping Application Compute with MPI Startup

No Overlap between MPI_Init and Application Computation MPI can continue to initialize in the background while Application starts

slide-12
SLIDE 12

Charm++ Workshop (April ’18) 12 Network Based Computing Laboratory

  • Near-constant MPI and OpenSHMEM

initialization time at any process count

  • 10x and 30x improvement in startup time
  • f MPI and OpenSHMEM respectively at

16,384 processes

  • Memory consumption reduced for

remote endpoint information by O(processes per node)

  • 1GB Memory saved per node with 1M

processes and 16 processes per node

Towards High Performance and Scalable Startup at Exascale

P M O Job Startup Performance Memory Required to Store Endpoint Information

a b c d e

P M PGAS – State of the art MPI – State of the art O PGAS/MPI – Optimized PMIX_Ring PMIX_Ibarrier PMIX_Iallgather Shmem based PMI

b c d e a

On-demand Connection

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15) PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia ’14) Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15) SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16) a b c d e

slide-13
SLIDE 13

Charm++ Workshop (April ’18) 13 Network Based Computing Laboratory

Startup Performance on KNL + Omni-Path

50 100 150 200 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 181K 232K MPI_Init (Seconds) Number of Processes MPI_Init - TACC Stampede-KNL Intel MPI 2018 beta MVAPICH2 2.3a 5 10 15 20 25 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Time Taken (Seconds) Number of Processes MPI_Init & Hello World - Oakforest-PACS Hello World (MVAPICH2-2.3a) MPI_Init (MVAPICH2-2.3a)

  • MPI_Init takes 51 seconds on 231,956 processes on 3,624 KNL nodes (Stampede – Full scale)
  • 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)
  • At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)
  • All numbers reported with 64 processes per node

5.8s 21s 51s 8.8x

New designs available in MVAPICH2-2.3a and as patch for SLURM-15.08.8 and SLURM-16.05.1

slide-14
SLIDE 14

Charm++ Workshop (April ’18) 14 Network Based Computing Laboratory

  • SHMEMPMI allows MPI processes to directly read remote endpoint (EP) information from the process

manager through shared memory segments

  • Only a single copy per node - O(processes per node) reduction in memory usage
  • Estimated savings of 1GB per node with 1 million processes and 16 processes per node
  • Up to 1,000 times faster PMI Gets compared to default design
  • Available since MVAPICH2 2.2rc1 and SLURM-15.08.8

Process Management Interface (PMI) over Shared Memory (SHMEMPMI)

50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes per Node

Time Taken by one PMI_Get

Default SHMEMPMI

0.0001 0.001 0.01 0.1 1 10 100 1000 10000 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Memory Usage per Node (MB) Number of Processes per Job

Memory Usage for Remote EP Information

Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem

Estimated

1000x

Actual

16x

slide-15
SLIDE 15

Charm++ Workshop (April ’18) 15 Network Based Computing Laboratory

On-demand Connection Management for OpenSHMEM+MPI

5 10 15 20 25 30 35 32 64 128 256 512 1K 2K 4K Time Taken (Seconds) Number of Processes

Breakdown of OpenSHMEM Startup

Connection Setup PMI Exchange Memory Registration Shared Memory Setup Other

20 40 60 80 100 120 16 32 64 128 256 512 1K 2K 4K 8K Time Taken (Seconds) Number of Processes

Performance of OpenSHMEM Initialization and Hello World

Hello World - Static Initialization - Static Hello World - On-demand Initialization - On-demand

  • Static connection establishment wastes memory and takes a lot of time
  • On-demand connection management improves OpenSHMEM initialization time by 29.6 times
  • Time taken for Hello World reduced by 8.31 times at 8,192 processes
  • Available since MVAPICH2-X 2.1rc1
slide-16
SLIDE 16

Charm++ Workshop (April ’18) 16 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-17
SLIDE 17

Charm++ Workshop (April ’18) 17 Network Based Computing Laboratory Application MPI Library High-Performance Networks

– Good communication performance for smaller messages – No synchronization required between sender and receiver – Cost of extra copies is high for large messages

Communication Costs of Point-to-point Protocols - Eager

Application Data Pre-registered Communication Buffers Pre-registered Communication Buffers

Buffer #1 Buffer #1 Buffer #n Buffer #n

Application Data Cost: Memcpy Cost: Memcpy Cost: Network Transfer

slide-18
SLIDE 18

Charm++ Workshop (April ’18) 18 Network Based Computing Laboratory

Communication Costs of Point-to-point Protocols - Rendezvous

Cost: Half RTT Cost: Half RTT Cost: Network Transfer Cost: Half RTT

– Avoid extra copies for larger messages – Synchronization required between sender and receiver – Can be based on RDMA Read or RDMA Write (shown here)

slide-19
SLIDE 19

Charm++ Workshop (April ’18) 19 Network Based Computing Laboratory

  • Application processes schedule communication operation
  • Network adapter progresses communication in the background
  • Application process free to perform useful compute in the foreground
  • Overlap of computation and communication => Better Overall

Application Performance

  • Increased buffer requirement
  • Poor communication performance if used for all types of

communication operations

Analyzing Overlap Potential of Eager Protocol

Application Process Application Process

Network Interface Card Network Interface Card Schedule Send Operation Schedule Receive Operation Check for Completion Check for Completion Complete Complete

Impact of changing Eager Threshold on performance of multi-pair message-rate benchmark with 32 processes on Stampede

Computation Communication Progress

slide-20
SLIDE 20

Charm++ Workshop (April ’18) 20 Network Based Computing Laboratory

  • Application processes schedule communication
  • peration
  • Application process free to perform useful compute in

the foreground

  • Little communication progress in the background
  • All communication takes place at final synchronization
  • Reduced buffer requirement
  • Good communication performance if used for large

message sizes and operations where communication library is progressed frequently

  • Poor overlap of computation and communication =>

Poor Overall Application Performance

Analyzing Overlap Potential of Rendezvous Protocol

Application Process Application Process

Network Interface Card Network Interface Card Schedule Send Operation Schedule Receive Operation RTS Check for Completion Check for Completion Not Complete Not Complete CTS Check for Completion Check for Completion Not Complete Not Complete Check for Completion Check for Completion Complete Complete

Computation Communication Progress

slide-21
SLIDE 21

Charm++ Workshop (April ’18) 21 Network Based Computing Laboratory

Impact of Tuning Rendezvous Threshold on 3D-Stencil

5 10 15 20 25 30 35 1 4 16 64 256 1K 4K 16K 64K 256K Latency (ms) Message Size (Bytes)

Communication Time

Default Tuned 20 40 60 80 100 1 4 16 64 256 1K 4K 16K 64K 256K Overlap (%) Message Size (Bytes)

Overlap Potential

Default Tuned 10 20 30 40 50 60 1 4 16 64 256 1K 4K 16K 64K 256K Latency (ms) Message Size (Bytes)

Overall Performance

Default Tuned

  • Increased eager threshold from 16KB to 512KB
  • Very small degradation in raw communication performance
  • Significant improvement in overlap of computation and communication
  • ~18% Improvement in overall performance

MV2_IBA_EAGER_THRESHOLD=512K MV2_SMP_EAGERSIZE=512K (Applicable to both InfiniBand and Omni-Path) 8192 Processes, SandyBridge + FDR

slide-22
SLIDE 22

Charm++ Workshop (April ’18) 22 Network Based Computing Laboratory

Impact of Tuning Rendezvous Protocol on 3D-Stencil

  • RDMA Read based protocol (RGET) used instead of RDMA Write
  • Very minor penalty in raw performance
  • Offers more overlap due to less synchronization overhead
  • Up to 15% improvement in overall execution time

MV2_RNDV_PROTOCOL=RGET (Applicable to InfiniBand) 64 Processes, Broadwell + EDR

1 2 3 4 5 6 7 8 Latency (us) Message Size (Bytes)

Communication Time

Default Tuned 20 40 60 80 100 Overlap (%) Message Size (Bytes)

Overlap Potential

Default Tuned 2 4 6 8 10 12 Latency (us) Message Size (Bytes)

Overall Performance

Default Tuned

slide-23
SLIDE 23

Charm++ Workshop (April ’18) 23 Network Based Computing Laboratory

Dynamic and Adaptive MPI Point-to-point Communication Protocols

  • Different communication protocols have different trade-offs

– Need to consider performance, overlap, memory requirement – Manual tuning is difficult and time-consuming

  • Can the MPI library select the best protocol at runtime?

– Use different protocols and thresholds between different pair of processes – Deliver good performance and minimize resource consumption – Dynamically adapt to the application’s communication requirements at runtime

Design Metrics: Overlap & Memory Requirement Metrics: Performance & Productivity Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

slide-24
SLIDE 24

Charm++ Workshop (April ’18) 24 Network Based Computing Laboratory

Dynamic and Adaptive MPI Point-to-point Communication Protocols (Cont.)

Process on Node 1 Process on Node 2

Eager Threshold for Example Communication Pattern with Different Designs

1 2 3 4 5 6 7

Default

16 KB 16 KB 16 KB 16 KB

1 2 3 4 5 6 7

Manually Tuned

128 KB 128 KB 128 KB 128 KB

1 2 3 4 5 6 7

Dynamic + Adaptive

32 KB 64 KB 128 KB 32 KB

  • H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper

100 200 300 400 500 600 128 256 512 1K Wall Clock Time (seconds) Number of Processes

Execution Time of Amber

Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold 2 4 6 8 10 128 256 512 1K Relative Memory Consumption Number of Processes

Relative Memory Consumption of Amber

Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold

Process Pair Eager Threshold (KB) 0 – 4 32 1 – 5 64 2 – 6 128 3 – 7 32

Desired Eager Threshold

slide-25
SLIDE 25

Charm++ Workshop (April ’18) 25 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-26
SLIDE 26

Charm++ Workshop (April ’18) 26 Network Based Computing Laboratory

  • Non-blocking one-sided communication routines

– Put, Get (Rput, Rget) – Accumulate, Get_accumulate – Atomics

  • Flexible synchronization operations to control initiation and completion

MPI-3 RMA: Communication and synchronization Primitives

MPI One-sided Synchronization/Completion Primitives Synchronization Completion Win_sync Lock/ Unlock Lock_all/ Unlock_all Fence Post-Wait/ Start-Complete Flush Flush_all Flush_local Flush_local_all

MVAPICH2 supports all RMA communication with Best performance and overlap

slide-27
SLIDE 27

Charm++ Workshop (April ’18) 27 Network Based Computing Laboratory 20 40 60 80 100 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Overlap (%) Message Size (Bytes)

MPI_Put with Lock/Unlock and MPI_Win_flush

MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores Mellanox Connect-X4 EDR HCA Mellanox OFED 4.3

Overlap between Computation and RMA Operations

  • 67-99% overlap between MPI_Put and computation
  • 75-99% overlap between MPI_Get and computation

20 40 60 80 100 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Overlap (%) Message Size (Bytes)

MPI_Get with Lock/Unlock and MPI_Win_flush

MVAPICH2-2.3rc1

slide-28
SLIDE 28

Charm++ Workshop (April ’18) 28 Network Based Computing Laboratory

  • Proposed design performs better than default implementation
  • For Weakly Connected Components (WCC) on 256 cores, proposed design could

reduce the total execution time by 2X compared with the default scheme

Graph Processing Framework with Optimized MPI RMA

100 200 300 400 500 64 128 256 Execution Time (s)

#Cores

Mizan-Default Mizan-RMA-Opt Better PageRank with LiveJournal1 20 40 60 80 100 120 140 160 64 128 256 Execution Time (s)

#Cores

Mizan-Default Mizan-RMA-Opt WCC with LiveJournal1 2X 3X

  • M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 2016
slide-29
SLIDE 29

Charm++ Workshop (April ’18) 29 Network Based Computing Laboratory

  • Proposed design shows good strong scaling
  • Proposed design scales better than default implementation

Graph Processing Framework with Optimized MPI RMA

200 400 600 800 1000 1200 1400 256 384 512 640 Total Execution Time (s)

Applications (128 Processes) Mizan-Default Mizan-RMA-Opt

2.5X PageRank with Arabic Dataset Better

  • M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 2016
slide-30
SLIDE 30

Charm++ Workshop (April ’18) 30 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-31
SLIDE 31

Charm++ Workshop (April ’18) 31 Network Based Computing Laboratory

Collective Communication in MVAPICH2

Run-time flags: All shared-memory based collectives : MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast-based collectives : MV2_USE_MCAST (Default : OFF) CMA-based collectives : MV2_USE_CMA_COLL (Default : ON)

Multi/Many-Core Aware Designs Blocking and Non-Blocking Collective Algorithms in MV2 Conventional (Flat) Inter-Node Communication Intra-Node Communication Point to Point (SHMEM, LiMIC, CMA, XPMEM) Direct Shared Memory Direct Kernel Assisted (CMA, XPMEM, LiMIC) Point to Point Hardware Multicast SHARP RDMA Designed for Performance & Overlap

slide-32
SLIDE 32

Charm++ Workshop (April ’18) 32 Network Based Computing Laboratory

Hardware Multicast-aware MPI_Bcast on TACC Stampede

5 10 15 20 25 30 35 40 2 8 32 128 512 Latency (us) Message Size (Bytes)

Small Messages (102,400 Cores)

Default Multicast 100 200 300 400 500 2K 8K 32K 128K Latency (us) Message Size (Bytes)

Large Messages (102,400 Cores)

Default Multicast 10 20 30 Latency (us) Number of Nodes

16 Byte Message

Default Multicast 50 100 150 200 Latency (us) Number of Nodes

32 KByte Message

Default Multicast

  • MCAST-based designs improve latency of MPI_Bcast by up to 85%
  • Use MV2_USE_MCAST=1 to enable MCAST-based designs

80% 85%

slide-33
SLIDE 33

Charm++ Workshop (April ’18) 33 Network Based Computing Laboratory

Optimized CMA-based Collectives for Large Messages

1 10 100 1000 10000 100000 1000000 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size

KNL (2 Nodes, 128 Procs)

MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Tuned CMA

Latency (us)

1 10 100 1000 10000 100000 1000000 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Message Size

KNL (4 Nodes, 256 Procs)

MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Tuned CMA

1 10 100 1000 10000 100000 1000000 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size

KNL (8 Nodes, 512 Procs)

MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Tuned CMA

  • Significant improvement over existing implementation for Scatter/Gather with

1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower)

  • New two-level algorithms for better scalability
  • Improved performance for other collectives (Bcast, Allgather, and Alltoall)

~ 2.5x Better ~ 3.2x Better ~ 4x Better ~ 17x Better

  • S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI

Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist

Performance of MPI_Gather on KNL nodes (64PPN)

Available in MVAPICH2-X 2.3b

slide-34
SLIDE 34

Charm++ Workshop (April ’18) 34 Network Based Computing Laboratory

Shared Address Space (XPMEM)-based Collectives Design

1 10 100 1000 10000 100000 16K 32K 64K 128K 256K 512K 1M 2M 4M Latency (us) Message Size MVAPICH2-2.3b IMPI-2017v1.132 MVAPICH2-Opt

OSU_Allreduce (Broadwell 256 procs)

  • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
  • Offloaded computation/communication to peers ranks in reduction collective operation
  • Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce

73.2 1.8X

1 10 100 1000 10000 100000 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size MVAPICH2-2.3b IMPI-2017v1.132 MVAPICH2-Opt

OSU_Reduce (Broadwell 256 procs) 4X 36.1 37.9 16.8

  • J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction

Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.

Will be available in future

slide-35
SLIDE 35

Charm++ Workshop (April ’18) 35 Network Based Computing Laboratory

Application-Level Benefits of XPMEM-Based Collectives

MiniAMR (Broadwell, ppn=16)

  • Up to 20% benefits over IMPI for CNTK DNN training using AllReduce
  • Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for

MiniAMR application kernel

200 400 600 800 28 56 112 224 Execution Time (s)

  • No. of Processes

IMPI-2017v1.132 MVAPICH2-2.3b MVAPICH2-Opt

CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28) 20 40 60 80 16 32 64 128 256 Execution Time (s)

  • No. of Processes

IMPI-2017v1.132 MVAPICH2-2.3b MVAPICH2-Opt

20% 9% 27% 15%

slide-36
SLIDE 36

Charm++ Workshop (April ’18) 36 Network Based Computing Laboratory

Problems with Blocking Collective Operations

Application Process Application Process Application Process Application Process Computation Communication

  • Communication time cannot be used for compute

– No overlap of computation and communication – Inefficient

slide-37
SLIDE 37

Charm++ Workshop (April ’18) 37 Network Based Computing Laboratory

  • Application processes schedule collective operation
  • Check periodically if operation is complete
  • Overlap of computation and communication => Better Performance
  • Catch: Who will progress communication

Concept of Non-blocking Collectives

Application Process Application Process Application Process Application Process Computation Communication

Communication Support Entity Communication Support Entity Communication Support Entity Communication Support Entity Schedule Operation Schedule Operation Schedule Operation Schedule Operation Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete

slide-38
SLIDE 38

Charm++ Workshop (April ’18) 38 Network Based Computing Laboratory

  • Enables overlap of computation with communication
  • Non-blocking calls do not match blocking collective calls

– MPI may use different algorithms for blocking and non-blocking collectives – Blocking collectives: Optimized for latency – Non-blocking collectives: Optimized for overlap

  • A process calling a NBC operation

– Schedules collective operation and immediately returns – Executes application computation code – Waits for the end of the collective

  • The communication progress by

– Application code through MPI_Test – Network adapter (HCA) with hardware support – Dedicated processes / thread in MPI library

  • There is a non-blocking equivalent for each blocking operation

– Has an “I” in the name

  • MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce

Non-blocking Collective (NBC) Operations

slide-39
SLIDE 39

Charm++ Workshop (April ’18) 39 Network Based Computing Laboratory

void main() { MPI_Init() ….. MPI_Ialltoall(…) Computation that does not depend on result of Alltoall MPI_Test(for Ialltoall) /* Check if complete (non-blocking) */ Computation that does not depend on result of Alltoall MPI_Wait(for Ialltoall) /* Wait till complete (Blocking) */ … MPI_Finalize() }

How do I write applications with NBC?

slide-40
SLIDE 40

Charm++ Workshop (April ’18) 40 Network Based Computing Laboratory

P3DFFT Performance with Non-Blocking Alltoall using RDMA Primitives

  • Weak scaling experiments; problem size increases with job size
  • RDMA-Aware delivers 19% improvement over Default @ 8,192 procs
  • Default-Thread exhibits worst performance

– Possibly because threads steal CPU cycles from P3DFFT – Do not consider for large scale experiments 2 4 6 8 10 12 14 128 256 512 1K 2K 4K 8K CPU Time Per Loop (Seconds) Number of Processes

Large Scale Runs

Default RDMA-Aware 2 4 6 8 10 12 14 16 128 256 512 CPU Time Per Loop (Seconds) Number of Processes

Small Scale Runs

Default RDMA-Aware Default-Thread

19%

Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters, H. Subramoni ,

  • A. Awan , K. Hamidouche , D. Pekurovsky , A. Venkatesh , S. Chakraborty , K. Tomko , and D. K. Panda, ISC '15, Jul 2015

Will be available in future

slide-41
SLIDE 41

Charm++ Workshop (April ’18) 41 Network Based Computing Laboratory

  • Management and execution of MPI operations in the

network by using SHArP

  • Manipulation of data while it is being transferred in the switch

network

  • SHArP provides an abstraction to realize the reduction
  • peration
  • Defines Aggregation Nodes (AN), Aggregation Tree, and

Aggregation Groups

  • AN logic is implemented as an InfiniBand Target Channel

Adapter (TCA) integrated into the switch ASIC *

  • Uses RC for communication between ANs and between AN and

hosts in the Aggregation Tree *

Offloading with Scalable Hierarchical Aggregation Protocol (SHArP)

Physical Network Topology* Logical SHArP Tree*

* Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction

slide-42
SLIDE 42

Charm++ Workshop (April ’18) 42 Network Based Computing Laboratory 1 2 3 4 5 6 7 8 9 4 8 16 32 64 128 Pure Communication Latency (us) Message Size (Bytes) 1 PPN*, 8 Nodes

MVAPICH2 MVAPICH2-SHArP

5 10 15 20 25 30 35 40 45 50 4 8 16 32 64 128 Communication-Computation Overlap (%) Message Size (Bytes)

1 PPN, 8 Nodes

MVAPICH2 MVAPICH2-SHArP

Evaluation of SHArP based Non Blocking Allreduce

MPI_Iallreduce Benchmark

2.3x

*PPN: Processes Per Node

  • Complete offload of Allreduce collective operation to Switch helps to have

much higher overlap of communication and computation

Lower is Better Higher is Better

Available since MVAPICH2 2.3a

slide-43
SLIDE 43

Charm++ Workshop (April ’18) 43 Network Based Computing Laboratory

  • Mellanox’s ConnectX-2, ConnectX-3, ConnectIB, ConnectX-4, and ConnectX-5

adapters feature “task-list” offload interface

– Extension to existing InfiniBand APIs

  • Collective communication with `blocking’ feature is usually a scaling bottleneck

– Matches with the need for non-blocking collective in MPI

  • Accordingly MPI software stacks need to be re-designed to leverage offload in a

comprehensive manner

  • Can applications be modified to take advantage of non-blocking collectives and

what will be the benefits?

Collective Offload in ConnectX-2, ConnectX-3, Connect-IB and ConnectX-4, ConnectX-5

slide-44
SLIDE 44

Charm++ Workshop (April ’18) 44 Network Based Computing Laboratory Application

Collective Offload Support in ConnectX InfiniBand Adapter (Recv followed by Multi-Send)

  • Sender creates a task-list consisting of only send and wait

WQEs

– One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list – A wait WQE is added to make the ConnectX-2 HCA wait for ACK packet from the receiver

InfiniBand HCA Physical Link

Send Q Recv Q Send CQ Recv CQ

Data Data

MCQ

MQ

Task List

Send Wait Send Send Send Wait

slide-45
SLIDE 45

Charm++ Workshop (April ’18) 45 Network Based Computing Laboratory

Co-designing HPL with Core-Direct and Performance Benefits

0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 Normalized HPL Performance HPL Problem Size (N) as % of Total Memory HPL-Offload HPL-1ring HPL-Host HPL Performance Comparison with 512 Processes HPL-Offload consistently offers higher throughput than HPL-1ring and HPL-

  • Host. Improves peak throughput by up to 4.5 % for large problem sizes

4.5% 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 20 30 40 50 60 70 80 90 64 128 256 512 Throughput (GFlops) Memory Consumption (%) System Size (Number of Processes) HPL-Offload HPL-1ring HPL-Host HPL-Offload HPL-1ring HPL-Host HPL-Offload surpasses the peak throughput of HPL-1ring with significantly smaller problem sizes and run-times!

  • K. Kandalla, H. Subramoni, J. Vienne, S. Pai Raikar, K. Tomko, S. Sur, and D K Panda,

Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, (HOTI 2011)

slide-46
SLIDE 46

Charm++ Workshop (April ’18) 46 Network Based Computing Laboratory

Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload

5 10 15 64 128 256 512 Run-Time (s) Number of Processes PCG-Default Modified-PCG-Offload

64,000 unknowns per process. Modified PCG with Offload-Allreduce performs 21% better than default PCG

21.8%

  • K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking

Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12, May 2012.

slide-47
SLIDE 47

Charm++ Workshop (April ’18) 47 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-48
SLIDE 48

Charm++ Workshop (April ’18) 48 Network Based Computing Laboratory

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

  • Standard MPI interfaces used for unified data movement
  • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
  • Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

slide-49
SLIDE 49

Charm++ Workshop (April ’18) 49 Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

  • Support for MPI communication from NVIDIA GPU device memory
  • High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)

  • High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)

  • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node

  • Optimized and tuned collectives for GPU device buffers
  • MPI datatype support for point-to-point and collective communication from

GPU device buffers

  • Unified memory
slide-50
SLIDE 50

Charm++ Workshop (April ’18) 50 Network Based Computing Laboratory

  • OFED with support for GPUDirect RDMA is developed by NVIDIA

and Mellanox

  • OSU has a design of MVAPICH2 using GPUDirect RDMA

– Hybrid design using GPU-Direct RDMA

  • GPUDirect RDMA and Host-based pipelining
  • Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
  • Similar bottlenecks on Haswell

– Support for communication using multi-rail – Support for Mellanox Connect-IB and ConnectX VPI adapters – Support for RoCE with Mellanox ConnectX VPI adapters

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

System Memory GPU Memory GPU CPU Chipset

SNB E5-2670 IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2

Intra-socket Inter-sockets Intra-socket Inter-sockets P2P read <1.0 GBs <300 MBs 3.5 GBs <300 MBs P2P write 5.2 GBs <300 MBs 6.4 GBs <300 MBs

slide-51
SLIDE 51

Charm++ Workshop (April ’18) 51 Network Based Computing Laboratory

2000 4000 6000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

10x 9x

Optimized MVAPICH2-GDR Design

1.88us 11X

slide-52
SLIDE 52

Charm++ Workshop (April ’18) 52 Network Based Computing Laboratory

20 40 60 80 100

1 4 16 64 256 1K 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes)

GPU-GPU Inter-node Overlap*

MVAPICH2-(NO-GDR) MVAPCH2-GDR-2.3a

20 40 60 80

1 4 16 64 256 1K 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes)

GPU-GPU Intra-node Overlap*

MVAPICH2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

Overlap with Optimized MVAPICH2-GDR Design

  • Up to 69% overlap* for intra-node GPU-GPU

communication

  • With GDR, up to 78% overlap* for inter-node small

and medium message transfers

  • With intelligent pipeline, up to 88% overlap* for inter-

node large message transfers

*Overlap between GPU-to-GPU communication and CPU computation

slide-53
SLIDE 53

Charm++ Workshop (April ’18) 53 Network Based Computing Laboratory

  • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
  • HoomdBlue Version 1.0.5
  • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

500 1000 1500 2000 2500 4 8 16 32

Average Time Steps per second (TPS)

Number of Processes

MV2 MV2+GDR

500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes 64K Particles 256K Particles 2X 2X

slide-54
SLIDE 54

Charm++ Workshop (April ’18) 54 Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

  • 2X improvement on 32 GPUs nodes
  • 30% improvement on 96 GPU nodes (8 GPUs/node)
  • C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/

slide-55
SLIDE 55

Charm++ Workshop (April ’18) 55 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-56
SLIDE 56

Charm++ Workshop (April ’18) 56 Network Based Computing Laboratory

  • Applications use GPU/CPU resources for computation and MPI for

communication directly from GPU buffers

  • MPI collectives common in GPU applications. E.g.: Alltoall for FFTs
  • Collectives are time consuming with scale so MPI-3.0 introduced NBCs
  • Non-blocking communication operations from GPU buffers can

– Allow CPU to overlap GPU-based communication with CPU compute – Ease GPU kernels redundancy in waiting for non-dependent communication – Allow power efficient execution from CPU perspective

  • Rich set of GPU and network primitives available for NBC designs but

architectural limitations must be addressed

Motivation: Exploiting CORE-Direct and GPUDirect RDMA

  • A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HiPC ’15
slide-57
SLIDE 57

Charm++ Workshop (April ’18) 57 Network Based Computing Laboratory

  • Realized through mapping of MPICH schedule abstraction

– Schedule composed of sched_send, sched_barrier, sched_recv, sched_start etc – Mapped to Core-Direct primitives with collective-specific GPU↔Host done additionally

  • Multiple designs explored

– Naïve Design: Host-assisted GPU NBC (Scatter) – Offload-Staged: Host-assisted GPU NBC + Core-Direct – Offload-GDR: (GDR + Core-Direct)-based NBC – Offload-Callback: (Core-Direct, GDR, CUDA)-based NBC

Overview of Core-Direct + GPUDirect Designs

slide-58
SLIDE 58

Charm++ Workshop (April ’18) 58 Network Based Computing Laboratory

  • Use of GDR and CUDA callback mechanisms improve latency (comparable for alltoall)
  • Latency high for the case of alltoall even though callback designed to avoid staging latency

Latency Comparison with Blocking Collectives

3000 13000 23000 33000 43000 53000 63000

4K 64K 1MB

LATENCY (US) MESSAGE SIZE (BYTES)

64-NODE I E I AL ALLGATHER ER L ATEN ENCY

gdr-offload-cb staged-offload

  • ffload-gdr

Blocking 3000 13000 23000 33000 43000 53000 63000 73000

4K 64K 1MB

LATENCY (US) MESSAGE SIZE (BYTES)

64-NODE I I ALLT LTOALL LL L ATEN ENCY

gdr-offload-cb staged-offload

  • ffload-gdr

Blocking

slide-59
SLIDE 59

Charm++ Workshop (April ’18) 59 Network Based Computing Laboratory

  • New schemes are able to exploit overlap well

Effect of Compute Location on Overlap/Latency

10 20 30 40 50 60 70 80 90 100

4K 8K 16K 32K 64K 128K 256K

OVERLAP(%) MESSAGE SIZE (BYTES) 64 64-NOD ODE I IALLGATH THER OVERLAP AP gdr-offload-cb-GPU gdr-offload-cb-CPU

  • ffload-gdr-GPU
  • ffload-gdr-CPU

1000 2000 3000 4000 5000 6000 7000 8000 9000

4K 8K 16K 32K 64K 128K 256K

LATENCY (US) MESSAGE SIZE (BYTES) 64 64-NOD ODE I IALLGATH THER LAT ATENCY gdr-offload-cb-GPU gdr-offload-cb-CPU

  • ffload-gdr-GPU
  • ffload-gdr-CPU

Available in MVAPICH2-GDR 2.3a

slide-60
SLIDE 60

Charm++ Workshop (April ’18) 60 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-61
SLIDE 61

Charm++ Workshop (April ’18) 61 Network Based Computing Laboratory

  • Scientific parallel applications spend a considerable amount of time in

GPU-based collective communication operations

– E.g. Deep learning frameworks such as TensorFlow and Caffe

  • Optimized computation-intensive collectives in MVAPICH2-GDR

– MPI_Reduce and MPI_Allreduce – Exploring the best combinations

  • Computation on

– CPU or GPU

  • Communication through

– Host or GPU memory

GPU-kernel based Reduction

CPU Host Memory GPU PCIe IB Adapter CPU Host Memory GPU PCIe IB Adapter 1 2 3 4 1 2

Node B Node A

slide-62
SLIDE 62

Charm++ Workshop (April ’18) 62 Network Based Computing Laboratory

16K 64K 256K 1M 4M Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH 50 100 150 200 250 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH

Evaluation - MPI_Reduce @ CSCS (96 GPUs)

Gather-first approaches* win for small messages K-nomial GPU-based approach* win for large messages

*Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, ” IEEE/ACM CCGrid’16.

slide-63
SLIDE 63

Charm++ Workshop (April ’18) 63 Network Based Computing Laboratory

5 10 15 20 25 Latency (ms) Message Size (Bytes) Default RD-DD BRB-DD 2 4 6 8 10 2 4 8 16 32 Latency (ms) System Size (Number of Nodes) Default RD-DD BRB-DD

Evaluation - MPI_Allreduce

96 GPUs @ CSCS Good Scalability 32 GPU Nodes @ Wilkes

Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, ” IEEE/ACM CCGrid’16.

Available in MVAPICH2-GDR 2.3a

slide-64
SLIDE 64

Charm++ Workshop (April ’18) 64 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-65
SLIDE 65

Charm++ Workshop (April ’18) 65 Network Based Computing Laboratory

  • Multi-dimensional data
  • Row based organization
  • Contiguous on one dimension
  • Non-contiguous on other dimensions
  • Halo data exchange
  • Duplicate the boundary
  • Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

slide-66
SLIDE 66

Charm++ Workshop (April ’18) 66 Network Based Computing Laboratory

MPI Datatype support in MVAPICH2

  • Datatypes support in MPI

– Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data

At Sender:

MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

  • Inside MVAPICH2
  • Use datatype specific CUDA Kernels to pack data in chunks
  • Efficiently move data between nodes using RDMA
  • In progress - currently optimizes vector and hindexed datatypes
  • Transparent to the user
  • H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel

and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.

slide-67
SLIDE 67

Charm++ Workshop (April ’18) 67 Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

  • Comprehensive support
  • Targeted kernels for regular datatypes - vector, subarray, indexed_block
  • Generic kernels for all other irregular datatypes
  • Separate non-blocking stream for kernels launched by MPI library
  • Avoids stream conflicts with application kernels
  • Flexible set of parameters for users to tune kernels
  • Vector
  • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
  • MV2_CUDA_KERNEL_VECTOR_YSIZE
  • Subarray
  • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
  • MV2_CUDA_KERNEL_SUBARR_XDIM
  • MV2_CUDA_KERNEL_SUBARR_YDIM
  • MV2_CUDA_KERNEL_SUBARR_ZDIM
  • Indexed_block
  • MV2_CUDA_KERNEL_IDXBLK_XDIM
slide-68
SLIDE 68

Charm++ Workshop (April ’18) 68 Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPU

MPI_Isend(Buf1, ...,req1); MPI_Isend(Buf2, ...,req2); Application work on the CPU/GPU MPI_Waitall(requests, …)

Common Scenario

*Buf1, Buf2…contain non-contiguous MPI Datatype

slide-69
SLIDE 69

Charm++ Workshop (April ’18) 69 Network Based Computing Laboratory

  • Modified ‘CUDA-Aware’ DDTBench for NAS_MG_y

– Up to 90% overlap between datatype processing and other computation

20 40 60 80 100

[32x16x16] [128x64x64] [256x128x128] [512x256x256]

Overlap (%) Input Size Default Event-based Callback-based

  • C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement

Processing on Modern GPU-enabled Systems, IPDPS’16

MPI Datatype Processing (Communication Optimization )

Available in MVAPICH2-GDR 2.3a

slide-70
SLIDE 70

Charm++ Workshop (April ’18) 70 Network Based Computing Laboratory

  • MVAPICH2/MVAPICH2-X

– Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication

  • MVAPICH2-GDR

– Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing

  • Deep Learning Application: OSU Caffe

Presentation Outline

slide-71
SLIDE 71

Charm++ Workshop (April ’18) 71 Network Based Computing Laboratory

  • Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers

  • Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance – NCCL2, CUDA-Aware MPI --> scale-out performance

  • For small and medium message sizes only!
  • Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both? – Efficient Overlap of Computation and Communication – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scale-up Performance Scale-out Performance

cuDNN NCCL gRPC Hadoop

Proposed Co- Designs

MPI cuBLAS

  • A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
  • Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

NCCL2

slide-72
SLIDE 72

Charm++ Workshop (April ’18) 72 Network Based Computing Laboratory

  • To address the limitations of Caffe and existing MPI runtimes, we propose

the OSU-Caffe (S-Caffe) framework

  • At the application (DL framework) level

– Develop a fine-grain workflow – i.e. layer-wise communication instead

  • f communicating the entire model
  • At the runtime (MPI) level

– Develop support to perform reduction of very-large GPU buffers – Perform reduction using GPU kernels

OSU-Caffe: Proposed Co-Design Overview

OSU-Caffe is available from the HiDL project page http://hidl.cse.ohio-state.edu

slide-73
SLIDE 73

Charm++ Workshop (April ’18) 73 Network Based Computing Laboratory

  • Exploit Non-Blocking Collective (NBC) operations in MPI-3

– Divide communication into fine-grain steps – Overlap computation of layer “i” with communication of layer “i+1” – MPI_Ibcast to post all communication in advance

  • Wait in an on-demand fashion

– Allow for runtime selection of data propagation design

  • Based on message (DL model) size, number of GPUs, and number of nodes
  • Co-design gradient aggregation at application level

– Helper thread based approach to realize a non-blocking MPI_Reduce Optimized Data Propagation and Gradient Aggregation using NBC Designs

  • A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI

Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters, PPoPP ’17

slide-74
SLIDE 74

Charm++ Workshop (April ’18) 74 Network Based Computing Laboratory

S-Caffe vs. Inspur-Caffe and Microsoft CNTK

  • AlexNet: Notoriously hard to scale-out on

multiple nodes due to comm. overhead!

  • Large number of parameters ~ 64 Million

(comm. buffer size = 256 MB)

S-Caffe delivers better or comparable performance with

  • ther multi-node capable DL frameworks

Up to 14% improvement (Scale-up)

Impact of HR

  • GoogLeNet is a popular DNN
  • 13 million parameters (comm. buffer

size = ~50 MB)

slide-75
SLIDE 75

Charm++ Workshop (April ’18) 75 Network Based Computing Laboratory

  • Exploiting overlap between computation and communication is significant in

HPC

  • Presented some of the approaches and results along these directions taken by

the MVAPICH2 and MVAPICH2-GDR Libraries

  • Allows applications to take advantage of the overlap capabilities
  • As exascale systems are getting more complicated in their architectures,

solutions exploiting overlap capabilities will be important

Concluding Remarks

slide-76
SLIDE 76

Charm++ Workshop (April ’18) 76 Network Based Computing Laboratory

Funding Acknowledgments

Funding Support by Equipment Support by

slide-77
SLIDE 77

Charm++ Workshop (April ’18) 77 Network Based Computing Laboratory

Personnel Acknowledgments

Current Students (Graduate)

  • A. Awan (Ph.D.)

  • R. Biswas (M.S.)

  • M. Bayatpour (Ph.D.)

  • S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.) –

  • S. Guganani (Ph.D.)

Past Students

  • A. Augustine (M.S.)

  • P. Balaji (Ph.D.)

  • S. Bhagvat (M.S.)

  • A. Bhat (M.S.)

  • D. Buntinas (Ph.D.)

  • L. Chai (Ph.D.)

  • B. Chandrasekharan (M.S.)

  • N. Dandapanthula (M.S.)

  • V. Dhanraj (M.S.)

  • T. Gangadharappa (M.S.)

  • K. Gopalakrishnan (M.S.)

  • R. Rajachandrasekar (Ph.D.)

  • G. Santhanaraman (Ph.D.)

  • A. Singh (Ph.D.)

  • J. Sridhar (M.S.)

  • S. Sur (Ph.D.)

  • H. Subramoni (Ph.D.)

  • K. Vaidyanathan (Ph.D.)

  • A. Vishnu (Ph.D.)

  • J. Wu (Ph.D.)

  • W. Yu (Ph.D.)

Past Research Scientist

  • K. Hamidouche

  • S. Sur

Past Post-Docs

  • D. Banerjee

  • X. Besseron

– H.-W. Jin –

  • W. Huang (Ph.D.)

  • W. Jiang (M.S.)

  • J. Jose (Ph.D.)

  • S. Kini (M.S.)

  • M. Koop (Ph.D.)

  • K. Kulkarni (M.S.)

  • R. Kumar (M.S.)

  • S. Krishnamoorthy (M.S.)

  • K. Kandalla (Ph.D.)

  • M. Li (Ph.D.)

  • P. Lai (M.S.)

  • J. Liu (Ph.D.)

  • M. Luo (Ph.D.)

  • A. Mamidala (Ph.D.)

  • G. Marsh (M.S.)

  • V. Meshram (M.S.)

  • A. Moody (M.S.)

  • S. Naravula (Ph.D.)

  • R. Noronha (Ph.D.)

  • X. Ouyang (Ph.D.)

  • S. Pai (M.S.)

  • S. Potluri (Ph.D.)

  • J. Hashmi (Ph.D.)

  • H. Javed (Ph.D.)

  • P. Kousha (Ph.D.)

  • D. Shankar (Ph.D.)

  • H. Shi (Ph.D.)

  • J. Zhang (Ph.D.)

  • J. Lin

  • M. Luo

  • E. Mancini

Current Research Scientists

  • X. Lu

  • H. Subramoni

Past Programmers

  • D. Bureddy

  • J. Perkins

Current Research Specialist

  • J. Smith

  • M. Arnold

  • S. Marcarelli

  • J. Vienne

  • H. Wang

Current Post-doc

  • A. Ruhela

  • K. Manian

Current Students (Undergraduate)

  • N. Sarkauskas (B.S.)
slide-78
SLIDE 78

Charm++ Workshop (April ’18) 78 Network Based Computing Laboratory

Upcoming 6th Annual MVAPICH User Group (MUG) Meeting

  • August 6-8, 2018; Columbus, Ohio, USA
  • Keynote Talks, Invited Talks, Contributed Presentations, Tutorials on

MVAPICH2, MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, OSU- INAM, and High-Performance Deep Learning optimization and tuning

  • Student Travel Award
  • More details at:

http://mug.mvapich.cse.ohio-state.edu

slide-79
SLIDE 79

Charm++ Workshop (April ’18) 79 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/