Benchmarks and Middleware for Designing Convergent HPC, Big Data and - - PowerPoint PPT Presentation

benchmarks and middleware for designing convergent hpc
SMART_READER_LITE
LIVE PREVIEW

Benchmarks and Middleware for Designing Convergent HPC, Big Data and - - PowerPoint PPT Presentation

Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems Keynote Talk at Bench 19 Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:


slide-1
SLIDE 1

Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Keynote Talk at Bench ’19 Conference by

Follow us on https://twitter.com/mvapich

slide-2
SLIDE 2

2 Network Based Computing Laboratory Bench ‘19

High-End Computing (HEC): PetaFlop to ExaFlop

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2017 1 EFlops in 2020-2021?

149 PFlops in 2018

slide-3
SLIDE 3

3 Network Based Computing Laboratory Bench ‘19

Big Data

(Hadoop, Spark, HBase, Memcached, etc.)

Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC

(MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!

slide-4
SLIDE 4

4 Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute

slide-5
SLIDE 5

5 Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

slide-6
SLIDE 6

6 Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

slide-7
SLIDE 7

7 Network Based Computing Laboratory Bench ‘19

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Spark Job Hadoop Job

Deep Learning Job

slide-8
SLIDE 8

8 Network Based Computing Laboratory Bench ‘19

  • MVAPICH Project

– MPI and PGAS Library with CUDA-Awareness

  • HiBD Project

– High-Performance Big Data Analytics Library

  • HiDL Project

– High-Performance Deep Learning

  • Public Cloud Deployment

– Microsoft-Azure and Amazon-AWS

  • Conclusions

Presentation Overview

slide-9
SLIDE 9

9 Network Based Computing Laboratory Bench ‘19

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries – More than 614,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘18 ranking)

  • 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 5th, 448, 448 cores (Frontera) at TACC
  • 8th, 391,680 cores (ABCI) in Japan
  • 15th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

Partner in the TACC Frontera System

slide-10
SLIDE 10

10 Network Based Computing Laboratory Bench ‘19 100000 200000 300000 400000 500000 600000 Sep-04 Feb-05 Jul-05 Dec-05 May-06 Oct-06 Mar-07 Aug-07 Jan-08 Jun-08 Nov-08 Apr-09 Sep-09 Feb-10 Jul-10 Dec-10 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Mar-17 Aug-17 Jan-18 Jun-18 Nov-18 Apr-19 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.0b MV2-MIC 2.0 MV2-GDR 2.3.2 MV2-X 2.3 rc2 MV2 Virt 2.2 MV2 2.3.2 OSU INAM 0.9.3 MV2-Azure 2.3.2 MV2-AWS 2.3

MVAPICH2 Release Timeline and Downloads

slide-11
SLIDE 11

11 Network Based Computing Laboratory Bench ‘19

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime

Diverse APIs and Mechanisms

Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis

Support for Modern Networking Technology

(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures

(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features

RC SRD UD DC UMR ODP SR- IOV Multi Rail

Transport Mechanisms

Shared Memory CMA

IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM

slide-12
SLIDE 12

12 Network Based Computing Laboratory Bench ‘19

MVAPICH2 Software Family

Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE MVAPICH2-X MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS Performance OMB

slide-13
SLIDE 13

13 Network Based Computing Laboratory Bench ‘19

Big Data

(Hadoop, Spark, HBase, Memcached, etc.)

Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC

(MPI, RDMA, Lustre, etc.)

Convergent Software Stacks for HPC, Big Data and Deep Learning

slide-14
SLIDE 14

14 Network Based Computing Laboratory Bench ‘19

  • Message Passing Interface (MPI) is the common programming model in scientific computing
  • Has 100’s of APIs and Primitives (Point-to-point, RMA, Collectives, Datatypes, …)
  • Multiple challenges for MPI developers, users, managers of HPC centers
  • How to optimize the designs of these APIs on various hardware platforms and configurations?
  • Designers and developers
  • Comparing performance of an MPI library (at the API-level) across various platforms and configurations?
  • Designers, developers and users
  • How to compare the performance of multiple MPI libraries (at the API-level) on a given platform and across

platforms?

  • Procurement decision by managers
  • How to correlate the performance from the micro-benchmark level to the overall application level?
  • Application developers and users, also beneficial for co-deigns

Need for Micro-Benchmarks to Design and Evaluate Programming Models

slide-15
SLIDE 15

15 Network Based Computing Laboratory Bench ‘19

  • Available since 2004 (https://mvapich.cse.ohio-state.edu/benchmarks)
  • Suite of microbenchmarks to study communication performance of various programming models
  • Benchmarks available for the following programming models

– Message Passing Interface (MPI)

– Partitioned Global Address Space (PGAS)

  • Unified Parallel C (UPC)
  • Unified Parallel C++ (UPC++)
  • OpenSHMEM
  • Benchmarks available for multiple accelerator based architectures

– Compute Unified Device Architecture (CUDA) – OpenACC Application Program Interface

  • Part of various national resource procurement suites like NERSC-8 / Trinity Benchmarks
  • Continuing to add support for newer primitives and features

OSU Micro-Benchmarks (OMB)

slide-16
SLIDE 16

16 Network Based Computing Laboratory Bench ‘19

  • Host-Based

– Point-to-point – Collectives

  • Blocking and Non-Blocking
  • Job-startup
  • GPU-Based

– CUDA-aware

  • Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD)
  • Collectives

– Managed Memory

  • Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

slide-17
SLIDE 17

17 Network Based Computing Laboratory Bench ‘19

One-way Latency: MPI over IB with MVAPICH2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Small Message Latency Message Size (bytes) Latency (us) 1.11 1.19 1.01 1.15 1.04 1.1

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

20 40 60 80 100 120 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Omni-Path ConnectX-6 HDR Large Message Latency Message Size (bytes) Latency (us)

slide-18
SLIDE 18

18 Network Based Computing Laboratory Bench ‘19

Bandwidth: MPI over IB with MVAPICH2

5000 10000 15000 20000 25000 30000 Unidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 12,590 3,373 6,356 12,083 12,366 24,532 10000 20000 30000 40000 50000 60000 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Omni-Path ConnectX-6 HDR Bidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 21,227 12,161 21,983 6,228 48,027 24,136

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

slide-19
SLIDE 19

19 Network Based Computing Laboratory Bench ‘19 0.2 0.4 0.6 0.8

4 8 16 32 64 128 256 512 1K 2K

Latency (us)

MVAPICH2-X 2.3.2 SpectrumMPI-10.3.0.01 0.25us

Intra-node Point-to-Point Performance on OpenPOWER

Platform: Two nodes of OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA Intra-Socket Small Message Latency Intra-Socket Large Message Latency Intra-Socket Bi-directional Bandwidth Intra-Socket Bandwidth 100 200 300

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us)

MVAPICH2-X 2.3.2 SpectrumMPI-10.3.0.01 10000 20000 30000 40000

1 8 64 512 4K 32K 256K 2M

Bandwidth (MB/s)

MVAPICH2-X 2.3.2 SpectrumMPI-10.3.0.01 10000 20000 30000 40000

1 8 64 512 4K 32K 256K 2M

Bandwidth (MB/s)

MVAPICH2-X 2.3.2 SpectrumMPI-10.3.0.01

slide-20
SLIDE 20

20 Network Based Computing Laboratory Bench ‘19

0.5 1 1.5 2 2.5 3 2 8 32 Latency (us) Message Size (Bytes)

Latency - Small Messages

MVAPICH2-X OpenMPI+UCX 5 10 15 20 25 30 35 128 512 2048 8192 Latency (us) Message Size (Bytes)

Latency - Medium Messages

MVAPICH2-X OpenMPI+UCX

500 1000 1500 2000 2500 32K 128K 512K 2M Latency (us) Message Size (Bytes)

Latency - Large Messages

MVAPICH2-X OpenMPI+UCX

100 200 300 400 1 4 16 64 Bandwidth (MB/s) Message Size (Bytes)

Bandwidth - Small Messages

MVAPICH2-X OpenMPI+UCX 2000 4000 6000 8000 10000 256 1K 4K 16K Bandwidth (MB/s) Message Size (Bytes)

Bandwidth – Medium Messages

MVAPICH2-X OpenMPI+UCX 2000 4000 6000 8000 10000 12000 64K 256K 1M 4M Bandwidth (MB/s) Message Size (Bytes)

Bandwidth - Large Messages

MVAPICH2-X OpenMPI+UCX

Point-to-point: Latency & Bandwidth (Inter-socket) on ARM

3.5x better 8.3x better 5x better

slide-21
SLIDE 21

21 Network Based Computing Laboratory Bench ‘19

  • Host-Based

– Point-to-point – Collectives

  • Blocking and Non-Blocking
  • Job-startup
  • GPU-Based

– CUDA-aware

  • Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD)
  • Collectives

– Managed Memory

  • Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

slide-22
SLIDE 22

22 Network Based Computing Laboratory Bench ‘19

MPI_Allreduce on KNL + Omni-Path (10,240 Processes)

50 100 150 200 250 300 4 8 16 32 64 128 256 512 1024 2048 4096

Latency (us)

Message Size MVAPICH2 MVAPICH2-OPT IMPI 200 400 600 800 1000 1200 1400 1600 1800 2000 8K 16K 32K 64K 128K 256K Message Size MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN

2.4X

  • For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X
  • M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based

Multi-Leader Design, SuperComputing '17.

Available since MVAPICH2-X 2.3b

slide-23
SLIDE 23

23 Network Based Computing Laboratory Bench ‘19

Shared Address Space (XPMEM)-based Collectives Design

1 10 100 1000 10000 100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency (us) Message Size MVAPICH2-2.3b IMPI-2017v1.132 MVAPICH2-X-2.3rc1 OSU_Allreduce (Broadwell 256 procs)

  • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
  • Offloaded computation/communication to peers ranks in reduction collective operation
  • Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce

73.2 1.8X

1 10 100 1000 10000 100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Message Size MVAPICH2-2.3b IMPI-2017v1.132 MVAPICH2-2.3rc1 OSU_Reduce (Broadwell 256 procs) 4X 36.1 37.9 16.8

  • J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction

Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.

Available since MVAPICH2-X 2.3rc1

slide-24
SLIDE 24

24 Network Based Computing Laboratory Bench ‘19 1 2 3 4 5 6 7 8 9 4 8 16 32 64 128 Pure Communication Latency (us) Message Size (Bytes) 1 PPN*, 8 Nodes

MVAPICH2 MVAPICH2-SHArP

5 10 15 20 25 30 35 40 45 50 4 8 16 32 64 128 Communication-Computation Overlap (%) Message Size (Bytes)

1 PPN, 8 Nodes

MVAPICH2 MVAPICH2-SHArP

Evaluation of SHArP based Non Blocking Allreduce

MPI_Iallreduce Benchmark

2.3x

*PPN: Processes Per Node

  • Complete offload of Allreduce collective operation to Switch helps to have

much higher overlap of communication and computation

Lower is Better Higher is Better

Available since MVAPICH2 2.3a

slide-25
SLIDE 25

25 Network Based Computing Laboratory Bench ‘19

  • Host-Based

– Point-to-point – Collectives

  • Blocking and Non-Blocking
  • Job-startup
  • GPU-Based

– CUDA-aware

  • Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD)

– Managed Memory

  • Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

slide-26
SLIDE 26

26 Network Based Computing Laboratory Bench ‘19

Startup Performance on TACC Frontera

  • MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes
  • All numbers reported with 56 processes per node

4.5s 3.9s

New designs available in MVAPICH2-2.3.2

1000 2000 3000 4000 5000 56 112 224 448 896 1792 3584 7168 14336 28672 57344 Time Taken (Milliseconds) Number of Processes

MPI_Init on Frontera

Intel MPI 2019 MVAPICH2 2.3.2

slide-27
SLIDE 27

27 Network Based Computing Laboratory Bench ‘19

  • Host-Based

– Point-to-point – Collectives

  • Blocking and Non-Blocking
  • Job-startup
  • GPU-Based

– CUDA-aware

  • Point-to-point: Device-to-Device (DD), Device-to-Host (DH) and Host-to-Device (HD)
  • Collectives

– Managed Memory

  • Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

slide-28
SLIDE 28

28 Network Based Computing Laboratory Bench ‘19 CPU CPU

QPI GPU

PCIe

GPU GPU

CPU

GPU IB

Node 0 Node 1

  • 1. Intra-GPU
  • 2. Intra-Socket GPU-GPU
  • 3. Inter-Socket GPU-GPU
  • 4. Inter-Node GPU-GPU
  • 5. Intra-Socket GPU-Host
  • 7. Inter-Node GPU-Host
  • 6. Inter-Socket GPU-Host

Memory buffers

  • 8. Inter-Node GPU-GPU with IB adapter on remote socket

and more . . .

  • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …
  • Critical for runtimes to optimize data movement while hiding the complexity
  • Connected as PCIe devices – Flexibility but Complexity

Optimizing MPI Data Movement on GPU Clusters

slide-29
SLIDE 29

29 Network Based Computing Laboratory Bench ‘19

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

  • Standard MPI interfaces used for unified data movement
  • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
  • Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

slide-30
SLIDE 30

30 Network Based Computing Laboratory Bench ‘19

2000 4000 6000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3 MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

10x 9x

Optimized MVAPICH2-GDR Design (D-D)

1.85us 11X

slide-31
SLIDE 31

31 Network Based Computing Laboratory Bench ‘19

D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node Bandwidth: 62.79 GB/sec for 4MB (via NVLINK2) Intra-node Latency: 0.90 us (with GDRCopy) Inter-node Latency: 2.04 us (with GDRCopy) Inter-node Bandwidth: 12.03 GB/sec (2 port EDR)

Available since MVAPICH2-GDR 2.3.2

2 4 6 8 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

INTRA-NODE LATENCY (SMALL)

Intra-Socket Inter-Socket 20 40 60 80 16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency (us) Message Size (Bytes)

INTRA-NODE LATENCY (LARGE)

Intra-Socket Inter-Socket 20 40 60 80 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Bandwidth (GB/sec) Message Size (Bytes)

INTRA-NODE BANDWIDTH

Intra-Socket Inter-Socket 2 4 6 8 10 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

INTER-NODE LATENCY (SMALL)

100 200 300 400 500 16K 32K 64K 128K256K512K 1M 2M 4M

Latency (us) Message Size (Bytes)

INTER-NODE LATENCY (LARGE)

5 10 15 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Bandwidth (GB/Sec) Message Size (Bytes)

INTER-NODE BANDWIDTH

slide-32
SLIDE 32

32 Network Based Computing Laboratory Bench ‘19

D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2) Intra-node D-H Latency: 0.49 us (with GDRCopy) Intra-node H-D Latency: 0.49 us (with GDRCopy) Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)

Available since MVAPICH2-GDR 2.3a

20 40 60 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

D-H INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR 100 200 300 400 16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency (us) Message Size (Bytes)

D-H INTRA-NODE LATENCY (LARGE)

Spectrum MPI MV2-GDR 5 10 15 20 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Bandwidth (GB/sec) Message Size (Bytes)

D-H INTRA-NODE BW

Spectrum MPI MV2-GDR 20 40 60 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

H-D INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR 100 200 300 400 16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency (us) Message Size (Bytes)

H-D INTRA-NODE LATENCY (LARGE)

Spectrum MPI MV2-GDR 10 20 30 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Bandwidth (GB/sec) Message Size (Bytes)

H-D INTRA-NODE BW

Spectrum MPI MV2-GDR

slide-33
SLIDE 33

33 Network Based Computing Laboratory Bench ‘19

MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale

  • Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
  • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

1 2 3 4 5 6 32M 64M 128M 256M Bandwidth (GB/s) Message Size (Bytes)

Bandwidth on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.7X better

50 100 150 200 250 300 350 400 450 4 16 64 256 1K 4K 16K Latency (us) Message Size (Bytes)

Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect 2 4 6 8 10 24 48 96 192 384 768 1536 Bandwidth (GB/s) Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1 NCCL 2.4 MVAPICH2-GDR-2.3.2

1.7X better

slide-34
SLIDE 34

34 Network Based Computing Laboratory Bench ‘19

  • Host-Based

– Point-to-point – Collectives

  • Blocking and Non-Blocking
  • Job-startup
  • GPU-Based

– CUDA-aware

  • Point-to-point: Device-to-Device (DD), Device-to-Host (DH) and Host-to-Device (HD)
  • Collectives

– Managed Memory

  • Point-to-point: Managed-Device-to-Managed-Device (MD-MD)

OSU Micro-Benchmarks (MPI): Examples and Capabilities

slide-35
SLIDE 35

35 Network Based Computing Laboratory Bench ‘19

Managed Memory Performance (Inter-node x86) with MVAPICH2-GDR

Latency MD MD Bandwidth MD MD Bi-Bandwidth MD MD

slide-36
SLIDE 36

36 Network Based Computing Laboratory Bench ‘19

Managed Memory Performance (OpenPOWER Intra-node)

Latency MD MD Bandwidth MD MD Bi-Bandwidth MD MD

slide-37
SLIDE 37

37 Network Based Computing Laboratory Bench ‘19

  • MVAPICH Project

– MPI and PGAS Library with CUDA-Awareness

  • HiBD Project

– High-Performance Big Data Analytics Library

  • HiDL Project

– High-Performance Deep Learning

  • Public Cloud Deployment

– Microsoft-Azure and Amazon-AWS

  • Conclusions

Presentation Overview

slide-38
SLIDE 38

38 Network Based Computing Laboratory Bench ‘19

  • Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)

  • Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

  • HDFS, MapReduce, Spark

Data Management and Processing on Modern Datacenters

slide-39
SLIDE 39

39 Network Based Computing Laboratory Bench ‘19

Big Data

(Hadoop, Spark, HBase, Memcached, etc.)

Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC

(MPI, RDMA, Lustre, etc.)

Convergent Software Stacks for HPC, Big Data and Deep Learning

slide-40
SLIDE 40

40 Network Based Computing Laboratory Bench ‘19

  • RDMA for Apache Spark
  • RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
  • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

  • RDMA for Apache Kafka
  • RDMA for Apache HBase
  • RDMA for Memcached (RDMA-Memcached)
  • RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
  • OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

  • http://hibd.cse.ohio-state.edu
  • Users Base: 315 organizations from 35 countries
  • More than 31,600 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker

slide-41
SLIDE 41

41 Network Based Computing Laboratory Bench ‘19

  • Hadoop Benchmarks

– DFSIO, Terasort, Teragen, HiBench, …

  • PUMA
  • YCSB
  • Spark Benchmarks
  • GroupBy, PageRank, K-means, …
  • BigData Bench

Current set of Benchmarks for Big Data

slide-42
SLIDE 42

42 Network Based Computing Laboratory Bench ‘19

  • The current benchmarks provide some performance behavior
  • However, do not provide any information to the designer/developer on:

– What is happening at the lower-layer? – Where the benefits are coming from? – Which design is leading to benefits or bottlenecks? – Which component in the design needs to be changed and what will be its impact? – Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

Are the Current Benchmarks Sufficient for Big Data?

slide-43
SLIDE 43

43 Network Based Computing Laboratory Bench ‘19 Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVM, and NVMe-SSD)

Programming Models (Sockets) Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication

QoS & Fault Tolerance

Threaded Models and Synchronization

Performance Tuning I/O and File Systems Virtualization (SR-IOV) Benchmarks RDMA Protocols

Challenges in Benchmarking of Optimized Designs

Current Benchmarks

No Benchmarks

Correlation?

slide-44
SLIDE 44

44 Network Based Computing Laboratory Bench ‘19 Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVM, and NVMe-SSD)

Programming Models (Sockets) Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication

QoS & Fault Tolerance

Threaded Models and Synchronization

Performance Tuning I/O and File Systems Virtualization (SR-IOV) Benchmarks RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

Applications-Level Benchmarks

Micro- Benchmarks

slide-45
SLIDE 45

45 Network Based Computing Laboratory Bench ‘19

  • Evaluate the performance of standalone HDFS
  • Five different benchmarks

– Sequential Write Latency (SWL) – Sequential or Random Read Latency (SRL or RRL) – Sequential Write Throughput (SWT) – Sequential Read Throughput (SRT) – Sequential Read-Write Throughput (SRWT)

OSU HiBD Micro-Benchmark (OHB) Suite - HDFS

Benchmark File Name File Size HDFS Parameter Readers Writers Random/ Sequential Read Seek Interval

SWL √ √ √ SRL/RRL √ √ √ √ √ (RRL) SWT √ √ √ SRT √ √ √ SRWT √ √ √ √

  • N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D.
  • K. Panda, A Micro-benchmark Suite for

Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012

slide-46
SLIDE 46

46 Network Based Computing Laboratory Bench ‘19

  • Evaluate the performance of stand-alone MapReduce
  • Does not require or involve HDFS or any other distributed file system
  • Models shuffle data patterns in real-workload Hadoop application workloads
  • Considers various factors that influence the data shuffling phase

– underlying network configuration, number of map and reduce tasks, intermediate shuffle data pattern, shuffle data size etc.

  • Two different micro-benchmarks based on generic intermediate shuffle patterns

– MR-AVG: intermediate data is evenly distributed (or approx. equal) among reduce tasks

  • MR-RR i.e., round-robin distribution and MR-RAND i.e., pseudo-random distribution

– MR-SKEW: intermediate data is unevenly distributed among reduce tasks

  • Total number of shuffle key/value pairs, max% per reducer, min% per reducer to configure skew

OSU HiBD Micro-Benchmark (OHB) Suite - MapReduce

  • D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-

Performance Networks, BPOE-5 (2014)

  • D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Characterizing and benchmarking stand-alone Hadoop MapReduce on modern

HPC clusters, The Journal of Supercomputing (2016)

slide-47
SLIDE 47

47 Network Based Computing Laboratory Bench ‘19

  • Two different micro-benchmarks to evaluate the performance of standalone Hadoop RPC

– Latency: Single Server, Single Client – Throughput: Single Server, Multiple Clients

  • A simple script framework for job launching and resource monitoring
  • Calculates statistics like Min, Max, Average
  • Network configuration, Tunable parameters, DataType, CPU Utilization

OSU HiBD Micro-Benchmark (OHB) Suite - RPC

Component Network Address Port Data Type Min Msg Size Max Msg Size

  • No. of Iterations

Handlers Verbose

lat_client √ √ √ √ √ √ √ lat_server √ √ √ √

Component Network Address Port Data Type Min Msg Size Max Msg Size

  • No. of Iterations
  • No. of Clients

Handlers Verbose

thr_client √ √ √ √ √ √ √ thr_server √ √ √ √ √ √

  • X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-

Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013

slide-48
SLIDE 48

48 Network Based Computing Laboratory Bench ‘19

  • Evaluates the performance of stand-alone Memcached in different modes
  • Default API Latency benchmarks for Memcached in-memory mode

– SET Micro-benchmark: Micro-benchmark for memcached set operations – GET Micro-benchmark: Micro-benchmark for memcached get operations – MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 90:10)

  • Latency benchmarks for Memcached hybrid-memory mode
  • Non-Blocking API Latency Benchmark for Memcached (both in-memory and hybrid-

memory mode)

  • Calculates average latency of Memcached operations in different modes

OSU HiBD Micro-Benchmark (OHB) Suite - Memcached

  • D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance Storage and

Interconnects for Web-Scale Workloads, IEEE International Conference on Big Data (IEEE BigData ‘15), Oct 2015

slide-49
SLIDE 49

49 Network Based Computing Laboratory Bench ‘19

  • HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well

as performance. This mode is enabled by default in the package.

  • HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-

memory and obtain as much performance benefit as possible.

  • HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
  • HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst

buffer design is hosted by Memcached servers, each of which has a local SSD.

  • MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
  • f Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
  • Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-

L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x

slide-50
SLIDE 50

50 Network Based Computing Laboratory Bench ‘19

Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

slide-51
SLIDE 51

51 Network Based Computing Laboratory Bench ‘19

  • High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark – RDMA-based data shuffle and SEDA-based shuffle architecture – Non-blocking and chunk-based data transfer – Off-JVM-heap buffer management – Support for OpenPOWER – Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)

  • Current release: 0.9.5

– Based on Apache Spark 2.1.0 – Tested with

  • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
  • RoCE support with Mellanox adapters
  • Various multi-core platforms (x86, POWER)
  • RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution

slide-52
SLIDE 52

52 Network Based Computing Laboratory Bench ‘19

Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

slide-53
SLIDE 53

53 Network Based Computing Laboratory Bench ‘19

  • MVAPICH Project

– MPI and PGAS Library with CUDA-Awareness

  • HiBD Project

– High-Performance Big Data Analytics Library

  • HiDL Project

– High-Performance Deep Learning

  • Public Cloud Deployment

– Microsoft-Azure and Amazon-AWS

  • Conclusions

Presentation Overview

slide-54
SLIDE 54

54 Network Based Computing Laboratory Bench ‘19

  • Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers

  • Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance – NCCL2, CUDA-Aware MPI --> scale-out performance

  • For small and medium message sizes only!
  • Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both? – Efficient Overlap of Computation and Communication – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scale-up Performance Scale-out Performance

  • A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
  • Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

cuDNN gRPC Hadoop MPI MKL-DNN

Desired

NCCL2

slide-55
SLIDE 55

55 Network Based Computing Laboratory Bench ‘19

Big Data

(Hadoop, Spark, HBase, Memcached, etc.)

Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC

(MPI, RDMA, Lustre, etc.)

Convergent Software Stacks for HPC, Big Data and Deep Learning

slide-56
SLIDE 56

56 Network Based Computing Laboratory Bench ‘19

  • CPU-based Deep Learning

– Using MVAPICH2-X

  • GPU-based Deep Learning

– Using MVAPICH2-GDR

High-Performance Deep Learning

slide-57
SLIDE 57

57 Network Based Computing Laboratory Bench ‘19

Large-Scale Benchmarking of DL Frameworks on Frontera

  • TensorFlow, PyTorch, and MXNet are widely used Deep Learning Frameworks
  • Optimized by Intel using Math Kernel Library for DNN (MKL-DNN) for Intel

processors

  • Single Node performance can be improved by running Multiple MPI processes

Impact of Batch Size on Performance for ResNet-50 Performance Improvement using Multiple MPI processes

*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).

slide-58
SLIDE 58

58 Network Based Computing Laboratory Bench ‘19

ResNet-50 using various DL benchmarks on Frontera

  • Observed 260K images per sec for ResNet-50 on 2,048 Nodes
  • Scaled MVAPICH2-X on 2,048 nodes on Frontera for Distributed Training using

TensorFlow

  • ResNet-50 can be trained in 7 minutes on 2048 nodes (114,688 cores)

*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).

slide-59
SLIDE 59

59 Network Based Computing Laboratory Bench ‘19

Benchmarking TensorFlow (TF) and PyTorch

PyTorch TensorFlow

  • Comprehensive and systematic

performance benchmarking

– tf_cnn_becchmarks (TF) – Horovod benchmark (PyTorch)

  • TensorFlow is up to 2.5X faster

than PyTorch for 128 Nodes.

  • TensorFlow: up to 125X speedup

for ResNet-152 on 128 nodes

  • PyTorch: Scales well but overall

lower performance than TensorFlow

*Jain et al., “Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters”, IEEE Cluster ’19.

slide-60
SLIDE 60

60 Network Based Computing Laboratory Bench ‘19

  • CPU based Hybrid-Parallel (Data

Parallelism and Model Parallelism) training on Stampede2

  • Benchmark developed for various

configuration

– Batch sizes –

  • No. of model partitions

  • No. of model replicas
  • Evaluation on a very deep model

– ResNet-1000 (a 1,000-layer model)

Benchmarking HyPar-Flow on Stampede

*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)

slide-61
SLIDE 61

61 Network Based Computing Laboratory Bench ‘19

  • CPU-based Deep Learning

– Using MVAPICH2-X

  • GPU-based Deep Learning

– Using MVAPICH2-GDR

High-Performance Deep Learning

slide-62
SLIDE 62

62 Network Based Computing Laboratory Bench ‘19

Distributed Training with TensorFlow and MVAPICH2-GDR

  • ResNet-50 Training using

TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!

  • 1,281,167 (1.2 mil.) images
  • Time/epoch = 3.6 seconds
  • Total Time (90 epochs)

= 3.6 x 90 = 332 seconds = 5.5 minutes!

50 100 150 200 250 300 350 400 1 2 4 6 12 24 48 96 192 384 768 1536 Image per second (Thousands) Number of GPUs NCCL-2.4 MVAPICH2-GDR-2.3.2

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k! ImageNet-1k has 1.2 million images

slide-63
SLIDE 63

63 Network Based Computing Laboratory Bench ‘19

  • Near-linear scaling may be achieved by tuning Horovod/MPI
  • Optimizing MPI/Horovod towards large message sizes for high-resolution images
  • Develop a generic Image Segmentation benchmark
  • Tuned DeepLabV3+ model using the benchmark and Horovod – up to 1.3X better than default

New Benchmark for Image Segmentation on Summit

*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)

slide-64
SLIDE 64

64 Network Based Computing Laboratory Bench ‘19

Using HiDL Packages for Deep Learning on Existing HPC Infrastructure Hadoop Job

slide-65
SLIDE 65

65 Network Based Computing Laboratory Bench ‘19

  • MVAPICH Project

– MPI and PGAS Library with CUDA-Awareness

  • HiBD Project

– High-Performance Big Data Analytics Library

  • HiDL Project

– High-Performance Deep Learning

  • Public Cloud Deployment

– Microsoft-Azure and Amazon-AWS

  • Conclusions

Presentation Overview

slide-66
SLIDE 66

66 Network Based Computing Laboratory Bench ‘19

  • Released on 08/16/2019
  • Major Features and Enhancements

– Based on MVAPICH2-2.3.2 – Enhanced tuning for point-to-point and collective operations – Targeted for Azure HB & HC virtual machine instances

– Flexibility for 'one-click' deployment

– Tested with Azure HB & HC VM instances

MVAPICH2-Azure 2.3.2

slide-67
SLIDE 67

67 Network Based Computing Laboratory Bench ‘19

  • Released on 08/12/2019
  • Major Features and Enhancements

– Based on MVAPICH2-X 2.3 – New design based on Amazon EFA adapter's Scalable Reliable Datagram (SRD) transport protocol – Support for XPMEM based intra-node communication for point-to-point and collectives – Enhanced tuning for point-to-point and collective operations – Targeted for AWS instances with Amazon Linux 2 AMI and EFA support – Tested with c5n.18xlarge instance

MVAPICH2-X-AWS 2.3

slide-68
SLIDE 68

68 Network Based Computing Laboratory Bench ‘19

  • Upcoming Exascale systems need to be designed with a holistic view of HPC,

Big Data, Deep Learning, and Cloud

  • Presented an overview of designing convergent software stacks
  • Presented benchmarks and middleware to enable HPC, Big Data, and Deep

Learning communities to take advantage of current and next-generation systems

Concluding Remarks

slide-69
SLIDE 69

69 Network Based Computing Laboratory Bench ‘19

  • Supported through X-ScaleSolutions (http://x-scalesolutions.com)
  • Benefits:

– Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements

  • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

slide-70
SLIDE 70

70 Network Based Computing Laboratory Bench ‘19

  • Has joined the OpenPOWER Consortium as a silver ISV member
  • Provides flexibility:

– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack – A part of the OpenPOWER ecosystem – Can participate with different vendors for bidding, installation and deployment process

  • Introduced two new integrated products with support for OpenPOWER systems

(Presented at the OpenPOWER North America Summit)

– X-ScaleHPC – X-ScaleAI – Send an e-mail to contactus@x-scalesolutions.com for free trial!!

Silver ISV Member for the OpenPOWER Consortium + Products

slide-71
SLIDE 71

71 Network Based Computing Laboratory Bench ‘19

  • Presentations at OSU and X-Scale Booth (#2094)

– Members of the MVAPICH, HiBD and HiDL members – External speakers

  • Presentations at SC main program (Tutorials and Workshops)
  • Presentation at many other booths and satellite events
  • Complete details available at

http://mvapich.cse.ohio-state.edu/conference/752/talks/

Multiple Events at SC ‘19

slide-72
SLIDE 72

72 Network Based Computing Laboratory Bench ‘19

Funding Acknowledgments

Funding Support by Equipment Support by

slide-73
SLIDE 73

73 Network Based Computing Laboratory Bench ‘19

Personnel Acknowledgments

Current Students (Graduate)

  • A. Awan (Ph.D.)

  • M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.) –

  • J. Hashmi (Ph.D.)

  • A. Jain (Ph.D.)

  • K. S. Kandadi (M.S.)

Past Students

  • A. Augustine (M.S.)

  • P. Balaji (Ph.D.)

  • R. Biswas (M.S.)

  • S. Bhagvat (M.S.)

  • A. Bhat (M.S.)

  • D. Buntinas (Ph.D.)

  • L. Chai (Ph.D.)

  • B. Chandrasekharan (M.S.)

  • S. Chakraborthy (Ph.D.)

  • N. Dandapanthula (M.S.)

  • V. Dhanraj (M.S.)

  • R. Rajachandrasekar (Ph.D.)

  • D. Shankar (Ph.D.)

  • G. Santhanaraman (Ph.D.)

  • A. Singh (Ph.D.)

  • J. Sridhar (M.S.)

  • S. Sur (Ph.D.)

  • H. Subramoni (Ph.D.)

  • K. Vaidyanathan (Ph.D.)

  • A. Vishnu (Ph.D.)

  • J. Wu (Ph.D.)

  • W. Yu (Ph.D.)

  • J. Zhang (Ph.D.)

Past Research Scientist

  • K. Hamidouche

  • S. Sur

  • X. Lu

Past Post-Docs

  • D. Banerjee

  • X. Besseron

– H.-W. Jin –

  • T. Gangadharappa (M.S.)

  • K. Gopalakrishnan (M.S.)

  • W. Huang (Ph.D.)

  • W. Jiang (M.S.)

  • J. Jose (Ph.D.)

  • S. Kini (M.S.)

  • M. Koop (Ph.D.)

  • K. Kulkarni (M.S.)

  • R. Kumar (M.S.)

  • S. Krishnamoorthy (M.S.)

  • K. Kandalla (Ph.D.)

  • M. Li (Ph.D.)

  • P. Lai (M.S.)

  • J. Liu (Ph.D.)

  • M. Luo (Ph.D.)

  • A. Mamidala (Ph.D.)

  • G. Marsh (M.S.)

  • V. Meshram (M.S.)

  • A. Moody (M.S.)

  • S. Naravula (Ph.D.)

  • R. Noronha (Ph.D.)

  • X. Ouyang (Ph.D.)

  • S. Pai (M.S.)

  • S. Potluri (Ph.D.)

– Kamal Raj (M.S.) –

  • K. S. Khorassani (Ph.D.)

  • P. Kousha (Ph.D.)

  • A. Quentin (Ph.D.)

  • B. Ramesh (M. S.)

  • S. Xu (M.S.)

  • J. Lin

  • M. Luo

  • E. Mancini

Past Programmers

  • D. Bureddy

  • J. Perkins

Current Research Specialist

  • J. Smith

  • S. Marcarelli

  • J. Vienne

  • H. Wang

Current Post-doc

  • M. S. Ghazimeersaeed

  • A. Ruhela

  • K. Manian

Current Students (Undergraduate)

  • V. Gangal (B.S.)

  • N. Sarkauskas (B.S.)

Past Research Specialist

  • M. Arnold

Current Research Scientist

  • H. Subramoni

  • Q. Zhou (Ph.D.)
slide-74
SLIDE 74

74 Network Based Computing Laboratory Bench ‘19

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

Follow us on https://twitter.com/mvapich