RDMA-based Networking Technologies and Middleware for - - PowerPoint PPT Presentation

rdma based networking technologies and middleware for
SMART_READER_LITE
LIVE PREVIEW

RDMA-based Networking Technologies and Middleware for - - PowerPoint PPT Presentation

RDMA-based Networking Technologies and Middleware for Next-Generation Clusters and Data Centers Keynote Talk at KBNet 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu


slide-1
SLIDE 1

RDMA-based Networking Technologies and Middleware for Next-Generation Clusters and Data Centers

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Keynote Talk at KBNet ‘18 by

slide-2
SLIDE 2

KBNet ‘18 2 Network Based Computing Laboratory

High-End Computing (HEC): Towards Exascale

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2016 1 EFlops in 2020- 2021?

122 PFlops in 2018

slide-3
SLIDE 3

KBNet ‘18 3 Network Based Computing Laboratory

Big Data – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 7.5% from 2016 and now represents

3.7 Billion People.

Courtesy: https://www.domo.com/blog/data-never-sleeps-5/

slide-4
SLIDE 4

KBNet ‘18 4 Network Based Computing Laboratory

Resurgence of AI/Machine Learning/Deep Learning

Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/

slide-5
SLIDE 5

KBNet ‘18 5 Network Based Computing Laboratory

  • Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)

  • Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

  • HDFS, MapReduce, Spark

Data Management and Processing on Modern Datacenters

slide-6
SLIDE 6

KBNet ‘18 6 Network Based Computing Laboratory

Communication and Computation Requirements

  • Requests are received from clients over the WAN
  • Proxy nodes perform caching, load balancing, resource monitoring, etc.
  • If not cached, the request is forwarded to the next tiers  Application Server
  • Application server performs the business logic (CGI, Java servlets, etc.)

– Retrieves appropriate data from the database to process the requests

Proxy Server Web-server (Apache) Application Server (PHP) Database Server (MySQL)

WAN

Clients Storage

More Computation and Communication Requirements

slide-7
SLIDE 7

KBNet ‘18 7 Network Based Computing Laboratory

Big Data

(Hadoop, Spark, HBase, Memcached, etc.)

Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC

(MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning on Modern Datacenters

Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!

slide-8
SLIDE 8

KBNet ‘18 8 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute

slide-9
SLIDE 9

KBNet ‘18 9 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

slide-10
SLIDE 10

KBNet ‘18 10 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

slide-11
SLIDE 11

KBNet ‘18 11 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Spark Job Hadoop Job

Deep Learning Job

slide-12
SLIDE 12

KBNet ‘18 12 Network Based Computing Laboratory

Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec InfiniBand (2001 -) 2 Gbit/sec (1X SDR) 10-Gigabit Ethernet (2001 -) 10 Gbit/sec InfiniBand (2003 -) 8 Gbit/sec (4X SDR) InfiniBand (2005 -) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) 32 Gbit/sec (4X QDR) 40-Gigabit Ethernet (2010 -) 40 Gbit/sec InfiniBand (2011 -) 54.6 Gbit/sec (4X FDR) InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR) 25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec 100-Gigabit Ethernet (2015 -) 100 Gbit/sec Omni-Path (2015 - ) 100 Gbit/sec InfiniBand (2015 - ) 100 Gbit/sec (4X EDR) InfiniBand (2016 - ) 200 Gbit/sec (4X HDR)

Trends in Network Speed Acceleration

100 times in the last 17 years

slide-13
SLIDE 13

KBNet ‘18 13 Network Based Computing Laboratory

Kernel Space

Available Interconnects and Protocols for Data Centers

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch

Hardware Offload TCP/IP

1/10/25/40/ 50/100 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE Omni-Path Adapter Omni-Path Switch User Space RDMA 100 Gb/s OFI

slide-14
SLIDE 14

KBNet ‘18 14 Network Based Computing Laboratory

  • Introduced in Oct 2000
  • High Performance Data Transfer

– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%)

  • Flexibility for LAN and WAN communication
  • Multiple Transport Services

– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers

  • Multiple Operations

– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)

  • high performance and scalable implementations of distributed locks, semaphores, collective

communication operations

  • Leading to big changes in designing HPC clusters, file systems, cloud computing

systems, grid computing systems, ….

Open Standard InfiniBand Networking Technology

slide-15
SLIDE 15

KBNet ‘18 15 Network Based Computing Laboratory

Communication in the Memory Semantics (RDMA Model)

InfiniBand Device Memory Memory InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Processor Processor

CQ QP

Send Recv

Memory Segment

Hardware ACK

Memory Segment Memory Segment

Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor

slide-16
SLIDE 16

KBNet ‘18 16 Network Based Computing Laboratory

  • 139 IB Clusters (27.8%) in the Jun’18 Top500 list

– (http://www.top500.org)

  • Installations in the Top 50 (19 systems):

Large-scale InfiniBand Installations

2,282,544 cores (Summit) at ORNL (1st) 155,150 cores (JURECA) at FZJ/Germany (38th) 1,572,480 cores (Sierra) at LLNL (3rd) 72,800 cores Cray CS-Storm in US (40th) 391,680 cores (ABCI) at AIST/Japan (5th) 72,800 cores Cray CS-Storm in US (41st) 253,600 cores (HPC4) in Italy (13th) 78,336 cores (Electra) at NASA/Ames (43rd) 114,480 cores (Juwels Module 1) at FZJ/Germany (23rd) 124,200 cores (Topaz) at ERDC DSRC/USA (44th) 241,108 cores (Pleiades) at NASA/Ames (24th) 60,512 cores NVIDIA DGX-1 at Facebook/USA (45th) 220,800 cores (Pangea) in France (30th) 60,512 cores (DGX Saturn V) at NVIDIA/USA (46th) 144,900 cores (Cheyenne) at NCAR/USA (31st) 113,832 cores (Damson) at AWE/UK (47th) 72,000 cores (IT0 – Subsystem A) in Japan (32nd) 72,000 cores (HPC2) in Italy (49th) 79,488 cores (JOLIOT-CURIE SKL) at CEA/France (34th) and many more!

#2nd system (Sunway TaihuLight) also uses InfiniBand

slide-17
SLIDE 17

KBNet ‘18 17 Network Based Computing Laboratory

  • 10GE Alliance formed by several industry leaders to take the Ethernet family to the next

speed step

  • Goal: To achieve a scalable and high performance communication architecture while

maintaining backward compatibility with Ethernet

  • http://www.ethernetalliance.org
  • 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG
  • 25-Gbps Ethernet Consortium targeting 25/50Gbps (July 2014)

– http://25gethernet.org

  • Energy-efficient and power-conscious protocols

– On-the-fly link speed reduction for under-utilized links

  • Ethernet Alliance Technology Forum looking forward to 2026

– http://insidehpc.com/2016/08/at-ethernet-alliance-technology-forum/

High-speed Ethernet Consortium (10GE/25GE/40GE/50GE/100GE)

slide-18
SLIDE 18

KBNet ‘18 18 Network Based Computing Laboratory

  • TCP Offload Engines (TOE)

– Hardware Acceleration for the entire TCP/IP stack – Initially patented by Tehuti Networks – Actually refers to the IC on the network adapter that implements TCP/IP – In practice, usually referred to as the entire network adapter

  • Internet Wide-Area RDMA Protocol (iWARP)

– Standardized by IETF and the RDMA Consortium – Support acceleration features (like IB) for Ethernet

  • http://www.ietf.org & http://www.rdmaconsortium.org

TOE and iWARP Accelerators

slide-19
SLIDE 19

KBNet ‘18 19 Network Based Computing Laboratory

RDMA over Converged Enhanced Ethernet (RoCE)

IB Verbs Application Hardware RoCE IB Verbs Application RoCE v2 InfiniBand Link Layer IB Network IB Transport IB Verbs Application InfiniBand Ethernet Link Layer IB Network IB Transport Ethernet Link Layer UDP / IP IB Transport

  • Takes advantage of IB and Ethernet

– Software written with IB-Verbs – Link layer is Converged (Enhanced) Ethernet (CE) – 100Gb/s support from latest EDR and ConnectX- 3 Pro adapters

  • Pros: IB Vs RoCE

– Works natively in Ethernet environments

  • Entire Ethernet management ecosystem is available

– Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so there are no missing features

  • RoCE v2: Additional Benefits over RoCE

– Traditional Network Management Tools Apply – ACLs (Metering, Accounting, Firewalling) – GMP Snooping for Optimized Multicast – Network Monitoring Tools

Courtesy: OFED, Mellanox Network Stack Comparison Packet Header Comparison

ETH L2 Hdr

Ethertype

IB GRH L3 Hdr IB BTH+ L4 Hdr RoCE ETH L2 Hdr

Ethertype

IP Hdr L3 Hdr IB BTH+ L4 Hdr Proto # RoCE v2 UDP Hdr Port #

slide-20
SLIDE 20

KBNet ‘18 20 Network Based Computing Laboratory

HSE Scientific Computing Installations

  • 171 HSE compute systems with ranking in the Jun’18 Top500 list

– 38,400-core installation in China (#95) – new – 38,400-core installation in China (#96) – new – 38,400-core installation in China (#97) – new – 39,680-core installation in China (#99) – 66,560-core installation in China (#157) – 66,280-core installation in China (#159) – 64,000-core installation in China (#160) – 64,000-core installation in China (#161) – 72,000-core installation in China (#164) – 64,320-core installation in China (#185) – new – 78,000-core installation in China (#187) – 75,776-core installation in China (#188) – new – 59,520-core installation in China (#192) – 59,520-core installation in China (#193) – 28,800-core installation in China (#195) – new – 62,400-core installation in China (#197) – new – 64,800-core installation in China (#198) – 66,000-core installation in China (#209) – new – and many more!

slide-21
SLIDE 21

KBNet ‘18 21 Network Based Computing Laboratory

Omni-Path Fabric Overview

Courtesy: Intel Corporation

  • Derived from QLogic InfiniBand
  • Layer 1.5: Link Transfer Protocol

– Features

  • Traffic Flow Optimization
  • Packet Integrity Protection
  • Dynamic Lane Switching

– Error detection/replay occurs in Link Transfer Packet units – Retransmit request via NULL LTP; carries replay command flit

  • Layer 2: Link Layer

– Supports 24 bit fabric addresses – Allows 10KB of L4 payload; 10,368 byte max packet size – Congestion Management

  • Adaptive / Dispersive Routing
  • Explicit Congestion Notification

– QoS support

  • Traffic Class, Service Level, Service Channel and Virtual Lane
  • Layer 3: Data Link Layer

– Fabric addressing, switching, resource allocation and partitioning support

slide-22
SLIDE 22

KBNet ‘18 22 Network Based Computing Laboratory

  • 39 Omni-Path Clusters (7.8%) in the Jun’18 Top500 list

– (http://www.top500.org)

Large-scale Omni-Path Installations

570,020 core (Nurion) at KISTI/South Korea (11th) 53,300 core (Makman-3) at Saudi Aramco/Saudi Arabia (78th) 556,104 core (Oakforest-PACS) at JCAHPC in Japan (12th) 34,560 core (Gaffney) at Navy DSRC/USA (85th) 367,024 core (Stampede2) at TACC in USA (15th) 34,560 core (Koehr) at Navy DSRC/USA (86th) 312,936 core (Marconi XeonPhi) at CINECA in Italy (18th) 49,432 core (Mogon II) in Germany (87th) 135,828 core (Tsubame 3.0) at TiTech in Japan (19th) 38,553 core (Molecular Simulator) in Japan (93rd) 153,216 core (MareNostrum) at BSC in Spain (22nd) 35,280 core (Quriosity) at BASF in Germany (94th) 127,520 core (Cobra) in Germany (28th) 54,432 core (Marconi Xeon) at CINECA in Italy (98th) 55,296 core (Mustang) at AFRL/USA (48th) 46,464 core (Peta4) at Cambridge/UK (101st) 95,472 core (Quartz) at LLNL in USA (63rd) 53,352 core (Girzzly) at LANL in USA (136th) 95,472 core (Jade) at LLNL in USA (64th) and many more!

slide-23
SLIDE 23

KBNet ‘18 23 Network Based Computing Laboratory

IB, Omni-Path, and HSE: Feature Comparison

Features IB iWARP/HSE RoCE RoCE v2 Omni-Path

Hardware Acceleration Yes Yes Yes Yes Yes RDMA Yes Yes Yes Yes Yes Congestion Control Yes Optional Yes Yes Yes Multipathing Yes Yes Yes Yes Yes Atomic Operations Yes No Yes Yes Yes Multicast Optional No Optional Optional Optional Data Placement Ordered Out-of-order Ordered Ordered Ordered Prioritization Optional Optional Yes Yes Yes Fixed BW QoS (ETS) No Optional Yes Yes Yes Ethernet Compatibility No Yes Yes Yes Yes TCP/IP Compatibility Yes (using IPoIB) Yes Yes (using IPoIB) Yes Yes

slide-24
SLIDE 24

KBNet ‘18 24 Network Based Computing Laboratory

Designing RDMA-based Communication and I/O Libraries for Clusters and Data Center Middleware: Challenges

Cluster and Data Center Middleware (MPI, PGAS, Memcached, HDFS, MapReduce, HBase, and gRPC/TensorFlow)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

RDMA? Communication and I/O Library

RDMA-based Communication Substrate

QoS & Fault Tolerance

Threaded Models and Synchronization

Performance Tuning I/O and File Systems Virtualization (SR-IOV)

Upper level Changes?

slide-25
SLIDE 25

KBNet ‘18 25 Network Based Computing Laboratory

  • High-Performance Programming Models Support for HPC Clusters
  • RDMA-Enabled Communication Substrate for Common Services in Datacenters
  • High-Performance and Scalable Memcached
  • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
  • Deep Learning with Scale-Up and Scale-Out

– Caffe and TensorFlow

  • Virtualization Support with SR-IOV and Containers

Designing RDMA-based Middleware for Clusters and Datacenters

slide-26
SLIDE 26

KBNet ‘18 26 Network Based Computing Laboratory

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models

MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies

(InfiniBand, 40/100GigE, Aries, and Omni-Path)

Multi-/Many-core Architectures Accelerators (GPU and FPGA)

Middleware

Co-Design Opportunities and Challenges across Various Layers

Performance Scalability Resilience Communication Library or Runtime for Programming Models

Point-to-point Communication Collective Communication Energy- Awareness Synchronization and Locks I/O and File Systems Fault Tolerance

slide-27
SLIDE 27

KBNet ‘18 27 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,925 organizations in 86 countries – More than 487,000 (> 0.48 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Jul ‘18 ranking)

  • 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
  • 12th, 556,104 cores (Oakforest-PACS) in Japan
  • 15th, 367,024 cores (Stampede2) at TACC
  • 24th, 241,108-core (Pleiades) at NASA and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade
slide-28
SLIDE 28

KBNet ‘18 28 Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime

Diverse APIs and Mechanisms

Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis

Support for Modern Networking Technology

(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures

(Intel-Xeon, OpenPOWER, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features

RC XRC UD DC UMR ODP SR- IOV Multi Rail

Transport Mechanisms

Shared Memory CMA IVSHMEM

Modern Features

NVLink* CAPI*

* Upcoming

XPMEM*

slide-29
SLIDE 29

KBNet ‘18 29 Network Based Computing Laboratory

One-way Latency: MPI over IB with MVAPICH2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Small Message Latency Message Size (bytes) Latency (us) 1.11 1.19 0.98 1.15 1.04

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

20 40 60 80 100 120 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path Large Message Latency Message Size (bytes) Latency (us)

slide-30
SLIDE 30

KBNet ‘18 30 Network Based Computing Laboratory TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

Bandwidth: MPI over IB with MVAPICH2

5000 10000 15000 20000 25000 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path Bidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 22,564 12,161 21,983 6,228 24,136 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 12,590 3,373 6,356 12,358 12,366

slide-31
SLIDE 31

KBNet ‘18 31 Network Based Computing Laboratory

Hardware Multicast-aware MPI_Bcast on Stampede

10 20 30 40 2 8 32 128 512 Latency (us) Message Size (Bytes)

Small Messages (102,400 Cores)

Default Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

100 200 300 400 500 2K 8K 32K 128K Latency (us) Message Size (Bytes)

Large Messages (102,400 Cores)

Default Multicast 5 10 15 20 25 30 Latency (us) Number of Nodes

16 Byte Message

Default Multicast 50 100 150 200 Latency (us) Number of Nodes

32 KByte Message

Default Multicast

slide-32
SLIDE 32

KBNet ‘18 32 Network Based Computing Laboratory

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

  • Standard MPI interfaces used for unified data movement
  • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
  • Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

slide-33
SLIDE 33

KBNet ‘18 33 Network Based Computing Laboratory

2000 4000 6000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

10x 9x

Optimized MVAPICH2-GDR Design

1.88us 11X

slide-34
SLIDE 34

KBNet ‘18 34 Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

  • 2X improvement on 32 GPUs nodes
  • 30% improvement on 96 GPU nodes (8 GPUs/node)
  • C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/

slide-35
SLIDE 35

KBNet ‘18 35 Network Based Computing Laboratory

  • High-Performance Programming Models Support for HPC Clusters
  • RDMA-Enabled Communication Substrate for Common Services in Datacenters
  • High-Performance and Scalable Memcached
  • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
  • Deep Learning with Scale-Up and Scale-Out

– Caffe and TensorFlow

  • Virtualization Support with SR-IOV and Containers

Designing RDMA-based Middleware for Clusters and Datacenters

slide-36
SLIDE 36

KBNet ‘18 36 Network Based Computing Laboratory

Data-Center Service Primitives

  • Common Services needed by Data-Centers

– Better resource management – Higher performance provided to higher layers

  • Service Primitives

– Soft Shared State – Distributed Lock Management – Global Memory Aggregator

  • Network Based Designs

– RDMA, Remote Atomic Operations

slide-37
SLIDE 37

KBNet ‘18 37 Network Based Computing Laboratory

Soft Shared State

Shared State

Data-Center Application Data-Center Application Data-Center Application Data-Center Application Data-Center Application Data-Center Application

Get Get Get Put Put Put

slide-38
SLIDE 38

KBNet ‘18 38 Network Based Computing Laboratory

Active Caching

  • Dynamic data caching – challenging!
  • Cache Consistency and Coherence

– Become more important than in static case

User Requests

Proxy Nodes Back-End Nodes Update

slide-39
SLIDE 39

KBNet ‘18 39 Network Based Computing Laboratory

RDMA based Client Polling Design

Front-End Back-End Request Cache Hit Cache Miss Response Version Read Response

slide-40
SLIDE 40

KBNet ‘18 40 Network Based Computing Laboratory

Active Caching – Performance Benefits

Data-Center Throughput 2000 4000 6000 8000 10000 12000 14000 16000 Trace 2 Trace 3 Trace 4 Trace 5 Traces with Increasing Update Rate Throughput No Cache Invalidate All Dependency Lists Effect of Load 2000 4000 6000 8000 10000 12000 14000 16000 1 2 4 8 16 32 64 Load (Compute Threads) Throughput No Cache Dependency Lists

  • Higher overall performance – Up to an order of magnitude
  • Performance is sustained under loaded conditions
  • S. Narravula, P. Balaji, K. Vaidyanathan, H. -W. Jin and D. K. Panda, Architecture for Caching Responses with Multiple Dynamic

Dependencies in Multi-Tier Data-Centers over InfiniBand. CCGrid-2005

slide-41
SLIDE 41

KBNet ‘18 41 Network Based Computing Laboratory

Resource Monitoring Services

  • Traditional approaches

– Coarse-grained in nature – Assume resource usage is consistent throughout the monitoring granularity (in the order of seconds)

  • This assumption is no longer valid

– Resource usage is becoming increasingly divergent

  • Fine-grained monitoring is desired but has additional overheads

– High overheads, less accurate, slow in response

  • Can we design fine-grained resource monitoring scheme with low overhead and

accurate resource usage?

slide-42
SLIDE 42

KBNet ‘18 42 Network Based Computing Laboratory

Synchronous Resource Monitoring using RDMA (RDMA-Sync)

/proc Kernel Space User Space Kernel Space User Space Front-end Node Memory Memory CPU CPU

App Threads Front-end Monitoring Process Kernel Data Structures App Threads

Back-end Node

RDMA

slide-43
SLIDE 43

KBNet ‘18 43 Network Based Computing Laboratory

  • High-Performance Programming Models Support for HPC Clusters
  • RDMA-Enabled Communication Substrate for Common Services in Datacenters
  • High-Performance and Scalable Memcached
  • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
  • Deep Learning with Scale-Up and Scale-Out

– Caffe and TensorFlow

  • Virtualization Support with SR-IOV and Containers

Designing RDMA-based Middleware for Clusters and Datacenters

slide-44
SLIDE 44

KBNet ‘18 44 Network Based Computing Laboratory

  • Three-layer architecture of Web 2.0

– Web Servers, Memcached Servers, Database Servers

  • Memcached is a core component of

Web 2.0 architecture

  • Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes – General purpose

  • Typically used to cache database queries,

results of API calls

  • Scalable model, but typical usage very

network intensive

Architecture Overview of Memcached

Internet

slide-45
SLIDE 45

KBNet ‘18 45 Network Based Computing Laboratory

  • Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

  • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to

that thread

  • All other Memcached data structures are shared among RDMA and Sockets worker threads
  • Memcached Server can serve both socket and verbs clients simultaneously
  • Memcached applications need not be modified; uses verbs interface if available

Memcached-RDMA Design

Sockets Client RDMA Client

Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2

slide-46
SLIDE 46

KBNet ‘18 46 Network Based Computing Laboratory

1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Message Size OSU-IB (FDR) IPoIB (FDR) 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thousands of Transactions per Second (TPS)

  • No. of Clients
  • Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

  • Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X 2X

slide-47
SLIDE 47

KBNet ‘18 47 Network Based Computing Laboratory

  • High-Performance Programming Models Support for HPC Clusters
  • RDMA-Enabled Communication Substrate for Common Services in Datacenters
  • High-Performance and Scalable Memcached
  • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
  • Deep Learning with Scale-Up and Scale-Out

– Caffe and TensorFlow

  • Virtualization Support with SR-IOV and Containers

Designing RDMA-based Middleware for Clusters and Datacenters

slide-48
SLIDE 48

KBNet ‘18 48 Network Based Computing Laboratory

  • RDMA for Apache Spark
  • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

  • RDMA for Apache HBase
  • RDMA for Memcached (RDMA-Memcached)
  • RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
  • OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

  • http://hibd.cse.ohio-state.edu
  • Users Base: 290 organizations from 34 countries
  • More than 27,300 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker

slide-49
SLIDE 49

KBNet ‘18 49 Network Based Computing Laboratory 50 100 150 200 250 300 350 400 80 120 160 Execution Time (s) Data Size (GB)

IPoIB (EDR) OSU-IB (EDR)

100 200 300 400 500 600 700 800 80 160 240 Execution Time (s) Data Size (GB)

IPoIB (EDR) OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

  • RandomWriter

– 3x improvement over IPoIB for 80-160 GB file size

  • TeraGen

– 4x improvement over IPoIB for 80-240 GB file size

RandomWriter TeraGen Reduced by 3x Reduced by 4x

slide-50
SLIDE 50

KBNet ‘18 50 Network Based Computing Laboratory

  • InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
  • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.

– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

50 100 150 200 250 300 350 400 450 Huge BigData Gigantic

Time (sec) Data Size (GB)

IPoIB RDMA

100 200 300 400 500 600 700 800 Huge BigData Gigantic

Time (sec) Data Size (GB)

IPoIB RDMA

43% 37%

slide-51
SLIDE 51

KBNet ‘18 51 Network Based Computing Laboratory

  • High-Performance Programming Models Support for HPC Clusters
  • RDMA-Enabled Communication Substrate for Common Services in Datacenters
  • High-Performance and Scalable Memcached
  • RDMA-Enabled Spark and Hadoop (HDFS, HBase, MapReduce)
  • Deep Learning with Scale-Up and Scale-Out

– Caffe and TensorFlow

  • Virtualization Support with SR-IOV and Containers

Designing RDMA-based Middleware for Clusters and Datacenters

slide-52
SLIDE 52

KBNet ‘18 52 Network Based Computing Laboratory

  • Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers

  • Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance – CUDA-Aware MPI --> scale-out performance

  • For small and medium message sizes only!
  • Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both? – Efficient Overlap of Computation and Communication – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for Communication Runtimes

Scale-up Performance Scale-out Performance

cuDNN NCCL gRPC Hadoop

Proposed Co-Designs

MPI cuBLAS

  • A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
  • Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
slide-53
SLIDE 53

KBNet ‘18 53 Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

  • Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
  • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1 10 100 1000 10000 100000 Latency (us) Message Size (Bytes) MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect 1 10 100 1000 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Latency (us) Message Size (Bytes) MVAPICH2-GDR NCCL2

~3X better

slide-54
SLIDE 54

KBNet ‘18 54 Network Based Computing Laboratory 10000 20000 30000 40000 50000 512K 1M 2M 4M Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1000000 2000000 3000000 4000000 5000000 6000000 8388608 16777216 33554432 67108864 134217728 268435456 536870912 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1 10 100 1000 10000 100000 4 16 64 256 1024 4096 16384 65536 262144 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI

  • 16 GPUs (4 nodes) MVAPICH2-GDR(*) vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a ~30X better

MV2 is ~2X better than Baidu

~10X better

OpenMPI is ~5X slower than Baidu

~4X better

slide-55
SLIDE 55

KBNet ‘18 55 Network Based Computing Laboratory

  • Caffe : A flexible and layered Deep Learning framework.
  • Benefits and Weaknesses

– Multi-GPU Training within a single node – Performance degradation for GPUs across different sockets – Limited Scale-out

  • OSU-Caffe: MPI-based Parallel Training

– Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

50 100 150 200 250 8 16 32 64 128 Training Time (seconds)

  • No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe publicly available from http://hidl.cse.ohio-state.edu/

slide-56
SLIDE 56

KBNet ‘18 56 Network Based Computing Laboratory

  • High-Performance Design of TensorFlow over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow – RDMA-based data communication – Adaptive communication protocols – Dynamic message chunking and accumulation – Support for RDMA device selection – Easily configurable for different protocols (native InfiniBand and IPoIB)

  • Current release: 0.9.1

– Based on Google TensorFlow 1.3.0 – Tested with

  • Mellanox InfiniBand adapters (e.g., EDR)
  • NVIDIA GPGPU K80
  • Tested with CUDA 8.0 and CUDNN 5.0

– http://hidl.cse.ohio-state.edu

RDMA-TensorFlow Distribution

slide-57
SLIDE 57

KBNet ‘18 57 Network Based Computing Laboratory

10 20 30 40 50 60 70 80 90 8 16 32 64 Images / Second Batch Size / GPU gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps)

47%

7%

Performance Benefit for TensorFlow (Inception3)

  • TensorFlow Inception3 performance evaluation on an IB EDR cluster

– Up to 47% performance speedup over Default gRPC (IPoIB) for 4 nodes – Up to 116% performance speedup over Default gRPC (IPoIB) for 8 nodes – Up to 153% performance speedup over Default gRPC (IPoIB) for 12 nodes 4 Nodes 8 Nodes 12 Nodes

60 120 180 240 300 8 16 32 64 Images / Second Batch Size / GPU gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps)

153% 12%

50 100 150 200 8 16 32 64 Images / Second Batch Size / GPU gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps)

116%

10%

slide-58
SLIDE 58

KBNet ‘18 58 Network Based Computing Laboratory

  • Next generation Clusters and Data Centers need to be designed with a holistic

view of HPC, Big Data, Deep Learning, and Cloud

  • Presented an overview of the networking technology trends exploiting RDMA
  • Presented some of the RDMA-based approaches and results along these

directions

  • Enable HPC, Big Data, Deep Learning and Cloud community to take advantage
  • f modern RDMA-based networking technologies
  • Many other open issues need to be solved

Concluding Remarks

slide-59
SLIDE 59

KBNet ‘18 59 Network Based Computing Laboratory

Funding Acknowledgments

Funding Support by Equipment Support by

slide-60
SLIDE 60

KBNet ‘18 60 Network Based Computing Laboratory

Personnel Acknowledgments

Current Students (Graduate)

  • A. Awan (Ph.D.)

  • M. Bayatpour (Ph.D.)

  • S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.) –

  • S. Guganani (Ph.D.)

Past Students

  • A. Augustine (M.S.)

  • P. Balaji (Ph.D.)

  • R. Biswas (M.S.)

  • S. Bhagvat (M.S.)

  • A. Bhat (M.S.)

  • D. Buntinas (Ph.D.)

  • L. Chai (Ph.D.)

  • B. Chandrasekharan (M.S.)

  • N. Dandapanthula (M.S.)

  • V. Dhanraj (M.S.)

  • T. Gangadharappa (M.S.)

  • K. Gopalakrishnan (M.S.)

  • R. Rajachandrasekar (Ph.D.)

  • G. Santhanaraman (Ph.D.)

  • A. Singh (Ph.D.)

  • J. Sridhar (M.S.)

  • S. Sur (Ph.D.)

  • H. Subramoni (Ph.D.)

  • K. Vaidyanathan (Ph.D.)

  • A. Vishnu (Ph.D.)

  • J. Wu (Ph.D.)

  • W. Yu (Ph.D.)

  • J. Zhang (Ph.D.)

Past Research Scientist

  • K. Hamidouche

  • S. Sur

Past Post-Docs

  • D. Banerjee

  • X. Besseron

– H.-W. Jin –

  • W. Huang (Ph.D.)

  • W. Jiang (M.S.)

  • J. Jose (Ph.D.)

  • S. Kini (M.S.)

  • M. Koop (Ph.D.)

  • K. Kulkarni (M.S.)

  • R. Kumar (M.S.)

  • S. Krishnamoorthy (M.S.)

  • K. Kandalla (Ph.D.)

  • M. Li (Ph.D.)

  • P. Lai (M.S.)

  • J. Liu (Ph.D.)

  • M. Luo (Ph.D.)

  • A. Mamidala (Ph.D.)

  • G. Marsh (M.S.)

  • V. Meshram (M.S.)

  • A. Moody (M.S.)

  • S. Naravula (Ph.D.)

  • R. Noronha (Ph.D.)

  • X. Ouyang (Ph.D.)

  • S. Pai (M.S.)

  • S. Potluri (Ph.D.)

  • J. Hashmi (Ph.D.)

  • H. Javed (Ph.D.)

  • P. Kousha (Ph.D.)

  • D. Shankar (Ph.D.)

  • H. Shi (Ph.D.)

  • J. Lin

  • M. Luo

  • E. Mancini

Current Research Scientists

  • X. Lu

  • H. Subramoni

Past Programmers

  • D. Bureddy

  • J. Perkins

Current Research Specialist

  • J. Smith

  • M. Arnold

  • S. Marcarelli

  • J. Vienne

  • H. Wang

Current Post-doc

  • A. Ruhela

  • K. Manian

Current Students (Undergraduate)

  • N. Sarkauskas (B.S.)

  • V. Gangal (B.S.)
slide-61
SLIDE 61

KBNet ‘18 61 Network Based Computing Laboratory

  • Looking for Bright and Enthusiastic Personnel to join as

– Post-Doctoral Researchers – PhD Students – MPI Programmer/Software Engineer – Hadoop/Spark/Big Data Programmer/Software Engineer – Deep Learning Programmer/Software Engineer

  • If interested, please contact me at this conference and/or send an e-mail

to panda@cse.ohio-state.edu

Multiple Positions Available in My Group

slide-62
SLIDE 62

KBNet ‘18 62 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/