[PPT] - Latest version of the slides can be obtained from PowerPoint Presentation

SLIDE 1

InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Features, Challenges in Designing HEC Systems, and Usage

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

A Tutorial at IT4 Innovations’18 by

Latest version of the slides can be obtained from http://www.cse.ohio-state.edu/~panda/it4-advanced.pdf

SLIDE 2

Network Based Computing Laboratory 2 IT4 Innovations’18

High-End Computing (HEC): ExaFlop & ExaByte

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ? ExaByte & BigData Expected to have an ExaFlop system in 2019-2020!

100 PFlops in 2016 1 EFlops in 2019-2020?

SLIDE 3

Network Based Computing Laboratory 3 IT4 Innovations’18

Compute Clusters
Storage Clusters
Multi-tier Data Centers
Cloud Computing Environments
Big Data Processing (Hadoop and Spark)
Web 2.0 with Memcached

Various High-End Computing (HEC) Systems

SLIDE 4

Network Based Computing Laboratory 4 IT4 Innovations’18

Various Clusters (Compute, Storage and Datacenters)

Compute cluster

Meta-Data Manager I/O Server Node Meta Data

Data

Compute Node Compute Node I/O Server Node

Data

Compute Node I/O Server Node

Data

Compute Node

L A N LAN

Frontend Storage cluster

LAN/WAN

. . . .

Enterprise Multi-tier Datacenter for Visualization and Mining

Tier1 Tier3

Routers/ Servers

Switch

Database Server Application Server Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Database Server Database Server

Switch Switch

Routers/ Servers

Tier2

SLIDE 5

Network Based Computing Laboratory 5 IT4 Innovations’18

Cloud Computing Environments

LAN / WAN

Physical Machine Virtual Machine Virtual Machine Physical Machine Virtual Machine Virtual Machine Physical Machine Virtual Machine Virtual Machine

Virtual Network File System

Physical Meta-Data Manager Meta Data Physical I/O Server Node

Data

Physical I/O Server Node

Data

Physical I/O Server Node

Data

Physical I/O Server Node

Data

Physical Machine Virtual Machine Virtual Machine

SLIDE 6

Network Based Computing Laboratory 6 IT4 Innovations’18

Open-source implementation of Google MapReduce, GFS, and BigTable

for Big Data Analytics

Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN
http://hadoop.apache.org

Overview of Apache Hadoop Architecture

Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data Processing)

Hadoop Common/Core (RPC, ..) Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing) Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

SLIDE 7

Network Based Computing Laboratory 7 IT4 Innovations’18

Big Data Processing with Hadoop Components

Major components included in this

tutorial:

– MapReduce (Batch) – HBase (Query) – HDFS (Storage) – RPC

Underlying Hadoop Distributed File

System (HDFS) used by both MapReduce and HBase

Model scales but high amount of

communication during intermediate phases can be further optimized

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)

SLIDE 8

Network Based Computing Laboratory 8 IT4 Innovations’18

Spark Architecture Overview

An in-memory data-processing

framework

– Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos

Scalable and communication

intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs) – MapReduce-like shuffle operations to repartition RDDs – Sockets based communication http://spark.apache.org

SLIDE 9

Network Based Computing Laboratory 9 IT4 Innovations’18

Memcached Architecture

Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes – General purpose

Typically used to cache database queries, results of API calls
Scalable model, but typical usage very network intensive

SLIDE 10

Network Based Computing Laboratory 10 IT4 Innovations’18

Big Data

(Hadoop, Spark, HBase, Memcached, etc.)

Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC

(MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!

SLIDE 11

Network Based Computing Laboratory 11 IT4 Innovations’18

Drivers of Modern HPC Cluster Architectures

Multi-core/many-core technologies
Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 Titan K - Computer Sunway TaihuLight

SLIDE 12

Network Based Computing Laboratory 12 IT4 Innovations’18

Kernel Space

Modern Interconnects and Protocols with IB, HSE, and Omni-Path

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch

Hardware Offload TCP/IP

10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE OmniPath Adapter OmniPath Switch User Space RDMA 100 Gb/s OFI

SLIDE 13

Network Based Computing Laboratory 13 IT4 Innovations’18

163 IB Clusters (32.6%) in the Nov’17 Top500 list

– (http://www.top500.org)

Installations in the Top 50 (17 systems):

Large-scale InfiniBand Installations

19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th) 241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th) 220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th) 144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd) 155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th) 72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th) 72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th) 78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st) 124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd) 60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!

SLIDE 14

Network Based Computing Laboratory 14 IT4 Innovations’18

35 Omni-Path Clusters (7%) in the Nov’17 Top500 list

– (http://www.top500.org)

Large-scale Omni-Path Installations

556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd) 368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th) 135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th) 314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th) 153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th) 95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th) 95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th) 49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th) 38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st) 35,280 core (Quriosity) at BASF in Germany (71st) and many more!

SLIDE 15

Network Based Computing Laboratory 15 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 16

Network Based Computing Laboratory 16 IT4 Innovations’18

Advanced Features of InfiniBand

SRQ and XRC
DCT
User-Mode Memory Registration (UMR)
On-demand Paging
Core-Direct Offload
SHArP

SLIDE 17

Network Based Computing Laboratory 17 IT4 Innovations’18

Different transport protocols with IB

– Reliable Connection (RC) is the most common – Unreliable Datagram (UD) is used in some cases

Buffers need to be posted at each receiver to receive message from any sender

– Buffer requirement can increase with system size

Connections need to be established across processes under RC

– Each connection requires certain amount of memory for handling related data structures – Memory required for all connections can increase with system size

Both issues have become critical as large-scale IB deployments have taken place

– Being addressed by both IB specification and upper-level middleware

Memory overheads in large-scale systems

SLIDE 18

Network Based Computing Laboratory 18 IT4 Innovations’18

SRQ is a hardware mechanism for a process to share receive resources

(memory) across multiple connections

– Introduced in specification v1.2

0 < Q << P*((M*N)-1)

Shared Receive Queue (SRQ)

Process

One RQ per connection

Process

One SRQ for all connections

P Q (M*N) - 1

SLIDE 19

Network Based Computing Laboratory 19 IT4 Innovations’18

Each QP takes at least one page of memory

– Connections between all processes is very costly for RC

New IB Transport added: eXtended Reliable Connection

– Allows connections between nodes instead of processes

eXtended Reliable Connection (XRC)

RC Connections XRC Connections M2 x (N – 1) connections/node

M = # of processes/node N = # of nodes

M x (N – 1) connections/node

SLIDE 20

Network Based Computing Laboratory 20 IT4 Innovations’18

XRC uses SRQ Numbers (SRQN) to direct where a operation should complete
Hardware does all routing of data, so p2 is not actually involved in the data

transfer

Connections are not bi-directional, so p3 cannot sent to p0

XRC Addressing

SRQ#1 SRQ#2

Process 0 Process 1

SRQ#1

Process 2

SRQ#2

Process 3 Send to #2 Send to #1

SLIDE 21

Network Based Computing Laboratory 21 IT4 Innovations’18

DC Connection Model, Communication Objects and Addressing Scheme

Constant connection cost

– One QP for any peer

Full Feature Set

– RDMA, Atomics etc

Node 0 P1 P0 Node 1 P3 P2 Node 3 P7 P6 Node 2 P5 P4

IB Network

Communication Objects & Addressing Scheme

– DCINI

Analogous to the send QPs
Can transmit data to any peer

– DCTGT

Receive objects
Must be backed by SRQ
Identified on a node by “DCT Number”

– Messages routed with combination of DCT Number + LID – Requires “DC Key” to enable communication

Must be same across all processes

SLIDE 22

Network Based Computing Laboratory 22 IT4 Innovations’18

Support direct local and remote non-contiguous memory access
Avoid packing at sender and unpacking at receiver

User-Mode Memory Registration

1 3 4 Process Kernel HCA/RNIC 2

Steps to create memory regions with UMR:

1. UMR Creation Request
Send number of blocks
2. HCA issues uninitialized memory keys for future

UMR use

3. Kernel maps virtual->physical and pins region

into physical memory

4. HCA caches the virtual to physical mapping

SLIDE 23

Network Based Computing Laboratory 23 IT4 Innovations’18

Applications no longer need to pin down the underlying physical pages
Memory Region (MR) are NEVER pinned by the OS
Paged in by the HCA when needed
Paged out by the OS when reclaimed
ODP can be divided into two classes

– Explicit ODP

Applications still register memory buffers for communication, but this operation is used to define access control

for IO rather than pin-down the pages

– Implicit ODP

Applications are provided with a special memory key that represents their complete address space, does not

need to register any virtual address range

Advantages
Simplifies programming
Unlimited MR sizes
Physical memory optimization

On-Demand Paging (ODP)

SLIDE 24

Network Based Computing Laboratory 24 IT4 Innovations’18

Introduced by Mellanox to avoid pinning the pages of registered memory regions
ODP-aware runtime could reduce the size of pin-down buffers while maintaining

performance

Implicit On-Demand Paging (ODP)

M. Li, X. Lu, H. Subramoni, and D. K. Panda, “Designing Registration Caching Free High-Performance MPI Library with Implicit

On-Demand Paging (ODP) of InfiniBand”, HiPC ‘17

0.1 1 10 100

CG EP FT IS MG LU SP AWP-ODC Graph500 Execution Time (s)

Applications (256 Processes)

Pin-down Explicit-ODP Implicit-ODP

SLIDE 25

Network Based Computing Laboratory 25 IT4 Innovations’18

Collective Offload Support on the Adapters

Performance of collective operations (broadcast, barrier, reduction,

all-reduce, etc.) are very critical to the overall performance of MPI applications

Currently being done with basic pt-to-pt operations (send/recv and

RDMA) using host-based operations

Mellanox ConnectX-2, ConnectX-3, ConnectX-4, and ConnectX-5

adapters support offloading some of these operations to the adapters (CORE-Direct)

– Provides overlap of computation and collective communication – Reduces OS jitter (since everything is done in hardware)

SLIDE 26

Network Based Computing Laboratory 26 IT4 Innovations’18

Application

One-to-many Multi-Send

Sender creates a task-list consisting of
nly send and wait WQEs

– One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list – A wait WQE is added to make the HCA wait for ACK packet from the receiver

InfiniBand HCA Physical Link

Send Q Recv Q Send CQ Recv CQ

Data Data

MCQ

MQ

Task List

Send Wait Send Send Send Wait

SLIDE 27

Network Based Computing Laboratory 27 IT4 Innovations’18

Management and execution of MPI operations in the

network by using SHArP

Manipulation of data while it is being transferred in the

switch network

SHArP provides an abstraction to realize the reduction
peration
Defines Aggregation Nodes (AN), Aggregation Tree, and

Aggregation Groups

AN logic is implemented as an InfiniBand Target Channel

Adapter (TCA) integrated into the switch ASIC*

Uses RC for communication between ANs and between AN

and hosts in the Aggregation Tree*

Physical Network Topology*

* Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. R. L. Graham, D. Bureddy, P. Lui, G. Shainer, H. Rosenstock, G. Bloch, D. Goldenberg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, E. Zahavi, Mellanox Technologies, Inc. First Workshop on Optimization of Communication in HPC Runtime Systems (COM-HPC 2016)

Logical SHArP Tree*

Scalable Hierarchical Aggregation Protocol (SHArP)

SLIDE 28

Network Based Computing Laboratory 28 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 29

Network Based Computing Laboratory 29 IT4 Innovations’18

The Ethernet Ecosystem

Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/

SLIDE 30

Network Based Computing Laboratory 30 IT4 Innovations’18

Courtesy http://www.eetimes.com/document.asp?doc_id=1323184 http://www.networkcomputing.com/data-centers/25-gbe-big-deal-will-arrive/1714647938 http://www.eetimes.com/document.asp?doc_id=1323184

Emergence of 25 GigE and Benefits

Slash top-of-rack switches (Source: IEEE 802.3)

Courtesy http://www.plexxi.com/2014/07/whats-25-gigabit-ethernet-want/ http://www.qlogic.com/Products/adapters/Pages/25Gb-Ethernet.aspx

SLIDE 31

Network Based Computing Laboratory 31 IT4 Innovations’18

Requires half the number of lanes compared to 40G (x4 instead of x8 PCIe lanes)
Better PCIe bandwidth utilization (25/32=78% vs. 40/64=62.5%) with lower power impact

Matching PCIe and Ethernet Speeds

Ethernet Rate (Gb/s) Number of PCIe Gen3 Lanes Needed for Single Port Dual Port 100 16 32 (Uncommon) 40 8 16 25 4 8 10 2 4 Courtesy: http://www.ieee802.org/3/cfi/0314_3/CFI_03_0314.pdf

SLIDE 32

Network Based Computing Laboratory 32 IT4 Innovations’18

25G & 50G Ethernet specification extends IEEE

802.3 to work at increased data rates

Features in Draft 1.4 of specification

– PCS/PMA operation at 25 Gb/s over a single lane – PCS/PMA operation at 25 Gb/s over two lanes – Optional Forward Error Correction modes – Optional auto-negotiation using an OUI next page – Optional link training

Standards for 50 Gb/s, 200 Gb/s and

400Gb/s under development

– Expected around 2017 – 2018?

Detailed Specifications for 25 and 50 GigE and Looking Forward

Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/ Next standards by 2017 – 2018?

SLIDE 33

Network Based Computing Laboratory 33 IT4 Innovations’18

Ethernet Roadmap – To Terabit Speeds?

50G, 100G, 200G and 400G by 2018-2019 Terabit speeds by 2025?!?! Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/

SLIDE 34

Network Based Computing Laboratory 34 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 35

Network Based Computing Laboratory 35 IT4 Innovations’18

RDMA over Converged Enhanced Ethernet

IB Verbs Application Hardware RoCE IB Verbs Application RoCE v2 InfiniBand Link Layer IB Network IB Transport IB Verbs Application InfiniBand Ethernet Link Layer IB Network IB Transport Ethernet Link Layer UDP / IP IB Transport

Takes advantage of IB and Ethernet

– Software written with IB-Verbs – Link layer is Converged (Enhanced) Ethernet (CE)

Pros: IB Vs RoCE

– Works natively in Ethernet environments

Entire Ethernet management ecosystem is available

– Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so there are no missing features

RoCE v2: Additional Benefits over RoCE

– Traditional Network Management Tools Apply – ACLs (Metering, Accounting, Firewalling) – GMP Snooping for optimized Multicast – Network Monitoring Tools

Cons:

– Network bandwidth might be limited to Ethernet switches

10/40GE switches available; 56 Gbps IB is available

Courtesy: OFED, Mellanox Network Stack Comparison Packet Header Comparison

ETH L2 Hdr

Ethertype

IB GRH L3 Hdr IB BTH+ L4 Hdr RoCE ETH L2 Hdr

Ethertype

IP Hdr L3 Hdr IB BTH+ L4 Hdr

Proto #

RoCE v2 UDP Hdr

Port #

SLIDE 36

Network Based Computing Laboratory 36 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 37

Network Based Computing Laboratory 37 IT4 Innovations’18

Open source organization (formerly OpenIB)

– www.openfabrics.org

Incorporates both IB, RoCE, and iWARP in a unified manner

– Support for Linux and Windows

Users can download the entire stack and run

– Latest stable release is OFED 4.8.1

New naming convention to get aligned with Linux Kernel

Development

OFED 4.8.2 is under development

Software Convergence with OpenFabrics

SLIDE 38

Network Based Computing Laboratory 38 IT4 Innovations’18

OpenFabrics Software Stack

SA Subnet Administrator MAD Management Datagram SMA Subnet Manager Agent PMA Performance Manager Agent IPoIB IP over InfiniBand SDP Sockets Direct Protocol SRP SCSI RDMA Protocol (Initiator) iSER iSCSI RDMA Protocol (Initiator) RDS Reliable Datagram Service UDAPL User Direct Access Programming Lib HCA Host Channel Adapter R-NIC RDMA NIC

Common InfiniBand iWARP Key InfiniBand HCA iWARP R-NIC Hardware Specific Driver Hardware Specific Driver Connection Manager MAD InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC SA Client Connection Manager Connection Manager Abstraction (CMA) InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC SDP IPoIB SRP iSER RDS SDP Lib User Level MAD API Open SM Diag Tools

Hardware Provider Mid-Layer Upper Layer Protocol User APIs Kernel Space User Space

NFS-RDMA RPC Cluster File Sys

Application Level

SMA

Clustered DB Access Sockets Based Access Various MPIs Access to File Systems Block Storage Access IP Based App Access

Apps & Access Methods for using OF Stack

UDAPL Kernel bypass Kernel bypass

SLIDE 39

Network Based Computing Laboratory 39 IT4 Innovations’18

1. Create QPs (endpoints) 2. Register memory for sending and receiving 3. Send – Channel semantics

Post receive
Post send

– RDMA semantics

Programming with OpenFabrics

Sender Receiver

Sample Steps

Kernel HCA Process

SLIDE 40

Network Based Computing Laboratory 40 IT4 Innovations’18

Prepare and post send descriptor (channel semantics)

Verbs: Post Send

struct ibv_send_wr *bad_wr; struct ibv_send_wr sr; struct ibv_sge sg_entry; sr.next = NULL; sr.opcode = IBV_WR_SEND; sr.wr_id = 0; sr.num_sge = 1; if (len < max_inline_size) { sr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE; } else { sr.send_flags = IBV_SEND_SIGNALED; } sr.sg_list = &(sg_entry); sg_entry.addr = (uintptr_t) buf; sg_entry.length = len; sg_entry.lkey = mr_handle->lkey; ret = ibv_post_send(qp, &sr, &bad_wr);

SLIDE 41

Network Based Computing Laboratory 41 IT4 Innovations’18

Prepare and post RDMA write (memory semantics)

Verbs: Post RDMA Write

struct ibv_send_wr *bad_wr; struct ibv_send_wr sr; struct ibv_sge sg_entry; sr.next = NULL; sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */ sr.wr_id = 0; sr.num_sge = 1; sr.send_flags = IBV_SEND_SIGNALED; sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */ sr.wr.rdma.rkey = rkey; /* from remote node */ sr.sg_list = &(sg_entry); sg_entry.addr = buf; /* local buffer */ sg_entry.length = len; sg_entry.lkey = mr_handle->lkey; ret = ibv_post_send(qp, &sr, &bad_wr);

SLIDE 42

Network Based Computing Laboratory 42 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Grid Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 43

Network Based Computing Laboratory 43 IT4 Innovations’18

Libfabrics Connection Model

Server Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale Client Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale fi_fabrics Open fabrics fi_fabrics Open fabrics Open domain fi_domain Register Mem fi_mr_reg fi_endpoint Open EndPoint fi_cq_open Open Comp Q fi_ep_bind Bind EP to CQ fi_connect Connect to Remote EP Open Event Q fi_ep_open fi_passive_ep Open Passive EP fi_eq_open Open Event Q fi_pep_bind Bind Passive EP fi_listen Listen for Incoming Connections New Event Detected on EQ fi_eq_sread Validate New Event == FI_CONNREQ Open domain fi_domain Register Mem fi_mr_reg

SLIDE 44

Network Based Computing Laboratory 44 IT4 Innovations’18

Libfabrics Connection Model (Cont.)

Server Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale Client Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale fi_ep_open Open EndPoint fi_cq_open Open Event Q fi_ep_bind Bind EP to CQ Open fabrics fi_accept Accept Connection fi_eq_sread Validate New Event == FI_CONNECTED New Event Detected on EQ fi_eq_sread Validate New Event == FI_CONNECTED fi_send Post Send Recv Completion fi_send Post Send Recv Completion fi_shutdown Shutdown Channel fi_close * Close all open resources fi_shutdown Shutdown Channel fi_close * Close all open resources fi_cq_read / fi_cq_sread Poll / Wait for Data fi_cq_read / fi_cq_sread Poll / Wait for Data fi_recv Post Recv fi_recv Post Recv

SLIDE 45

Network Based Computing Laboratory 45 IT4 Innovations’18

Similar to socket / QP
Simple / Easy to use

Scalable EndPoints Vs Shared TX/RX Context

Courtesy: http://www.slideshare.net/seanhefty/ofa-workshop2015ofiwg?ref=http://ofiwg.github.io/libfabric/

End Point

Transmit Receive Completion

End Point

Transmit Receive Completion

End Point End Point End Point

Transmit Receive Completion Transmit Receive

Scalable EndPoints Shared TX/RX Context Normal EndPoint

Share HW resources
# EP >> HW resources
Use more HW resources
Higher performance per EP

SLIDE 46

Network Based Computing Laboratory 46 IT4 Innovations’18

Open Fabric, Domain and EP

Libfabrics: Fabric, Domain and Endpoint creation

struct fi_info *info, *hints; struct fid_fabric *fabric; struct fid_domain *dom; struct fid_ep *ep; hints = fi_allocinfo(); /* Obtain fabric information */ rc = fi_getinfo(VERSION, node, service, flags, hints, &info); /* Free fabric information */ fi_freeinfo(hints); /* Open fabric */ rc = fi_fabric(info->fabric_attr, &fabric, NULL); /* Open domain */ rc = fi_domain(fabric, entry.info, &dom, NULL); /* Open End point */ rc = fi_endpoint(dom, entry.info, &ep, NULL);

SLIDE 47

Network Based Computing Laboratory 47 IT4 Innovations’18

Open Fabric / Domain and create EQ, EP to end nodes

– Connection establishment is abstracted out using connection management APIs (fi_cm) – fi_listen, fi_connect, fi_accept – Fabric provider can implement them with connection managers (rdma_cm or ibcm)

r directly through verbs with out-of-band communication
Register memory

Libfabrics: Memory Registration

int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len, uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags, struct fid_mr **mr, void *context); rc = fi_mr_reg(domain, buffer, size, FI_SEND | FI_RECV, 0, 0, 0, &mr, NULL); rc = fi_mr_reg(domain, buffer, size, FI_REMOTE_READ | FI_REMOTE_WRITE, 0, user_key, 0, &mr, NULL);

Permissions can be set as needed

SLIDE 48

Network Based Computing Laboratory 48 IT4 Innovations’18

Prepare and post receive request

Libfabrics: Post Receive (Channel Semantics)

ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len, void *desc, fi_addr_t src_addr, void *context);

For connected EPs

ssize_t fi_recvmsg(struct fid_ep *ep, const struct fi_msg *msg, uint64_t flags);

For connected and un-connected EPs

struct fid_ep *ep; struct fid_mr *mr; /* Post recv request */ rc = fi_recv(ep, buf, size, fi_mr_desc(mr), 0, (void *)(uintptr_t)RECV_WCID);

SLIDE 49

Network Based Computing Laboratory 49 IT4 Innovations’18

Prepare and post send descriptor

Libfabrics: Post Send (Channel Semantics)

ssize_t fi_send(struct fid_ep *ep, void *buf, size_t len, void *desc, fi_addr_t dest_addr, void *context);

For connected EPs

ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg, uint64_t flags);

For connected and un-connected EPs

ssize_t fi_inject(struct fid_ep *ep, void *buf, size_t len, fi_addr_t dest_addr);

Buffer available for re-use as soon as function returns
No completion event generated for send

struct fid_ep *ep; struct fid_mr *mr; static fi_addr_t remote_fi_addr; rc = fi_send(ep, buf, size, fi_mr_desc(mr), 0, (void *)(uintptr_t)SEND_WCID); rc = fi_inject(ep, buf, size, remote_fi_addr);

SLIDE 50

Network Based Computing Laboratory 50 IT4 Innovations’18

Prepare and post receive request

Libfabrics: Post Remote Read (Memory Semantics)

ssize_t fi_read(struct fid_ep *ep, void *buf, size_t len, void *desc, fi_addr_t src_addr, uint64_t addr, uint64_t key, void *context);

For connected EPs

ssize_t fi_readmsg(struct fid_ep *ep, const struct fi_msg_rma *msg, uint64_t flags);

For connected and un-connected EPs

struct fid_ep *ep; struct fid_mr *mr; struct fi_context fi_ctx_read; /* Post remote read request */ ret = fi_read(ep, buf, size, fi_mr_desc(mr), local_addr, remote_addr, remote_key, &fi_ctx_read);

SLIDE 51

Network Based Computing Laboratory 51 IT4 Innovations’18

Prepare and post send descriptor

Libfabrics: Post Remote Write (Memory Semantics)

ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);

For connected EPs

ssize_t fi_writemsg(struct fid_ep *ep, const struct fi_msg_rma *msg, uint64_t flags);

For connected and un-connected EPs

ssize_t fi_inject_write(struct fid_ep *ep, const void *buf, size_t len, fi_addr_t dest_addr, uint64_t addr, uint64_t key);

Buffer available for re-use as soon as function returns
No completion event generated for send

ssize_t fi_writedata(struct fid_ep *ep, const void *buf, size_t len, void *desc, uint64_t data, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);

Similar to fi_write
Allows for the sending of remote CQ data

SLIDE 52

Network Based Computing Laboratory 52 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 53

Network Based Computing Laboratory 53 IT4 Innovations’18

Management Infrastructure

– Subnet Manager – Diagnostic tools

System Discovery Tools
System Health Monitoring Tools
System Performance Monitoring Tools

– Fabric management tools

Network Management Infrastructure and Tools

SLIDE 54

Network Based Computing Laboratory 54 IT4 Innovations’18

Agents

– Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters

Managers

– Make high-level decisions and implement it on the network fabric using the agents

Messaging schemes

– Used for interactions between the manager and agents (or between agents)

Messages

Concepts in IB Management

SLIDE 55

Network Based Computing Laboratory 55 IT4 Innovations’18

All IB management happens using packets called as Management

Datagrams

– Popularly referred to as “MAD packets”

Four major classes of management mechanisms

– Subnet Management – Subnet Administration – Communication Management – General Services

InfiniBand Management

SLIDE 56

Network Based Computing Laboratory 56 IT4 Innovations’18

Consists of at least one subnet manager (SM) and several subnet

management agents (SMAs)

– Each adapter, switch, router has an agent running – Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)

SM’s responsibilities include:

– Discovering the physical topology of the subnet – Assigning LIDs to the end nodes, switches and routers – Populating switches and routers with routing paths – Subnet sweeps to discover topology changes

Subnet Management & Administration

SLIDE 57

Network Based Computing Laboratory 57 IT4 Innovations’18

Subnet Manager

Active Links Inactive Links Compute Node Switch Subnet Manager Inactive Link Multicast Join Multicast Setup Multicast Join Multicast Setup

SLIDE 58

Network Based Computing Laboratory 58 IT4 Innovations’18

SM can be configured to sweep once or continuously
On the first sweep:

– All ports are assigned LIDs on the first sweep – All routes are setup on the switches

On consequent sweeps:

– If there has been any change to the topology, appropriate routes are updated – If DLID X is down, packet not sent all the way

First hop will not have a forwarding entry for LID X
Sweep time configured by the system administrator

– Cannot be too high or too low

Subnet Manager Sweep Behavior

SLIDE 59

Network Based Computing Laboratory 59 IT4 Innovations’18

Single subnet manager has issues on large systems

– Performance and overhead of scanning

Hardware implementations on switches are faster, but will work only for small

systems (memory usage)

Software implementations are more popular (OpenSM)

– Multi-SM models

Two benefits: fault tolerance (if one SM dies) and scalability (different SMs can

handle different portions of the network)

Current SMs only provide a fault-tolerance model
Network subsetting is still be investigated
Asynchronous events specified to improve scalability

– E.g., TRAPS are events sent by an agent to the SM when a link goes down

Subnet Manager Scalability Issues

SLIDE 60

Network Based Computing Laboratory 60 IT4 Innovations’18

Creation, joining/leaving, deleting multicast groups occur

as SA requests

– The requesting node sends a request to a SA – The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets

Each switch contains information on which ports to forward the

multicast packet to

Multicast itself does not go through the subnet manager

– Only the setup and teardown goes through the SM

Multicast Group Management

SLIDE 61

Network Based Computing Laboratory 61 IT4 Innovations’18

Management Infrastructure

– Subnet Manager – Diagnostic tools

System Discovery Tools
System Health Monitoring Tools
System Performance Monitoring Tools

– Fabric management tools

Network Management Infrastructure and Tools

SLIDE 62

Network Based Computing Laboratory 62 IT4 Innovations’18

Different types of tools exist:

– High-level tools that internally talk to the subnet manager using management datagrams – Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) counters

Possible to write your own tools based on the

management datagram interface

– Several vendors provide such IB management tools

Tools to Analyze InfiniBand Networks

SLIDE 63

Network Based Computing Laboratory 63 IT4 Innovations’18

Starting with almost no knowledge about the system, we can identify several

details of the network configuration

– Example tools include:

ibstatus: shows adapter status
smpquery: SMP query tool
perfquery: reports performance/error counters of a port
ibportstate: shows status of IB port, enable/disable port
ibhosts: finds all the network adapters in the system
ibswitches: finds all the network switches in the system
ibnetdiscover: finds the connectivity between the ports
… and many others exist

– Possible to write your own tools based on the management datagram interface

Several vendors provide such IB management tools

Network Discovery Tools

SLIDE 64

Network Based Computing Laboratory 64 IT4 Innovations’18

Several tools exist to monitor the health and performance
f the InfiniBand network

– Example health monitoring tools include

ibdiagnet: queries for overall fabric health
ibportstate: identify state and link speed of an InfiniBand port
ibdatacounts: get InfiniBand port data counters

– Example performance monitoring tools include

ibv_send_lat, ibv_write_lat: IB verbs level performance tests
perfquery: queries performance counters in IB HCA

Health and Performance Monitoring Tools

SLIDE 65

Network Based Computing Laboratory 65 IT4 Innovations’18

Tools for Network Switching and Routing

% ibroute -G 0x66a000700067c Lid Out Destination Port Info 0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1') 0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1') 0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1') 0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2') 0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1') 0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 8, Chip A') 0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2') 0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 10, Chip A') 0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1') 0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2') ...

Packets to LID 0x0001 will be sent out on Port 001

SLIDE 66

Network Based Computing Laboratory 66 IT4 Innovations’18

Based on destination LIDs and switching/routing

information, the exact path of the packets can be identified

– If application communication pattern is known, we can statically identify possible network contention

Static Analysis of Network Contention

Leaf Blocks Spine Blocks

2 4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10 11 22 17 24 27 15 20

SLIDE 67

Network Based Computing Laboratory 67 IT4 Innovations’18

IB provides many optional counters to query performance counters

– PortXmitWait: Number of ticks in which there was data to send, but no flow-control credits – RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive buffer

This can timeout, so it can be an error in some cases

– PortXmitFlowPkts: Number of (link-level) flow-control packets transmitted

n the port

– SWPortVLCongestion: Number of packets dropped due to congestion

Dynamic Analysis of Network Contention

SLIDE 68

Network Based Computing Laboratory 68 IT4 Innovations’18

Management Infrastructure

– Subnet Manager – Diagnostic tools

System Discovery Tools
System Health Monitoring Tools
System Performance Monitoring Tools

– Fabric management tools

Network Management Infrastructure and Tools

SLIDE 69

Network Based Computing Laboratory 69 IT4 Innovations’18

InfiniBand provides two forms of management

– Out-of-band management (similar to other networks) – In-band management (used by the subnet manager)

Out-of-band management requires a separate Ethernet port on the switch, where an

administrator can plug in his/her laptop

In-band management allows the switch to receive management commands directly over

the regular communication network

In-band Management vs. Out-of-band Management

InfiniBand connectivity (In-band management) Ethernet connectivity (Out-of-band management)

SLIDE 70

Network Based Computing Laboratory 70 IT4 Innovations’18

Overview of OSU INAM

A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the

MPI runtime

– http://mvapich.cse.ohio-state.edu/tools/osu-inam/

Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes
OSU INAM v0.9.2 released on 10/31/2017
Significant enhancements to user interface to enable scaling to clusters with thousands of nodes
Improve database insert times by using 'bulk inserts‘
Capability to look up list of nodes communicating through a network link
Capability to classify data flowing over a network link at job level and process level granularity in conjunction with

MVAPICH2-X 2.3b

“Best practices “ guidelines for deploying OSU INAM on different clusters
Capability to analyze and profile node-level, job-level and process-level activities for MPI communication

– Point-to-Point, Collectives and RMA

Ability to filter data based on type of counters using “drop down” list
Remotely monitor various metrics of MPI processes at user specified granularity
"Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X
Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes

SLIDE 71

Network Based Computing Laboratory 71 IT4 Innovations’18

OSU INAM Features

Show network topology of large clusters
Visualize traffic pattern on different links
Quickly identify congested links/links in error state
See the history unfold – play back historical state of the network

Comet@SDSC --- Clustered View (1,879 nodes, 212 switches, 4,377 network links) Finding Routes Between Nodes

SLIDE 72

Network Based Computing Laboratory 72 IT4 Innovations’18

OSU INAM Features (Cont.)

Visualizing a Job (5 Nodes)

Job level view
Show different network metrics (load, error, etc.) for any live job
Play back historical data for completed jobs to identify bottlenecks
Node level view - details per process or per node
CPU utilization for each rank/node
Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)
Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Estimated Process Level Link Utilization

Estimated Link Utilization view
Classify data flowing over a network link at

different granularity in conjunction with MVAPICH2-X 2.2rc1

Job level and
Process level

SLIDE 73

Network Based Computing Laboratory 73 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 74

Network Based Computing Laboratory 74 IT4 Innovations’18

Common Challenges for Large-Scale Installations

Common Challenges

 Adapters and Interactions

 I/O bus  Multi-port adapters  NUMA

 Switches

 Topologies  Switching / Routing

 Bridges

 IB interoperability

SLIDE 75

Network Based Computing Laboratory 75 IT4 Innovations’18

Network adapters and interactions with other components

– I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions

Network switches
Network bridges

Common Challenges in Building HEC Systems with IB and HSE

SLIDE 76

Network Based Computing Laboratory 76 IT4 Innovations’18

Data communication traverses three buses (or

links) before it reaches the network switch

– Memory bus (memory to IO hub) – I/O link (IO hub to the network adapter) – Network link (network adapter to switch)

For optimal communication, all these need to

be balanced

Network bandwidth:

– 4X SDR (8 Gbps), 4X DDR (16 Gbps), 4X QDR (32 Gbps), 4X FDR (56 Gbps), 4X EDR (100 Gbps) and 4X HDR (200 Gbps) – 40 GigE (40 Gbps)

Memory bandwidth:

– Shared bandwidth (incoming and outgoing) – For IB FDR (56 Gbps), memory bandwidth greater than 112 Gbps is required to fully utilize the network

I/O bus limitations

P0

Core0 Core1 Core2 Core3

P1

Core0 Core1 Core2 Core3

Memory Memory Network Adapter

Network Switch

I/O link bandwidth:

– Tricky because several aspects need to be considered – Connector capacity vs. link capacity – I/O link communication headers, etc.

I/O Bus

SLIDE 77

Network Based Computing Laboratory 77 IT4 Innovations’18

Common I/O interconnect used on most current platforms

– Can be configured as multiple lanes (1X, 4X, 8X, 16X, 32X)

Generation 1 provided 2 Gbps bandwidth per lane, Gen 2 provides 4 Gbps, and Gen 3

provides 8 Gbps per lane)

– Compatible with adapters using lesser lanes

If a PCIe connector is 16X, it will still support an 8X adapter by using only 8 lanes

– Provides multiplexing across a single lane

A 1X PCIe bus can be connected to an 8X PCIe connector (allowing an 8X adapter to be

plugged in)

– I/O interconnects are like networks with packetized communication

Communication headers for each packet
Reliability acknowledgments
Flow control acknowledgments
Typical efficiency is around 75-80% with 256 byte PCIe packets

PCI Express

Use I/O bandwidth

Beware Beware Beware

SLIDE 78

Network Based Computing Laboratory 78 IT4 Innovations’18

Several multi-port adapters available in the market

– Single adapter can drive multiple network ports at full bandwidth – Important to measure other overheads (memory bandwidth and I/O link bandwidth) before assuming performance benefit

Case Study: IB Dual-port 4x QDR adapter

– Each network link is 32 Gbps (dual-port adapters can drive 64 Gbps) – PCIe Gen2 8X link can give 32 Gbps data rate  around 24 Gbps effective rate (20 % encoding overheads!!)

Dual-port IB QDR is not expected to give any benefit in this case

– PCIe Gen3 8X link can give 64 Gbps data rate  64 Gbps (minimal encoding overheads)

Delivers close to peak performance with Dual-port IB adapters

Multi-port adapters

SLIDE 79

Network Based Computing Laboratory 79 IT4 Innovations’18

Network adapters and interactions with other components

– I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions

Network switches
Network bridges

Common Challenges in Building HEC Systems with IB and HSE

SLIDE 80

Network Based Computing Laboratory 80 IT4 Innovations’18

NUMA Interactions

Memory Memory Memory Memory Network Card

Core 8

Socket 2

Core 9 Core 10 Core 11 Core 12

Socket 3

Core 13 Core 14 Core 15 Core 4

Socket 1

Core 5 Core 6 Core 7 Core

Socket 0

Core 1 Core 2 Core 3

Different cores in a NUMA

platform have different communication costs

QPI or HT PCIe

SLIDE 81

Network Based Computing Laboratory 81 IT4 Innovations’18

0.5 1 1.5 2 2.5 3

2 4 8 16 32 64 128 256 512 1K 2K

Send Latency (us) Message Size (Bytes)

Core 0 -> 0 (Socket 0) Core 7->7 (Socket 0) Core 14->14 (Socket 1) Core 27->27 (Socket 1)

Impact of NUMA on Inter-node Latency

Cores in Socket 0 (closest to network card) have lowest latency
Cores in Socket 1 (one hop from network card) have highest latency

ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches

SLIDE 82

Network Based Computing Laboratory 82 IT4 Innovations’18

Impact of NUMA on Inter-node Bandwidth

500 1000 1500 2000 2500 3000

Send Bandwidth (MBps) Message Size (bytes)

AMD MagnyCours Core-0 Core-6 Core-12 Core-18 ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches ConnectX-2-QDR (36 Gbps): 2.5 GHz Hex-core (MagnyCours) AMD with IB (QDR) switches 2000 4000 6000 8000 10000 12000 14000

Send Bandwidth (MBps) Message Size (bytes)

Intel Broadwell Core-0 Core-7 Core-14 Core-27

NUMA interactions have significant impact on bandwidth

SLIDE 83

Network Based Computing Laboratory 83 IT4 Innovations’18

Network adapters and interactions with other components

– I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions

Network switches
Network bridges

Common Challenges in Building HEC Systems with IB and HSE

SLIDE 84

Network Based Computing Laboratory 84 IT4 Innovations’18

Network adapters and interactions with other components
Network switches

– Switch topologies – Switching and Routing

Network bridges

Common Challenges in Building HEC Systems with IB and HSE

SLIDE 85

Network Based Computing Laboratory 85 IT4 Innovations’18

InfiniBand installations come in multiple topologies

– Single crossbar switches (up to 36-ports for QDR or FDR)

Applicable only to very small systems (hard to scale to large clusters)

– Fat-tree topologies (medium scale topologies)

Provides full bisection bandwidth: Given independent communication between processes,

you can find a switch configuration that provides fully non-blocking paths (though the same configuration might have contention if the communication pattern changes)

Issue: Number of switch components increases super-linearly with the number of nodes

(Not scalable for large-scale systems)

Large scale installations can use more conservative topologies

– Partial fat-tree topologies (over-provisioning) – 3D Torus (Sandia Red Sky and SDSC Gordon), Hypercube (SGI Altix) topologies, and 10D Hypercube (NASA Pleiades)

Switch Topologies

SLIDE 86

Network Based Computing Laboratory 86 IT4 Innovations’18

Switch Topology: Absolute Performance vs. Scalability

Crossbar ASIC (all-to-all connectivity) Leaf Blocks Spine Blocks Full Fat-tree Topology (full bisection bandwidth) Leaf Blocks Spine Blocks Partial Fat-tree Topology (reduced inter-switch connectivity for more out-ports: super-linear scaling of switch components, but slower than a full fat-tree topology)

Only a few links are connected

Torus/Hypercube Topology (linear scaling of switch components)

SLIDE 87

Network Based Computing Laboratory 87 IT4 Innovations’18

IB standard only supports static routing

– Not scalable for large systems where traffic might be non-deterministic causing hot-spots

Next generation IB switches are supporting adaptive routing (in addition to static routing):

Outside the IB standard

Qlogic (Intel) support for adaptive routing

– Continually monitors application messaging patterns and selects the optimum path for each traffic flow, eliminating slowdowns caused by pathway bottlenecks – Dispersive routing load-balances traffic among multiple pathways

– http://ir.qlogic.com/phoenix.zhtml?c=85695&p=irol-newsarticle&id=1428788

Mellanox support for adaptive routing

– Supports moving traffic via multiple parallel paths – Dynamically and automatically re-routes traffic to alleviate congested ports

– http://www.mellanox.com/related-docs/prod_silicon/PB_InfiniScale_IV.pdf

Static Routing in IB + Adaptive Routing models from Qlogic (Intel) and Mellanox

SLIDE 88

Network Based Computing Laboratory 88 IT4 Innovations’18

Network adapters and interactions with other components
Network switches
Network bridges

– IB interoperability with Ethernet and FC

Common Challenges in Building HEC Systems with IB and HSE

SLIDE 89

Network Based Computing Laboratory 89 IT4 Innovations’18

Virtual Ethernet/FC Adapter

Mainly developed for backward compatibility with existing

infrastructure

– Ethernet over IB (EoIB) – Fibre Channel over IB (FCoIB)

IB-Ethernet and IB-FC Bridging Solutions

IB Adapter Host

Ethernet Packet

Convertor Switch (e.g., Mellanox BridgeX) Ethernet/FC Adapter Host

SLIDE 90

Network Based Computing Laboratory 90 IT4 Innovations’18

Can be used in an infrastructure where a part of the nodes are connected over

Ethernet or FC

– All of the IB connected nodes can communicate over IB – The same nodes can communicate with nodes in the older infrastructure using Ethernet-over-IB or FC-over-IB

Do not have the performance benefits of IB

– Host thinks it is using an Ethernet or FC adapter – For example, with Ethernet, communication will be using TCP/IP

There is some hardware support for segmentation offload, but the rest of the IB features

are unutilized

Note that this is different from VPI, as there is only one network connectivity

from the adapter

Ethernet/FC over IB

SLIDE 91

Network Based Computing Laboratory 91 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 92

Network Based Computing Laboratory 92 IT4 Innovations’18

System Specific Challenges for HPC Systems

Common Challenges

 Adapters and Interactions  I/O bus  Multi-port adapters  NUMA  Switches  Topologies  Switching / Routing  Bridges  IB interoperability

HPC

MPI  Multi-rail  Collectives  Scalability  Application Scalability  Energy Awareness PGAS  Programmability w/ Performance  Optimized Resource Utilization  GPU / XeonPhi  Programmability w/ Performance  Hide data movement costs  Heterogeneity aware design  Streaming, Deep Learning

SLIDE 93

Network Based Computing Laboratory 93 IT4 Innovations’18

Message Passing Interface (MPI)
Partitioned Global Address Space (PGAS) models
GPU Computing
Xeon Phi Computing

HPC System Challenges and Case Studies

SLIDE 94

Network Based Computing Laboratory 94 IT4 Innovations’18

Overview of the MVAPICH2 Project

High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2, MPI-3.0, and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,850 organizations in 85 countries – More than 440,000 (> 0.44 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)

1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
12th, 368,928-core (Stampede2) at TACC
17th, 241,108-core (Pleiades) at NASA
48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)

SLIDE 95

Network Based Computing Laboratory 95 IT4 Innovations’18

Interaction with Multi-Rail Environments
Collective Communication
Scalability for Large-scale Systems
Energy Awareness

Design Challenges and Sample Results

SLIDE 96

Network Based Computing Laboratory 96 IT4 Innovations’18

2000 4000 6000 8000 10000 12000 14000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MBytes/sec) Message Size (bytes) Single Rail

Impact of Multiple Rails on Inter-node MPI Bandwidth

Designs based on: S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007

5000 10000 15000 20000 25000 30000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MBytes/sec) Message Size (bytes) Dual Rail 1 pair 2 pairs 4 pairs 8 pairs 16 pairs ConnectX-4 EDR (100 Gbps): 2.4 GHz Deca-core (Haswell) Intel with IB (EDR) switches

SLIDE 97

Network Based Computing Laboratory 97 IT4 Innovations’18

Hardware Multicast-aware MPI_Bcast on Stampede

10 20 30 40 2 8 32 128 512 Latency (us) Message Size (Bytes)

Small Messages (102,400 Cores)

Default Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

100 200 300 400 500 2K 8K 32K 128K Latency (us) Message Size (Bytes)

Large Messages (102,400 Cores)

Default Multicast 5 10 15 20 25 30 Latency (us) Number of Nodes

16 Byte Message

Default Multicast 50 100 150 200 Latency (us) Number of Nodes

32 KByte Message

Default Multicast

SLIDE 98

Network Based Computing Laboratory 98 IT4 Innovations’18

Hardware Multicast-aware MPI_Bcast on Broadwell + EDR

1 2 3 4 5 6 7 2 8 32 128 512 Latency (us) Message Size (Bytes)

Small Messages (1,120 Cores)

Default Multicast

ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with Mellanox IB (EDR) switches

20 40 60 80 100 120 2K 8K 32K 128K Latency (us) Message Size (Bytes)

Large Messages (1,120 Cores)

Default Multicast 1 2 3 4 5 Latency (us) Number of Nodes

16 Byte Message

Default Multicast 10 20 30 40 Latency (us) Number of Nodes

32 KByte Message

Default Multicast

SLIDE 99

Network Based Computing Laboratory 99 IT4 Innovations’18

Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders

Socket-based design can reduce the communication latency by 23% and 40% on

Xeon + IB nodes

Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b

HPCG (28 PPN) 0.1 0.2 0.3 0.4 0.5 0.6 56 224 448

Communication Latency (Seconds) Number of Processes MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP

10 20 30 40 50 60 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us) Message Size (Byte) MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP

OSU Micro Benchmark (16 Nodes, 28 PPN)

23% 40%

Lower is better

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-

Leader Design, Supercomputing '17.

SLIDE 100

Network Based Computing Laboratory 100 IT4 Innovations’18

Performance of MPI_Allreduce On Stampede2 (10,240 Processes)

50 100 150 200 250 300 4 8 16 32 64 128 256 512 1024 2048 4096

Latency (us)

Message Size MVAPICH2 MVAPICH2-OPT IMPI 200 400 600 800 1000 1200 1400 1600 1800 2000 8K 16K 32K 64K 128K 256K Message Size MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN

2.4X

MPI_Allreduce latency with 32K bytes reduced by 2.4X

SLIDE 101

Network Based Computing Laboratory 101 IT4 Innovations’18

Network-Topology-Aware Placement of Processes

Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand

Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist

Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
15% improvement in MILC execution time @ 2048 cores
15% improvement in Hypre execution time @ 1024 cores

SLIDE 102

Network Based Computing Laboratory 102 IT4 Innovations’18

Dynamic and Adaptive Tag Matching

Normalized Total Tag Matching Time at 512 Processes Normalized to Default (Lower is Better) Normalized Memory Overhead per Process at 512 Processes Compared to Default (Lower is Better) Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee]

Challenge

Tag matching is a significant

verhead for receivers

Existing Solutions are

Static and do not adapt

dynamically to communication pattern

Do not consider memory
verhead

Solution

A new tag matching design

Dynamically adapt to

communication patterns

Use different strategies for

different ranks

Decisions are based on the

number of request object that must be traversed before hitting on the required one

Results

Better performance than

ther state-of-the art tag-

matching schemes Minimum memory consumption Will be available in future MVAPICH2 releases

SLIDE 103

Network Based Computing Laboratory 103 IT4 Innovations’18

Enhance existing support for MPI_T in MVAPICH2 to expose a richer

set of performance and control variables

Get and display MPI Performance Variables (PVARs) made available by

the runtime in TAU

Control the runtime’s behavior via MPI Control Variables (CVARs)
Introduced support for new MPI_T based CVARs to MVAPICH2

○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE

TAU enhanced with support for setting MPI_T CVARs in a non-

interactive mode for uninstrumented applications

Performance Engineering Applications using MVAPICH2 and TAU

VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf

SLIDE 104

Network Based Computing Laboratory 104 IT4 Innovations’18

Dynamic and Adaptive MPI Point-to-point Communication Protocols

Process on Node 1 Process on Node 2

Eager Threshold for Example Communication Pattern with Different Designs

1 2 3 4 5 6 7

Default

16 KB 16 KB 16 KB 16 KB

1 2 3 4 5 6 7

Manually Tuned

128 KB 128 KB 128 KB 128 KB

1 2 3 4 5 6 7

Dynamic + Adaptive

32 KB 64 KB 128 KB 32 KB

H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper

200 400 600 128 256 512 1K Wall Clock Time (seconds) Number of Processes

Execution Time of Amber

Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold 5 10 128 256 512 1K Relative Memory Consumption Number of Processes

Relative Memory Consumption of Amber

Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold

Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

Process Pair Eager Threshold (KB) 0 – 4 32 1 – 5 64 2 – 6 128 3 – 7 32

Desired Eager Threshold

SLIDE 105

Network Based Computing Laboratory 105 IT4 Innovations’18

Enhanced MPI_Bcast with Optimized CMA-based Design

1 10 100 1000 10000 100000 1K 4K 16K 64K 256K 1M 4M Message Size KNL (64 Processes)

MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Proposed

Latency (us)

1 10 100 1000 10000 100000 Message Size Broadwell (28 Processes)

MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Proposed

1 10 100 1000 10000 100000 1K 4K 16K 64K 256K 1M 4M Message Size Power8 (160 Processes)

MVAPICH2-2.3a OpenMPI 2.1.0 Proposed

Up to 2x - 4x improvement over existing implementation for 1MB messages
Up to 1.5x – 2x faster than Intel MPI and Open MPI for 1MB messages

Use CMA Use SHMEM Use CMA Use SHMEM Use CMA Use SHMEM

Improvements obtained for large messages only
p-1 copies with CMA, p copies with Shared memory
Fallback to SHMEM for small messages
S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-

core Systems, IEEE Cluster ’17, BEST Paper Finalist

Support is available in MVAPICH2-X 2.3b

SLIDE 106

Network Based Computing Laboratory 106 IT4 Innovations’18

Designing Energy-Aware (EA) MPI Runtime

Energy Spent in Communication Routines Energy Spent in Computation Routines

Overall application Energy Expenditure

Point-to-point Routines Collective Routines RMA Routines MVAPICH2-EA Designs MPI Two-sided and collectives (ex: MVAPICH2) Other PGAS Implementations (ex: OSHMPI) One-sided runtimes (ex: ComEx) Impact MPI-3 RMA Implementations (ex: MVAPICH2)

SLIDE 107

Network Based Computing Laboratory 107 IT4 Innovations’18

An energy efficient runtime that

provides energy savings without application knowledge

Uses automatically and

transparently the best energy lever

Provides guarantees on maximum

degradation with 5-41% savings at <= 5% degradation

Pessimistic MPI applies energy

reduction lever to each MPI call

Available for download from

MVAPICH project site since Aug’15

MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.

K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]

1

SLIDE 108

Network Based Computing Laboratory 108 IT4 Innovations’18

Message Passing Interface (MPI)
Partitioned Global Address Space (PGAS) models
GPU Computing
Xeon Phi Computing

HPC System Challenges and Case Studies

SLIDE 109

Network Based Computing Laboratory 109 IT4 Innovations’18

Global view improves programmer productivity
Idea is to decouple data movement with process synchronization
Processes should have asynchronous access to globally distributed data
Well suited for irregular applications and kernels that require dynamic access to different

data

Different Approaches

– Library-based (Global Arrays, OpenSHMEM) – Compiler-based (Unified Parallel C (UPC), Co-Array Fortran (CAF)) – HPCS Language-based (X10, Chapel, Fortress)

Partitioned Global Address Space (PGAS) Models

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory Logical shared memory

Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS)

SLIDE 110

Network Based Computing Laboratory 110 IT4 Innovations’18

Hybrid (MPI+PGAS) Programming

Application sub-kernels can be re-written in

MPI/PGAS based on communication characteristics

Benefits:

– Best of Distributed Computing Model – Best of Shared Memory Computing Model

Kernel 1 MPI Kernel 2 MPI Kernel 3 MPI Kernel N MPI

HPC Application

Kernel 2 PGAS Kernel N PGAS

SLIDE 111

Network Based Computing Laboratory 111 IT4 Innovations’18

MVAPICH2-X for Hybrid MPI + PGAS Applications

Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI

– Possible deadlock if both runtimes are not progressed – Consumes more network resource

Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF

– Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu

SLIDE 112

Network Based Computing Laboratory 112 IT4 Innovations’18

UPC++ Collectives Performance

MPI + {UPC++} application

GASNet Interfaces

UPC++ Runtime

Network Conduit (MPI) MVAPICH2-X Unified communication Runtime (UCR)

MPI + {UPC++} application

UPC++ Runtime

MPI Interfaces

Full and native support for hybrid MPI + UPC++ applications
Better performance compared to IBV and MPI conduits
OSU Micro-benchmarks (OMB) support for UPC++
Available since MVAPICH2-X 2.2RC1

5000 10000 15000 20000 25000 30000 35000 40000 Time (us) Message Size (bytes) GASNet_MPI GASNET_IBV MV2-X

14x Inter-node Broadcast (64 nodes 1:ppn)

J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling

Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016)

SLIDE 113

Network Based Computing Laboratory 113 IT4 Innovations’18

Application Level Performance with Graph500 and Sort

Graph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming

Models, International Supercomputing Conference (ISC’13), June 2013 5 10 15 20 25 30 35 4K 8K 16K Time (s)

No. of Processes

MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 13X 7.6X

Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
8,192 processes
2.4X improvement over MPI-CSR
7.6X improvement over MPI-Simple
16,384 processes
1.5X improvement over MPI-CSR
13X improvement over MPI-Simple

Sort Execution Time 500 1000 1500 2000 2500 3000 500GB-512 1TB-1K 2TB-2K 4TB-4K Time (seconds) Input Data - No. of Processes MPI Hybrid 51%

Performance of Hybrid (MPI+OpenSHMEM) Sort

Application

4,096 processes, 4 TB Input Size
MPI – 2408 sec; 0.16 TB/min
Hybrid – 1172 sec; 0.36 TB/min
51% improvement over MPI-design
J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with

Hybrid MPI+PGAS Programming Models, PGAS’14

SLIDE 114

Network Based Computing Laboratory 114 IT4 Innovations’18

Performance of PGAS Models on KNL using MVAPICH2-X

0.01 0.1 1 10 100 1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency (us) Message Size shmem_put upc_putmem upcxx_async_put Intra-node PUT

0.01 0.1 1 10 100 1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency (us) Message Size shmem_get upc_getmem upcxx_async_get Intra-node GET

Intra-node performance of one-sided Put/Get operations of PGAS

libraries/languages using MVAPICH2-X communication conduit

Near-native communication performance is observed on KNL

SLIDE 115

Network Based Computing Laboratory 115 IT4 Innovations’18

Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation

Heat Image Kernel

On heat diffusion based kernels AVX-512 vectorization showed better performance
MCDRAM showed significant benefits on Heat-Image kernel for all process counts.

Combined with AVX-512 vectorization, it showed up to 4X improved performance

1 10 100 1000 16 32 64 128 Time (s)

No. of processes

KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell

Heat-2D Kernel using Jacobi method 0.1 1 10 100 16 32 64 128 Time (s)

No. of processes

KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell

SLIDE 116

Network Based Computing Laboratory 116 IT4 Innovations’18

Message Passing Interface (MPI)
Partitioned Global Address Space (PGAS) models
GPU Computing
Xeon Phi Computing

HPC System Challenges and Case Studies

SLIDE 117

Network Based Computing Laboratory 117 IT4 Innovations’18

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

Standard MPI interfaces used for unified data movement
Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

SLIDE 118

Network Based Computing Laboratory 118 IT4 Innovations’18

2000 4000 6000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

10x 9x

Optimized MVAPICH2-GDR Design

1.88us 11X

SLIDE 119

Network Based Computing Laboratory 119 IT4 Innovations’18

Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
HoomdBlue Version 1.0.5
GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

500 1000 1500 2000 2500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes MV2 MV2+GDR 500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes

64K Particles 256K Particles 2X 2X

SLIDE 120

Network Based Computing Laboratory 120 IT4 Innovations’18

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

2X improvement on 32 GPUs nodes
30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/

SLIDE 121

Network Based Computing Laboratory 121 IT4 Innovations’18

Enhanced Support for GPU Managed Memory

CUDA Managed => no memory pin down
No IPC support for intranode communication
No GDR support for Internode communication
Significant productivity benefits due to abstraction of explicit

allocation and cudaMemcpy()

Initial and basic support in MVAPICH2-GDR
For both intra- and inter-nodes use “pipeline through”

host memory

Enhance intranode managed memory to use IPC
Double buffering pair-wise IPC-based scheme
Brings IPC performance to Managed memory
High performance and high productivity
2.5 X improvement in bandwidth
OMB extended to evaluate the performance of point-to-point

and collective communications using managed buffers

2000 4000 6000 8000 10000 32K 128K 512K 2M Enhanced MV2-GDR 2.2b

Message Size (bytes) Bandwidth (MB/s)

2.5X

D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication

Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16 0.2 0.4 0.6 0.8 1 2 4 8 16 32 64 128 256 1K 4K 8K 16K

Halo Exchange Time (ms)

Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1

Device Managed

SLIDE 122

Network Based Computing Laboratory 122 IT4 Innovations’18

Streaming applications on GPU clusters

– Using a pipeline of broadcast operations to move host- resident data from a single source—typically live— to multiple GPU-based computing sites – Existing schemes require explicitly data movements between Host and GPU memories Poor performance and breaking the pipeline

IB hardware multicast + Scatter-List

– Efficient heterogeneous-buffer broadcast operation

CUDA Inter-Process Communication (IPC)

– Efficient intra-node topology-aware broadcast

perations for multi-GPU systems
Available MVAPICH2-GDR 2.3a!

High-Performance Heterogeneous Broadcast for Streaming Applications

Node N IB HCA IB HCA CPU GPU Source IB Switch GPU CPU Node 1 Multicast steps C Data C IB SL step Data IB HCA GPU CPU Data C

Node N Node 1 IB Switch GPU 0 GPU 1 GPU N GPU CPU Source GPU CPU CPU Multicast steps IPC-based cudaMemcpy (Device<->Device) 3 Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters. C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh , B. Elton, and D. K. Panda, SBAC-PAD'16, Oct 2016.

SLIDE 123

Network Based Computing Laboratory 123 IT4 Innovations’18

Control Flow Decoupling through GPUDirect Async

Latency Oriented: Able to hide the kernel

launch overhead – 25% improvement at 256 Bytes

Throughput Oriented: Asynchronously to
ffload queue the Communication and

computation tasks – 14% improvement at 1KB message size

Intel Sandy Bridge, NVIDIA K20 and Mellanox

FDR HCA

Will be available in a public release soon

GPU CPU HCA

Kernel Launch Overhead Hidden

CPU offloads the compute, communication and

synchronization tasks to GPU

All operations asynchronous from CPU
Hide the overhead of kernel launch
Needs stream-based extensions to MPI semantics

Latency oriented: Kernel+Send and Recv+Kernel

10 20 30 40 50 60 70 1 4 16 64 256 1K 4K Default MPI Enhaced MPI+GDS Message Size (bytes) Latency (us)

Overlap with host computation/communication

20 40 60 80 100 1 4 16 64 256 1K 4K Default MPI Enhanced MPI+GDS Message Size (bytes) Overlap (%)

SLIDE 124

Network Based Computing Laboratory 124 IT4 Innovations’18

Message Passing Interface (MPI)
Partitioned Global Address Space (PGAS) models
GPU Computing
Xeon Phi Computing

HPC System Challenges and Case Studies

SLIDE 125

Network Based Computing Laboratory 125 IT4 Innovations’18

On-load approach

– Takes advantage of the idle cores – Dynamically configurable – Takes advantage of highly multithreaded cores – Takes advantage of MCDRAM of KNL processors

Applicable to other programming models such as PGAS, Task-based, etc.
Provides portability, performance, and applicability to runtime as well as

applications in a transparent manner

Enhanced Designs for KNL: MVAPICH2 Approach

SLIDE 126

Network Based Computing Laboratory 126 IT4 Innovations’18

Performance Benefits of the Enhanced Designs

New designs to exploit high concurrency and MCDRAM of KNL
Significant improvements for large message sizes
Benefits seen in varying message size as well as varying MPI processes

Very Large Message Bi-directional Bandwidth 16-process Intra-node All-to-All Intra-node Broadcast with 64MB Message

2000 4000 6000 8000 10000 2M 4M 8M 16M 32M 64M Bandwidth (MB/s) Message size MVAPICH2 MVAPICH2-Optimized 10000 20000 30000 40000 50000 60000 4 8 16 Latency (us)

No. of processes

MVAPICH2 MVAPICH2-Optimized

27%

50000 100000 150000 200000 250000 300000 1M 2M 4M 8M 16M 32M Latency (us) Message size MVAPICH2 MVAPICH2-Optimized

17.2%

52%

SLIDE 127

Network Based Computing Laboratory 127 IT4 Innovations’18

Performance Benefits of the Enhanced Designs

10000 20000 30000 40000 50000 60000 1M 2M 4M 8M 16M 32M 64M

Bandwidth (MB/s) Message Size (bytes)

MV2_Opt_DRAM MV2_Opt_MCDRAM MV2_Def_DRAM MV2_Def_MCDRAM 30%

50 100 150 200 250 300 4:268 4:204 4:64 Time (s) MPI Processes : OMP Threads MV2_Def_DRAM MV2_Opt_DRAM

15%

Multi-Bandwidth using 32 MPI processes CNTK: MLP Training Time using MNIST (BS:64)

Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset

using CNTK Deep Learning Framework

Enhanced Designs will be available in upcoming MVAPICH2 releases

SLIDE 128

Network Based Computing Laboratory 128 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Big Data – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 129

Network Based Computing Laboratory 129 IT4 Innovations’18

System Specific Challenges for Big Data Processing

Common Challenges

 Adapters and Interactions  I/O bus  Multi-port adapters  NUMA  Switches  Topologies  Switching / Routing  Bridges  IB interoperability

Big Data

 Taking advantage of RDMA  Performance  Scalability  Backward compatibility

HPC

MPI  Multi-rail  Collectives  Scalability Application Scalability  Energy Awareness PGAS  Programmability w/ Performance  Optimized Resource Utilization GPU / MIC  Programmability w/ Performance  Hide data movement costs  Heterogeneity aware design

SLIDE 130

Network Based Computing Laboratory 130 IT4 Innovations’18

How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? Can RDMA-enabled high-performance interconnects benefit Big Data processing?

Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters?

SLIDE 131

Network Based Computing Laboratory 131 IT4 Innovations’18

Can We Run Big Data Jobs on Existing HPC Infrastructure?

SLIDE 132

Network Based Computing Laboratory 132 IT4 Innovations’18

Can We Run Big Data Jobs on Existing HPC Infrastructure?

SLIDE 133

Network Based Computing Laboratory 133 IT4 Innovations’18

Can We Run Big Data Jobs on Existing HPC Infrastructure?

SLIDE 134

Network Based Computing Laboratory 134 IT4 Innovations’18

Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

RDMA? Communication and I/O Library

Point-to-Point Communication

QoS & Fault Tolerance

Threaded Models and Synchronization

Performance Tuning I/O and File Systems Virtualization (SR-IOV)

Benchmarks Upper level Changes?

SLIDE 135

Network Based Computing Laboratory 135 IT4 Innovations’18

RDMA for Apache Spark
RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

RDMA for Apache HBase
RDMA for Memcached (RDMA-Memcached)
RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

http://hibd.cse.ohio-state.edu
Users Base: 275 organizations from 34 countries
More than 24,700 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE Also run on Ethernet Support for OpenPower is available

SLIDE 136

Network Based Computing Laboratory 136 IT4 Innovations’18

HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well

as performance. This mode is enabled by default in the package.

HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-

memory and obtain as much performance benefit as possible.

HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst

buffer design is hosted by Memcached servers, each of which has a local SSD.

MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
f Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-

L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x

SLIDE 137

Network Based Computing Laboratory 137 IT4 Innovations’18 50 100 150 200 250 300 350 400 80 120 160 Execution Time (s) Data Size (GB)

IPoIB (EDR) OSU-IB (EDR)

100 200 300 400 500 600 700 800 80 160 240 Execution Time (s) Data Size (GB)

IPoIB (EDR) OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

RandomWriter

– 3x improvement over IPoIB for 80-160 GB file size

TeraGen

– 4x improvement over IPoIB for 80-240 GB file size

RandomWriter TeraGen Reduced by 3x Reduced by 4x

SLIDE 138

Network Based Computing Laboratory 138 IT4 Innovations’18 100 200 300 400 500 600 700 800 80 120 160 Execution Time (s) Data Size (GB)

IPoIB (EDR) OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps and 32 reduces

Sort

– 61% improvement over IPoIB for 80-160 GB data

TeraSort

– 18% improvement over IPoIB for 80-240 GB data

Reduced by 61% Reduced by 18% Cluster with 8 Nodes with a total of 64 maps and 14 reduces Sort TeraSort

100 200 300 400 500 600 80 160 240 Execution Time (s) Data Size (GB)

IPoIB (EDR) OSU-IB (EDR)

SLIDE 139

Network Based Computing Laboratory 139 IT4 Innovations’18

Design Features

– RDMA based shuffle plugin – SEDA-based architecture – Dynamic connection management and sharing – Non-blocking data transfer – Off-JVM-heap buffer management – InfiniBand/RoCE support

Design Overview of Spark with RDMA

Enables high performance RDMA communication, while supporting traditional socket interface
JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High

Performance Interconnects (HotI'14), August 2014

X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.

Spark Core

RDMA Capable Networks (IB, iWARP, RoCE ..)

Apache Spark Benchmarks/Applications/Libraries/Frameworks

1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI)

Native RDMA-based Comm. Engine

Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin)

Netty Server NIO Server RDMA Server Netty Client NIO Client RDMA Client

SLIDE 140

Network Based Computing Laboratory 140 IT4 Innovations’18

InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.

– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps) – GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – SortBy/GroupBy

64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time

50 100 150 200 250 300 64 128 256

Time (sec) Data Size (GB)

IPoIB RDMA

50 100 150 200 250 64 128 256

Time (sec) Data Size (GB)

IPoIB RDMA

74% 80%

SLIDE 141

Network Based Computing Laboratory 141 IT4 Innovations’18

Application Evaluation on SDSC Comet

Kira Toolkit: Distributed astronomy image processing

toolkit implemented using Apache Spark

– https://github.com/BIDS/Kira

Source extractor application, using a 65GB dataset from

the SDSS DR2 survey that comprises 11,150 image files. 20 40 60 80 100 120 RDMA Spark Apache Spark (IPoIB) 21 % Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC

Comet, XSEDE’16, July 2016

200 400 600 800 1000

24 48 96 192 384

One Epoch Time (sec) Number of cores

IPoIB RDMA

BigDL: Distributed Deep Learning Tool using Apache

Spark

– https://github.com/intel-analytics/BigDL

VGG training model on the CIFAR-10 dataset

4.58x

SLIDE 142

Network Based Computing Laboratory 142 IT4 Innovations’18

Using HiBD Packages on Existing HPC Infrastructure

SLIDE 143

Network Based Computing Laboratory 143 IT4 Innovations’18

Using HiBD Packages on Existing HPC Infrastructure

SLIDE 144

Network Based Computing Laboratory 144 IT4 Innovations’18

RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are

installed and available on SDSC Comet.

– Examples for various modes of usage are available in:

RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
RDMA for Apache Spark: /share/apps/examples/SPARK/

– Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration.

RDMA for Apache Hadoop is also available on Chameleon Cloud as an

appliance

– https://www.chameleoncloud.org/appliances/17/

HiBD Packages on SDSC Comet and Chameleon Cloud

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC

Comet, XSEDE’16, July 2016

SLIDE 145

Network Based Computing Laboratory 145 IT4 Innovations’18

1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Message Size OSU-IB (FDR) IPoIB (FDR) 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thousands of Transactions per Second (TPS)

No. of Clients
Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X 2X

SLIDE 146

Network Based Computing Laboratory 146 IT4 Innovations’18

Advanced Features for InfiniBand
Advanced Features for High Speed Ethernet
RDMA over Converged Ethernet
Open Fabrics Software Stack and RDMA Programming
Libfabrics Software Stack and Programming
Network Management Infrastructure and Tool
Common Challenges in Building HEC Systems with IB and HSE

– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges

System Specific Challenges and Case Studies

– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Big Data – Cloud Computing

Conclusions and Final Q&A

Presentation Overview

SLIDE 147

Network Based Computing Laboratory 147 IT4 Innovations’18

System Specific Challenges for Cloud Computing

Common Challenges

 Adapters and Interactions  I/O bus  Multi-port adapters  NUMA  Switches  Topologies  Switching / Routing  Bridges  IB interoperability

Cloud Computing

 SR-IOV Support  Virtualization  Containers

HPC

MPI  Multi-rail  Collectives  Scalability Application Scalability  Energy Awareness PGAS  Programmability w/ Performance  Optimized Resource Utilization  GPU / XeonPhi  Programmability w/ Performance  Hide data movement costs  Heterogeneity aware design

Big Data

 Taking advantage of RDMA  Performance  Scalability  Backward compatibility

SLIDE 148

Network Based Computing Laboratory 148 IT4 Innovations’18

Cloud Computing widely adopted in industry computing environment
Cloud Computing provides high resource utilization and flexibility
Virtualization is the key technology to enable Cloud Computing
Intersect360 study shows cloud is the fastest growing class of HPC
HPC Meets Cloud: The convergence of Cloud Computing and HPC

HPC Meets Cloud Computing

SLIDE 149

Network Based Computing Laboratory 149 IT4 Innovations’18

Virtualization has many benefits

– Fault-tolerance – Job migration – Compaction

Have not been very popular in HPC due to overhead associated with

Virtualization

New SR-IOV (Single Root – IO Virtualization) support available with

Mellanox InfiniBand adapters changes the field

Enhanced MVAPICH2 support for SR-IOV
MVAPICH2-Virt 2.2 supports:

– OpenStack, Docker, and singularity

Can HPC and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based

Virtualized InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand

Clusters, HiPC’14

J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build

HPC Clouds, CCGrid’15

SLIDE 150

Network Based Computing Laboratory 150 IT4 Innovations’18

50 100 150 200 250 300 350 400 milc leslie3d pop2 GAPgeofem zeusmp2 lu Execution Time (s) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native

1% 9.5%

1000 2000 3000 4000 5000 6000 22,20 24,10 24,16 24,20 26,10 26,16 Execution Time (ms) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native

2%

32 VMs, 6 Core/VM
Compared to Native, 2-5% overhead for Graph500 with 128 Procs
Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007 Graph500

5%

SLIDE 151

Network Based Computing Laboratory 151 IT4 Innovations’18

10 20 30 40 50 60 70 80 90 100 MG.D FT.D EP.D LU.D CG.D

Execution Time (s)

Container-Def Container-Opt Native

64 Containers across 16 nodes, pining 4 Cores per Container
Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500
Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Application-Level Performance on Docker with MVAPICH2

Graph 500 NAS

11%

50 100 150 200 250 300 1Cont*16P 2Conts*8P 4Conts*4P

BFS Execution Time (ms) Scale, Edgefactor (20,16)

Container-Def Container-Opt Native

73%

SLIDE 152

Network Based Computing Laboratory 152 IT4 Innovations’18

500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20

BFS Execution Time (ms) Problem Size (Scale, Edgefactor)

Graph500

Singularity Native

50 100 150 200 250 300 CG EP FT IS LU MG

Execution Time (s)

NPB Class D

Singularity Native

512 Processes across 32 nodes
Less than 7% and 6% overhead for NPB and Graph500, respectively

Application-Level Performance on Singularity with MVAPICH2

7% 6%

J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?,

UCC ’17, Best Student Paper Award

SLIDE 153

Network Based Computing Laboratory 153 IT4 Innovations’18

Challenges

– Existing designs in Hadoop not virtualization-aware – No support for automatic topology detection

Design

– Automatic Topology Detection using MapReduce-based utility

Requires no user input
Can detect topology changes during

runtime without affecting running jobs

– Virtualization and topology-aware communication through map task scheduling and YARN container allocation policy extensions

Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand

S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating

Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016

2000 4000 6000 40 GB 60 GB 40 GB 60 GB 40 GB 60 GB EXECUTION TIME

Hadoop Benchmarks

RDMA-Hadoop Hadoop-Virt 100 200 300 400 Default Mode Distributed Mode Default Mode Distributed Mode EXECUTION TIME

Hadoop Applications

RDMA-Hadoop Hadoop-Virt CloudBurst Self-join Sort WordCount PageRank Reduced by 55% Reduced by 34%

SLIDE 154

Network Based Computing Laboratory 154 IT4 Innovations’18

Presented advanced features of InfiniBand, HSE, Omni-Path, and RoCE
Provided an overview of Open Fabrics Verbs-level and Libfabrics-level

Programming and InfiniBand Network Management

Discussed common set of challenges in designing HEC Systems
Presented Challenges and Solutions in designing various High-End Computing

systems with IB, Omni-Path, and HSE

IB, Omni-Path, and HSE are emerging as new architectures leading to a new

generation of networked computing systems, opening many research issues needing novel solutions

Concluding Remarks

SLIDE 155

Network Based Computing Laboratory 155 IT4 Innovations’18

Funding Acknowledgments

Funding Support by Equipment Support by

SLIDE 156

Network Based Computing Laboratory 156 IT4 Innovations’18

Personnel Acknowledgments

Current Students (Graduate)

–

A. Awan (Ph.D.)

–

R. Biswas (M.S.)

–

M. Bayatpour (Ph.D.)

–

S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.) –

S. Guganani (Ph.D.)

Past Students

–

A. Augustine (M.S.)

–

P. Balaji (Ph.D.)

–

S. Bhagvat (M.S.)

–

A. Bhat (M.S.)

–

D. Buntinas (Ph.D.)

–

L. Chai (Ph.D.)

–

B. Chandrasekharan (M.S.)

–

N. Dandapanthula (M.S.)

–

V. Dhanraj (M.S.)

–

T. Gangadharappa (M.S.)

–

K. Gopalakrishnan (M.S.)

–

R. Rajachandrasekar (Ph.D.)

–

G. Santhanaraman (Ph.D.)

–

A. Singh (Ph.D.)

–

J. Sridhar (M.S.)

–

S. Sur (Ph.D.)

–

H. Subramoni (Ph.D.)

–

K. Vaidyanathan (Ph.D.)

–

A. Vishnu (Ph.D.)

–

J. Wu (Ph.D.)

–

W. Yu (Ph.D.)

Past Research Scientist

–

K. Hamidouche

–

S. Sur

Past Post-Docs

–

D. Banerjee

–

X. Besseron

– H.-W. Jin –

W. Huang (Ph.D.)

–

W. Jiang (M.S.)

–

J. Jose (Ph.D.)

–

S. Kini (M.S.)

–

M. Koop (Ph.D.)

–

K. Kulkarni (M.S.)

–

R. Kumar (M.S.)

–

S. Krishnamoorthy (M.S.)

–

K. Kandalla (Ph.D.)

–

M. Li (Ph.D.)

–

P. Lai (M.S.)

–

J. Liu (Ph.D.)

–

M. Luo (Ph.D.)

–

A. Mamidala (Ph.D.)

–

G. Marsh (M.S.)

–

V. Meshram (M.S.)

–

A. Moody (M.S.)

–

S. Naravula (Ph.D.)

–

R. Noronha (Ph.D.)

–

X. Ouyang (Ph.D.)

–

S. Pai (M.S.)

–

S. Potluri (Ph.D.)

–

J. Hashmi (Ph.D.)

–

H. Javed (Ph.D.)

–

P. Kousha (Ph.D.)

–

D. Shankar (Ph.D.)

–

H. Shi (Ph.D.)

–

J. Zhang (Ph.D.)

–

J. Lin

–

M. Luo

–

E. Mancini

Current Research Scientists

–

X. Lu

–

H. Subramoni

Past Programmers

–

D. Bureddy

–

J. Perkins

Current Research Specialist

–

J. Smith

–

M. Arnold

–

S. Marcarelli

–

J. Vienne

–

H. Wang

Current Post-doc

–

A. Ruhela

Current Students (Undergraduate)

–

N. Sarkauskas (B.S.)

SLIDE 157

Network Based Computing Laboratory 157 IT4 Innovations’18

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/