Latest version of the slides can be obtained from - - PowerPoint PPT Presentation

latest version of the slides can be obtained from http
SMART_READER_LITE
LIVE PREVIEW

Latest version of the slides can be obtained from - - PowerPoint PPT Presentation

Latest version of the slides can be obtained from http://www.cse.ohio-state.edu/~panda/it4i-ib-hse.pdf InfiniBand, Omni-Path, and High-speed Ethernet for Dummies Tutorial at IT4 Innovations 18 by Dhabaleswar K. (DK) Panda Hari Subramoni


slide-1
SLIDE 1

InfiniBand, Omni-Path, and High-speed Ethernet for Dummies

Tutorial at IT4 Innovations ’18 by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

Latest version of the slides can be obtained from http://www.cse.ohio-state.edu/~panda/it4i-ib-hse.pdf

slide-2
SLIDE 2

IT4 Innovations’18 2 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-3
SLIDE 3

IT4 Innovations’18 3 Network Based Computing Laboratory

  • Growth of High Performance Computing

– Growth in processor performance

  • Chip density doubles every 18 months

– Growth in commodity networking

  • Increase in speed/features + reducing cost
  • Clusters: popular choice for HPC

– Scalability, Modularity and Upgradeability

Current and Next Generation Applications and Computing Systems

slide-4
SLIDE 4

IT4 Innovations’18 4 Network Based Computing Laboratory

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

10 20 30 40 50 60 70 80 90 100 50 100 150 200 250 300 350 400 450 500 Percentage of Clusters Number of Clusters Timeline Percentage of Clusters Number of Clusters

87%

slide-5
SLIDE 5

IT4 Innovations’18 5 Network Based Computing Laboratory

Integrated High-End Computing Environments

Compute cluster

Meta-Data Manager I/O Server Node Meta Data Data Compute Node Compute Node I/O Server Node Data Compute Node I/O Server Node Data Compute Node L A N LAN

Frontend Storage cluster

LAN/WAN

. . . . Enterprise Multi-tier Datacenter for Visualization and Mining

Tier1 Tier3 Routers/ Servers Switch Database Server Application Server Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Database Server Database Server Switch Switch Routers/ Servers Tier2

slide-6
SLIDE 6

IT4 Innovations’18 6 Network Based Computing Laboratory

Cloud Computing Environments

LAN / WAN

Physical Machine Virtual Machine Virtual Machine Physical Machine Virtual Machine Virtual Machine Physical Machine Virtual Machine Virtual Machine

Virtual Network File System

Physical Meta-Data Manager Meta Data Physical I/O Server Node Data Physical I/O Server Node Data Physical I/O Server Node Data Physical I/O Server Node Data

Physical Machine Virtual Machine Virtual Machine

slide-7
SLIDE 7

IT4 Innovations’18 7 Network Based Computing Laboratory

Big Data Analytics with Hadoop

  • Underlying Hadoop Distributed File System

(HDFS)

  • Fault-tolerance by replicating data blocks
  • NameNode: stores information on data

blocks

  • DataNodes: store blocks and host Map-

reduce computation

  • JobTracker: track jobs and detect failure
  • MapReduce (Distributed Computation)
  • HBase (Database component)
  • Model scales but high amount of

communication during intermediate phases

slide-8
SLIDE 8

IT4 Innovations’18 8 Network Based Computing Laboratory

  • Good System Area Networks with excellent performance (low latency, high

bandwidth and low CPU utilization) for inter-processor communication (IPC) and I/O

  • Good Storage Area Networks high performance I/O
  • Good WAN connectivity in addition to intra-cluster SAN/LAN connectivity
  • Quality of Service (QoS) for interactive applications
  • RAS (Reliability, Availability, and Serviceability)
  • With low cost

Networking and I/O Requirements

slide-9
SLIDE 9

IT4 Innovations’18 9 Network Based Computing Laboratory

  • Hardware components

– Processing cores and memory subsystem – I/O bus or links – Network adapters/switches

  • Software components

– Communication stack

  • Bottlenecks can artificially

limit the network performance the user perceives

Major Components in Computing Systems

P0

Core0 Core1 Core2 Core3

P1

Core0 Core1 Core2 Core3

Memory Memory

I / O B u s

Network Adapter

Network Switch

Processing Bottlenecks I/O Interface Bottlenecks Network Bottlenecks

slide-10
SLIDE 10

IT4 Innovations’18 10 Network Based Computing Laboratory

  • Ex: TCP/IP, UDP/IP
  • Generic architecture for all networks
  • Host processor handles almost all aspects of

communication

– Data buffering (copies on sender and receiver) – Data integrity (checksum) – Routing aspects (IP routing)

  • Signaling between different layers

– Hardware interrupt on packet arrival or transmission – Software signals between different layers to handle protocol processing in different priority levels

Processing Bottlenecks in Traditional Protocols

P0

Core0 Core1 Core2 Core3

P1

Core0 Core1 Core2 Core3

Memory Memory

I / O B u s

Network Adapter

Network Switch Processing Bottlenecks

slide-11
SLIDE 11

IT4 Innovations’18 11 Network Based Computing Laboratory

  • Traditionally relied on bus-based

technologies (last mile bottleneck)

– E.g., PCI, PCI-X – One bit per wire – Performance increase through:

  • Increasing clock speed
  • Increasing bus width

– Not scalable:

  • Cross talk between bits
  • Skew between wires
  • Signal integrity makes it difficult to increase bus width significantly,

especially for high clock speeds

Bottlenecks in Traditional I/O Interfaces and Networks

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional) PCI-X 1998 (v1.0) 2003 (v2.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) 266-533MHz/64bit: 17Gbps (shared bidirectional)

P0

Core0 Core1 Core2 Core3

P1

Core0 Core1 Core2 Core3

Memory Memory

I / O B u s

Network Adapter

Network Switch

I/O Interface Bottlenecks

slide-12
SLIDE 12

IT4 Innovations’18 12 Network Based Computing Laboratory

  • Network speeds saturated at around 1Gbps

– Features provided were limited – Commodity networks were not considered scalable enough for very large-scale systems

Bottlenecks on Traditional Networks

Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec

P0

Core0 Core1 Core2 Core3

P1

Core0 Core1 Core2 Core3

Memory Memory

I / O B u s

Network Adapter

Network Switch

Network Bottlenecks

slide-13
SLIDE 13

IT4 Innovations’18 13 Network Based Computing Laboratory

  • Industry Networking Standards
  • InfiniBand and High-speed Ethernet were introduced into the market to

address these bottlenecks

  • InfiniBand aimed at all three bottlenecks (protocol processing, I/O bus, and

network speed)

  • Ethernet aimed at directly handling the network speed bottleneck and relying
  • n complementary technologies to alleviate the protocol processing and I/O

bus bottlenecks

Motivation for InfiniBand and High-speed Ethernet

slide-14
SLIDE 14

IT4 Innovations’18 14 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-15
SLIDE 15

IT4 Innovations’18 15 Network Based Computing Laboratory

  • IB Trade Association was formed with seven industry leaders (Compaq, Dell,

HP, IBM, Intel, Microsoft, and Sun)

  • Goal: To design a scalable and high performance communication and I/O

architecture by taking an integrated view of computing, networking, and storage technologies

  • Many other industry participated in the effort to define the IB architecture

specification

  • IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000

– Several annexes released after that (RDMA_CM - Sep’06, iSER – Sep’06, XRC – Mar’09, RoCE – Apr’10, RoCEv2 – Sep’14, Virtualization – Nov’16) – Latest version 1.3.1 released November 2016

  • http://www.infinibandta.org

IB Trade Association

slide-16
SLIDE 16

IT4 Innovations’18 16 Network Based Computing Laboratory

  • 10GE Alliance formed by several industry leaders to take the Ethernet family to the next

speed step

  • Goal: To achieve a scalable and high performance communication architecture while

maintaining backward compatibility with Ethernet

  • http://www.ethernetalliance.org
  • 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG
  • 25-Gbps Ethernet Consortium targeting 25/50Gbps (July 2014)

– http://25gethernet.org

  • Energy-efficient and power-conscious protocols

– On-the-fly link speed reduction for under-utilized links

  • Ethernet Alliance Technology Forum looking forward to 2026

– http://insidehpc.com/2016/08/at-ethernet-alliance-technology-forum/

High-speed Ethernet Consortium (10GE/25GE/40GE/50GE/100GE)

slide-17
SLIDE 17

IT4 Innovations’18 17 Network Based Computing Laboratory

  • Network speed bottlenecks
  • Protocol processing bottlenecks
  • I/O interface bottlenecks

Tackling Communication Bottlenecks with IB and HSE

slide-18
SLIDE 18

IT4 Innovations’18 18 Network Based Computing Laboratory

  • Bit serial differential signaling

– Independent pairs of wires to transmit independent data (called a lane) – Scalable to any number of lanes – Easy to increase clock speed of lanes (since each lane consists only of a pair of wires)

  • Theoretically, no perceived limit on the bandwidth

Network Bottleneck Alleviation: InfiniBand (“Infinite Bandwidth”) and High-speed Ethernet

slide-19
SLIDE 19

IT4 Innovations’18 19 Network Based Computing Laboratory

Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec InfiniBand (2001 -) 2 Gbit/sec (1X SDR) 10-Gigabit Ethernet (2001 -) 10 Gbit/sec InfiniBand (2003 -) 8 Gbit/sec (4X SDR) InfiniBand (2005 -) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) 32 Gbit/sec (4X QDR) 40-Gigabit Ethernet (2010 -) 40 Gbit/sec InfiniBand (2011 -) 54.6 Gbit/sec (4X FDR) InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR) 25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec 100-Gigabit Ethernet (2015 -) 100 Gbit/sec Omni-Path (2015 - ) 100 Gbit/sec InfiniBand (2015 - ) 100 Gbit/sec (4X EDR) InfiniBand (2016 - ) 200 Gbit/sec (4X HDR)

Network Speed Acceleration with IB and HSE

100 times in the last 15 years

slide-20
SLIDE 20

IT4 Innovations’18 20 Network Based Computing Laboratory

InfiniBand Link Speed Standardization Roadmap

Courtesy: InfiniBand Trade Association

XDR = eXtreme Data Rate NDR = Next Data Rate HDR = High Data Rate EDR = Enhanced Data Rate FDR = Fourteen Data Rate QDR = Quad Data Rate DDR = Double Data Rate (not shown) SDR = Single Data Rate (not shown)

slide-21
SLIDE 21

IT4 Innovations’18 21 Network Based Computing Laboratory

  • Network speed bottlenecks
  • Protocol processing bottlenecks
  • I/O interface bottlenecks

Tackling Communication Bottlenecks with IB and HSE

slide-22
SLIDE 22

IT4 Innovations’18 22 Network Based Computing Laboratory

  • Intelligent Network Interface Cards
  • Support entire protocol processing completely in hardware

(hardware protocol offload engines)

  • Provide a rich communication interface to applications

– User-level communication capability – Gets rid of intermediate data buffering requirements

  • No software signaling between communication layers

– All layers are implemented on a dedicated hardware unit, and not on a shared host CPU

Capabilities of High-Performance Networks

slide-23
SLIDE 23

IT4 Innovations’18 23 Network Based Computing Laboratory

  • Fast Messages (FM)

– Developed by UIUC

  • Myricom GM

– Proprietary protocol stack from Myricom

  • These network stacks set the trend for high-performance communication

requirements

– Hardware offloaded protocol stack – Support for fast and secure user-level access to the protocol stack

  • Virtual Interface Architecture (VIA)

– Standardized by Intel, Compaq, Microsoft – Precursor to IB

Previous High-Performance Network Stacks

slide-24
SLIDE 24

IT4 Innovations’18 24 Network Based Computing Laboratory

  • Some IB models have multiple hardware accelerators

– E.g., Mellanox IB adapters

  • Protocol Offload Engines

– Completely implement ISO/OSI layers 2-4 (link layer, network layer and transport layer) in hardware

  • Additional hardware supported features also present

– RDMA, Multicast, QoS, Fault Tolerance, and many more

IB Hardware Acceleration

slide-25
SLIDE 25

IT4 Innovations’18 25 Network Based Computing Laboratory

  • Interrupt Coalescing

– Improves throughput, but degrades latency

  • Jumbo Frames

– No latency impact; Incompatible with existing switches

  • Hardware Checksum Engines

– Checksum performed in hardware  significantly faster – Shown to have minimal benefit independently

  • Segmentation Offload Engines (a.k.a. Virtual MTU)

– Host processor “thinks” that the adapter supports large Jumbo frames, but the adapter splits it into regular sized (1500-byte) frames – Supported by most HSE products because of its backward compatibility  considered “regular” Ethernet

Ethernet Hardware Acceleration

slide-26
SLIDE 26

IT4 Innovations’18 26 Network Based Computing Laboratory

  • TCP Offload Engines (TOE)

– Hardware Acceleration for the entire TCP/IP stack – Initially patented by Tehuti Networks – Actually refers to the IC on the network adapter that implements TCP/IP – In practice, usually referred to as the entire network adapter

  • Internet Wide-Area RDMA Protocol (iWARP)

– Standardized by IETF and the RDMA Consortium – Support acceleration features (like IB) for Ethernet

  • http://www.ietf.org & http://www.rdmaconsortium.org

TOE and iWARP Accelerators

slide-27
SLIDE 27

IT4 Innovations’18 27 Network Based Computing Laboratory

  • Also known as “Datacenter Ethernet” or “Lossless Ethernet”

– Combines a number of optional Ethernet standards into one umbrella as mandatory requirements

  • Sample enhancements include:

– Priority-based flow-control: Link-level flow control for each Class of Service (CoS) – Enhanced Transmission Selection (ETS): Bandwidth assignment to each CoS – Datacenter Bridging Exchange Protocols (DBX): Congestion notification, Priority classes – End-to-end Congestion notification: Per flow congestion control to supplement per link flow control

Converged (Enhanced) Ethernet (CEE or CE)

slide-28
SLIDE 28

IT4 Innovations’18 28 Network Based Computing Laboratory

  • Network speed bottlenecks
  • Protocol processing bottlenecks
  • I/O interface bottlenecks

Tackling Communication Bottlenecks with IB and HSE

slide-29
SLIDE 29

IT4 Innovations’18 29 Network Based Computing Laboratory

  • InfiniBand initially intended to replace I/O bus technologies with networking-

like technology

– That is, bit serial differential signaling – With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now

  • Both IB and HSE today come as network adapters that plug into existing I/O

technologies

Interplay with I/O Technologies

slide-30
SLIDE 30

IT4 Innovations’18 30 Network Based Computing Laboratory

  • Recent trends in I/O interfaces show that they are nearly matching head-to-

head with network speeds (though they still lag a little bit)

Trends in I/O Interfaces with Servers

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional) PCI-X 1998 (v1.0) 2003 (v2.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) 266-533MHz/64bit: 17Gbps (shared bidirectional) AMD HyperTransport (HT) 2001 (v1.0), 2004 (v2.0) 2006 (v3.0), 2008 (v3.1) 102.4Gbps (v1.0), 179.2Gbps (v2.0) 332.8Gbps (v3.0), 409.6Gbps (v3.1) (32 lanes) PCI-Express (PCIe) by Intel 2003 (Gen1), 2007 (Gen2), 2009 (Gen3 standard), 2017 (Gen4 standard) Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps) Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps) Gen4: 4X (~64Gbps), 8X (~128Gbps), 16X (~256Gbps) Intel QuickPath Interconnect (QPI) 2009 153.6-204.8Gbps (20 lanes)

slide-31
SLIDE 31

IT4 Innovations’18 31 Network Based Computing Laboratory

  • Cache Coherence Interconnect for Accelerators (CCIX)

– https://www.ccixconsortium.com/

  • NVLink

– http://www.nvidia.com/object/nvlink.html

  • CAPI/OpenCAPI

– http://opencapi.org/

  • GenZ

– http://genzconsortium.org/

Upcoming I/O Interface Architectures

slide-32
SLIDE 32

IT4 Innovations’18 32 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-33
SLIDE 33

IT4 Innovations’18 33 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services

  • High-speed Ethernet Family

– Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks

  • InfiniBand/Ethernet Convergence Technologies

– Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)

IB, HSE and their Convergence

slide-34
SLIDE 34

IT4 Innovations’18 34 Network Based Computing Laboratory

Comparing InfiniBand with Traditional Networking Stack

Application Layer

MPI, PGAS, File Systems

Transport Layer

OpenFabrics Verbs RC (reliable), UD (unreliable)

Link Layer

Flow-control, Error Detection

Physical Layer

InfiniBand

Copper or Optical HTTP, FTP, MPI, File Systems Routing

Physical Layer Link Layer Network Layer Transport Layer Application Layer

Traditional Ethernet

Sockets Interface TCP, UDP Flow-control and Error Detection Copper, Optical or Wireless

Network Layer

Routing OpenSM (management tool) DNS management tools

slide-35
SLIDE 35

IT4 Innovations’18 35 Network Based Computing Laboratory Kernel Space

TCP/IP Stack and IPoIB

Application / Middleware Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP 1/10/25/40/ 50/100 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Sockets Application / Middleware Interface Protocol Adapter Switch

slide-36
SLIDE 36

IT4 Innovations’18 36 Network Based Computing Laboratory Kernel Space

TCP/IP, IPoIB and Native IB Verbs

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch 1/10/25/40/ 50/100 GigE

slide-37
SLIDE 37

IT4 Innovations’18 37 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics

  • Communication Model
  • Memory registration and protection
  • Channel and memory semantics

– Novel Features

  • Hardware Protocol Offload

– Link, network and transport layer features

– Subnet Management and Services – Sockets Direct Protocol (SDP) stack – RSockets Protocol Stack

IB Overview

slide-38
SLIDE 38

IT4 Innovations’18 38 Network Based Computing Laboratory

  • Used by processing and I/O units to

connect to fabric

  • Consume & generate IB packets
  • Programmable DMA engines with

protection features

  • May have multiple ports

– Independent buffering channeled through Virtual Lanes

  • Host Channel Adapters (HCAs)

Components: Channel Adapters

C Port

VL VL VL

Port

VL VL VL

Port

VL VL VL

… …

DMA Memory

QP QP QP QP

MTP SMA

Transport Channel Adapter

slide-39
SLIDE 39

IT4 Innovations’18 39 Network Based Computing Laboratory

  • Relay packets from a link to another
  • Switches: intra-subnet
  • Routers: inter-subnet
  • May support multicast

Components: Switches and Routers

Packet Relay Port

VL VL VL

Port

VL VL VL

Port

VL VL VL

… …

Switch

GRH Packet Relay Port

VL VL VL

Port

VL VL VL

Port

VL VL VL

… …

Router

slide-40
SLIDE 40

IT4 Innovations’18 40 Network Based Computing Laboratory

  • Network Links

– Copper, Optical, Printed Circuit wiring on Back Plane – Not directly addressable

  • Traditional adapters built for copper cabling

– Restricted by cable length (signal integrity) – For example, QDR copper cables are restricted to 7m

  • Intel Connects: Optical cables with Copper-to-optical conversion hubs

(acquired by Emcore)

– Up to 100m length – 550 picoseconds copper-to-optical conversion latency

  • Available from other vendors (Luxtera)
  • Repeaters (Vol. 2 of InfiniBand specification)

Components: Links & Repeaters

(Courtesy Intel)

slide-41
SLIDE 41

IT4 Innovations’18 41 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics

  • Communication Model
  • Memory registration and protection
  • Channel and memory semantics

– Novel Features

  • Hardware Protocol Offload

– Link, network and transport layer features

– Subnet Management and Services – Sockets Direct Protocol (SDP) stack – RSockets Protocol Stack

IB Overview

slide-42
SLIDE 42

IT4 Innovations’18 42 Network Based Computing Laboratory

IB Communication Model

Basic InfiniBand Communication Semantics

slide-43
SLIDE 43

IT4 Innovations’18 43 Network Based Computing Laboratory

Two-sided Communication Model

HCA HCA HCA P 1 P 2 P 3

HCA Send Data to P2 HCA Send Data to P2 Poll HCA

Recv from P3 Recv from P1

Poll HCA Poll HCA No Data Recv Data from P3 Recv Data from P1 Post Send Buffer Post Send Buffer

Send to P2 Send to P2 Recv from P3 Recv from P1

Post Recv Buffer Post Recv Buffer

slide-44
SLIDE 44

IT4 Innovations’18 44 Network Based Computing Laboratory

One-sided Communication Model

HCA HCA HCA P 1 P 2 P 3

Write to P2 Write to P3 Write Data from P1 Write data from P2

Post to HCA Post to HCA

Buffer at P2 Buffer at P3

Global Region Creation

(Buffer Info Exchanged)

Buffer at P1

HCA Write Data to P2 HCA Write Data to P3

slide-45
SLIDE 45

IT4 Innovations’18 45 Network Based Computing Laboratory

  • Each QP has two queues

– Send Queue (SQ) – Receive Queue (RQ) – Work requests are queued to the QP (WQEs: “Wookies”)

  • QP to be linked to a Complete Queue

(CQ)

– Gives notification of operation completion from QPs – Completed WQEs are placed in the CQ with additional information (CQEs: “Cookies”)

Queue Pair Model

InfiniBand Device

CQ QP

Send Recv

WQEs CQEs

slide-46
SLIDE 46

IT4 Innovations’18 46 Network Based Computing Laboratory

  • 1. Registration Request
  • Send virtual address and length
  • 2. Kernel handles virtual->physical

mapping and pins region into physical memory

  • Process cannot map memory that it

does not own (security !)

  • 3. HCA caches the virtual to physical

mapping and issues a handle

  • Includes an l_key and r_key
  • 4. Handle is returned to application

Memory Registration

Before we do any communication: All memory used for communication must be registered

1 3 4 Process Kernel HCA/RNIC 2

slide-47
SLIDE 47

IT4 Innovations’18 47 Network Based Computing Laboratory

  • To send or receive data the l_key

must be provided to the HCA

  • HCA verifies access to local memory
  • For RDMA, initiator must have the

r_key for the remote virtual address

  • Possibly exchanged with a send/recv
  • r_key is not encrypted in IB

Memory Protection

HCA/NIC Kernel Process l_key r_key is needed for RDMA operations

For security, keys are required for all

  • perations that touch buffers
slide-48
SLIDE 48

IT4 Innovations’18 48 Network Based Computing Laboratory

Communication in the Channel Semantics (Send/Receive Model)

InfiniBand Device Memory Memory InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple non-contiguous segments) Processor Processor

CQ QP

Send Recv

Memory Segment

Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place Hardware ACK

Memory Segment Memory Segment Memory Segment

Processor is involved only to: 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ

slide-49
SLIDE 49

IT4 Innovations’18 49 Network Based Computing Laboratory

Communication in the Memory Semantics (RDMA Model)

InfiniBand Device Memory Memory InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Processor Processor

CQ QP

Send Recv

Memory Segment

Hardware ACK

Memory Segment Memory Segment

Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor

slide-50
SLIDE 50

IT4 Innovations’18 50 Network Based Computing Laboratory

InfiniBand Device

Communication in the Memory Semantics (Atomics)

Memory Memory InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (single 64-bit segment) and the receive buffer (single 64-bit segment) Processor Processor

CQ QP

Send Recv

Source Memory Segment

OP

Destination Memory Segment

IB supports compare-and-swap and fetch-and-add atomic operations

Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor

slide-51
SLIDE 51

IT4 Innovations’18 51 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics

  • Communication Model
  • Memory registration and protection
  • Channel and memory semantics

– Novel Features

  • Hardware Protocol Offload

– Link, network and transport layer features

– Subnet Management and Services – Sockets Direct Protocol (SDP) stack – RSockets Protocol Stack

IB Overview

slide-52
SLIDE 52

IT4 Innovations’18 52 Network Based Computing Laboratory

Hardware Protocol Offload

Complete Hardware Implementations Exist

slide-53
SLIDE 53

IT4 Innovations’18 53 Network Based Computing Laboratory

  • Buffering and Flow Control
  • Virtual Lanes, Service Levels, and QoS
  • Switching and Multicast
  • Network Fault Tolerance
  • IB WAN Capability

Link/Network Layer Capabilities

slide-54
SLIDE 54

IT4 Innovations’18 54 Network Based Computing Laboratory

  • IB provides three-levels of communication throttling/control mechanisms

– Link-level flow control (link layer feature) – Message-level flow control (transport layer feature): discussed later – Congestion control (part of the link layer features)

  • IB provides an absolute credit-based flow-control

– Receiver guarantees that enough space is allotted for N blocks of data – Occasional update of available credits by the receiver

  • Has no relation to the number of messages, but only to the total amount of data

being sent

– One 1MB message is equivalent to 1024 1KB messages (except for rounding off at message boundaries)

Buffering and Flow Control

slide-55
SLIDE 55

IT4 Innovations’18 55 Network Based Computing Laboratory

  • Virtual Lanes (VL)

– Multiple (between 2 and 16) virtual links within same physical link

  • 0 – default data VL; 15 – VL for management traffic

– Separate buffers and flow control – Avoids Head-of-Line Blocking

  • Service Level (SL):

– Packets may operate at one of 16, user defined SLs

Virtual Lanes, Service Levels, and QoS

(Courtesy: Mellanox Technologies) Routers, Switches VPN’s, DSLAMs Storage Area Network RAID, NAS, Backup IPC, Load Balancing, Web Caches, ASP InfiniBand Network Virtual Lanes

Servers

Fabric

Servers Servers IP Network

InfiniBand Fabric

Traffic Segregation

  • SL to VL mapping:

– SL determines which VL on the next link is to be used – Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management

  • Partitions:

– Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows

slide-56
SLIDE 56

IT4 Innovations’18 56 Network Based Computing Laboratory

  • Each port has one or more associated LIDs (Local Identifiers)

– Switches look up which port to forward a packet to based on its destination LID (DLID) – This information is maintained at the switch

  • For multicast packets, the switch needs to maintain multiple output ports to

forward the packet to

– Packet is replicated to each appropriate output port – Ensures at-most once delivery & loop-free forwarding – There is an interface for a group management protocol

  • Create, join/leave, prune, delete group

Switching (Layer-2 Routing) and Multicast

slide-57
SLIDE 57

IT4 Innovations’18 57 Network Based Computing Laboratory

  • Basic unit of switching is a crossbar

– Current InfiniBand products use either 24-port (DDR) or 36-port (QDR and FDR) crossbars

  • Switches available in the market are typically collections of crossbars within a

single cabinet

  • Do not confuse “non-blocking switches” with “crossbars”

– Crossbars provide all-to-all connectivity to all connected nodes

  • For any random node pair selection, all communication is non-blocking

– Non-blocking switches provide a fat-tree of many crossbars

  • For any random node pair selection, there exists a switch configuration such that

communication is non-blocking

  • If the communication pattern changes, the same switch configuration might no longer

provide fully non-blocking communication

Switch Complex

slide-58
SLIDE 58

IT4 Innovations’18 58 Network Based Computing Laboratory

  • Someone has to setup the forwarding tables and give

every port an LID

– “Subnet Manager” does this work

  • Different routing algorithms give different paths

IB Switching/Routing: An Example

Leaf Blocks Spine Blocks P1 P2

DLID Out-Port 2 1 4 4

Forwarding Table

LID: 2 LID: 4

1 2 3 4

An Example IB Switch Block Diagram (Mellanox 144-Port) Switching: IB supports Virtual Cut Through (VCT) Routing: Unspecified by IB SPEC Up*/Down*, Shift are popular routing engines supported by OFED

  • Fat-Tree is a popular topology for

IB Cluster

– Different over-subscription ratio may be used

  • Other topologies

– 3D Torus (Sandia Red Sky, SDSC Gordon) and SGI Altix (Hypercube) – 10D Hypercube (NASA Pleiades)

slide-59
SLIDE 59

IT4 Innovations’18 59 Network Based Computing Laboratory

  • Similar to basic switching, except…

– … sender can utilize multiple LIDs associated to the same destination port

  • Packets sent to one DLID take a fixed path
  • Different packets can be sent using different DLIDs
  • Each DLID can have a different path (switch can be configured differently for each DLID)
  • Can cause out-of-order arrival of packets

– IB uses a simplistic approach:

  • If packets in one connection arrive out-of-order, they are dropped

– Easier to use different DLIDs for different connections

  • This is what most high-level libraries using IB do!

More on Multipathing

slide-60
SLIDE 60

IT4 Innovations’18 60 Network Based Computing Laboratory

IB Multicast Example

Active Links Compute Node Switch Subnet Manager Multicast Join Multicast Setup Multicast Join Multicast Setup

slide-61
SLIDE 61

IT4 Innovations’18 61 Network Based Computing Laboratory

  • Automatically utilizes multipathing for network fault-tolerance (optional feature)
  • Idea is that the high-level library (or application) using IB will have one primary

path, and one fall-back path

– Enables migrating connections to a different path

  • Connection recovery in the case of failures
  • Available for RC, UC, and RD
  • Reliability guarantees for service type maintained during migration
  • Issue is that there is only one fall-back path (in hardware). If there is more than
  • ne failure (or a failure that affects both paths), the application will have to

handle this in software

Network Level Fault Tolerance: Automatic Path Migration

slide-62
SLIDE 62

IT4 Innovations’18 62 Network Based Computing Laboratory

  • Getting increased attention for:

– Remote Storage, Remote Visualization – Cluster Aggregation (Cluster-of-clusters)

  • IB-Optical switches by multiple vendors

– Mellanox Technologies: www.mellanox.com – Obsidian Research Corporation: www.obsidianresearch.com & Bay Microsystems: www.baymicrosystems.com

  • Layer-1 changes from copper to optical; everything else stays the same

– Low-latency copper-optical-copper conversion

  • Large link-level buffers for flow-control

– Data messages do not have to wait for round-trip hops – Important in the wide-area network

  • Efforts underway to create InfiniBand connectivity around the world by Astar Computational

Resource Centre and partner organizations[1]

IB WAN Capability

[1] InfiniCortex: present and future invited paper, Michalewicz et al, Proceedings of the ACM International Conference on Computing Frontiers

slide-63
SLIDE 63

IT4 Innovations’18 63 Network Based Computing Laboratory

Hardware Protocol Offload

Complete Hardware Implementations Exist

slide-64
SLIDE 64

IT4 Innovations’18 64 Network Based Computing Laboratory

IB Transport Types and Associated Trade-offs

Attribute Reliable Connection Reliable Datagram Dynamic Connected eXtended Reliable Connection Unreliable Connection Unreliable Datagram Raw Datagram

Scalability (M processes, N nodes) M2N QPs per HCA M QPs per HCA M QPs per HCA MNQPs per HCA M2N QPs per HCA M QPs per HCA 1 QP per HCA

Reliability

Corrupt data detected Yes Data Delivery Guarantee Data delivered exactly once No guarantees Data Order Guarantees

Per connection One source to multiple destinations Per connection Per connection Unordered, duplicate data detected No No

Data Loss Detected Yes No No Error Recovery

Errors (retransmissions, alternate path, etc.) handled by transport layer. Client only involved in handling fatal errors (links broken, protection violation, etc.) Packets with errors and sequence errors are reported to responder None None

slide-65
SLIDE 65

IT4 Innovations’18 65 Network Based Computing Laboratory

  • Data Segmentation
  • Transaction Ordering
  • Message-level Flow Control
  • Static Rate Control and Auto-negotiation

Transport Layer Capabilities

slide-66
SLIDE 66

IT4 Innovations’18 66 Network Based Computing Laboratory

  • Message-level communication granularity, not byte-level (unlike TCP)

– Application can hand over a large message

  • Network adapter segments it to MTU sized packets
  • Single notification when the entire message is transmitted or received (not per packet)
  • Reduced host overhead to send/receive messages

– Depends on the number of messages, not the number of bytes

  • Strong transaction ordering for RC

– Sender network adapter transmits messages in the order in which WQEs were posted – Each QP utilizes a single LID

  • All WQEs posted on same QP take the same path
  • All packets are received by the receiver in the same order
  • All receive WQEs are completed in the order in which they were posted

Data Segmentation & Transaction Ordering

slide-67
SLIDE 67

IT4 Innovations’18 67 Network Based Computing Laboratory

  • Also called as End-to-end Flow-control

– Does not depend on the number of network hops – Separate from Link-level Flow-Control

  • Link-level flow-control only relies on the number of bytes being transmitted, not the number of messages
  • Message-level flow-control only relies on the number of messages transferred, not the number of bytes

– If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs)

  • If the sent messages are larger than what the receive buffers are posted, flow-control cannot handle it
  • IB allows link rates to be statically changed to fixed values

– On a 4X link, we can set data to be sent at 1X

  • Cannot set rate requirement to 3.16 Gbps, for example

– For heterogeneous links, rate can be set to the lowest link rate – Useful for low-priority traffic

  • Auto-negotiation also available

– E.g., if you connect a 4X adapter to a 1X switch, data is automatically sent at 1X rate

Message-level Flow-Control & Rate Control

slide-68
SLIDE 68

IT4 Innovations’18 68 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics

  • Communication Model
  • Memory registration and protection
  • Channel and memory semantics

– Novel Features

  • Hardware Protocol Offload

– Link, network and transport layer features

– Subnet Management and Services – Sockets Direct Protocol (SDP) Stack – RSockets Protocol Stack

IB Overview

slide-69
SLIDE 69

IT4 Innovations’18 69 Network Based Computing Laboratory

  • Agents

– Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters

  • Managers

– Make high-level decisions and implement it on the network fabric using the agents

  • Messaging schemes

– Used for interactions between the manager and agents (or between agents)

  • Messages

Concepts in IB Management

slide-70
SLIDE 70

IT4 Innovations’18 70 Network Based Computing Laboratory Inactive Links

Subnet Manager

Active Links Compute Node Switch Subnet Manager Inactive Link

slide-71
SLIDE 71

IT4 Innovations’18 71 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics

  • Communication Model
  • Memory registration and protection
  • Channel and memory semantics

– Novel Features

  • Hardware Protocol Offload

– Link, network and transport layer features

– Subnet Management and Services – Sockets Direct Protocol (SDP) Stack – RSockets Protocol Stack

IB Overview

slide-72
SLIDE 72

IT4 Innovations’18 72 Network Based Computing Laboratory

IPoIB vs. SDP Architectural Models

Traditional Model Possible SDP Model

Sockets App Sockets API Sockets Application Sockets API Kernel TCP/IP Sockets Provider TCP/IP Transport Driver IPoIB Driver User

InfiniBand CA

Kernel TCP/IP Sockets Provider TCP/IP Transport Driver IPoIB Driver User

InfiniBand CA

Sockets Direct Protocol Kernel Bypass RDMA Semantics

SDP OS Modules InfiniBand Hardware

(Source: InfiniBand Trade Association)

slide-73
SLIDE 73

IT4 Innovations’18 73 Network Based Computing Laboratory

  • Implements various socket like

functions

– Functions take same parameters as sockets

  • Can switch between regular Sockets

and RSockets using LD_PRELOAD

RSockets Overview

Applications / Middleware Verbs Sockets LD_PRELOAD RSockets Library RSockets RDMA_CM

slide-74
SLIDE 74

IT4 Innovations’18 74 Network Based Computing Laboratory Kernel Space

TCP/IP, IPoIB, Native IB Verbs, SDP and RSockets

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE

slide-75
SLIDE 75

IT4 Innovations’18 75 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services

  • High-speed Ethernet Family

– Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks

  • InfiniBand/Ethernet Convergence Technologies

– Virtual Protocol Interconnect (VPI) – RDMA over Converged Enhanced Ethernet (RoCE)

IB, HSE and their Convergence

slide-76
SLIDE 76

IT4 Innovations’18 76 Network Based Computing Laboratory

  • High-speed Ethernet Family

– Internet Wide-Area RDMA Protocol (iWARP)

  • Architecture and Components
  • Features

– Out-of-order data placement – Dynamic and Fine-grained Data Rate control

– Alternate Vendor-specific Stacks

  • MX over Ethernet (for Myricom 10GE adapters)
  • Datagram Bypass Layer (for Myricom 10GE adapters)
  • Solarflare OpenOnload (for Solarflare 10/40GE adapters)
  • Emulex FastStack DBL (for OneConnect OCe12000-D 10GE adapters)

HSE Overview

slide-77
SLIDE 77

IT4 Innovations’18 77 Network Based Computing Laboratory

IB and 10/40GE RDMA Models: Commonalities and Differences

Features IB iWARP/HSE

Hardware Acceleration Supported Supported RDMA Supported Supported Atomic Operations Supported Not supported Multicast Supported Supported Congestion Control Supported Supported Data Placement Ordered Out-of-order Data Rate-control Static and Coarse-grained Dynamic and Fine-grained QoS Prioritization Prioritization and Fixed Bandwidth QoS Multipathing Using DLIDs Using VLANs

slide-78
SLIDE 78

IT4 Innovations’18 78 Network Based Computing Laboratory

  • RDMA Protocol (RDMAP)

– Feature-rich interface – Security Management

  • Remote Direct Data Placement (RDDP)

– Data Placement and Delivery – Multi Stream Semantics – Connection Management

  • Marker PDU Aligned (MPA)

– Middle Box Fragmentation – Data Integrity (CRC)

iWARP Architecture and Components

RDDP Application or Library Hardware User SCTP IP Device Driver Network Adapter (e.g., 10GigE)

iWARP Offload Engines

TCP RDMAP MPA

(Courtesy iWARP Specification)

slide-79
SLIDE 79

IT4 Innovations’18 79 Network Based Computing Laboratory

  • Place data as it arrives, whether in or out-of-order
  • If data is out-of-order, place it at the appropriate offset
  • Issues from the application’s perspective:

– Second half of the message has been placed does not mean that the first half of the message has arrived as well – If one message has been placed, it does not mean that that the previous messages have been placed

  • Issues from protocol stack’s perspective

– The receiver network stack has to understand each frame of data

  • If the frame is unchanged during transmission, this is easy!

– The MPA protocol layer adds appropriate information at regular intervals to allow the receiver to identify fragmented frames

Decoupled Data Placement and Data Delivery

slide-80
SLIDE 80

IT4 Innovations’18 80 Network Based Computing Laboratory

  • High-speed Ethernet Family

– Internet Wide-Area RDMA Protocol (iWARP)

  • Architecture and Components
  • Features

– Out-of-order data placement – Dynamic and Fine-grained Data Rate control

– Alternate Vendor-specific Stacks

  • MX over Ethernet (for Myricom 10GE adapters)
  • Datagram Bypass Layer (for Myricom 10GE adapters)
  • Solarflare OpenOnload (for Solarflare 10/40GE adapters)
  • Emulex FastStack DBL (for OneConnect OCe12000-D 10GE adapters)

HSE Overview

slide-81
SLIDE 81

IT4 Innovations’18 81 Network Based Computing Laboratory

  • Part of the Ethernet standard, not iWARP

– Network vendors use a separate interface to support it

  • Dynamic bandwidth allocation to flows based on interval between two packets

in a flow

– E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps – Complicated because of TCP windowing behavior

  • Important for high-latency/high-bandwidth networks

– Large windows exposed on the receiver side – Receiver overflow controlled through rate control

Dynamic and Fine-grained Rate Control

slide-82
SLIDE 82

IT4 Innovations’18 82 Network Based Computing Laboratory

  • Can allow for simple prioritization:

– E.g., connection 1 performs better than connection 2 – 8 classes provided (a connection can be in any class)

  • Similar to SLs in InfiniBand

– Two priority classes for high-priority traffic

  • E.g., management traffic or your favorite application
  • Or can allow for specific bandwidth requests:

– E.g., can request for 3.62 Gbps bandwidth – Packet pacing and stalls used to achieve this

  • Query functionality to find out “remaining bandwidth”

Prioritization and Fixed Bandwidth QoS

slide-83
SLIDE 83

IT4 Innovations’18 83 Network Based Computing Laboratory Kernel Space

iWARP and TOE

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP 10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE

slide-84
SLIDE 84

IT4 Innovations’18 84 Network Based Computing Laboratory

  • High-speed Ethernet Family

– Internet Wide-Area RDMA Protocol (iWARP)

  • Architecture and Components
  • Features

– Out-of-order data placement – Dynamic and Fine-grained Data Rate control

– Alternate Vendor-specific Stack

  • Datagram Bypass Layer (for Myricom 10GE adapters)
  • Solarflare OpenOnload (for Solarflare 10/40GE adapters)
  • Emulex FastStack DBL (for OneConnect OCe12000-D 10GE adapters)

HSE Overview

slide-85
SLIDE 85

IT4 Innovations’18 85 Network Based Computing Laboratory

  • Another proprietary communication layer developed by Myricom

– Compatible with regular UDP sockets (embraces and extends) – Idea is to bypass the kernel stack and give UDP applications direct access to the network adapter

  • High performance and low-jitter
  • Primary motivation: Financial market applications (e.g., stock market)

– Applications prefer unreliable communication – Timeliness is more important than reliability

  • This stack is covered by NDA; more details can be requested from Myricom

Datagram Bypass Layer (DBL)

slide-86
SLIDE 86

IT4 Innovations’18 86 Network Based Computing Laboratory

Solarflare Communications: OpenOnload Stack

Typical HPC Networking Stack Typical Commodity Networking Stack

  • HPC Networking Stack provides many

performance benefits, but has limitations for certain types of scenarios, especially where applications tend to fork(), exec() and need asynchronous advancement (per application)

Solarflare approach to networking stack

  • Solarflare approach:
  • Network hardware provides user-safe interface

to route packets directly to apps based on flow information in headers

  • Protocol processing can happen in both kernel

and user space

  • Protocol state shared between app and kernel

using shared memory Courtesy Solarflare communications (www.openonload.org/openonload-google-talk.pdf)

slide-87
SLIDE 87

IT4 Innovations’18 87 Network Based Computing Laboratory

  • Proprietary communication layer developed by Emulex

– Compatible with regular UDP and TCP sockets – Idea is to bypass the kernel stack

  • High performance, low-jitter and low latency

– Available In multiple modes

  • Transparent Acceleration (TA)

– Accelerate existing sockets applications for UDP/TCP

  • DBL API

– UDP-only, socket-like semantics but requires application changes

  • Primary motivation: Financial market applications (e.g., stock market)

– Applications prefer unreliable communication – Timeliness is more important than reliability

  • This stack is covered by NDA; more details can be requested from Emulex

FastStack DBL

slide-88
SLIDE 88

IT4 Innovations’18 88 Network Based Computing Laboratory

  • InfiniBand

– Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services

  • High-speed Ethernet Family

– Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks

  • InfiniBand/Ethernet Convergence Technologies

– Virtual Protocol Interconnect (VPI) – RDMA over Converged Enhanced Ethernet (RoCE)

IB, HSE and their Convergence

slide-89
SLIDE 89

IT4 Innovations’18 89 Network Based Computing Laboratory

  • Single network firmware to support both IB

and Ethernet

  • Autosensing of layer-2 protocol

– Can be configured to automatically work with either IB or Ethernet networks

  • Multi-port adapters can use one port on IB

and another on Ethernet

  • Multiple use modes:

– Datacenters with IB inside the cluster and Ethernet outside – Clusters with IB network and Ethernet management

Virtual Protocol Interconnect (VPI)

IB Link Layer IB Port Ethernet Port Hardware

TCP/IP support

Ethernet Link Layer IB Network Layer IP IB Transport Layer TCP IB Verbs Sockets Applications

slide-90
SLIDE 90

IT4 Innovations’18 90 Network Based Computing Laboratory

RDMA over Converged Enhanced Ethernet (RoCE)

IB Verbs Application Hardware RoCE IB Verbs Application RoCE v2 InfiniBand Link Layer IB Network IB Transport IB Verbs Application InfiniBand Ethernet Link Layer IB Network IB Transport Ethernet Link Layer UDP / IP IB Transport

  • Takes advantage of IB and Ethernet

– Software written with IB-Verbs – Link layer is Converged (Enhanced) Ethernet (CE) – 100Gb/s support from latest EDR and ConnectX- 3 Pro adapters

  • Pros: IB Vs RoCE

– Works natively in Ethernet environments

  • Entire Ethernet management ecosystem is available

– Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so there are no missing features

  • RoCE v2: Additional Benefits over RoCE

– Traditional Network Management Tools Apply – ACLs (Metering, Accounting, Firewalling) – GMP Snooping for Optimized Multicast – Network Monitoring Tools

Courtesy: OFED, Mellanox Network Stack Comparison Packet Header Comparison

ETH L2 Hdr

Ethertype

IB GRH L3 Hdr IB BTH+ L4 Hdr RoCE ETH L2 Hdr

Ethertype

IP Hdr L3 Hdr IB BTH+ L4 Hdr Proto # RoCE v2 UDP Hdr Port #

slide-91
SLIDE 91

IT4 Innovations’18 91 Network Based Computing Laboratory Kernel Space

RDMA over Converged Ethernet (RoCE)

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch

Hardware Offload TCP/IP

10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE

slide-92
SLIDE 92

IT4 Innovations’18 92 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-93
SLIDE 93

IT4 Innovations’18 93 Network Based Computing Laboratory

  • Pathscale (2003 – 2006) came up with initial version of IB-based product
  • QLogic enhanced the product with the PSM software interface
  • The IB product of QLogic was acquired by Intel
  • Intel enhanced the QLogic IB product to create the Omni-Path product

A Brief History of Omni-Path

slide-94
SLIDE 94

IT4 Innovations’18 94 Network Based Computing Laboratory

Omni-Path Fabric Overview

Courtesy: Intel Corporation

  • Layer 1.5: Link Transfer Protocol

– Features

  • Traffic Flow Optimization
  • Packet Integrity Protection
  • Dynamic Lane Switching

– Error detection/replay occurs in Link Transfer Packet units – 1 Flit = 65b; LTP = 1056b = 16 flits + 14b CRC + 2b Credit – LTPs implicitly acknowledged – Retransmit request via NULL LTP; carries replay command flit

  • Layer 2: Link Layer

– Supports 24 bit fabric addresses – Allows 10KB of L4 payload; 10,368 byte max packet size – Congestion Management

  • Adaptive / Dispersive Routing
  • Explicit Congestion Notification

– QoS support

  • Traffic Class, Service Level, Service Channel and Virtual Lane
  • Layer 3: Data Link Layer

– Fabric addressing, switching, resource allocation and partitioning support

slide-95
SLIDE 95

IT4 Innovations’18 95 Network Based Computing Laboratory

Kernel Space

All Protocols Including Omni-Path

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch

Hardware Offload TCP/IP

10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE Omni-Path Adapter Omni-Path Switch User Space RDMA 100 Gb/s OFI

slide-96
SLIDE 96

IT4 Innovations’18 96 Network Based Computing Laboratory

IB, Omni-Path, and HSE: Feature Comparison

Features IB iWARP/HSE RoCE RoCE v2 Omni-Path

Hardware Acceleration Yes Yes Yes Yes Yes RDMA Yes Yes Yes Yes Yes Congestion Control Yes Optional Yes Yes Yes Multipathing Yes Yes Yes Yes Yes Atomic Operations Yes No Yes Yes Yes Multicast Optional No Optional Optional Optional Data Placement Ordered Out-of-order Ordered Ordered Ordered Prioritization Optional Optional Yes Yes Yes Fixed BW QoS (ETS) No Optional Yes Yes Yes Ethernet Compatibility No Yes Yes Yes Yes TCP/IP Compatibility Yes (using IPoIB) Yes Yes (using IPoIB) Yes Yes

slide-97
SLIDE 97

IT4 Innovations’18 97 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-98
SLIDE 98

IT4 Innovations’18 98 Network Based Computing Laboratory

  • Many IB vendors: Mellanox, Voltaire (acquired by Mellanox) and QLogic (acquired by Intel)

– Aligned with many server vendors: Intel, IBM, Oracle, Dell – And many integrators: Appro, Advanced Clustering, Microway

  • New vendors like Oracle are entering the market with IB products
  • Broadly two kinds of adapters

– Offloading (Mellanox) and Onloading (Intel TrueScale / QLogic)

  • Adapters with different interfaces:

– Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0, PCI 3.0 and HT

  • MemFree Adapter

– No memory on HCA  Uses System memory (through PCIe) – Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)

  • Different speeds

– SDR (8 Gbps), DDR (16 Gbps), QDR (32 Gbps), FDR (56 Gbps), Dual-FDR (100Gbps), EDR (100 Gbps), HDR (200 Gbps)

  • ConnectX-2,ConnectX-3, ConnectIB, Connectx-3 Pro, ConnectX-4, ConnectX-5, ConnectX-6 adapters from

Mellanox supports offload for collectives (Barrier, Broadcast, etc.) and offload for tag matching

IB Hardware Products

slide-99
SLIDE 99

IT4 Innovations’18 99 Network Based Computing Laboratory

  • Switches:

– 4X SDR and DDR (8-288 ports); 12X SDR (small sizes) – 3456-port “Magnum” switch from SUN  used at TACC

  • 72-port “nano magnum”

– 36-port Mellanox InfiniScale IV QDR switch silicon in 2008

  • Up to 648-port QDR switch by Mellanox and SUN
  • Some internal ports are 96 Gbps (12X QDR)

– IB switch silicon from QLogic (Intel)

  • Up to 846-port QDR switch by QLogic

– FDR (54.6 Gbps) switch silicon (Bridge-X) and associated switches (18-648 ports) – EDR (100Gbps) switch from Oracle and Mellanox – Switch-X-2 silicon from Mellanox with VPI and SDN (Software Defined Networking) support announced in Oct ’12 – SwitchIB-2 from Mellanox EDR 100Gb/s, offloads MPI communications, announced Nov'15 – Quantum from Mellanox HDR 200Gb/s, offloads MPI communications, announced Nov'16

  • Switch Routers with Gateways

– IB-to-FC; IB-to-IP

IB Hardware Products (contd.)

slide-100
SLIDE 100

IT4 Innovations’18 100 Network Based Computing Laboratory

  • 10GE adapters

– Intel, Intilop, Myricom, Emulex, Mellanox (ConnectX, ConnectX-4 Lx EN), Solarflare (Flareon)

  • 10GE/iWARP adapters

– Chelsio, NetEffect (now owned by Intel)

  • 25GE adapters

– Mellanox ConnectX-4 Lx EN

  • 40GE adapters

– Mellanox ConnectX3-EN 40G, Mellanox ConnectX-4 Lx EN – Chelsio (T5 2x40 GigE), Solarflare (Flareon)

  • 50GE adapters

– Mellanox ConnectX-4 Lx EN

  • 100GE adapters

– FPGA-based 100GE adapter from inveaTECH – FPGA based Dual-port 100GE adapter from Accolade Technology (ANIC-200K) – ConnectX-4 EN single/dual-port 100GE adapter from Mellanox

10G, 25G, 40G, 50G, 56G and 100G Ethernet Products

Price for different adapters and switches are available from: http://colfaxdirect.com

  • 10GE switches

– Fulcrum Microsystems (acquired by Intel recently)

  • Low latency switch based on 24-port silicon
  • FM4000 switch with IP routing, and TCP/UDP support

– Arista, Brocade, Cisco, Extreme, Force10, Fujitsu, Juniper, Gnodal and Myricom

  • 25GE, 40GE, 50GE, 56GE and 100GE switches

– Mellanox SN2410, SN2100, and SN2700 supports 10/25/40/50/56/100 GE – Gnodal, Arista, Brocade, Cisco, Juniper, Huawei and Mellanox 40GE (SX series) – Arista 7504R, 7508R, 7512R supports 10/25/40/100 GE – Broadcom has switch architectures for 10/40/100GE – Trident, Trident2, Tomahawk and, Tomahawk2 – Nortel Networks - 10GE downlinks with 40GE and 100GE uplinks – Mellanox – Spectrum 25/100 Gigabit Open Ethernet-based Switch – Atrica A-8800 provides 100 GE optical Ethernet

slide-101
SLIDE 101

IT4 Innovations’18 101 Network Based Computing Laboratory

  • Intel Omni-Path Edge Switches 100 Series

– https://www.intel.com/content/www/us/en/products/network-io/high- performance-fabrics/omni-path-edge-switch-100-series.html

  • Intel Omni-Path Director Class Switches 100 Series

– https://www.intel.com/content/www/us/en/products/network-io/high- performance-fabrics/omni-path-director-class-switch-100-series.html

  • Intel Omni-Path Host Fabric Interface

– https://www.intel.com/content/www/us/en/products/network-io/high- performance-fabrics/omni-path-host-fabric-interface-adapters.html

Omni-Path Products

slide-102
SLIDE 102

IT4 Innovations’18 102 Network Based Computing Laboratory

  • Mellanox ConnectX Adapter
  • Supports IB and HSE convergence
  • Ports can be configured to support IB or HSE
  • Support for VPI and RoCE

– 8 Gbps (SDR), 16Gbps (DDR), 32Gbps (QDR), 54.6 Gbps (FDR) and 100 Gbps (EDR) rates available for IB – 10GE, 40GE, 56GE (only with Mellanox Switches) and 100GE rates available for RoCE

Products Providing IB and HSE Convergence

slide-103
SLIDE 103

IT4 Innovations’18 103 Network Based Computing Laboratory

  • Open source organization (formerly OpenIB)

– www.openfabrics.org

  • Incorporates both IB, RoCE, and iWARP in a unified manner

– Support for Linux and Windows

  • Users can download the entire stack and run

– Latest stable release is OFED 4.8.1

  • New naming convention to get aligned with Linux Kernel Development
  • OFED 4.8.2 is under development

Software Convergence with OpenFabrics

slide-104
SLIDE 104

IT4 Innovations’18 104 Network Based Computing Laboratory

OpenFabrics Stack with Unified Verbs Interface

Verbs Interface (libibverbs)

Mellanox (libmthca, libmlx*) Intel/QLogic (libipathverbs) IBM (libehca) Chelsio (libcxgb*)

User Level Kernel Level

Mellanox Adapters Intel/QLogic Adapters Chelsio Adapters IBM Adapters Emulex (libocrdma) Intel/NetEffect (libnes) Emulex Adapters Intel/NetEffect Adapters Mellanox (ib_mthca, ib_mlx*) Intel/QLogic (ib_ipath) IBM (ib_ehca) Chelsio (ib_cxgb*) Emulex Intel/NetEffect

slide-105
SLIDE 105

IT4 Innovations’18 105 Network Based Computing Laboratory

OpenFabrics Software Stack

SA Subnet Administrator MAD Management Datagram SMA Subnet Manager Agent PMA Performance Manager Agent IPoIB IP over InfiniBand SDP Sockets Direct Protocol SRP SCSI RDMA Protocol (Initiator) iSER iSCSI RDMA Protocol (Initiator) RDS Reliable Datagram Service UDAPL User Direct Access Programming Lib HCA Host Channel Adapter R-NIC RDMA NIC

Common InfiniBand iWARP Key InfiniBand HCA iWARP R-NIC Hardware Specific Driver Hardware Specific Driver Connection Manager MAD InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC SA Client Connection Manager Connection Manager Abstraction (CMA) InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC SDP IPoIB SRP iSER RDS SDP Lib User Level MAD API Open SM Diag Tools

Hardware Provider Mid-Layer Upper Layer Protocol User APIs Kernel Space User Space

NFS-RDMA RPC Cluster File Sys

Application Level

SMA

Clustered DB Access Sockets Based Access Various MPIs Access to File Systems Block Storage Access IP Based App Access

Apps & Access Methods for using OF Stack

UDAPL Kernel bypass Kernel bypass

slide-106
SLIDE 106

IT4 Innovations’18 106 Network Based Computing Laboratory OFI Provider

Libfabrics Software Stack

Courtesy: http://www.slideshare.net/seanhefty/ofi-overview?ref=http://ofiwg.github.io/libfabric/

Open Fabrics Interface (OFI)

Control Services Communication Services Completion Services Data Transfer Services Discovery Connection Management Address Vectors Event Queues Counters Message Queues Tag Matching RMA Atomics Triggered Operations

MPI SHMEM PGAS OFI Enabled Applications

Discovery Connection Management Address Vectors Event Queues Counters Message Queues Tag Matching RMA Atomics Triggered Operations

NIC

TX Command Queues RX Command Queues

slide-107
SLIDE 107

IT4 Innovations’18 107 Network Based Computing Laboratory

Trends of Networking Technologies in TOP500 Systems

Interconnect Family – Systems Share

slide-108
SLIDE 108

IT4 Innovations’18 108 Network Based Computing Laboratory 41% 33% 13% 7% 5% 1% 0%

Count

10G InfiniBand Custom Interconnect Omni-Path Gigabit Ethernet Proprietary Network Ethernet

InfiniBand in the Top500 (November 2017)

19% 26% 41% 9% 3% 2% 0%

Performance

10G InfiniBand Custom Interconnect Omni-Path Gigabit Ethernet Proprietary Network Ethernet

slide-109
SLIDE 109

IT4 Innovations’18 109 Network Based Computing Laboratory

  • 163 IB Clusters (32.6%) in the Nov’17 Top500 list

– (http://www.top500.org)

  • Installations in the Top 50 (17 systems):

Large-scale InfiniBand Installations

19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th) 241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th) 220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th) 144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd) 155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th) 72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th) 72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th) 78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st) 124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd) 60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!

slide-110
SLIDE 110

IT4 Innovations’18 110 Network Based Computing Laboratory

  • 35 Omni-Path Clusters (7%) in the Nov’17 Top500 list

– (http://www.top500.org)

Large-scale Omni-Path Installations

556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd) 368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th) 135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th) 314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th) 153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th) 95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th) 95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th) 49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th) 38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st) 35,280 core (Quriosity) at BASF in Germany (71st) and many more!

slide-111
SLIDE 111

IT4 Innovations’18 111 Network Based Computing Laboratory

HSE Scientific Computing Installations

  • 204 HSE compute systems with ranking in the Nov’17 Top500 list

– 39,680-core installation in China (#73) – 66,560-core installation in China (#101) – new – 66,280-core installation in China (#103) – new – 64,000-core installation in China (#104) – new – 64,000-core installation in China (#105) – new – 72,000-core installation in China (#108) – new – 78,000-core installation in China (#125) – 59,520-core installation in China (#128) – new – 59,520-core installation in China (#129) – new – 64,800-core installation in China (#130) – new – 67,200-core installation in China (#134) – new – 57,600-core installation in China (#135) – new – 57,600-core installation in China (#136) – new – 64,000-core installation in China (#138) – new – 84,000-core installation in China (#139) – 84,000-core installation in China (#140) – 51,840-core installation in China (#151) – new – 51,200-core installation in China (#156) – new – and many more!

slide-112
SLIDE 112

IT4 Innovations’18 112 Network Based Computing Laboratory

  • HSE has most of its popularity in enterprise computing

and other non-scientific markets including Wide-area networking

  • Example Enterprise Computing Domains

– Enterprise Datacenters (HP, Intel) – Animation firms (e.g., Universal Studios (“The Hulk”), 20th Century Fox (“Avatar”), and many new movies using 10GE) – Amazon’s HPC cloud offering uses 10GE internally – Heavily used in financial markets (users are typically undisclosed)

  • Many Network-attached Storage devices come

integrated with 10GE network adapters

  • ESnet has installed a 100GE infrastructure for US DOE

and have recently expanded it across the Atlantic to Europe

  • They also have a 100G SDN overlay network from LBL

to StarLight

Other HSE Installations

Courtesy ESnet

https://www.es.net/network-r-and-d/experimental-network-testbeds/100g-sdn-testbed/ https://www.es.net/news-and-publications/esnet-news/2014/esnet-extends-100g-connectivity-across-atlantic/

slide-113
SLIDE 113

IT4 Innovations’18 113 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-114
SLIDE 114

IT4 Innovations’18 114 Network Based Computing Laboratory

  • Low-level Performance
  • Message Passing Interface (MPI)

Case Studies

slide-115
SLIDE 115

IT4 Innovations’18 115 Network Based Computing Laboratory

Low-level Latency Measurements

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Small Messages Latency (us) Message Size (bytes) 10 20 30 40 50 60 70 80 90 100 IB-EDR (100Gbps) RoCE (100Gbps) Large Messages Latency (us) Message Size (bytes) ConnectX-4 EDR (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back ConnectX-4 EN (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back 0.73 0.86

slide-116
SLIDE 116

IT4 Innovations’18 116 Network Based Computing Laboratory

Low-level Uni-directional Bandwidth Measurements

2000 4000 6000 8000 10000 12000 14000 IB-EDR (100Gbps) RoCE (100Gbps) Uni-directional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) ConnectX-4 EDR (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back ConnectX-4 EN (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back 11,459 12,404

slide-117
SLIDE 117

IT4 Innovations’18 117 Network Based Computing Laboratory

Low-level Latency Measurements

2 4 6 8 10 12 Small Messages Latency (us) Message Size (bytes) 50 100 150 200 250 300 350 400 450 500 Sockets Rsockets IB-Verbs Large Messages Latency (us) Message Size (bytes) ConnectX-4 EDR (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back 5.94 0.73 0.90

slide-118
SLIDE 118

IT4 Innovations’18 118 Network Based Computing Laboratory

Low-level Uni-directional Bandwidth Measurements

2000 4000 6000 8000 10000 12000 14000 Sockets Rsockets IB-Verbs Uni-directional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) ConnectX-4 EDR (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back 12,583 12,405 3,065

slide-119
SLIDE 119

IT4 Innovations’18 119 Network Based Computing Laboratory

  • Low-level Performance
  • Message Passing Interface (MPI)

Case Studies

slide-120
SLIDE 120

IT4 Innovations’18 120 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,850 organizations in 85 countries – More than 440,000 (> 0.44 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)

  • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 12th, 368,928-core (Stampede2) at TACC
  • 17th, 241,108-core (Pleiades) at NASA
  • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)

slide-121
SLIDE 121

IT4 Innovations’18 121 Network Based Computing Laboratory

One-way Latency: MPI over IB with MVAPICH2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Small Message Latency Message Size (bytes) Latency (us) 1.11 1.19 1.01 1.15 1.04

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

20 40 60 80 100 120 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Omni-Path Large Message Latency Message Size (bytes) Latency (us)

slide-122
SLIDE 122

IT4 Innovations’18 122 Network Based Computing Laboratory

Bandwidth: MPI over IB with MVAPICH2

2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 12,590 3,373 6,356 12,083 12,366 5000 10000 15000 20000 25000 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Omni-Path Bidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 21,227 12,161 21,983 6,228 24,136

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

slide-123
SLIDE 123

IT4 Innovations’18 123 Network Based Computing Laboratory

One-way Latency: MPI over iWARP

5 10 15 20 25 30 Chelsio T4 (TCP/IP) Chelsio T4 (iWARP) Intel-NetEffect NE20 (TCP/IP) Intel-NetEffect NE20 (iWARP) Message Size (bytes) One-way Latency Latency (us) 13.44 2.6 GHz Dual Eight-core (SandyBridge) Intel Chelsio T4 cards connected through Fujitsu xg2600 10GigE switch Intel NetEffect cards connected through Fulcrum 10GigE switch 4.64 11.32 5.59

slide-124
SLIDE 124

IT4 Innovations’18 124 Network Based Computing Laboratory

Bandwidth: MPI over iWARP

0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00

Chelsio T4 (TCP/IP) Chelsio T4 (iWARP) Intel-NetEffect NE20 (TCP/IP) Intel-NetEffect NE20 (iWARP)

Message Size (bytes) Bandwidth (MBytes/sec) Unidirectional Bandwidth 1168 1181 1169 2.6 GHz Dual Eight-core (SandyBridge) Intel Chelsio T4 cards connected through Fujitsu xg2600 10GigE switch Intel NetEffect cards connected through Fulcrum 10GigE switch 1176

slide-125
SLIDE 125

IT4 Innovations’18 125 Network Based Computing Laboratory

Convergent Technologies: MPI Latency

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 IB-EDR (100Gbps) RoCE (100Gbps) One-way Latency Latency (us) Message Size (bytes) ConnectX-4 EDR (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back ConnectX-4 EN (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back 1.04 0.91

slide-126
SLIDE 126

IT4 Innovations’18 126 Network Based Computing Laboratory

5000 10000 15000 20000 25000 IB-EDR (100 Gbps) RoCE (100 Gbps) Bi-directional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes)

Convergent Technologies: MPI Uni- and Bi-directional Bandwidth

2000 4000 6000 8000 10000 12000 14000 Uni-directional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes)

ConnectX-4 EDR (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back ConnectX-4 EN (100 Gbps): 3.1 GHz Deca-core (Haswell) Intel Back-to-back

11,436 12,097 21,072 21,169

slide-127
SLIDE 127

IT4 Innovations’18 127 Network Based Computing Laboratory

  • Introduction
  • Why InfiniBand and High-speed Ethernet?
  • Overview of IB, HSE, their Convergence and Features
  • Overview of Omni-Path Architecture
  • IB, Omni-Path, and HSE HW/SW Products and Installations
  • Sample Case Studies and Performance Numbers
  • Conclusions and Final Q&A

Presentation Overview

slide-128
SLIDE 128

IT4 Innovations’18 128 Network Based Computing Laboratory

  • Presented network architectures & trends in Clusters
  • Presented background and details of IB, Omni-Path and HSE

– Highlighted the main features of IB and HSE and their convergence – Gave an overview of IB, Omni-Path, and HSE hardware/software products – Discussed sample performance numbers in designing various high-end systems with IB, Omni-Path, and HSE

  • IB, Omni-Path, and HSE are emerging as new architectures leading to a new

generation of networked computing systems, opening many research issues needing novel solutions

Concluding Remarks

slide-129
SLIDE 129

IT4 Innovations’18 129 Network Based Computing Laboratory

Funding Acknowledgments

Funding Support by Equipment Support by

slide-130
SLIDE 130

IT4 Innovations’18 130 Network Based Computing Laboratory

Personnel Acknowledgments

Current Students (Graduate)

  • A. Awan (Ph.D.)

  • R. Biswas (M.S.)

  • M. Bayatpour (Ph.D.)

  • S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.) –

  • S. Guganani (Ph.D.)

Past Students

  • A. Augustine (M.S.)

  • P. Balaji (Ph.D.)

  • S. Bhagvat (M.S.)

  • A. Bhat (M.S.)

  • D. Buntinas (Ph.D.)

  • L. Chai (Ph.D.)

  • B. Chandrasekharan (M.S.)

  • N. Dandapanthula (M.S.)

  • V. Dhanraj (M.S.)

  • T. Gangadharappa (M.S.)

  • K. Gopalakrishnan (M.S.)

  • R. Rajachandrasekar (Ph.D.)

  • G. Santhanaraman (Ph.D.)

  • A. Singh (Ph.D.)

  • J. Sridhar (M.S.)

  • S. Sur (Ph.D.)

  • H. Subramoni (Ph.D.)

  • K. Vaidyanathan (Ph.D.)

  • A. Vishnu (Ph.D.)

  • J. Wu (Ph.D.)

  • W. Yu (Ph.D.)

Past Research Scientist

  • K. Hamidouche

  • S. Sur

Past Post-Docs

  • D. Banerjee

  • X. Besseron

– H.-W. Jin –

  • W. Huang (Ph.D.)

  • W. Jiang (M.S.)

  • J. Jose (Ph.D.)

  • S. Kini (M.S.)

  • M. Koop (Ph.D.)

  • K. Kulkarni (M.S.)

  • R. Kumar (M.S.)

  • S. Krishnamoorthy (M.S.)

  • K. Kandalla (Ph.D.)

  • M. Li (Ph.D.)

  • P. Lai (M.S.)

  • J. Liu (Ph.D.)

  • M. Luo (Ph.D.)

  • A. Mamidala (Ph.D.)

  • G. Marsh (M.S.)

  • V. Meshram (M.S.)

  • A. Moody (M.S.)

  • S. Naravula (Ph.D.)

  • R. Noronha (Ph.D.)

  • X. Ouyang (Ph.D.)

  • S. Pai (M.S.)

  • S. Potluri (Ph.D.)

  • J. Hashmi (Ph.D.)

  • H. Javed (Ph.D.)

  • P. Kousha (Ph.D.)

  • D. Shankar (Ph.D.)

  • H. Shi (Ph.D.)

  • J. Zhang (Ph.D.)

  • J. Lin

  • M. Luo

  • E. Mancini

Current Research Scientists

  • X. Lu

  • H. Subramoni

Past Programmers

  • D. Bureddy

  • J. Perkins

Current Research Specialist

  • J. Smith

  • M. Arnold

  • S. Marcarelli

  • J. Vienne

  • H. Wang

Current Post-doc

  • A. Ruhela

Current Students (Undergraduate)

  • N. Sarkauskas (B.S.)
slide-131
SLIDE 131

IT4 Innovations’18 131 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu; subramon@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/