Latest version of the slides can be obtained from - - PowerPoint PPT Presentation
Latest version of the slides can be obtained from - - PowerPoint PPT Presentation
Latest version of the slides can be obtained from http://www.cse.ohio-state.edu/~panda/it4-advanced.pdf InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Features, Challenges in Designing HEC Systems, and Usage A Tutorial at IT4
Network Based Computing Laboratory 2 IT4 Innovations’18
High-End Computing (HEC): ExaFlop & ExaByte
10K-20K EBytes in 2016-2018
40K EBytes in 2020 ? ExaByte & BigData Expected to have an ExaFlop system in 2019-2020!
100 PFlops in 2016 1 EFlops in 2019-2020?
Network Based Computing Laboratory 3 IT4 Innovations’18
- Compute Clusters
- Storage Clusters
- Multi-tier Data Centers
- Cloud Computing Environments
- Big Data Processing (Hadoop and Spark)
- Web 2.0 with Memcached
Various High-End Computing (HEC) Systems
Network Based Computing Laboratory 4 IT4 Innovations’18
Various Clusters (Compute, Storage and Datacenters)
Compute cluster
Meta-Data Manager I/O Server Node Meta Data
Data
Compute Node Compute Node I/O Server Node
Data
Compute Node I/O Server Node
Data
Compute Node
L A N LAN
Frontend Storage cluster
LAN/WAN
. . . .
Enterprise Multi-tier Datacenter for Visualization and Mining
Tier1 Tier3
Routers/ Servers
Switch
Database Server Application Server Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Database Server Database Server
Switch Switch
Routers/ Servers
Tier2
Network Based Computing Laboratory 5 IT4 Innovations’18
Cloud Computing Environments
LAN / WAN
Physical Machine Virtual Machine Virtual Machine Physical Machine Virtual Machine Virtual Machine Physical Machine Virtual Machine Virtual Machine
Virtual Network File System
Physical Meta-Data Manager Meta Data Physical I/O Server Node
Data
Physical I/O Server Node
Data
Physical I/O Server Node
Data
Physical I/O Server Node
Data
Physical Machine Virtual Machine Virtual Machine
Network Based Computing Laboratory 6 IT4 Innovations’18
- Open-source implementation of Google MapReduce, GFS, and BigTable
for Big Data Analytics
- Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN
- http://hadoop.apache.org
Overview of Apache Hadoop Architecture
Hadoop Distributed File System (HDFS)
MapReduce (Cluster Resource Management & Data Processing)
Hadoop Common/Core (RPC, ..) Hadoop Distributed File System (HDFS)
YARN (Cluster Resource Management & Job Scheduling)
Hadoop Common/Core (RPC, ..)
MapReduce (Data Processing) Other Models (Data Processing)
Hadoop 1.x Hadoop 2.x
Network Based Computing Laboratory 7 IT4 Innovations’18
Big Data Processing with Hadoop Components
- Major components included in this
tutorial:
– MapReduce (Batch) – HBase (Query) – HDFS (Storage) – RPC
- Underlying Hadoop Distributed File
System (HDFS) used by both MapReduce and HBase
- Model scales but high amount of
communication during intermediate phases can be further optimized
HDFS
MapReduce
Hadoop Framework
User Applications
HBase
Hadoop Common (RPC)
Network Based Computing Laboratory 8 IT4 Innovations’18
Spark Architecture Overview
- An in-memory data-processing
framework
– Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos
- Scalable and communication
intensive
– Wide dependencies between Resilient Distributed Datasets (RDDs) – MapReduce-like shuffle operations to repartition RDDs – Sockets based communication http://spark.apache.org
Network Based Computing Laboratory 9 IT4 Innovations’18
Memcached Architecture
- Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes – General purpose
- Typically used to cache database queries, results of API calls
- Scalable model, but typical usage very network intensive
Network Based Computing Laboratory 10 IT4 Innovations’18
Big Data
(Hadoop, Spark, HBase, Memcached, etc.)
Deep Learning
(Caffe, TensorFlow, BigDL, etc.)
HPC
(MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!
Network Based Computing Laboratory 11 IT4 Innovations’18
Drivers of Modern HPC Cluster Architectures
- Multi-core/many-core technologies
- Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
- Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
- Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
- Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM
Tianhe – 2 Titan K - Computer Sunway TaihuLight
Network Based Computing Laboratory 12 IT4 Innovations’18
Kernel Space
Modern Interconnects and Protocols with IB, HSE, and Omni-Path
Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch
Hardware Offload TCP/IP
10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP 1/10/25/40/ 50/100 GigE OmniPath Adapter OmniPath Switch User Space RDMA 100 Gb/s OFI
Network Based Computing Laboratory 13 IT4 Innovations’18
- 163 IB Clusters (32.6%) in the Nov’17 Top500 list
– (http://www.top500.org)
- Installations in the Top 50 (17 systems):
Large-scale InfiniBand Installations
19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th) 241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th) 220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th) 144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd) 155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th) 72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th) 72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th) 78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st) 124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd) 60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
Network Based Computing Laboratory 14 IT4 Innovations’18
- 35 Omni-Path Clusters (7%) in the Nov’17 Top500 list
– (http://www.top500.org)
Large-scale Omni-Path Installations
556,104 core (Oakforest-PACS) at JCAHPC in Japan (9th) 54,432 core (Marconi Xeon) at CINECA in Italy (72nd) 368,928 core (Stampede2) at TACC in USA (12th) 46,464 core (Peta4) at University of Cambridge in UK (75th) 135,828 core (Tsubame 3.0) at TiTech in Japan (13th) 53,352 core (Girzzly) at LANL in USA (85th) 314,384 core (Marconi XeonPhi) at CINECA in Italy (14th) 45,680 core (Endeavor) at Intel in USA (86th) 153,216 core (MareNostrum) at BSC in Spain (16th) 59,776 core (Cedar) at SFU in Canada (94th) 95,472 core (Quartz) at LLNL in USA (49th) 27,200 core (Peta HPC) in Taiwan (95th) 95,472 core (Jade) at LLNL in USA (50th) 40,392 core (Serrano) at SNL in USA (112th) 49,432 core (Mogon II) at Universitaet Mainz in Germany (65th) 40,392 core (Cayenne) at SNL in USA (113th) 38,552 core (Molecular Simulator) in Japan (70th) 39,774 core (Nel) at LLNL in USA (101st) 35,280 core (Quriosity) at BASF in Germany (71st) and many more!
Network Based Computing Laboratory 15 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 16 IT4 Innovations’18
Advanced Features of InfiniBand
- SRQ and XRC
- DCT
- User-Mode Memory Registration (UMR)
- On-demand Paging
- Core-Direct Offload
- SHArP
Network Based Computing Laboratory 17 IT4 Innovations’18
- Different transport protocols with IB
– Reliable Connection (RC) is the most common – Unreliable Datagram (UD) is used in some cases
- Buffers need to be posted at each receiver to receive message from any sender
– Buffer requirement can increase with system size
- Connections need to be established across processes under RC
– Each connection requires certain amount of memory for handling related data structures – Memory required for all connections can increase with system size
- Both issues have become critical as large-scale IB deployments have taken place
– Being addressed by both IB specification and upper-level middleware
Memory overheads in large-scale systems
Network Based Computing Laboratory 18 IT4 Innovations’18
- SRQ is a hardware mechanism for a process to share receive resources
(memory) across multiple connections
– Introduced in specification v1.2
- 0 < Q << P*((M*N)-1)
Shared Receive Queue (SRQ)
Process
One RQ per connection
Process
One SRQ for all connections
P Q (M*N) - 1
Network Based Computing Laboratory 19 IT4 Innovations’18
- Each QP takes at least one page of memory
– Connections between all processes is very costly for RC
- New IB Transport added: eXtended Reliable Connection
– Allows connections between nodes instead of processes
eXtended Reliable Connection (XRC)
RC Connections XRC Connections M2 x (N – 1) connections/node
M = # of processes/node N = # of nodes
M x (N – 1) connections/node
Network Based Computing Laboratory 20 IT4 Innovations’18
- XRC uses SRQ Numbers (SRQN) to direct where a operation should complete
- Hardware does all routing of data, so p2 is not actually involved in the data
transfer
- Connections are not bi-directional, so p3 cannot sent to p0
XRC Addressing
SRQ#1 SRQ#2
Process 0 Process 1
SRQ#1
Process 2
SRQ#2
Process 3 Send to #2 Send to #1
Network Based Computing Laboratory 21 IT4 Innovations’18
DC Connection Model, Communication Objects and Addressing Scheme
- Constant connection cost
– One QP for any peer
- Full Feature Set
– RDMA, Atomics etc
Node 0 P1 P0 Node 1 P3 P2 Node 3 P7 P6 Node 2 P5 P4
IB Network
- Communication Objects & Addressing Scheme
– DCINI
- Analogous to the send QPs
- Can transmit data to any peer
– DCTGT
- Receive objects
- Must be backed by SRQ
- Identified on a node by “DCT Number”
– Messages routed with combination of DCT Number + LID – Requires “DC Key” to enable communication
- Must be same across all processes
Network Based Computing Laboratory 22 IT4 Innovations’18
- Support direct local and remote non-contiguous memory access
- Avoid packing at sender and unpacking at receiver
User-Mode Memory Registration
1 3 4 Process Kernel HCA/RNIC 2
Steps to create memory regions with UMR:
- 1. UMR Creation Request
- Send number of blocks
- 2. HCA issues uninitialized memory keys for future
UMR use
- 3. Kernel maps virtual->physical and pins region
into physical memory
- 4. HCA caches the virtual to physical mapping
Network Based Computing Laboratory 23 IT4 Innovations’18
- Applications no longer need to pin down the underlying physical pages
- Memory Region (MR) are NEVER pinned by the OS
- Paged in by the HCA when needed
- Paged out by the OS when reclaimed
- ODP can be divided into two classes
– Explicit ODP
- Applications still register memory buffers for communication, but this operation is used to define access control
for IO rather than pin-down the pages
– Implicit ODP
- Applications are provided with a special memory key that represents their complete address space, does not
need to register any virtual address range
- Advantages
- Simplifies programming
- Unlimited MR sizes
- Physical memory optimization
On-Demand Paging (ODP)
Network Based Computing Laboratory 24 IT4 Innovations’18
- Introduced by Mellanox to avoid pinning the pages of registered memory regions
- ODP-aware runtime could reduce the size of pin-down buffers while maintaining
performance
Implicit On-Demand Paging (ODP)
- M. Li, X. Lu, H. Subramoni, and D. K. Panda, “Designing Registration Caching Free High-Performance MPI Library with Implicit
On-Demand Paging (ODP) of InfiniBand”, HiPC ‘17
0.1 1 10 100
CG EP FT IS MG LU SP AWP-ODC Graph500 Execution Time (s)
Applications (256 Processes)
Pin-down Explicit-ODP Implicit-ODP
Network Based Computing Laboratory 25 IT4 Innovations’18
Collective Offload Support on the Adapters
- Performance of collective operations (broadcast, barrier, reduction,
all-reduce, etc.) are very critical to the overall performance of MPI applications
- Currently being done with basic pt-to-pt operations (send/recv and
RDMA) using host-based operations
- Mellanox ConnectX-2, ConnectX-3, ConnectX-4, and ConnectX-5
adapters support offloading some of these operations to the adapters (CORE-Direct)
– Provides overlap of computation and collective communication – Reduces OS jitter (since everything is done in hardware)
Network Based Computing Laboratory 26 IT4 Innovations’18
Application
One-to-many Multi-Send
- Sender creates a task-list consisting of
- nly send and wait WQEs
– One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list – A wait WQE is added to make the HCA wait for ACK packet from the receiver
InfiniBand HCA Physical Link
Send Q Recv Q Send CQ Recv CQ
Data Data
MCQ
MQ
Task List
Send Wait Send Send Send Wait
Network Based Computing Laboratory 27 IT4 Innovations’18
- Management and execution of MPI operations in the
network by using SHArP
- Manipulation of data while it is being transferred in the
switch network
- SHArP provides an abstraction to realize the reduction
- peration
- Defines Aggregation Nodes (AN), Aggregation Tree, and
Aggregation Groups
- AN logic is implemented as an InfiniBand Target Channel
Adapter (TCA) integrated into the switch ASIC*
- Uses RC for communication between ANs and between AN
and hosts in the Aggregation Tree*
Physical Network Topology*
* Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. R. L. Graham, D. Bureddy, P. Lui, G. Shainer, H. Rosenstock, G. Bloch, D. Goldenberg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, E. Zahavi, Mellanox Technologies, Inc. First Workshop on Optimization of Communication in HPC Runtime Systems (COM-HPC 2016)
Logical SHArP Tree*
Scalable Hierarchical Aggregation Protocol (SHArP)
Network Based Computing Laboratory 28 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 29 IT4 Innovations’18
The Ethernet Ecosystem
Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/
Network Based Computing Laboratory 30 IT4 Innovations’18
Courtesy http://www.eetimes.com/document.asp?doc_id=1323184 http://www.networkcomputing.com/data-centers/25-gbe-big-deal-will-arrive/1714647938 http://www.eetimes.com/document.asp?doc_id=1323184
Emergence of 25 GigE and Benefits
Slash top-of-rack switches (Source: IEEE 802.3)
Courtesy http://www.plexxi.com/2014/07/whats-25-gigabit-ethernet-want/ http://www.qlogic.com/Products/adapters/Pages/25Gb-Ethernet.aspx
Network Based Computing Laboratory 31 IT4 Innovations’18
- Requires half the number of lanes compared to 40G (x4 instead of x8 PCIe lanes)
- Better PCIe bandwidth utilization (25/32=78% vs. 40/64=62.5%) with lower power impact
Matching PCIe and Ethernet Speeds
Ethernet Rate (Gb/s) Number of PCIe Gen3 Lanes Needed for Single Port Dual Port 100 16 32 (Uncommon) 40 8 16 25 4 8 10 2 4 Courtesy: http://www.ieee802.org/3/cfi/0314_3/CFI_03_0314.pdf
Network Based Computing Laboratory 32 IT4 Innovations’18
- 25G & 50G Ethernet specification extends IEEE
802.3 to work at increased data rates
- Features in Draft 1.4 of specification
– PCS/PMA operation at 25 Gb/s over a single lane – PCS/PMA operation at 25 Gb/s over two lanes – Optional Forward Error Correction modes – Optional auto-negotiation using an OUI next page – Optional link training
- Standards for 50 Gb/s, 200 Gb/s and
400Gb/s under development
– Expected around 2017 – 2018?
Detailed Specifications for 25 and 50 GigE and Looking Forward
Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/ Next standards by 2017 – 2018?
Network Based Computing Laboratory 33 IT4 Innovations’18
Ethernet Roadmap – To Terabit Speeds?
50G, 100G, 200G and 400G by 2018-2019 Terabit speeds by 2025?!?! Courtesy: Scott Kipp @ Ethernet Alliance - http://www.ethernetalliance.org/roadmap/
Network Based Computing Laboratory 34 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 35 IT4 Innovations’18
RDMA over Converged Enhanced Ethernet
IB Verbs Application Hardware RoCE IB Verbs Application RoCE v2 InfiniBand Link Layer IB Network IB Transport IB Verbs Application InfiniBand Ethernet Link Layer IB Network IB Transport Ethernet Link Layer UDP / IP IB Transport
- Takes advantage of IB and Ethernet
– Software written with IB-Verbs – Link layer is Converged (Enhanced) Ethernet (CE)
- Pros: IB Vs RoCE
– Works natively in Ethernet environments
- Entire Ethernet management ecosystem is available
– Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so there are no missing features
- RoCE v2: Additional Benefits over RoCE
– Traditional Network Management Tools Apply – ACLs (Metering, Accounting, Firewalling) – GMP Snooping for optimized Multicast – Network Monitoring Tools
- Cons:
– Network bandwidth might be limited to Ethernet switches
- 10/40GE switches available; 56 Gbps IB is available
Courtesy: OFED, Mellanox Network Stack Comparison Packet Header Comparison
ETH L2 Hdr
Ethertype
IB GRH L3 Hdr IB BTH+ L4 Hdr RoCE ETH L2 Hdr
Ethertype
IP Hdr L3 Hdr IB BTH+ L4 Hdr
Proto #
RoCE v2 UDP Hdr
Port #
Network Based Computing Laboratory 36 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 37 IT4 Innovations’18
- Open source organization (formerly OpenIB)
– www.openfabrics.org
- Incorporates both IB, RoCE, and iWARP in a unified manner
– Support for Linux and Windows
- Users can download the entire stack and run
– Latest stable release is OFED 4.8.1
- New naming convention to get aligned with Linux Kernel
Development
- OFED 4.8.2 is under development
Software Convergence with OpenFabrics
Network Based Computing Laboratory 38 IT4 Innovations’18
OpenFabrics Software Stack
SA Subnet Administrator MAD Management Datagram SMA Subnet Manager Agent PMA Performance Manager Agent IPoIB IP over InfiniBand SDP Sockets Direct Protocol SRP SCSI RDMA Protocol (Initiator) iSER iSCSI RDMA Protocol (Initiator) RDS Reliable Datagram Service UDAPL User Direct Access Programming Lib HCA Host Channel Adapter R-NIC RDMA NIC
Common InfiniBand iWARP Key InfiniBand HCA iWARP R-NIC Hardware Specific Driver Hardware Specific Driver Connection Manager MAD InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC SA Client Connection Manager Connection Manager Abstraction (CMA) InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC SDP IPoIB SRP iSER RDS SDP Lib User Level MAD API Open SM Diag Tools
Hardware Provider Mid-Layer Upper Layer Protocol User APIs Kernel Space User Space
NFS-RDMA RPC Cluster File Sys
Application Level
SMA
Clustered DB Access Sockets Based Access Various MPIs Access to File Systems Block Storage Access IP Based App Access
Apps & Access Methods for using OF Stack
UDAPL Kernel bypass Kernel bypass
Network Based Computing Laboratory 39 IT4 Innovations’18
1. Create QPs (endpoints) 2. Register memory for sending and receiving 3. Send – Channel semantics
- Post receive
- Post send
– RDMA semantics
Programming with OpenFabrics
Sender Receiver
Sample Steps
Kernel HCA Process
Network Based Computing Laboratory 40 IT4 Innovations’18
- Prepare and post send descriptor (channel semantics)
Verbs: Post Send
struct ibv_send_wr *bad_wr; struct ibv_send_wr sr; struct ibv_sge sg_entry; sr.next = NULL; sr.opcode = IBV_WR_SEND; sr.wr_id = 0; sr.num_sge = 1; if (len < max_inline_size) { sr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE; } else { sr.send_flags = IBV_SEND_SIGNALED; } sr.sg_list = &(sg_entry); sg_entry.addr = (uintptr_t) buf; sg_entry.length = len; sg_entry.lkey = mr_handle->lkey; ret = ibv_post_send(qp, &sr, &bad_wr);
Network Based Computing Laboratory 41 IT4 Innovations’18
- Prepare and post RDMA write (memory semantics)
Verbs: Post RDMA Write
struct ibv_send_wr *bad_wr; struct ibv_send_wr sr; struct ibv_sge sg_entry; sr.next = NULL; sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */ sr.wr_id = 0; sr.num_sge = 1; sr.send_flags = IBV_SEND_SIGNALED; sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */ sr.wr.rdma.rkey = rkey; /* from remote node */ sr.sg_list = &(sg_entry); sg_entry.addr = buf; /* local buffer */ sg_entry.length = len; sg_entry.lkey = mr_handle->lkey; ret = ibv_post_send(qp, &sr, &bad_wr);
Network Based Computing Laboratory 42 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Grid Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 43 IT4 Innovations’18
Libfabrics Connection Model
Server Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale Client Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale fi_fabrics Open fabrics fi_fabrics Open fabrics Open domain fi_domain Register Mem fi_mr_reg fi_endpoint Open EndPoint fi_cq_open Open Comp Q fi_ep_bind Bind EP to CQ fi_connect Connect to Remote EP Open Event Q fi_ep_open fi_passive_ep Open Passive EP fi_eq_open Open Event Q fi_pep_bind Bind Passive EP fi_listen Listen for Incoming Connections New Event Detected on EQ fi_eq_sread Validate New Event == FI_CONNREQ Open domain fi_domain Register Mem fi_mr_reg
Network Based Computing Laboratory 44 IT4 Innovations’18
Libfabrics Connection Model (Cont.)
Server Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale Client Process OFI Provider Sockets/Verbs/PSM HCA GigE/IB/TrueScale fi_ep_open Open EndPoint fi_cq_open Open Event Q fi_ep_bind Bind EP to CQ Open fabrics fi_accept Accept Connection fi_eq_sread Validate New Event == FI_CONNECTED New Event Detected on EQ fi_eq_sread Validate New Event == FI_CONNECTED fi_send Post Send Recv Completion fi_send Post Send Recv Completion fi_shutdown Shutdown Channel fi_close * Close all open resources fi_shutdown Shutdown Channel fi_close * Close all open resources fi_cq_read / fi_cq_sread Poll / Wait for Data fi_cq_read / fi_cq_sread Poll / Wait for Data fi_recv Post Recv fi_recv Post Recv
Network Based Computing Laboratory 45 IT4 Innovations’18
- Similar to socket / QP
- Simple / Easy to use
Scalable EndPoints Vs Shared TX/RX Context
Courtesy: http://www.slideshare.net/seanhefty/ofa-workshop2015ofiwg?ref=http://ofiwg.github.io/libfabric/
End Point
Transmit Receive Completion
End Point
Transmit Receive Completion
End Point End Point End Point
Transmit Receive Completion Transmit Receive
Scalable EndPoints Shared TX/RX Context Normal EndPoint
- Share HW resources
- # EP >> HW resources
- Use more HW resources
- Higher performance per EP
Network Based Computing Laboratory 46 IT4 Innovations’18
- Open Fabric, Domain and EP
Libfabrics: Fabric, Domain and Endpoint creation
struct fi_info *info, *hints; struct fid_fabric *fabric; struct fid_domain *dom; struct fid_ep *ep; hints = fi_allocinfo(); /* Obtain fabric information */ rc = fi_getinfo(VERSION, node, service, flags, hints, &info); /* Free fabric information */ fi_freeinfo(hints); /* Open fabric */ rc = fi_fabric(info->fabric_attr, &fabric, NULL); /* Open domain */ rc = fi_domain(fabric, entry.info, &dom, NULL); /* Open End point */ rc = fi_endpoint(dom, entry.info, &ep, NULL);
Network Based Computing Laboratory 47 IT4 Innovations’18
- Open Fabric / Domain and create EQ, EP to end nodes
– Connection establishment is abstracted out using connection management APIs (fi_cm) – fi_listen, fi_connect, fi_accept – Fabric provider can implement them with connection managers (rdma_cm or ibcm)
- r directly through verbs with out-of-band communication
- Register memory
Libfabrics: Memory Registration
int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len, uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags, struct fid_mr **mr, void *context); rc = fi_mr_reg(domain, buffer, size, FI_SEND | FI_RECV, 0, 0, 0, &mr, NULL); rc = fi_mr_reg(domain, buffer, size, FI_REMOTE_READ | FI_REMOTE_WRITE, 0, user_key, 0, &mr, NULL);
Permissions can be set as needed
Network Based Computing Laboratory 48 IT4 Innovations’18
- Prepare and post receive request
Libfabrics: Post Receive (Channel Semantics)
ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len, void *desc, fi_addr_t src_addr, void *context);
- For connected EPs
ssize_t fi_recvmsg(struct fid_ep *ep, const struct fi_msg *msg, uint64_t flags);
- For connected and un-connected EPs
struct fid_ep *ep; struct fid_mr *mr; /* Post recv request */ rc = fi_recv(ep, buf, size, fi_mr_desc(mr), 0, (void *)(uintptr_t)RECV_WCID);
Network Based Computing Laboratory 49 IT4 Innovations’18
- Prepare and post send descriptor
Libfabrics: Post Send (Channel Semantics)
ssize_t fi_send(struct fid_ep *ep, void *buf, size_t len, void *desc, fi_addr_t dest_addr, void *context);
- For connected EPs
ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg, uint64_t flags);
- For connected and un-connected EPs
ssize_t fi_inject(struct fid_ep *ep, void *buf, size_t len, fi_addr_t dest_addr);
- Buffer available for re-use as soon as function returns
- No completion event generated for send
struct fid_ep *ep; struct fid_mr *mr; static fi_addr_t remote_fi_addr; rc = fi_send(ep, buf, size, fi_mr_desc(mr), 0, (void *)(uintptr_t)SEND_WCID); rc = fi_inject(ep, buf, size, remote_fi_addr);
Network Based Computing Laboratory 50 IT4 Innovations’18
- Prepare and post receive request
Libfabrics: Post Remote Read (Memory Semantics)
ssize_t fi_read(struct fid_ep *ep, void *buf, size_t len, void *desc, fi_addr_t src_addr, uint64_t addr, uint64_t key, void *context);
- For connected EPs
ssize_t fi_readmsg(struct fid_ep *ep, const struct fi_msg_rma *msg, uint64_t flags);
- For connected and un-connected EPs
struct fid_ep *ep; struct fid_mr *mr; struct fi_context fi_ctx_read; /* Post remote read request */ ret = fi_read(ep, buf, size, fi_mr_desc(mr), local_addr, remote_addr, remote_key, &fi_ctx_read);
Network Based Computing Laboratory 51 IT4 Innovations’18
- Prepare and post send descriptor
Libfabrics: Post Remote Write (Memory Semantics)
ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);
- For connected EPs
ssize_t fi_writemsg(struct fid_ep *ep, const struct fi_msg_rma *msg, uint64_t flags);
- For connected and un-connected EPs
ssize_t fi_inject_write(struct fid_ep *ep, const void *buf, size_t len, fi_addr_t dest_addr, uint64_t addr, uint64_t key);
- Buffer available for re-use as soon as function returns
- No completion event generated for send
ssize_t fi_writedata(struct fid_ep *ep, const void *buf, size_t len, void *desc, uint64_t data, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);
- Similar to fi_write
- Allows for the sending of remote CQ data
Network Based Computing Laboratory 52 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 53 IT4 Innovations’18
- Management Infrastructure
– Subnet Manager – Diagnostic tools
- System Discovery Tools
- System Health Monitoring Tools
- System Performance Monitoring Tools
– Fabric management tools
Network Management Infrastructure and Tools
Network Based Computing Laboratory 54 IT4 Innovations’18
- Agents
– Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters
- Managers
– Make high-level decisions and implement it on the network fabric using the agents
- Messaging schemes
– Used for interactions between the manager and agents (or between agents)
- Messages
Concepts in IB Management
Network Based Computing Laboratory 55 IT4 Innovations’18
- All IB management happens using packets called as Management
Datagrams
– Popularly referred to as “MAD packets”
- Four major classes of management mechanisms
– Subnet Management – Subnet Administration – Communication Management – General Services
InfiniBand Management
Network Based Computing Laboratory 56 IT4 Innovations’18
- Consists of at least one subnet manager (SM) and several subnet
management agents (SMAs)
– Each adapter, switch, router has an agent running – Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)
- SM’s responsibilities include:
– Discovering the physical topology of the subnet – Assigning LIDs to the end nodes, switches and routers – Populating switches and routers with routing paths – Subnet sweeps to discover topology changes
Subnet Management & Administration
Network Based Computing Laboratory 57 IT4 Innovations’18
Subnet Manager
Active Links Inactive Links Compute Node Switch Subnet Manager Inactive Link Multicast Join Multicast Setup Multicast Join Multicast Setup
Network Based Computing Laboratory 58 IT4 Innovations’18
- SM can be configured to sweep once or continuously
- On the first sweep:
– All ports are assigned LIDs on the first sweep – All routes are setup on the switches
- On consequent sweeps:
– If there has been any change to the topology, appropriate routes are updated – If DLID X is down, packet not sent all the way
- First hop will not have a forwarding entry for LID X
- Sweep time configured by the system administrator
– Cannot be too high or too low
Subnet Manager Sweep Behavior
Network Based Computing Laboratory 59 IT4 Innovations’18
- Single subnet manager has issues on large systems
– Performance and overhead of scanning
- Hardware implementations on switches are faster, but will work only for small
systems (memory usage)
- Software implementations are more popular (OpenSM)
– Multi-SM models
- Two benefits: fault tolerance (if one SM dies) and scalability (different SMs can
handle different portions of the network)
- Current SMs only provide a fault-tolerance model
- Network subsetting is still be investigated
- Asynchronous events specified to improve scalability
– E.g., TRAPS are events sent by an agent to the SM when a link goes down
Subnet Manager Scalability Issues
Network Based Computing Laboratory 60 IT4 Innovations’18
- Creation, joining/leaving, deleting multicast groups occur
as SA requests
– The requesting node sends a request to a SA – The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets
- Each switch contains information on which ports to forward the
multicast packet to
- Multicast itself does not go through the subnet manager
– Only the setup and teardown goes through the SM
Multicast Group Management
Network Based Computing Laboratory 61 IT4 Innovations’18
- Management Infrastructure
– Subnet Manager – Diagnostic tools
- System Discovery Tools
- System Health Monitoring Tools
- System Performance Monitoring Tools
– Fabric management tools
Network Management Infrastructure and Tools
Network Based Computing Laboratory 62 IT4 Innovations’18
- Different types of tools exist:
– High-level tools that internally talk to the subnet manager using management datagrams – Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) counters
- Possible to write your own tools based on the
management datagram interface
– Several vendors provide such IB management tools
Tools to Analyze InfiniBand Networks
Network Based Computing Laboratory 63 IT4 Innovations’18
- Starting with almost no knowledge about the system, we can identify several
details of the network configuration
– Example tools include:
- ibstatus: shows adapter status
- smpquery: SMP query tool
- perfquery: reports performance/error counters of a port
- ibportstate: shows status of IB port, enable/disable port
- ibhosts: finds all the network adapters in the system
- ibswitches: finds all the network switches in the system
- ibnetdiscover: finds the connectivity between the ports
- … and many others exist
– Possible to write your own tools based on the management datagram interface
- Several vendors provide such IB management tools
Network Discovery Tools
Network Based Computing Laboratory 64 IT4 Innovations’18
- Several tools exist to monitor the health and performance
- f the InfiniBand network
– Example health monitoring tools include
- ibdiagnet: queries for overall fabric health
- ibportstate: identify state and link speed of an InfiniBand port
- ibdatacounts: get InfiniBand port data counters
– Example performance monitoring tools include
- ibv_send_lat, ibv_write_lat: IB verbs level performance tests
- perfquery: queries performance counters in IB HCA
Health and Performance Monitoring Tools
Network Based Computing Laboratory 65 IT4 Innovations’18
Tools for Network Switching and Routing
% ibroute -G 0x66a000700067c Lid Out Destination Port Info 0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1') 0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1') 0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1') 0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2') 0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1') 0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 8, Chip A') 0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2') 0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 10, Chip A') 0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1') 0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2') ...
Packets to LID 0x0001 will be sent out on Port 001
Network Based Computing Laboratory 66 IT4 Innovations’18
- Based on destination LIDs and switching/routing
information, the exact path of the packets can be identified
– If application communication pattern is known, we can statically identify possible network contention
Static Analysis of Network Contention
Leaf Blocks Spine Blocks
2 4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10 11 22 17 24 27 15 20
Network Based Computing Laboratory 67 IT4 Innovations’18
- IB provides many optional counters to query performance counters
– PortXmitWait: Number of ticks in which there was data to send, but no flow-control credits – RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive buffer
- This can timeout, so it can be an error in some cases
– PortXmitFlowPkts: Number of (link-level) flow-control packets transmitted
- n the port
– SWPortVLCongestion: Number of packets dropped due to congestion
Dynamic Analysis of Network Contention
Network Based Computing Laboratory 68 IT4 Innovations’18
- Management Infrastructure
– Subnet Manager – Diagnostic tools
- System Discovery Tools
- System Health Monitoring Tools
- System Performance Monitoring Tools
– Fabric management tools
Network Management Infrastructure and Tools
Network Based Computing Laboratory 69 IT4 Innovations’18
- InfiniBand provides two forms of management
– Out-of-band management (similar to other networks) – In-band management (used by the subnet manager)
- Out-of-band management requires a separate Ethernet port on the switch, where an
administrator can plug in his/her laptop
- In-band management allows the switch to receive management commands directly over
the regular communication network
In-band Management vs. Out-of-band Management
InfiniBand connectivity (In-band management) Ethernet connectivity (Out-of-band management)
Network Based Computing Laboratory 70 IT4 Innovations’18
Overview of OSU INAM
- A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the
MPI runtime
– http://mvapich.cse.ohio-state.edu/tools/osu-inam/
- Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes
- OSU INAM v0.9.2 released on 10/31/2017
- Significant enhancements to user interface to enable scaling to clusters with thousands of nodes
- Improve database insert times by using 'bulk inserts‘
- Capability to look up list of nodes communicating through a network link
- Capability to classify data flowing over a network link at job level and process level granularity in conjunction with
MVAPICH2-X 2.3b
- “Best practices “ guidelines for deploying OSU INAM on different clusters
- Capability to analyze and profile node-level, job-level and process-level activities for MPI communication
– Point-to-Point, Collectives and RMA
- Ability to filter data based on type of counters using “drop down” list
- Remotely monitor various metrics of MPI processes at user specified granularity
- "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X
- Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes
Network Based Computing Laboratory 71 IT4 Innovations’18
OSU INAM Features
- Show network topology of large clusters
- Visualize traffic pattern on different links
- Quickly identify congested links/links in error state
- See the history unfold – play back historical state of the network
Comet@SDSC --- Clustered View (1,879 nodes, 212 switches, 4,377 network links) Finding Routes Between Nodes
Network Based Computing Laboratory 72 IT4 Innovations’18
OSU INAM Features (Cont.)
Visualizing a Job (5 Nodes)
- Job level view
- Show different network metrics (load, error, etc.) for any live job
- Play back historical data for completed jobs to identify bottlenecks
- Node level view - details per process or per node
- CPU utilization for each rank/node
- Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)
- Network metrics (e.g. XmitDiscard, RcvError) per rank/node
Estimated Process Level Link Utilization
- Estimated Link Utilization view
- Classify data flowing over a network link at
different granularity in conjunction with MVAPICH2-X 2.2rc1
- Job level and
- Process level
Network Based Computing Laboratory 73 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 74 IT4 Innovations’18
Common Challenges for Large-Scale Installations
Common Challenges
Adapters and Interactions
I/O bus Multi-port adapters NUMA
Switches
Topologies Switching / Routing
Bridges
IB interoperability
Network Based Computing Laboratory 75 IT4 Innovations’18
- Network adapters and interactions with other components
– I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions
- Network switches
- Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 76 IT4 Innovations’18
- Data communication traverses three buses (or
links) before it reaches the network switch
– Memory bus (memory to IO hub) – I/O link (IO hub to the network adapter) – Network link (network adapter to switch)
- For optimal communication, all these need to
be balanced
- Network bandwidth:
– 4X SDR (8 Gbps), 4X DDR (16 Gbps), 4X QDR (32 Gbps), 4X FDR (56 Gbps), 4X EDR (100 Gbps) and 4X HDR (200 Gbps) – 40 GigE (40 Gbps)
- Memory bandwidth:
– Shared bandwidth (incoming and outgoing) – For IB FDR (56 Gbps), memory bandwidth greater than 112 Gbps is required to fully utilize the network
I/O bus limitations
P0
Core0 Core1 Core2 Core3
P1
Core0 Core1 Core2 Core3
Memory Memory Network Adapter
Network Switch
- I/O link bandwidth:
– Tricky because several aspects need to be considered – Connector capacity vs. link capacity – I/O link communication headers, etc.
I/O Bus
Network Based Computing Laboratory 77 IT4 Innovations’18
- Common I/O interconnect used on most current platforms
– Can be configured as multiple lanes (1X, 4X, 8X, 16X, 32X)
- Generation 1 provided 2 Gbps bandwidth per lane, Gen 2 provides 4 Gbps, and Gen 3
provides 8 Gbps per lane)
– Compatible with adapters using lesser lanes
- If a PCIe connector is 16X, it will still support an 8X adapter by using only 8 lanes
– Provides multiplexing across a single lane
- A 1X PCIe bus can be connected to an 8X PCIe connector (allowing an 8X adapter to be
plugged in)
– I/O interconnects are like networks with packetized communication
- Communication headers for each packet
- Reliability acknowledgments
- Flow control acknowledgments
- Typical efficiency is around 75-80% with 256 byte PCIe packets
PCI Express
Use I/O bandwidth
Beware Beware Beware
Network Based Computing Laboratory 78 IT4 Innovations’18
- Several multi-port adapters available in the market
– Single adapter can drive multiple network ports at full bandwidth – Important to measure other overheads (memory bandwidth and I/O link bandwidth) before assuming performance benefit
- Case Study: IB Dual-port 4x QDR adapter
– Each network link is 32 Gbps (dual-port adapters can drive 64 Gbps) – PCIe Gen2 8X link can give 32 Gbps data rate around 24 Gbps effective rate (20 % encoding overheads!!)
- Dual-port IB QDR is not expected to give any benefit in this case
– PCIe Gen3 8X link can give 64 Gbps data rate 64 Gbps (minimal encoding overheads)
- Delivers close to peak performance with Dual-port IB adapters
Multi-port adapters
Network Based Computing Laboratory 79 IT4 Innovations’18
- Network adapters and interactions with other components
– I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions
- Network switches
- Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 80 IT4 Innovations’18
NUMA Interactions
Memory Memory Memory Memory Network Card
Core 8
Socket 2
Core 9 Core 10 Core 11 Core 12
Socket 3
Core 13 Core 14 Core 15 Core 4
Socket 1
Core 5 Core 6 Core 7 Core
Socket 0
Core 1 Core 2 Core 3
- Different cores in a NUMA
platform have different communication costs
QPI or HT PCIe
Network Based Computing Laboratory 81 IT4 Innovations’18
0.5 1 1.5 2 2.5 3
2 4 8 16 32 64 128 256 512 1K 2K
Send Latency (us) Message Size (Bytes)
Core 0 -> 0 (Socket 0) Core 7->7 (Socket 0) Core 14->14 (Socket 1) Core 27->27 (Socket 1)
Impact of NUMA on Inter-node Latency
- Cores in Socket 0 (closest to network card) have lowest latency
- Cores in Socket 1 (one hop from network card) have highest latency
ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches
Network Based Computing Laboratory 82 IT4 Innovations’18
Impact of NUMA on Inter-node Bandwidth
500 1000 1500 2000 2500 3000
Send Bandwidth (MBps) Message Size (bytes)
AMD MagnyCours Core-0 Core-6 Core-12 Core-18 ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with IB (EDR) switches ConnectX-2-QDR (36 Gbps): 2.5 GHz Hex-core (MagnyCours) AMD with IB (QDR) switches 2000 4000 6000 8000 10000 12000 14000
Send Bandwidth (MBps) Message Size (bytes)
Intel Broadwell Core-0 Core-7 Core-14 Core-27
- NUMA interactions have significant impact on bandwidth
Network Based Computing Laboratory 83 IT4 Innovations’18
- Network adapters and interactions with other components
– I/O bus interactions and limitations – Multi-port adapters and bottlenecks – NUMA interactions
- Network switches
- Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 84 IT4 Innovations’18
- Network adapters and interactions with other components
- Network switches
– Switch topologies – Switching and Routing
- Network bridges
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 85 IT4 Innovations’18
- InfiniBand installations come in multiple topologies
– Single crossbar switches (up to 36-ports for QDR or FDR)
- Applicable only to very small systems (hard to scale to large clusters)
– Fat-tree topologies (medium scale topologies)
- Provides full bisection bandwidth: Given independent communication between processes,
you can find a switch configuration that provides fully non-blocking paths (though the same configuration might have contention if the communication pattern changes)
- Issue: Number of switch components increases super-linearly with the number of nodes
(Not scalable for large-scale systems)
- Large scale installations can use more conservative topologies
– Partial fat-tree topologies (over-provisioning) – 3D Torus (Sandia Red Sky and SDSC Gordon), Hypercube (SGI Altix) topologies, and 10D Hypercube (NASA Pleiades)
Switch Topologies
Network Based Computing Laboratory 86 IT4 Innovations’18
Switch Topology: Absolute Performance vs. Scalability
Crossbar ASIC (all-to-all connectivity) Leaf Blocks Spine Blocks Full Fat-tree Topology (full bisection bandwidth) Leaf Blocks Spine Blocks Partial Fat-tree Topology (reduced inter-switch connectivity for more out-ports: super-linear scaling of switch components, but slower than a full fat-tree topology)
Only a few links are connected
Torus/Hypercube Topology (linear scaling of switch components)
Network Based Computing Laboratory 87 IT4 Innovations’18
- IB standard only supports static routing
– Not scalable for large systems where traffic might be non-deterministic causing hot-spots
- Next generation IB switches are supporting adaptive routing (in addition to static routing):
Outside the IB standard
- Qlogic (Intel) support for adaptive routing
– Continually monitors application messaging patterns and selects the optimum path for each traffic flow, eliminating slowdowns caused by pathway bottlenecks – Dispersive routing load-balances traffic among multiple pathways
– http://ir.qlogic.com/phoenix.zhtml?c=85695&p=irol-newsarticle&id=1428788
- Mellanox support for adaptive routing
– Supports moving traffic via multiple parallel paths – Dynamically and automatically re-routes traffic to alleviate congested ports
– http://www.mellanox.com/related-docs/prod_silicon/PB_InfiniScale_IV.pdf
Static Routing in IB + Adaptive Routing models from Qlogic (Intel) and Mellanox
Network Based Computing Laboratory 88 IT4 Innovations’18
- Network adapters and interactions with other components
- Network switches
- Network bridges
– IB interoperability with Ethernet and FC
Common Challenges in Building HEC Systems with IB and HSE
Network Based Computing Laboratory 89 IT4 Innovations’18
Virtual Ethernet/FC Adapter
- Mainly developed for backward compatibility with existing
infrastructure
– Ethernet over IB (EoIB) – Fibre Channel over IB (FCoIB)
IB-Ethernet and IB-FC Bridging Solutions
IB Adapter Host
Ethernet Packet
Convertor Switch (e.g., Mellanox BridgeX) Ethernet/FC Adapter Host
Network Based Computing Laboratory 90 IT4 Innovations’18
- Can be used in an infrastructure where a part of the nodes are connected over
Ethernet or FC
– All of the IB connected nodes can communicate over IB – The same nodes can communicate with nodes in the older infrastructure using Ethernet-over-IB or FC-over-IB
- Do not have the performance benefits of IB
– Host thinks it is using an Ethernet or FC adapter – For example, with Ethernet, communication will be using TCP/IP
- There is some hardware support for segmentation offload, but the rest of the IB features
are unutilized
- Note that this is different from VPI, as there is only one network connectivity
from the adapter
Ethernet/FC over IB
Network Based Computing Laboratory 91 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Deep Learning – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 92 IT4 Innovations’18
System Specific Challenges for HPC Systems
Common Challenges
Adapters and Interactions I/O bus Multi-port adapters NUMA Switches Topologies Switching / Routing Bridges IB interoperability
HPC
MPI Multi-rail Collectives Scalability Application Scalability Energy Awareness PGAS Programmability w/ Performance Optimized Resource Utilization GPU / XeonPhi Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning
Network Based Computing Laboratory 93 IT4 Innovations’18
- Message Passing Interface (MPI)
- Partitioned Global Address Space (PGAS) models
- GPU Computing
- Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 94 IT4 Innovations’18
Overview of the MVAPICH2 Project
- High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2, MPI-3.0, and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,850 organizations in 85 countries – More than 440,000 (> 0.44 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)
- 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
- 12th, 368,928-core (Stampede2) at TACC
- 17th, 241,108-core (Pleiades) at NASA
- 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
- Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)
Network Based Computing Laboratory 95 IT4 Innovations’18
- Interaction with Multi-Rail Environments
- Collective Communication
- Scalability for Large-scale Systems
- Energy Awareness
Design Challenges and Sample Results
Network Based Computing Laboratory 96 IT4 Innovations’18
2000 4000 6000 8000 10000 12000 14000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MBytes/sec) Message Size (bytes) Single Rail
Impact of Multiple Rails on Inter-node MPI Bandwidth
Designs based on: S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007
5000 10000 15000 20000 25000 30000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MBytes/sec) Message Size (bytes) Dual Rail 1 pair 2 pairs 4 pairs 8 pairs 16 pairs ConnectX-4 EDR (100 Gbps): 2.4 GHz Deca-core (Haswell) Intel with IB (EDR) switches
Network Based Computing Laboratory 97 IT4 Innovations’18
Hardware Multicast-aware MPI_Bcast on Stampede
10 20 30 40 2 8 32 128 512 Latency (us) Message Size (Bytes)
Small Messages (102,400 Cores)
Default Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
100 200 300 400 500 2K 8K 32K 128K Latency (us) Message Size (Bytes)
Large Messages (102,400 Cores)
Default Multicast 5 10 15 20 25 30 Latency (us) Number of Nodes
16 Byte Message
Default Multicast 50 100 150 200 Latency (us) Number of Nodes
32 KByte Message
Default Multicast
Network Based Computing Laboratory 98 IT4 Innovations’18
Hardware Multicast-aware MPI_Bcast on Broadwell + EDR
1 2 3 4 5 6 7 2 8 32 128 512 Latency (us) Message Size (Bytes)
Small Messages (1,120 Cores)
Default Multicast
ConnectX-4 EDR (100 Gbps): 2.4 GHz Fourteen-core (Broadwell) Intel with Mellanox IB (EDR) switches
20 40 60 80 100 120 2K 8K 32K 128K Latency (us) Message Size (Bytes)
Large Messages (1,120 Cores)
Default Multicast 1 2 3 4 5 Latency (us) Number of Nodes
16 Byte Message
Default Multicast 10 20 30 40 Latency (us) Number of Nodes
32 KByte Message
Default Multicast
Network Based Computing Laboratory 99 IT4 Innovations’18
Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders
- Socket-based design can reduce the communication latency by 23% and 40% on
Xeon + IB nodes
- Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b
HPCG (28 PPN) 0.1 0.2 0.3 0.4 0.5 0.6 56 224 448
Communication Latency (Seconds) Number of Processes MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP
10 20 30 40 50 60 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us) Message Size (Byte) MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP
OSU Micro Benchmark (16 Nodes, 28 PPN)
23% 40%
Lower is better
- M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-
Leader Design, Supercomputing '17.
Network Based Computing Laboratory 100 IT4 Innovations’18
Performance of MPI_Allreduce On Stampede2 (10,240 Processes)
50 100 150 200 250 300 4 8 16 32 64 128 256 512 1024 2048 4096
Latency (us)
Message Size MVAPICH2 MVAPICH2-OPT IMPI 200 400 600 800 1000 1200 1400 1600 1800 2000 8K 16K 32K 64K 128K 256K Message Size MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN
2.4X
- MPI_Allreduce latency with 32K bytes reduced by 2.4X
Network Based Computing Laboratory 101 IT4 Innovations’18
Network-Topology-Aware Placement of Processes
Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?
Overall performance and Split up of physical communication for MILC on Ranger
Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
- H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand
Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist
- Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
- 15% improvement in MILC execution time @ 2048 cores
- 15% improvement in Hypre execution time @ 1024 cores
Network Based Computing Laboratory 102 IT4 Innovations’18
Dynamic and Adaptive Tag Matching
Normalized Total Tag Matching Time at 512 Processes Normalized to Default (Lower is Better) Normalized Memory Overhead per Process at 512 Processes Compared to Default (Lower is Better) Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee]
Challenge
Tag matching is a significant
- verhead for receivers
Existing Solutions are
- Static and do not adapt
dynamically to communication pattern
- Do not consider memory
- verhead
Solution
A new tag matching design
- Dynamically adapt to
communication patterns
- Use different strategies for
different ranks
- Decisions are based on the
number of request object that must be traversed before hitting on the required one
Results
Better performance than
- ther state-of-the art tag-
matching schemes Minimum memory consumption Will be available in future MVAPICH2 releases
Network Based Computing Laboratory 103 IT4 Innovations’18
- Enhance existing support for MPI_T in MVAPICH2 to expose a richer
set of performance and control variables
- Get and display MPI Performance Variables (PVARs) made available by
the runtime in TAU
- Control the runtime’s behavior via MPI Control Variables (CVARs)
- Introduced support for new MPI_T based CVARs to MVAPICH2
○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE
- TAU enhanced with support for setting MPI_T CVARs in a non-
interactive mode for uninstrumented applications
Performance Engineering Applications using MVAPICH2 and TAU
VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf
Network Based Computing Laboratory 104 IT4 Innovations’18
Dynamic and Adaptive MPI Point-to-point Communication Protocols
Process on Node 1 Process on Node 2
Eager Threshold for Example Communication Pattern with Different Designs
1 2 3 4 5 6 7
Default
16 KB 16 KB 16 KB 16 KB
1 2 3 4 5 6 7
Manually Tuned
128 KB 128 KB 128 KB 128 KB
1 2 3 4 5 6 7
Dynamic + Adaptive
32 KB 64 KB 128 KB 32 KB
- H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper
200 400 600 128 256 512 1K Wall Clock Time (seconds) Number of Processes
Execution Time of Amber
Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold 5 10 128 256 512 1K Relative Memory Consumption Number of Processes
Relative Memory Consumption of Amber
Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold
Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity
Process Pair Eager Threshold (KB) 0 – 4 32 1 – 5 64 2 – 6 128 3 – 7 32
Desired Eager Threshold
Network Based Computing Laboratory 105 IT4 Innovations’18
Enhanced MPI_Bcast with Optimized CMA-based Design
1 10 100 1000 10000 100000 1K 4K 16K 64K 256K 1M 4M Message Size KNL (64 Processes)
MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Proposed
Latency (us)
1 10 100 1000 10000 100000 Message Size Broadwell (28 Processes)
MVAPICH2-2.3a Intel MPI 2017 OpenMPI 2.1.0 Proposed
1 10 100 1000 10000 100000 1K 4K 16K 64K 256K 1M 4M Message Size Power8 (160 Processes)
MVAPICH2-2.3a OpenMPI 2.1.0 Proposed
- Up to 2x - 4x improvement over existing implementation for 1MB messages
- Up to 1.5x – 2x faster than Intel MPI and Open MPI for 1MB messages
Use CMA Use SHMEM Use CMA Use SHMEM Use CMA Use SHMEM
- Improvements obtained for large messages only
- p-1 copies with CMA, p copies with Shared memory
- Fallback to SHMEM for small messages
- S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-
core Systems, IEEE Cluster ’17, BEST Paper Finalist
Support is available in MVAPICH2-X 2.3b
Network Based Computing Laboratory 106 IT4 Innovations’18
Designing Energy-Aware (EA) MPI Runtime
Energy Spent in Communication Routines Energy Spent in Computation Routines
Overall application Energy Expenditure
Point-to-point Routines Collective Routines RMA Routines MVAPICH2-EA Designs MPI Two-sided and collectives (ex: MVAPICH2) Other PGAS Implementations (ex: OSHMPI) One-sided runtimes (ex: ComEx) Impact MPI-3 RMA Implementations (ex: MVAPICH2)
Network Based Computing Laboratory 107 IT4 Innovations’18
- An energy efficient runtime that
provides energy savings without application knowledge
- Uses automatically and
transparently the best energy lever
- Provides guarantees on maximum
degradation with 5-41% savings at <= 5% degradation
- Pessimistic MPI applies energy
reduction lever to each MPI call
- Available for download from
MVAPICH project site since Aug’15
MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)
A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
- K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]
1
Network Based Computing Laboratory 108 IT4 Innovations’18
- Message Passing Interface (MPI)
- Partitioned Global Address Space (PGAS) models
- GPU Computing
- Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 109 IT4 Innovations’18
- Global view improves programmer productivity
- Idea is to decouple data movement with process synchronization
- Processes should have asynchronous access to globally distributed data
- Well suited for irregular applications and kernels that require dynamic access to different
data
- Different Approaches
– Library-based (Global Arrays, OpenSHMEM) – Compiler-based (Unified Parallel C (UPC), Co-Array Fortran (CAF)) – HPCS Language-based (X10, Chapel, Fortress)
Partitioned Global Address Space (PGAS) Models
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory Logical shared memory
Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS)
Network Based Computing Laboratory 110 IT4 Innovations’18
Hybrid (MPI+PGAS) Programming
- Application sub-kernels can be re-written in
MPI/PGAS based on communication characteristics
- Benefits:
– Best of Distributed Computing Model – Best of Shared Memory Computing Model
Kernel 1 MPI Kernel 2 MPI Kernel 3 MPI Kernel N MPI
HPC Application
Kernel 2 PGAS Kernel N PGAS
Network Based Computing Laboratory 111 IT4 Innovations’18
MVAPICH2-X for Hybrid MPI + PGAS Applications
- Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI
– Possible deadlock if both runtimes are not progressed – Consumes more network resource
- Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF
– Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu
Network Based Computing Laboratory 112 IT4 Innovations’18
UPC++ Collectives Performance
MPI + {UPC++} application
GASNet Interfaces
UPC++ Runtime
Network Conduit (MPI) MVAPICH2-X Unified communication Runtime (UCR)
MPI + {UPC++} application
UPC++ Runtime
MPI Interfaces
- Full and native support for hybrid MPI + UPC++ applications
- Better performance compared to IBV and MPI conduits
- OSU Micro-benchmarks (OMB) support for UPC++
- Available since MVAPICH2-X 2.2RC1
5000 10000 15000 20000 25000 30000 35000 40000 Time (us) Message Size (bytes) GASNet_MPI GASNET_IBV MV2-X
14x Inter-node Broadcast (64 nodes 1:ppn)
- J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling
Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016)
Network Based Computing Laboratory 113 IT4 Innovations’18
Application Level Performance with Graph500 and Sort
Graph500 Execution Time
- J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming
Models, International Supercomputing Conference (ISC’13), June 2013 5 10 15 20 25 30 35 4K 8K 16K Time (s)
- No. of Processes
MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 13X 7.6X
- Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
- 8,192 processes
- 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
- 16,384 processes
- 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
Sort Execution Time 500 1000 1500 2000 2500 3000 500GB-512 1TB-1K 2TB-2K 4TB-4K Time (seconds) Input Data - No. of Processes MPI Hybrid 51%
- Performance of Hybrid (MPI+OpenSHMEM) Sort
Application
- 4,096 processes, 4 TB Input Size
- MPI – 2408 sec; 0.16 TB/min
- Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design
- J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with
Hybrid MPI+PGAS Programming Models, PGAS’14
Network Based Computing Laboratory 114 IT4 Innovations’18
Performance of PGAS Models on KNL using MVAPICH2-X
0.01 0.1 1 10 100 1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency (us) Message Size shmem_put upc_putmem upcxx_async_put Intra-node PUT
0.01 0.1 1 10 100 1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency (us) Message Size shmem_get upc_getmem upcxx_async_get Intra-node GET
- Intra-node performance of one-sided Put/Get operations of PGAS
libraries/languages using MVAPICH2-X communication conduit
- Near-native communication performance is observed on KNL
Network Based Computing Laboratory 115 IT4 Innovations’18
Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation
Heat Image Kernel
- On heat diffusion based kernels AVX-512 vectorization showed better performance
- MCDRAM showed significant benefits on Heat-Image kernel for all process counts.
Combined with AVX-512 vectorization, it showed up to 4X improved performance
1 10 100 1000 16 32 64 128 Time (s)
- No. of processes
KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell
Heat-2D Kernel using Jacobi method 0.1 1 10 100 16 32 64 128 Time (s)
- No. of processes
KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell
Network Based Computing Laboratory 116 IT4 Innovations’18
- Message Passing Interface (MPI)
- Partitioned Global Address Space (PGAS) models
- GPU Computing
- Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 117 IT4 Innovations’18
At Sender: At Receiver:
MPI_Recv(r_devbuf, size, …); inside MVAPICH2
- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
Network Based Computing Laboratory 118 IT4 Innovations’18
2000 4000 6000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K
Latency (us) Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA
10x 9x
Optimized MVAPICH2-GDR Design
1.88us 11X
Network Based Computing Laboratory 119 IT4 Innovations’18
- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
- GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
500 1000 1500 2000 2500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes MV2 MV2+GDR 500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes
64K Particles 256K Particles 2X 2X
Network Based Computing Laboratory 120 IT4 Innovations’18
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
- 2X improvement on 32 GPUs nodes
- 30% improvement on 96 GPU nodes (8 GPUs/node)
- C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/
Network Based Computing Laboratory 121 IT4 Innovations’18
Enhanced Support for GPU Managed Memory
- CUDA Managed => no memory pin down
- No IPC support for intranode communication
- No GDR support for Internode communication
- Significant productivity benefits due to abstraction of explicit
allocation and cudaMemcpy()
- Initial and basic support in MVAPICH2-GDR
- For both intra- and inter-nodes use “pipeline through”
host memory
- Enhance intranode managed memory to use IPC
- Double buffering pair-wise IPC-based scheme
- Brings IPC performance to Managed memory
- High performance and high productivity
- 2.5 X improvement in bandwidth
- OMB extended to evaluate the performance of point-to-point
and collective communications using managed buffers
2000 4000 6000 8000 10000 32K 128K 512K 2M Enhanced MV2-GDR 2.2b
Message Size (bytes) Bandwidth (MB/s)
2.5X
- D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication
Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16 0.2 0.4 0.6 0.8 1 2 4 8 16 32 64 128 256 1K 4K 8K 16K
Halo Exchange Time (ms)
Total Dimension Size (Bytes)
2D Stencil Performance for Halowidth=1
Device Managed
Network Based Computing Laboratory 122 IT4 Innovations’18
- Streaming applications on GPU clusters
– Using a pipeline of broadcast operations to move host- resident data from a single source—typically live— to multiple GPU-based computing sites – Existing schemes require explicitly data movements between Host and GPU memories Poor performance and breaking the pipeline
- IB hardware multicast + Scatter-List
– Efficient heterogeneous-buffer broadcast operation
- CUDA Inter-Process Communication (IPC)
– Efficient intra-node topology-aware broadcast
- perations for multi-GPU systems
- Available MVAPICH2-GDR 2.3a!
High-Performance Heterogeneous Broadcast for Streaming Applications
Node N IB HCA IB HCA CPU GPU Source IB Switch GPU CPU Node 1 Multicast steps C Data C IB SL step Data IB HCA GPU CPU Data C
Node N Node 1 IB Switch GPU 0 GPU 1 GPU N GPU CPU Source GPU CPU CPU Multicast steps IPC-based cudaMemcpy (Device<->Device) 3 Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters. C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh , B. Elton, and D. K. Panda, SBAC-PAD'16, Oct 2016.
Network Based Computing Laboratory 123 IT4 Innovations’18
Control Flow Decoupling through GPUDirect Async
- Latency Oriented: Able to hide the kernel
launch overhead – 25% improvement at 256 Bytes
- Throughput Oriented: Asynchronously to
- ffload queue the Communication and
computation tasks – 14% improvement at 1KB message size
- Intel Sandy Bridge, NVIDIA K20 and Mellanox
FDR HCA
- Will be available in a public release soon
GPU CPU HCA
Kernel Launch Overhead Hidden
- CPU offloads the compute, communication and
synchronization tasks to GPU
- All operations asynchronous from CPU
- Hide the overhead of kernel launch
- Needs stream-based extensions to MPI semantics
Latency oriented: Kernel+Send and Recv+Kernel
10 20 30 40 50 60 70 1 4 16 64 256 1K 4K Default MPI Enhaced MPI+GDS Message Size (bytes) Latency (us)
Overlap with host computation/communication
20 40 60 80 100 1 4 16 64 256 1K 4K Default MPI Enhanced MPI+GDS Message Size (bytes) Overlap (%)
Network Based Computing Laboratory 124 IT4 Innovations’18
- Message Passing Interface (MPI)
- Partitioned Global Address Space (PGAS) models
- GPU Computing
- Xeon Phi Computing
HPC System Challenges and Case Studies
Network Based Computing Laboratory 125 IT4 Innovations’18
- On-load approach
– Takes advantage of the idle cores – Dynamically configurable – Takes advantage of highly multithreaded cores – Takes advantage of MCDRAM of KNL processors
- Applicable to other programming models such as PGAS, Task-based, etc.
- Provides portability, performance, and applicability to runtime as well as
applications in a transparent manner
Enhanced Designs for KNL: MVAPICH2 Approach
Network Based Computing Laboratory 126 IT4 Innovations’18
Performance Benefits of the Enhanced Designs
- New designs to exploit high concurrency and MCDRAM of KNL
- Significant improvements for large message sizes
- Benefits seen in varying message size as well as varying MPI processes
Very Large Message Bi-directional Bandwidth 16-process Intra-node All-to-All Intra-node Broadcast with 64MB Message
2000 4000 6000 8000 10000 2M 4M 8M 16M 32M 64M Bandwidth (MB/s) Message size MVAPICH2 MVAPICH2-Optimized 10000 20000 30000 40000 50000 60000 4 8 16 Latency (us)
- No. of processes
MVAPICH2 MVAPICH2-Optimized
27%
50000 100000 150000 200000 250000 300000 1M 2M 4M 8M 16M 32M Latency (us) Message size MVAPICH2 MVAPICH2-Optimized
17.2%
52%
Network Based Computing Laboratory 127 IT4 Innovations’18
Performance Benefits of the Enhanced Designs
10000 20000 30000 40000 50000 60000 1M 2M 4M 8M 16M 32M 64M
Bandwidth (MB/s) Message Size (bytes)
MV2_Opt_DRAM MV2_Opt_MCDRAM MV2_Def_DRAM MV2_Def_MCDRAM 30%
50 100 150 200 250 300 4:268 4:204 4:64 Time (s) MPI Processes : OMP Threads MV2_Def_DRAM MV2_Opt_DRAM
15%
Multi-Bandwidth using 32 MPI processes CNTK: MLP Training Time using MNIST (BS:64)
- Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset
using CNTK Deep Learning Framework
Enhanced Designs will be available in upcoming MVAPICH2 releases
Network Based Computing Laboratory 128 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Big Data – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 129 IT4 Innovations’18
System Specific Challenges for Big Data Processing
Common Challenges
Adapters and Interactions I/O bus Multi-port adapters NUMA Switches Topologies Switching / Routing Bridges IB interoperability
Big Data
Taking advantage of RDMA Performance Scalability Backward compatibility
HPC
MPI Multi-rail Collectives Scalability Application Scalability Energy Awareness PGAS Programmability w/ Performance Optimized Resource Utilization GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design
Network Based Computing Laboratory 130 IT4 Innovations’18
How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? Bring HPC and Big Data processing into a “convergent trajectory”!
What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? Can RDMA-enabled high-performance interconnects benefit Big Data processing?
Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters?
Network Based Computing Laboratory 131 IT4 Innovations’18
Can We Run Big Data Jobs on Existing HPC Infrastructure?
Network Based Computing Laboratory 132 IT4 Innovations’18
Can We Run Big Data Jobs on Existing HPC Infrastructure?
Network Based Computing Laboratory 133 IT4 Innovations’18
Can We Run Big Data Jobs on Existing HPC Infrastructure?
Network Based Computing Laboratory 134 IT4 Innovations’18
Designing Communication and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)
RDMA? Communication and I/O Library
Point-to-Point Communication
QoS & Fault Tolerance
Threaded Models and Synchronization
Performance Tuning I/O and File Systems Virtualization (SR-IOV)
Benchmarks Upper level Changes?
Network Based Computing Laboratory 135 IT4 Innovations’18
- RDMA for Apache Spark
- RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
- RDMA for Apache HBase
- RDMA for Memcached (RDMA-Memcached)
- RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
- OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
- http://hibd.cse.ohio-state.edu
- Users Base: 275 organizations from 34 countries
- More than 24,700 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE Also run on Ethernet Support for OpenPower is available
Network Based Computing Laboratory 136 IT4 Innovations’18
- HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
- HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
- HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
- HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
- MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
- f Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
- Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
Network Based Computing Laboratory 137 IT4 Innovations’18 50 100 150 200 250 300 350 400 80 120 160 Execution Time (s) Data Size (GB)
IPoIB (EDR) OSU-IB (EDR)
100 200 300 400 500 600 700 800 80 160 240 Execution Time (s) Data Size (GB)
IPoIB (EDR) OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
- RandomWriter
– 3x improvement over IPoIB for 80-160 GB file size
- TeraGen
– 4x improvement over IPoIB for 80-240 GB file size
RandomWriter TeraGen Reduced by 3x Reduced by 4x
Network Based Computing Laboratory 138 IT4 Innovations’18 100 200 300 400 500 600 700 800 80 120 160 Execution Time (s) Data Size (GB)
IPoIB (EDR) OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps and 32 reduces
- Sort
– 61% improvement over IPoIB for 80-160 GB data
- TeraSort
– 18% improvement over IPoIB for 80-240 GB data
Reduced by 61% Reduced by 18% Cluster with 8 Nodes with a total of 64 maps and 14 reduces Sort TeraSort
100 200 300 400 500 600 80 160 240 Execution Time (s) Data Size (GB)
IPoIB (EDR) OSU-IB (EDR)
Network Based Computing Laboratory 139 IT4 Innovations’18
- Design Features
– RDMA based shuffle plugin – SEDA-based architecture – Dynamic connection management and sharing – Non-blocking data transfer – Off-JVM-heap buffer management – InfiniBand/RoCE support
Design Overview of Spark with RDMA
- Enables high performance RDMA communication, while supporting traditional socket interface
- JNI Layer bridges Scala based Spark with communication library written in native code
- X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High
Performance Interconnects (HotI'14), August 2014
- X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks (IB, iWARP, RoCE ..)
Apache Spark Benchmarks/Applications/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI)
Native RDMA-based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin)
Netty Server NIO Server RDMA Server Netty Client NIO Client RDMA Client
Network Based Computing Laboratory 140 IT4 Innovations’18
- InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
- RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps) – GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
50 100 150 200 250 300 64 128 256
Time (sec) Data Size (GB)
IPoIB RDMA
50 100 150 200 250 64 128 256
Time (sec) Data Size (GB)
IPoIB RDMA
74% 80%
Network Based Computing Laboratory 141 IT4 Innovations’18
Application Evaluation on SDSC Comet
- Kira Toolkit: Distributed astronomy image processing
toolkit implemented using Apache Spark
– https://github.com/BIDS/Kira
- Source extractor application, using a 65GB dataset from
the SDSS DR2 survey that comprises 11,150 image files. 20 40 60 80 100 120 RDMA Spark Apache Spark (IPoIB) 21 % Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.
- M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
200 400 600 800 1000
24 48 96 192 384
One Epoch Time (sec) Number of cores
IPoIB RDMA
- BigDL: Distributed Deep Learning Tool using Apache
Spark
– https://github.com/intel-analytics/BigDL
- VGG training model on the CIFAR-10 dataset
4.58x
Network Based Computing Laboratory 142 IT4 Innovations’18
Using HiBD Packages on Existing HPC Infrastructure
Network Based Computing Laboratory 143 IT4 Innovations’18
Using HiBD Packages on Existing HPC Infrastructure
Network Based Computing Laboratory 144 IT4 Innovations’18
- RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are
installed and available on SDSC Comet.
– Examples for various modes of usage are available in:
- RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
- RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration.
- RDMA for Apache Hadoop is also available on Chameleon Cloud as an
appliance
– https://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
- M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
Network Based Computing Laboratory 145 IT4 Innovations’18
1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Message Size OSU-IB (FDR) IPoIB (FDR) 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thousands of Transactions per Second (TPS)
- No. of Clients
- Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
- Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X 2X
Network Based Computing Laboratory 146 IT4 Innovations’18
- Advanced Features for InfiniBand
- Advanced Features for High Speed Ethernet
- RDMA over Converged Ethernet
- Open Fabrics Software Stack and RDMA Programming
- Libfabrics Software Stack and Programming
- Network Management Infrastructure and Tool
- Common Challenges in Building HEC Systems with IB and HSE
– Network Adapters and NUMA Interactions – Network Switches, Topology and Routing – Network Bridges
- System Specific Challenges and Case Studies
– HPC (MPI, PGAS and GPU/Xeon Phi Computing) – Big Data – Cloud Computing
- Conclusions and Final Q&A
Presentation Overview
Network Based Computing Laboratory 147 IT4 Innovations’18
System Specific Challenges for Cloud Computing
Common Challenges
Adapters and Interactions I/O bus Multi-port adapters NUMA Switches Topologies Switching / Routing Bridges IB interoperability
Cloud Computing
SR-IOV Support Virtualization Containers
HPC
MPI Multi-rail Collectives Scalability Application Scalability Energy Awareness PGAS Programmability w/ Performance Optimized Resource Utilization GPU / XeonPhi Programmability w/ Performance Hide data movement costs Heterogeneity aware design
Big Data
Taking advantage of RDMA Performance Scalability Backward compatibility
Network Based Computing Laboratory 148 IT4 Innovations’18
- Cloud Computing widely adopted in industry computing environment
- Cloud Computing provides high resource utilization and flexibility
- Virtualization is the key technology to enable Cloud Computing
- Intersect360 study shows cloud is the fastest growing class of HPC
- HPC Meets Cloud: The convergence of Cloud Computing and HPC
HPC Meets Cloud Computing
Network Based Computing Laboratory 149 IT4 Innovations’18
- Virtualization has many benefits
– Fault-tolerance – Job migration – Compaction
- Have not been very popular in HPC due to overhead associated with
Virtualization
- New SR-IOV (Single Root – IO Virtualization) support available with
Mellanox InfiniBand adapters changes the field
- Enhanced MVAPICH2 support for SR-IOV
- MVAPICH2-Virt 2.2 supports:
– OpenStack, Docker, and singularity
Can HPC and Virtualization be Combined?
- J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based
Virtualized InfiniBand Clusters? EuroPar'14
- J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand
Clusters, HiPC’14
- J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build
HPC Clouds, CCGrid’15
Network Based Computing Laboratory 150 IT4 Innovations’18
50 100 150 200 250 300 350 400 milc leslie3d pop2 GAPgeofem zeusmp2 lu Execution Time (s) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native
1% 9.5%
1000 2000 3000 4000 5000 6000 22,20 24,10 24,16 24,20 26,10 26,16 Execution Time (ms) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native
2%
- 32 VMs, 6 Core/VM
- Compared to Native, 2-5% overhead for Graph500 with 128 Procs
- Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007 Graph500
5%
Network Based Computing Laboratory 151 IT4 Innovations’18
10 20 30 40 50 60 70 80 90 100 MG.D FT.D EP.D LU.D CG.D
Execution Time (s)
Container-Def Container-Opt Native
- 64 Containers across 16 nodes, pining 4 Cores per Container
- Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500
- Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500
Application-Level Performance on Docker with MVAPICH2
Graph 500 NAS
11%
50 100 150 200 250 300 1Cont*16P 2Conts*8P 4Conts*4P
BFS Execution Time (ms) Scale, Edgefactor (20,16)
Container-Def Container-Opt Native
73%
Network Based Computing Laboratory 152 IT4 Innovations’18
500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20
BFS Execution Time (ms) Problem Size (Scale, Edgefactor)
Graph500
Singularity Native
50 100 150 200 250 300 CG EP FT IS LU MG
Execution Time (s)
NPB Class D
Singularity Native
- 512 Processes across 32 nodes
- Less than 7% and 6% overhead for NPB and Graph500, respectively
Application-Level Performance on Singularity with MVAPICH2
7% 6%
- J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?,
UCC ’17, Best Student Paper Award
Network Based Computing Laboratory 153 IT4 Innovations’18
- Challenges
– Existing designs in Hadoop not virtualization-aware – No support for automatic topology detection
- Design
– Automatic Topology Detection using MapReduce-based utility
- Requires no user input
- Can detect topology changes during
runtime without affecting running jobs
– Virtualization and topology-aware communication through map task scheduling and YARN container allocation policy extensions
Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand
- S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating
Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016
2000 4000 6000 40 GB 60 GB 40 GB 60 GB 40 GB 60 GB EXECUTION TIME
Hadoop Benchmarks
RDMA-Hadoop Hadoop-Virt 100 200 300 400 Default Mode Distributed Mode Default Mode Distributed Mode EXECUTION TIME
Hadoop Applications
RDMA-Hadoop Hadoop-Virt CloudBurst Self-join Sort WordCount PageRank Reduced by 55% Reduced by 34%
Network Based Computing Laboratory 154 IT4 Innovations’18
- Presented advanced features of InfiniBand, HSE, Omni-Path, and RoCE
- Provided an overview of Open Fabrics Verbs-level and Libfabrics-level
Programming and InfiniBand Network Management
- Discussed common set of challenges in designing HEC Systems
- Presented Challenges and Solutions in designing various High-End Computing
systems with IB, Omni-Path, and HSE
- IB, Omni-Path, and HSE are emerging as new architectures leading to a new
generation of networked computing systems, opening many research issues needing novel solutions
Concluding Remarks
Network Based Computing Laboratory 155 IT4 Innovations’18
Funding Acknowledgments
Funding Support by Equipment Support by
Network Based Computing Laboratory 156 IT4 Innovations’18
Personnel Acknowledgments
Current Students (Graduate)
–
- A. Awan (Ph.D.)
–
- R. Biswas (M.S.)
–
- M. Bayatpour (Ph.D.)
–
- S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.) –
- S. Guganani (Ph.D.)
Past Students
–
- A. Augustine (M.S.)
–
- P. Balaji (Ph.D.)
–
- S. Bhagvat (M.S.)
–
- A. Bhat (M.S.)
–
- D. Buntinas (Ph.D.)
–
- L. Chai (Ph.D.)
–
- B. Chandrasekharan (M.S.)
–
- N. Dandapanthula (M.S.)
–
- V. Dhanraj (M.S.)
–
- T. Gangadharappa (M.S.)
–
- K. Gopalakrishnan (M.S.)
–
- R. Rajachandrasekar (Ph.D.)
–
- G. Santhanaraman (Ph.D.)
–
- A. Singh (Ph.D.)
–
- J. Sridhar (M.S.)
–
- S. Sur (Ph.D.)
–
- H. Subramoni (Ph.D.)
–
- K. Vaidyanathan (Ph.D.)
–
- A. Vishnu (Ph.D.)
–
- J. Wu (Ph.D.)
–
- W. Yu (Ph.D.)
Past Research Scientist
–
- K. Hamidouche
–
- S. Sur
Past Post-Docs
–
- D. Banerjee
–
- X. Besseron
– H.-W. Jin –
- W. Huang (Ph.D.)
–
- W. Jiang (M.S.)
–
- J. Jose (Ph.D.)
–
- S. Kini (M.S.)
–
- M. Koop (Ph.D.)
–
- K. Kulkarni (M.S.)
–
- R. Kumar (M.S.)
–
- S. Krishnamoorthy (M.S.)
–
- K. Kandalla (Ph.D.)
–
- M. Li (Ph.D.)
–
- P. Lai (M.S.)
–
- J. Liu (Ph.D.)
–
- M. Luo (Ph.D.)
–
- A. Mamidala (Ph.D.)
–
- G. Marsh (M.S.)
–
- V. Meshram (M.S.)
–
- A. Moody (M.S.)
–
- S. Naravula (Ph.D.)
–
- R. Noronha (Ph.D.)
–
- X. Ouyang (Ph.D.)
–
- S. Pai (M.S.)
–
- S. Potluri (Ph.D.)
–
- J. Hashmi (Ph.D.)
–
- H. Javed (Ph.D.)
–
- P. Kousha (Ph.D.)
–
- D. Shankar (Ph.D.)
–
- H. Shi (Ph.D.)
–
- J. Zhang (Ph.D.)
–
- J. Lin
–
- M. Luo
–
- E. Mancini
Current Research Scientists
–
- X. Lu
–
- H. Subramoni
Past Programmers
–
- D. Bureddy
–
- J. Perkins
Current Research Specialist
–
- J. Smith
–
- M. Arnold
–
- S. Marcarelli
–
- J. Vienne
–
- H. Wang
Current Post-doc
–
- A. Ruhela
Current Students (Undergraduate)
–
- N. Sarkauskas (B.S.)
Network Based Computing Laboratory 157 IT4 Innovations’18
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
panda@cse.ohio-state.edu
The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/