Performance of HPC Middleware over Infiniband WAN Designing - - PowerPoint PPT Presentation

performance of hpc middleware over infiniband wan
SMART_READER_LITE
LIVE PREVIEW

Performance of HPC Middleware over Infiniband WAN Designing - - PowerPoint PPT Presentation

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High Performance Data Transfer over Infiniband High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband Presented by: Ashish


slide-1
SLIDE 1

Performance of HPC Middleware

  • ver Infiniband WAN

Presented by: Ashish Kumar Singh

Designing Efficient FTP Mechanisms for High Performance Data –Transfer

  • ver Infiniband

High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband

slide-2
SLIDE 2

Performance of HPC Middleware

  • ver Infiniband WAN
  • S. Narravula, H. Subramoni, P. Lai, R. Noronha and D.K.

Panda

slide-3
SLIDE 3

Motivation

  • Multi-Cluster needs of organizations
  • Advent of long haul Infiniband (IB WAN)

– Infiniband range extenders like Intel Connects and Obsidian Longbows

  • IB applications and libraries like, MPI, NFS over

RDMA, etc. developed for Intra-cluster environments

slide-4
SLIDE 4

Contributions

  • Analyzes the general communication

performance of HPC middleware

  • Proposes basic design optimizations for

enhancing communication performance over WAN

  • Demonstrates the potential benefits obtained

by enhancing internal protocols of middleware

slide-5
SLIDE 5

IB Range Extension

  • Obsidian Longbows provide range extension

for Infiniband fabrics over 10 Gigabits/s WAN

slide-6
SLIDE 6

Verbs-level Performance (UD)

  • UD does not involve any acknowledgements from the remote side
  • UD is scalable with higher delays
  • Higher level protocols need to take care of reliability and flow control mechanisms
slide-7
SLIDE 7

Verbs-level Performance (RC)

  • RC guarantees in-order delivery by ACKs and NACKs, which limits the number of messages

that can be in flight to a maximum supported window size

  • Fewer large messages can fill the pipeline and so large messages are less effected
slide-8
SLIDE 8

IPoIB Performance (UD)

  • TCP needs larger window sizes to achieve good bandwidth
  • More streams – more UD packets with independent flow control, so more outstanding

packets that can be pushed out from source at any given time frame

slide-9
SLIDE 9

IPoIB Performance (RC)

  • Advantage of RC transport mode over IPoIB is that RC can handle larger packet sizes.

Larger packet sizes can achieve better bandwidth and per byte TCP processing decreases

slide-10
SLIDE 10

MPI-level Performance (Delay)

  • Trends similar to basic verbs-level evaluation
slide-11
SLIDE 11

MPI-level Performance (Tuning)

  • Protocol choice changes for medium sized messages in high delay scenario
  • Rendezvous protocol involves an additional message exchange
slide-12
SLIDE 12

MPI-level Performance (Streams)

  • For small messages, messaging rate increases proportionally with number of

communicating streams

  • For higher delay networks, additional parallel streams are better for overall network

bandwidth utilization a) 100 us delay b) 1 ms delay c) 10 ms delay

slide-13
SLIDE 13

MPI-level Performance (Collective)

  • Simple optimized broadcast that performs the bcast operation hierarchially over the two

connected clusters, minimizing the traffic on the WAN

  • For small messages, as the WAN link is able to handle all the traffic, the congestion is very

minor a) 10 us delay b) 100 us delay c) 1000 us delay

slide-14
SLIDE 14

Conclusions

  • Applications usually absorb smaller network

delay fairly well

  • Many protocols get severely impacted in high

delay scenarios

  • Protocols can be optimized for high delay

scenarios to improve the performance

  • With long-haul IB WAN technology cluster-of-

clusters architecture for HPC systems is feasible

slide-15
SLIDE 15

Designing Efficient FTP Mechanisms for High Performance Data – Transfer over Infiniband

  • P. Lai, H. Subramoni, S. Narravula, A. Mamidala and D.K.

Panda

slide-16
SLIDE 16

Motivation

  • FTP - most popular method to transfer bulk

data

  • Typically used in applications like data staging,

content replication and remote site backup

  • Advent of long haul Infiniband (IB WAN) made

cluster-of-cluster architecture possible

  • IPoIB and SDP lose significant native

performance

slide-17
SLIDE 17

Possible Approaches

  • Existing sockets based FTP through intermediate drivers (#1, #2 and #3). IPoIB

and SDP are the popular schemes for this choice.

  • #4, new FTP mechanism using the Native IB features.
slide-18
SLIDE 18

Performance of Communication Protocols

  • Native IB verbs achieve much higher bandwidth as compared to other protocols.
  • Performance for FTP, e.g., GridFTP, using IPoIB and SDP is even more worse.
slide-19
SLIDE 19

Contributions

  • Design an Advanced Data Transfer Service

(ADTS) that leverages zero-copy capabilities

  • Leverage ADTS to design a high performance

zero-copy FTP library

  • Provide a robust and inter-operable

mechanism to support zero-copy capable clients and the traditional TCP/UDP clients

  • Performance study
slide-20
SLIDE 20

FTP-ADTS Architecture

  • Clients may be capable of performing zero-copy data transfer or only support the

TCP/UDP based communication.

  • Once the transport protocol is negotiated , Data Connection Management

component initiates a connection.

slide-21
SLIDE 21

Design of Zero-Copy Channel

  • Memory Semantics using RDMA vs. Channel

semantics using Send-Recv

  • Drawbacks of Memory Semantics:

– Pre-allocation, registration and communication of target RDMA buffers – Explicit flow control – Notification of completion – Latency benefits for small messages is marred by high network delay

slide-22
SLIDE 22

Design of Zero-Copy Channel

  • Advantages of Send-Recv Semantics:

– Identical zero-copy benefits – Simpler flow control, with use of SRQ – Sender is not throttled down due to lack of buffers

  • n remote node

– Both RC and UD transports available

slide-23
SLIDE 23

Design Enhancements

  • Buffer/File Management component keeps a

small set of pre-allocated and registered buffers

  • Memory Registration Cache and Persistent

Sessions

  • Pipelined Data Transfers
  • Prefork Server to handle bursts of requests
slide-24
SLIDE 24

Performance

  • Site Replication over IB WAN using FTP.
  • FTP-ADTS speeds up data transfer by up to 65%.
  • Much lesser CPU utilization.
slide-25
SLIDE 25

Conclusions

  • Existing TCP or UDP or SCTP based FTP

implementations are not suitable for WAN capable interconnects like IB WAN

  • FTP-ADTS efficiently transfers data by

leveraging zero-copy operations of modern interconnects

slide-26
SLIDE 26

High Performance Data Transfer in Grid Environment Using GridFTP

  • ver Infiniband
  • H. Subramoni, P. Lai, R. Kettimuthu and D.K. Panda
slide-27
SLIDE 27

Overview

  • GridFTP is a high-performance, secure, reliable

extension of the standard FTP optimized for WAN

  • Globus XIO framework, used to design

GridFTP, offers easy-to-use interface

  • The framework hides the complications of

communication semantics of underlying devices (network or disk)

slide-28
SLIDE 28

Contribution

  • Combining the ease of use of Globus XIO

framework and the high performance achieved through IB

  • Enhancing the disk I/O performance of the

existing ADTS library

– By decoupling the network processing from disk I/O

  • perations
  • Evaluation of the design

– micro-benchmark level – applications like Community Climate System Model and ultra scale visualization

slide-29
SLIDE 29

Design Issues

  • Most HPC applications require movement of

huge amount of data

– Needs slower hard disks and RAIDs for storage – With low bandwidth provided by TCP/UDP based FTP, this was not an issue – Will be an issue for Globus ADTS XIO

  • Solution

– decoupling of network from disk I/O

slide-30
SLIDE 30

Design Changes in ADTS

  • Introduction of :
  • multiple threads (read, write and network thread)
  • set of buffers to stage the data
  • Read thread prefetches a set of locations from the disk and keeps it ready for the

network thread to send over the physical link

  • How to avoid frequent context switches?
  • Low and High Water Marks, High water mark is set to max size of circular buf
  • Read only available buffers less than low-water mark
slide-31
SLIDE 31

Application Level Improvements

slide-32
SLIDE 32