Performance of RDMA-Capable Storage Performance of RDMA-Capable - - PowerPoint PPT Presentation

performance of rdma capable storage performance of rdma
SMART_READER_LITE
LIVE PREVIEW

Performance of RDMA-Capable Storage Performance of RDMA-Capable - - PowerPoint PPT Presentation

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area Network Protocols on Wide-Area Network Weikuan Yu Weikuan Yu Nageswara S.V. Rao Nageswara S.V. Rao Pete Wyckoff* Pete Wyckoff* Jeffrey S. Vetter


slide-1
SLIDE 1

Managed by UT-Battelle for the Department of Energy

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area Network Protocols on Wide-Area Network

Weikuan Yu Weikuan Yu Nageswara S.V. Rao Nageswara S.V. Rao Pete Wyckoff* Pete Wyckoff* Jeffrey S. Vetter Jeffrey S. Vetter Ohio Ohio Supercomputer Center* Supercomputer Center*

slide-2
SLIDE 2

2 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

InfiniBand Clusters around the World InfiniBand Clusters around the World

Ranger (US)

SGI (US) CEA (France) Tsubame (Japan) Dawning (China) EKA (India)

slide-3
SLIDE 3

3 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

The Problem of Computing Islands The Problem of Computing Islands

  • Islands of InfiniBand (IB) clusters

Islands of InfiniBand (IB) clusters

– – More IB clusters are deployed More IB clusters are deployed – – Some already connected, e.g. through Some already connected, e.g. through TeraGrid TeraGrid

  • But only via TCP/IP protocols

But only via TCP/IP protocols

  • Data transfer across these islands

Data transfer across these islands

– – Need ever-greater data movement capabilities. Need ever-greater data movement capabilities. – – GridFTP, BBCP or other special storage configuration GridFTP, BBCP or other special storage configuration – – TCP performance on Long Distance can be low TCP performance on Long Distance can be low

  • With 10GigE on UltraScience Net (no tuning)

With 10GigE on UltraScience Net (no tuning)

– – 9.2 Gbps at 0.2 mile 9.2 Gbps at 0.2 mile – – 8.2 Gbps at 1400 miles 8.2 Gbps at 1400 miles – – 2.3-2.5 Gbps at 6600+ miles 2.3-2.5 Gbps at 6600+ miles

slide-4
SLIDE 4

4 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

RDMA (IB) in Clusters and Local Area Networks RDMA (IB) in Clusters and Local Area Networks

  • Sub-microsecond latency

Sub-microsecond latency

  • Superb bandwidth (32Gbps with IB QDR)

Superb bandwidth (32Gbps with IB QDR)

  • Heavily used for clustering

Heavily used for clustering

  • Getting popular in storage environment

Getting popular in storage environment

– – NFS over RDMA ( NFS over RDMA (NFSoRDMA NFSoRDMA) ) – – SCSI RDMA Protocol (SRP) SCSI RDMA Protocol (SRP) – – iSCSI over RDMA ( iSCSI over RDMA (iSER iSER) )

InfiniBand HCA

Applications

Verbs InfiniBand HCA MPI NFS/iSERI/SRP

Applications

Verbs MPI NFS/iSERI/SRP

1µsec

slide-5
SLIDE 5

5 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

Sample Performance of RDMA-based Storage Sample Performance of RDMA-based Storage

  • RDMA
enables
good
iSCSI
bandwidth
within
LAN

RDMA
enables
good
iSCSI
bandwidth
within
LAN

  • Nearly
doubled
the
performance
for
iSCSI

Nearly
doubled
the
performance
for
iSCSI

slide-6
SLIDE 6

6 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

Feasibility of RDMA (IB) on WAN Feasibility of RDMA (IB) on WAN

  • Long-range Extensions for InfiniBand available

Long-range Extensions for InfiniBand available – – Network Equipment Technologies (NET): NX5010 Network Equipment Technologies (NET): NX5010 – – Obsidian Research: Longbow Obsidian Research: Longbow

  • Long latency (10

Long latency (104

4~10

~105

5µsec)

µsec)

  • High bandwidth yet feasible

High bandwidth yet feasible

– – Good Good distance scalability and tolerance to interfering traffic distance scalability and tolerance to interfering traffic – – Good network throughput and MPI-level Performance Good network throughput and MPI-level Performance

  • Can RDMA provide a good transport protocol for storage on WAN?

Can RDMA provide a good transport protocol for storage on WAN?

InfiniBand HCA

Applications

Verbs InfiniBand HCA MPI NFS/iSERI/SRP

Applications

Verbs MPI NFS/iSERI/SRP

104~105µsec

slide-7
SLIDE 7

7 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

Experimental Environment Experimental Environment

  • Hardware

Hardware

– – Long-range IB extension devices from NET (Network Equipment Long-range IB extension devices from NET (Network Equipment Technologies, Inc) Technologies, Inc) – – Mellanox PCI-Express 4x DDR Mellanox PCI-Express 4x DDR HCAs HCAs (InfiniHost-III and Connect-X) (InfiniHost-III and Connect-X)

  • Software Packages

Software Packages

– – OFED-1.3 from OFED-1.3 from openfabrics

  • penfabrics.org

.org

– – Linux-2.6.25 with Linux-2.6.25 with NFSoRDMA NFSoRDMA and and iSER iSER support support

  • Performance of RDMA-based Storage Protocols on WAN

Performance of RDMA-based Storage Protocols on WAN

– – NFS over RDMA NFS over RDMA – – iSCSI over RDMA iSCSI over RDMA

slide-8
SLIDE 8

8 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

UltraScience Net at ORNL UltraScience Net at ORNL

  • Experimental WAN Network

Experimental WAN Network

– – Oak Ridge, Oak Ridge, Atlanta, Chicago, Seattle, and Sunnyvale Atlanta, Chicago, Seattle, and Sunnyvale – – OC192 backbone connections OC192 backbone connections – – 4300 miles one way, 8600 miles loop-back 4300 miles one way, 8600 miles loop-back

slide-9
SLIDE 9

9 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

RDMA-based Transport RDMA-based Transport

  • Request
and
request
becomes
pure
control
messages,

Request
and
request
becomes
pure
control
messages, and
have
to
travel
long
distance
on
WAN and
have
to
travel
long
distance
on
WAN

  • Use
of
RDMA
read
(round‐trip
operations)
for
clients
to
write
data

Use
of
RDMA
read
(round‐trip
operations)
for
clients
to
write
data

  • Possible
additional
control
messages
for
NFSoRDMA
for
long
arguments

Possible
additional
control
messages
for
NFSoRDMA
for
long
arguments

  • Further
fragmentation
due
to
the
use
of
page‐based
operations

Further
fragmentation
due
to
the
use
of
page‐based
operations

slide-10
SLIDE 10

10 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

RDMA on WAN RDMA on WAN

  • RDMA
has
good
network‐level
performance
within
short
distance
WAN

RDMA
has
good
network‐level
performance
within
short
distance
WAN

  • High
bandwidth
at
long
distance
is
only
possible
for
large
messages

High
bandwidth
at
long
distance
is
only
possible
for
large
messages

  • Low
RDMA‐read
performance
for
page‐based
messages
(4KB),
even
at

Low
RDMA‐read
performance
for
page‐based
messages
(4KB),
even
at 0.2
mile
when
using
InfiniHost‐III 0.2
mile
when
using
InfiniHost‐III
 
HCAs HCAs

slide-11
SLIDE 11

11 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

NFS over RDMA NFS over RDMA

  • NFS
over
RDMA
achieves
good

NFS
over
RDMA
achieves
good
 
bandwidth
within
short
distance bandwidth
within
short
distance

  • But
significant
optimizations
are
needed
for
long
distance

But
significant
optimizations
are
needed
for
long
distance

slide-12
SLIDE 12

12 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

NFS - Large block size NFS - Large block size

  • NFS
over
IPoIB‐CM
benefits
from
large
block
size

NFS
over
IPoIB‐CM
benefits
from
large
block
size

  • NFS
over
RDMA
needs
to
support
large
block
size
for
better
fit

NFS
over
RDMA
needs
to
support
large
block
size
for
better
fit

  • n
long‐distance
WAN
  • n
long‐distance
WAN
slide-13
SLIDE 13

13 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

NFS over RDMA - using Connect-X NFS over RDMA - using Connect-X

  • Better
RDMA
read
in
connect‐X
improves

Better
RDMA
read
in
connect‐X
improves
 
the
performance the
performance

  • f
file
write
for
NFS
over
RDMA
  • f
file
write
for
NFS
over
RDMA
  • Performance
at
long
distance
is
yet
to
determine

Performance
at
long
distance
is
yet
to
determine

slide-14
SLIDE 14

14 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

iSCSI over RDMA ( iSCSI over RDMA (iSER iSER) )

  • RDMA
enables
high‐performance
iSCSI
within
short
distance

RDMA
enables
high‐performance
iSCSI
within
short
distance

  • RDMA
has
good
promise
over
long
distance
as
shown
with
large

RDMA
has
good
promise
over
long
distance
as
shown
with
large messages messages

slide-15
SLIDE 15

15 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

Perspectives Perspectives

  • Long-range InfiniBand

Long-range InfiniBand

– – InfiniBand over SONET is promising InfiniBand over SONET is promising – – Storage protocols are not yet exploiting the bandwidth Storage protocols are not yet exploiting the bandwidth potential of RDMA at long distance potential of RDMA at long distance

  • RDMA-based Storage on WAN

RDMA-based Storage on WAN

– – Need to enable large block sizes Need to enable large block sizes – – Need to avoid page-based RDMA operations in NFS Need to avoid page-based RDMA operations in NFS

  • Utilize IB FRMR support to avoid

Utilize IB FRMR support to avoid small RDMA operations small RDMA operations

– – Need to allow more concurrent RDMA read operations Need to allow more concurrent RDMA read operations

slide-16
SLIDE 16

16 Managed by UT-Battelle for the Department of Energy

PDSW'08, Austin, TX

Acknowledgment Acknowledgment

  • Network Equipment Technologies, Inc

Network Equipment Technologies, Inc

– – Andrew Andrew DiSilvestre DiSilvestre – – Rich Rich Erikson Erikson – – Brad Brad Chalker Chalker

  • ORNL

ORNL

– – Susan Hicks Susan Hicks – – Philip Roth Philip Roth

  • Mellanox

Mellanox