Managed by UT-Battelle for the Department of Energy
Performance of RDMA-Capable Storage Performance of RDMA-Capable - - PowerPoint PPT Presentation
Performance of RDMA-Capable Storage Performance of RDMA-Capable - - PowerPoint PPT Presentation
Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area Network Protocols on Wide-Area Network Weikuan Yu Weikuan Yu Nageswara S.V. Rao Nageswara S.V. Rao Pete Wyckoff* Pete Wyckoff* Jeffrey S. Vetter
2 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
InfiniBand Clusters around the World InfiniBand Clusters around the World
Ranger (US)
SGI (US) CEA (France) Tsubame (Japan) Dawning (China) EKA (India)
3 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
The Problem of Computing Islands The Problem of Computing Islands
- Islands of InfiniBand (IB) clusters
Islands of InfiniBand (IB) clusters
– – More IB clusters are deployed More IB clusters are deployed – – Some already connected, e.g. through Some already connected, e.g. through TeraGrid TeraGrid
- But only via TCP/IP protocols
But only via TCP/IP protocols
- Data transfer across these islands
Data transfer across these islands
– – Need ever-greater data movement capabilities. Need ever-greater data movement capabilities. – – GridFTP, BBCP or other special storage configuration GridFTP, BBCP or other special storage configuration – – TCP performance on Long Distance can be low TCP performance on Long Distance can be low
- With 10GigE on UltraScience Net (no tuning)
With 10GigE on UltraScience Net (no tuning)
– – 9.2 Gbps at 0.2 mile 9.2 Gbps at 0.2 mile – – 8.2 Gbps at 1400 miles 8.2 Gbps at 1400 miles – – 2.3-2.5 Gbps at 6600+ miles 2.3-2.5 Gbps at 6600+ miles
4 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
RDMA (IB) in Clusters and Local Area Networks RDMA (IB) in Clusters and Local Area Networks
- Sub-microsecond latency
Sub-microsecond latency
- Superb bandwidth (32Gbps with IB QDR)
Superb bandwidth (32Gbps with IB QDR)
- Heavily used for clustering
Heavily used for clustering
- Getting popular in storage environment
Getting popular in storage environment
– – NFS over RDMA ( NFS over RDMA (NFSoRDMA NFSoRDMA) ) – – SCSI RDMA Protocol (SRP) SCSI RDMA Protocol (SRP) – – iSCSI over RDMA ( iSCSI over RDMA (iSER iSER) )
InfiniBand HCA
Applications
Verbs InfiniBand HCA MPI NFS/iSERI/SRP
Applications
Verbs MPI NFS/iSERI/SRP
1µsec
5 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
Sample Performance of RDMA-based Storage Sample Performance of RDMA-based Storage
- RDMA enables good iSCSI bandwidth within LAN
RDMA enables good iSCSI bandwidth within LAN
- Nearly doubled the performance for iSCSI
Nearly doubled the performance for iSCSI
6 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
Feasibility of RDMA (IB) on WAN Feasibility of RDMA (IB) on WAN
- Long-range Extensions for InfiniBand available
Long-range Extensions for InfiniBand available – – Network Equipment Technologies (NET): NX5010 Network Equipment Technologies (NET): NX5010 – – Obsidian Research: Longbow Obsidian Research: Longbow
- Long latency (10
Long latency (104
4~10
~105
5µsec)
µsec)
- High bandwidth yet feasible
High bandwidth yet feasible
– – Good Good distance scalability and tolerance to interfering traffic distance scalability and tolerance to interfering traffic – – Good network throughput and MPI-level Performance Good network throughput and MPI-level Performance
- Can RDMA provide a good transport protocol for storage on WAN?
Can RDMA provide a good transport protocol for storage on WAN?
InfiniBand HCA
Applications
Verbs InfiniBand HCA MPI NFS/iSERI/SRP
Applications
Verbs MPI NFS/iSERI/SRP
104~105µsec
7 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
Experimental Environment Experimental Environment
- Hardware
Hardware
– – Long-range IB extension devices from NET (Network Equipment Long-range IB extension devices from NET (Network Equipment Technologies, Inc) Technologies, Inc) – – Mellanox PCI-Express 4x DDR Mellanox PCI-Express 4x DDR HCAs HCAs (InfiniHost-III and Connect-X) (InfiniHost-III and Connect-X)
- Software Packages
Software Packages
– – OFED-1.3 from OFED-1.3 from openfabrics
- penfabrics.org
.org
– – Linux-2.6.25 with Linux-2.6.25 with NFSoRDMA NFSoRDMA and and iSER iSER support support
- Performance of RDMA-based Storage Protocols on WAN
Performance of RDMA-based Storage Protocols on WAN
– – NFS over RDMA NFS over RDMA – – iSCSI over RDMA iSCSI over RDMA
8 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
UltraScience Net at ORNL UltraScience Net at ORNL
- Experimental WAN Network
Experimental WAN Network
– – Oak Ridge, Oak Ridge, Atlanta, Chicago, Seattle, and Sunnyvale Atlanta, Chicago, Seattle, and Sunnyvale – – OC192 backbone connections OC192 backbone connections – – 4300 miles one way, 8600 miles loop-back 4300 miles one way, 8600 miles loop-back
9 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
RDMA-based Transport RDMA-based Transport
- Request and request becomes pure control messages,
Request and request becomes pure control messages, and have to travel long distance on WAN and have to travel long distance on WAN
- Use of RDMA read (round‐trip operations) for clients to write data
Use of RDMA read (round‐trip operations) for clients to write data
- Possible additional control messages for NFSoRDMA for long arguments
Possible additional control messages for NFSoRDMA for long arguments
- Further fragmentation due to the use of page‐based operations
Further fragmentation due to the use of page‐based operations
10 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
RDMA on WAN RDMA on WAN
- RDMA has good network‐level performance within short distance WAN
RDMA has good network‐level performance within short distance WAN
- High bandwidth at long distance is only possible for large messages
High bandwidth at long distance is only possible for large messages
- Low RDMA‐read performance for page‐based messages (4KB), even at
Low RDMA‐read performance for page‐based messages (4KB), even at 0.2 mile when using InfiniHost‐III 0.2 mile when using InfiniHost‐III HCAs HCAs
11 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
NFS over RDMA NFS over RDMA
- NFS over RDMA achieves good
NFS over RDMA achieves good bandwidth within short distance bandwidth within short distance
- But significant optimizations are needed for long distance
But significant optimizations are needed for long distance
12 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
NFS - Large block size NFS - Large block size
- NFS over IPoIB‐CM benefits from large block size
NFS over IPoIB‐CM benefits from large block size
- NFS over RDMA needs to support large block size for better fit
NFS over RDMA needs to support large block size for better fit
- n long‐distance WAN
- n long‐distance WAN
13 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
NFS over RDMA - using Connect-X NFS over RDMA - using Connect-X
- Better RDMA read in connect‐X improves
Better RDMA read in connect‐X improves the performance the performance
- f file write for NFS over RDMA
- f file write for NFS over RDMA
- Performance at long distance is yet to determine
Performance at long distance is yet to determine
14 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
iSCSI over RDMA ( iSCSI over RDMA (iSER iSER) )
- RDMA enables high‐performance iSCSI within short distance
RDMA enables high‐performance iSCSI within short distance
- RDMA has good promise over long distance as shown with large
RDMA has good promise over long distance as shown with large messages messages
15 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
Perspectives Perspectives
- Long-range InfiniBand
Long-range InfiniBand
– – InfiniBand over SONET is promising InfiniBand over SONET is promising – – Storage protocols are not yet exploiting the bandwidth Storage protocols are not yet exploiting the bandwidth potential of RDMA at long distance potential of RDMA at long distance
- RDMA-based Storage on WAN
RDMA-based Storage on WAN
– – Need to enable large block sizes Need to enable large block sizes – – Need to avoid page-based RDMA operations in NFS Need to avoid page-based RDMA operations in NFS
- Utilize IB FRMR support to avoid
Utilize IB FRMR support to avoid small RDMA operations small RDMA operations
– – Need to allow more concurrent RDMA read operations Need to allow more concurrent RDMA read operations
16 Managed by UT-Battelle for the Department of Energy
PDSW'08, Austin, TX
Acknowledgment Acknowledgment
- Network Equipment Technologies, Inc
Network Equipment Technologies, Inc
– – Andrew Andrew DiSilvestre DiSilvestre – – Rich Rich Erikson Erikson – – Brad Brad Chalker Chalker
- ORNL
ORNL
– – Susan Hicks Susan Hicks – – Philip Roth Philip Roth
- Mellanox