NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, - - PowerPoint PPT Presentation

nfs over rdma
SMART_READER_LITE
LIVE PREVIEW

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, - - PowerPoint PPT Presentation

SIGCOMM 2003, NICELI Workshop NFS NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun Microsystems, Inc. NFS over RDMA 1 of 17 SIGCOMM 2003, NICELI Workshop Why RDMA as a Transport? Nice to


slide-1
SLIDE 1

NFS over RDMA 1 of 17 SIGCOMM 2003, NICELI Workshop

NFS over RDMA

Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun Microsystems, Inc.

NFS

slide-2
SLIDE 2

NFS over RDMA 2 of 17 SIGCOMM 2003, NICELI Workshop

Why RDMA as a Transport?

  • Nice to have at 1 Gb/sec but

must have for 10 Gb/sec

  • Offload protocol processing from general

purpose CPU to dedicated protocol hardware

  • Offload host memory/IO bus with direct

data placement (DDP)

slide-3
SLIDE 3

NFS over RDMA 3 of 17 SIGCOMM 2003, NICELI Workshop

NFS is an RDMA Sweet Spot

  • Clients and servers are close

– Most commonly on a LAN – Often in the same server room or rack – Bandwidth high - latency low

  • NFS moves big chunks of data

– 8 KB for NFS version 2 – No limit for NFS version 3

  • Most clients read & write 32 KB chunks
  • Solaris servers accept up to 1 MB reads/writes
slide-4
SLIDE 4

NFS over RDMA 4 of 17 SIGCOMM 2003, NICELI Workshop

NFS

RDMA as a new RPC Transport

TCP UDP

RDMA

IP RPC/XDR

CHANGES

NFS NFS

NLM ACL

slide-5
SLIDE 5

NFS over RDMA 5 of 17 SIGCOMM 2003, NICELI Workshop

Small RPC Messages

SEND SEND RPC Call RPC Reply

Small pre-posted receive buffer Small pre-posted receive buffer

  • Most NFS messages are quite small

– Less than 1 KB

  • No RDMA needed - just use SENDs
slide-6
SLIDE 6

NFS over RDMA 6 of 17 SIGCOMM 2003, NICELI Workshop

Moving NFS data with RDMA

Data RPC Header

An NFS read reply or write request is a large chunk

  • f data with a variable length RPC & NFS header.

That large chunk of data could be moved more efficiently if we could move it instead with DDP.

Stag Address Length

DDP Header

slide-7
SLIDE 7

NFS over RDMA 7 of 17 SIGCOMM 2003, NICELI Workshop

XDR De-Chunking the Message

Chunk Chunk list entry XDR encoded RPC Message

XDR Offset Chunk Address

Non Chunks RDMA Send RDMA Read or Write TCP Conn

  • Encoded message for TCP transport
  • Encoded message for RDMA transport
slide-8
SLIDE 8

NFS over RDMA 8 of 17 SIGCOMM 2003, NICELI Workshop

RDMA Transport Header

XID Version Message Type Chunk List

RPC Message sans chunks

XDR Stream Offset Chunk Length Source STag Source Address Next Chunk

slide-9
SLIDE 9

NFS over RDMA 9 of 17 SIGCOMM 2003, NICELI Workshop

Read-Read Protocol

SEND SEND SEND READ READ RPC Call Arg chunks Result Chunks RPC Reply RPC Done

Message + Chunk list Message + Chunk list Free chunks Client Server

slide-10
SLIDE 10

NFS over RDMA 10 of 17 SIGCOMM 2003, NICELI Workshop

NFS/TCP Throughput

Peak throughput 60 MB/sec @ 256 KB reads & 4 reads-ahead

slide-11
SLIDE 11

NFS over RDMA 11 of 17 SIGCOMM 2003, NICELI Workshop

NFS/RDMA Throughput

Peak throughput 102 MB/sec @ 256 KB reads & 8 reads-ahead

slide-12
SLIDE 12

NFS over RDMA 12 of 17 SIGCOMM 2003, NICELI Workshop

CPU Utilization

(with no async read-ahead)

slide-13
SLIDE 13

NFS over RDMA 13 of 17 SIGCOMM 2003, NICELI Workshop

Further Work

  • NFS/RDMA protocol Internet Drafts

submitted to IETF

  • Extends basic “read-read” protocol to use RDMA

write with ULP hooks: “read-write”

  • Includes receive buffer request/grant credit control
  • Support for alignment padding in RDMA SENDs
  • Receive buffer size negotiation protocol
  • Support in NFS version 4.1
slide-14
SLIDE 14

NFS over RDMA 14 of 17 SIGCOMM 2003, NICELI Workshop

Extended RDMA Transport Header

XID Version Message Type Chunk List XID Version Message Type Credits Read List Write List Reply Read List Write List Reply Threshold Alignment

Old Header Extended Header

Padding Control

Receive Buffer Credit Control Long replies Direct write from server

slide-15
SLIDE 15

NFS over RDMA 15 of 17 SIGCOMM 2003, NICELI Workshop

Read-Write Protocol

SEND SEND READ RPC Call Arg chunks Result Chunks RPC Reply

Message + Write list Message + Write list

WRITE

Client Server

slide-16
SLIDE 16

NFS over RDMA 16 of 17 SIGCOMM 2003, NICELI Workshop

Project Status

  • Solaris prototype

– kVIPL with Emulex GN9000/VI, 1Gb link – Like a normal NFS mount – Demonstrated good performance

  • Infiniband

– Implementing extended “read-write” protocol – Mellanox Tavor, 10 Gb (4x) link – Evaluating performance

slide-17
SLIDE 17

NFS over RDMA 17 of 17 SIGCOMM 2003, NICELI Workshop