compsci 514 computer networks lecture 17 network support
play

CompSci 514: Computer Networks Lecture 17: Network Support for - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt Overview Introduction to RDMA DCQCN: congestion control


  1. CompSci 514: Computer Networks Lecture 17: Network Support for Remote Direct Memory Access Xiaowei Yang Some slides adapted from http://www.cs.unh.edu/~rdr/rdma- intro-module.ppt

  2. Overview • Introduction to RDMA • DCQCN: congestion control for large-scale RDMA deployments • Experience of deploying RDMA at a large scale datacenter network 2

  3. What is RDMA? • A (relatively) new method for high-speed inter- machine communication – new standards – new protocols – new hardware interface cards and switches – new software

  4. Remote Direction Memory Access • Read, write, send, receive etc. do not go through CPU

  5. • two machines (Intel Xeon E5-2660 2.2GHz, 16 core, 128GB RAM, 40Gbps NICs, Windows Server 2012R2) connected via a 40Gbps switch.

  6. R emote D irect M emory A ccess v R emote – data transfers between nodes in a network v D irect – no Operating System Kernel involvement in transfers – everything about a transfer offloaded onto Interface Card v M emory – transfers between user space application virtual memory – no extra copying or buffering v A ccess – send, receive, read, write, atomic operations

  7. RDMA Benefits v High throughput v Low latency v High messaging rate v Low CPU utilization v Low memory bus contention v Message boundaries preserved v Asynchronous operation

  8. RDMA Technologies v InfiniBand – (41.8% of top 500 supercomputers) – SDR 4x – 8 Gbps – DDR 4x – 16 Gbps – QDR 4x – 32 Gbps – FDR 4x – 54 Gbps v iWarp – internet Wide Area RDMA Protocol – 10 Gbps v RoCE – RDMA over Converged Ethernet – 10 Gbps – 40 Gbps

  9. RDMA architecture layers

  10. Software RDMA Drivers v Softiwarp – www.zurich.ibm.com/sys/rdma – open source kernel module that implements iWARP protocols on top of ordinary kernel TCP sockets – interoperates with hardware iWARP at other end of wire v Soft RoCE – www.systemfabricworks.com/downloads/roce – open source IB transport and network layers in software over ordinary Ethernet – interoperates with hardware RoCE at other end of wire

  11. Similarities between TCP and RDMA v Both utilize the client-server model v Both require a connection for reliable transport v Both provide a reliable transport mode – TCP provides a reliable in-order sequence of bytes – RDMA provides a reliable in-order sequence of messages

  12. How RDMA differs from TCP/IP v “zero copy” – data transferred directly from virtual memory on one node to virtual memory on another node v “kernel bypass” – no operating system involvement during data transfers v asynchronous operation – threads not blocked during I/O transfers

  13. TCP/IP setup client server setup setup bind User User listen connect accept App App Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

  14. RDMA setup client server setup setup rdma_bind rdma_ User User rdma_listen connect rdma_accept App App Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

  15. TCP/IP setup client server setup setup bind User User listen connect accept App App Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

  16. TCP/IP transfer client server setup setup transfer transfer bind data data User User listen connect send recv accept App copy App copy Kernel Kernel data data Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

  17. RDMA transfer client server setup setup transfer transfer rdma_bind rdma_ rdma_ rdma_ data data User User rdma_listen post_ connect post_ rdma_accept App App recv send Kernel Kernel Stack Stack CA CA Wire Wire blue lines: control information red lines: user data green lines: control and data

  18. “Normal” TCP/IP socket access model v Byte streams – requires application to delimit / recover message boundaries v Synchronous – blocks until data is sent/received – O_NONBLOCK, MSG_DONTWAIT are not asynchronous, are “try” and “try again” v send() and recv() are paired – both sides must participate in the transfer v Requires data copy into system buffers – order and timing of send() and recv() are irrelevant – user memory accessible immediately before and immediately after each send() and recv() call

  19. TCP RECV() WIRE OPERATING SYSTEM NIC USER control allocate metadata recv() add to tables data packets sleep TCP virtual blocked buffers memory ACKs copy status access wakeup

  20. RDMA RECV() WIRE USER CHANNEL ADAPTER allocate register recv queue metadata . . . recv() parallel virtual control activity memory data packets completion queue . . . status ACK poll_cq() access

  21. RDMA access model v Messages – preserves user's message boundaries v Asynchronous – no blocking during a transfer, which – starts when metadata added to work queue – finishes when status available in completion queue v 1-sided (unpaired) and 2-sided (paired) transfers v No data copying into system buffers – order and timing of send() and recv() are relevant • recv() must be waiting before issuing send() – memory involved in transfer is untouchable between start and completion of transfer

  22. Congestion Control for Large- Scale RDMA Deployments By Yibo Zhu et al.

  23. Problem • RDMA requires a lossless data link layer • Ethernet is not lossless • Solution à RDMA over Converged Ethernet RoCE

  24. RoCE details • Priority-based Flow Control (PFC) – When busy, send Pause – When not busy, send Resume

  25. Problems with PFC • Per-port, not per flow • Unfairness: port-fair, not flow-fair • Collateral damage: head-of-line blocking for some flows

  26. Experimental topology

  27. Unfairness • H1-H4 write to R • H4 has no contention at port P2 • H1,H2, and H3 has contention on P3, and P4

  28. Head of line blocking • VS à VR • H11,J14, H31-H32 à R • T4 congested, sends PAUSE messages • T1 Pauses all its incoming links regardless of their destinations

  29. Solution • Per-flow congestion control control • Existing work: – QCN (Quantized Congestion Notification) • Using Ethernet SRC/DST and a flow ID to define a flow • Switch sends congestion notification to sender based on source MAC address • Only works at L2 • This work: DCQCN – Works for IP-routed networks

  30. Why QCN does not work for IP networks? • Same packet has different SRC/DST MAC addresses.

  31. DCQCN • DCQCN is a rate-based, end-to-end congestion protocol • Most of the DCQCN functionality is implemented in the NICs

  32. High level Ideas • ECN-mark packets at an egress queue • Receiver sends Congestion Notification to sender • Sender reduces sending rates

  33. Challenges • How to set buffer sizes at the egress queue • How often to send congestion notifications • How a sender should reduce its sending rate to ensure both convergence and fairness

  34. Solutions provided by the paper • ECN must be set before PFC is triggered – Use PFC queue sizes to set ECN buffer • Use a fluid model to tune congestion parameters

  35. RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn Microsoft

  36. What this paper is about • Extending PFC to IP-routed network • Safety issues of RDMA – Livelock – Deadlock – Pause frame storm – Slower receiver symptoms • Performance observed in production networks

  37. 4MB message, 1K packets Drop packets with IP ID’s last byte 0xff (1/256)

  38. S3 is dead. T1.p2 is congested Pause is sent to T1.p3, La.p1, To.p2, S1.

  39. S4 à S2, S2 is dead Blue packet flooded to T0.p2 To.p2 is paused. Ingress T0.p3 pauses Lb.p0 Lb.p1 pauses T1.p4. T1.p1 pauses S4

  40. Summary • What is RDMA • DCQCN: congestion control for RDMA • Deployment issues for RDMA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend