 
              Low-Latency TCP/IP Stack for Data Center Applications David Sidler, Zsolt Istv´ an, Gustavo Alonso Systems Group, Dept. of Computer Science, ETH Z¨ urich Systems Group, Dept. of Computer Science, ETH Z¨ urich
Original Architecture [1] 10 Gbps bandwidth TCP/IP stack Supporting thousands of concurrent connections Generic implementation as close to specification as possible Enables seamless integration of FPGA-based applications into existing networks [1] Sidler et al., Scalable 10 Gbps TCP/IP Stack Architecture for Reconfigurable Hardware , FCCM’15, http://github.com/dsidler/fpga-network-stack Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 2 / 6
Application Integration DDR FPGA TCP/IP App App module module 10G Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 3 / 6
Application Integration DDR FPGA Requires DDR to buffer packet payloads Applications require TCP/IP DDR memory App App module module 10G Memory bandwidth is shared among multiple modules → potential bottleneck Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 3 / 6
Application Integration DDR FPGA Requires DDR to buffer packet payloads Applications require TCP/IP DDR memory App App module module 10G Memory bandwidth is shared among multiple modules → potential bottleneck Distributed systems rely on very low latency → to guarantee latency bounds to clients Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 3 / 6
Assumptions Application Client requests fit into an MTU (maximum transfer unit) Synchronous clients Application logic consumes data at line-rate Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 4 / 6
Assumptions Application Client requests fit into an MTU (maximum transfer unit) Synchronous clients Application logic consumes data at line-rate Data center network High reliability and structured topology Data loss less common → fewer retransmission Packets are rarely reordered Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 4 / 6
Optimizations for Data Center Applications RX RX Engine Buffer Application Network Event State App Timers Engine Tables If TX Engine TX Buffer Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 5 / 6
Optimizations for Data Center Applications Replace RX buffer with BRAM RX RX Engine Buffer Application Network Event State App Timers Engine Tables If TX Engine TX Buffer Only read for retransmission Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 5 / 6
Optimizations for Data Center Applications - Tuning Timers Replace RX buffer with BRAM - Reducing ACK delay RX RX Engine Buffer Application Network Event State App Timers Engine Tables If TX Engine TX Buffer Disabling Nagle’s algorithm Only read for retransmission Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 5 / 6
Results Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6
Results RX org. RX opt. cycles @ 156.25 MHz TX org. TX opt. 800 600 2-3x lower 400 Latency 200 1 64 128 256 512 1024 1460 Payload size [B] Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6
Results RX org. RX opt. TCP org. (SM) TCP org. (DM) cycles @ 156.25 MHz TX org. TX opt. max. goodput TCP opt. (SM) 800 10 Goodput [Gb/s] 8 600 2-3x lower 6 High Throughput 400 Latency 4 200 2 64 256 512 1 , 024 1 , 460 1 64 128 256 512 1024 1460 Payload size [B] Payload size [B] Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6
Results RX org. RX opt. TCP org. (SM) TCP org. (DM) cycles @ 156.25 MHz TX org. TX opt. max. goodput TCP opt. (SM) 800 10 Goodput [Gb/s] 8 600 2-3x lower 6 High Throughput 400 Latency 4 200 2 64 256 512 1 , 024 1 , 460 1 64 128 256 512 1024 1460 Payload size [B] Payload size [B] Mem. allocated Mem. bandwidth TCP org. 1,300 MB 40 Gbps TCP opt. 650 MB 10 Gbps Diff -50% -75% Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6
Results RX org. RX opt. TCP org. (SM) TCP org. (DM) cycles @ 156.25 MHz TX org. TX opt. max. goodput TCP opt. (SM) 800 10 Goodput [Gb/s] 8 600 2-3x lower 6 High Throughput 400 Latency 4 200 2 64 256 512 1 , 024 1 , 460 1 64 128 256 512 1024 1460 Payload size [B] Payload size [B] These results enabled a consistent distributed key-value store [2] Mem. allocated Mem. bandwidth TCP org. 1,300 MB 40 Gbps TCP opt. 650 MB 10 Gbps Diff -50% -75% [2] Istv´ an et al., Consensus in a Box: Inexpensive Coordination in Hardware , NSDI16 Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6
Results RX org. RX opt. TCP org. (SM) TCP org. (DM) cycles @ 156.25 MHz TX org. TX opt. max. goodput TCP opt. (SM) 800 10 Goodput [Gb/s] 8 600 2-3x lower 6 High Throughput 400 Latency 4 200 2 64 256 512 1 , 024 1 , 460 1 64 128 256 512 1024 1460 Visit our poster for more results and details! Payload size [B] Payload size [B] Find the source at: http://github.com/dsidler/fpga-network-stack Mem. allocated Mem. bandwidth TCP org. 1,300 MB 40 Gbps TCP opt. 650 MB 10 Gbps Diff -50% -75% [2] Istv´ an et al., Consensus in a Box: Inexpensive Coordination in Hardware , NSDI16 Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6
Recommend
More recommend