Low-Latency TCP/IP Stack for Data Center Applications David Sidler, - - PowerPoint PPT Presentation

low latency tcp ip stack for data center applications
SMART_READER_LITE
LIVE PREVIEW

Low-Latency TCP/IP Stack for Data Center Applications David Sidler, - - PowerPoint PPT Presentation

Low-Latency TCP/IP Stack for Data Center Applications David Sidler, Zsolt Istv an, Gustavo Alonso Systems Group, Dept. of Computer Science, ETH Z urich Systems Group, Dept. of Computer Science, ETH Z urich Original Architecture [1] 10


slide-1
SLIDE 1

Low-Latency TCP/IP Stack for Data Center Applications

David Sidler, Zsolt Istv´ an, Gustavo Alonso Systems Group, Dept. of Computer Science, ETH Z¨ urich

Systems Group, Dept. of Computer Science, ETH Z¨ urich

slide-2
SLIDE 2

Original Architecture [1]

10 Gbps bandwidth TCP/IP stack Supporting thousands of concurrent connections Generic implementation as close to specification as possible Enables seamless integration of FPGA-based applications into existing networks [1] Sidler et al., Scalable 10 Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, FCCM’15, http://github.com/dsidler/fpga-network-stack

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 2 / 6

slide-3
SLIDE 3

Application Integration

TCP/IP App module App module 10G DDR FPGA

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 3 / 6

slide-4
SLIDE 4

Application Integration

TCP/IP App module App module 10G DDR FPGA

Requires DDR to buffer packet payloads Applications require DDR memory

Memory bandwidth is shared among multiple modules → potential bottleneck

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 3 / 6

slide-5
SLIDE 5

Application Integration

TCP/IP App module App module 10G DDR FPGA

Requires DDR to buffer packet payloads Applications require DDR memory

Memory bandwidth is shared among multiple modules → potential bottleneck Distributed systems rely on very low latency → to guarantee latency bounds to clients

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 3 / 6

slide-6
SLIDE 6

Assumptions

Application

Client requests fit into an MTU (maximum transfer unit) Synchronous clients Application logic consumes data at line-rate

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 4 / 6

slide-7
SLIDE 7

Assumptions

Application

Client requests fit into an MTU (maximum transfer unit) Synchronous clients Application logic consumes data at line-rate

Data center network

High reliability and structured topology Data loss less common → fewer retransmission Packets are rarely reordered

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 4 / 6

slide-8
SLIDE 8

Optimizations for Data Center Applications

State Tables Timers Event Engine RX Engine TX Engine RX Buffer TX Buffer App If

Network

Application

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 5 / 6

slide-9
SLIDE 9

Optimizations for Data Center Applications

State Tables Timers Event Engine RX Engine TX Engine RX Buffer TX Buffer App If

Network

Application Replace RX buffer with BRAM Only read for retransmission

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 5 / 6

slide-10
SLIDE 10

Optimizations for Data Center Applications

State Tables Timers Event Engine RX Engine TX Engine RX Buffer TX Buffer App If

Network

Application Replace RX buffer with BRAM Only read for retransmission

  • Tuning Timers
  • Reducing ACK delay

Disabling Nagle’s algorithm

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 5 / 6

slide-11
SLIDE 11

Results

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6

slide-12
SLIDE 12

Results

1 64 128 256 512 1024 1460 200 400 600 800

Payload size [B] cycles @ 156.25 MHz

RX org. RX opt. TX org. TX opt.

2-3x lower Latency

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6

slide-13
SLIDE 13

Results

1 64 128 256 512 1024 1460 200 400 600 800

Payload size [B] cycles @ 156.25 MHz

RX org. RX opt. TX org. TX opt.

2-3x lower Latency

64 256 512 1,024 1,460 2 4 6 8 10

Payload size [B] Goodput [Gb/s]

TCP org. (SM) TCP org. (DM) TCP opt. (SM)

  • max. goodput

High Throughput

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6

slide-14
SLIDE 14

Results

1 64 128 256 512 1024 1460 200 400 600 800

Payload size [B] cycles @ 156.25 MHz

RX org. RX opt. TX org. TX opt.

2-3x lower Latency

64 256 512 1,024 1,460 2 4 6 8 10

Payload size [B] Goodput [Gb/s]

TCP org. (SM) TCP org. (DM) TCP opt. (SM)

  • max. goodput

High Throughput

  • Mem. allocated
  • Mem. bandwidth

TCP org. 1,300 MB 40 Gbps TCP opt. 650 MB 10 Gbps Diff

  • 50%
  • 75%

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6

slide-15
SLIDE 15

Results

1 64 128 256 512 1024 1460 200 400 600 800

Payload size [B] cycles @ 156.25 MHz

RX org. RX opt. TX org. TX opt.

2-3x lower Latency

64 256 512 1,024 1,460 2 4 6 8 10

Payload size [B] Goodput [Gb/s]

TCP org. (SM) TCP org. (DM) TCP opt. (SM)

  • max. goodput

High Throughput

These results enabled a consistent distributed key-value store [2]

  • Mem. allocated
  • Mem. bandwidth

TCP org. 1,300 MB 40 Gbps TCP opt. 650 MB 10 Gbps Diff

  • 50%
  • 75%

[2] Istv´ an et al., Consensus in a Box: Inexpensive Coordination in Hardware, NSDI16

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6

slide-16
SLIDE 16

Results

1 64 128 256 512 1024 1460 200 400 600 800

Payload size [B] cycles @ 156.25 MHz

RX org. RX opt. TX org. TX opt.

2-3x lower Latency

64 256 512 1,024 1,460 2 4 6 8 10

Payload size [B] Goodput [Gb/s]

TCP org. (SM) TCP org. (DM) TCP opt. (SM)

  • max. goodput

High Throughput

Visit our poster for more results and details! Find the source at: http://github.com/dsidler/fpga-network-stack

  • Mem. allocated
  • Mem. bandwidth

TCP org. 1,300 MB 40 Gbps TCP opt. 650 MB 10 Gbps Diff

  • 50%
  • 75%

[2] Istv´ an et al., Consensus in a Box: Inexpensive Coordination in Hardware, NSDI16

Systems Group, Dept. of Computer Science, ETH Z¨ urich FPL’16, Lausanne August 30, 2016 6 / 6