Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. - - PowerPoint PPT Presentation

adapting tcp for recon fj gurable datacenter networks
SMART_READER_LITE
LIVE PREVIEW

Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. - - PowerPoint PPT Presentation

Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. Mukerjee* , Christopher Canel* , Weiyang Wang , Daehyeok Kim* , Srinivasan Seshan*, Alex C. Snoeren *Carnegie Mellon University, UC San Diego, Nefeli Networks,


slide-1
SLIDE 1

Adapting TCP for Reconfjgurable Datacenter Networks

Matthew K. Mukerjee*†, Christopher Canel*, Weiyang Wang○, Daehyeok Kim*‡, Srinivasan Seshan*, Alex C. Snoeren○ *Carnegie Mellon University, ○UC San Diego, †Nefeli Networks, ‡Microsoft Research February 26, 2020

slide-2
SLIDE 2

Rack 1 Server 1 Server M

ToR switch Rack N Server 1 Server M

ToR switch

Reconfigurable Datacenter Network (RDCN)

Time Available bandwidth

Packet Network

Packet Switch

Packet Switch

Circuit Switch

higher bandwidth, between certain racks all-to-all connectivity

60GHz wireless free-space

  • ptics
  • ptical

circuits RDCN is a black box: Do not segregate flows between networks

[Liu, NSDI ’14]

slide-3
SLIDE 3

2010: RDCNs speed up DC workloads

[Wang, SIGCOMM ’10] Packet network Hybrid network (c-Through) Full bisection bandwidth network

Hybrid networks achieve higher performance on datacenter workloads

slide-4
SLIDE 4

Advances in circuit switch technology have led to a 10x reduction in reconfiguration delay ⇒ today, circuits can reconfigure much more frequently Better for datacenters: More flexibility to support dynamic workloads Better for hosts: Less data must be available to saturate higher bandwidth NW

Today’s RDCNs reconfigure 10x as often

Time Available bandwidth Time Available bandwidth

2010 Today 180μs 10ms

[Porter, SIGCOMM ’13]

slide-5
SLIDE 5

Short-lived circuits pose a problem for TCP

16 fmows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s

b b r b i c c d g c u b i c d c t c p h i g h s p e e d h t c p h y b l a i l l i n

  • i

s l p n v r e n

  • s

c a l a b l e v e g a s v e n

  • w

e s t w

  • d

y e a h 7C3 variant 25 50 75 100 Average circuit utilization (%)

26 49 54 49 46 50 49 51 55 53 42 53 55 52 52 53 51

No TCP variant makes use of the high-bandwidth circuits

slide-6
SLIDE 6

reality what we expect

TCP cannot ramp up during short circuits

no circuit no circuit circuit 180μs

achieved bandwidth (BW) = slope 8x BW 1 x B W

slide-7
SLIDE 7

What is the problem?

All TCP variants are designed to adapt to changing network conditions

  • E.g., congestion, bottleneck links, RTT

But bandwidth fmuctuations in modern RDCN are an order of magnitude more frequent (10x shorter circuit duration) and more substantial (10x higher bandwidth) than TCP is designed to handle

  • RDCNs break the implicit assumption of relatively-stable network

conditions This requires an order-of-magnitude shift in how fast TCP reacts

slide-8
SLIDE 8

This talk: Our 2-part solution

In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively

  • High utilization, at the cost of tail latency

At endhosts: New TCP variant, reTCP, that explicitly reacts to circuit state changes

  • Mitigates tail latency penalty

The two techniques can be deployed separately, but work best together

slide-9
SLIDE 9

Naïve idea: Enlarge switch buffers

Want we want: TCP’s congestion window (cwnd) to parallel the BW fluctuations First attempt: Make cwnd large all the time How? Use large ToR buffers

desired large, static buffers

Tim Available bandwidth Bandwidth Time cwnd cwnd

slide-10
SLIDE 10

Circuit Switch

Naïve idea: Enlarge switch buffers

Packet Switch

Sender Receiver ToR buffer ToR buffer high BDP low BDP

slide-11
SLIDE 11

Naïve idea: Enlarge switch buffers

ToR buffer

Circuit Switch Packet Switch

Sender Receiver high BDP low BDP ToR buffer Bandwidth Larger ToR buffers increase utilization of the high-BDP circuit network

slide-12
SLIDE 12

Circuit Switch

Naïve idea: Enlarge switch buffers

ToR buffer

Packet Switch

Sender Receiver ToR buffer Latency Bandwidth high BDP low BDP

Circuit Switch

slide-13
SLIDE 13

Large queues increase utilization…

4 8 16 32 64 128 6tatic buffer size (Sackets) 20 40 60 80 100 Average circuit utilization (%)

21 31 49 77 100 100

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s

slide-14
SLIDE 14

…but result in high latency

Median latency 99th percentile latency latency

20 40 60 80 100 Average circuit utilization (%) 500 1000 0edian latency (μs)

6tatic buffers (vary size)

20 40 60 80 100 Average circuit utilization (%) 500 1000 99th Sercentile latency (μs)

6tatic buffers (vary size)

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s How can we improve this latency?

slide-15
SLIDE 15

Use large buffers only when circuit is up

Dynamic buffer resizing: Before a circuit begins, transparently enlarge ToR buffers Full circuit utilization with a latency degradation only during ramp-up period

desired large, static buffers dynamic buffers

Tim Available bandwidth Bandwidth Time cwnd cwnd resize!

slide-16
SLIDE 16

Circuit Switch

Resize ToR buffers before circuit begins

Packet Switch

Sender Receiver ToR buffer ToR buffer Circuit coming! Time cwnd

slide-17
SLIDE 17

ToR buffer

Packet Switch

Sender Receiver ToR buffer

Circuit Switch

Time cwnd

Resize ToR buffers before circuit begins

slide-18
SLIDE 18

ToR buffer

Circuit Switch Packet Switch

Sender Receiver ToR buffer Time cwnd

Resize ToR buffers before circuit begins

Circuit Switch

slide-19
SLIDE 19

Circuit Switch Packet Switch

Sender Receiver ToR buffer ToR buffer Time cwnd

Resize ToR buffers before circuit begins

Circuit Switch

slide-20
SLIDE 20

Configuring dynamic buffer resizing

How long in advance should ToR buffers resize (𝝊)?

  • Long enough for TCP to grow cwnd to the circuit BDP

How large should ToR buffers grow to?

  • circuit BDP = 80 Gb/s ⨉ 40 µs = 45 9000-byte packets

For our configuration, the ToR buffers must hold ~40 packets to achieve 90% utilization, which requires 1800 µs of prebuffering We resize ToR buffers between sizes of 16 and 50 packets

slide-21
SLIDE 21

How long in advance to resize, 𝝊?

no circuit circuit 180μs

achieved bandwidth (BW) = slope 8x BW 1x BW

49% 65% 91% 98%

ToR buffer size (packets)

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

too early: extra queuing too late: low util. util./latency trade-off

slide-22
SLIDE 22

3 6 9 1 2 1 5 1 8 2 1 2 4 2 7 3 5esize time (μs) 20 40 60 80 100 Average circuit utilizatiRn (%)

49 60 65 72 79 86 91 96 98 97 97

1800μs of prebuffering yields 91% util.

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

slide-23
SLIDE 23

20 40 60 80 100 AverDge circuit utilizDtion (%) 500 1000 0ediDn lDtency (μs)

6tDtic buffers (vDry size) DynDmic buffers (vDry τ)

20 40 60 80 100 AverDge circuit utilizDtion (%) 500 1000 99th Sercentile lDtency (μs)

6tDtic buffers (vDry size) DynDmic buffers (vDry τ)

Latency degradation during ramp-up

Median latency 99th percentile latency We cannot use large queues for so long. Can we get the same high utilization with shorter prebuffering?

2.3x increase

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

slide-24
SLIDE 24

This talk: Our 2-part solution

In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively

  • High utilization, at the cost of tail latency

At endhosts: New TCP variant, reTCP, that explicitly reacts to circuit state changes

  • Mitigates tail latency penalty

The two techniques can be deployed separately, but work best together

slide-25
SLIDE 25

This talk: Our 2-part solution

In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively

  • High utilization, at the cost of tail latency

At endhosts: New TCP variant, reTCP, that explicitly reacts to circuit state changes

  • Mitigates tail latency penalty

The two techniques can be deployed separately, but work best together

slide-26
SLIDE 26

reTCP: Rapidly grow cwnd before a circuit

1) Communicate circuit state to sender TCP

2) Sender TCP reacts by multiplicatively increasing/decreasing

cwnd

desired large, static buffers dynamic buffers dynamic buffers + reTCP

Time Available bandwidth Bandwidth Time cwnd cwnd resize!

slide-27
SLIDE 27

reTCP marks:

Circuit Switch

reTCP: Explicit circuit state feedback

Packet Switch

Sender Receiver ToR buffer ACK ToR buffer Circuit coming! Reuse existing ECN-Echo (ECE) bit

slide-28
SLIDE 28

reTCP marks: ToR buffer

Packet Switch

Sender Receiver ToR buffer ACK 1 0 → 1 increase!

Circuit Switch

Reuse existing ECN-Echo (ECE) bit

reTCP: Explicit circuit state feedback

Circuit coming!

slide-29
SLIDE 29

reTCP marks: ToR buffer

Circuit Switch Packet Switch

Sender Receiver ToR buffer ACK 1 1 Reuse existing ECN-Echo (ECE) bit

Circuit Switch

reTCP: Explicit circuit state feedback

slide-30
SLIDE 30

Circuit Switch Packet Switch

Sender Receiver ToR buffer ACK ToR buffer reTCP marks: 1 → 0 decrease! 1 1 Reuse existing ECN-Echo (ECE) bit

Circuit Switch

reTCP: Explicit circuit state feedback

slide-31
SLIDE 31

Single multiplicative increase/decrease

𝛽 depends on ratio of circuit BDP to ToR queue capacity:

  • Circuit network BDP: 45 packets
  • Small ToR queue capacity: 16 packets

We use 𝛽 = 2 More advanced forms of feedback are possible

On 0 → 1 transitions: cwnd = cwnd ⨉ 𝛽 On 1 → 0 transitions: cwnd = cwnd / 𝛽

slide-32
SLIDE 32

Dynamic buffers + reTCP achieve high utilization

50 100 150 200 250 300 5esize time (μs) 20 40 60 80 100 Average circuit utilizatiRn (%)

80 87 91 93 94 93 93

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

slide-33
SLIDE 33

Short prebuffer time means low latency

Median latency 99th percentile latency

20 40 60 80 100 AverDge circuit utilizDtion (%) 500 1000 0ediDn lDtency (μs)

6tDtic buffers (vDry size) DynDPic buffers (vDry τ) DynDPic buffers + reTCP (vDry τ)

20 40 60 80 100 AverDge circuit utilizDtion (%) 500 1000 99th Sercentile lDtency (μs)

6tDtic buffers (vDry size) DynDPic buffers (vDry τ) DynDPic buffers + reTCP (vDry τ)

With 150μs of prebuffering, dynamic buffers + reTCP achieve 93% circuit utilization with an only 1.20x increase in tail latency 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

slide-34
SLIDE 34

Limitations and future work

Dynamic buffer resizing and reTCP are designed to be minimally invasive

  • Higher performance may be possible by involving the end-host further

Our evaluation used a simple traffic pattern to isolate TCP’s behavior

  • Important to consider complex workloads as well

Is TCP the right protocol for hybrid networks?

slide-35
SLIDE 35

Summary: Adapting TCP for RDCNs

Bandwidth fluctuations in reconfigurable datacenter networks break TCP’s implicit assumption of relative network stability Two techniques to ramp up TCP during short-lived circuits

  • Dynamic buffer resizing: Adapt ToR queues to packet or circuit network
  • reTCP: Ramp up aggressively to fill new queue capacity

Etalon emulator open source at: github.com/mukerjee/etalon Christopher Canel ~ ccanel@cmu.edu

Thank you!

slide-36
SLIDE 36

Packet Network

Packet Switch

Packet Switch

Circuit Switch

Click hybrid switch (physical host)

One more thing: Etalon emulator

Rack 1 Server 1 Server M

ToR switch Rack N Server 1 Server M

ToR switch

Emulated rack 1 (physical host)

Container 1 Container M

Emulated rack N (physical host)

Container 1 Container M

… …

slide-37
SLIDE 37

Click hybrid switch (physical host)

One more thing: Etalon emulator

Use time dilation to emulate high-bandwidth links

  • “slows down” rest of the

machine

  • libVT: Catches common

syscalls Flowgrind to generate traffic Strobe schedule: Each rack pair gets a circuit for an equal share Emulated rack 1 (physical host)

Container 1 Container M

Emulated rack N (physical host)

Container 1 Container M

… …

slide-38
SLIDE 38

Summary: Adapting TCP for RDCNs

Bandwidth fluctuations in reconfigurable datacenter networks break TCP’s implicit assumption of relative network stability Two techniques to ramp up TCP during short-lived circuits

  • Dynamic buffer resizing: Adapt ToR queues to packet or circuit network
  • reTCP: Ramp up aggressively to fill new queue capacity

Etalon emulator open source at: github.com/mukerjee/etalon Christopher Canel ~ ccanel@cmu.edu

Thank you!

slide-39
SLIDE 39

Backup Slides

slide-40
SLIDE 40

Circuit uptime impacts FCT

101 102 103 104 105 106 CirFuit uptime (μs) 5 10 15 20 25 )low Fompletion time (s) 8 paFkets 16 paFkets 32 paFkets 64 paFkets 128 paFkets

Simulation; packet network: 10 Gb/s; circuit network: 80 Gb/s

slide-41
SLIDE 41

Buffer resizing benefits many TCP variants

b b r b i c c d g c u b i c d c t c p h i g h s p e e d h t c p h y b l a i l l i n

  • i

s l p n v r e n

  • s

c a l a b l e v e g a s v e n

  • w

e s t w

  • d

y e a h 7C3 variant 25 50 75 100 Average circuit utilization (%)

43 87 93 79 88 89 89 89 91 96 54 95 95 92 89 97 88

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small queues: 16 packets; large queues: 50 packets

slide-42
SLIDE 42

Higher latency percentiles perform similarly

99th percentile latency 99.999th percentile latency

20 40 60 80 100 AverDge circuit utilizDtion (%) 500 1000 99th Sercentile lDtency (μs)

6tDtic buffers (vDry size) DynDPic buffers (vDry τ) DynDPic buffers + reTCP (vDry τ)

20 40 60 80 100 AverDge circuit utilizDtion (%) 500 1000 99.999th Sercentile lDtency (μs)

6tDtic buffers (vDry size) DynDPic buffers (vDry τ) DynDPic buffers + reTCP (vDry τ)

16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small queues: 16 packets; large queues: 50 packets