Reliable Communication for Datacenters Mahesh Balakrishnan Cornell - - PowerPoint PPT Presentation

reliable communication for datacenters
SMART_READER_LITE
LIVE PREVIEW

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell - - PowerPoint PPT Presentation

Maelstrom Ricochet Conclusion Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Datacenters Internet Services (90s)


slide-1
SLIDE 1

Maelstrom Ricochet Conclusion

Reliable Communication for Datacenters

Mahesh Balakrishnan Cornell University

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-2
SLIDE 2

Maelstrom Ricochet Conclusion

Datacenters

◮ Internet Services (90s) — Websites, Search, Online Stores ◮ Since then:

# of low-end volume servers 5 10 15 20 25 30 2000 2001 2002 2003 2004 2005 Millions

Installed Server Base 00-05:

◮ Commodity — up by 100% ◮ High/Mid — down by 40% ◮ Today: Datacenters are ubiquitous ◮ How have they evolved?

Data partially sourced from IDC press releases (www.idc.com) Mahesh Balakrishnan Reliable Communication for Datacenters

slide-3
SLIDE 3

Maelstrom Ricochet Conclusion

Networks of Datacenters

Why? Business Continuity, Client Locality, Distributed Datasets

  • r Operations ...

Any modern enterprise!

N S E W

100 ms RTT: 110 ms 210 ms 220 ms 110 ms 100 ms 200 ms Mahesh Balakrishnan Reliable Communication for Datacenters

slide-4
SLIDE 4

Maelstrom Ricochet Conclusion

Networks of Real-Time Datacenters

◮ Finance, Aerospace, Military, Search and Rescue... ◮ ... documents, chat, email, games, videos, photos, blogs,

social networks

◮ The Datacenter is the Computer! ◮ Not hard real-time: real fast, highly responsive, time-critical

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-5
SLIDE 5

Maelstrom Ricochet Conclusion

Networks of Real-Time Datacenters

◮ Finance, Aerospace, Military, Search and Rescue... ◮ ... documents, chat, email, games, videos, photos, blogs,

social networks

◮ The Datacenter is the Computer! ◮ Not hard real-time: real fast, highly responsive, time-critical

Gartner Survey:

◮ Real-Time Infrastructure (RTI): reaction time in secs/mins ◮ 73%: RTI is important or very important ◮ 85%: Have no RTI capability

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-6
SLIDE 6

Maelstrom Ricochet Conclusion

The Real-Time Datacenter — Systems Challenges

How do we recover from failures within seconds?

Real-World, Real-Time

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-7
SLIDE 7

Maelstrom Ricochet Conclusion

The Real-Time Datacenter — Systems Challenges

How do we recover from failures within seconds?

Real-World, Real-Time

Disk Failure, Disasters Crashes Overloads Bugs Exploits

Software Stack

Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ...

Packet Loss Mahesh Balakrishnan Reliable Communication for Datacenters

slide-8
SLIDE 8

Maelstrom Ricochet Conclusion

The Real-Time Datacenter — Systems Challenges

How do we recover from failures within seconds?

Real-World, Real-Time

Disk Failure, Disasters Crashes Overloads Bugs Exploits

SMFS KyotoFS Tempest Maelstrom

Software Stack

Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato

Packet Loss Mahesh Balakrishnan Reliable Communication for Datacenters

slide-9
SLIDE 9

Maelstrom Ricochet Conclusion

Reliable Communication

Goal: Recover lost packets fast!

◮ Existing protocols react to loss: too much, too late ◮ We want proactive recovery: stable overhead, low latencies ◮ Maelstrom: Reliability between datacenters

[NSDI 2008]

◮ Ricochet: Reliability within datacenters

[NSDI 2007]

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-10
SLIDE 10

Maelstrom Ricochet Conclusion

Reliable Communication between Datacenters

TCP fails in three ways:

  • 1. Throughput Collapse

100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps!

  • 2. Massive Buffers required for High-Rate Traffic
  • 3. Recovery Delays for Time-Critical Traffic

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-11
SLIDE 11

Maelstrom Ricochet Conclusion

Reliable Communication between Datacenters

TCP fails in three ways:

  • 1. Throughput Collapse

100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps!

  • 2. Massive Buffers required for High-Rate Traffic
  • 3. Recovery Delays for Time-Critical Traffic

Current Solutions:

◮ Rewrite Apps: One Flow → Multiple Split Flows ◮ Resize Buffers ◮ Spend (infinite) money!

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-12
SLIDE 12

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer Network

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-13
SLIDE 13

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer Network

◮ End-to-End UDP Probes: Zero

Congestion, Non-Zero Loss!

◮ Possible Reasons:

◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 24 14 5 10 15 20 25 30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets % of Measurements Mahesh Balakrishnan Reliable Communication for Datacenters

slide-14
SLIDE 14

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer Network

◮ End-to-End UDP Probes: Zero

Congestion, Non-Zero Loss!

◮ Possible Reasons:

◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 24 14 5 10 15 20 25 30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets % of Measurements

Electronics: Cluttered Pathways Optics: Lossy Fiber

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-15
SLIDE 15

Maelstrom Ricochet Conclusion

Problem Statement Run unmodified TCP/IP over lossy high-speed long-distance networks

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-16
SLIDE 16

Maelstrom Ricochet Conclusion

The Maelstrom Network Appliance

Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-17
SLIDE 17

Maelstrom Ricochet Conclusion

The Maelstrom Network Appliance

Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router Maelstrom Receive-Side Appliance Maelstrom Send-Side Appliance

Transparent: No modification to end-host or network

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-18
SLIDE 18

Maelstrom Ricochet Conclusion

The Maelstrom Network Appliance

Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router Maelstrom Receive-Side Appliance Maelstrom Send-Side Appliance FEC Encode Decode

Transparent: No modification to end-host or network FEC = Forward Error Correction

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-19
SLIDE 19

Maelstrom Ricochet Conclusion

What is FEC?

A B C D E X X X C D E A B

3 repair packets from every 5 data packets Receiver can recover from any 3 lost packets

Rate : (r, c) — c repair packets for every r data packets.

◮ Pro: Recovery Latency independent of RTT ◮ Constant Data Overhead: c r+c ◮ Packet-level FEC at End-hosts: Inexpensive, No extra HW

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-20
SLIDE 20

Maelstrom Ricochet Conclusion

What is FEC?

A B C D E X X X C D E A B

3 repair packets from every 5 data packets Receiver can recover from any 3 lost packets

Rate : (r, c) — c repair packets for every r data packets.

◮ Pro: Recovery Latency independent of RTT ◮ Constant Data Overhead: c r+c ◮ Packet-level FEC at End-hosts: Inexpensive, No extra HW ◮ Con: Recovery Latency dependent on channel data rate

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-21
SLIDE 21

Maelstrom Ricochet Conclusion

What is FEC?

A B C D E X X X C D E A B

3 repair packets from every 5 data packets Receiver can recover from any 3 lost packets

Rate : (r, c) — c repair packets for every r data packets.

◮ Pro: Recovery Latency independent of RTT ◮ Constant Data Overhead: c r+c ◮ Packet-level FEC at End-hosts: Inexpensive, No extra HW ◮ Con: Recovery Latency dependent on channel data rate ◮ FEC in the Network:

◮ Where and What? Mahesh Balakrishnan Reliable Communication for Datacenters

slide-22
SLIDE 22

Maelstrom Ricochet Conclusion

The Maelstrom Network Appliance

Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router Maelstrom Receive-Side Appliance Maelstrom Send-Side Appliance FEC Encode Decode

Transparent: No modification to end-host or network FEC = Forward Error Correction Where: at the appliance, What: aggregated data

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-23
SLIDE 23

Maelstrom Ricochet Conclusion

Maelstrom Mechanism

Send-Side Appliance:

◮ Snoop IP packets ◮ Create repair packet =

XOR + ‘recipe’ of data packet IDs

29 28 27 26 25 25 26 27 28 29 X LOSS XOR ‘Recipe List’: 25,26,27,28,29 25 26 28 29 Lambda Jumbo MTU LAN MTU Appliance Appliance 27 Recovered Packet Mahesh Balakrishnan Reliable Communication for Datacenters

slide-24
SLIDE 24

Maelstrom Ricochet Conclusion

Maelstrom Mechanism

Send-Side Appliance:

◮ Snoop IP packets ◮ Create repair packet =

XOR + ‘recipe’ of data packet IDs Receive-Side Appliance:

◮ Lost packet recovered

using XOR and other data packets

◮ At receiver end-host: out

  • f order, no loss

29 28 27 26 25 25 26 27 28 29 X LOSS XOR ‘Recipe List’: 25,26,27,28,29 25 26 28 29 Lambda Jumbo MTU LAN MTU Appliance Appliance 27 Recovered Packet Mahesh Balakrishnan Reliable Communication for Datacenters

slide-25
SLIDE 25

Maelstrom Ricochet Conclusion

Layered Interleaving for Bursty Loss

Recovery Latency ∝ Actual Burst Size, not Max Burst Size

3 2 1 X1 11 21 X2 101 201 X3

Data Stream XORs:

◮ XORs at different interleaves ◮ Recovery latency degrades gracefully

with loss burstiness: X1 catches random singleton losses X2 catches loss bursts of 10 or less X3 catches bursts of 100 or less 2in2in

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-26
SLIDE 26

Maelstrom Ricochet Conclusion

Layered Interleaving for Bursty Loss

Recovery Latency ∝ Actual Burst Size, not Max Burst Size

3 2 1 X1 11 21 X2 101 201 X3

Data Stream XORs:

◮ XORs at different interleaves ◮ Recovery latency degrades gracefully

with loss burstiness: X1 catches random singleton losses X2 catches loss bursts of 10 or less X3 catches bursts of 100 or less

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Loss probability Recovery probability Reed−Solomon Maelstrom

Comparison of Recovery Probability: r=7, c=2

2in2in

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-27
SLIDE 27

Maelstrom Ricochet Conclusion

Maelstrom Modes

◮ TCP Traffic: Two Flow Control Modes

A) End-to-End Flow Control

End-Host End-Host Appliance Appliance

B) Split Flow Control

End-Host End-Host Appliance Appliance

◮ Split Mode avoids client buffer resizing (PeP)

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-28
SLIDE 28

Maelstrom Ricochet Conclusion

Implementation Details

◮ In Kernel — Linux 2.6.20 Module ◮ Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$) ◮ Max speed: 1 Gbps, Memory Footprint: 10 MB ◮ 50-60% CPU → NIC is the bottleneck (for c = 3) ◮ How do we efficiently store/access/clean a gigabit of data

every second?

◮ Scaling to Multi-Gigabit: Partition IP space across proxies

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-29
SLIDE 29

Maelstrom Ricochet Conclusion

Evaluation: FEC Mode and Loss

Claim: Maelstrom effectively hides loss from TCP/IP

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss)

Mahesh Balakrishnan Reliable Communication for Datacenters

TCP/IP with loss TCP/IP without loss

slide-30
SLIDE 30

Maelstrom Ricochet Conclusion

Evaluation: FEC Mode and Loss

Claim: Maelstrom effectively hides loss from TCP/IP

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%)

Mahesh Balakrishnan Reliable Communication for Datacenters

TCP/IP with loss TCP/IP without loss

slide-31
SLIDE 31

Maelstrom Ricochet Conclusion

Evaluation: FEC Mode and Loss

Claim: Maelstrom effectively hides loss from TCP/IP

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%)

Mahesh Balakrishnan Reliable Communication for Datacenters

{ →

Data + FEC ˜ = 1 Gbps

TCP/IP with loss TCP/IP without loss

slide-32
SLIDE 32

Maelstrom Ricochet Conclusion

Evaluation: FEC Mode and Loss

Claim: Maelstrom effectively hides loss from TCP/IP

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%)

Mahesh Balakrishnan Reliable Communication for Datacenters

← Tput ∝

1 RTT

{ →

Data + FEC ˜ = 1 Gbps

TCP/IP with loss TCP/IP without loss

slide-33
SLIDE 33

Maelstrom Ricochet Conclusion

Evaluation: Split Mode and Buffering

Claim: Maelstrom eliminates the need for large end-host buffers

100 200 300 400 500 600 20 30 40 50 60 70 80 90 100 Throughput (Mbps) One-way link latency (ms) Throughput as a function of latency loss-rate=0.001

Mahesh Balakrishnan Reliable Communication for Datacenters

Tput independent

  • f RTT!
slide-34
SLIDE 34

Maelstrom Ricochet Conclusion

Evaluation: Delivery Latency

Claim: Maelstrom eliminates TCP/IP’s loss-related jitter

100 200 300 400 500 600 1000 2000 3000 4000 5000 6000 Delivery Latency (ms) Packet # TCP/IP: 0.1% Loss 100 200 300 400 500 600 1000 2000 3000 4000 5000 6000 Delivery Latency (ms) Packet # Maelstrom: 0.1% Loss

Sources of Jitter:

◮ Receive-side buffering due to sequencing ◮ Send-side buffering due to congestion control

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-35
SLIDE 35

Maelstrom Ricochet Conclusion

Evaluation: Layered Interleaving

Claim: Recovery Latency depends on Actual Burst Length

20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms)

◮ Longer Burst Lengths → Longer Recovery Latency

Mahesh Balakrishnan Reliable Communication for Datacenters

40 20 1 Burst Length =

slide-36
SLIDE 36

Maelstrom Ricochet Conclusion

Next Step: SMFS - The Smoke and Mirrors Filesystem

◮ Classic Mirroring Trade-off:

◮ Fast — return to user after sending to mirror ◮ Safe — return to user after ACK from mirror

◮ SMFS — return to user after sending enough FEC ◮ Maelstrom: Lossy Network → Lossless Network → Disk! ◮ Result: Fast, Safe Mirroring independent of link length! ◮ General Principle: Gray-box Exposure of Protocol State

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-37
SLIDE 37

Maelstrom Ricochet Conclusion

The Big Picture

Real-World, Real-Time

Disk Failure, Disasters Crashes Overloads Bugs Exploits

SMFS KyotoFS Tempest Maelstrom

Software Stack

Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato

Packet Loss Mahesh Balakrishnan Reliable Communication for Datacenters

slide-38
SLIDE 38

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

From Long-Haul to Multicast

Data acks

High RTT Sender Receiver Between Datacenters ◮ Feedback Loop Infeasible:

◮ Inter-Datacenter Long-Haul: RTT too high Mahesh Balakrishnan Reliable Communication for Datacenters

slide-39
SLIDE 39

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

From Long-Haul to Multicast

Data acks

High RTT Sender Receiver Between Datacenters Many Receivers

Data Data acks acks

Within Datacenter

Data

Multicast Group ◮ Feedback Loop Infeasible:

◮ Inter-Datacenter Long-Haul: RTT too high ◮ Intra-Cluster Multicast: Too many receivers Mahesh Balakrishnan Reliable Communication for Datacenters

slide-40
SLIDE 40

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

How is Multicast Used?

service replication/partitioning, publish-subscribe, data caching...

Financial Pub-Sub Example:

◮ Each equity is mapped to

a multicast group

◮ Each node is interested

in a different set of equities ...

Tracking S&P 500 Tracking Portfolio

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-41
SLIDE 41

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

How is Multicast Used?

service replication/partitioning, publish-subscribe, data caching...

Financial Pub-Sub Example:

◮ Each equity is mapped to

a multicast group

◮ Each node is interested

in a different set of equities ... Each node in many groups = ⇒ Low per-group data rate

Tracking S&P 500 Tracking Portfolio

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-42
SLIDE 42

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

How is Multicast Used?

service replication/partitioning, publish-subscribe, data caching...

Financial Pub-Sub Example:

◮ Each equity is mapped to

a multicast group

◮ Each node is interested

in a different set of equities ... Each node in many groups = ⇒ Low per-group data rate High per-node data rate = ⇒ Overload

Tracking S&P 500 Tracking Portfolio

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-43
SLIDE 43

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Where does loss occur in a Datacenter?

Packet Loss occurs at end-hosts: independent and bursty

20 40 60 80 100 180 160 140 120 100 80 60 burst length (packets) receiver r1: loss bursts in A and B receiver r1: data bursts in A and B 20 40 60 80 100 180 160 140 120 100 80 60 burst length (packets) time in seconds receiver r2: loss bursts in A receiver r2: data bursts in A

Mahesh Balakrishnan Reliable Communication for Datacenters

Loss Bursts

Data Rate

Less Loaded Node Overloaded Node

slide-44
SLIDE 44

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

◮ Recover lost packets

rapidly!

◮ Scalability:

Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters

slide-45
SLIDE 45

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

◮ Recover lost packets

rapidly!

◮ Scalability:

◮ Number of Receivers Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters

slide-46
SLIDE 46

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

◮ Recover lost packets

rapidly!

◮ Scalability:

◮ Number of Receivers ◮ Number of Senders Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters

slide-47
SLIDE 47

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

◮ Recover lost packets

rapidly!

◮ Scalability:

◮ Number of Receivers ◮ Number of Senders ◮ Number of Groups Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters

slide-48
SLIDE 48

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable Multicast

How does latency scale? Sender data

Loss Mahesh Balakrishnan Reliable Communication for Datacenters

slide-49
SLIDE 49

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable Multicast

How does latency scale? data

Loss

a c k s Sender data

Loss

◮ 1. acks: implosion

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-50
SLIDE 50

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable Multicast

How does latency scale? data

Loss

naks Sender data

Loss

◮ 1. acks: implosion ◮ 2. naks

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-51
SLIDE 51

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable Multicast

How does latency scale? data

Loss

xors Sender data

Loss

◮ 1. acks: implosion ◮ 2. naks ◮ 3. Sender-based FEC

recovery latency ∝

1 datarate

data rate:

  • ne sender, one group

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-52
SLIDE 52

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable Multicast

How does latency scale? data

Loss

xors Sender data

Loss

◮ 1. acks: implosion ◮ 2. naks ◮ 3. Sender-based FEC

recovery latency ∝

1 datarate

data rate:

  • ne sender, one group

◮ FEC in the network:

Where and What?

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-53
SLIDE 53

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable Multicast

How does latency scale? Sender data

Loss

xors data

Loss

◮ 1. acks: implosion ◮ 2. naks ◮ 3. Sender-based FEC

recovery latency ∝

1 datarate

data rate:

  • ne sender, one group

◮ FEC in the network:

Where and What? Receiver-based FEC: at receivers, from incoming data

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-54
SLIDE 54

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Receiver-Based Forward Error Correction

◮ Receiver generates an XOR of r

incoming multicast packets and exchanges with other receivers

◮ Each XOR sent to c other

random receivers

◮ Rate: (r, c)

Multicast Data S E N D E R S G R O U P

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-55
SLIDE 55

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Receiver-Based Forward Error Correction

◮ Receiver generates an XOR of r

incoming multicast packets and exchanges with other receivers

◮ Each XOR sent to c other

random receivers

◮ Rate: (r, c) ◮ latency ∝ 1 P

s datarate

data rate: across all senders, in a single group

Multicast Data S E N D E R S G R O U P

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-56
SLIDE 56

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Lateral Error Correction: Principle

B2 INCOMING DATA PACKETS A1 A2 A3 B2 B1 A4 A5 B4 B3 A6 B5 A1 A2 A3 B1 A4 A5 B4 B3 A6 B5 INCOMING DATA PACKETS Loss R e p a i r P a c k e t I : ( A 1 , A 2 , A 3 , A 4 , A 5 ) R e p a i r P a c k e t I I : ( B 1 , B 2 , B 3 , B 4 , B 5 ) Recovery Receiver R2 Receiver R1 Group A Group B

◮ Single-Group RFEC

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-57
SLIDE 57

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Lateral Error Correction: Principle

B2 INCOMING DATA PACKETS A1 A2 A3 B2 B1 A4 A5 B4 B3 A6 B5 A1 A2 A3 B1 A4 A5 B4 B3 A6 B5 INCOMING DATA PACKETS Loss R e p a i r P a c k e t I : ( A 1 , A 2 , A 3 , B 1 , B 2 ) R e p a i r P a c k e t I I : ( A 4 , B 3 , B 4 , A 5 , A 6 ) Recovery Receiver R2 Receiver R1 Group A Group B

◮ Single-Group RFEC ◮ Lateral Error Correction

◮ Create XORs from multiple

groups → faster recovery!

◮ What about complex

  • verlap?

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-58
SLIDE 58

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Nodes and Disjoint Regions

2 1 3

◮ Receiver n1 belongs to

groups A, B, and C

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-59
SLIDE 59

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Nodes and Disjoint Regions

2 1 3

◮ Receiver n1 belongs to

groups A, B, and C

◮ Divides groups into

disjoint regions

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-60
SLIDE 60

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Nodes and Disjoint Regions

2 1 4 3

◮ Receiver n1 belongs to

groups A, B, and C

◮ Divides groups into

disjoint regions

◮ Is unaware of groups it

does not belong to (D)

◮ Works with any conventional Group Membership Service

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-61
SLIDE 61

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Regional Selection

1 A A Mahesh Balakrishnan Reliable Communication for Datacenters

slide-62
SLIDE 62

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Regional Selection

1 A A

◮ Select targets for XORs

from regions, not groups

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-63
SLIDE 63

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Regional Selection

1 A ac A a A abc

1.0 cA

ab

◮ Select targets for XORs

from regions, not groups

◮ From each region, select

proportional fraction of cA: cx

A = |x| |A| · cA

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-64
SLIDE 64

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Regional Selection

1 A ac A a A abc

1.0 cA

ab

◮ Select targets for XORs

from regions, not groups

◮ From each region, select

proportional fraction of cA: cx

A = |x| |A| · cA

latency ∝

1 P

s

P

g datarate

data rate: across all senders, in intersections of groups

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-65
SLIDE 65

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Scalability in Groups

Claim: Ricochet scales to hundreds of groups.

90 92 94 96 98 100 2 4 8 16 32 64 128 256 512 1024 LEC Recovery Percentage Groups per Node (Log Scale) Recovery % 10000 20000 30000 40000 50000 2 4 8 16 32 64 128 256 512 1024 Microseconds Groups per Node (Log Scale) Average Recovery Latency

Comparision: At 128 groups, NAK/SFEC latency is 8 seconds. Ricochet is 400 times faster!

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-66
SLIDE 66

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Distribution of Recovery Latency

Claim: Ricochet is reliable and time-critical

10 20 30 40 50 50 100 150 200 250 Recovery Percentage Milliseconds 96.8% LEC + 3.2% NAK 10 20 30 40 50 50 100 150 200 250 Recovery Percentage Milliseconds 92% LEC + 8% NAK 10 20 30 40 50 50 100 150 200 250 Recovery Percentage Milliseconds 84% LEC + 16% NAK

(a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate

Most lost packets recovered < 50ms by LEC. Remainder via reactive NAKs. Bursty Loss: 100 packet burst → 90% recovered at 50 ms avg

Mahesh Balakrishnan Reliable Communication for Datacenters

LEC

NAK

slide-67
SLIDE 67

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Next Step: Dr. Multicast

◮ IP Multicast has a bad reputation!

◮ Unscalable filtering at routers/switches/NICs ◮ Insecure Mahesh Balakrishnan Reliable Communication for Datacenters

slide-68
SLIDE 68

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Next Step: Dr. Multicast

◮ IP Multicast has a bad reputation!

◮ Unscalable filtering at routers/switches/NICs ◮ Insecure

◮ Insight: IP Multicast is a shared, controlled resource

◮ Transparent interception of socket system calls ◮ Logical address → Set of network (uni/multi)cast addresses ◮ Enforcement of IP Multicast policies ◮ Gossip-based tracking of membership/mappings Mahesh Balakrishnan Reliable Communication for Datacenters

slide-69
SLIDE 69

Maelstrom Ricochet Conclusion

The Big Picture

Real-World, Real-Time

Disk Failure, Disasters Crashes Overloads Bugs Exploits

SMFS KyotoFS Tempest Maelstrom

Software Stack

Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato

Packet Loss NSDI 2008 NSDI 2007 DSN 2008 SRDS 2006 HotOS XI Submitted

Mistral

MobiHoc 2006

Sequoia

PODC 2007 Mahesh Balakrishnan Reliable Communication for Datacenters

slide-70
SLIDE 70

Maelstrom Ricochet Conclusion

The Big Picture

Real-World, Real-Time

Disk Failure, Disasters Crashes Overloads Bugs Exploits

SMFS KyotoFS Tempest Maelstrom

Software Stack

Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato

Packet Loss NSDI 2008 NSDI 2007 DSN 2008 SRDS 2006 HotOS XI Submitted

Mistral

MobiHoc 2006

Sequoia

PODC 2007 Mahesh Balakrishnan Reliable Communication for Datacenters

What about new abstractions?

slide-71
SLIDE 71

Maelstrom Ricochet Conclusion

Future Work

Service

Invocation (ID …) Mahesh Balakrishnan Reliable Communication for Datacenters

slide-72
SLIDE 72

Maelstrom Ricochet Conclusion

Future Work

Instance

(ID1 …) (ID0 …) (ID2 …)

Instance Instance Partition

◮ Partition for Scalability

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-73
SLIDE 73

Maelstrom Ricochet Conclusion

Future Work

(ID1 …) (ID0 …) (ID2 …)

Partition Group Replicate Group Group

◮ Partition for Scalability ◮ Replicate for Availability /

Fault-Tolerance

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-74
SLIDE 74

Maelstrom Ricochet Conclusion

Future Work

(ID1 …) (ID0 …) (ID2 …)

Partition Group Replicate Group Group

◮ Partition for Scalability ◮ Replicate for Availability /

Fault-Tolerance

◮ The DatacenterOS

Concerns: Performance, Parallelism, Privacy, Power...

Mahesh Balakrishnan Reliable Communication for Datacenters

The Datacenter is the Computer Rethink Old Abstractions Processes, Threads, Address Space, Protection, Locks, IPC/RPC, Sockets, Files... Invent New Abstractions!

slide-75
SLIDE 75

Maelstrom Ricochet Conclusion

Conclusion

◮ The Real-Time Datacenter

◮ Recover from failures within seconds

◮ Reliable Communication: FEC in the Network

◮ Recover lost packets in milliseconds ◮ Maelstrom: between Datacenters ◮ Ricochet: within Datacenters Mahesh Balakrishnan Reliable Communication for Datacenters

slide-76
SLIDE 76

Maelstrom Ricochet Conclusion

Conclusion

◮ The Real-Time Datacenter

◮ Recover from failures within seconds

◮ Reliable Communication: FEC in the Network

◮ Recover lost packets in milliseconds ◮ Maelstrom: between Datacenters ◮ Ricochet: within Datacenters

Thank You!

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-77
SLIDE 77

Maelstrom Ricochet Conclusion

Extra Slide: FEC and Bursty Loss

◮ Existing solution:

interleaving

◮ Interleave i and rate

(r, c) tolerates (c ∗ i) burst...

◮ ...with i times the latency A B C D E X X X F G H I J A C E G I X X X B D F H J

Figure: Interleave of 2 — Even and Odd packets encoded separately

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-78
SLIDE 78

Maelstrom Ricochet Conclusion

Extra Slide: FEC and Bursty Loss

◮ Existing solution:

interleaving

◮ Interleave i and rate

(r, c) tolerates (c ∗ i) burst...

◮ ...with i times the latency A B C D E X X X F G H I J A C E G I X X X B D F H J

Figure: Interleave of 2 — Even and Odd packets encoded separately

Wanted: Graceful degradation of recovery latency with actual burst size for constant overhead

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-79
SLIDE 79

Maelstrom Ricochet Conclusion

Extra Slide: Maelstrom Evaluation

Maelstrom goodput is near theoretical maximum

100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 Throughput (Mbits/sec) FEC Layers (c) theoretical goodput M-FEC goodput

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-80
SLIDE 80

Maelstrom Ricochet Conclusion

Extra Slide: Layered Interleaving

20 40 60 80 100 200 400 600 800 1000 Recovery Latency (Milliseconds) Recovered Packet # Reed Solomon Layered Interleaving

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-81
SLIDE 81

Maelstrom Ricochet Conclusion

Evaluation: Layered Interleaving

Claim: Recovery Latency depends on Actual Burst Length

20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms)

◮ Longer Burst Lengths → Longer Recovery Latency

Mahesh Balakrishnan Reliable Communication for Datacenters

40 20 1 Burst Length =

slide-82
SLIDE 82

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer Network

◮ End-to-End UDP Probes: Zero

Congestion, Non-Zero Loss!

◮ Possible Reasons:

◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 24 14 5 10 15 20 25 30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets % of Measurements Mahesh Balakrishnan Reliable Communication for Datacenters

slide-83
SLIDE 83

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer Network

◮ End-to-End UDP Probes: Zero

Congestion, Non-Zero Loss!

◮ Possible Reasons:

◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 14 24 3 14 5 10 15 20 25 30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets % of Measurements

All Measurements Without Indiana U. Site

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-84
SLIDE 84

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer Network

◮ End-to-End UDP Probes: Zero

Congestion, Non-Zero Loss!

◮ Possible Reasons:

◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 14 24 3 14 5 10 15 20 25 30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets % of Measurements

All Measurements Without Indiana U. Site

Problem Statement: Run unmodified TCP/IP over lossy high-speed long-distance networks

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-85
SLIDE 85

Maelstrom Ricochet Conclusion

Evaluation: Split Mode and Buffering

Claim: Maelstrom eliminates the need for large end-host buffers

100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

slide-86
SLIDE 86

Maelstrom Ricochet Conclusion

Evaluation: Split Mode and Buffering

Claim: Maelstrom eliminates the need for large end-host buffers

100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC

slide-87
SLIDE 87

Maelstrom Ricochet Conclusion

Evaluation: Split Mode and Buffering

Claim: Maelstrom eliminates the need for large end-host buffers

100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC ↑

Maelstrom FEC + Hand-Tuned Buffers

slide-88
SLIDE 88

Maelstrom Ricochet Conclusion

Evaluation: Split Mode and Buffering

Claim: Maelstrom eliminates the need for large end-host buffers

100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC

Maelstrom FEC + Split Mode

↓ ↑

Maelstrom FEC + Hand-Tuned Buffers

slide-89
SLIDE 89

Maelstrom Ricochet Conclusion

Evaluation: FEC mode and loss

Claim: Maelstrom works at high loss rates

50 100 150 200 250 300 350 400 450 500 0.1 1 10 Throughput (Mbps) Packet Loss Rate % Maelstrom TCP

Link RTT = 100ms

Mahesh Balakrishnan Reliable Communication for Datacenters