Maelstrom Ricochet Conclusion
Reliable Communication for Datacenters
Mahesh Balakrishnan Cornell University
Mahesh Balakrishnan Reliable Communication for Datacenters
Reliable Communication for Datacenters Mahesh Balakrishnan Cornell - - PowerPoint PPT Presentation
Maelstrom Ricochet Conclusion Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh Balakrishnan Reliable Communication for Datacenters Maelstrom Ricochet Conclusion Datacenters Internet Services (90s)
Maelstrom Ricochet Conclusion
Reliable Communication for Datacenters
Mahesh Balakrishnan Cornell University
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Datacenters
◮ Internet Services (90s) — Websites, Search, Online Stores ◮ Since then:
# of low-end volume servers 5 10 15 20 25 30 2000 2001 2002 2003 2004 2005 Millions
Installed Server Base 00-05:
◮ Commodity — up by 100% ◮ High/Mid — down by 40% ◮ Today: Datacenters are ubiquitous ◮ How have they evolved?
Data partially sourced from IDC press releases (www.idc.com) Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Datacenters
Why? Business Continuity, Client Locality, Distributed Datasets
Any modern enterprise!
N S E W
100 ms RTT: 110 ms 210 ms 220 ms 110 ms 100 ms 200 ms Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Real-Time Datacenters
◮ Finance, Aerospace, Military, Search and Rescue... ◮ ... documents, chat, email, games, videos, photos, blogs,
social networks
◮ The Datacenter is the Computer! ◮ Not hard real-time: real fast, highly responsive, time-critical
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Real-Time Datacenters
◮ Finance, Aerospace, Military, Search and Rescue... ◮ ... documents, chat, email, games, videos, photos, blogs,
social networks
◮ The Datacenter is the Computer! ◮ Not hard real-time: real fast, highly responsive, time-critical
Gartner Survey:
◮ Real-Time Infrastructure (RTI): reaction time in secs/mins ◮ 73%: RTI is important or very important ◮ 85%: Have no RTI capability
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems Challenges
How do we recover from failures within seconds?
Real-World, Real-Time
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems Challenges
How do we recover from failures within seconds?
Real-World, Real-Time
Disk Failure, Disasters Crashes Overloads Bugs Exploits
Software Stack
Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ...
Packet Loss Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems Challenges
How do we recover from failures within seconds?
Real-World, Real-Time
Disk Failure, Disasters Crashes Overloads Bugs Exploits
SMFS KyotoFS Tempest Maelstrom
Software Stack
Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato
Packet Loss Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication
Goal: Recover lost packets fast!
◮ Existing protocols react to loss: too much, too late ◮ We want proactive recovery: stable overhead, low latencies ◮ Maelstrom: Reliability between datacenters
[NSDI 2008]
◮ Ricochet: Reliability within datacenters
[NSDI 2007]
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
TCP fails in three ways:
100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
TCP fails in three ways:
100ms RTT, 0.1% Loss, 40 Gbps → Tput < 10 Mbps!
Current Solutions:
◮ Rewrite Apps: One Flow → Multiple Split Flows ◮ Resize Buffers ◮ Spend (infinite) money!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
◮ End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
◮ Possible Reasons:
◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 24 14 5 10 15 20 25 30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets % of Measurements Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
◮ End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
◮ Possible Reasons:
◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 24 14 5 10 15 20 25 30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets % of Measurements
Electronics: Cluttered Pathways Optics: Lossy Fiber
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Problem Statement Run unmodified TCP/IP over lossy high-speed long-distance networks
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router Maelstrom Receive-Side Appliance Maelstrom Send-Side Appliance
Transparent: No modification to end-host or network
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router Maelstrom Receive-Side Appliance Maelstrom Send-Side Appliance FEC Encode Decode
Transparent: No modification to end-host or network FEC = Forward Error Correction
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets Receiver can recover from any 3 lost packets
Rate : (r, c) — c repair packets for every r data packets.
◮ Pro: Recovery Latency independent of RTT ◮ Constant Data Overhead: c r+c ◮ Packet-level FEC at End-hosts: Inexpensive, No extra HW
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets Receiver can recover from any 3 lost packets
Rate : (r, c) — c repair packets for every r data packets.
◮ Pro: Recovery Latency independent of RTT ◮ Constant Data Overhead: c r+c ◮ Packet-level FEC at End-hosts: Inexpensive, No extra HW ◮ Con: Recovery Latency dependent on channel data rate
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets Receiver can recover from any 3 lost packets
Rate : (r, c) — c repair packets for every r data packets.
◮ Pro: Recovery Latency independent of RTT ◮ Constant Data Overhead: c r+c ◮ Packet-level FEC at End-hosts: Inexpensive, No extra HW ◮ Con: Recovery Latency dependent on channel data rate ◮ FEC in the Network:
◮ Where and What? Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Packet Loss Sending End-hosts Commodity TCP Receiving End-hosts Commodity TCP Router Router Maelstrom Receive-Side Appliance Maelstrom Send-Side Appliance FEC Encode Decode
Transparent: No modification to end-host or network FEC = Forward Error Correction Where: at the appliance, What: aggregated data
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Maelstrom Mechanism
Send-Side Appliance:
◮ Snoop IP packets ◮ Create repair packet =
XOR + ‘recipe’ of data packet IDs
29 28 27 26 25 25 26 27 28 29 X LOSS XOR ‘Recipe List’: 25,26,27,28,29 25 26 28 29 Lambda Jumbo MTU LAN MTU Appliance Appliance 27 Recovered Packet Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Maelstrom Mechanism
Send-Side Appliance:
◮ Snoop IP packets ◮ Create repair packet =
XOR + ‘recipe’ of data packet IDs Receive-Side Appliance:
◮ Lost packet recovered
using XOR and other data packets
◮ At receiver end-host: out
29 28 27 26 25 25 26 27 28 29 X LOSS XOR ‘Recipe List’: 25,26,27,28,29 25 26 28 29 Lambda Jumbo MTU LAN MTU Appliance Appliance 27 Recovered Packet Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Layered Interleaving for Bursty Loss
Recovery Latency ∝ Actual Burst Size, not Max Burst Size
3 2 1 X1 11 21 X2 101 201 X3
Data Stream XORs:
◮ XORs at different interleaves ◮ Recovery latency degrades gracefully
with loss burstiness: X1 catches random singleton losses X2 catches loss bursts of 10 or less X3 catches bursts of 100 or less 2in2in
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Layered Interleaving for Bursty Loss
Recovery Latency ∝ Actual Burst Size, not Max Burst Size
3 2 1 X1 11 21 X2 101 201 X3
Data Stream XORs:
◮ XORs at different interleaves ◮ Recovery latency degrades gracefully
with loss burstiness: X1 catches random singleton losses X2 catches loss bursts of 10 or less X3 catches bursts of 100 or less
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Loss probability Recovery probability Reed−Solomon Maelstrom
Comparison of Recovery Probability: r=7, c=2
2in2in
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Maelstrom Modes
◮ TCP Traffic: Two Flow Control Modes
A) End-to-End Flow Control
End-Host End-Host Appliance Appliance
B) Split Flow Control
End-Host End-Host Appliance Appliance
◮ Split Mode avoids client buffer resizing (PeP)
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Implementation Details
◮ In Kernel — Linux 2.6.20 Module ◮ Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$) ◮ Max speed: 1 Gbps, Memory Footprint: 10 MB ◮ 50-60% CPU → NIC is the bottleneck (for c = 3) ◮ How do we efficiently store/access/clean a gigabit of data
every second?
◮ Scaling to Multi-Gigabit: Partition IP space across proxies
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss)
Mahesh Balakrishnan Reliable Communication for Datacenters
TCP/IP with loss TCP/IP without loss
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%)
Mahesh Balakrishnan Reliable Communication for Datacenters
TCP/IP with loss TCP/IP without loss
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%)
Mahesh Balakrishnan Reliable Communication for Datacenters
Data + FEC ˜ = 1 Gbps
TCP/IP with loss TCP/IP without loss
Maelstrom Ricochet Conclusion
Evaluation: FEC Mode and Loss
Claim: Maelstrom effectively hides loss from TCP/IP
100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 Throughput (Mbps) One Way Link Latency (ms) TCP (No loss) TCP (0.1% loss) TCP (1% loss) Maelstrom (No loss) Maelstrom (0.1%) Maelstrom (1%)
Mahesh Balakrishnan Reliable Communication for Datacenters
1 RTT
Data + FEC ˜ = 1 Gbps
TCP/IP with loss TCP/IP without loss
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
100 200 300 400 500 600 20 30 40 50 60 70 80 90 100 Throughput (Mbps) One-way link latency (ms) Throughput as a function of latency loss-rate=0.001
Mahesh Balakrishnan Reliable Communication for Datacenters
Tput independent
Maelstrom Ricochet Conclusion
Evaluation: Delivery Latency
Claim: Maelstrom eliminates TCP/IP’s loss-related jitter
100 200 300 400 500 600 1000 2000 3000 4000 5000 6000 Delivery Latency (ms) Packet # TCP/IP: 0.1% Loss 100 200 300 400 500 600 1000 2000 3000 4000 5000 6000 Delivery Latency (ms) Packet # Maelstrom: 0.1% Loss
Sources of Jitter:
◮ Receive-side buffering due to sequencing ◮ Send-side buffering due to congestion control
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Layered Interleaving
Claim: Recovery Latency depends on Actual Burst Length
20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms)
◮ Longer Burst Lengths → Longer Recovery Latency
Mahesh Balakrishnan Reliable Communication for Datacenters
40 20 1 Burst Length =
Maelstrom Ricochet Conclusion
Next Step: SMFS - The Smoke and Mirrors Filesystem
◮ Classic Mirroring Trade-off:
◮ Fast — return to user after sending to mirror ◮ Safe — return to user after ACK from mirror
◮ SMFS — return to user after sending enough FEC ◮ Maelstrom: Lossy Network → Lossless Network → Disk! ◮ Result: Fast, Safe Mirroring independent of link length! ◮ General Principle: Gray-box Exposure of Protocol State
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Big Picture
Real-World, Real-Time
Disk Failure, Disasters Crashes Overloads Bugs Exploits
SMFS KyotoFS Tempest Maelstrom
Software Stack
Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato
Packet Loss Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
From Long-Haul to Multicast
Data acks
High RTT Sender Receiver Between Datacenters ◮ Feedback Loop Infeasible:
◮ Inter-Datacenter Long-Haul: RTT too high Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
From Long-Haul to Multicast
Data acks
High RTT Sender Receiver Between Datacenters Many Receivers
Data Data acks acks
Within Datacenter
Data
Multicast Group ◮ Feedback Loop Infeasible:
◮ Inter-Datacenter Long-Haul: RTT too high ◮ Intra-Cluster Multicast: Too many receivers Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
How is Multicast Used?
service replication/partitioning, publish-subscribe, data caching...
Financial Pub-Sub Example:
◮ Each equity is mapped to
a multicast group
◮ Each node is interested
in a different set of equities ...
Tracking S&P 500 Tracking Portfolio
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
How is Multicast Used?
service replication/partitioning, publish-subscribe, data caching...
Financial Pub-Sub Example:
◮ Each equity is mapped to
a multicast group
◮ Each node is interested
in a different set of equities ... Each node in many groups = ⇒ Low per-group data rate
Tracking S&P 500 Tracking Portfolio
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
How is Multicast Used?
service replication/partitioning, publish-subscribe, data caching...
Financial Pub-Sub Example:
◮ Each equity is mapped to
a multicast group
◮ Each node is interested
in a different set of equities ... Each node in many groups = ⇒ Low per-group data rate High per-node data rate = ⇒ Overload
Tracking S&P 500 Tracking Portfolio
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Where does loss occur in a Datacenter?
Packet Loss occurs at end-hosts: independent and bursty
20 40 60 80 100 180 160 140 120 100 80 60 burst length (packets) receiver r1: loss bursts in A and B receiver r1: data bursts in A and B 20 40 60 80 100 180 160 140 120 100 80 60 burst length (packets) time in seconds receiver r2: loss bursts in A receiver r2: data bursts in A
Mahesh Balakrishnan Reliable Communication for Datacenters
Loss Bursts
Data Rate
Less Loaded Node Overloaded Node
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
◮ Recover lost packets
rapidly!
◮ Scalability:
Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
◮ Recover lost packets
rapidly!
◮ Scalability:
◮ Number of Receivers Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
◮ Recover lost packets
rapidly!
◮ Scalability:
◮ Number of Receivers ◮ Number of Senders Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
◮ Recover lost packets
rapidly!
◮ Scalability:
◮ Number of Receivers ◮ Number of Senders ◮ Number of Groups Multicast Data Lost Packet Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale? Sender data
Loss Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale? data
Loss
a c k s Sender data
Loss
◮ 1. acks: implosion
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale? data
Loss
naks Sender data
Loss
◮ 1. acks: implosion ◮ 2. naks
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale? data
Loss
xors Sender data
Loss
◮ 1. acks: implosion ◮ 2. naks ◮ 3. Sender-based FEC
recovery latency ∝
1 datarate
data rate:
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale? data
Loss
xors Sender data
Loss
◮ 1. acks: implosion ◮ 2. naks ◮ 3. Sender-based FEC
recovery latency ∝
1 datarate
data rate:
◮ FEC in the network:
Where and What?
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable Multicast
How does latency scale? Sender data
Loss
xors data
Loss
◮ 1. acks: implosion ◮ 2. naks ◮ 3. Sender-based FEC
recovery latency ∝
1 datarate
data rate:
◮ FEC in the network:
Where and What? Receiver-based FEC: at receivers, from incoming data
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Receiver-Based Forward Error Correction
◮ Receiver generates an XOR of r
incoming multicast packets and exchanges with other receivers
◮ Each XOR sent to c other
random receivers
◮ Rate: (r, c)
Multicast Data S E N D E R S G R O U P
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Receiver-Based Forward Error Correction
◮ Receiver generates an XOR of r
incoming multicast packets and exchanges with other receivers
◮ Each XOR sent to c other
random receivers
◮ Rate: (r, c) ◮ latency ∝ 1 P
s datarate
data rate: across all senders, in a single group
Multicast Data S E N D E R S G R O U P
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Lateral Error Correction: Principle
B2 INCOMING DATA PACKETS A1 A2 A3 B2 B1 A4 A5 B4 B3 A6 B5 A1 A2 A3 B1 A4 A5 B4 B3 A6 B5 INCOMING DATA PACKETS Loss R e p a i r P a c k e t I : ( A 1 , A 2 , A 3 , A 4 , A 5 ) R e p a i r P a c k e t I I : ( B 1 , B 2 , B 3 , B 4 , B 5 ) Recovery Receiver R2 Receiver R1 Group A Group B
◮ Single-Group RFEC
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Lateral Error Correction: Principle
B2 INCOMING DATA PACKETS A1 A2 A3 B2 B1 A4 A5 B4 B3 A6 B5 A1 A2 A3 B1 A4 A5 B4 B3 A6 B5 INCOMING DATA PACKETS Loss R e p a i r P a c k e t I : ( A 1 , A 2 , A 3 , B 1 , B 2 ) R e p a i r P a c k e t I I : ( A 4 , B 3 , B 4 , A 5 , A 6 ) Recovery Receiver R2 Receiver R1 Group A Group B
◮ Single-Group RFEC ◮ Lateral Error Correction
◮ Create XORs from multiple
groups → faster recovery!
◮ What about complex
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
2 1 3
◮ Receiver n1 belongs to
groups A, B, and C
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
2 1 3
◮ Receiver n1 belongs to
groups A, B, and C
◮ Divides groups into
disjoint regions
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
2 1 4 3
◮ Receiver n1 belongs to
groups A, B, and C
◮ Divides groups into
disjoint regions
◮ Is unaware of groups it
does not belong to (D)
◮ Works with any conventional Group Membership Service
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Regional Selection
1 A A Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Regional Selection
1 A A
◮ Select targets for XORs
from regions, not groups
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Regional Selection
1 A ac A a A abc
1.0 cA
ab
◮ Select targets for XORs
from regions, not groups
◮ From each region, select
proportional fraction of cA: cx
A = |x| |A| · cA
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Regional Selection
1 A ac A a A abc
1.0 cA
ab
◮ Select targets for XORs
from regions, not groups
◮ From each region, select
proportional fraction of cA: cx
A = |x| |A| · cA
latency ∝
1 P
s
P
g datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Scalability in Groups
Claim: Ricochet scales to hundreds of groups.
90 92 94 96 98 100 2 4 8 16 32 64 128 256 512 1024 LEC Recovery Percentage Groups per Node (Log Scale) Recovery % 10000 20000 30000 40000 50000 2 4 8 16 32 64 128 256 512 1024 Microseconds Groups per Node (Log Scale) Average Recovery Latency
Comparision: At 128 groups, NAK/SFEC latency is 8 seconds. Ricochet is 400 times faster!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Distribution of Recovery Latency
Claim: Ricochet is reliable and time-critical
10 20 30 40 50 50 100 150 200 250 Recovery Percentage Milliseconds 96.8% LEC + 3.2% NAK 10 20 30 40 50 50 100 150 200 250 Recovery Percentage Milliseconds 92% LEC + 8% NAK 10 20 30 40 50 50 100 150 200 250 Recovery Percentage Milliseconds 84% LEC + 16% NAK
(a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate
Most lost packets recovered < 50ms by LEC. Remainder via reactive NAKs. Bursty Loss: 100 packet burst → 90% recovered at 50 ms avg
Mahesh Balakrishnan Reliable Communication for Datacenters
LEC
NAK
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Next Step: Dr. Multicast
◮ IP Multicast has a bad reputation!
◮ Unscalable filtering at routers/switches/NICs ◮ Insecure Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Next Step: Dr. Multicast
◮ IP Multicast has a bad reputation!
◮ Unscalable filtering at routers/switches/NICs ◮ Insecure
◮ Insight: IP Multicast is a shared, controlled resource
◮ Transparent interception of socket system calls ◮ Logical address → Set of network (uni/multi)cast addresses ◮ Enforcement of IP Multicast policies ◮ Gossip-based tracking of membership/mappings Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Big Picture
Real-World, Real-Time
Disk Failure, Disasters Crashes Overloads Bugs Exploits
SMFS KyotoFS Tempest Maelstrom
Software Stack
Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato
Packet Loss NSDI 2008 NSDI 2007 DSN 2008 SRDS 2006 HotOS XI Submitted
Mistral
MobiHoc 2006
Sequoia
PODC 2007 Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Big Picture
Real-World, Real-Time
Disk Failure, Disasters Crashes Overloads Bugs Exploits
SMFS KyotoFS Tempest Maelstrom
Software Stack
Compilers, Databases, Distributed Systems, Machine Learning, Filesystems, Operating Systems, Networking ... Ricochet Plato
Packet Loss NSDI 2008 NSDI 2007 DSN 2008 SRDS 2006 HotOS XI Submitted
Mistral
MobiHoc 2006
Sequoia
PODC 2007 Mahesh Balakrishnan Reliable Communication for Datacenters
What about new abstractions?
Maelstrom Ricochet Conclusion
Future Work
Service
Invocation (ID …) Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Future Work
Instance
(ID1 …) (ID0 …) (ID2 …)
Instance Instance Partition
◮ Partition for Scalability
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Future Work
(ID1 …) (ID0 …) (ID2 …)
Partition Group Replicate Group Group
◮ Partition for Scalability ◮ Replicate for Availability /
Fault-Tolerance
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Future Work
(ID1 …) (ID0 …) (ID2 …)
Partition Group Replicate Group Group
◮ Partition for Scalability ◮ Replicate for Availability /
Fault-Tolerance
◮ The DatacenterOS
Concerns: Performance, Parallelism, Privacy, Power...
Mahesh Balakrishnan Reliable Communication for Datacenters
The Datacenter is the Computer Rethink Old Abstractions Processes, Threads, Address Space, Protection, Locks, IPC/RPC, Sockets, Files... Invent New Abstractions!
Maelstrom Ricochet Conclusion
Conclusion
◮ The Real-Time Datacenter
◮ Recover from failures within seconds
◮ Reliable Communication: FEC in the Network
◮ Recover lost packets in milliseconds ◮ Maelstrom: between Datacenters ◮ Ricochet: within Datacenters Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Conclusion
◮ The Real-Time Datacenter
◮ Recover from failures within seconds
◮ Reliable Communication: FEC in the Network
◮ Recover lost packets in milliseconds ◮ Maelstrom: between Datacenters ◮ Ricochet: within Datacenters
Thank You!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: FEC and Bursty Loss
◮ Existing solution:
interleaving
◮ Interleave i and rate
(r, c) tolerates (c ∗ i) burst...
◮ ...with i times the latency A B C D E X X X F G H I J A C E G I X X X B D F H J
Figure: Interleave of 2 — Even and Odd packets encoded separately
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: FEC and Bursty Loss
◮ Existing solution:
interleaving
◮ Interleave i and rate
(r, c) tolerates (c ∗ i) burst...
◮ ...with i times the latency A B C D E X X X F G H I J A C E G I X X X B D F H J
Figure: Interleave of 2 — Even and Odd packets encoded separately
Wanted: Graceful degradation of recovery latency with actual burst size for constant overhead
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: Maelstrom Evaluation
Maelstrom goodput is near theoretical maximum
100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 Throughput (Mbits/sec) FEC Layers (c) theoretical goodput M-FEC goodput
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: Layered Interleaving
20 40 60 80 100 200 400 600 800 1000 Recovery Latency (Milliseconds) Recovered Packet # Reed Solomon Layered Interleaving
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Layered Interleaving
Claim: Recovery Latency depends on Actual Burst Length
20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms) 20 40 60 80 100 50 100 150 200 % Recovered Recovery Latency (ms)
◮ Longer Burst Lengths → Longer Recovery Latency
Mahesh Balakrishnan Reliable Communication for Datacenters
40 20 1 Burst Length =
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
◮ End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
◮ Possible Reasons:
◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 24 14 5 10 15 20 25 30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets % of Measurements Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
◮ End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
◮ Possible Reasons:
◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 14 24 3 14 5 10 15 20 25 30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets % of Measurements
All Measurements Without Indiana U. Site
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer Network
◮ End-to-End UDP Probes: Zero
Congestion, Non-Zero Loss!
◮ Possible Reasons:
◮ transient congestion ◮ degraded fiber ◮ malfunctioning HW ◮ misconfigured HW ◮ switching contention ◮ low receiver power ◮ end-host overflow ◮ ... 14 24 3 14 5 10 15 20 25 30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets % of Measurements
All Measurements Without Indiana U. Site
Problem Statement: Run unmodified TCP/IP over lossy high-speed long-distance networks
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Ricochet Conclusion
Evaluation: Split Mode and Buffering
Claim: Maelstrom eliminates the need for large end-host buffers
100 200 300 400 500 600 client-buf M-ALL M-FEC M-BUF tcp Throughput (Mbits/sec) RTT 100ms throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom FEC + Split Mode
Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Ricochet Conclusion
Evaluation: FEC mode and loss
Claim: Maelstrom works at high loss rates
50 100 150 200 250 300 350 400 450 500 0.1 1 10 Throughput (Mbps) Packet Loss Rate % Maelstrom TCP
Link RTT = 100ms
Mahesh Balakrishnan Reliable Communication for Datacenters