Surviving congestion in geo-distributed storage systems Brian Cho - - PowerPoint PPT Presentation
Surviving congestion in geo-distributed storage systems Brian Cho - - PowerPoint PPT Presentation
Surviving congestion in geo-distributed storage systems Brian Cho University of Illinois at Urbana-Champaign Marcos K. Aguilera Microsoft Research Silicon Valley Geo-distributed data centers Web applications increasingly deployed across
Geo-distributed data centers
- Web applications increasingly deployed across
geo-distributed data centers
– e.g., social networks, online stores, messaging
- App data replicated across data centers
– Disaster tolerance – Access locality
2
Congestion between geo-distributed data centers
- Limited bandwidth between data centers
– e.g., leased lines, MPLS VPN – Bandwidth is expensive: ~1K $/Mbps [SprintMPLS] – Provision for typical (not peak) usage
- Many machines in each data center
3
Congestion → Delay between geo-distributed data centers
- Congestion can cause significant delays
– TCP messaging increases to order-of-seconds (Figure) – Observed across Amazon EC2 data centers [Kraska et al]
- Users do not tolerate delays (<1s) [Nielsen]
4
FIGURE: RPC round trip delay under congestion (10-30s)
Replication techniques applied to geo-distributed data centers
5
- Weak consistency
– e.g., Amazon Dynamo, Yahoo PNUTS, COPS – Good performance: updates can be propagated asynchronously – Semantics undesirable in some cases (e.g., writes get re-ordered across replicas)
- Strong consistency
– e.g., ABD, Paxos, available in Google Megastore, Amazon SimpleDB – Avoids the many problems of weak consistency – Must wait for updates to propagate across data centers – App delay requirements difficult to meet under congestion
Contributions
- Vivace: a strongly consistent key-value store that
is resilient to congestion across geo-distributed data centers
- Approach
– New algorithms send small amount of critical information across data centers in separate prioritized messages
- Challenges
– Still provide strong consistency – Keep prioritized messages small – Avoid delay overhead in absence of congestion
6
Vivace algorithms
- Enhance previous strongly consistent algorithms
- Prioritize small amount of critical information
across sites
7
Vivace algorithms
- Enhance previous strongly consistent algorithms
- Prioritize small amount of critical information
across sites Two algorithms:
- 1. Read/write algorithm
– Very simple – Based on traditional quorum algorithm [ABD] – Linearizable read() and write() – read() contains a write-back phase
- 2. State machine replication algorithm
– More complex, details in paper
8
Traditional quorum algorithm: write
Replica 2 Replica 3 Replica 1 Client
9
<WRITE,key,val,ts>
1 val is large (compared with key & ts)
Traditional quorum algorithm: write
Replica 2 Replica 3 Replica 1 Client
10
<WRITE,key,val,ts>
1
<ACK-WRITE>
Traditional quorum algorithm: write
Replica 2 Replica 3 Replica 1 Client
11
<WRITE,key,val,ts>
1 write done
<ACK-WRITE>
Traditional quorum algorithm: read
Replica 2 Replica 3 Replica 1 Client
12
<READ,key>
1
Traditional quorum algorithm: read
Replica 2 Replica 3 Replica 1 Client
<READ,key>
13
1
<ACK-READ,val,ts>
large val
Traditional quorum algorithm: read
Replica 2 Replica 3 Replica 1 Client
14
1
<WRITE,key,val,ts>
2 writeback: ensures strong consistency (linearizability) large val, again!
Traditional quorum algorithm: read
Replica 2 Replica 3 Replica 1 Client
15
1
<WRITE,key,val,ts>
2 writeback: ensures strong consistency (linearizability) large val, again!
<ACK-WRITE>
Traditional quorum algorithm: read
Replica 2 Replica 3 Replica 1 Client
16
1 read done 2
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
17
new quorum of local replicas
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
18
<W-LOCAL,key,val,ts>
1 val sent locally
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
19
<ACK-W-LOCAL> <W-LOCAL,key,val,ts>
1
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
<W-TS,key,ts>
2 1
20
no val: small message! prioritize
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
<W-TS,key,ts>
2
21
no val: small message! prioritize
<ACK-W-TS>
1
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
2
22
write done Replica 1,2,3 have a consistent view of key & ts, but no val (yet) 1
Vivace: write
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
23
<W-REMOTE,key,val,ts>
* val is still large, but not in critical path Replica 1,2,3 add val to their consistent view of key & ts 1 2
write comparison
Replica 2 Replica 3 Replica 1 Client
1
24
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
* 1 2
Traditional algorithm: 1 remote RTT Vivace algorithm: 1 prioritized remote RTT + 1 local RTT
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
25
prioritize
<R-TS,key>
1
- nly ask for ts
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
26
prioritize
<R-TS,key>
1
<ACK-R-TS,ts>
small message
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
27
1
<R-DATA,key,ts>
2 ask for data with largest ts
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
28
1
<R-DATA,key,ts>
2
<ACK-R-DATA,val>
large val, but wait for
- nly one reply
(common case: local)
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
29
1 2
<W-TS,key,ts>
3 prioritize writeback only small ts
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
30
1
<ACK-W-TS> <W-TS,key,ts>
prioritize 3 2
Vivace: read
Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3
31
1 2 read done 3
read comparison
32
Replica 2 Replica 3 Replica 1 Client Replica 2 Replica 3 Replica 1 Client
1 2 3 1 2
Traditional algorithm: 2 remote RTTs Vivace algorithm: 2 prioritized remote RTT + 1 local RTT
Evaluation topics
- Practical prioritization setup
- Delay with congestion
– KV-store operations – Twitter clone web app operations
- Delay without congestion
– Overhead of Vivace algorithms compared to traditional algorithms
33
Evaluation setup
- Local cluster <-> Amazon EC2 Ireland
- DSCP bit prioritization on local router’s egress port
- Congestion generated with iperf
34
Local cluster (Illinois) Amazon EC2 (Ireland) prioritization applied here
- nly
Evaluation
Does prioritization work in practice?
- Simple ping experiment
- Prioritized messages bypass congestion
- Local router-based prioritization is effective
35
Evaluation
How well does Vivace perform under congestion?
KV-store operations Twitter-clone operations
36 (a) Read algorithms (b) Write algorithms (c) State machine algorithms (a) Post tweet (b) Read user timeline (c) Read friends timeline
37
2 remote RTTs 2 prioritized remote RTTs + 1 local RTT buffering delay TCP resend on packet loss avoids congestion delays
(a) Read algorithms
Evaluation
How well does Vivace perform under congestion?
Evaluation
What is the overhead of Vivace without congestion?
- (Results in paper)
- No measurable overhead compared to
traditional algorithms
- Extra message phases are not harmful
38
Conclusion
- Proposed two new algorithms
– Read/write (simple, in talk) – State machine (more complex, in paper)
- Both algorithms avoid delay due to congestion by
prioritizing a small amount of critical information, while
– Still providing strong consistency – Keeping prioritized messages small – Avoiding delay overhead in absence of congestion – Using a practical prioritization infrastructure
- Careful use of prioritized messages can be an effective
strategy in geo-distributed data centers
39