Surviving congestion in geo-distributed storage systems Brian Cho - - PowerPoint PPT Presentation

surviving congestion in
SMART_READER_LITE
LIVE PREVIEW

Surviving congestion in geo-distributed storage systems Brian Cho - - PowerPoint PPT Presentation

Surviving congestion in geo-distributed storage systems Brian Cho University of Illinois at Urbana-Champaign Marcos K. Aguilera Microsoft Research Silicon Valley Geo-distributed data centers Web applications increasingly deployed across


slide-1
SLIDE 1

Surviving congestion in geo-distributed storage systems

Brian Cho Marcos K. Aguilera University of Illinois at Urbana-Champaign Microsoft Research Silicon Valley

slide-2
SLIDE 2

Geo-distributed data centers

  • Web applications increasingly deployed across

geo-distributed data centers

– e.g., social networks, online stores, messaging

  • App data replicated across data centers

– Disaster tolerance – Access locality

2

slide-3
SLIDE 3

Congestion between geo-distributed data centers

  • Limited bandwidth between data centers

– e.g., leased lines, MPLS VPN – Bandwidth is expensive: ~1K $/Mbps [SprintMPLS] – Provision for typical (not peak) usage

  • Many machines in each data center

3

slide-4
SLIDE 4

Congestion → Delay between geo-distributed data centers

  • Congestion can cause significant delays

– TCP messaging increases to order-of-seconds (Figure) – Observed across Amazon EC2 data centers [Kraska et al]

  • Users do not tolerate delays (<1s) [Nielsen]

4

FIGURE: RPC round trip delay under congestion (10-30s)

slide-5
SLIDE 5

Replication techniques applied to geo-distributed data centers

5

  • Weak consistency

– e.g., Amazon Dynamo, Yahoo PNUTS, COPS – Good performance: updates can be propagated asynchronously – Semantics undesirable in some cases (e.g., writes get re-ordered across replicas)

  • Strong consistency

– e.g., ABD, Paxos, available in Google Megastore, Amazon SimpleDB – Avoids the many problems of weak consistency – Must wait for updates to propagate across data centers – App delay requirements difficult to meet under congestion

slide-6
SLIDE 6

Contributions

  • Vivace: a strongly consistent key-value store that

is resilient to congestion across geo-distributed data centers

  • Approach

– New algorithms send small amount of critical information across data centers in separate prioritized messages

  • Challenges

– Still provide strong consistency – Keep prioritized messages small – Avoid delay overhead in absence of congestion

6

slide-7
SLIDE 7

Vivace algorithms

  • Enhance previous strongly consistent algorithms
  • Prioritize small amount of critical information

across sites

7

slide-8
SLIDE 8

Vivace algorithms

  • Enhance previous strongly consistent algorithms
  • Prioritize small amount of critical information

across sites Two algorithms:

  • 1. Read/write algorithm

– Very simple – Based on traditional quorum algorithm [ABD] – Linearizable read() and write() – read() contains a write-back phase

  • 2. State machine replication algorithm

– More complex, details in paper

8

slide-9
SLIDE 9

Traditional quorum algorithm: write

Replica 2 Replica 3 Replica 1 Client

9

<WRITE,key,val,ts>

1 val is large (compared with key & ts)

slide-10
SLIDE 10

Traditional quorum algorithm: write

Replica 2 Replica 3 Replica 1 Client

10

<WRITE,key,val,ts>

1

<ACK-WRITE>

slide-11
SLIDE 11

Traditional quorum algorithm: write

Replica 2 Replica 3 Replica 1 Client

11

<WRITE,key,val,ts>

1 write done

<ACK-WRITE>

slide-12
SLIDE 12

Traditional quorum algorithm: read

Replica 2 Replica 3 Replica 1 Client

12

<READ,key>

1

slide-13
SLIDE 13

Traditional quorum algorithm: read

Replica 2 Replica 3 Replica 1 Client

<READ,key>

13

1

<ACK-READ,val,ts>

large val

slide-14
SLIDE 14

Traditional quorum algorithm: read

Replica 2 Replica 3 Replica 1 Client

14

1

<WRITE,key,val,ts>

2 writeback: ensures strong consistency (linearizability) large val, again!

slide-15
SLIDE 15

Traditional quorum algorithm: read

Replica 2 Replica 3 Replica 1 Client

15

1

<WRITE,key,val,ts>

2 writeback: ensures strong consistency (linearizability) large val, again!

<ACK-WRITE>

slide-16
SLIDE 16

Traditional quorum algorithm: read

Replica 2 Replica 3 Replica 1 Client

16

1 read done 2

slide-17
SLIDE 17

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

17

new quorum of local replicas

slide-18
SLIDE 18

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

18

<W-LOCAL,key,val,ts>

1 val sent locally

slide-19
SLIDE 19

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

19

<ACK-W-LOCAL> <W-LOCAL,key,val,ts>

1

slide-20
SLIDE 20

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

<W-TS,key,ts>

2 1

20

no val: small message! prioritize

slide-21
SLIDE 21

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

<W-TS,key,ts>

2

21

no val: small message! prioritize

<ACK-W-TS>

1

slide-22
SLIDE 22

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

2

22

write done Replica 1,2,3 have a consistent view of key & ts, but no val (yet) 1

slide-23
SLIDE 23

Vivace: write

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

23

<W-REMOTE,key,val,ts>

* val is still large, but not in critical path Replica 1,2,3 add val to their consistent view of key & ts 1 2

slide-24
SLIDE 24

write comparison

Replica 2 Replica 3 Replica 1 Client

1

24

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

* 1 2

Traditional algorithm: 1 remote RTT Vivace algorithm: 1 prioritized remote RTT + 1 local RTT

slide-25
SLIDE 25

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

25

prioritize

<R-TS,key>

1

  • nly ask for ts
slide-26
SLIDE 26

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

26

prioritize

<R-TS,key>

1

<ACK-R-TS,ts>

small message

slide-27
SLIDE 27

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

27

1

<R-DATA,key,ts>

2 ask for data with largest ts

slide-28
SLIDE 28

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

28

1

<R-DATA,key,ts>

2

<ACK-R-DATA,val>

large val, but wait for

  • nly one reply

(common case: local)

slide-29
SLIDE 29

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

29

1 2

<W-TS,key,ts>

3 prioritize writeback only small ts

slide-30
SLIDE 30

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

30

1

<ACK-W-TS> <W-TS,key,ts>

prioritize 3 2

slide-31
SLIDE 31

Vivace: read

Replica 2 Replica 3 Replica 1 Client Local Replica 1 Local Replica 2 Local Replica 3

31

1 2 read done 3

slide-32
SLIDE 32

read comparison

32

Replica 2 Replica 3 Replica 1 Client Replica 2 Replica 3 Replica 1 Client

1 2 3 1 2

Traditional algorithm: 2 remote RTTs Vivace algorithm: 2 prioritized remote RTT + 1 local RTT

slide-33
SLIDE 33

Evaluation topics

  • Practical prioritization setup
  • Delay with congestion

– KV-store operations – Twitter clone web app operations

  • Delay without congestion

– Overhead of Vivace algorithms compared to traditional algorithms

33

slide-34
SLIDE 34

Evaluation setup

  • Local cluster <-> Amazon EC2 Ireland
  • DSCP bit prioritization on local router’s egress port
  • Congestion generated with iperf

34

Local cluster (Illinois) Amazon EC2 (Ireland) prioritization applied here

  • nly
slide-35
SLIDE 35

Evaluation

Does prioritization work in practice?

  • Simple ping experiment
  • Prioritized messages bypass congestion
  • Local router-based prioritization is effective

35

slide-36
SLIDE 36

Evaluation

How well does Vivace perform under congestion?

KV-store operations Twitter-clone operations

36 (a) Read algorithms (b) Write algorithms (c) State machine algorithms (a) Post tweet (b) Read user timeline (c) Read friends timeline

slide-37
SLIDE 37

37

2 remote RTTs 2 prioritized remote RTTs + 1 local RTT buffering delay TCP resend on packet loss avoids congestion delays

(a) Read algorithms

Evaluation

How well does Vivace perform under congestion?

slide-38
SLIDE 38

Evaluation

What is the overhead of Vivace without congestion?

  • (Results in paper)
  • No measurable overhead compared to

traditional algorithms

  • Extra message phases are not harmful

38

slide-39
SLIDE 39

Conclusion

  • Proposed two new algorithms

– Read/write (simple, in talk) – State machine (more complex, in paper)

  • Both algorithms avoid delay due to congestion by

prioritizing a small amount of critical information, while

– Still providing strong consistency – Keeping prioritized messages small – Avoiding delay overhead in absence of congestion – Using a practical prioritization infrastructure

  • Careful use of prioritized messages can be an effective

strategy in geo-distributed data centers

39