Scalable Consistency in Scatter A Distributed Key-Value Storage - - PowerPoint PPT Presentation

scalable consistency in scatter
SMART_READER_LITE
LIVE PREVIEW

Scalable Consistency in Scatter A Distributed Key-Value Storage - - PowerPoint PPT Presentation

Scalable Consistency in Scatter A Distributed Key-Value Storage System Lisa Glendenning Ivan Beschastnikh University of Washington Arvind Krishnamurthy Thomas Anderson Supported by NSF CNS-0963754 October 2011 1 1 Internet services


slide-1
SLIDE 1

Scalable Consistency in Scatter

Lisa Glendenning Ivan Beschastnikh Arvind Krishnamurthy Thomas Anderson University of Washington

1

A Distributed Key-Value Storage System

October 2011 Supported by NSF CNS-0963754

1

slide-2
SLIDE 2

Internet services depend on distributed key-value stores

2

Dynamo

Consistency Scalability

Scatter

2

slide-3
SLIDE 3

Scatter: Goals

3

✓ linearizable consistency semantics ✓ scalable in a wide area network ✓ high availability ✓ performance close to existing systems

3

slide-4
SLIDE 4

Scatter: Approach

4

scalable peer-to-peer systems consistent datacenter systems combine ideas from:

✓ distributed hash table ✓ self-organization ✓ decentralization ✓ consensus ✓ replication ✓ transactions

4

slide-5
SLIDE 5

Distributed Hash Tables: Background

5

links between nodes form overlay nodes keys

partition and assign keys to nodes

core functionality: system structure:

knowledge of system state is distributed among all nodes

system management:

nodes coordinate locally to respond to churn, e.g.,

  • give keys to new nodes
  • take over keys of failed

nodes

5

slide-6
SLIDE 6

Distributed Hash Tables: Faults Cause Inconsistencies

6

ka kb b a c.pred = a c.succ = b a.succ = c b.pred = c b.keys = (kc,kb] c.keys = (ka,kc] c

Example: c joins between a and b

ka kb b a c kc

JOIN

6

slide-7
SLIDE 7

Distributed Hash Tables: Faults Cause Inconsistencies

6

ka kb b a c.pred = a c.succ = b a.succ = c b.pred = c b.keys = (kc,kb] c.keys = (ka,kc] c

Example: c joins between a and b

ka kb b a c kc

JOIN

what could go wrong?

communication fault between b and c

FAULT OUTCOME

both b and c claim

  • wnership of

(ka,kc] c fails during

  • peration

no node claims

  • wnership of

(ka,kc] communication fault between a and c routes through a skip over c

6

slide-8
SLIDE 8

Distributed Hash Tables: Weak Atomicity Causes Anomalies

7

what happens if... DHTs use ad-hoc protocols to add and remove nodes

  • two nodes join at the same place at the same time
  • two adjacent nodes leave at the same time
  • during a node join the predecessor leaves
  • one node mistakenly thinks another node has failed

...

7

slide-9
SLIDE 9

8

Scatter: Design Overview

group node use groups as building blocks instead of nodes

How is Scatter different? What does this give us?

set of nodes that cooperatively manage a key-range

What is a group?

  • nodes within a group act as a single

entity

  • a group is much less likely to fail than an

individual node

  • distributed transactions for operations

involving multiple groups

8

slide-10
SLIDE 10

Scatter: Group Anatomy

9

  • each node orders client
  • perations on its keys

a b c ka kz kc kb nodes = {a,b,c} keys = (kz,kc] values = {...}

  • include new nodes
  • exclude failed nodes
  • group replicates all state

among members with Paxos

a.keys = (kz,ka] b.keys = (ka,kb] c.keys = (kb,kc]

  • changes to group membership

are Paxos reconfigurations:

  • key-range further partitioned

among nodes of group for performance

9

slide-11
SLIDE 11

Scatter: Self-Reorganization

10

multi-group operations:

MERGE SPLIT

some problems can’t be handled within a single group

  • small groups are at risk of failing
  • large groups are slow
  • load imbalance across groups

a b c b1 a c b2

  • merge two small groups into
  • ne
  • split one large group into two
  • rebalance keys and nodes

between groups

distributed transactions coordinated locally by groups

10

slide-12
SLIDE 12

Example: Group Split

11

2PC

a b c split?

11

slide-13
SLIDE 13

Example: Group Split

11

2PC

a b c split?

  • k!
  • k!

a b c

11

slide-14
SLIDE 14

Example: Group Split

11

2PC

a b c split?

  • k!
  • k!

a b c a b c split!

11

slide-15
SLIDE 15

Example: Group Split

11

2PC

a b c split?

  • k!
  • k!

a b c a b c split! a b2 c b1

11

slide-16
SLIDE 16

Example: Group Split

11

b split?

  • k!
  • k!

2PC

a b c split?

  • k!
  • k!

a b c a b c split! a b2 c b1

11

slide-17
SLIDE 17

Example: Group Split

11

b split?

  • k!
  • k!

a c split b?

  • k!

split b?

  • k!
  • k!
  • k!

2PC

a b c split?

  • k!
  • k!

a b c a b c split! a b2 c b1

11

slide-18
SLIDE 18

Example: Group Split

11

b split?

  • k!
  • k!

a c split b?

  • k!

split b?

  • k!
  • k!
  • k!

2PC

a b c split?

  • k!
  • k!

a b c a b c split! a b2 c b1 b split!

  • k!
  • k!

b1 b2

RECONFIGURE!

committed

11

slide-19
SLIDE 19

Example: Group Split

11

b split?

  • k!
  • k!

a c split b?

  • k!

split b?

  • k!
  • k!
  • k!

a c b split!

  • k!
  • k!

b split!

  • k!
  • k!

2PC

a b c split?

  • k!
  • k!

a b c a b c split! a b2 c b1 b split!

  • k!
  • k!

b1 b2

RECONFIGURE!

committed

11

slide-20
SLIDE 20

✓ linearizable consistency semantics ✓ scalable in a wide area network ✓ high availability ✓ performance close to existing systems

Scatter

12

...local operations ...replication, reconfiguration ...group consensus, transactions ...key partitioning, optimizations

12

slide-21
SLIDE 21

13

Evaluation: Overview

Questions:

1.How robust is Scatter in high-churn peer-to- peer environment? 2.How does Scatter adapt to dynamic workload in datacenter environment?

Comparisons:

Environment P2P Datacenter Comparison System OpenDHT ZooKeeper

13

slide-22
SLIDE 22

Comparison: OpenDHT

14

Layered OpenDHT’s recursive routing on top of Scatter groups Implemented a Twitter- like application, Chirp

Experimental Setup:

  • 840 PlanetLab nodes
  • injected node churn at varying rates
  • Twitter traces as a workload
  • tweets and social network stored in DHT

14

slide-23
SLIDE 23

Comparison: OpenDHT

15

Consistency Availability

stency

Scatter has zero inconsistencies and high availability even under churn

75 80 85 90 95 100 100 300 500 700 900

consistent fetches (%) node lifetime (seconds) Scatter OpenDHT

75 80 85 90 95 100 100 300 500 700 900

completed fetches (%) node lifetime (seconds) Scatter OpenDHT

15

slide-24
SLIDE 24

Comparison: OpenDHT

16

Latency

Scalable consistency is cheap

350 700 1050 1400 100 300 500 700 900

fetch latency (ms) node lifetime (seconds) Scatter OpenDHT 10-12%

]

16

slide-25
SLIDE 25

Comparison: Replicated ZooKeeper

17

  • testbed: Emulab
  • varied total number of nodes
  • no churn
  • same Chirp workload

statically partitioned global key-space to multiple, isolated ZooKeeper instantiations small-scale, centralized coordination service

ZooKeeper: Experimental Setup:

Z1 Z2 Z3

Replicated ZooKeeper:

17

slide-26
SLIDE 26

Comparison: Replicated ZooKeeper

18

Dynamic partitioning adapts to changes in workload

100 200 300 400 5 25 50 75 100 125 150 throughput (1000 ops/sec) total number of nodes

Scatter ZooKeeper

Scalability

18

slide-27
SLIDE 27

✓ consensus groups of nodes as fault- tolerant building blocks ✓ distributed transactions across groups to repartition the global key-space ✓ evaluation against OpenDHT and ZooKeeper shows strict consistency, linear scalability, and high availability

Scatter: Summary

19

19