Dynamo Dynamo motivation Fast, available writes - Shopping cart: - - PowerPoint PPT Presentation

dynamo dynamo motivation
SMART_READER_LITE
LIVE PREVIEW

Dynamo Dynamo motivation Fast, available writes - Shopping cart: - - PowerPoint PPT Presentation

Dynamo Dynamo motivation Fast, available writes - Shopping cart: always enable purchases FLP: consistency and progress at odds - Paxos: must communicate with a quorum Performance: strict consistency = single copy - Updates serialized to


slide-1
SLIDE 1

Dynamo

slide-2
SLIDE 2

Dynamo motivation

Fast, available writes

  • Shopping cart: always enable purchases

FLP: consistency and progress at odds

  • Paxos: must communicate with a quorum

Performance: strict consistency = “single” copy

  • Updates serialized to single copy
  • Or, single copy moves
slide-3
SLIDE 3

Why Fast Available Writes?

Amazon study: 100ms increase in response time => 5% reduction in revenue Similar results at other ecommerce sites 99.99% availability => less than an hour outage/year (total) Amazon revenue > $300K/minute

slide-4
SLIDE 4

Dynamo motivation

Dynamo goals

  • Expose “as much consistency as possible”
  • Good latency, 99.9% of the time
  • Easy scalability
slide-5
SLIDE 5

Dynamo consistency

Eventual consistency

  • Can have stale reads
  • Can have multiple “latest” versions
  • Reads can return multiple values

Not sequentially consistent

  • Can’t “defriend and dis”
slide-6
SLIDE 6

External interface

get : key -> ([value], context)

  • Exposes inconsistency: can return multiple values
  • context is opaque to user (set of vector clocks)

put : (key, value, context) -> void

  • Caller passes context from previous get

Example: add to cart (carts, context) = get(“cart-“ + uid)
 cart = merge(carts)
 cart = add(cart, item)
 put(“cart-“ + uid, cart, context)

slide-7
SLIDE 7

Resolving conflicts in application

Applications can choose how to handle inconsistency:

  • Shopping cart: take union of cart versions
  • User sessions: take most recent session
  • High score list: take maximum score

Default: highest timestamp wins Context used to record causal relationships between gets and puts

  • Once inconsistency resolved, should stay resolved
  • Implemented using vector clocks
slide-8
SLIDE 8

Dynamo’s vector clocks

Each object associated with a vector clock

  • e.g., [(node1, 0), (node2, 1)]

Each write has a coordinator, and is replicated to multiple other nodes

  • In an eventually consistent manner

Nodes in vector clock are coordinators

slide-9
SLIDE 9

Dynamo’s vector clocks

Client sends clock with put (as context) Coordinator increments its own index in clock, then replicates across nodes Nodes keep objects with conflicting vector clocks

  • These are then returned on subsequent gets

If clock(v1) < clock(v2), node deletes v1

slide-10
SLIDE 10

Dynamo Vector Clocks

Vector clock returned as context with get

  • Merge of all returned objects’ clocks

Used to detect inconsistencies on write

slide-11
SLIDE 11

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

slide-12
SLIDE 12

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

get()

slide-13
SLIDE 13

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

slide-14
SLIDE 14

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

slide-15
SLIDE 15

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

[1], [(node1, 0)]

slide-16
SLIDE 16

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

slide-17
SLIDE 17

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)]

put(“2”, [(node1, 0)])

slide-18
SLIDE 18

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

slide-19
SLIDE 19

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

slide-20
SLIDE 20

node1 node2 node3 client

“1” @ [(node1, 0)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

slide-21
SLIDE 21

node1 node2 node3 client

“2” @ [(node1, 1)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

slide-22
SLIDE 22

node1 node2 node3 client

“2” @ [(node1, 1)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

OK

slide-23
SLIDE 23

node1 node2 node3 client

“2” @ [(node1, 1)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

slide-24
SLIDE 24

node1 node2 node3 client

“2” @ [(node1, 1)] “1” @ [(node1, 0)] “2” @ [(node1, 1)]

put(“3”, [(node1, 0)])

slide-25
SLIDE 25

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)]

slide-26
SLIDE 26

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

slide-27
SLIDE 27

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

OK

slide-28
SLIDE 28

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

slide-29
SLIDE 29

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

get()

slide-30
SLIDE 30

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

slide-31
SLIDE 31

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

slide-32
SLIDE 32

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

[“2”, “3”], [(node1, 1), (node3, 0)]

slide-33
SLIDE 33

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

[“2”, “3”], [(node1, 1), (node3, 0)]

client must now run merge!

slide-34
SLIDE 34

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)]

put(“3”, [(node1, 1), (node3, 0)])

slide-35
SLIDE 35

node1 node2 node3 client

“2” @ [(node1, 1)] “3” @ [(node1, 0), (node3, 0)] “3” @ [(node1, 2), (node3, 0)] “3” @ [(node1, 0), (node3, 0)]

slide-36
SLIDE 36

node1 node2 node3 client

“3” @ [(node1, 2), (node3, 0)] “3” @ [(node1, 0), (node3, 0)] “3” @ [(node1, 2), (node3, 0)]

slide-37
SLIDE 37

node1 node2 node3 client

“3” @ [(node1, 2), (node3, 0)] “3” @ [(node1, 0), (node3, 0)] “3” @ [(node1, 2), (node3, 0)]

slide-38
SLIDE 38

node1 node2 node3 client

“3” @ [(node1, 2), (node3, 0)] “3” @ [(node1, 2), (node3, 0)] “3” @ [(node1, 2), (node3, 0)]

slide-39
SLIDE 39

Where does each key live?

Goals:

  • Balance load, even as servers join and leave
  • Encourage put/get to see each other
  • Avoid conflicting versions

Solution: consistent hashing

slide-40
SLIDE 40

Detour: Consistent hashing

Node ids hashed to many pseudorandom points on a circle Keys hashed onto circle, assigned to “next” node Idea used widely:

  • Developed for Akamai CDN
  • Used in Chord distributed hash table
  • Used in Dynamo distributed DB
slide-41
SLIDE 41

Scaling Systems: Shards

Distribute portions of your dataset to various groups of nodes Question: how do we allocate a data item to a shard?

slide-42
SLIDE 42

State machine

Paxos

Replicated, Sharded Database

State machine State machine

Paxos

State machine

Paxos

Which keys are where?

slide-43
SLIDE 43

State machine

Paxos

Lab 4 (and other systems)

State machine State machine

Paxos

State machine

Paxos Paxos

Shard master

slide-44
SLIDE 44

Replicated, Sharded Database

Shard master decides

  • which group has which keys

Shards operate independently How do clients know who has what keys?

  • Ask shard master? Becomes the bottleneck!

Avoid shard master communication if possible

  • Can clients predict which group has which keys
slide-45
SLIDE 45

Recurring Problem

Client needs to access some resource Sharded for scalability How does client find specific server to use? Central redirection won’t scale!

slide-46
SLIDE 46

Another scenario

Client

slide-47
SLIDE 47

Another scenario

Client GET index.html

slide-48
SLIDE 48

Another scenario

Client index.html

slide-49
SLIDE 49

Another scenario

Client index.html Links to: logo.jpg, jquery.js, …

slide-50
SLIDE 50

Another scenario

Client Cache 1 Cache 2 Cache 3 GET logo.jpg GET jquery.js

slide-51
SLIDE 51

Another scenario

Client 2 Cache 1 Cache 2 Cache 3 GET logo.jpg GET jquery.js

slide-52
SLIDE 52

Other Examples

Scalable shopping cart service Scalable email service Scalable cache layer (Memcache) Scalable network path allocation Scalable network function virtualization (NFV) …

slide-53
SLIDE 53

What’s in common?

Want to assign keys to servers w/o communication Requirement 1: clients all have same assignment

slide-54
SLIDE 54

Proposal 1

For n nodes, a key k goes to k mod n Cache 1 Cache 2 Cache 3 “a”, “d”, “ab” “b” “c”

slide-55
SLIDE 55

Proposal 1

For n nodes, a key k goes to k mod n Problems with this approach? Cache 1 Cache 2 Cache 3 “a”, “d”, “ab” “b” “c”

slide-56
SLIDE 56

Proposal 1

For n nodes, a key k goes to k mod n Problems with this approach?

  • Likely to have distribution issues

Cache 1 Cache 2 Cache 3 “a”, “d”, “ab” “b” “c”

slide-57
SLIDE 57

Requirements, revisited

Requirement 1: clients all have same assignment Requirement 2: keys uniformly distributed

slide-58
SLIDE 58

Proposal 2: Hashing

For n nodes, a key k goes to hash(k) mod n Hash distributes keys uniformly Cache 1 Cache 2 Cache 3 h(“a”)=1 h(“abc”)=2 h(“b”)=3

slide-59
SLIDE 59

Proposal 2: Hashing

For n nodes, a key k goes to hash(k) mod n Hash distributes keys uniformly But, new problem: what if we add a node? Cache 1 Cache 2 Cache 3 h(“a”)=1 h(“abc”)=2 h(“b”)=3

slide-60
SLIDE 60

Proposal 2: Hashing

For n nodes, a key k goes to hash(k) mod n Hash distributes keys uniformly But, new problem: what if we add a node? Cache 1 Cache 2 Cache 3 h(“a”)=1 h(“abc”)=2 h(“b”)=3 Cache 4

slide-61
SLIDE 61

Proposal 2: Hashing

For n nodes, a key k goes to hash(k) mod n Hash distributes keys uniformly But, new problem: what if we add a node? Cache 1 Cache 2 Cache 3 h(“a”)=1 h(“abc”)=2 h(“b”)=3 Cache 4 h(“a”)=3 h(“b”)=4

slide-62
SLIDE 62

h(“b”)=4 h(“a”)=3

Proposal 2: Hashing

For n nodes, a key k goes to hash(k) mod n Hash distributes keys uniformly But, new problem: what if we add a node?

  • Redistribute a lot of keys! (on average, all but K/n)

Cache 1 Cache 2 Cache 3 h(“abc”)=2 Cache 4

slide-63
SLIDE 63

Requirements, revisited

Requirement 1: clients all have same assignment Requirement 2: keys uniformly distributed Requirement 3: can add/remove nodes w/o redistributing too many keys

slide-64
SLIDE 64

First, hash the node ids

Proposal 3: Consistent Hashing

slide-65
SLIDE 65

First, hash the node ids

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232

slide-66
SLIDE 66

First, hash the node ids

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1)

slide-67
SLIDE 67

First, hash the node ids

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2)

slide-68
SLIDE 68

First, hash the node ids

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3)

slide-69
SLIDE 69

First, hash the node ids

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3)

slide-70
SLIDE 70

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3)

slide-71
SLIDE 71

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3) “a”

slide-72
SLIDE 72

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3) “a” hash(“a”)

slide-73
SLIDE 73

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3) “a”

slide-74
SLIDE 74

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3) “b”

slide-75
SLIDE 75

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3) “b” hash(“b”)

slide-76
SLIDE 76

First, hash the node ids Keys are hashed, go to the “next” node

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 232 hash(1) hash(2) hash(3) “b”

slide-77
SLIDE 77

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3

slide-78
SLIDE 78

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b”

slide-79
SLIDE 79

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b” What if we add a node?

slide-80
SLIDE 80

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b” Cache 4

slide-81
SLIDE 81

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b” Cache 4 Only “b” has to move! On average, K/n keys move

slide-82
SLIDE 82

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b” Cache 4

slide-83
SLIDE 83

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b” Cache 4

slide-84
SLIDE 84

Proposal 3: Consistent Hashing

Cache 1 Cache 2 Cache 3 “a” “b” Cache 4 Only “b” has to move! On average, K/n keys move but all between two nodes

slide-85
SLIDE 85

Requirements, revisited

Requirement 1: clients all have same assignment Requirement 2: keys evenly distributed Requirement 3: can add/remove nodes w/o redistributing too many keys Requirement 4: parcel out work of redistributing keys

slide-86
SLIDE 86

First, hash the node ids to multiple locations

Proposal 4: Virtual Nodes

Cache 1 Cache 2 Cache 3 232

slide-87
SLIDE 87

First, hash the node ids to multiple locations

Proposal 4: Virtual Nodes

Cache 1 Cache 2 Cache 3 232 1 1 1 1 1

slide-88
SLIDE 88

First, hash the node ids to multiple locations

Proposal 4: Virtual Nodes

Cache 1 Cache 2 Cache 3 232 1 1 1 1 1 2 2 2 2 2

slide-89
SLIDE 89

First, hash the node ids to multiple locations As it turns out, hash functions come in families s.t. their members are independent. So this is easy!

Proposal 4: Virtual Nodes

Cache 1 Cache 2 Cache 3 232 1 1 1 1 1 2 2 2 2 2

slide-90
SLIDE 90

Prop 4: Virtual Nodes

Cache 1 Cache 2 Cache 3

slide-91
SLIDE 91

Prop 4: Virtual Nodes

Cache 1 Cache 2 Cache 3

slide-92
SLIDE 92

Prop 4: Virtual Nodes

Cache 1 Cache 2 Cache 3

slide-93
SLIDE 93

Prop 4: Virtual Nodes

Cache 1 Cache 2 Cache 3 Keys more evenly distributed and migration is evenly spread out.

slide-94
SLIDE 94

Requirements, revisited

Requirement 1: clients all have same assignment Requirement 2: keys evenly distributed Requirement 3: can add/remove nodes w/o redistributing too many keys Requirement 4: parcel out work of redistributing keys

slide-95
SLIDE 95

Load Balancing At Scale

Suppose you have N servers Using consistent hashing with virtual nodes:

  • heaviest server has x% more load than the average
  • lightest server has x% less load than the average

What is peak load of the system?

  • N * load of average machine? No!

Need to minimize x

slide-96
SLIDE 96

Key Popularity

  • What if some keys are more popular than others
  • Consistent hashing is no longer load balanced!
  • One model for popularity is the Zipf distribution
  • Popularity of kth most popular item, 1 < c < 2
  • 1/k^c
  • Ex: 1, 1/2, 1/3, … 1/100 … 1/1000 … 1/10000
slide-97
SLIDE 97

Zipf “Heavy Tail” Distribution

slide-98
SLIDE 98

Zipf Examples

  • Web pages
  • Movies
  • Library books
  • Words in text
  • Salaries
  • City population
  • Twitter followers

Whenever popularity is self-reinforcing

slide-99
SLIDE 99

Proposal 5: Table Indirection

Consistent hashing is (mostly) stateless

  • Given list of servers and # of virtual nodes, client can

locate key

  • Worst case unbalanced, especially with zipf

Add a small table on each client

  • Table maps: virtual node -> server
  • Shard master reassigns table entries to balance load
slide-100
SLIDE 100

Consistent hashing in Dynamo

Each key has a “preference list”—next nodes around the circle

  • Skip duplicate virtual nodes
  • Ensure list spans data centers

Slightly more complex:

  • Dynamo ensures keys evenly distributed
  • Nodes choose “tokens” (positions in ring) when

joining the system

  • Tokens used to route requests
  • Each token = equal fraction of the keyspace
slide-101
SLIDE 101

Replication in Dynamo

Three parameters: N, R, W

  • N: number of nodes each key replicated on
  • R: number of nodes participating in each read
  • W: number of nodes participating in each write

Data replicated onto first N live nodes in pref list

  • But respond to the client after contacting W

Reads see values from R nodes Common config: (3, 2, 2)

slide-102
SLIDE 102

Sloppy quorum

Never block waiting for unreachable nodes

  • Try next node in list!

Want get to see most recent put (as often as possible) Quorum: R + W > N

  • Don’t wait for all N
  • R and W will (usually) overlap

Nodes ping each other

  • Each has independent opinion of up/down

“Sloppy” quorum—nodes can disagree about which nodes are running

slide-103
SLIDE 103

Replication in Dynamo

Coordinator (or client) sends each request (put or get) to first N reachable nodes in pref list

  • Wait for R replies (for read) or W replies (for write)

Normal operation: gets see all recent versions Failures/delays:

  • Writes still complete quickly
  • Reads eventually see
slide-104
SLIDE 104

Ensuring eventual consistency

What if puts end up far away from first N?

  • Could happen if some nodes temporarily

unreachable

  • Server remembers “hint” about proper location
  • Once reachability restored, forwards data

Nodes periodically sync whole DB

  • Fast comparisons using Merkle trees
slide-105
SLIDE 105

Dynamo deployments

~100 nodes each One for each service (parameters global) How to extend to multiple apps? Different apps use different (N, R, W)

  • Pretty fast, pretty durable: (3, 2, 2)
  • Many reads, few writes: (3, 1, 3) or (N, 1, N)
  • (3, 3, 3)?
  • (3, 1, 1)?
slide-106
SLIDE 106

Dynamo results

Average much faster than 99.9%

  • But, 99.9% acceptable

Inconsistencies rare in practice

  • Allow inconsistency, but minimize it
slide-107
SLIDE 107

Dynamo Revisited

Implemented as a library, not as a service

  • Each service (eg shopping cart) instantiated a

Dynamo instance When an inconsistency happens:

  • Is it a problem in Dynamo?
  • Is it an intended side effect of Dynamo’s design?

Every service runs its own ops => every service needs to be an expert at sloppy quorum

slide-108
SLIDE 108

Dynamo DB

Replaced Dynamo the library with DynamoDB the service DynamoDB: strictly consistent key value store

  • validated with TLA and model checking
  • eventually consistent as an option
  • (afaik) no multikey transactions?

Dynamo is eventually consistent Amazon is eventually strictly consistent!

slide-109
SLIDE 109

Discussion

Why is symmetry valuable? Do seeds break it? Dynamo and SOA

  • What about malicious/buggy clients?

Issues with hot keys? Transactions and strict consistency

  • Why were transactions implemented at Google

and not at Amazon?

  • Do Amazon’s programmers not want strict

consistency?

slide-110
SLIDE 110