Dynamo, Five Years Later Andy Gross Chief Architect, Basho - - PowerPoint PPT Presentation

dynamo five years later
SMART_READER_LITE
LIVE PREVIEW

Dynamo, Five Years Later Andy Gross Chief Architect, Basho - - PowerPoint PPT Presentation

Dynamo, Five Years Later Andy Gross Chief Architect, Basho Technologies QCon London 2013 Friday, March 8, 13 Dynamo Published October 2007 @ SOSP Describes a collection of distributed systems techniques applied to low-latency key-value


slide-1
SLIDE 1

Dynamo, Five Years Later

Andy Gross Chief Architect, Basho Technologies QCon London 2013

Friday, March 8, 13

slide-2
SLIDE 2

Dynamo

Published October 2007 @ SOSP Describes a collection of distributed systems techniques applied to low-latency key-value storage Spawned (along with BigTable) many imitators, an industry (LinkedIn -> Voldemort, Facebook -> Cassandra) Authors nearly got fired from Amazon for publishing

Friday, March 8, 13

slide-3
SLIDE 3

Riak - A Dynamo Clone

First lines of first prototype written in Fall 2007 on a plane on the way to my Basho interview “Technical Debt” is another term we use at Basho for this code Mostly Erlang with some C/C++ Apache2 Licensed First release in 2009, 1.3 released 2/21/13

Friday, March 8, 13

slide-4
SLIDE 4

Basho

Friday, March 8, 13

slide-5
SLIDE 5

Basho

Founded late 2007 by ex-Akamai people Currently ~120 employees, distributed, with offices in Cambridge, San Francisco, London, and Tokyo We sponsor of Riak Open Source We sell Riak Enterprise (Riak + Multi-DC replication) We sell Riak CS (S3 clone backed by Riak Enterprise)

Friday, March 8, 13

slide-6
SLIDE 6

Principles

Always-writable Incrementally scalable Symmetrical Decentralized Heterogenous Focus on SLAs, tail latency

Friday, March 8, 13

slide-7
SLIDE 7

Techniques

Consistent Hashing Vector Clocks Read Repair Anti-Entropy Hinted Handoff Gossip Protocol

Friday, March 8, 13

slide-8
SLIDE 8

Consistent Hashing

Invented by Danny Lewin and others @ MIT/Akamai Minimizes remapping of keys when number of hash slots changes Originally applied to CDNs, used in Dynamo for replica placement Enables incremental scalability, even spread Minimizes hot spots

Friday, March 8, 13

slide-9
SLIDE 9

Friday, March 8, 13

slide-10
SLIDE 10

Vector Clocks

Introduced by Mattern et al, in 1988 Extends Lamport’s timestamps (1978) Each value in Dynamo tagged with vector clock Allows detection of stale values, logical siblings

Friday, March 8, 13

slide-11
SLIDE 11

Read Repair

Update stale versions opportunistically on reads (instead of writes) Pushes system toward consistency, after returning value to client Reflects focus on a cheap, always-available write path

Friday, March 8, 13

slide-12
SLIDE 12

Hinted Handoff

Any node can accept writes for other nodes if they’re down All messages include a destination Data accepted by node other than destination is handed off when node recovers As long as a single node is alive the cluster can accept a write

Friday, March 8, 13

slide-13
SLIDE 13

Anti-Entropy

Replicas maintain a Merkle Tree of keys and their versions/hashes Trees periodically exchanged with peer vnodes Merkle tree enables cheap comparison Only values with different hashes are exchanged Pushes system toward consistency

Friday, March 8, 13

slide-14
SLIDE 14

Gossip Protocol

Decentralized approach to managing global state Trades off atomicity of state changes for a decentralized approach Volume of gossip can overwhelm networks without care

Friday, March 8, 13

slide-15
SLIDE 15

Hinted Handoff

Friday, March 8, 13

slide-16
SLIDE 16

Hinted Handoff

  • Node fails

X X X X X X X X

Friday, March 8, 13

slide-17
SLIDE 17

Hinted Handoff

  • Node fails
  • Requests go to fallback

hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

X X X X X X X X

Friday, March 8, 13

slide-18
SLIDE 18

Hinted Handoff

  • Node fails
  • Requests go to fallback
  • Node comes back

hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Friday, March 8, 13

slide-19
SLIDE 19

Hinted Handoff

  • Node fails
  • Requests go to fallback
  • Node comes back
  • “Handoff” - data returns

to recovered node

hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Friday, March 8, 13

slide-20
SLIDE 20

Hinted Handoff

  • Node fails
  • Requests go to fallback
  • Node comes back
  • “Handoff” - data returns

to recovered node

  • Normal operations

resume

hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Friday, March 8, 13

slide-21
SLIDE 21

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Friday, March 8, 13

slide-22
SLIDE 22

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

client Riak

Friday, March 8, 13

slide-23
SLIDE 23

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak

Friday, March 8, 13

slide-24
SLIDE 24

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak

hash(“blocks/ 6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

== 10, 11, 12

Friday, March 8, 13

slide-25
SLIDE 25

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak

hash(“blocks/ 6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

== 10, 11, 12

Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16

The Ring

Friday, March 8, 13

slide-26
SLIDE 26

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16

The Ring

Friday, March 8, 13

slide-27
SLIDE 27

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16

The Ring

R=2

Friday, March 8, 13

slide-28
SLIDE 28

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16

The Ring

R=2 v1

Friday, March 8, 13

slide-29
SLIDE 29

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak

R=2 v1 v2

Friday, March 8, 13

slide-30
SLIDE 30

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak

R=2 v2 v2

Friday, March 8, 13

slide-31
SLIDE 31

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

v2

Friday, March 8, 13

slide-32
SLIDE 32

Read Repair

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16 R=2 v1 v2 v2 v2 v1

Friday, March 8, 13

slide-33
SLIDE 33

Read Repair

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16 R=2 v2 v2 v2 v1

Friday, March 8, 13

slide-34
SLIDE 34

Read Repair

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16 R=2 v2 v2 v2 v1 v1

Friday, March 8, 13

slide-35
SLIDE 35

v2 v2

Read Repair

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)

client Riak Coordinating node Cluster

6 7 8 9 10 11 12 13 14 15 16 R=2 v2 v2 v2 v2 v2

Friday, March 8, 13

slide-36
SLIDE 36

Erlang/OTP Runtime

Riak Architecture

Friday, March 8, 13

slide-37
SLIDE 37

Erlang/OTP Runtime Riak KV

Riak Architecture

Friday, March 8, 13

slide-38
SLIDE 38

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs

Friday, March 8, 13

slide-39
SLIDE 39

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs HTTP

Friday, March 8, 13

slide-40
SLIDE 40

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs HTTP Protocol Buffers

Friday, March 8, 13

slide-41
SLIDE 41

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs HTTP Protocol Buffers Erlang local client

Friday, March 8, 13

slide-42
SLIDE 42

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination HTTP Protocol Buffers Erlang local client

Friday, March 8, 13

slide-43
SLIDE 43

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination get put delete map-reduce HTTP Protocol Buffers Erlang local client

Friday, March 8, 13

slide-44
SLIDE 44

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

Friday, March 8, 13

slide-45
SLIDE 45

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

consistent hashing

Friday, March 8, 13

slide-46
SLIDE 46

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing

Friday, March 8, 13

slide-47
SLIDE 47

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff

Friday, March 8, 13

slide-48
SLIDE 48

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness

Friday, March 8, 13

slide-49
SLIDE 49

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness gossip

Friday, March 8, 13

slide-50
SLIDE 50

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness gossip buckets

Friday, March 8, 13

slide-51
SLIDE 51

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness gossip buckets

vnode master

Friday, March 8, 13

slide-52
SLIDE 52

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness gossip buckets

vnodes vnode master

Friday, March 8, 13

slide-53
SLIDE 53

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness gossip buckets

vnodes storage backend vnode master

Friday, March 8, 13

slide-54
SLIDE 54

Erlang/OTP Runtime Riak KV

Riak Architecture

Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client

membership consistent hashing handoff node-liveness gossip buckets

vnodes storage backend JS Runtime vnode master

Friday, March 8, 13

slide-55
SLIDE 55

Problems with Dynamo

Eventual Consistency is harsh mistress Pushes conflict resolution to clients Key/value data types limited in use Random replica placement destroys locality Gossip protocol can limit cluster size R+W > N is NOT more consistent TCP Incast

Friday, March 8, 13

slide-56
SLIDE 56

Key-Value Conflict Resolution

Forcing clients to resolve consistency issues on read is a pain for developers Most end up choosing the server-enforced last-write- wins policy With many language clients, logic must be implemented many times One solution: https://github.com/bumptech/montage Another: Make everything immutable Another: CRDTs

Friday, March 8, 13

slide-57
SLIDE 57

Optimize for Immutability

“Accountants don’t use erasers” - Pat Helland Eventual consistency is *great* for immutable data Conflicts become a non-issue if data never changes don’t need full quorums, vector clocks backend optimizations are possible Problem space shifts to distributed GC ... which is very hard, but not the user’s problem anymore

Friday, March 8, 13

slide-58
SLIDE 58

CRDTs

Conflict-free|Commutative Replicated Data Types A server side structure and conflict-resolution policy for richer datatypes like counters and sets amenable to eventual consistency Letia et al. (2009). CRDTs: Consistency without concurrency control: http://hal.inria.fr/inria-00397981/ en Prototype here: http://github.com/basho/riak_dt

Friday, March 8, 13

slide-59
SLIDE 59

Random Placement and Locality

By default, keys are randomly placed on different replicas But we have buckets! Containers imply cheap iteration/enumeration, but with random placement it becomes an expensive full-scan Partial Solution: hash function defined per-bucket can increase locality Lots of work done to minimize impact of bucket listings

Friday, March 8, 13

slide-60
SLIDE 60

(R+W>N) != Consistency

R+W described in Dynamo paper as “consistency knobs” Some Basho/Riak docs still say this too! :( Even if R=W=N, sloppy quorums and partial writes make reading old values possible “Read your own writes if your writes succeed but

  • therwise you have no idea what you’re going to read

consistency (RYOWIWSBOYHNIWYGTRC)” - Joe Blomstedt Solution: actual “strong” consistency

Friday, March 8, 13

slide-61
SLIDE 61

Strong Consistency in Riak

CAP says you must choose C vs. A, but only during failures There’s no reason we can’t implement both models, with different tradeoffs Enable strong consistency on a per-bucket basis See Joe Blomstedt’s talk at RICON 2012: http:// ricon2012.com, earlier work at: http://github.com/jtuple/riak_zab

Friday, March 8, 13

slide-62
SLIDE 62

An Aside: Probabalistically Bounded Staleness

Bailis et al. : http://pbs.cs.berkeley.edu R=W=1, .1ms latency at all hops

Friday, March 8, 13

slide-63
SLIDE 63

TCP Incast

“You can’t pour two buckets of manure into one bucket” - Scott Fritchie’s Grandfather “microbursts” of traffic sent to one cluster member Coordinator sends request to three replicas All respond with large-ish result at roughly the same time Switch has to either buffer or drop packets Cassandra tries to mitigate: 1 replica sends data,

  • thers send hashes. We should do this in Riak.

Friday, March 8, 13

slide-64
SLIDE 64

What Riak Did Differently (or wrong)

Screwed up vector clock implementation Actor IDs in vector clocks were client ids, therefore potentially unbounded Size explosion resulted in huge objects, caused OOM crashes Vector clock pruning resulted in false siblings Fixed by forwarding to node in preflist circa 1.0

Friday, March 8, 13

slide-65
SLIDE 65

What Riak Did Differently

No active anti-entropy until v1.3 Early versions had slow, unstable AAE Node loss required reading all objects and repopulating replicas via read repair Ok for objects that are read often Rarely-read objects N value decreases over time

Friday, March 8, 13

slide-66
SLIDE 66

What Riak Did Differently

Initial versions had an unavailability window during topology changes Nodes would claim partitions immediately, before data had been handed off New versions don’t change request preflist until all data has been handed off Implemented as 2PC-ish commit over gossip

Friday, March 8, 13

slide-67
SLIDE 67

Riak, Beyond Dynamo

MapReduce Search Secondary Indexes Pre/post-commit hooks Multi-DC replication Riak Pipe distributed computation Riak CS

Friday, March 8, 13

slide-68
SLIDE 68

Riak CS

Amazon S3 clone implemented as a proxy in front of Riak Handles eventual consistency issues, object chunking, multitenancy, and API for a much narrower use case Forced us to eat our own dogfood and get serious about fixing long-standing warts Drives feature development

Friday, March 8, 13

slide-69
SLIDE 69

Riak the Product vs. Dynamo the Service

Dynamo had luxury of being a service while Riak is a product Screwing things up with Riak can not be fixed with an emergency deploy Multiple platforms, packaging are challenges Testing distributed systems is another talk entirely (QuickCheck FTW)

http://www.erlang-factory.com/upload/presentations/514/ TestFirstConstructionDistributedSystems.pdf

Friday, March 8, 13

slide-70
SLIDE 70

Riak Core

Some of our best work! Dynamo abstracted Implements all Dynamo techniques without prescribing a use case Examples of Riak Core apps: Riak KV! Riak Search Riak Pipe

Friday, March 8, 13

slide-71
SLIDE 71

Riak Core

Production deployments OpenX: several 100+-node clusters of custom Riak Core systems StackMob: proxy for mobile services implemented with Riak Core Needs to be much easier to use and better documented

Friday, March 8, 13

slide-72
SLIDE 72

Multi-Datacenter Replication

Intra-cluster replication in Riak is optimized for consistently low latency, high throughput WAN replication needs to deal with lossy links, long fat networks, TCP oddities MDC replication has 2 phases: full-sync (per-partition merkle comparisons), real-time (asynchronous, driven by post-commit hook) Separate policies settable per-bucket

Friday, March 8, 13

slide-73
SLIDE 73

Erlang

Still the best language for this stuff, but We mix data and control messages over Erlang message passing. Switch to TCP (or uTP/UDT) for data NIFs are problematic VM tuning can be a dark art ~90 public repos of mostly-Erlang, mostly-awesome

  • pen source: https://github.com/basho

Friday, March 8, 13

slide-74
SLIDE 74

Other Future Directions

Security was not a factor in Dynamo’s or Riak’s design Isolating Riak increases operational complexity, cost Statically sized ring is a pain Explore possibilities with smarter clients Support larger clusters Multitenancy, tenant isolation More vertical products like Riak CS

Friday, March 8, 13

slide-75
SLIDE 75

Questions?

@argv0 http://www.basho.com http://github.com/basho http://docs.basho.com

Friday, March 8, 13