Amazon Dynamo A Highly Available Key-value Store Present by Jian - - PowerPoint PPT Presentation

amazon dynamo
SMART_READER_LITE
LIVE PREVIEW

Amazon Dynamo A Highly Available Key-value Store Present by Jian - - PowerPoint PPT Presentation

Amazon Dynamo A Highly Available Key-value Store Present by Jian Fang jianf@cmu.edu What is Dynamo Eventually consistent key-value store Support scalable highly available data access Optimized for availability to maximize customer


slide-1
SLIDE 1

Amazon Dynamo

A Highly Available Key-value Store Present by Jian Fang jianf@cmu.edu

slide-2
SLIDE 2

What is Dynamo

 Eventually consistent key-value store  Support scalable highly available data access  Optimized for availability to maximize customer satisfaction

slide-3
SLIDE 3

Why not RDBMS?

 Only need primary-key access  RDBMS have limited scalability  RDBMS require expensive hardware and skillful administrators

slide-4
SLIDE 4

Amazon’s Requirements

 Objects are less than 1MB  No operations span for multiple data  <300ms response time for 99.9% requests  Heterogeneous commodity hardware infrastructure  Decentralized, loosely coupled services  Highly available(always writable)

slide-5
SLIDE 5

Techniques used in Dynamo

 Consistent Hashing  Vector clocks  Sloppy Quorum and Hinted handoff  Merkle trees  Gossip-based membership protocol

slide-6
SLIDE 6

Interfaces

 Key-value storage system with operators:

 Get(key): return a single or a list of objects with conflicting versions  Put(key, context, object): context contains the version information

 MD5 hashing is applied on the key to generate 128-bit identifier

slide-7
SLIDE 7

Partitioning

 Scale Incrementally  Consistent Hashing  Variant of Consistent Hashing

slide-8
SLIDE 8

Consistent Hashing

 Simple Non-Consistent Hashing

 𝐼𝑏𝑡ℎ 𝑙𝑓𝑧 𝑛𝑝𝑒 𝑂  What if N = N + 1  6 keys(a half) remapped

 Consistent Hashing

 Only K/N keys need to be remapped

12 keys, N = 3 S1 S2 S3 12 keys, N = 4 S1 S2 S3 S4

slide-9
SLIDE 9

Consistent Hashing

A B C Key X D Key Z Key Y

slide-10
SLIDE 10

Consistent Hashing

 Not good enough

 Non-uniform load distribution  No heterogeneity in node’s performance

 Variant of Consistent Hashing

 Virtual Nodes

slide-11
SLIDE 11

Variant of Consistent Hashing

Q = 12 (Virtual Nodes) S = 3 (Physical Nodes) T = Q/S = 4 (Tokens) S1 S1 S1 S1 S2 S2 S2 S2 S3 S3 S3 S3

slide-12
SLIDE 12

Variant of Consistent Hashing

Q = 12 (Virtual Nodes) S = 4 (Physical Nodes) T = Q/S = 4 (Tokens) S1 S4 S1 S1 S2 S2 S4 S2 S3 S3 S3 S4 S1 S2 S3

slide-13
SLIDE 13

Replication

 A coordinator Node(i)  (N-1) clockwise successor nodes as replicas  Node(i) update all other (N-1) replicas  A preference list of nodes

 List size > N

A B C Key Z D Node(i) Preference List = [A,B,C,D]

slide-14
SLIDE 14

Data Versioning

 Eventual Consistency  Put() is returned before updating all replicas  Get() can return multiple versions for the same key  Data mutation as new version  Vector Clock

slide-15
SLIDE 15

Vector Clock(Example)

Sx Sy Sz Supplier A 500$

500$(1,0,0) 500$(1,0,0) 500$(1,0,0)

slide-16
SLIDE 16

Vector Clock(Example)

Sx Sy Sz Supplier A 550$

500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0)

slide-17
SLIDE 17

Vector Clock(Example)

Sx Sy Sz 600$

500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0)

Supplier B

600$(2,1,0)

slide-18
SLIDE 18

Vector Clock(Example)

Sx Sy Sz 650$

500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0)

Supplier C

600$(2,1,0) 650$(2,0,1) 650$(2,0,1) 650$(2,0,1)

Conflict!

slide-19
SLIDE 19

Vector Clock(Example)

Sx Sy Sz

500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0)

Supplier B

600$(2,1,0) 650$(2,0,1) 650$(2,0,1) 650$(2,0,1) 600$(2,1,0)/650$(2,0,1) Resolve Conflict Choose 650$

slide-20
SLIDE 20

Vector Clock(Example)

Sx Sy Sz

500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0) 500$(1,0,0) 550$(2,0,0)

Supplier B

600$(2,1,0)/650$(2,0,1) 650$(2,0,1) 650$(2,0,1)

650$(2,1,1)

650$(2,1,1) 650$(2,1,1) 650$(2,1,1)

slide-21
SLIDE 21

Processing get() and put()

 How to select a coordinator node

 Load balancer (server-driven)  Partition aware client library (client-driven)

 Quorum-like system for consistency

 W + R > N  Typical value: W=2 R=2 N=3

N W R

slide-22
SLIDE 22

Hinted Handoff

A B C Put() D A

slide-23
SLIDE 23

Hinted Handoff

A B C D A

slide-24
SLIDE 24

Replica Synchronization(Merkle Tree)

(128,160] 0x1100 (160,192] 0x0111 (192,224] (224,256] (0,32] 0x1001 (32,64} (64,96} (96,128] 32

0x1001

96 160 0x1011 224 64 0x1001 192 0x1011 128 0x0010

XOR XOR XOR XOR XOR XOR XOR

Row key1 Row key2 Row key3 Row key4 Token: 5 Token: 135 Token: 170 Token: 185 Hash: 0x1001 Hash: 0x1100 Hash: 0x0101 Hash: 0x0010 Range: (0,256]

Depth: 3 Tokens: 8 * 32 Example from: http://bit.ly/1fUa0CS

slide-25
SLIDE 25

Performance

slide-26
SLIDE 26

Q&A Thank you!