Data Storage Revolution Relational Databases Object Storage - - PowerPoint PPT Presentation

data storage revolution relational databases object
SMART_READER_LITE
LIVE PREVIEW

Data Storage Revolution Relational Databases Object Storage - - PowerPoint PPT Presentation

Data Storage Revolution Relational Databases Object Storage (put/get) Speed Dynamo Scalability PNUTS Availability CouchDB Throughput MemcacheDB No Complexity Cassandra Eventual Consistency Read Request Write


slide-1
SLIDE 1
slide-2
SLIDE 2

Data Storage Revolution

  • Relational Databases
  • Object Storage (put/get)

– Dynamo – PNUTS – CouchDB – MemcacheDB – Cassandra Speed Scalability Availability Throughput No Complexity

slide-3
SLIDE 3

Eventual Consistency

Manager

Replica

Replica

Replica


 Write Request Read Request

Replica


 B A Read Request

slide-4
SLIDE 4

Eventual Consistency

  • Writes ordered after commit
  • Reads can be out-of-order or stale
  • Easy to scale, high throughput
  • Difficult application programming model
slide-5
SLIDE 5

Traditional Solution to Consistency

Manager

Replica

Replica

Replica

Replica


 Write Request

Two-Phase Commit:

  • 1. Prepare
  • 2. Vote: Yes
  • 3. Commit
  • 4. Ack
slide-6
SLIDE 6

Strong Consistency

  • Reads and Writes strictly ordered
  • Easy programming
  • Expensive implementation
  • Doesn’t scale well
slide-7
SLIDE 7

Our Goal

  • Easy programming
  • Easy to scale, high throughput
slide-8
SLIDE 8

Chain Replication

Manager

Replica

Replica

Replica

Replica

HEAD 
 TAIL 


Write Request Read Request

W1 R1 W2 R2 R3 van Renesse & Schneider (OSDI 2004)

W1 R1 R2 W2 R3

slide-9
SLIDE 9

Chain Replication

  • Strong consistency
  • Simple replication
  • Increases write throughput
  • Low read throughput
  • Can we increase throughput?
  • Insight:

– Most applications are read-heavy (100:1)

slide-10
SLIDE 10

CRAQ

  • Two states per object – clean and dirty

Replica

TAIL

Replica

Replica

HEAD


 Read Request Read Request Read Request Read Request Read Request

V1 V1 V1 V1 V1

slide-11
SLIDE 11

CRAQ

  • Two states per object – clean and dirty
  • If latest version is clean, return value
  • If dirty, contact tail for latest version number

Replica

TAIL

Replica

Replica

HEAD

V1 V1 V1 V1 V1

Write Request

,V2 ,V2 ,V2

Read Request

V1

Read Request

1 V1 ,V2 V2 V2 V2 V2 V2 2 V2

slide-12
SLIDE 12

Multicast Optimizations

  • Each chain forms group
  • Tail multicasts ACKs

Replica

TAIL

Replica

Replica

HEAD

V1 V1 V1 V1 ,V2 ,V2 ,V2 ,V2 V2 V2 V2 V2 V2

slide-13
SLIDE 13

Multicast Optimizations

  • Each chain forms group
  • Tail multicasts ACKs
  • Head multicasts write data

Replica

TAIL

Replica

Replica

HEAD

V2 V2 V2 V2

Write Request

,V3 ,V3 ,V3 ,V3 V2 ,V3 V3

slide-14
SLIDE 14

CRAQ Benefits

  • From Chain Replication

– Strong consistency – Simple replication – Increases write throughput

  • Additional Contributions

– Read throughput scales :

  • Chain Replication with Apportioned Queries

– Supports Eventual Consistency

slide-15
SLIDE 15

High Diversity

  • Many data storage systems assume locality

– Well connected, low latency

  • Real large applications are geo-replicated

– To provide low latency – Fault tolerance

(source: Data Center Knowledge)

slide-16
SLIDE 16

TAIL

Multi-Datacenter CRAQ

HEAD

Replica Replica Replica Replica Replica

TAIL

Replica Replica

DC1 DC2 DC3

slide-17
SLIDE 17

Multi-Datacenter CRAQ

HEAD

Replica Replica Replica Replica Replica

TAIL

Replica Replica Client

DC1 DC2 DC3

Client

slide-18
SLIDE 18

Motivation

  • 1. Popular vs. scarce objects
  • 2. Subset relevance
  • 3. Datacenter diversity
  • 4. Write locality

Solution

  • 1. Specify chain size
  • 2. List datacenters

- dc1, dc2, … dcN

  • 3. Separate sizes

– dc1, chain_size1, …

  • 4. Specify master

Chain Configuration

slide-19
SLIDE 19

Master Datacenter

HEAD

Replica Replica

Replica

Replica Replica

DC1 DC2

HEAD

Writer

TAIL TAIL

Replica Replica

DC3

slide-20
SLIDE 20

Implementation

  • Approximately 3,000 lines of C++
  • Uses Tame extensions to SFS asynchronous

I/O and RPC libraries

  • Network operations use Sun RPC interfaces
  • Uses Yahoo’s ZooKeeper for coordination
slide-21
SLIDE 21

Coordination Using ZooKeeper

  • Stores chain metadata
  • Monitors/notifies about node membership

CRAQ

CRAQ CRAQ CRAQ CRAQ CRAQ

CRAQ

CRAQ CRAQ

DC1 DC3 DC2

ZooKeeper ZooKeeper ZooKeeper

slide-22
SLIDE 22

Evaluation

  • Does CRAQ scale vs. CR?
  • How does write rate impact performance?
  • Can CRAQ recover from failures?
  • How does WAN effect CRAQ?
  • Tests use Emulab network emulation testbed
slide-23
SLIDE 23

20 40 60 80 100 5000 10000 15000 Writes/s Reads/s CRAQ−7 CRAQ−3 CR−3

Read Throughput as Writes Increase

1x- 3x- 7x-

slide-24
SLIDE 24

Failure Recovery (Read Throughput)

10 20 30 40 50 20000 40000 60000 Time (s) Reads/s Length 7 Length 5 Length 3

slide-25
SLIDE 25

Failure Recovery (Latency)

10 20 0.0 0.5 1.0 1.5 Read Latency (ms) 10 20 1000 3000 5000 Write Latency (ms)

Time (s) Time (s)

slide-26
SLIDE 26

Geo-replicated Read Latency

5 10 15 20 20 40 60 80 Writes/s Mean Latency (ms) CR CRAQ

slide-27
SLIDE 27

If Single Object Put/Get Insufficient

  • Test-and-Set, Append, Increment

– Trivial to implement – Head alone can evaluate

  • Multiple object transaction in same chain

– Can still be performed easily – Head alone can evaluate

  • Multiple chains

– An agreement protocol (2PC) can be used – Only heads of chains need to participate – Although degrades performance (use carefully!)

slide-28
SLIDE 28

Summary

  • CRAQ Contributions?

– Challenges trade-off of consistency vs. throughput

  • Provides strong consistency
  • Throughput scales linearly for read-mostly
  • Support for wide-area deployments of chains
  • Provides atomic operations and transactions

Thank You Questions?