as the Basis of a High- Performance Data Store William J. Bolosky , - - PowerPoint PPT Presentation

as the basis of a high
SMART_READER_LITE
LIVE PREVIEW

as the Basis of a High- Performance Data Store William J. Bolosky , - - PowerPoint PPT Presentation

Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky , Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a fault-tolerant, high-performance data


slide-1
SLIDE 1

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li

March 30, 2011

slide-2
SLIDE 2

Q: How to build a fault-tolerant, high-performance data store from commodity parts? A: Paxos replicated state machines

slide-3
SLIDE 3
  • Paxos Replicated State Machines

– Sequentially consistent – Persistent – Fault tolerant – Don’t rely on clock sync for correctness – Thought to be too slow

  • Conventional systems compromise on

– Semantics (e.g. data consistency after failures) – Assumptions (e.g. clock sync for correctness) – API (e.g. append only) – Special hardware (e.g. FAB’s write timestamps)

  • Paxos equaling the speed of a conventional

system is a win

– That we sometimes do better is a bonus

slide-4
SLIDE 4

Take Away Point

  • For datacenter-like systems that:

– Value Consistency and Availability over Partition tolerance – Have operation latencies ≥ network latencies

  • Paxos replicated state machines

– Perform very well – While not compromising

slide-5
SLIDE 5

Outline

  • Background: Replicated State Machines and

Paxos

  • SMARTER and Gaios
  • A new protocol for read-only operations
  • Performance evaluation and comparison to

primary-backup replication

slide-6
SLIDE 6

Replicated State Machines

  • For fault tolerance

– Of any deterministic computation – Via replication – Replicas see the same sequence of inputs

  • Paxos is a protocol for guaranteeing input ordering,

even with:

– Multiple clients – Unreliable networks – No synchronized clocks – Unlimited machine reboots – Some permanent stopping faults (i.e., disk losses) – But not Byzantine faults

slide-7
SLIDE 7

Non Trade-Off

  • RSMs’ one-at-a-time execution

model seems to be at odds with disks’ need to reorder IO for

  • efficiency. It’s not.
  • Analogous to an out-of-order

processor.

slide-8
SLIDE 8

Paxos Basics

  • Paxos binds client requests to

sequentially numbered slots.

  • In normal operation requires a write to

persistent store to survive power loss.

  • Has a dynamically selected and

changeable leader that drives the protocol.

slide-9
SLIDE 9

Client Leader Member Member

Client Request Proposal Logging Log Complete Log Complete + ACK Commit Reply Extra Reply

slide-10
SLIDE 10

4K Write Latency Timeline

(One-at-a-Time Operations)

1 2 3 4 5 6 7 8 9 10 Time (ms) Request Send Proposal Send Logging (first) Logging (second) ACK Send Execute Reply Send

slide-11
SLIDE 11

Outline

  • Background: Replicated State Machines and

Paxos

  • SMARTER and Gaios
  • A new protocol for read-only operations
  • Performance evaluation and comparison to

primary-backup replication

slide-12
SLIDE 12

Gaios Architecture

Standard Application NTFS Gaios Disk Driver SMARTER Client User Kernel

Net Client Machine

NTFS User Kernel SMARTER Server Gaios RSM Log Stream Store

Server Machine

slide-13
SLIDE 13

Getting Efficiency

  • Mostly just lots of good engineering
  • 1. Pipelining
  • 2. Batched write behind
  • 3. Overlap fetching with logging
  • 4. Batching client requests
  • 5. Zero-copy data path
  • Novel read-only operation protocol that

allows consistent reads from any node

slide-14
SLIDE 14

Outline

  • Background: Replicated State Machines and

Paxos

  • SMARTER and Gaios
  • A new protocol for read-only operations
  • Performance evaluation and comparison to

primary-backup replication

slide-15
SLIDE 15

Read Consistency Property

Not-Before Constraint: When a read-only request R completes, it reflects any data known by any client to be written at the time R was sent.

slide-16
SLIDE 16

Read-Only Operations

  • Read-only operations only need to run in
  • ne place
  • Using all disks is crucial
  • Dynamically selecting location helps

–Avoid nodes that are writing

slide-17
SLIDE 17

Read/Write Contention

Disk Queue

Write 10 Write 42 Write 66 Write 97 Read 600 Stream Store Page Cleaner Stream Store Reader Dirty Page Pool 10 42 66 97 212 235 270 331 344 389 401 416 444 469 511 580 616 629 689 704 765 830 845 866 914 919 952 953

Randomize Checkpoint timing across nodes

slide-18
SLIDE 18

Client Leader Member Member

Read Request Leadership Check Leadership Reply Read Complete Client Reply

slide-19
SLIDE 19

4K Read Latency Timeline

(One-at-a-Time Operations)

2 4 6 8 10 Time (ms) Client Send Leader Check Execute Reply

slide-20
SLIDE 20

Outline

  • Background: Replicated State Machines and

Paxos

  • SMARTER and Gaios
  • A new protocol for read-only operations
  • Performance evaluation and comparison to

primary-backup replication

slide-21
SLIDE 21

Primary-Backup Replication

  • (Usually) Sends both read and write replies

from the primary in order to achieve the read consistency property

  • Uses leasing protocol for primary

– No need for a quorum check on reads – Relies on clock sync for correctness, which in practice means it trades failover time for correctness

slide-22
SLIDE 22

Read Distribution

  • Primary-Backup forces reads to one node, while

SMARTER spreads them across all, which can matter for random reads

  • P-B can achieve spreading by striping data across

many groups and locating the primaries on different nodes; this spreading is static

  • Implemented two versions of P-B:

– Worst-case PB1 where all reads come from one node – Best-case PBN which uses round-robin reads

slide-23
SLIDE 23

8K Random Read Throughput

(Lots of outstanding operations)

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 I O / s

Replicas

Gaios PBN PB1 Local

slide-24
SLIDE 24

Transaction Processing

  • Ran industry standard OLTP load
  • ver Microsoft SQL Server 2008.
  • Critical factors: SQL log write

latency, random read bandwidth.

  • Even read/write ratio, mostly

~8K.

slide-25
SLIDE 25

OLTP Performance

(3 nodes, 50% read workload)

0% 20% 40% 60% 80% 100% 120% Gaios PBN PB1 Normalized Transactions/s

slide-26
SLIDE 26

Conclusion

  • Paxos RSMs are fine for high-performance

disk-based applications, it just takes careful engineering.

  • In some cases, they outperform best-case P-B

due to flexibility in directing reads.

  • There is no need to compromise on semantics,

buy special hardware, depend on clocks, etc.

slide-27
SLIDE 27

Thank You!

Photo of Gaios, Paxos, Greece

Submit to FAST