Paxos Replicated State Machines as the Basis of a High- Performance Data Store
William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li
March 30, 2011
as the Basis of a High- Performance Data Store William J. Bolosky , - - PowerPoint PPT Presentation
Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky , Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a fault-tolerant, high-performance data
March 30, 2011
– Sequentially consistent – Persistent – Fault tolerant – Don’t rely on clock sync for correctness – Thought to be too slow
– Semantics (e.g. data consistency after failures) – Assumptions (e.g. clock sync for correctness) – API (e.g. append only) – Special hardware (e.g. FAB’s write timestamps)
system is a win
– That we sometimes do better is a bonus
– Value Consistency and Availability over Partition tolerance – Have operation latencies ≥ network latencies
– Perform very well – While not compromising
– Of any deterministic computation – Via replication – Replicas see the same sequence of inputs
even with:
– Multiple clients – Unreliable networks – No synchronized clocks – Unlimited machine reboots – Some permanent stopping faults (i.e., disk losses) – But not Byzantine faults
Client Leader Member Member
(One-at-a-Time Operations)
1 2 3 4 5 6 7 8 9 10 Time (ms) Request Send Proposal Send Logging (first) Logging (second) ACK Send Execute Reply Send
Standard Application NTFS Gaios Disk Driver SMARTER Client User Kernel
Net Client Machine
NTFS User Kernel SMARTER Server Gaios RSM Log Stream Store
Server Machine
Disk Queue
Write 10 Write 42 Write 66 Write 97 Read 600 Stream Store Page Cleaner Stream Store Reader Dirty Page Pool 10 42 66 97 212 235 270 331 344 389 401 416 444 469 511 580 616 629 689 704 765 830 845 866 914 919 952 953
Client Leader Member Member
(One-at-a-Time Operations)
2 4 6 8 10 Time (ms) Client Send Leader Check Execute Reply
– No need for a quorum check on reads – Relies on clock sync for correctness, which in practice means it trades failover time for correctness
– Worst-case PB1 where all reads come from one node – Best-case PBN which uses round-robin reads
(Lots of outstanding operations)
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 I O / s
Gaios PBN PB1 Local
(3 nodes, 50% read workload)
0% 20% 40% 60% 80% 100% 120% Gaios PBN PB1 Normalized Transactions/s
Photo of Gaios, Paxos, Greece
Submit to FAST