as the Basis of a High- Performance Data Store William J. Bolosky , - PowerPoint PPT Presentation

Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky , Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011

Q: How to build a fault-tolerant, high-performance data store from commodity parts? A: Paxos replicated state machines

• Paxos Replicated State Machines – Sequentially consistent – Persistent – Fault tolerant – Don’t rely on clock sync for correctness – Thought to be too slow • Conventional systems compromise on – Semantics ( e.g. data consistency after failures) – Assumptions ( e.g. clock sync for correctness) – API ( e.g. append only) – Special hardware ( e.g. FAB’s write timestamps) • Paxos equaling the speed of a conventional system is a win – That we sometimes do better is a bonus

Take Away Point • For datacenter-like systems that: – Value C onsistency and A vailability over P artition tolerance – Have operation latencies ≥ network latencies • Paxos replicated state machines – Perform very well – While not compromising

Outline • Background: Replicated State Machines and Paxos • SMARTER and Gaios • A new protocol for read-only operations • Performance evaluation and comparison to primary-backup replication

Replicated State Machines • For fault tolerance – Of any deterministic computation – Via replication – Replicas see the same sequence of inputs • Paxos is a protocol for guaranteeing input ordering, even with: – Multiple clients – Unreliable networks – No synchronized clocks – Unlimited machine reboots – Some permanent stopping faults ( i.e. , disk losses) – But not Byzantine faults

Non Trade-Off • RSMs’ one -at-a-time execution model seems to be at odds with disks’ need to reorder IO for efficiency. It’s not. • Analogous to an out-of-order processor.

Paxos Basics • Paxos binds client requests to sequentially numbered slots . • In normal operation requires a write to persistent store to survive power loss. • Has a dynamically selected and changeable leader that drives the protocol.

Member Leader Member Log Complete + Client Request Log Complete Commit Extra Reply Logging Proposal Reply ACK Client

4K Write Latency Timeline (One-at-a-Time Operations) Request Send Proposal Send Logging (first) Logging (second) ACK Send Execute Reply Send 0 1 2 3 4 5 6 7 8 9 10 Time (ms)

Gaios Architecture Standard Application User NTFS Kernel Gaios Disk Driver SMARTER Client Client Machine Net SMARTER Server Gaios RSM Stream Store Log User Kernel NTFS Server Machine

Getting Efficiency • Mostly just lots of good engineering 1. Pipelining 2. Batched write behind 3. Overlap fetching with logging 4. Batching client requests 5. Zero-copy data path • Novel read-only operation protocol that allows consistent reads from any node

Read Consistency Property Not-Before Constraint : When a read-only request R completes, it reflects any data known by any client to be written at the time R was sent.

Read-Only Operations • Read-only operations only need to run in one place • Using all disks is crucial • Dynamically selecting location helps – Avoid nodes that are writing

Read/Write Contention Stream Store Page Stream Store Reader Cleaner Read 600 Write 97 10 42 66 97 212 235 270 Write 66 331 344 389 401 416 444 469 Write 42 511 580 616 629 689 704 765 Write 10 830 845 866 914 919 952 953 Disk Queue Dirty Page Pool Randomize Checkpoint timing across nodes

Member Leader Member Read Complete Client Reply Leadership Reply Leadership Check Read Request Client

4K Read Latency Timeline (One-at-a-Time Operations) Client Send Leader Check Execute Reply 0 2 4 6 8 10 Time (ms)

Primary-Backup Replication • (Usually) Sends both read and write replies from the primary in order to achieve the read consistency property • Uses leasing protocol for primary – No need for a quorum check on reads – Relies on clock sync for correctness, which in practice means it trades failover time for correctness

Read Distribution • Primary-Backup forces reads to one node, while SMARTER spreads them across all, which can matter for random reads • P-B can achieve spreading by striping data across many groups and locating the primaries on different nodes; this spreading is static • Implemented two versions of P-B: – Worst-case PB1 where all reads come from one node – Best-case PBN which uses round-robin reads

8K Random Read Throughput (Lots of outstanding operations) 500 450 400 350 I 300 O Gaios 250 / PBN 200 s 150 PB1 100 Local 50 0 1 2 3 4 5 Replicas

Transaction Processing • Ran industry standard OLTP load over Microsoft SQL Server 2008. • Critical factors: SQL log write latency, random read bandwidth. • Even read/write ratio, mostly ~8K.

OLTP Performance (3 nodes, 50% read workload) 120% Normalized Transactions/s 100% 80% 60% 40% 20% 0% Gaios PBN PB1

Conclusion • Paxos RSMs are fine for high-performance disk-based applications, it just takes careful engineering. • In some cases, they outperform best-case P-B due to flexibility in directing reads. • There is no need to compromise on semantics, buy special hardware, depend on clocks, etc .

Thank You! Submit to FAST Photo of Gaios, Paxos, Greece

as the Basis of a High- Performance Data Store William J. Bolosky , - PowerPoint PPT Presentation

Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky , Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a fault-tolerant, high-performance data

CHAPTER IX IX CHAPTER Radial Basis Function Networks Radial Basis Function Networks CHAPTER IX

Quiz Suppose u 1 , . . . , u n is a basis for U and v 1 , . . . , v k is a basis for V . Prove that

Overview Last time we introduced the Gram Schmidt process as an algorithm for turning a basis for

Math 211 Math 211 Lecture #21 Determinants October 16, 2002 2 Basis of a Subspace Basis of a

On a new orthonormal basis for RBF native spaces and its fast computation Stefano De Marchi and

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Basis of presentation and accounting principles Basis of presentation Consolidated financial

GDPR Lawful basis Data Protection Practitioners #DPPC2018 Conference 2018 Whats new? Why is

Inadequate Tax Basis: Navigating New IRS Regulations for S Corporations Leveraging Basis Rules

Timber Basis Deborah Gunter, Ph.D. Visiting Professor Department of Forestry Southern Illinois

PLANT DESIGN AND ECONOMICS (6) Zahra Maghsoud TOTAL PRODUCT COST Total product costs

Reduced Basis Collocation Methods for Partial Differential Equations with Random Coefficients

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

The Greedy Basis Equals the Theta Basis A Rank Two Haiku Man Wai Cheung (UCSD), Mark Gross

Algorithmics and C basis Introduction For beginners . . . Definition of algorithm Examples

READING REPORT Symmetric Jordan Basis, Terwilliger Algebra of Binary Hamming Scheme and

E NERGY -E FFICIENT D ATA R EPLICATION IN C LOUD C OMPUTING D ATACENTERS Presented by David Ocejo

Causal Consistency CS 240: Computing Systems and Concurrency Lecture 16 Marco Canini Credits:

Intro to RavenDB Oren Eini aka Ayende Rahien ayende@ayende.com http://ayende.com/blog What?

Building an open source data lake at scale in the cloud Adrian Woodhead, Principal Engineer 1

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Deploy your own replication system with Wal2json PGCONF.EU 2019 Mai PENG 17/10/2019 Hello

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Enhancements to FreeIPA Replication Topology Management Jan Pazdziora Sr. Principal Software