Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13

Scalability 2

Scalability • Ideal world • Linear scalability Speedup • Reality Ideal • Bottlenecks • For example: central coordinator • When do we stop scaling? Reality Parallelism 3

Scalability • Capacity of a system to improve performance by increasing the amount of resources available • Typically, resources = processors • Strong scaling • Fixed total problem size, more processors • Weak scaling • Fixed per-processor problem size, more processors 4 4

Scaling Up and Out • Scaling Up • More powerful server (more cores, memory, disk) • Single server (or fixed number of servers) • Scaling Out • Larger number of servers • Constant resources per server 5 5

Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated ∗ Unaffiliated Microsoft Research Abstract 50 1000 A m system A e t s y s We offer a new metric for big data platforms, COST, seconds speed-up 10 100 system B or the Configuration that Outperforms a Single Thread. system B The COST of a given platform for a given problem is the hardware configuration required before the platform out- 1 8 1 10 100 300 1 10 100 300 performs a competent single-threaded implementation. cores cores COST weighs a system’s scalability against the over- Figure 1: Scaling and performance measurements heads introduced by the system, and indicates the actual for a data-parallel algorithm, before (system A) and performance gains of the system, without rewarding sys- after (system B) a simple performance optimization. tems that bring substantial but parallelizable overheads. The unoptimized implementation “scales” far better, We survey measurements of data-parallel systems re- despite (or rather, because of) its poor performance. cently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread argue that many published big data systems more closely for all of their reported configurations. resemble system A than they resemble system B.

What Does This Plot Tell You? 50 50 system A system A speed-up 10 speed-up 10 system B system B 1 1 1 10 100 300 1 10 100 300 cores cores 7 7

How About Now? 1000 s y s t e m seconds A 100 system B 8 1 10 100 300 cores 8 8

COST • Configuration that Outperforms Single Thread (COST) • # cores after which we achieve speedup over 1 core 20 460 Vertex SSD GraphX GraphLab 10 seconds seconds Hilbert RAM Vertex SSD 100 N a i a d Hilbert RAM 1 50 16 100 512 64 100 512 cores cores Single iteration 10 iterations 9 9

Possible Reasons for High COST • Restricted API • Limits algorithmic choice • Makes assumptions • MapReduce: No memory-resident state • Pregel: program can be specified as “think-like-a-vertex” • BUT also simplifies programming • Lower end nodes than laptop • Implementation adds overhead • Coordination • Cannot use application-specific optimizations 10 10

Why not Just a Laptop? • Capacity • Large datasets, complex computations don’t fit in a laptop • Simplicity, convenience • Nobody ever got fired for using Hadoop on a cluster • Integration with toolchain • Example: ETL à SQL à Graph computation on Spark 11 11

Disclaimers • Graph computation is peculiar • Some algorithms are computationally complex… • Even for small datasets • Good use case for single-server implementations • Similar observations for Machine Learning 12 12

Replication 13

Replication • Pros • Good for reads: can read any replica (if consistent) • Fault tolerance • Cons • Bad for writes: must update multiple replicas • Coordination for consistency 14

Replication protocol • Mediates client-server communication • Ideally, clients cannot “see” replication Replication Replication Replica protocol agent Replication Replication Replica Client agent agent Replication Replica protocol 15 15

Consistency Properties • Strong consistency • All operations take effect in some total order in every possible execution of the system • Linearizability: total order respects real-time ordering • Sequential consistency: total order is sufficient • Weak consistency • We will talk about that in another lecture • Many other semantics 16 16

What to Replicate? • Read-only objects: trivial • Read-write objects: harder • Need to deal with concurrent writes • Only the last write matters: previous writes are overwritten • Read-modify-write objects: very hard • Current state is function of history of previous requests • We consider deterministic objects 17 17

Fault Assumptions • Every fault tolerant system is based on a fault assumption • We assume that up to f replicas can fail (crash) • Total number of replicas is determined based on f • If the system has more than f failures, no guarantee 18 18

Synchrony Assumptions • Consider the following scenario • Process s sends a message to process r and waits for reply • Reply r does not arrive to s before a timeout • Can s assume that r has crashed? • We call a system asynchronous if we do not make this assumption • Otherwise we call it (partially) synchronous • This is because we are making additional assumptions on the speed or round-trips 19 19

Distributed Shared Memory (R/W) • Simple case • 1 writer client, m reader clients • n replicas, up to f faulty ones • Asynchronous system • Clients send messages to all n replicas and wait for n-f replies (otherwise they may hang forever waiting for crashed replicas) • Q: How many replicas do we need to tolerate 1 fault? • A: 2 not enough • Writer and readers can only wait for 1 reply (otherwise it blocks forever if a replica crashes) • Writer and readers may contact disjoint sets of replicas 20 20

Quorum Intersection • To tolerate f faults, use n = 2f+1 replicas • Writes and reads wait for replies from a set of n-f = f+1 replicas (i.e., a majority) called a majority quorum • Two majority quorums always intersect! wait for n-f acks wait for n-f replies Writer Reader v ack w(v) r Replicas … 21 21

Consistency is Expensive • Q: How to get linearizability? • A: Reader needs to write back to a quorum (2) wait for n-f (3) w(v i ,t i ) (4) wait for rcv (v i ,t i ) (2) wait for n-f acks with max t i n-f acks Writer Reader (1) w(v,t) (1) r ack Replicas … replicas set v i = v only if t > t i Reference: Attiya, Bar-Noy, Dolev. “ Sharing memory robustly in message-passing systems ” 22 22

Why Write Back? • We want to avoid this scenario • Assume initial value is v = 4 • No valid total order that respects real-time order exists in this execution write (v = 5) Writer read (v) à 5 Reader 1 read (v) à 4 Reader 2 23 23

State Machine Replication (SMR) • Read-modify-write objects • Assume deterministic state machine • Consistent sequence of inputs (consensus) SM R1 R2 R2 R1 R3 concurrent Consistent consensus SM client requests outputs! R3 SM Consistent decision on sequential execution order 24

Impossibility Result • Fischer, Lynch, Patterson (FLP) result • “It is impossible to reach distributed consensus in an asynchronous system with one faulty process” (because fault detection is not accurate) • Implication: Practical consensus protocols are • Always safe: Never allow inconsistent decision • Liveness (termination): Only in periods when additional synchrony assumptions hold. In periods when these assumptions do not hold, the protocol may stall and make no progress. 25 25

Leader Election • Consider the following scenario • There are n replicas of which up to f can fail • Each replica has a pre-defined unique ID • Simple leader election protocol • Periodically, every T seconds, each replica sends a heartbeat to all other replicas • If a replica p does not receive a heartbeat from a replica r within T + D seconds from the last heartbeat from r , then p considers r as faulty (D = maximum assumed message delay) • Each replica considers as leader the non-faulty replica with lowest ID 26 26

Eventual Single Leader Assumption • Typically, a system respects synchrony assumption • All heartbeats take at most D to arrive • All replicas elect the same leader • In the remaining asynchronous periods • Some heartbeat might take more than D to arrive • Replicas might disagree over who is faulty and who is not • Different replicas might see different leaders • Eventually, all replicas see a single leader • Asynchronous periods are glitches that are limited in time 27 27

The Paxos Protocol • Paxos is a consensus protocol • All replicas start with their own proposal • In SMR, a proposal is a batch of requests the replica has received from clients, ordered according to the order in which the replica received them • Eventually, all replica decide the same proposal • In SMR, this is the batch of requests to be executed next • Paxos terminates when there is a single leader • The assumption is that eventually there will be a single leader • Paxos potentially stalls when there are multiple leaders • But it prevents divergent decisions during these asynchronous periods 28 28

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 - PowerPoint PPT Presentation

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability Ideal world Linear scalability Speedup Reality Ideal Bottlenecks For example: central coordinator When do we stop

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Replication and Migration Background, Requirements and Strawman Migration and Replication

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February

in Tashkent CSEP 545 Transaction Processing Sameh Elnikety Replication for Performance

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

What are Exceptions? Exceptions are rare events triggered by the hardware and forcing the

Distributed Systems (ICE 601) Transactions & Concurrency Control - Part1 Dongman Lee ICU

ADAPTIVE TWO-STAGE INTEGRATORS FOR SAMPLING ALGORITHMS BASED ON HAMILTONIAN DYNAMICS E.

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Distributed Transactions and Concurrency CS425/ECE 428 Nikita Borisov Topics for Today

Work Queue + Python A Framework For Scalable Scientific Ensemble Applications Peter Bui , Dinesh

ECE 650 Systems Programming & Engineering Spring 2018 Database Transaction Processing Tyler

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 - PowerPoint PPT Presentation

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability Ideal world Linear scalability Speedup Reality Ideal Bottlenecks For example: central coordinator When do we stop

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Replication and Migration Background, Requirements and Strawman Migration and Replication

Distributed Systems (3rd Edition) Chapter 07: Consistency &amp; Replication Version: February

in Tashkent CSEP 545 Transaction Processing Sameh Elnikety Replication for Performance

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

What are Exceptions? Exceptions are rare events triggered by the hardware and forcing the

Distributed Systems (ICE 601) Transactions &amp; Concurrency Control - Part1 Dongman Lee ICU

ADAPTIVE TWO-STAGE INTEGRATORS FOR SAMPLING ALGORITHMS BASED ON HAMILTONIAN DYNAMICS E.

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Distributed Transactions and Concurrency CS425/ECE 428 Nikita Borisov Topics for Today

Work Queue + Python A Framework For Scalable Scientific Ensemble Applications Peter Bui , Dinesh

ECE 650 Systems Programming &amp; Engineering Spring 2018 Database Transaction Processing Tyler

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February

Distributed Systems (ICE 601) Transactions & Concurrency Control - Part1 Dongman Lee ICU

ECE 650 Systems Programming & Engineering Spring 2018 Database Transaction Processing Tyler