Approche Algorithmique des Syst` emes R epartis (AASR) Guillaume - - PowerPoint PPT Presentation
Approche Algorithmique des Syst` emes R epartis (AASR) Guillaume - - PowerPoint PPT Presentation
Approche Algorithmique des Syst` emes R epartis (AASR) Guillaume Pierre guillaume.pierre@irisa.fr Dapr` es un jeu de transparents de Maarten van Steen VU Amsterdam, Dept. Computer Science 07b: Consistency & Replication (2/2)
Contents
Chapter 01: Introduction 02: Architectures 03: Processes 04: Communication (1/2) 04: Communication (2/2) 05: Naming (1/2) 05: Naming (2/2) 06: Synchronization (1/2) 06: Synchronization (2/2) 07: Consistency & Replication (1/2) 07: Consistency & Replication (2/2)
2 / 39
Web applications
3 / 39
Web applications
4 / 39
Web applications
5 / 39
Scaling relational databases
Relational databases have many benefits:
A very powerful query language (SQL) Strong consistency Mature implementations Well-understood by developers Etc.
But also a few drawbacks:
Poor elasticity (ability to change the processing capacity easily) Poor scalability (ability to process arbitrary levels of load) Behavior in the presence of network partitions
6 / 39
Elasticity of relational databases
Relational databases were designed in the 1970s
Designed for mainframes (a single expensive machine) Not for clouds (many weak machines being created/stopped at any time)
Master-slave replication:
1 master database processes and serializes all updates N slaves receive updates from the master and process all reads Designed mostly for fault-tolerance, not performance
How can we add a replica at runtime?
Take a snapshot of the database (very well supported by relational databases) Copy the snapshot into the new replica Apply all updates received since the snapshot Add the new replica in the load balancing group This may take hours depending on the size of the database
7 / 39
Scalability of relational databases
Assuming an unlimited number of machines, can we process arbitrary levels of load?
Throughput (transactions/second) 20000 15000 10000 5000 10 20 30 40 50 60 Number of server machines PostgreSQL+DAS3 CloudTPS+HBase+DAS3
Problem: full replication
Each replica must process every update
Solution: partial replication
Each server contains a fraction of the total data Updates can be confined to a small number of machines
8 / 39
Sharding
Sharding = shared nothing architecture The programmer splits the database into independent partitions
Customers A-M → Database server 1 Customers N-Z → Database server 2
Advantage: scalability
Each partition can work independently without processing the updates of other partitions
Drawback: all the work is left for the developer
Defining the partition criterion Routing requests to the correct servers Implementing queries which span multiple partitions Implementing elasticity Etc.
Implementing sharding correctly is very difficult!
9 / 39
The CAP Theorem
In a distributed system we want three important properties:
1
Consistency: readers always see the result of previous updates
2
Availability: the system always answers client requests
3
Partition tolerance: the system doesn’t break down if the network gets partitioned
Brewer’s theorem: you cannot get all three simultaneously
You must pick at most two out of three Relational databases usually implement AC
10 / 39
NoSQL takes the problem upside down
NoSQL is designed with scalability in mind:
The database must be elastic The database must be fully scalable The database must tolerate machine failures The database must tolerate network partitions
What’s the catch?
NoSQL must choose between AP and CP
Most NoSQL systems choose AP: they do not guarantee strong consistency
NoSQL do not support complicated queries
They do not support the SQL language Only very simple operations!
Different NoSQL systems apply these principles differently
11 / 39
NoSQL data stores rely on DHT techniques
NoSQL data stores split data across nodes. . .
Excellent elasticity and scalability
. . . and replicate each data item on m nodes
For fault-tolerance
If the network gets partitioned: serve requests within each partition
The system remains available But clients will miss updates issued in the other partitions (bad consistency) When the partition is resolved, updates from different partitions get merged
12 / 39
The two meanings of “Consistency”
1 For database experts: Consistency == Referential integrity
in a single database
To make things simple: unique keys are really unique, foreign keys map on something etc. This is the “C” from ACID
2 For distributed systems experts: Consistency = a property
- f replicated data
To make things simple: all copies of the same data seem to have the same value at any time
13 / 39
Flexible consistency models
Some NoSQL data stores allow users to define the level of consistency they want
Replicate each data item over N servers Associate each data item with a timestamp Issue writes on all servers, consider a write to be successful when m servers have acknowledged Read data from at least n servers (and return the freshest version to the client)
If m +n > N then we have strong consistency
For example: m = N, n = 1 But other possibilities exist: m = 1, n = N Or anything in between: m = N
2 +1, n = N 2 +1
If m +n ≤ N then we have weak consistency
Faster
Example: Amazon Dynamo
14 / 39
Why do people use NoSQL?
15 / 39
Flexible data schemas
In NoSQL data stores there is no need to impose a strict data schema
Anyway the data store treats each row as a (key,value) pair No requirement for the value ⇒ no fixed data schema Not the same as empty values!
16 / 39
Scaling the database tier
- Repl. SQL
Sharding NoSQL
(e.g., MySQL) (e.g., Bigtable)
Scalability Complex queries / / Fault Tolerance Consistency /
17 / 39
Consistency issues in NoSQL databases
NoSQL databases scale because of heavy data partitioning
Minimum coordination between partitions
Consistency (worst case): eventual consistency
Updates will become visible at some point in the future Multiple updates are propagated independently from each
- ther
E.g., Amazon’s SimpleDB
Consistency (best case): single-row transactions
Transactional updates to a single database row No support for multiple-row transactions E.g., Google’s Bigtable, Cassandra, etc.
18 / 39
Position
We can guarantee multiple-row transactions in NoSQL databases without compromising their scalability or fault-tolerance properties. The secret: exploit the properties of Web applications Transactions are short-lived Transactions span a limited number of well-identified data items Question In fact this statement cannot be true. Why?
19 / 39
Availability vs. Consistency
Strictly speaking, it is impossible to fulfill my promises entirely
The CAP theorem states that one cannot support strong Consistency and high Availability in the presence of network Partitions A scalable system necessarily faces occasional partitions
NoSQL databases favorize high availability
And deliver best-effort consistency
CloudTPS focuses on consistency first
At the cost of unavailability in extreme failure/partition cases Note: a machine failure is not an extreme case. . .
20 / 39
System Model
21 / 39
System Model
22 / 39
System Model
23 / 39
Atomicity
Atomicity: All operations succeed or none of them does
No partially executed transactions!
Solution: 2-phase commit across the LTMs which contain relevant data items
24 / 39
Atomicity
Atomicity: All operations succeed or none of them does
No partially executed transactions!
Solution: 2-phase commit across the LTMs which contain relevant data items
25 / 39
Atomicity
Atomicity: All operations succeed or none of them does
No partially executed transactions!
Solution: 2-phase commit across the LTMs which contain relevant data items
26 / 39
Atomicity
Atomicity: All operations succeed or none of them does
No partially executed transactions!
Solution: 2-phase commit across the LTMs which contain relevant data items
27 / 39
Atomicity
Atomicity: All operations succeed or none of them does
No partially executed transactions!
Solution: 2-phase commit across the LTMs which contain relevant data items
28 / 39
Atomicity
Atomicity: All operations succeed or none of them does
No partially executed transactions!
Solution: 2-phase commit across the LTMs which contain relevant data items
29 / 39
Consistency
Consistency: Each transaction leaves the database in an internally consistent state
“Consistency” in this context means: logical consistency of different data items Very different than consistency of a single replicated data item
Solution: we assume that transactions are semantically correct
30 / 39
Isolation
Isolation: The system behaves as if transactions were processed sequentially
If the system allows concurrent transactions, then conflicting transactions must be serialized
Solution: Timestamp ordering
31 / 39
Isolation
Isolation: The system behaves as if transactions were processed sequentially
If the system allows concurrent transactions, then conflicting transactions must be serialized
Solution: Timestamp ordering
32 / 39
Isolation
Isolation: The system behaves as if transactions were processed sequentially
If the system allows concurrent transactions, then conflicting transactions must be serialized
Solution: Timestamp ordering
33 / 39
Durability
Durability: Once a transaction has been committed it cannot be undone
Even in the case of server failures or network partitions. . .
Long-term solution: Transaction managers checkpoint updates into the cloud storage service
NoSQL databases guarantee durability But updates may not be visible immediately
Short-term solution: Each data item is hosted by N transaction managers
We can support the simultaneous failure of N −1 servers
34 / 39
Evaluation setup
CloudTPS runs on top of:
HBase in our local cluster SimpleDB in the Amazon EC2 cloud
Workload derived from TPC-W
Web application benchmark which models an online bookstore (ported to NoSQL databases) Standardized workloads
35 / 39
Workload
How many LTMs participate in each transaction? When deploying the system across 40 LTMs:
1 10 100 1000 10000 100000 1e+06 1e+07 5 10 15 20 25 Number of Transactions Number Of Accessed LTMs
36 / 39
Scalability evaluation
When using N transaction managers, how many transactions/second can we sustain? HBase + DAS3: we want 99% of transactions<100 ms
Maximum Throughput (TPS) 8000 7000 6000 5000 4000 3000 2000 1000 0 0 5 10 Number of LTMs 15 20 25 30 35
SimpleDB + EC2: we want 90% of transactions<100 ms
Maximum Throughput (TPS) HighCPU Medium EC2 instances 500 1000 1500 2000 2500 3000 3500 10 20 30 40 50 60 70 Number of LTMs Standard Small EC2 instances
We have linear scalability
37 / 39
Tolerance to failures and partitions
Throughput (TPS) Time (Seconds) 200 400 600 800 1000 1200
failure LTM partition Network
200 400 600 800 1000 1200
CloudTPS recovers from a node failure in ∼ 18.6 sec
0.5 sec to rebuild a new system membership 12.2 sec to recover transactions from the failed LTM (timeouts) 5.9 sec to reorganize data placement among surviving LTMs
CloudTPS aborts all transactions during a partition
Recovers ∼ 135 ms after the end of the partition
38 / 39
Optimization: memory management
Loading all data items in memory may create problems Better: let’s keep only the most frequently accessed items
We can load the others from the NoSQL data store when necessary
This creates a tradeoff: memory consumption vs. performance
99th Percentile Response Time (Milli−Second) LTM Buffer Size (No. of Data Items) 100000 10000 1000 100 1000 10000 100000 1e+06 10k Items 1M Items
Keeping only 10% of the data in the LTMs barely impacts performance
39 / 39