Approche Algorithmique des Syst` emes R epartis (AASR) Guillaume - - PowerPoint PPT Presentation

approche algorithmique des syst emes r epartis aasr
SMART_READER_LITE
LIVE PREVIEW

Approche Algorithmique des Syst` emes R epartis (AASR) Guillaume - - PowerPoint PPT Presentation

Approche Algorithmique des Syst` emes R epartis (AASR) Guillaume Pierre guillaume.pierre@irisa.fr Dapr` es un jeu de transparents de Maarten van Steen VU Amsterdam, Dept. Computer Science 07b: Consistency & Replication (2/2)


slide-1
SLIDE 1

Approche Algorithmique des Syst` emes R´ epartis (AASR)

Guillaume Pierre

guillaume.pierre@irisa.fr

D’apr` es un jeu de transparents de Maarten van Steen VU Amsterdam, Dept. Computer Science

07b: Consistency & Replication (2/2)

slide-2
SLIDE 2

Contents

Chapter 01: Introduction 02: Architectures 03: Processes 04: Communication (1/2) 04: Communication (2/2) 05: Naming (1/2) 05: Naming (2/2) 06: Synchronization (1/2) 06: Synchronization (2/2) 07: Consistency & Replication (1/2) 07: Consistency & Replication (2/2)

2 / 39

slide-3
SLIDE 3

Web applications

3 / 39

slide-4
SLIDE 4

Web applications

4 / 39

slide-5
SLIDE 5

Web applications

5 / 39

slide-6
SLIDE 6

Scaling relational databases

Relational databases have many benefits:

A very powerful query language (SQL) Strong consistency Mature implementations Well-understood by developers Etc.

But also a few drawbacks:

Poor elasticity (ability to change the processing capacity easily) Poor scalability (ability to process arbitrary levels of load) Behavior in the presence of network partitions

6 / 39

slide-7
SLIDE 7

Elasticity of relational databases

Relational databases were designed in the 1970s

Designed for mainframes (a single expensive machine) Not for clouds (many weak machines being created/stopped at any time)

Master-slave replication:

1 master database processes and serializes all updates N slaves receive updates from the master and process all reads Designed mostly for fault-tolerance, not performance

How can we add a replica at runtime?

Take a snapshot of the database (very well supported by relational databases) Copy the snapshot into the new replica Apply all updates received since the snapshot Add the new replica in the load balancing group This may take hours depending on the size of the database

7 / 39

slide-8
SLIDE 8

Scalability of relational databases

Assuming an unlimited number of machines, can we process arbitrary levels of load?

Throughput (transactions/second) 20000 15000 10000 5000 10 20 30 40 50 60 Number of server machines PostgreSQL+DAS3 CloudTPS+HBase+DAS3

Problem: full replication

Each replica must process every update

Solution: partial replication

Each server contains a fraction of the total data Updates can be confined to a small number of machines

8 / 39

slide-9
SLIDE 9

Sharding

Sharding = shared nothing architecture The programmer splits the database into independent partitions

Customers A-M → Database server 1 Customers N-Z → Database server 2

Advantage: scalability

Each partition can work independently without processing the updates of other partitions

Drawback: all the work is left for the developer

Defining the partition criterion Routing requests to the correct servers Implementing queries which span multiple partitions Implementing elasticity Etc.

Implementing sharding correctly is very difficult!

9 / 39

slide-10
SLIDE 10

The CAP Theorem

In a distributed system we want three important properties:

1

Consistency: readers always see the result of previous updates

2

Availability: the system always answers client requests

3

Partition tolerance: the system doesn’t break down if the network gets partitioned

Brewer’s theorem: you cannot get all three simultaneously

You must pick at most two out of three Relational databases usually implement AC

10 / 39

slide-11
SLIDE 11

NoSQL takes the problem upside down

NoSQL is designed with scalability in mind:

The database must be elastic The database must be fully scalable The database must tolerate machine failures The database must tolerate network partitions

What’s the catch?

NoSQL must choose between AP and CP

Most NoSQL systems choose AP: they do not guarantee strong consistency

NoSQL do not support complicated queries

They do not support the SQL language Only very simple operations!

Different NoSQL systems apply these principles differently

11 / 39

slide-12
SLIDE 12

NoSQL data stores rely on DHT techniques

NoSQL data stores split data across nodes. . .

Excellent elasticity and scalability

. . . and replicate each data item on m nodes

For fault-tolerance

If the network gets partitioned: serve requests within each partition

The system remains available But clients will miss updates issued in the other partitions (bad consistency) When the partition is resolved, updates from different partitions get merged

12 / 39

slide-13
SLIDE 13

The two meanings of “Consistency”

1 For database experts: Consistency == Referential integrity

in a single database

To make things simple: unique keys are really unique, foreign keys map on something etc. This is the “C” from ACID

2 For distributed systems experts: Consistency = a property

  • f replicated data

To make things simple: all copies of the same data seem to have the same value at any time

13 / 39

slide-14
SLIDE 14

Flexible consistency models

Some NoSQL data stores allow users to define the level of consistency they want

Replicate each data item over N servers Associate each data item with a timestamp Issue writes on all servers, consider a write to be successful when m servers have acknowledged Read data from at least n servers (and return the freshest version to the client)

If m +n > N then we have strong consistency

For example: m = N, n = 1 But other possibilities exist: m = 1, n = N Or anything in between: m = N

2 +1, n = N 2 +1

If m +n ≤ N then we have weak consistency

Faster

Example: Amazon Dynamo

14 / 39

slide-15
SLIDE 15

Why do people use NoSQL?

15 / 39

slide-16
SLIDE 16

Flexible data schemas

In NoSQL data stores there is no need to impose a strict data schema

Anyway the data store treats each row as a (key,value) pair No requirement for the value ⇒ no fixed data schema Not the same as empty values!

16 / 39

slide-17
SLIDE 17

Scaling the database tier

  • Repl. SQL

Sharding NoSQL

(e.g., MySQL) (e.g., Bigtable)

Scalability Complex queries / / Fault Tolerance Consistency /

17 / 39

slide-18
SLIDE 18

Consistency issues in NoSQL databases

NoSQL databases scale because of heavy data partitioning

Minimum coordination between partitions

Consistency (worst case): eventual consistency

Updates will become visible at some point in the future Multiple updates are propagated independently from each

  • ther

E.g., Amazon’s SimpleDB

Consistency (best case): single-row transactions

Transactional updates to a single database row No support for multiple-row transactions E.g., Google’s Bigtable, Cassandra, etc.

18 / 39

slide-19
SLIDE 19

Position

We can guarantee multiple-row transactions in NoSQL databases without compromising their scalability or fault-tolerance properties. The secret: exploit the properties of Web applications Transactions are short-lived Transactions span a limited number of well-identified data items Question In fact this statement cannot be true. Why?

19 / 39

slide-20
SLIDE 20

Availability vs. Consistency

Strictly speaking, it is impossible to fulfill my promises entirely

The CAP theorem states that one cannot support strong Consistency and high Availability in the presence of network Partitions A scalable system necessarily faces occasional partitions

NoSQL databases favorize high availability

And deliver best-effort consistency

CloudTPS focuses on consistency first

At the cost of unavailability in extreme failure/partition cases Note: a machine failure is not an extreme case. . .

20 / 39

slide-21
SLIDE 21

System Model

21 / 39

slide-22
SLIDE 22

System Model

22 / 39

slide-23
SLIDE 23

System Model

23 / 39

slide-24
SLIDE 24

Atomicity

Atomicity: All operations succeed or none of them does

No partially executed transactions!

Solution: 2-phase commit across the LTMs which contain relevant data items

24 / 39

slide-25
SLIDE 25

Atomicity

Atomicity: All operations succeed or none of them does

No partially executed transactions!

Solution: 2-phase commit across the LTMs which contain relevant data items

25 / 39

slide-26
SLIDE 26

Atomicity

Atomicity: All operations succeed or none of them does

No partially executed transactions!

Solution: 2-phase commit across the LTMs which contain relevant data items

26 / 39

slide-27
SLIDE 27

Atomicity

Atomicity: All operations succeed or none of them does

No partially executed transactions!

Solution: 2-phase commit across the LTMs which contain relevant data items

27 / 39

slide-28
SLIDE 28

Atomicity

Atomicity: All operations succeed or none of them does

No partially executed transactions!

Solution: 2-phase commit across the LTMs which contain relevant data items

28 / 39

slide-29
SLIDE 29

Atomicity

Atomicity: All operations succeed or none of them does

No partially executed transactions!

Solution: 2-phase commit across the LTMs which contain relevant data items

29 / 39

slide-30
SLIDE 30

Consistency

Consistency: Each transaction leaves the database in an internally consistent state

“Consistency” in this context means: logical consistency of different data items Very different than consistency of a single replicated data item

Solution: we assume that transactions are semantically correct

30 / 39

slide-31
SLIDE 31

Isolation

Isolation: The system behaves as if transactions were processed sequentially

If the system allows concurrent transactions, then conflicting transactions must be serialized

Solution: Timestamp ordering

31 / 39

slide-32
SLIDE 32

Isolation

Isolation: The system behaves as if transactions were processed sequentially

If the system allows concurrent transactions, then conflicting transactions must be serialized

Solution: Timestamp ordering

32 / 39

slide-33
SLIDE 33

Isolation

Isolation: The system behaves as if transactions were processed sequentially

If the system allows concurrent transactions, then conflicting transactions must be serialized

Solution: Timestamp ordering

33 / 39

slide-34
SLIDE 34

Durability

Durability: Once a transaction has been committed it cannot be undone

Even in the case of server failures or network partitions. . .

Long-term solution: Transaction managers checkpoint updates into the cloud storage service

NoSQL databases guarantee durability But updates may not be visible immediately

Short-term solution: Each data item is hosted by N transaction managers

We can support the simultaneous failure of N −1 servers

34 / 39

slide-35
SLIDE 35

Evaluation setup

CloudTPS runs on top of:

HBase in our local cluster SimpleDB in the Amazon EC2 cloud

Workload derived from TPC-W

Web application benchmark which models an online bookstore (ported to NoSQL databases) Standardized workloads

35 / 39

slide-36
SLIDE 36

Workload

How many LTMs participate in each transaction? When deploying the system across 40 LTMs:

1 10 100 1000 10000 100000 1e+06 1e+07 5 10 15 20 25 Number of Transactions Number Of Accessed LTMs

36 / 39

slide-37
SLIDE 37

Scalability evaluation

When using N transaction managers, how many transactions/second can we sustain? HBase + DAS3: we want 99% of transactions<100 ms

Maximum Throughput (TPS) 8000 7000 6000 5000 4000 3000 2000 1000 0 0 5 10 Number of LTMs 15 20 25 30 35

SimpleDB + EC2: we want 90% of transactions<100 ms

Maximum Throughput (TPS) HighCPU Medium EC2 instances 500 1000 1500 2000 2500 3000 3500 10 20 30 40 50 60 70 Number of LTMs Standard Small EC2 instances

We have linear scalability

37 / 39

slide-38
SLIDE 38

Tolerance to failures and partitions

Throughput (TPS) Time (Seconds) 200 400 600 800 1000 1200

failure LTM partition Network

200 400 600 800 1000 1200

CloudTPS recovers from a node failure in ∼ 18.6 sec

0.5 sec to rebuild a new system membership 12.2 sec to recover transactions from the failed LTM (timeouts) 5.9 sec to reorganize data placement among surviving LTMs

CloudTPS aborts all transactions during a partition

Recovers ∼ 135 ms after the end of the partition

38 / 39

slide-39
SLIDE 39

Optimization: memory management

Loading all data items in memory may create problems Better: let’s keep only the most frequently accessed items

We can load the others from the NoSQL data store when necessary

This creates a tradeoff: memory consumption vs. performance

99th Percentile Response Time (Milli−Second) LTM Buffer Size (No. of Data Items) 100000 10000 1000 100 1000 10000 100000 1e+06 10k Items 1M Items

Keeping only 10% of the data in the LTMs barely impacts performance

39 / 39