Transactional storage for geo-replicated systems Yair Sovran - - PowerPoint PPT Presentation

transactional storage for geo replicated systems
SMART_READER_LITE
LIVE PREVIEW

Transactional storage for geo-replicated systems Yair Sovran - - PowerPoint PPT Presentation

Transactional storage for geo-replicated systems Yair Sovran Russell Power Marcos K. Aguilera Jinyang Li New York University Microsoft Research Silicon Valley Presentation by Wojciech tak Geo-replication Network


slide-1
SLIDE 1

Transactional storage for geo-replicated systems

Yair Sovran∗ Russell Power∗ Marcos K. Aguilera† Jinyang Li∗

∗New York University †Microsoft Research Silicon Valley

Presentation by Wojciech Żółtak

slide-2
SLIDE 2
  • Network latencies between distant places may be very

high.

  • In case of natural disaster the whole data-center may be

destroyed. So, we replicate our service across many sites around the World and redirect users to the closest one.

Geo-replication

slide-3
SLIDE 3
  • Application logic changes rarely and is easy to

replicate.

  • Data in store changes frequently and is hard to

replicate, especially when supporting transactions. We are going to focus on a key-value store with transaction support.

Geo-replication of storage systems

slide-4
SLIDE 4

Why transactions?

Transactions removes the burden of carrying about problems like:

  • race conditions,
  • partial writes
  • overwrites,

therefore, makes developing much easier.

slide-5
SLIDE 5

Write-write conflicts

Conflicting writes to replicated sites:

Site A Site B Obj X Obj X u p d a t e update

?

How to merge updates?

slide-6
SLIDE 6

Write-write conflicts

Master-slave architecture

Master Slave Obj X Obj X u p d a t e

Master quickly becomes a bottleneck.

update

  • Read-write master
  • Replicated, read-
  • nly slaves

Better solution needed.

slide-7
SLIDE 7

Goals

  • Asynchronous replication across sites.
  • Efficient update-anywhere for certain objects.
  • Freedom from conflict-resolution logic.
  • Strong isolation within each site.

Current systems provides only a subset of above properties.

slide-8
SLIDE 8

Problems with SI in replicated systems:

  • Total ordering of commit time for all transactions in a

whole system (even if they do not conflict).

  • Transaction is visible only after its writes have been

propagated to all sites. PSI is a new isolation property which adapts SI for replicated systems.

  • 1. Parallel Snapshot Isolation (PSI)
slide-9
SLIDE 9

SI vs PSI, properties

SI PSI (Snapshot Read) All operations read the most recent committed version as

  • f the time when transaction began.

(Site Snapshot Read) All operations read the most recent committed version at the transaction's site as of the time when transactions began (No Write-Write Conflicts) The write set

  • f each pair of committed concurrent

transactions must be disjoint. (No Write-Write Conflict) The write sets of each pair of committed somewhere-concurrent[1] transactions must be disjoint. (Commit Causality Across Sites) If a transaction T1 commits at a site A before transaction T2 starts at site A, then T1 cannot commit after T2 at any site.

Note, that PSI guarantees a SI within single site.

[1] T1 and T2 are somewhere-concurrent when they are concurrent[2] at two (not necessarily different) sites. [2] T1 and T2 are concurrent if one of them has a commit timestamp between start and commit timestamp

  • f the other one.
slide-10
SLIDE 10

SI vs PSI, properties

Example of SI

slide-11
SLIDE 11

SI vs PSI, properties

Example of PSI (commit timestamp may differ at different sites)

slide-12
SLIDE 12

SI vs PSI, anomalies

slide-13
SLIDE 13

We can use sharding for writing - i.e. associate objects with concrete sites and redirect writing to them. It is called a primary sites mechanism. But, the transaction may contain writes of objects associated with different sites, which is problematic. Instead, the slightly less restrictive property is introduced. It is called preffered sites.

  • 2. Preferred sites
slide-14
SLIDE 14
  • Each object is associated to a concrete site (e.g. user

data is associated to the site which is the closest to his usual location).

  • Object writing at its preferred site is guaranteed to be

conflict-free with other sites.

  • Object writing at site which is not its preferred site is still

permitted. We will see later how those properties can be achieved and what benefits do they provide.

  • 2. Preferred sites
slide-15
SLIDE 15
  • 3. CSet, a commutative data type
  • Data type is commutative when all operations on it are

commutative. I.e. we can change order of operations on commutative data type and the result will be the same.

  • CSet is a commutative data type which implements a

set.

slide-16
SLIDE 16

Implementation CSet : Key -> Int

  • empty CSet maps every key to 0
  • CSet.contains(X) = true

if X is mapped to a positive integer

  • CSet.add(X)

increases number associated with X

  • CSet.del(X)

decreases number associated with X

  • 3. CSet, a commutative data type
slide-17
SLIDE 17
  • Because concurrent operations may remove / add the

same object to the same CSet the counter may not be 0

  • r 1.
  • Application should decrease/increase until the counter

is (not) positive.

  • That behaviour may be encapsulated by interface and

transparent to the user.

  • 3. CSet, a commutative data type
slide-18
SLIDE 18
  • Since CSet is a commutative data type it can be

modified at any site without introduction of write-write conflicts, since merge of different updates is trivial.

  • Set is a very useful structure which can be used to

aggregate data (like user's posts, friends, basket content etc).

  • CSets may be used to eliminate some situations that

would involve updating objects with different preferred sites within single transaction (e.g. modifying a symmetrical relations like friendship)

  • 3. CSet, benefits
slide-19
SLIDE 19

Putting all things together - Walter

configuration service Site 1 Site 2 cluster storage user A user B

  • Data is divided into containers which are simply a logical organization units.
  • Object ID contains container ID and therefore can not be moved between

containers.

  • Every container is associated with a preferred site, and set of sites to which

should be replicated (replica set).

slide-20
SLIDE 20

Putting all things together - Walter

configuration service Site 1 Site 2 cluster storage user A user B

  • Configuration service is a black box which tracks active sites and the

preferred site and replica set for each container.

  • Sites are caching containers mapping between containers and sites.
  • Cluster storage is used to keep logs of all sites for safety reasons (the site

state can be restored from stored log).

slide-21
SLIDE 21

Walter, server variables at each site

slide-22
SLIDE 22
  • Version number of objects is a pair <site, seqno>.
  • When transaction x starts, the startVTS timestamp

vector is obtained in form of: <CommitedVTS1, ... , CommitedVTSn>

  • Version v is visible to startVTS if

seqno <= startVTS[site]

  • Transaction sees a snapshot with newest visible

versions of objects.

Walter, executing transactions

slide-23
SLIDE 23
  • Writes in transaction x are stored in a temporary buffer

x.updates.

  • While reading an object information from x.updates is

merged with information from the snapshot.

  • If object is not replicated locally it is fetched from its

preferred site.

Walter, executing transactions

slide-24
SLIDE 24

Walter, fast commit

Write of objects with a local preferred site There is object modified since transaction beginning?

Yes No

ABORT Assign new seqno to x Wait until transaction number seqno-1 is committed Commit transaction Propagate to other sites

slide-25
SLIDE 25

Walter, fast commit

slide-26
SLIDE 26

Walter, slow commit

Write of at least one object with non-local preferred site. Ask involved sites to lock corresponding objects All locks acquired? Unlock locked objects, then ABORT Commit x as in fast commit When x is propagated to site which is holding a related lock, the lock is released.

No Yes

slide-27
SLIDE 27

Walter, slow commit

slide-28
SLIDE 28

Walter, slow commit

slide-29
SLIDE 29

Walter, propagation

  • After commit, transactions are propagated to other sites.
  • Site receives a transaction x and sends an ACK.
  • When transaction is received by at least f+1 sites (for

some), than it is marked as disaster-safe and all sites are notified.

  • Sites merges transaction x when all transactions from x.

startVTS are merged and x is disaster-safe.

  • When x is committed at all sites, it is marked as

globally-visible.

slide-30
SLIDE 30

Walter, failures

  • Site can be restored from data stored in cluster storage

system.

  • System can either wait for site to be back online or find

a best replacement within other nodes and reassign preferred sites.

  • Transactions for which not all preceding transactions

can be found are discarded.

  • Reactivated site can be re-integrated back to the

system.

slide-31
SLIDE 31

Walter, partial replication

  • One data center can hold a few Walter servers which

are replicating a different data partitions.

  • Transaction may operate on objects which are not

replicated in given site.

  • It can be used to scale up the system.
slide-32
SLIDE 32

Evaluation

  • 4 sites in distant places (Virginia, California, Ireland,

Singapore).

  • Virtual servers equivalent to 2.5 GHz Intel Xeon with

7GB of RAM.

  • Replication to all sites in test.
  • 600 Mbps network between hosts in site.
  • 22 Mbps network between sites.
  • Transactions which are reading/writing few randomly

chosen 100-byte objects.

slide-33
SLIDE 33

Evaluation, round trip latencies (ms)

slide-34
SLIDE 34

Evaluation, base performance

  • Against Berkeley DB 11gR2.
  • Both in master-slave architecture.
  • Populated with 50k regular objects.
  • Read/write of all objects in DB.
slide-35
SLIDE 35

Evaluation, fast commit

slide-36
SLIDE 36

Evaluation, fast commit

slide-37
SLIDE 37

Evaluation, fast commit

slide-38
SLIDE 38

Evaluation, fast commit

slide-39
SLIDE 39

Evaluation, fast commit

slide-40
SLIDE 40

Evaluation, slow commit

slide-41
SLIDE 41

Evaluation, WaltSocial

Simple Facebook clone.

  • 400k users, each with 10 status updates and 10 wall

postings from other users.

  • 4 sites.
  • Many clients at all sites.
  • Mix1 test - 90% read, 10% update operations.
  • Mix2 test - 80% read, 20% update operations.
slide-42
SLIDE 42

Evaluation, WaltSocial

slide-43
SLIDE 43

Evaluation, WaltSocial

slide-44
SLIDE 44

Evaluation, ReTwis port

Twitter-like application.

  • Transactions instead of Redis atomic operations.
  • Done in one day.
  • 500k users.
  • Mixed workload with 85% timeline reads, 7.5% posts

and 7.5% follow operations.

slide-45
SLIDE 45

Evaluation, ReTwis port

slide-46
SLIDE 46

Summary

  • PSI is a strong property for replicated systems.
  • Preferred sites and CSets can be used to avoid write-

write conflicts and increase the locality of writes.

  • Walter is an example of efficient system with PSI.
  • Transactions makes app-developing much easier.