SLIDE 1 Transactional storage for geo-replicated systems
Yair Sovran∗ Russell Power∗ Marcos K. Aguilera† Jinyang Li∗
∗New York University †Microsoft Research Silicon Valley
Presentation by Wojciech Żółtak
SLIDE 2
- Network latencies between distant places may be very
high.
- In case of natural disaster the whole data-center may be
destroyed. So, we replicate our service across many sites around the World and redirect users to the closest one.
Geo-replication
SLIDE 3
- Application logic changes rarely and is easy to
replicate.
- Data in store changes frequently and is hard to
replicate, especially when supporting transactions. We are going to focus on a key-value store with transaction support.
Geo-replication of storage systems
SLIDE 4 Why transactions?
Transactions removes the burden of carrying about problems like:
- race conditions,
- partial writes
- overwrites,
therefore, makes developing much easier.
SLIDE 5 Write-write conflicts
Conflicting writes to replicated sites:
Site A Site B Obj X Obj X u p d a t e update
?
How to merge updates?
SLIDE 6 Write-write conflicts
Master-slave architecture
Master Slave Obj X Obj X u p d a t e
Master quickly becomes a bottleneck.
update
- Read-write master
- Replicated, read-
- nly slaves
Better solution needed.
SLIDE 7 Goals
- Asynchronous replication across sites.
- Efficient update-anywhere for certain objects.
- Freedom from conflict-resolution logic.
- Strong isolation within each site.
Current systems provides only a subset of above properties.
SLIDE 8 Problems with SI in replicated systems:
- Total ordering of commit time for all transactions in a
whole system (even if they do not conflict).
- Transaction is visible only after its writes have been
propagated to all sites. PSI is a new isolation property which adapts SI for replicated systems.
- 1. Parallel Snapshot Isolation (PSI)
SLIDE 9 SI vs PSI, properties
SI PSI (Snapshot Read) All operations read the most recent committed version as
- f the time when transaction began.
(Site Snapshot Read) All operations read the most recent committed version at the transaction's site as of the time when transactions began (No Write-Write Conflicts) The write set
- f each pair of committed concurrent
transactions must be disjoint. (No Write-Write Conflict) The write sets of each pair of committed somewhere-concurrent[1] transactions must be disjoint. (Commit Causality Across Sites) If a transaction T1 commits at a site A before transaction T2 starts at site A, then T1 cannot commit after T2 at any site.
Note, that PSI guarantees a SI within single site.
[1] T1 and T2 are somewhere-concurrent when they are concurrent[2] at two (not necessarily different) sites. [2] T1 and T2 are concurrent if one of them has a commit timestamp between start and commit timestamp
SLIDE 10
SI vs PSI, properties
Example of SI
SLIDE 11
SI vs PSI, properties
Example of PSI (commit timestamp may differ at different sites)
SLIDE 12
SI vs PSI, anomalies
SLIDE 13 We can use sharding for writing - i.e. associate objects with concrete sites and redirect writing to them. It is called a primary sites mechanism. But, the transaction may contain writes of objects associated with different sites, which is problematic. Instead, the slightly less restrictive property is introduced. It is called preffered sites.
SLIDE 14
- Each object is associated to a concrete site (e.g. user
data is associated to the site which is the closest to his usual location).
- Object writing at its preferred site is guaranteed to be
conflict-free with other sites.
- Object writing at site which is not its preferred site is still
permitted. We will see later how those properties can be achieved and what benefits do they provide.
SLIDE 15
- 3. CSet, a commutative data type
- Data type is commutative when all operations on it are
commutative. I.e. we can change order of operations on commutative data type and the result will be the same.
- CSet is a commutative data type which implements a
set.
SLIDE 16 Implementation CSet : Key -> Int
- empty CSet maps every key to 0
- CSet.contains(X) = true
if X is mapped to a positive integer
increases number associated with X
decreases number associated with X
- 3. CSet, a commutative data type
SLIDE 17
- Because concurrent operations may remove / add the
same object to the same CSet the counter may not be 0
- r 1.
- Application should decrease/increase until the counter
is (not) positive.
- That behaviour may be encapsulated by interface and
transparent to the user.
- 3. CSet, a commutative data type
SLIDE 18
- Since CSet is a commutative data type it can be
modified at any site without introduction of write-write conflicts, since merge of different updates is trivial.
- Set is a very useful structure which can be used to
aggregate data (like user's posts, friends, basket content etc).
- CSets may be used to eliminate some situations that
would involve updating objects with different preferred sites within single transaction (e.g. modifying a symmetrical relations like friendship)
SLIDE 19 Putting all things together - Walter
configuration service Site 1 Site 2 cluster storage user A user B
- Data is divided into containers which are simply a logical organization units.
- Object ID contains container ID and therefore can not be moved between
containers.
- Every container is associated with a preferred site, and set of sites to which
should be replicated (replica set).
SLIDE 20 Putting all things together - Walter
configuration service Site 1 Site 2 cluster storage user A user B
- Configuration service is a black box which tracks active sites and the
preferred site and replica set for each container.
- Sites are caching containers mapping between containers and sites.
- Cluster storage is used to keep logs of all sites for safety reasons (the site
state can be restored from stored log).
SLIDE 21
Walter, server variables at each site
SLIDE 22
- Version number of objects is a pair <site, seqno>.
- When transaction x starts, the startVTS timestamp
vector is obtained in form of: <CommitedVTS1, ... , CommitedVTSn>
- Version v is visible to startVTS if
seqno <= startVTS[site]
- Transaction sees a snapshot with newest visible
versions of objects.
Walter, executing transactions
SLIDE 23
- Writes in transaction x are stored in a temporary buffer
x.updates.
- While reading an object information from x.updates is
merged with information from the snapshot.
- If object is not replicated locally it is fetched from its
preferred site.
Walter, executing transactions
SLIDE 24
Walter, fast commit
Write of objects with a local preferred site There is object modified since transaction beginning?
Yes No
ABORT Assign new seqno to x Wait until transaction number seqno-1 is committed Commit transaction Propagate to other sites
SLIDE 25
Walter, fast commit
SLIDE 26
Walter, slow commit
Write of at least one object with non-local preferred site. Ask involved sites to lock corresponding objects All locks acquired? Unlock locked objects, then ABORT Commit x as in fast commit When x is propagated to site which is holding a related lock, the lock is released.
No Yes
SLIDE 27
Walter, slow commit
SLIDE 28
Walter, slow commit
SLIDE 29 Walter, propagation
- After commit, transactions are propagated to other sites.
- Site receives a transaction x and sends an ACK.
- When transaction is received by at least f+1 sites (for
some), than it is marked as disaster-safe and all sites are notified.
- Sites merges transaction x when all transactions from x.
startVTS are merged and x is disaster-safe.
- When x is committed at all sites, it is marked as
globally-visible.
SLIDE 30 Walter, failures
- Site can be restored from data stored in cluster storage
system.
- System can either wait for site to be back online or find
a best replacement within other nodes and reassign preferred sites.
- Transactions for which not all preceding transactions
can be found are discarded.
- Reactivated site can be re-integrated back to the
system.
SLIDE 31 Walter, partial replication
- One data center can hold a few Walter servers which
are replicating a different data partitions.
- Transaction may operate on objects which are not
replicated in given site.
- It can be used to scale up the system.
SLIDE 32 Evaluation
- 4 sites in distant places (Virginia, California, Ireland,
Singapore).
- Virtual servers equivalent to 2.5 GHz Intel Xeon with
7GB of RAM.
- Replication to all sites in test.
- 600 Mbps network between hosts in site.
- 22 Mbps network between sites.
- Transactions which are reading/writing few randomly
chosen 100-byte objects.
SLIDE 33
Evaluation, round trip latencies (ms)
SLIDE 34 Evaluation, base performance
- Against Berkeley DB 11gR2.
- Both in master-slave architecture.
- Populated with 50k regular objects.
- Read/write of all objects in DB.
SLIDE 35
Evaluation, fast commit
SLIDE 36
Evaluation, fast commit
SLIDE 37
Evaluation, fast commit
SLIDE 38
Evaluation, fast commit
SLIDE 39
Evaluation, fast commit
SLIDE 40
Evaluation, slow commit
SLIDE 41 Evaluation, WaltSocial
Simple Facebook clone.
- 400k users, each with 10 status updates and 10 wall
postings from other users.
- 4 sites.
- Many clients at all sites.
- Mix1 test - 90% read, 10% update operations.
- Mix2 test - 80% read, 20% update operations.
SLIDE 42
Evaluation, WaltSocial
SLIDE 43
Evaluation, WaltSocial
SLIDE 44 Evaluation, ReTwis port
Twitter-like application.
- Transactions instead of Redis atomic operations.
- Done in one day.
- 500k users.
- Mixed workload with 85% timeline reads, 7.5% posts
and 7.5% follow operations.
SLIDE 45
Evaluation, ReTwis port
SLIDE 46 Summary
- PSI is a strong property for replicated systems.
- Preferred sites and CSets can be used to avoid write-
write conflicts and increase the locality of writes.
- Walter is an example of efficient system with PSI.
- Transactions makes app-developing much easier.