 
              STI-BT: A Scalable Transactional Index Nuno Diegues and Paolo Romano 34th International Conference on Distributed Systems (ICDCS)
Distributed Key-Value (DKV) stores rise in popularity:
Distributed Key-Value (DKV) stores rise in popularity: • scalability • fault-tolerance • elasticity
Distributed Key-Value (DKV) stores rise in popularity: • scalability • fault-tolerance • elasticity Recent trend: • cloud adoption, large/elastic scaling • move towards strong consistency • easy and transparent APIs
We focus on two main disadvantages: typically embrace weak consistency key-value access is too simplistic • mainly an index for primary key
We focus on two main disadvantages: typically embrace weak consistency key-value access is too simplistic • mainly an index for primary key Providing Secondary index is non-trivial
We focus on two main disadvantages: typically embrace weak consistency key-value access is too simplistic • mainly an index for primary key Providing Secondary index is non-trivial But it is desirable!
We focus on two main disadvantages: typically embrace weak consistency key-value access is too simplistic • mainly an index for primary key Providing Secondary index is non-trivial But it is desirable! State of the art solutions either…: did not provide strongly consistent transactions — more difficult were not fully decentralised — not scalable required several hops to access the index — more latency
In this work we present STI-BT: a Scalable Transactional Index
In this work we present STI-BT: a Scalable Transactional Index Serializable distributed transactions Secondary indexes via a distributed B+Tree implementation Index accesses/changes obey transactions’ semantics
In this work we present STI-BT: a Scalable Transactional Index Serializable distributed transactions Secondary indexes via a distributed B+Tree implementation Index accesses/changes obey transactions’ semantics Provide strong consistency + scalable indexing
Outline • Background on DKV stores • STI-BT • Evaluation • Related Work
Background on DKV store Infinispan DKV store by Red Hat
Background on DKV store Infinispan DKV store by Red Hat Distributed vector-clock based protocol: GMU [ICDCS’12]
Background on DKV store Infinispan DKV store by Red Hat Distributed vector-clock based protocol: GMU [ICDCS’12] Read-only transactions do not abort M ulti-versioned
Background on DKV store Infinispan DKV store by Red Hat Distributed vector-clock based protocol: GMU [ICDCS’12] Read-only transactions do not abort M ulti-versioned Update transactions U pdate Serializability
Background on DKV store Infinispan DKV store by Red Hat Distributed vector-clock based protocol: GMU [ICDCS’12] Read-only transactions do not abort M ulti-versioned G enuine Update transactions U pdate Serializability
GMU data set:
GMU data set: replication degree: 2 consistent hash function
GMU data set: replication degree: 2 consistent hash function
GMU data set: replication degree: 2 consistent hash function
GMU: genuine partial replication No central component Transactions require only machines holding data used
GMU: genuine partial replication No central component Transactions require only machines holding data used read/write
GMU: genuine partial replication No central component Transactions require only machines holding data used commit tx
GMU: genuine partial replication No central component Transactions require only machines holding data used commit tx consensus for commit
Outline • Background on DKV stores • STI-BT � • Evaluation • maximizing data locality • hybrid replication • Related Work • elastic scaling • concurrency enhancements
The need for data locality of the index Starting point: • consider a distributed B+Tree built on the DKV
The need for data locality of the index Starting point: • consider a distributed B+Tree built on the DKV consistent hash function S1 S2 S3 S4
The need for data locality of the index Starting point: • consider a distributed B+Tree built on the DKV • tree nodes placed with random consistent hash consistent hash function S1 S4 S3 S1 S2 S3 S4 S3 S1 S4 S2
Current problems with data locality Problems with consistent hashing data placement: S1 S4 S3 S2 Z P S3 S1 S4
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops S1 S4 S3 S2 Z P S3 S1 S4
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops S1 delete Z S4 S3 P Z S3 S1 S4 S2
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops S1 delete Z S1 S2 S3 S4 S3 Z P Z S3 S1 S4 S2
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops - Some servers receive more load than others S1 delete Z S1 S2 S3 S4 S3 Z P Z S3 S1 S4 S2
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops - Some servers receive more load than others server load S1 delete Z S1 S2 S3 S4 S3 Z P Z S3 S1 S4 S2
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops - Some servers receive more load than others - Range scan operations are also inefficient server load S1 delete Z S1 S2 S3 S4 S3 Z P Z S3 S1 S4 S2
Current problems with data locality Problems with consistent hashing data placement: - One index access entails several hops - Some servers receive more load than others - Range scan operations are also inefficient server load S1 delete Z S1 S2 S3 S4 S3 Z P Z S3 S1 S4 S2 scan P to Z
Where typical solutions fall short Partial replication of the index: poor locality poor load balancing
Where typical solutions fall short Partial replication of the index: poor locality poor load balancing Full replication of the index: consensus on updates is too expensive prevents scaling out storage
STI-BT: Maximizing data locality of the index
STI-BT: Maximizing data locality of the index Hybrid replication top nodes are more accessed but less modified better load balancing, rare cost for expensive consensus full replication C (cut-off level) partial replication
STI-BT: Maximizing data locality of the index Hybrid replication top nodes are more accessed but less modified better load balancing, rare cost for expensive consensus Co-located data placement groups of sub-trees, reduce network hops migrate transaction to exploit co-location full replication C (cut-off level) S1 S3 S4 S2 partial replication
Transaction migration driven by data co-location full C S1 S3 S4 partial S2 K
Transaction migration driven by data co-location S1 S2 S3 S4 full C S1 S3 S4 partial S2 K
Transaction migration driven by data co-location Lookup K S1 S2 S3 S4 full C S1 S3 S4 partial S2 K
Transaction migration driven by data co-location Lookup K 1 S1 S2 S3 S4 1 full C S1 S3 S4 partial S2 K
Transaction migration driven by data co-location Lookup K 1 2 local search S1 S2 S3 S4 full 2 C S1 S3 S4 partial S2 K
Transaction migration driven by data co-location Lookup K 1 local 2 search 3 migrate tx S1 S2 S3 S4 full C 3 S1 S3 S4 partial S2 K
Transaction migration driven by data co-location Lookup K 1 local 4 search local 2 search 3 migrate tx S1 S2 S3 S4 full C S1 S3 S4 partial S2 4 K
Transaction migration driven by data co-location Lookup K 1 local 4 search local 2 search 3 migrate tx S1 S2 S3 S4 full C S1 S3 S4 partial S2 K
Grouping index in sub-trees Still rely on consistent hashing: • preserve fully decentralized design and quick lookup of data Exploit knowledge over structure of the indexed data • general purpose data placement is agnostic of the data • but we know how it will be structured
Grouping index in sub-trees Still rely on consistent hashing: • preserve fully decentralized design and quick lookup of data Exploit knowledge over structure of the indexed data • general purpose data placement is agnostic of the data • but we know how it will be structured k u : unique key consistent hash function server
Grouping index in sub-trees Still rely on consistent hashing: • preserve fully decentralized design and quick lookup of data Exploit knowledge over structure of the indexed data • general purpose data placement is agnostic of the data • but we know how it will be structured k u : unique key consistent hash function server local map lookup k u : unique key
Recommend
More recommend