SLIDE 1 STI-BT: A Scalable Transactional Index
Nuno Diegues and Paolo Romano
34th International Conference on Distributed Systems (ICDCS)
SLIDE 2
Distributed Key-Value (DKV) stores rise in popularity:
SLIDE 3 Distributed Key-Value (DKV) stores rise in popularity:
- scalability
- fault-tolerance
- elasticity
SLIDE 4 Distributed Key-Value (DKV) stores rise in popularity:
- scalability
- fault-tolerance
- elasticity
Recent trend:
- cloud adoption, large/elastic scaling
- move towards strong consistency
- easy and transparent APIs
SLIDE 5 We focus on two main disadvantages:
typically embrace weak consistency key-value access is too simplistic
- mainly an index for primary key
SLIDE 6 Providing Secondary index is non-trivial
We focus on two main disadvantages:
typically embrace weak consistency key-value access is too simplistic
- mainly an index for primary key
SLIDE 7 Providing Secondary index is non-trivial But it is desirable!
We focus on two main disadvantages:
typically embrace weak consistency key-value access is too simplistic
- mainly an index for primary key
SLIDE 8 State of the art solutions either…:
did not provide strongly consistent transactions — more difficult were not fully decentralised — not scalable required several hops to access the index — more latency
Providing Secondary index is non-trivial But it is desirable!
We focus on two main disadvantages:
typically embrace weak consistency key-value access is too simplistic
- mainly an index for primary key
SLIDE 9
In this work we present STI-BT: a Scalable Transactional Index
SLIDE 10
In this work we present STI-BT: a Scalable Transactional Index
Serializable distributed transactions Secondary indexes via a distributed B+Tree implementation Index accesses/changes obey transactions’ semantics
SLIDE 11
Provide strong consistency + scalable indexing
In this work we present STI-BT: a Scalable Transactional Index
Serializable distributed transactions Secondary indexes via a distributed B+Tree implementation Index accesses/changes obey transactions’ semantics
SLIDE 12
- Background on DKV stores
- STI-BT
- Evaluation
- Related Work
Outline
SLIDE 13
Infinispan DKV store by Red Hat
Background on DKV store
SLIDE 14
Distributed vector-clock based protocol:
GMU [ICDCS’12]
Infinispan DKV store by Red Hat
Background on DKV store
SLIDE 15
Distributed vector-clock based protocol:
GMU [ICDCS’12]
Infinispan DKV store by Red Hat
Read-only transactions do not abort
Background on DKV store
Multi-versioned
SLIDE 16
Distributed vector-clock based protocol:
GMU [ICDCS’12] Update Serializability
Infinispan DKV store by Red Hat
Read-only transactions do not abort
Background on DKV store
Update transactions
Multi-versioned
SLIDE 17
Distributed vector-clock based protocol:
GMU [ICDCS’12] Update Serializability
Infinispan DKV store by Red Hat
Read-only transactions do not abort
Background on DKV store
Update transactions
Multi-versioned Genuine
SLIDE 19 replication degree: 2 consistent hash function data set:
GMU
SLIDE 20 replication degree: 2 consistent hash function data set:
GMU
SLIDE 21 replication degree: 2 consistent hash function data set:
GMU
SLIDE 22
No central component Transactions require only machines holding data used
GMU: genuine partial replication
SLIDE 23 No central component Transactions require only machines holding data used
read/write
GMU: genuine partial replication
SLIDE 24 No central component Transactions require only machines holding data used
commit tx
GMU: genuine partial replication
SLIDE 25 No central component Transactions require only machines holding data used
consensus for commit commit tx
GMU: genuine partial replication
SLIDE 26
- Background on DKV stores
- STI-BT
- Evaluation
- Related Work
Outline
- maximizing data locality
- hybrid replication
- elastic scaling
- concurrency enhancements
SLIDE 27 Starting point:
- consider a distributed B+Tree built on the DKV
The need for data locality of the index
SLIDE 28 Starting point:
- consider a distributed B+Tree built on the DKV
consistent hash function
S1 S2 S3 S4
The need for data locality of the index
SLIDE 29 Starting point:
- consider a distributed B+Tree built on the DKV
S1 S4 S3 S3 S1 S4 S2
consistent hash function
S1 S2 S3 S4
The need for data locality of the index
- tree nodes placed with random consistent hash
SLIDE 30 S1 S4 S3 S3 S1 S4 S2 Z
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 31 S1 S4 S3 S3 S1 S4 S2 Z
- One index access entails several hops
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 32 S1 S4 S3 S3 S1 S4 S2
delete Z
Z
- One index access entails several hops
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 33 S1 S4 S3 S3 S1 S4 S2
delete Z
Z Z
- One index access entails several hops
S1 S3 S2
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 34 S1 S4 S3 S3 S1 S4 S2
delete Z
Z Z
- One index access entails several hops
- Some servers receive more load than others
S1 S3 S2
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 35 S1 S4 S3 S3 S1 S4 S2
delete Z
Z Z
- One index access entails several hops
- Some servers receive more load than others
S1 S3 S2
server load
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 36 S1 S4 S3 S3 S1 S4 S2
delete Z
Z Z
- One index access entails several hops
- Some servers receive more load than others
- Range scan operations are also inefficient
S1 S3 S2
server load
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 37 S1 S4 S3 S3 S1 S4 S2
delete Z
Z Z
- One index access entails several hops
- Some servers receive more load than others
- Range scan operations are also inefficient
S1 S3 S2
server load scan P to Z
P
Current problems with data locality
Problems with consistent hashing data placement:
SLIDE 38
Partial replication of the index:
poor locality poor load balancing
Where typical solutions fall short
SLIDE 39
Partial replication of the index:
poor locality poor load balancing
Where typical solutions fall short
Full replication of the index:
consensus on updates is too expensive prevents scaling out storage
SLIDE 40
STI-BT: Maximizing data locality of the index
SLIDE 41
C (cut-off level)
full replication partial replication
Hybrid replication top nodes are more accessed but less modified better load balancing, rare cost for expensive consensus
STI-BT: Maximizing data locality of the index
SLIDE 42 C (cut-off level)
full replication partial replication
S1 S2 S3 S4
Co-located data placement groups of sub-trees, reduce network hops migrate transaction to exploit co-location Hybrid replication top nodes are more accessed but less modified better load balancing, rare cost for expensive consensus
STI-BT: Maximizing data locality of the index
SLIDE 43 K
C
S1 S2 S3 S4
full partial
Transaction migration driven by data co-location
SLIDE 44 K
C
S1 S2 S3 S4 S1 S2 S3 S4
full partial
Transaction migration driven by data co-location
SLIDE 45 K
C
S1 S2 S3 S4 S1 S2 S3 S4
full partial
Lookup K
Transaction migration driven by data co-location
SLIDE 46 K
C
S1 S2 S3 S4 S1 S2 S3 S4 1
full partial
Lookup K 1
Transaction migration driven by data co-location
SLIDE 47 K
C
S1 S2 S3 S4 S1 S2 S3 S4 1
full partial
Lookup K 2 2 local search
Transaction migration driven by data co-location
SLIDE 48 K
C
S1 S2 S3 S4 S1 S2 S3 S4 1
full partial
Lookup K 3 2 3 migrate tx local search
Transaction migration driven by data co-location
SLIDE 49 K
C
S1 S2 S3 S4 S1 S2 S3 S4 1
full partial
Lookup K 4 2 3 migrate tx local search 4 local search
Transaction migration driven by data co-location
SLIDE 50 K
C
S1 S2 S3 S4 S1 S2 S3 S4 1
full partial
Lookup K 2 3 migrate tx local search 4 local search
Transaction migration driven by data co-location
SLIDE 51 Still rely on consistent hashing:
- preserve fully decentralized design and quick lookup of data
Exploit knowledge over structure of the indexed data
- general purpose data placement is agnostic of the data
- but we know how it will be structured
Grouping index in sub-trees
SLIDE 52 Still rely on consistent hashing:
- preserve fully decentralized design and quick lookup of data
Exploit knowledge over structure of the indexed data
- general purpose data placement is agnostic of the data
- but we know how it will be structured
consistent hash function ku: unique key server
Grouping index in sub-trees
SLIDE 53 Still rely on consistent hashing:
- preserve fully decentralized design and quick lookup of data
Exploit knowledge over structure of the indexed data
- general purpose data placement is agnostic of the data
- but we know how it will be structured
consistent hash function ku: unique key server local map lookup ku: unique key
Grouping index in sub-trees
SLIDE 54 Still rely on consistent hashing:
- preserve fully decentralized design and quick lookup of data
Exploit knowledge over structure of the indexed data
- general purpose data placement is agnostic of the data
- but we know how it will be structured
consistent hash function ku: unique key server local map lookup ku: unique key
key = { ku , kcl }
kcl: co-location identifier
Grouping index in sub-trees
SLIDE 55
Algorithms in the paper
SLIDE 56 Algorithms in the paper
including load balancing of sub-trees between different servers
SLIDE 57
Management of the cut-off
There is a trade-off in the cut-off:
SLIDE 58
as high as possible to keep the fully replicated part as small as possible to avoid costly consensus upon updates
Management of the cut-off
There is a trade-off in the cut-off:
SLIDE 59
as high as possible to keep the fully replicated part as small as possible to avoid costly consensus upon updates
Management of the cut-off
but deep enough to create sub-trees need enough to load balance across all machines There is a trade-off in the cut-off:
SLIDE 60
as high as possible to keep the fully replicated part as small as possible to avoid costly consensus upon updates
Management of the cut-off
but deep enough to create sub-trees need enough to load balance across all machines Dynamic problem: changes with elastic scaling of the machines There is a trade-off in the cut-off:
SLIDE 61 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4
Elastic scaling of the index
SLIDE 62 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4 S5
?
Elastic scaling of the index
SLIDE 63 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4 S5
?
Elastic scaling of the index
lower cut-off
SLIDE 64 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4 S5 S5
Elastic scaling of the index
SLIDE 65 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4 S5 S5
M
e s u b
r e e s t h a n n e e d e d !
Elastic scaling of the index
SLIDE 66 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4 S5 S5
M
e s u b
r e e s t h a n n e e d e d !
Elastic scaling of the index
fine-grained change
SLIDE 67 C
S1 S2 S3 S4
full partial
S1 S2 S3 S4 S5 S5
Elastic scaling of the index
SLIDE 68
We prove that this scheme is memory efficient and does not hamper scalability.
Elastic scaling of the index
Fully replicated part adds memory overhead as the cluster scales.
SLIDE 69
Avoid aborts of transactions mutating the index. More details in the paper.
Concurrency enhancements
SLIDE 70 Evaluation
Built on top of Infinispan
- open-source DKV from Red Hat
- YCSB
- data inserted used as a secondary index
- operations made transactional
- Up to 100 VMs in a cloud cluster (Futuregrid)
- Partial replication: 2
SLIDE 71 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) STI-BT
STI-BT:
- Each transaction performs an average of 2 remote requests
Evaluating each contribution ( 1 / 5 )
60 machines
SLIDE 72 5 10 15 20 25
B a l a n c e d R e a d
i n a t e d R e a d
n l y R e a d
a t e s t S c a n
e a v y R M W B a l a n c e d
throughput (1000 txs/sec) Baseline STI-BT
Baseline:
- Simple B+Tree on top of Infinispan/GMU
- None of the improvements of STI-BT
Evaluating each contribution ( 2 / 5 )
SLIDE 73 5 10 15 20 25
B a l a n c e d R e a d
i n a t e d R e a d
n l y R e a d
a t e s t S c a n
e a v y R M W B a l a n c e d
throughput (1000 txs/sec) Baseline STI-BT
Baseline:
- Simple B+Tree on top of Infinispan/GMU
- None of the improvements of STI-BT
Evaluating each contribution ( 2 / 5 )
- 10 to 32 average remote requests per transaction
SLIDE 74 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees STI-BT
Sub-Trees:
- Sub-trees co-located and transaction migration
- But no hybrid replication
- Machines replicating top tree nodes are over-loaded
Evaluating each contribution ( 3 / 5 )
SLIDE 75 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees STI-BT
Sub-Trees:
- Sub-trees co-located and transaction migration
- But no hybrid replication
- Machines replicating top tree nodes are over-loaded
Evaluating each contribution ( 3 / 5 )
- reduced average remote requests to 2.5 per transaction
- 6.6x speedup over Baseline
SLIDE 76 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees Dirty STI-BT
Dirty:
- Concurrency enhancements to reduce tx aborts
- But no smart co-location of data
Evaluating each contribution ( 4 / 5 )
SLIDE 77 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees Dirty STI-BT
Dirty:
- Concurrency enhancements to reduce tx aborts
- But no smart co-location of data
Evaluating each contribution ( 4 / 5 )
- 2.5x speedup over Baseline
SLIDE 78 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees Dirty TopFull STI-BT
TopFull:
- Hybrid replication with the cut-off level
- But none of the other improvements
Evaluating each contribution ( 5 / 5 )
SLIDE 79 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees Dirty TopFull STI-BT
TopFull:
- Hybrid replication with the cut-off level
- But none of the other improvements
Evaluating each contribution ( 5 / 5 )
- Reduce average remote requests from 16 to 9
- 1.9x speedup over Baseline
SLIDE 80 5 10 15 20 25
Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced
throughput (1000 txs/sec) Baseline Sub-trees Dirty TopFull STI-BT
TopFull:
- Hybrid replication with the cut-off level
- But none of the other improvements
Evaluating each contribution ( 5 / 5 )
- Reduce average remote requests from 16 to 9
- 1.9x speedup over Baseline
SLIDE 81 Assessing Scalability
2 6 10 14 20 40 60 80 100
#machines
throughput (1000 txs/sec)
SLIDE 82 90% of ops in 6ms or less
Assessing Scalability
2 6 10 14 20 40 60 80 100
#machines
throughput (1000 txs/sec)
SLIDE 83 90% of ops in 6ms or less
Performance is unlocked by combination of the mechanisms. Similar outcome in other workloads.
Assessing Scalability
2 6 10 14 20 40 60 80 100
#machines
throughput (1000 txs/sec)
SLIDE 84 90% of ops in 6ms or less
Performance is unlocked by combination of the mechanisms. Similar outcome in other workloads.
Assessing Scalability
2 6 10 14 20 40 60 80 100
#machines
throughput (1000 txs/sec)
60 machines
SLIDE 85 Adapting the cut-off level vs Static heuristics
- AllInner: fully replicate all inner nodes
- B+Tree rebalancing causes costly updates
- FixedAt2: cut-off level at level 2
- poor load balancing as more machines join
SLIDE 86 Adapting the cut-off level vs Static heuristics
- AllInner: fully replicate all inner nodes
- B+Tree rebalancing causes costly updates
- FixedAt2: cut-off level at level 2
- poor load balancing as more machines join
STI-BT
0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90 100
slowdown relative to STI-BT #machines
YCSB workload A 50% lookups 50% modifications
SLIDE 87 Adapting the cut-off level vs Static heuristics
- AllInner: fully replicate all inner nodes
- B+Tree rebalancing causes costly updates
- FixedAt2: cut-off level at level 2
- poor load balancing as more machines join
0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90 100
slowdown relative to STI-BT #machines
AllInner FixedAt2
STI-BT
YCSB workload A 50% lookups 50% modifications
SLIDE 88
- low arity => deeper trees, more accesses to DKV
- stable performance for non minimal arity
Varying the arity of the B+Tree
SLIDE 89
- Read dominated workload in YCSB
- With up to 60 machines
- Latency remains stable
Scaling the Data Indexed
SLIDE 90 Sinfonia tree [VLDB’08]
- all inner nodes are fully replicated
Related Work
10 20 30 20 40 60 80 100
throughput (1000 txs/sec) #machines Baseline Minuet Emulation STI-BT
SLIDE 91 Sinfonia tree [VLDB’08]
- all inner nodes are fully replicated
Global index [VLDB’10]
- could be integrated with STI-BT
Related Work
10 20 30 20 40 60 80 100
throughput (1000 txs/sec) #machines Baseline Minuet Emulation STI-BT
SLIDE 92 Sinfonia tree [VLDB’08]
- all inner nodes are fully replicated
Global index [VLDB’10]
- could be integrated with STI-BT
YCSB Read latest workload
Related Work
Minuet [VLDB’12]
- lack of data placement
- snapshot creation for read-only transactions is expensive
10 20 30 20 40 60 80 100
throughput (1000 txs/sec) #machines Baseline Minuet Emulation STI-BT
SLIDE 93 STI-BT: A Scalable Transactional Index
Nuno Diegues and Paolo Romano
Thank you!
Questions?