STI-BT: A Scalable Transactional Index Nuno Diegues and Paolo Romano - - PowerPoint PPT Presentation

sti bt a scalable transactional index
SMART_READER_LITE
LIVE PREVIEW

STI-BT: A Scalable Transactional Index Nuno Diegues and Paolo Romano - - PowerPoint PPT Presentation

STI-BT: A Scalable Transactional Index Nuno Diegues and Paolo Romano 34th International Conference on Distributed Systems (ICDCS) Distributed Key-Value (DKV) stores rise in popularity: Distributed Key-Value (DKV) stores rise in popularity:


slide-1
SLIDE 1

STI-BT: A Scalable Transactional Index

Nuno Diegues and Paolo Romano

34th International Conference on Distributed Systems (ICDCS)

slide-2
SLIDE 2

Distributed Key-Value (DKV) stores rise in popularity:

slide-3
SLIDE 3

Distributed Key-Value (DKV) stores rise in popularity:

  • scalability
  • fault-tolerance
  • elasticity
slide-4
SLIDE 4

Distributed Key-Value (DKV) stores rise in popularity:

  • scalability
  • fault-tolerance
  • elasticity

Recent trend:

  • cloud adoption, large/elastic scaling
  • move towards strong consistency
  • easy and transparent APIs
slide-5
SLIDE 5

We focus on two main disadvantages:

typically embrace weak consistency key-value access is too simplistic

  • mainly an index for primary key
slide-6
SLIDE 6

Providing Secondary index is non-trivial

We focus on two main disadvantages:

typically embrace weak consistency key-value access is too simplistic

  • mainly an index for primary key
slide-7
SLIDE 7

Providing Secondary index is non-trivial But it is desirable!

We focus on two main disadvantages:

typically embrace weak consistency key-value access is too simplistic

  • mainly an index for primary key
slide-8
SLIDE 8

State of the art solutions either…:

did not provide strongly consistent transactions — more difficult were not fully decentralised — not scalable required several hops to access the index — more latency

Providing Secondary index is non-trivial But it is desirable!

We focus on two main disadvantages:

typically embrace weak consistency key-value access is too simplistic

  • mainly an index for primary key
slide-9
SLIDE 9

In this work we present STI-BT: a Scalable Transactional Index

slide-10
SLIDE 10

In this work we present STI-BT: a Scalable Transactional Index

Serializable distributed transactions Secondary indexes via a distributed B+Tree implementation Index accesses/changes obey transactions’ semantics

slide-11
SLIDE 11

Provide strong consistency + scalable indexing

In this work we present STI-BT: a Scalable Transactional Index

Serializable distributed transactions Secondary indexes via a distributed B+Tree implementation Index accesses/changes obey transactions’ semantics

slide-12
SLIDE 12
  • Background on DKV stores
  • STI-BT
  • Evaluation
  • Related Work

Outline

slide-13
SLIDE 13

Infinispan DKV store by Red Hat

Background on DKV store

slide-14
SLIDE 14

Distributed vector-clock based protocol:

GMU [ICDCS’12]

Infinispan DKV store by Red Hat

Background on DKV store

slide-15
SLIDE 15

Distributed vector-clock based protocol:

GMU [ICDCS’12]

Infinispan DKV store by Red Hat

Read-only transactions do not abort

Background on DKV store

Multi-versioned

slide-16
SLIDE 16

Distributed vector-clock based protocol:

GMU [ICDCS’12] Update Serializability

Infinispan DKV store by Red Hat

Read-only transactions do not abort

Background on DKV store

Update transactions

Multi-versioned

slide-17
SLIDE 17

Distributed vector-clock based protocol:

GMU [ICDCS’12] Update Serializability

Infinispan DKV store by Red Hat

Read-only transactions do not abort

Background on DKV store

Update transactions

Multi-versioned Genuine

slide-18
SLIDE 18

data set:

GMU

slide-19
SLIDE 19

replication degree: 2 consistent hash function data set:

GMU

slide-20
SLIDE 20

replication degree: 2 consistent hash function data set:

GMU

slide-21
SLIDE 21

replication degree: 2 consistent hash function data set:

GMU

slide-22
SLIDE 22

No central component Transactions require only machines holding data used

GMU: genuine partial replication

slide-23
SLIDE 23

No central component Transactions require only machines holding data used

read/write

GMU: genuine partial replication

slide-24
SLIDE 24

No central component Transactions require only machines holding data used

commit tx

GMU: genuine partial replication

slide-25
SLIDE 25

No central component Transactions require only machines holding data used

consensus for commit commit tx

GMU: genuine partial replication

slide-26
SLIDE 26
  • Background on DKV stores
  • STI-BT
  • Evaluation
  • Related Work

Outline

  • maximizing data locality
  • hybrid replication
  • elastic scaling
  • concurrency enhancements
slide-27
SLIDE 27

Starting point:

  • consider a distributed B+Tree built on the DKV

The need for data locality of the index

slide-28
SLIDE 28

Starting point:

  • consider a distributed B+Tree built on the DKV

consistent hash function

S1 S2 S3 S4

The need for data locality of the index

slide-29
SLIDE 29

Starting point:

  • consider a distributed B+Tree built on the DKV

S1 S4 S3 S3 S1 S4 S2

consistent hash function

S1 S2 S3 S4

The need for data locality of the index

  • tree nodes placed with random consistent hash
slide-30
SLIDE 30

S1 S4 S3 S3 S1 S4 S2 Z

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-31
SLIDE 31

S1 S4 S3 S3 S1 S4 S2 Z

  • One index access entails several hops

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-32
SLIDE 32

S1 S4 S3 S3 S1 S4 S2

delete Z

Z

  • One index access entails several hops

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-33
SLIDE 33

S1 S4 S3 S3 S1 S4 S2

delete Z

Z Z

  • One index access entails several hops

S1 S3 S2

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-34
SLIDE 34

S1 S4 S3 S3 S1 S4 S2

delete Z

Z Z

  • One index access entails several hops
  • Some servers receive more load than others

S1 S3 S2

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-35
SLIDE 35

S1 S4 S3 S3 S1 S4 S2

delete Z

Z Z

  • One index access entails several hops
  • Some servers receive more load than others

S1 S3 S2

server load

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-36
SLIDE 36

S1 S4 S3 S3 S1 S4 S2

delete Z

Z Z

  • One index access entails several hops
  • Some servers receive more load than others
  • Range scan operations are also inefficient

S1 S3 S2

server load

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-37
SLIDE 37

S1 S4 S3 S3 S1 S4 S2

delete Z

Z Z

  • One index access entails several hops
  • Some servers receive more load than others
  • Range scan operations are also inefficient

S1 S3 S2

server load scan P to Z

P

Current problems with data locality

Problems with consistent hashing data placement:

slide-38
SLIDE 38

Partial replication of the index:

poor locality poor load balancing

Where typical solutions fall short

slide-39
SLIDE 39

Partial replication of the index:

poor locality poor load balancing

Where typical solutions fall short

Full replication of the index:

consensus on updates is too expensive prevents scaling out storage

slide-40
SLIDE 40

STI-BT: Maximizing data locality of the index

slide-41
SLIDE 41

C (cut-off level)

full replication partial replication

Hybrid replication top nodes are more accessed but less modified better load balancing, rare cost for expensive consensus

STI-BT: Maximizing data locality of the index

slide-42
SLIDE 42

C (cut-off level)

full replication partial replication

S1 S2 S3 S4

Co-located data placement groups of sub-trees, reduce network hops migrate transaction to exploit co-location Hybrid replication top nodes are more accessed but less modified better load balancing, rare cost for expensive consensus

STI-BT: Maximizing data locality of the index

slide-43
SLIDE 43

K

C

S1 S2 S3 S4

full partial

Transaction migration driven by data co-location

slide-44
SLIDE 44

K

C

S1 S2 S3 S4 S1 S2 S3 S4

full partial

Transaction migration driven by data co-location

slide-45
SLIDE 45

K

C

S1 S2 S3 S4 S1 S2 S3 S4

full partial

Lookup K

Transaction migration driven by data co-location

slide-46
SLIDE 46

K

C

S1 S2 S3 S4 S1 S2 S3 S4 1

full partial

Lookup K 1

Transaction migration driven by data co-location

slide-47
SLIDE 47

K

C

S1 S2 S3 S4 S1 S2 S3 S4 1

full partial

Lookup K 2 2 local search

Transaction migration driven by data co-location

slide-48
SLIDE 48

K

C

S1 S2 S3 S4 S1 S2 S3 S4 1

full partial

Lookup K 3 2 3 migrate tx local search

Transaction migration driven by data co-location

slide-49
SLIDE 49

K

C

S1 S2 S3 S4 S1 S2 S3 S4 1

full partial

Lookup K 4 2 3 migrate tx local search 4 local search

Transaction migration driven by data co-location

slide-50
SLIDE 50

K

C

S1 S2 S3 S4 S1 S2 S3 S4 1

full partial

Lookup K 2 3 migrate tx local search 4 local search

Transaction migration driven by data co-location

slide-51
SLIDE 51

Still rely on consistent hashing:

  • preserve fully decentralized design and quick lookup of data

Exploit knowledge over structure of the indexed data

  • general purpose data placement is agnostic of the data
  • but we know how it will be structured

Grouping index in sub-trees

slide-52
SLIDE 52

Still rely on consistent hashing:

  • preserve fully decentralized design and quick lookup of data

Exploit knowledge over structure of the indexed data

  • general purpose data placement is agnostic of the data
  • but we know how it will be structured

consistent hash function ku: unique key server

Grouping index in sub-trees

slide-53
SLIDE 53

Still rely on consistent hashing:

  • preserve fully decentralized design and quick lookup of data

Exploit knowledge over structure of the indexed data

  • general purpose data placement is agnostic of the data
  • but we know how it will be structured

consistent hash function ku: unique key server local map lookup ku: unique key

Grouping index in sub-trees

slide-54
SLIDE 54

Still rely on consistent hashing:

  • preserve fully decentralized design and quick lookup of data

Exploit knowledge over structure of the indexed data

  • general purpose data placement is agnostic of the data
  • but we know how it will be structured

consistent hash function ku: unique key server local map lookup ku: unique key

key = { ku , kcl }

kcl: co-location identifier

Grouping index in sub-trees

slide-55
SLIDE 55

Algorithms in the paper

slide-56
SLIDE 56

Algorithms in the paper

including load balancing of sub-trees between different servers

slide-57
SLIDE 57

Management of the cut-off

There is a trade-off in the cut-off:

slide-58
SLIDE 58

as high as possible to keep the fully replicated part as small as possible to avoid costly consensus upon updates

Management of the cut-off

There is a trade-off in the cut-off:

slide-59
SLIDE 59

as high as possible to keep the fully replicated part as small as possible to avoid costly consensus upon updates

Management of the cut-off

but deep enough to create sub-trees need enough to load balance across all machines There is a trade-off in the cut-off:

slide-60
SLIDE 60

as high as possible to keep the fully replicated part as small as possible to avoid costly consensus upon updates

Management of the cut-off

but deep enough to create sub-trees need enough to load balance across all machines Dynamic problem: changes with elastic scaling of the machines There is a trade-off in the cut-off:

slide-61
SLIDE 61

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4

Elastic scaling of the index

slide-62
SLIDE 62

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4 S5

?

Elastic scaling of the index

slide-63
SLIDE 63

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4 S5

?

Elastic scaling of the index

lower cut-off

slide-64
SLIDE 64

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4 S5 S5

Elastic scaling of the index

slide-65
SLIDE 65

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4 S5 S5

M

  • r

e s u b

  • t

r e e s t h a n n e e d e d !

Elastic scaling of the index

slide-66
SLIDE 66

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4 S5 S5

M

  • r

e s u b

  • t

r e e s t h a n n e e d e d !

Elastic scaling of the index

fine-grained change

slide-67
SLIDE 67

C

S1 S2 S3 S4

full partial

S1 S2 S3 S4 S5 S5

Elastic scaling of the index

slide-68
SLIDE 68

We prove that this scheme is memory efficient and does not hamper scalability.

Elastic scaling of the index

Fully replicated part adds memory overhead as the cluster scales.

slide-69
SLIDE 69

Avoid aborts of transactions mutating the index. More details in the paper.

Concurrency enhancements

slide-70
SLIDE 70

Evaluation

Built on top of Infinispan

  • open-source DKV from Red Hat
  • YCSB
  • data inserted used as a secondary index
  • operations made transactional
  • Up to 100 VMs in a cloud cluster (Futuregrid)
  • Partial replication: 2
slide-71
SLIDE 71

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) STI-BT

STI-BT:

  • Each transaction performs an average of 2 remote requests

Evaluating each contribution ( 1 / 5 )

60 machines

slide-72
SLIDE 72

5 10 15 20 25

B a l a n c e d R e a d

  • D
  • m

i n a t e d R e a d

  • O

n l y R e a d

  • L

a t e s t S c a n

  • H

e a v y R M W B a l a n c e d

throughput (1000 txs/sec) Baseline STI-BT

Baseline:

  • Simple B+Tree on top of Infinispan/GMU
  • None of the improvements of STI-BT

Evaluating each contribution ( 2 / 5 )

slide-73
SLIDE 73

5 10 15 20 25

B a l a n c e d R e a d

  • D
  • m

i n a t e d R e a d

  • O

n l y R e a d

  • L

a t e s t S c a n

  • H

e a v y R M W B a l a n c e d

throughput (1000 txs/sec) Baseline STI-BT

Baseline:

  • Simple B+Tree on top of Infinispan/GMU
  • None of the improvements of STI-BT

Evaluating each contribution ( 2 / 5 )

  • 10 to 32 average remote requests per transaction
slide-74
SLIDE 74

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees STI-BT

Sub-Trees:

  • Sub-trees co-located and transaction migration
  • But no hybrid replication
  • Machines replicating top tree nodes are over-loaded

Evaluating each contribution ( 3 / 5 )

slide-75
SLIDE 75

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees STI-BT

Sub-Trees:

  • Sub-trees co-located and transaction migration
  • But no hybrid replication
  • Machines replicating top tree nodes are over-loaded

Evaluating each contribution ( 3 / 5 )

  • reduced average remote requests to 2.5 per transaction
  • 6.6x speedup over Baseline
slide-76
SLIDE 76

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees Dirty STI-BT

Dirty:

  • Concurrency enhancements to reduce tx aborts
  • But no smart co-location of data

Evaluating each contribution ( 4 / 5 )

slide-77
SLIDE 77

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees Dirty STI-BT

Dirty:

  • Concurrency enhancements to reduce tx aborts
  • But no smart co-location of data

Evaluating each contribution ( 4 / 5 )

  • 2.5x speedup over Baseline
slide-78
SLIDE 78

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees Dirty TopFull STI-BT

TopFull:

  • Hybrid replication with the cut-off level
  • But none of the other improvements

Evaluating each contribution ( 5 / 5 )

slide-79
SLIDE 79

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees Dirty TopFull STI-BT

TopFull:

  • Hybrid replication with the cut-off level
  • But none of the other improvements

Evaluating each contribution ( 5 / 5 )

  • Reduce average remote requests from 16 to 9
  • 1.9x speedup over Baseline
slide-80
SLIDE 80

5 10 15 20 25

Balanced Read-Dominated Read-Only Read-Latest Scan-Heavy RMW Balanced

throughput (1000 txs/sec) Baseline Sub-trees Dirty TopFull STI-BT

TopFull:

  • Hybrid replication with the cut-off level
  • But none of the other improvements

Evaluating each contribution ( 5 / 5 )

  • Reduce average remote requests from 16 to 9
  • 1.9x speedup over Baseline
slide-81
SLIDE 81

Assessing Scalability

2 6 10 14 20 40 60 80 100

#machines

throughput (1000 txs/sec)

slide-82
SLIDE 82

90% of ops in 6ms or less

Assessing Scalability

2 6 10 14 20 40 60 80 100

#machines

throughput (1000 txs/sec)

slide-83
SLIDE 83

90% of ops in 6ms or less

Performance is unlocked by combination of the mechanisms. Similar outcome in other workloads.

Assessing Scalability

2 6 10 14 20 40 60 80 100

#machines

throughput (1000 txs/sec)

slide-84
SLIDE 84

90% of ops in 6ms or less

Performance is unlocked by combination of the mechanisms. Similar outcome in other workloads.

Assessing Scalability

2 6 10 14 20 40 60 80 100

#machines

throughput (1000 txs/sec)

60 machines

slide-85
SLIDE 85

Adapting the cut-off level vs Static heuristics

  • AllInner: fully replicate all inner nodes
  • B+Tree rebalancing causes costly updates
  • FixedAt2: cut-off level at level 2
  • poor load balancing as more machines join
slide-86
SLIDE 86

Adapting the cut-off level vs Static heuristics

  • AllInner: fully replicate all inner nodes
  • B+Tree rebalancing causes costly updates
  • FixedAt2: cut-off level at level 2
  • poor load balancing as more machines join

STI-BT

0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90 100

slowdown relative to STI-BT #machines

YCSB workload A 50% lookups 50% modifications

slide-87
SLIDE 87

Adapting the cut-off level vs Static heuristics

  • AllInner: fully replicate all inner nodes
  • B+Tree rebalancing causes costly updates
  • FixedAt2: cut-off level at level 2
  • poor load balancing as more machines join

0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90 100

slowdown relative to STI-BT #machines

AllInner FixedAt2

STI-BT

YCSB workload A 50% lookups 50% modifications

slide-88
SLIDE 88
  • low arity => deeper trees, more accesses to DKV
  • stable performance for non minimal arity

Varying the arity of the B+Tree

slide-89
SLIDE 89
  • Read dominated workload in YCSB
  • With up to 60 machines
  • Latency remains stable

Scaling the Data Indexed

slide-90
SLIDE 90

Sinfonia tree [VLDB’08]

  • all inner nodes are fully replicated

Related Work

10 20 30 20 40 60 80 100

throughput (1000 txs/sec) #machines Baseline Minuet Emulation STI-BT

slide-91
SLIDE 91

Sinfonia tree [VLDB’08]

  • all inner nodes are fully replicated

Global index [VLDB’10]

  • could be integrated with STI-BT

Related Work

10 20 30 20 40 60 80 100

throughput (1000 txs/sec) #machines Baseline Minuet Emulation STI-BT

slide-92
SLIDE 92

Sinfonia tree [VLDB’08]

  • all inner nodes are fully replicated

Global index [VLDB’10]

  • could be integrated with STI-BT

YCSB Read latest workload

Related Work

Minuet [VLDB’12]

  • lack of data placement
  • snapshot creation for read-only transactions is expensive

10 20 30 20 40 60 80 100

throughput (1000 txs/sec) #machines Baseline Minuet Emulation STI-BT

slide-93
SLIDE 93

STI-BT: A Scalable Transactional Index

Nuno Diegues and Paolo Romano

Thank you!

Questions?