ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ - - PowerPoint PPT Presentation

advanced database systems
SMART_READER_LITE
LIVE PREVIEW

ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ - - PowerPoint PPT Presentation

Lect ure # 07 ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 // Spring 2020 2 Latches B+Trees Judy Array ART Masstree 15-721 (Spring 2020) 3 LATCH IM PLEM ENTATIO N GOALS Small memory


slide-1
SLIDE 1

Lect ure # 07

OLTP Indexes (Trie Data Structures)

@ Andy_Pavlo // 15- 721 // Spring 2020

ADVANCED DATABASE SYSTEMS

slide-2
SLIDE 2

15-721 (Spring 2020)

Latches B+Trees Judy Array ART Masstree

2

slide-3
SLIDE 3

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO N GOALS

Small memory footprint. Fast execution path when no contention. Deschedule thread when it has been waiting for too long to avoid burning cycles. Each latch should not have to implement their

  • wn queue to track waiting threads.

3

Source: Filip Pizlo

slide-4
SLIDE 4

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO N GOALS

Small memory footprint. Fast execution path when no contention. Deschedule thread when it has been waiting for too long to avoid burning cycles. Each latch should not have to implement their

  • wn queue to track waiting threads.

3

Source: Filip Pizlo

slide-5
SLIDE 5

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Test-and-Set Spinlock Blocking OS Mutex Adaptive Spinlock Queue-based Spinlock Reader-Writer Locks

4

slide-6
SLIDE 6

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #1: Test-and-Set Spinlock (TaS)

→ Very efficient (single instruction to lock/unlock) → Non-scalable, not cache friendly, not OS friendly. → Example: std::atomic<T>

5

std::atomic_flag latch; ⋮ while (latch.test_and_set(…)) { // Yield? Abort? Retry? }

slide-7
SLIDE 7

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #1: Test-and-Set Spinlock (TaS)

→ Very efficient (single instruction to lock/unlock) → Non-scalable, not cache friendly, not OS friendly. → Example: std::atomic<T>

5

std::atomic_flag latch; ⋮ while (latch.test_and_set(…)) { // Yield? Abort? Retry? }

slide-8
SLIDE 8

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #2: Blocking OS Mutex

→ Simple to use → Non-scalable (about 25ns per lock/unlock invocation) → Example: std::mutex

6

std::mutex m; ⋮ m.lock(); // Do something special... m.unlock();

slide-9
SLIDE 9

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #2: Blocking OS Mutex

→ Simple to use → Non-scalable (about 25ns per lock/unlock invocation) → Example: std::mutex

6

std::mutex m; ⋮ m.lock(); // Do something special... m.unlock(); pthread_mutex_t futex

slide-10
SLIDE 10

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #2: Blocking OS Mutex

→ Simple to use → Non-scalable (about 25ns per lock/unlock invocation) → Example: std::mutex

6

std::mutex m; ⋮ m.lock(); // Do something special... m.unlock(); pthread_mutex_t futex

Userspace Latch OS Latch

slide-11
SLIDE 11

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #2: Blocking OS Mutex

→ Simple to use → Non-scalable (about 25ns per lock/unlock invocation) → Example: std::mutex

6

std::mutex m; ⋮ m.lock(); // Do something special... m.unlock(); pthread_mutex_t futex

Userspace Latch OS Latch

slide-12
SLIDE 12

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #3: Adaptive Spinlock

→ Thread spins on a userspace lock for a brief time. → If they cannot acquire the lock, they then get descheduled and stored in a global "parking lot". → Threads check to see whether other threads are "parked" before spinning and then park themselves. → Example: Apple's WTF::ParkingLot

7

slide-13
SLIDE 13

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch CPU1

slide-14
SLIDE 14

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch CPU1

slide-15
SLIDE 15

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch CPU1

slide-16
SLIDE 16

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch CPU1

slide-17
SLIDE 17

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch next CPU2 Latch CPU1 CPU2

slide-18
SLIDE 18

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch next CPU2 Latch CPU1 CPU2

slide-19
SLIDE 19

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch next CPU2 Latch CPU1 CPU2

slide-20
SLIDE 20

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch next CPU2 Latch CPU1 CPU2 CPU3 next CPU3 Latch

slide-21
SLIDE 21

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch next CPU2 Latch CPU1 CPU2 CPU3 next CPU3 Latch

slide-22
SLIDE 22

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #4: Queue-based Spinlock (MCS)

→ More efficient than mutex, better cache locality → Non-trivial memory management → Example: std::atomic<Latch*>

8

next Base Latch next CPU1 Latch next CPU2 Latch CPU1 CPU2 CPU3 next CPU3 Latch

slide-23
SLIDE 23

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

slide-24
SLIDE 24

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0

slide-25
SLIDE 25

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1

slide-26
SLIDE 26

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1

slide-27
SLIDE 27

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1 =2

slide-28
SLIDE 28

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1 =2

slide-29
SLIDE 29

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1 =2 =1

slide-30
SLIDE 30

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1 =2 =1

slide-31
SLIDE 31

15-721 (Spring 2020)

LATCH IM PLEM ENTATIO NS

Choice #5: Reader-Writer Locks

→ Allows for concurrent readers. → Must manage read/write queues to avoid starvation. → Can be implemented on top of spinlocks.

9

read write Latch

=0 =0 =0 =0 =1 =2 =1 =1

slide-32
SLIDE 32

15-721 (Spring 2020)

B+ TREE

A B+Tree is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in O(log n).

→ Generalization of a binary search tree in that a node can have more than two children. → Optimized for systems that read and write large blocks of data.

10

slide-33
SLIDE 33

15-721 (Spring 2020)

LATCH CRABBIN G / COUPLING

Acquire and release latches on B+Tree nodes when traversing the data structure. A thread can release latch on a parent node if its child node considered safe.

→ Any node that won’t split or merge when updated. → Not full (on insertion) → More than half-full (on deletion)

11

slide-34
SLIDE 34

15-721 (Spring 2020)

LATCH CRABBIN G

Search: Start at root and go down; repeatedly,

→ Acquire read (R) latch on child → Then unlock the parent node.

Insert/Delete: Start at root and go down,

  • btaining write (W) latches as needed.

Once child is locked, check if it is safe:

→ If child is safe, release all locks on ancestors.

12

slide-35
SLIDE 35

15-721 (Spring 2020)

EXAM PLE # 1: SEARCH 23

13

A B D G

20 10 35 6 12 23 38 44

C E F

slide-36
SLIDE 36

15-721 (Spring 2020)

EXAM PLE # 1: SEARCH 23

13

A B D G

20 10 35 6 12 23 38 44

C E F R

slide-37
SLIDE 37

15-721 (Spring 2020)

EXAM PLE # 1: SEARCH 23

13

A B D G

20 10 35 6 12 23 38 44

C E F R

We can release the latch on A as soon as we acquire the latch for C.

slide-38
SLIDE 38

15-721 (Spring 2020)

EXAM PLE # 1: SEARCH 23

13

A B D G

20 10 35 6 12 23 38 44

C E F R R

We can release the latch on A as soon as we acquire the latch for C.

slide-39
SLIDE 39

15-721 (Spring 2020)

EXAM PLE # 1: SEARCH 23

13

A B D G

20 10 35 6 12 23 38 44

C E F R

We can release the latch on A as soon as we acquire the latch for C.

slide-40
SLIDE 40

15-721 (Spring 2020)

EXAM PLE # 2: DELETE 4 4

14

A B D G

20 10 35 6 12 23 38 44

C E F W

slide-41
SLIDE 41

15-721 (Spring 2020)

EXAM PLE # 2: DELETE 4 4

14

A B D G

20 10 35 6 12 23 38 44

C E F W W

We may need to coalesce C, so we can’t release the latch on A.

slide-42
SLIDE 42

15-721 (Spring 2020)

EXAM PLE # 2: DELETE 4 4

14

A B D G

20 10 35 6 12 23 38 44

C E F W W W

We may need to coalesce C, so we can’t release the latch on A. G will not merge with F, so we can release latches on A and C.

slide-43
SLIDE 43

15-721 (Spring 2020)

EXAM PLE # 2: DELETE 4 4

14

A B D G

20 10 35 6 12 23 38 44

C E F W

We may need to coalesce C, so we can’t release the latch on A. G will not merge with F, so we can release latches on A and C.

slide-44
SLIDE 44

15-721 (Spring 2020)

EXAM PLE # 3: INSERT 4 0

15

A B D G

20 10 35 6 12 23 38 44

C E F W

slide-45
SLIDE 45

15-721 (Spring 2020)

EXAM PLE # 3: INSERT 4 0

15

A B D G

20 10 35 6 12 23 38 44

C E F W W

C has room if its child has to split, so we can release the latch on A.

slide-46
SLIDE 46

15-721 (Spring 2020)

EXAM PLE # 3: INSERT 4 0

15

A B D G

20 10 35 6 12 23 38 44

C E F W

C has room if its child has to split, so we can release the latch on A.

slide-47
SLIDE 47

15-721 (Spring 2020)

EXAM PLE # 3: INSERT 4 0

15

A B D G

20 10 35 6 12 23 38 44

C E F W W

C has room if its child has to split, so we can release the latch on A. G must split, so we can’t release the latch on C.

slide-48
SLIDE 48

15-721 (Spring 2020)

EXAM PLE # 3: INSERT 4 0

15

A B D G

20 10 35 6 12 23 38 44

C E F W W

C has room if its child has to split, so we can release the latch on A. G must split, so we can’t release the latch on C.

H

44 40 44

slide-49
SLIDE 49

15-721 (Spring 2020)

EXAM PLE # 3: INSERT 4 0

15

A B D G

20 10 35 6 12 23 38 44

C E F

C has room if its child has to split, so we can release the latch on A. G must split, so we can’t release the latch on C.

H

44 40 44

slide-50
SLIDE 50

15-721 (Spring 2020)

BETTER LATCH CRABBIN G

The basic latch crabbing algorithm always takes a write latch on the root for any update.

→ This makes the index essentially single threaded.

A better approach is to optimistically assume that the target leaf node is safe.

→ Take R latches as you traverse the tree to reach it and verify. → If leaf is not safe, then do previous algorithm.

17

CONCURRENCY OF OPERATIONS ON B- TREES

ACTA INFORMATICA 1977

slide-51
SLIDE 51

15-721 (Spring 2020)

EXAM PLE # 4 : DELETE 4 4

18

A B D G

20 10 35 6 12 23 38 44

C E F R

slide-52
SLIDE 52

15-721 (Spring 2020)

EXAM PLE # 4 : DELETE 4 4

18

A B D G

20 10 35 6 12 23 38 44

C E F R

We assume that C is safe, so we can release the latch on A.

slide-53
SLIDE 53

15-721 (Spring 2020)

EXAM PLE # 4 : DELETE 4 4

18

A B D G

20 10 35 6 12 23 38 44

C E F W

We assume that C is safe, so we can release the latch on A. Acquire an exclusive latch on G.

slide-54
SLIDE 54

15-721 (Spring 2020)

EXAM PLE # 4 : DELETE 4 4

18

A B D G

20 10 35 6 12 23 38 44

C E F W

We assume that C is safe, so we can release the latch on A. Acquire an exclusive latch on G.

slide-55
SLIDE 55

15-721 (Spring 2020)

VERSION ED LATCH COUPLING

Optimistic crabbing scheme where writers are not blocked on readers. Every node now has a version number (counter).

→ Writers increment counter when they acquire latch. → Readers proceed if a node’s latch is available but then do not acquire it. → It then checks whether the latch’s counter has changed from when it checked the latch.

Relies on epoch GC to ensure pointers are valid.

19

THE ART OF PRACTICAL SYNCHRONIZATION

DAMON 2016

slide-56
SLIDE 56

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

v3 v5 v6 v9 v4 v4 v5

slide-57
SLIDE 57

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3

v3 v5 v6 v9 v4 v4 v5

@A

slide-58
SLIDE 58

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

@A @B

slide-59
SLIDE 59

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5

@A @B

slide-60
SLIDE 60

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3

@A @B

slide-61
SLIDE 61

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node

@A @B

slide-62
SLIDE 62

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9

@A @B @C

slide-63
SLIDE 63

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5 C: Examine Node

@A @B @C

slide-64
SLIDE 64

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5 C: Examine Node

@A @B @C

slide-65
SLIDE 65

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node

@A @B @C

slide-66
SLIDE 66

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9

@A @B @C

slide-67
SLIDE 67

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9

v6

@A @B @C

slide-68
SLIDE 68

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5

v6

@A @B @C

slide-69
SLIDE 69

15-721 (Spring 2020)

VERSION ED LATCHES: SEARCH 4 4

20

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Examine Node

v3 v5 v6 v9 v4 v4 v5

B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5

v6

@A @B @C

slide-70
SLIDE 70

15-721 (Spring 2020)

OBSERVATION

The inner node keys in a B+tree cannot tell you whether a key exists in the index. You always must traverse to the leaf node. This means that you could have (at least) one cache miss per level in the tree.

21

slide-71
SLIDE 71

15-721 (Spring 2020)

TRIE INDEX

Use a digital representation of keys to examine prefixes one-by-one instead of comparing entire key.

→ Also known as Digital Search Tree, Prefix Tree.

22

Keys: HELLO, HAT, HAVE

L L O ¤ ¤ E ¤ H A E V T

slide-72
SLIDE 72

15-721 (Spring 2020)

TRIE INDEX

Use a digital representation of keys to examine prefixes one-by-one instead of comparing entire key.

→ Also known as Digital Search Tree, Prefix Tree.

22

Keys: HELLO, HAT, HAVE

L L O ¤ ¤ E ¤ H A E V T

slide-73
SLIDE 73

15-721 (Spring 2020)

TRIE INDEX PROPERTIES

Shape only depends on key space and lengths.

→ Does not depend on existing keys or insertion order. → Does not require rebalancing operations.

All operations have O(k) complexity where k is the length of the key.

→ The path to a leaf node represents the key of the leaf → Keys are stored implicitly and can be reconstructed from paths.

23

slide-74
SLIDE 74

15-721 (Spring 2020)

TRIE KEY SPAN

The span of a trie level is the number of bits that each partial key / digit represents.

→ If the digit exists in the corpus, then store a pointer to the next level in the trie branch. Otherwise, store null.

This determines the fan-out of each node and the physical height of the tree.

→ n-way Trie = Fan-Out of n

24

slide-75
SLIDE 75

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-76
SLIDE 76

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-77
SLIDE 77

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-78
SLIDE 78

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-79
SLIDE 79

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-80
SLIDE 80

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-81
SLIDE 81

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

←Repeat 10x Tuple Pointer Node Pointer

slide-82
SLIDE 82

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

←Repeat 10x

¤ Ø ¤ Ø ¤ ¤ ¤ Ø Ø ¤ ¤ Ø ¤ Ø Ø ¤ Ø ¤ Ø ¤ ¤ ¤ Ø ¤ Ø ¤

Tuple Pointer Node Pointer

slide-83
SLIDE 83

15-721 (Spring 2020)

TRIE KEY SPAN

Keys: K10,K25,K31

25

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

←Repeat 10x

¤ Ø ¤ Ø ¤ ¤ ¤ Ø Ø ¤ ¤ Ø ¤ Ø Ø ¤ Ø ¤ Ø ¤ ¤ ¤ Ø ¤ Ø ¤

Tuple Pointer Node Pointer

slide-84
SLIDE 84

15-721 (Spring 2020)

RADIX TREE

Omit all nodes with only a single child.

→ Also known as Patricia Tree.

Can produce false positives, so the DBMS always checks the original tuple to see whether a key matches.

26

1-bit Span Radix Tree

¤ Ø ¤ Ø ¤ ¤ Ø ¤ ¤ ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-85
SLIDE 85

15-721 (Spring 2020)

TRIE VARIANTS

Judy Arrays (HP) ART Index (HyPer) Masstree (Silo)

27

slide-86
SLIDE 86

15-721 (Spring 2020)

J UDY ARRAYS

Variant of a 256-way radix tree. First known radix tree that supports adaptive node representation. Three array types

→ Judy1: Bit array that maps integer keys to true/false. → JudyL: Map integer keys to integer values. → JudySL: Map variable-length keys to integer values.

Open-Source Implementation (LGPL). Patented by HP in 2000. Expires in 2022.

→ Not an issue according to authors. → http://judy.sourceforge.net/

28

slide-87
SLIDE 87

15-721 (Spring 2020)

J UDY ARRAYS

Do not store meta-data about node in its header.

→ This could lead to additional cache misses.

Pack meta-data about a node in 128-bit "Judy Pointers" stored in its parent node.

→ Node Type → Population Count → Child Key Prefix / Value (if only one child below) → 64-bit Child Pointer

29

A COMPARISON OF ADAPTIVE RADIX TREES AND HASH TABLES

ICDE 2015

slide-88
SLIDE 88

15-721 (Spring 2020)

J UDY ARRAYS: NODE TYPES

Every node can store up to 256 digits. Not all nodes will be 100% full though. Adapt node's organization based on its keys.

→ Linear Node: Sparse Populations → Bitmap Node: Typical Populations → Uncompressed Node: Dense Population

30

A COMPARISON OF ADAPTIVE RADIX TREES AND HASH TABLES

ICDE 2015

slide-89
SLIDE 89

15-721 (Spring 2020)

J UDY ARRAYS: LINEAR NODES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

31

Linear Node

K0 K2 K8

¤ ¤ ¤

1 5 ... ... 1 5

slide-90
SLIDE 90

15-721 (Spring 2020)

J UDY ARRAYS: LINEAR NODES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

31

Linear Node

K0 K2 K8

¤ ¤ ¤

1 5 ... ... 1 5

Sorted Digits

slide-91
SLIDE 91

15-721 (Spring 2020)

J UDY ARRAYS: LINEAR NODES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

31

Linear Node

K0 K2 K8

¤ ¤ ¤

1 5 ... ... 1 5

Sorted Digits Child Pointers

slide-92
SLIDE 92

15-721 (Spring 2020)

J UDY ARRAYS: LINEAR NODES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

31

Linear Node

K0 K2 K8

¤ ¤ ¤

1 5 ... ... 1 5

Sorted Digits Child Pointers

6 × 1-byte = 6 bytes 6 × 16-bytes = 96 bytes 102 bytes 128 bytes (padded)

slide-93
SLIDE 93

15-721 (Spring 2020)

J UDY ARRAYS: BITM AP NODES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

32

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤

... 00100100 ¤ ...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

slide-94
SLIDE 94

15-721 (Spring 2020)

J UDY ARRAYS: BITM AP NODES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

32

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤

... 00100100 ¤ ...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

Prefix Bitmaps

0→00000000 1→00000001 2→00000010 3→00000011 4→00000100 5→00000101 6→00000110 7→00000111 Offset Digit

slide-95
SLIDE 95

15-721 (Spring 2020)

J UDY ARRAYS: BITM AP NODES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

32

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤

... 00100100 ¤ ...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

Prefix Bitmaps Sub-Array Pointers

slide-96
SLIDE 96

15-721 (Spring 2020)

J UDY ARRAYS: BITM AP NODES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

32

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤

... 00100100 ¤ ...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

Child Pointers Prefix Bitmaps Sub-Array Pointers

slide-97
SLIDE 97

15-721 (Spring 2020)

ADAPATIVE RADIX TREE (ART)

Developed for TUM HyPer DBMS in 2013. 256-way radix tree that supports different node types based on its population.

→ Stores meta-data about each node in its header.

Concurrency support was added in 2015.

33

THE ADAPTIVE RADIX TREE: ARTFUL INDEXING FOR MAIN- MEMORY DATABASES

ICDE 2013

slide-98
SLIDE 98

15-721 (Spring 2020)

ART vs. J UDY

Difference #1: Node Types

→ Judy has three node types with different organizations. → ART has four nodes types that (mostly) vary in the maximum number of children.

Difference #2: Purpose

→ Judy is a general-purpose associative array. It "owns" the keys and values. → ART is a table index and does not need to cover the full

  • keys. Values are pointers to tuples.

34

slide-99
SLIDE 99

15-721 (Spring 2020)

ART: INNER NODE TYPES (1)

Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.

35

Node16

K0 K2 K8

¤ ¤ ¤

1 15 ... ... 1 15

Node4

K0 K2 K3 K8

¤ ¤ ¤ ¤

1 2 3 1 2 3

slide-100
SLIDE 100

15-721 (Spring 2020)

ART: INNER NODE TYPES (1)

Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.

35

Node16

K0 K2 K8

¤ ¤ ¤

1 15 ... ... 1 15

Node4

K0 K2 K3 K8

¤ ¤ ¤ ¤

Sorted Digits

1 2 3 1 2 3

slide-101
SLIDE 101

15-721 (Spring 2020)

ART: INNER NODE TYPES (1)

Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.

35

Node16

K0 K2 K8

¤ ¤ ¤

1 15 ... ... 1 15

Node4

K0 K2 K3 K8

¤ ¤ ¤ ¤

Sorted Digits Child Pointers

1 2 3 1 2 3

slide-102
SLIDE 102

15-721 (Spring 2020)

ART: INNER NODE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte offsets to a child pointer array that is indexed on the digit bits.

36

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤

slide-103
SLIDE 103

15-721 (Spring 2020)

ART: INNER NODE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte offsets to a child pointer array that is indexed on the digit bits.

36

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤

Pointer Array Offsets

slide-104
SLIDE 104

15-721 (Spring 2020)

ART: INNER NODE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte offsets to a child pointer array that is indexed on the digit bits.

36

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤

Pointer Array Offsets

slide-105
SLIDE 105

15-721 (Spring 2020)

ART: INNER NODE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte offsets to a child pointer array that is indexed on the digit bits.

36

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤ 256 × 1-byte = 256 bytes 48 × 8-bytes = 384 bytes 640 bytes

Pointer Array Offsets

slide-106
SLIDE 106

15-721 (Spring 2020)

ART: INNER NODE TYPES (3)

Store an array of 256 pointers to child nodes. This covers all possible values in 8-bit digits. Same as the Judy Array's Uncompressed Node.

37

Node256

K0 ... K1 K2 K255

¤

Ø

¤ ¤

K3 K4 K5

¤

Ø

¤

K6

Ø

slide-107
SLIDE 107

15-721 (Spring 2020)

ART: INNER NODE TYPES (3)

Store an array of 256 pointers to child nodes. This covers all possible values in 8-bit digits. Same as the Judy Array's Uncompressed Node.

37

Node256

K0 ... K1 K2 K255

¤

Ø

¤ ¤ 256 × 8-byte = 2048 bytes

K3 K4 K5

¤

Ø

¤

K6

Ø

slide-108
SLIDE 108

15-721 (Spring 2020)

ART: BINARY COM PARABLE KEYS

Not all attribute types can be decomposed into binary comparable digits for a radix tree.

→ Unsigned Integers: Byte order must be flipped for little endian machines. → Signed Integers: Flip two’s-complement so that negative numbers are smaller than positive. → Floats: Classify into group (neg vs. pos, normalized vs. denormalized), then store as unsigned integer. → Compound: Transform each attribute separately.

38

slide-109
SLIDE 109

15-721 (Spring 2020)

ART: BINARY COM PARABLE KEYS

39

Hex Key: 0A 0B 0C 0D Int Key: 168496141 0A 0B 0C 0D Big Endian 0D 0C 0B 0A Little Endian

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-110
SLIDE 110

15-721 (Spring 2020)

ART: BINARY COM PARABLE KEYS

39

Hex Key: 0A 0B 0C 0D Int Key: 168496141 0A 0B 0C 0D Big Endian 0D 0C 0B 0A Little Endian

Hex: 0A 0B 1D Find: 658205

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-111
SLIDE 111

15-721 (Spring 2020)

ART: BINARY COM PARABLE KEYS

39

Hex Key: 0A 0B 0C 0D Int Key: 168496141 0A 0B 0C 0D Big Endian 0D 0C 0B 0A Little Endian

Hex: 0A 0B 1D Find: 658205

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-112
SLIDE 112

15-721 (Spring 2020)

M ASSTREE

Instead of using different layouts for each trie node based on its size, use an entire B+Tree.

→ Each B+tree represents 8-byte span. → Optimized for long keys. → Uses a latching protocol that is similar to versioned latches.

Part of the Harvard Silo project.

40

CACHE CRAFTINESS FOR FAST MULTICORE KEY- VALUE STORAGE

EUROSYS 2012

Masstree

Bytes [0-7] Bytes [8-15] Bytes [8-15]

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

slide-113
SLIDE 113

15-721 (Spring 2020)

IN- M EM ORY INDEXES

41

9.94 15.5 13.3 5.43 2.51 2.78 1.51 2.43 8.09 29 25.1 18.9 17.9 30.5 22 3.68 44.9 51.5 42.9 3.43

10 20 30 40 50 60 Insert-Only Read-Only Read/Update Scan/Insert

Operations/sec (M)

Open Bw-Tree Skip List B+Tree Masstree ART

Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Random Integer Keys (64-bit)

Source: Ziqi Wang

slide-114
SLIDE 114

15-721 (Spring 2020)

IN- M EM ORY INDEXES

42

2.34 1.79 1.91 2.07 2.18 2.49 1.59 1.15 1.3 3.37 2.86 4.22 0.42 1.44 0.722

1 2 3 4 5 Mono Int Rand Int Emails

Memory (GB)

Open Bw-Tree Skip List B+Tree Masstree ART

Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Keys

Source: Ziqi Wang

slide-115
SLIDE 115

15-721 (Spring 2020)

PARTING THOUGHTS

Andy was wrong about the Bw-Tree and latch- free indexes. Radix trees have interesting properties, but a well- written B+tree is still a solid design choice.

43

slide-116
SLIDE 116

15-721 (Spring 2020)

NEXT CLASS

System Catalogs Data Layout Storage Models

44