DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #12: OLTP INDEXES (PART II) 2 TODAYS AGENDA B+Tree Overview Index Implementation Issues ART Index 3 LOGISTICS Reminder: Problem set due on Feb 21 st .


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #12: OLTP INDEXES (PART II)

slide-2
SLIDE 2

TODAY’S AGENDA

B+Tree Overview Index Implementation Issues ART Index

2

slide-3
SLIDE 3

LOGISTICS

Reminder: Problem set due on Feb 21st. Reminder: Mid-term Exam on Feb 26th. Reminder: Project Proposals due on Feb 28th.

3

slide-4
SLIDE 4

B+TREE

A B+Tree is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in O(log n).

→ Generalization of a binary search tree in that a node can have more than two children. → Optimized for systems that read and write large blocks of data.

4

slide-5
SLIDE 5

B+TREE PROPERTIES

A B+tree is an M-way search tree with the following properties:

→ It is perfectly balanced (i.e., every leaf node is at the same depth). → Every inner node other than the root, is at least half-full M/2-1 ≤ #keys ≤ M-1 → Every inner node with k keys has k+1 non-null children

5

slide-6
SLIDE 6

B+TREE EXAMPLE

6

5 9 6 7 9 13 1 3

slide-7
SLIDE 7

B+TREE EXAMPLE

7

Inner Node

5 9 6 7 9 13 1 3

slide-8
SLIDE 8

B+TREE EXAMPLE

8

Leaf Nodes Inner Node

5 9 6 7 9 13 1 3

slide-9
SLIDE 9

B+TREE EXAMPLE

9

Leaf Nodes Inner Node Sibling Pointers

5 9 6 7 9 13 1 3

slide-10
SLIDE 10

B+TREE EXAMPLE

10

Leaf Nodes <5 <9 ≥9 Inner Node Sibling Pointers

5 9 6 7 9 13 1 3

slide-11
SLIDE 11

B+TREE EXAMPLE

11

Leaf Nodes <5 <9 ≥9 Inner Node

<value>|<key>

Sibling Pointers

5 9 6 7 9 13 1 3

slide-12
SLIDE 12

NODES

Every node in the B+Tree contains an array of key/value pairs.

→ The keys will always be the column or columns that you built your index on → The values will differ based on whether the node is classified as inner nodes or leaf nodes.

The arrays are (usually) kept in sorted key order.

12

slide-13
SLIDE 13

LEAF NODE VALUES

Approach #1: Record Ids

→ A pointer to the location of the tuple that the index entry corresponds to.

Approach #2: Tuple Data

→ The actual contents of the tuple is stored in the leaf node. → Secondary indexes have to store the record id as their values.

13

slide-14
SLIDE 14

LEAF NODE VALUES

Approach #1: Record Ids

→ A pointer to the location of the tuple that the index entry corresponds to.

Approach #2: Tuple Data

→ The actual contents of the tuple is stored in the leaf node. → Secondary indexes have to store the record id as their values.

14

slide-15
SLIDE 15

B+TREE LEAF NODES

15

slide-16
SLIDE 16

B+Tree Leaf Node

B+TREE LEAF NODES

16

K1 V1 • • • Kn Vn

¤ ¤

Prev Next

slide-17
SLIDE 17

B+Tree Leaf Node

B+TREE LEAF NODES

17

K1 V1 • • • Kn Vn

¤ ¤

Prev Next PageID PageID

slide-18
SLIDE 18

B+Tree Leaf Node

B+TREE LEAF NODES

18

Key+ Value K1 V1 • • • Kn Vn

¤ ¤

Prev Next PageID PageID

slide-19
SLIDE 19

B+Tree Leaf Node

B+TREE LEAF NODES

19

Key+ Value K1 V1 • • • Kn Vn

¤ ¤

Prev Next

¤ ¤

PageID PageID

slide-20
SLIDE 20

B+Tree Leaf Node

B+TREE LEAF NODES

20

Sorted Keys K1 K2 K3 K4 K5 • • • Kn Values

¤ ¤ ¤ ¤ ¤

  • • • ¤

¤

Prev

¤

Next # Level # Slots

slide-21
SLIDE 21

B+Tree Leaf Node

B+TREE LEAF NODES

21

Sorted Keys K1 K2 K3 K4 K5 • • • Kn Values

¤ ¤ ¤ ¤ ¤

  • • • ¤

¤

Prev

¤

Next # Level # Slots

slide-22
SLIDE 22

B+TREE INSERT

Find correct leaf L. Put data entry into L in sorted order. If L has enough space, done! Else, must split L into L and a new node L2

→ Redistribute entries evenly, copy up middle key. → Insert index entry pointing to L2 into parent of L.

To split inner node, redistribute entries evenly, but push up middle key.

22

Source: Chris Re

slide-23
SLIDE 23

B+TREE VISUALIZATION

https://www.cs.usfca.edu/~galles/visualizati

  • n/BPlusTree.html

Source: David Gales (Univ. of San Francisco)

23

slide-24
SLIDE 24

B+TREE DELETE

Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only M/2-1 entries,

→ Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). → If re-distribution fails, merge L and sibling.

If merge occurred, must delete entry (pointing to L

  • r sibling) from parent of L.

24

Source: Chris Re

slide-25
SLIDE 25

B+TREES IN PRACTICE

Typical Fill-Factor: 67%.

→ Average Fanout = 2*100*0.67 = 134

Typical Capacities:

→ Height 4: 1334 = 312,900,721 entries → Height 3: 1333 = 2,406,104 entries

Pages per level:

→ Level 1 = 1 page = 8 KB → Level 2 = 134 pages = 1 MB → Level 3 = 17,956 pages = 140 MB

25

slide-26
SLIDE 26

CLUSTERED INDEXES

The table is stored in the sort order specified by the primary key.

→ Can be either heap- or index-organized storage.

Some DBMSs always use a clustered index.

→ If a table doesn’t include a pkey, the DBMS will automatically make a hidden row id pkey.

Other DBMSs cannot use them at all.

26

slide-27
SLIDE 27

SELECTION CONDITIONS

The DBMS can use a B+Tree index if the query provides any of the attributes of the search key. Example: Index on <a,b,c>

→ Supported: (a=5 AND b=3) → Supported: (b=3).

Not all DBMSs support this. For hash index, we must have all attributes in search key.

27

slide-28
SLIDE 28

B+TREE DESIGN CHOICES

Node Size Merge Threshold Intra-Node Search Variable Length Keys Non-Unique Indexes

28

slide-29
SLIDE 29

NODE SIZE

The slower the disk, the larger the optimal node size for a B+Tree.

→ HDD ~1MB → SSD: ~10KB → In-Memory: ~512B

Optimal sizes can vary depending on the workload

→ Leaf Node Scans vs. Root-to-Leaf Traversals

29

slide-30
SLIDE 30

MERGE THRESHOLD

Some DBMSs don't always merge nodes when it is half full. Delaying a merge operation may reduce the amount of reorganization. May be better to just let underflows to exist and then periodically rebuild entire tree.

30

slide-31
SLIDE 31

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

31

slide-32
SLIDE 32

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

32

Find Key=8

5 6 7 8 9 10 4

slide-33
SLIDE 33

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

33

Find Key=8

5 6 7 8 9 10 4

slide-34
SLIDE 34

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

34

Find Key=8

5 6 7 8 9 10 4

slide-35
SLIDE 35

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

35

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4

slide-36
SLIDE 36

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

36

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4

slide-37
SLIDE 37

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

37

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4

slide-38
SLIDE 38

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

38

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4

slide-39
SLIDE 39

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

39

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4

slide-40
SLIDE 40

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

40

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4

Offset: 7-(10-8)=5

slide-41
SLIDE 41

INTRA-NODE SEARCH

Approach #1: Linear

→ Scan node keys from beginning to end.

Approach #2: Binary

→ Jump to middle key, pivot left/right depending on comparison.

Approach #3: Interpolation

→ Approximate location of desired key based

  • n known distribution of keys.

41

Find Key=8

5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4

Offset: 7-(10-8)=5

slide-42
SLIDE 42

INDEX IMPLEMENTATION ISSUES

Bulk Insert Pointer Swizzling Prefix Compression Memory Pools Garbage Collection Non-Unique Keys Variable-length Keys Prefix Compression

42

slide-43
SLIDE 43

BULK INSERT

The fastest/best way to build a B+Tree is to first sort the keys and then build the index from the bottom up.

43

slide-44
SLIDE 44

BULK INSERT

The fastest/best way to build a B+Tree is to first sort the keys and then build the index from the bottom up.

44

Keys: 3, 7, 9, 13, 6, 1

slide-45
SLIDE 45

BULK INSERT

The fastest/best way to build a B+Tree is to first sort the keys and then build the index from the bottom up.

45

Keys: 3, 7, 9, 13, 6, 1 Sorted Keys: 1, 3, 6, 7, 9, 13

slide-46
SLIDE 46

BULK INSERT

The fastest/best way to build a B+Tree is to first sort the keys and then build the index from the bottom up.

46

6 7 9 13 1 3

Keys: 3, 7, 9, 13, 6, 1 Sorted Keys: 1, 3, 6, 7, 9, 13

slide-47
SLIDE 47

BULK INSERT

The fastest/best way to build a B+Tree is to first sort the keys and then build the index from the bottom up.

47

6 9 6 7 9 13 1 3

Keys: 3, 7, 9, 13, 6, 1 Sorted Keys: 1, 3, 6, 7, 9, 13

slide-48
SLIDE 48

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

48

slide-49
SLIDE 49

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

49

6 9 6 7 1 3

Buffer Pool

1

Header

2

Header

3

Header

slide-50
SLIDE 50

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

50

6 9 6 7 1 3

Buffer Pool

1

Header

2

Header

3

Header

Find Key>3

slide-51
SLIDE 51

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

51

6 9 6 7 1 3

Buffer Pool

1

Header

2

Header

3

Header

Find Key>3

slide-52
SLIDE 52

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

52

6 9 6 7 1 3

Page #2

Buffer Pool

1

Header

2

Header

3

Header

Find Key>3

slide-53
SLIDE 53

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

53

6 9 6 7 1 3

Page #2

Buffer Pool

1

Header

2

Header

3

Header

Page #2 → <Page*>

Find Key>3

slide-54
SLIDE 54

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

54

6 9 6 7 1 3

Page #2 Page #3

Buffer Pool

1

Header

2

Header

3

Header

Page #2 → <Page*>

Find Key>3

slide-55
SLIDE 55

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

55

6 9 6 7 1 3

Page #2 Page #3

Buffer Pool

1

Header

2

Header

3

Header

Page #2 → <Page*> Page #3 → <Page*>

Find Key>3

slide-56
SLIDE 56

POINTER SWIZZLING

Nodes use page ids to reference other nodes in the index. The DBMS has to get the memory location from the page table during traversal. If a page is pinned in the buffer pool, then we can store raw pointers instead

  • f page ids, thereby removing the need

to get address from the page table.

56

6 9 6 7 1 3

Buffer Pool

1

Header

2

Header

3

Header

Find Key>3

<Page*> <Page*>

slide-57
SLIDE 57

MEMORY POOLS

We don’t want to be calling malloc and free anytime we need to add or delete a node. This could lead to a system call.

→ If you call malloc to request 10 bytes of memory, the allocator may invoke the sbrk (or mmap) system call to request 4K bytes from OS. → Then, when you call malloc next time to request another 10 bytes, it may not have to issue a system call; instead, it may return a pointer within allocated memory.

57

slide-58
SLIDE 58

MEMORY POOLS

If all the nodes are the same size, then the index can maintain a pool of available nodes.

→ Insert: Grab a free node, otherwise create a new one. → Delete: Add the node back to the free pool.

Need some policy to decide when to retract the pool size (garbage collection & de-fragmentation).

58

slide-59
SLIDE 59

GARBAGE COLLECTION

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

59

K2 V2 K3 V3 K4 V4

slide-60
SLIDE 60

GARBAGE COLLECTION

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

60

K2 V2 K3 V3 K4 V4

slide-61
SLIDE 61

GARBAGE COLLECTION

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

61

K2 V2 K3 V3 K4 V4

slide-62
SLIDE 62

GARBAGE COLLECTION

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

62

K2 V2 K3 V3 K4 V4

X

slide-63
SLIDE 63

GARBAGE COLLECTION

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

63

K2 V2 K4 V4

slide-64
SLIDE 64

GARBAGE COLLECTION

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

64

K2 V2 K4 V4

slide-65
SLIDE 65

REFERENCE COUNTING

Maintain a counter for each node to keep track of the number of threads that are accessing it.

→ Increment the counter before accessing. → Decrement it when finished. → A node is only safe to delete when the count is zero.

This has bad performance for multi-core CPUs

→ Incrementing/decrementing counters causes a lot of cache coherence traffic.

65

slide-66
SLIDE 66

OBSERVATION

We don’t actually care about the actual value of the reference counter. We only need to know when it reaches zero. We don’t have to perform garbage collection immediately when the counter reaches zero.

66

Source: Stephen Tu

slide-67
SLIDE 67

EPOCH GARBAGE COLLECTION

Maintain a global epoch counter that is periodically updated (e.g., every 10 ms).

→ Keep track of what threads enter the index during an epoch and when they leave.

Mark the current epoch of a node when it is marked for deletion.

→ The node can be reclaimed once all threads have left that epoch (and all preceding epochs).

Also known as Read-Copy-Update (RCU) in Linux.

67

slide-68
SLIDE 68

NON-UNIQUE INDEXES

Approach #1: Duplicate Keys

→ Use the same node layout but store duplicate keys multiple times.

Approach #2: Value Lists

→ Store each key only once and maintain a linked list of unique values.

68

MODERN B-TREE TECHNIQUES NOW PUBLISHERS 2010

slide-69
SLIDE 69

B+Tree Leaf Node

NON-UNIQUE: DUPLICATE KEYS

69

Sorted Keys K1 K1 K1 K2 K2 • • • Kn

¤

Prev

¤

Next # Level # Slots Values

¤ ¤ ¤ ¤ ¤

  • • • ¤
slide-70
SLIDE 70

B+Tree Leaf Node

NON-UNIQUE: DUPLICATE KEYS

70

Sorted Keys K1 K1 K1 K2 K2 • • • Kn

¤

Prev

¤

Next # Level # Slots Values

¤ ¤ ¤ ¤ ¤

  • • • ¤
slide-71
SLIDE 71

B+Tree Leaf Node

NON-UNIQUE: DUPLICATE KEYS

71

Sorted Keys K1 K1 K1 K2 K2 • • • Kn

¤

Prev

¤

Next # Level # Slots Values

¤ ¤ ¤ ¤ ¤

  • • • ¤
slide-72
SLIDE 72

B+Tree Leaf Node

NON-UNIQUE: VALUE LISTS

72

Values

¤ ¤ ¤ ¤ ¤

  • • •

¤

Prev

¤

Next # Level # Slots Sorted Keys K1 K2 K3 K4 K5 • • • Kn

slide-73
SLIDE 73

B+Tree Leaf Node

NON-UNIQUE: VALUE LISTS

73

Values

¤ ¤ ¤ ¤ ¤

  • • •

¤

Prev

¤

Next # Level # Slots Sorted Keys K1 K2 K3 K4 K5 • • • Kn

slide-74
SLIDE 74

VARIABLE LENGTH KEYS

Approach #1: Pointers

→ Store the keys as pointers to the tuple’s attribute.

Approach #2: Variable Length Nodes

→ The size of each node in the index can vary. → Requires careful memory management.

Approach #3: Padding

→ Always pad the key to be max length of the key type.

Approach #4: Key Map / Indirection

→ Embed an array of pointers that map to the key + value list within the node.

74

slide-75
SLIDE 75

B+Tree Leaf Node

KEY MAP / INDIRECTION

75

Key+Values

¤

Prev

¤

Next # Level # Slots Key Map Andy V1

¤ ¤ ¤

Obama Prashanth V2 V3

slide-76
SLIDE 76

B+Tree Leaf Node

KEY MAP / INDIRECTION

76

Key+Values

¤

Prev

¤

Next # Level # Slots Key Map Andy V1

¤ ¤ ¤

Obama Prashanth V2 V3

slide-77
SLIDE 77

B+Tree Leaf Node

KEY MAP / INDIRECTION

77

Key+Values

¤

Prev

¤

Next # Level # Slots Key Map Andy V1

¤ ¤ ¤

Obama Prashanth V2 V3

slide-78
SLIDE 78

B+Tree Leaf Node

KEY MAP / INDIRECTION

78

Key+Values

¤

Prev

¤

Next # Level # Slots Key Map Andy V1

¤ ¤ ¤

Obama Prashanth V2 V3

slide-79
SLIDE 79

B+Tree Leaf Node

KEY MAP / INDIRECTION

79

Key+Values

¤

Prev

¤

Next # Level # Slots Key Map Andy V1

¤ ¤ ¤

Obama Prashanth V2 V3

A·¤ O·¤ P·¤

slide-80
SLIDE 80

PREFIX COMPRESSION

Store a minimum prefix that is needed to correctly route probes into the index. Since keys are sorted in lexicographical order, there will be a lot of duplicated prefixes.

80

abcdefghijk lmnopqrstuv

… … … …

slide-81
SLIDE 81

PREFIX COMPRESSION

Store a minimum prefix that is needed to correctly route probes into the index. Since keys are sorted in lexicographical order, there will be a lot of duplicated prefixes.

81

abcdefghijk lmnopqrstuv

… … … …

slide-82
SLIDE 82

PREFIX COMPRESSION

Store a minimum prefix that is needed to correctly route probes into the index. Since keys are sorted in lexicographical order, there will be a lot of duplicated prefixes.

82

… … … …

abc lmn

slide-83
SLIDE 83

PREFIX COMPRESSION

Store a minimum prefix that is needed to correctly route probes into the index. Since keys are sorted in lexicographical order, there will be a lot of duplicated prefixes.

83

… … … …

abc lmn

Andre Andy Annie

slide-84
SLIDE 84

PREFIX COMPRESSION

Store a minimum prefix that is needed to correctly route probes into the index. Since keys are sorted in lexicographical order, there will be a lot of duplicated prefixes.

84

… … … …

abc lmn

An dre dy nie Andre Andy Annie

slide-85
SLIDE 85

ADAPATIVE RADIX TREE (ART)

Uses digital representation of keys to examine prefixes 1-by-1 instead of comparing entire key. Radix trees properties:

→ The height of the tree depends on the length of keys. (unlike B+tree where height depends on the number of keys) → Does not require rebalancing → The path to a leaf node represents the key of the leaf → Keys are stored implicitly and can be reconstructed from paths. → Structure does not depend on order of key insertion

85

THE ADAPTIVE RADIX TREE: ARTFUL INDEXING FOR MAIN-MEMORY DATABASES ICDE 2013

slide-86
SLIDE 86

TRIE VS. RADIX TREE

86

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

slide-87
SLIDE 87

TRIE VS. RADIX TREE

87

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

slide-88
SLIDE 88

TRIE VS. RADIX TREE

88

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

slide-89
SLIDE 89

TRIE VS. RADIX TREE

89

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

slide-90
SLIDE 90

TRIE VS. RADIX TREE

90

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

slide-91
SLIDE 91

TRIE VS. RADIX TREE

91

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

Radix Tree

ELLO H

¤

A

¤ T ¤

VE

slide-92
SLIDE 92

TRIE VS. RADIX TREE

92

Keys: HELLO, HAT, HAVE Trie (Re`trie’val - 1959)

E H L

¤

L O A

¤ T ¤

V E

Radix Tree

ELLO H

¤

A

¤ T ¤

VE

slide-93
SLIDE 93

ART: ADAPTIVELY SIZED NODES

The index supports four different internal node types with different capacities. Pack in multiple digits into a single node to improve cache locality.

93

slide-94
SLIDE 94

ART: MODIFICATIONS

94

¤

ELLO

¤ ¤

T VE H A

slide-95
SLIDE 95

ART: MODIFICATIONS

95

¤

ELLO

¤ ¤

T VE H A

Operation: Insert HAIR

slide-96
SLIDE 96

ART: MODIFICATIONS

96

¤

ELLO

¤ ¤

T VE H A

¤

IR

Operation: Insert HAIR

slide-97
SLIDE 97

ART: MODIFICATIONS

97

¤

ELLO

¤ ¤

T VE H A

¤

IR

Operation: Insert HAIR Operation: Delete HAT, HAVE

slide-98
SLIDE 98

ART: MODIFICATIONS

98

¤

ELLO

¤ ¤

T VE H A

¤

IR

Operation: Insert HAIR Operation: Delete HAT, HAVE

slide-99
SLIDE 99

ART: MODIFICATIONS

99

¤

ELLO H A

¤

IR

Operation: Insert HAIR Operation: Delete HAT, HAVE

slide-100
SLIDE 100

ART: MODIFICATIONS

100

¤

ELLO H A

¤

IR

Operation: Insert HAIR Operation: Delete HAT, HAVE

slide-101
SLIDE 101

ART: MODIFICATIONS

101

¤

ELLO H A

Operation: Insert HAIR Operation: Delete HAT, HAVE

AIR

¤

slide-102
SLIDE 102

ART: BINARY COMPARABLE KEYS

Not all attribute types can be decomposed into binary comparable digits for a radix tree.

→ Unsigned Integers: Byte order must be flipped to big endian representation for little endian machines (x86). → Signed Integers: Flip two’s-complement so that negative numbers are smaller than positive. → Floats: Classify into group (neg vs. pos, normalized vs. denormalized), then store as unsigned integer. → Compound: Transform each attribute separately.

102

slide-103
SLIDE 103

ART: BINARY COMPARABLE KEYS

103

Int Key: 168496141

slide-104
SLIDE 104

ART: BINARY COMPARABLE KEYS

104

Hex Key: 0A 0B 0C 0D Int Key: 168496141

slide-105
SLIDE 105

ART: BINARY COMPARABLE KEYS

105

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian

0D 0C 0B 0A

Little Endian

slide-106
SLIDE 106

ART: BINARY COMPARABLE KEYS

106

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian

0D 0C 0B 0A

Little Endian

0F0F0F 0A

¤

0B

¤

0B0F

¤

OF0F

¤ ¤ ¤

0C 0F 0D

slide-107
SLIDE 107

ART: BINARY COMPARABLE KEYS

107

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian

0D 0C 0B 0A

Little Endian

0F0F0F 0A

¤

0B

¤

0B0F

¤

OF0F

¤ ¤ ¤

0C 0F 0D

slide-108
SLIDE 108

ART: BINARY COMPARABLE KEYS

108

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0F0F0F 0A

¤

0B

¤

0B0F

¤

OF0F

¤ ¤ ¤

0C 0F 0D

slide-109
SLIDE 109

ART: BINARY COMPARABLE KEYS

109

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0F0F0F 0A

¤

0B

¤

0B0F

¤

OF0F

¤ ¤ ¤

0C 0F 0D

Hex: 0A 0B 1D Lookup: 658205

slide-110
SLIDE 110

BINARY COMPARABLE KEYS

110

6695 3277 7899 9430 6775 18518 12093 13682 31052

8000 16000 24000 32000 Insert Lookup Delete

Execution Time (ms)

CompactIntsKey GenericKey + FastCompare GenericKey + GenericCompare

Peloton w/ Bw-Tree Index Data Set: 10m keys (three 64-bit ints)

slide-111
SLIDE 111

CONCURRENT ART INDEX

HyPer’s ART is not latch-free.

→ The authors argue that it would be a significant amount

  • f work to make it latch-free.

Approach #1: Optimistic Lock Coupling Approach #2: Read-Optimized Write Exclusion

111

THE ART OF PRACTICAL SYNCHRONIZATION DaMoN 2016

slide-112
SLIDE 112

OPTIMISTIC LOCK COUPLING

Optimistic crabbing scheme where writers are not blocked on readers.

→ Writers increment counter when they acquire latch. → Readers can proceed if a node’s latch is available. → It then checks whether the latch’s counter has changed from when it checked the latch.

112

slide-113
SLIDE 113

OPTIMISTIC LOCK COUPLING

113

A B D G

20 10 35 6 12 23 38 44

C E F

SEARCH 44

slide-114
SLIDE 114

OPTIMISTIC LOCK COUPLING

114

A B D G

20 10 35 6 12 23 38 44

C E F

SEARCH 44

slide-115
SLIDE 115

OPTIMISTIC LOCK COUPLING

115

A B D G

20 10 35 6 12 23 38 44

C E F

SEARCH 44

slide-116
SLIDE 116

OPTIMISTIC LOCK COUPLING

116

A B D G

20 10 35 6 12 23 38 44

C E F

SEARCH 44

slide-117
SLIDE 117

OPTIMISTIC LOCK COUPLING

117

A B D G

20 10 35 6 12 23 38 44

C E F

v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-118
SLIDE 118

OPTIMISTIC LOCK COUPLING

118

A B D G

20 10 35 6 12 23 38 44

C E F

v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-119
SLIDE 119

OPTIMISTIC LOCK COUPLING

119

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-120
SLIDE 120

OPTIMISTIC LOCK COUPLING

120

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-121
SLIDE 121

OPTIMISTIC LOCK COUPLING

121

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-122
SLIDE 122

OPTIMISTIC LOCK COUPLING

122

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-123
SLIDE 123

OPTIMISTIC LOCK COUPLING

123

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-124
SLIDE 124

OPTIMISTIC LOCK COUPLING

124

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5

SEARCH 44

slide-125
SLIDE 125

OPTIMISTIC LOCK COUPLING

125

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

slide-126
SLIDE 126

OPTIMISTIC LOCK COUPLING

126

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

slide-127
SLIDE 127

OPTIMISTIC LOCK COUPLING

127

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

slide-128
SLIDE 128

OPTIMISTIC LOCK COUPLING

128

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

slide-129
SLIDE 129

OPTIMISTIC LOCK COUPLING

129

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node C: Read v9 B: Recheck v5 C: Search Node

SEARCH 44

slide-130
SLIDE 130

OPTIMISTIC LOCK COUPLING

130

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node C: Read v9 B: Recheck v5 C: Search Node

SEARCH 44

slide-131
SLIDE 131

OPTIMISTIC LOCK COUPLING

131

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

slide-132
SLIDE 132

OPTIMISTIC LOCK COUPLING

132

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

v6

slide-133
SLIDE 133

OPTIMISTIC LOCK COUPLING

133

A B D G

20 10 35 6 12 23 38 44

C E F

A: Read v3 A: Search Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Search Node

SEARCH 44

v6

X

slide-134
SLIDE 134

READ-OPTIMIZED WRITE EXCLUSION

Each node includes an exclusive lock that blocks

  • nly other writers and not readers.

→ Readers proceed without checking versions or locks. → Every writer must ensure that reads are always consistent.

Requires fundamental changes to how threads make modifications to the data structure.

134

slide-135
SLIDE 135

IN-MEMORY INDEXES

135

9.94 15.5 13.3 8.09 29 25.1 2.51 2.78 1.51 17.9 30.5 22 44.9 51.5 42.9

20 40 60 Insert-Only Read-Only Read/Update

Operations/sec (M)

Open Bw-Tree B+Tree Skip List Masstree ART

Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Random Integer Keys (64-bit)

Source: Ziqi Wang

slide-136
SLIDE 136

PARTING THOUGHTS

Andy was wrong about the Bw-Tree and latch- free indexes.

136

slide-137
SLIDE 137

NEXT CLASS

Query Compilation Reminder: Mid-term Exam on Feb 26th. Reminder: Project Proposals due on Feb 26th.

137