Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - - PowerPoint PPT Presentation

indexing
SMART_READER_LITE
LIVE PREVIEW

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - - PowerPoint PPT Presentation

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing B-tree B-tree basics The cost of B-tree operations B-tree variants B-tree in multi-user Environments Learned Index


slide-1
SLIDE 1

Indexing

CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu

slide-2
SLIDE 2

Content

  • Motivation for Indexing
  • B-tree

○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments

  • Learned Index
slide-3
SLIDE 3

Motivation for Indexing

Activity Question: Why do we need indexing?

slide-4
SLIDE 4

Motivation for Indexing

Activity Question: Why do we need indexing?

  • Items are retrieved from secondary storage to memory before processed.
  • Organizing files intelligently makes the retrieval process efficient.
  • Large, randomly accessed file in a computer system is associated with index

○ which like the labels on the drawers ○ directing the searcher to the small part of the file containing the desired item.

slide-5
SLIDE 5

Content

  • Motivation for Indexing
  • B-tree

○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments

  • Learned Index
slide-6
SLIDE 6

Operations on a file

  • Files: set of records
  • Each record: ri = (ki, αi), where ki is the key

and αi is the associated information

  • Operations

○ Insert: add new record, (ki, αi), checking ki is unique. ○ Delete: remove record, (ki, αi), given ki ○ Find: retrieve αi, given ki. ○ Next: retrieve αi+1, given that αi was just retrieved. k0 α0 k1 α1 k2 α2 ... ...

slide-7
SLIDE 7

B-tree: Generalization of Binary Search Tree

  • More than 2 paths leave a given node.
  • Compare query key and the key stored at the node the decide path to take.
  • Exact match (success). No exact match and leaf is reached (failure).
slide-8
SLIDE 8

B-tree of Order d

  • Each node contains at most 2d keys and 2d + 1 pointers.
  • Each node contains at least d keys and d + 1 points (at least ½ full).
slide-9
SLIDE 9

Balancing

B-Tree:

  • Never visits more than 1 + logd(n)

node.

  • Accessing each node is a separate

access to secondary storage.

slide-10
SLIDE 10

Insertion

1. Find: proceeds from root to location the proper leaf for insertion. 2. Insert: balance is restored by a procedure which moves from the leaf back towards the root.

slide-11
SLIDE 11
slide-12
SLIDE 12

Insertion: Split

Of the 2d + 1 keys, the smallest d are placed in one node, the largest d are placed in another node, and the remaining value is promoted to the parent node as

  • separator. The splitting can propagtes to root and the tree increase height by 1.
slide-13
SLIDE 13

Deletion

Find proper node. There are two possibilities: 1. The key to be deleted resides in a leaf 2. The key resides in a nonleaf node.

a. An adjance key be found and swapped into the vacated position. b. Use the leftmost leaf in the right subtree.

slide-14
SLIDE 14

Deletion: Underflow

After the removal, check to see at least d keys remain in each node. If a node has less than d keys, then underflow is said to occur and redistribution of the keys becomes necessary.

slide-15
SLIDE 15

Deletion: Concatenation

  • Redistribution of keys among two neighbors only there are at least 2d keys.
  • When there are less than 2d keys remain, a concatenation must occur.

○ Keys are simply combined into one of the nodes and the other is discarded. ○ Since only one node remains, the key separating the two nodes in the ancestor is no longer necessary and added to the single remaining leaf. ○ If the descendants of the root are concatenated, they form a new root, decrease B-tree height by 1.

slide-16
SLIDE 16

Content

  • Motivation for Indexing
  • B-tree

○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments

  • Learned Index
slide-17
SLIDE 17

The cost of operations

  • Retrieval costs
  • Insertion and Deletion costs
  • Sequential Processing
slide-18
SLIDE 18

Retrieval costs

  • Find operation grows as the logarithm of the file size.
  • With d being order of the B-tree, n being number of keys in the file, h being

the height of the tree:

slide-19
SLIDE 19

Insertion and Deletion costs - Tree Height

  • May require additional secondary storage accesses beyond the cost of a find
  • peration as it progresses back up the tree.
  • Overall, the costs are at most doubled, so the height of the tree still dominates

the cost.

  • In a B-tree of order d for a file of n records, insertion and deletion take time

proportional to logd(n) in the worst case.

slide-20
SLIDE 20

Insertion and Deletion costs - Tree Order

  • As the branch factor, d, increases, the logarithmic base increases, the cost of

find, insert and delete operation delete decreases.

  • There are practical limits on the size of a node.

○ Most hardware systems bound the amount of data that can be transferred with one access to secondary storage. ○ The cost estimation is now hiding the constant factor which grows as the size of data transferred increases.

slide-21
SLIDE 21

Sequential Processing

  • Using the next operation to process all records in key-sequence order.
  • B-tree may not do well in sequential processing

○ Preorder tree walk requires space for at least h = logd(n+1) nodes in main memory since it stacks the nodes along a path from the root to avoid reading them twice ○ Processing a next operation may require tracing a path through several nodes before reaching the desired key.

  • B+-tree improves sequential processing

performance.

slide-22
SLIDE 22

Content

  • Motivation for Indexing
  • B-tree

○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments

  • Learned Index
slide-23
SLIDE 23

B-Tree variants

  • Different variations

○ Splitting vs. Redistributed to neighbor ○ Processing a node once it has been retrieved from secondary storage, using different search method (e.g. linear search, binary search) ○ Varying “order” at each depth

  • B*-Trees
  • B+-Trees
slide-24
SLIDE 24

B*-Trees

  • Each node is at least ⅔ full (instead of just ½ full).
  • Delay spitting until 2 sibling nodes are full and then divided into 3 each ⅔ full.
  • Increasing storage utilization.
  • Speeding up search as height of the tree is reduced.
slide-25
SLIDE 25

B+-Trees structure

  • All keys reside in the leave.
  • Nonleaf levels are organized as B-tree, consist only index. All keys reside in

leaves.

  • Leaf nodes are usually linked together left-to-right.
slide-26
SLIDE 26

B+-Tree Operations

  • Insertion:

○ Almost identically to B-tree. ○ During a split, instead of promoting the middle key, promote a copy of the key.

  • Deletion:

○ key to be deleted always reside in leaf node, which makes deletion simple. ○ As long as the leaf remains at least half full, the upper index levels does not need change.

  • Find:

○ Search does not stop on exact match, instead the right pointer is followed. ○ Almost proceeds all the way to a leaf.

slide-27
SLIDE 27

Content

  • Motivation for Indexing
  • B-tree

○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments

  • Learned Index
slide-28
SLIDE 28

B-tree in Multiuser Environment

  • Should permit several user requests to be processed simultaneously.
  • One process may read a node and follow one of the links while another

process is changing it.

  • Find operations goes top down, while insertion and deletion require bottom-up

access.

slide-29
SLIDE 29

B-tree in Multiuser Environment: Locking

  • Find operation

○ locks a node once it has been read ○ Release when search proceed to next level ○ Readers locks at most two nodes at any time.

  • Update operation

○ Reservation on access ○ Reservation converted to an absolute lock if update changes will propagate to the reserved node, otherwise cancel reservation ○ Reserved node may be read but may not be reserved a second time

slide-30
SLIDE 30

B-tree in Multiuser Environment: Security

  • Protection of information in a multiuser environment.
  • Memory protection mechanism of paging.
  • Encryption techniques can be used to protect contents of a file outside of the

underlying system.

slide-31
SLIDE 31

Summary of B-tree

  • Efficient, simple and easily maintained.
  • Logarithmic cost find insert and delete operations.
  • Guarantee 50% storage utilization.
  • B+-tree allow efficient sequential processing.
  • There are many variants of B-tree.
  • Can be used in multiuser environment.
slide-32
SLIDE 32

Content

  • Motivation for Indexing
  • B-tree

○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments

  • Learned Index
slide-33
SLIDE 33

Indexes as Models

  • B-Tree Index : Maps key to position of record in sorted array
  • Hash Index : Maps key to position of record in unsorted array
  • BitMap Index : Checks if a data-record exists
slide-34
SLIDE 34

Indexes as Models

  • B-Tree Index : Maps key to position of record in sorted array
  • Hash Index : Maps key to position of record in unsorted array
  • BitMap Index : Checks if a data-record exists

Can we replace these traditional models with other kinds of models?

slide-35
SLIDE 35

Activity

If we have fixed length records with continuous integer keys from 1 to 1 million, can we find a better way to access record corresponding to any given key? What if the length of each record was one unit greater than its immediate predecessor?

slide-36
SLIDE 36

Knowing data distribution helps !

  • ML, especially neural nets, can learn variety of data distributions, mixtures

and other patterns

  • Balancing complexity of model with accuracy is important
slide-37
SLIDE 37

What should the model learn?

slide-38
SLIDE 38

What should the model learn?

Model that predicts position of key within a sorted array effectively approximates the Cumulative Distribution Function (CDF) corresponding to the keys p = F(Key)*N

Position estimate Total number

  • f

records Estimated CDF

slide-39
SLIDE 39

Naive Learned Index

  • Performance worse in comparison to

traditional Btrees !

  • Might be more CPU and space-efficient to

narrow down position of an item from entire dataset to region with thousands of records

  • Significantly more difficult to run the last

mile

  • Cache-efficiency and memory efficiency of

B-trees difficult to replicate in our model

slide-40
SLIDE 40

Learning Index Framework

  • Learns simple models on the fly and relies on TensorFlow for complex models
  • Generates efficient index structures in C++ for inference
  • Runs simple models in order of 30 ns
slide-41
SLIDE 41

Recursive Model Index

  • Learning Hierarchy of models

instead of a single unified model for indexing

  • Each stage takes key as input

and selects another better model in the next hierarchical layer

  • Final stage predicts position
slide-42
SLIDE 42

Hybrid Indexes

  • Different layers have different types of learning models
  • Ideas?
slide-43
SLIDE 43

Hybrid Indexes

  • Different layers have different types of learning models
  • Ideas?

○ Small ReLU neural net at top can learn wide range of complex data distributions ○ Bottom can have simple linear regression models - inexpensive in space and execution time ○ Traditional B-Trees (i.e. decision trees) can be used at bottom if data is particularly difficult to learn

slide-44
SLIDE 44

Hybrid Indexes

  • Different layers have different types of learning models
  • Ideas?

○ Small ReLU neural net at top can learn wide range of complex data distributions ○ Bottom can have simple linear regression models - inexpensive in space and execution time ○ Traditional B-Trees (i.e. decision trees) can be used at bottom if data is particularly difficult to learn

  • Worst-case performance of learned indexes bound to that of B-Trees !
slide-45
SLIDE 45

Searching record with learned index

  • Finding the first key higher/lower from the look-up key based on prediction
  • Model biased search

○ Middle-point of binary search set to the value predicted by our model

  • Biased quaternary search

○ Three middle points - pos-𝞽,pos,pos+𝞽

  • Min and max-errors used to define the search area
  • Can this work for non-existent keys?
slide-46
SLIDE 46

Indexing Strings / Training

slide-47
SLIDE 47

Results

  • Map - longitude of locations is relatively linear and has fewer irregularities
  • Weblogs - worst-case scenario, complex time patterns
  • Log-Normal (synthetic) - highly non-linear, CDF difficult to learn using NNs
slide-48
SLIDE 48

Results

Comparison with other models

  • Lookup Tables with fixed records allows use of AVX instructions
  • FAST - Highly SIMD-optimized data structure is used, but memory requirement is higher
  • Fixed-Size Btree with interpolation search (variation of binary search for uniformly distributed data) -

height of Btree is fixed to reduce memory consumption

  • Multivariate Learned Index - multivariate linear regression used at top layer of hierarchy with

variables like key, log(key), key2

slide-49
SLIDE 49

Results

String Datasets

  • Speed-up for learned index is not

prominent due to high cost of execution and search over strings

  • Higher precision in hybrid indexes helps

since string search is more expensive

  • Different search strategies make a

difference (biased binary search vs biased quaternary search)

  • Non-hybrid RMI with quaternary search

performed best

slide-50
SLIDE 50

Point Index

  • Hash-maps have been used for point look-ups
  • Efficient implementations aim to reduce conflicts
  • Previous learning models for hash functions didn’t consider underlying data

distribution and hence the size of data-structure grew with data-size

slide-51
SLIDE 51

Hash-Model Index

  • Learning CDF of key distribution
  • We don’t aim to store keys

compactly or in strictly sorted

  • rder
  • Inserts, look-ups and conflict

handling depends on hashmap architecture

  • Benefits of learned hashmap

function depend on accuracy of model in representing the CDF, hashmap architecture, etc

slide-52
SLIDE 52

Results

  • Learned models reduced conflicts upto 77% over these datasets, learning

empirical CDF at reasonable cost

  • For distributed settings with RDMA for lookup, benefits of learned models can

be high

  • Depending on hashmap architecture, complexity of learned models may or

may not pay off

slide-53
SLIDE 53

Existence Index

Bloom Filters

  • Space-efficient probabilistic data-structure to test if an element is member of a set
  • Guarantee no false negatives, but false positives possible
  • In spite of being space-efficient, can still occupy significant amount of memory
slide-54
SLIDE 54

Learned Bloom Filters

  • Given high latencies to access cold-storage, we can afford to have more

complex models reducing false positives and space requirements

  • What properties should a good hash function for bloom filters have?
slide-55
SLIDE 55

Learned Bloom Filters

  • Given high latencies to access cold-storage, we can afford to have more

complex models reducing false positives and space requirements

  • What properties should a good hash function for bloom filters have?

○ lots of collisions among keys ○ lots of collisions among non-keys(keys that don’t exist) ○ few collisions between keys and non-keys

slide-56
SLIDE 56

Learned Bloom Filters

  • Given high latencies to access cold-storage, we can afford to have more

complex models reducing false positives and space requirements

  • What properties should a good hash function for bloom filters have?

○ lots of collisions among keys ○ lots of collisions among non-keys (keys that don’t exist) ○ few collisions between keys and non-keys

  • Maintain specific FPR for realistic queries while maintaining FNR of zero
  • Existence indices have traditionally not used distribution of keys to advantage,

but learned bloom filters can !

  • Any ideas?
slide-57
SLIDE 57

Learned Bloom Filters

Bloom-filters as classification problem -

  • Using neural network with sigmoid activation to produce binary probabilistic

classifier

  • Choosing threshold 𝜐 such that outputs above 𝜐 are assumed to exist in the

database

  • Such a model will have a positive FNR along with positive FPR ! Solutions?
slide-58
SLIDE 58

Learned Bloom Filters

Bloom-filters as classification problem -

  • Using neural network with sigmoid activation to produce binary probabilistic

classifier

  • Choosing threshold 𝜐 such that outputs about 𝜐 are assumed to exist in the

database

  • Such a model will have a positive FNR along with positive FPR ! Solutions?
slide-59
SLIDE 59

Learned Bloom Filters

How to maintain a specific FPR p*? FPRO= FPR𝜐 + (1-FPR𝜐)*FPRB For simplicity, keep FPR𝜐 = FPRB = p*/2 to ensure FPRO≤ p*. Such a 𝜐 can be tuned over the held-out data-set of non-keys. Learned model is small comparative to dataset, and overflow bloom filter scales with FNR => lower memory footprint Bloom-filters with model hashes - learning hash function such that most of the keys are mapped to higher bit positions and non-keys mapped to lower bit positions => same probabilistic model can be used !

slide-60
SLIDE 60

Results

Example - Normal bloom filter with FPR of 1% needs 2.04 MB. 16 dim GRU type of RNN requires 0.0259 MB. Setting 𝜐 = 0.5% makes FNR = 55% and spillover bloom filter requires 1.39 MB (36% reduction in size) Additional work - Covariate shifts in query distribution, using additional features for ML models, etc

slide-61
SLIDE 61

Conclusions and Future Directions

  • Exploring other ML models
  • Multi-dimensional indexes using any combination of attributes as key
  • Learned Algorithms
  • GPU/TPU and other hardware improvements