Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - - PowerPoint PPT Presentation
Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - - PowerPoint PPT Presentation
Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing B-tree B-tree basics The cost of B-tree operations B-tree variants B-tree in multi-user Environments Learned Index
Content
- Motivation for Indexing
- B-tree
○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments
- Learned Index
Motivation for Indexing
Activity Question: Why do we need indexing?
Motivation for Indexing
Activity Question: Why do we need indexing?
- Items are retrieved from secondary storage to memory before processed.
- Organizing files intelligently makes the retrieval process efficient.
- Large, randomly accessed file in a computer system is associated with index
○ which like the labels on the drawers ○ directing the searcher to the small part of the file containing the desired item.
Content
- Motivation for Indexing
- B-tree
○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments
- Learned Index
Operations on a file
- Files: set of records
- Each record: ri = (ki, αi), where ki is the key
and αi is the associated information
- Operations
○ Insert: add new record, (ki, αi), checking ki is unique. ○ Delete: remove record, (ki, αi), given ki ○ Find: retrieve αi, given ki. ○ Next: retrieve αi+1, given that αi was just retrieved. k0 α0 k1 α1 k2 α2 ... ...
B-tree: Generalization of Binary Search Tree
- More than 2 paths leave a given node.
- Compare query key and the key stored at the node the decide path to take.
- Exact match (success). No exact match and leaf is reached (failure).
B-tree of Order d
- Each node contains at most 2d keys and 2d + 1 pointers.
- Each node contains at least d keys and d + 1 points (at least ½ full).
Balancing
B-Tree:
- Never visits more than 1 + logd(n)
node.
- Accessing each node is a separate
access to secondary storage.
Insertion
1. Find: proceeds from root to location the proper leaf for insertion. 2. Insert: balance is restored by a procedure which moves from the leaf back towards the root.
Insertion: Split
Of the 2d + 1 keys, the smallest d are placed in one node, the largest d are placed in another node, and the remaining value is promoted to the parent node as
- separator. The splitting can propagtes to root and the tree increase height by 1.
Deletion
Find proper node. There are two possibilities: 1. The key to be deleted resides in a leaf 2. The key resides in a nonleaf node.
a. An adjance key be found and swapped into the vacated position. b. Use the leftmost leaf in the right subtree.
Deletion: Underflow
After the removal, check to see at least d keys remain in each node. If a node has less than d keys, then underflow is said to occur and redistribution of the keys becomes necessary.
Deletion: Concatenation
- Redistribution of keys among two neighbors only there are at least 2d keys.
- When there are less than 2d keys remain, a concatenation must occur.
○ Keys are simply combined into one of the nodes and the other is discarded. ○ Since only one node remains, the key separating the two nodes in the ancestor is no longer necessary and added to the single remaining leaf. ○ If the descendants of the root are concatenated, they form a new root, decrease B-tree height by 1.
Content
- Motivation for Indexing
- B-tree
○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments
- Learned Index
The cost of operations
- Retrieval costs
- Insertion and Deletion costs
- Sequential Processing
Retrieval costs
- Find operation grows as the logarithm of the file size.
- With d being order of the B-tree, n being number of keys in the file, h being
the height of the tree:
Insertion and Deletion costs - Tree Height
- May require additional secondary storage accesses beyond the cost of a find
- peration as it progresses back up the tree.
- Overall, the costs are at most doubled, so the height of the tree still dominates
the cost.
- In a B-tree of order d for a file of n records, insertion and deletion take time
proportional to logd(n) in the worst case.
Insertion and Deletion costs - Tree Order
- As the branch factor, d, increases, the logarithmic base increases, the cost of
find, insert and delete operation delete decreases.
- There are practical limits on the size of a node.
○ Most hardware systems bound the amount of data that can be transferred with one access to secondary storage. ○ The cost estimation is now hiding the constant factor which grows as the size of data transferred increases.
Sequential Processing
- Using the next operation to process all records in key-sequence order.
- B-tree may not do well in sequential processing
○ Preorder tree walk requires space for at least h = logd(n+1) nodes in main memory since it stacks the nodes along a path from the root to avoid reading them twice ○ Processing a next operation may require tracing a path through several nodes before reaching the desired key.
- B+-tree improves sequential processing
performance.
Content
- Motivation for Indexing
- B-tree
○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments
- Learned Index
B-Tree variants
- Different variations
○ Splitting vs. Redistributed to neighbor ○ Processing a node once it has been retrieved from secondary storage, using different search method (e.g. linear search, binary search) ○ Varying “order” at each depth
- B*-Trees
- B+-Trees
B*-Trees
- Each node is at least ⅔ full (instead of just ½ full).
- Delay spitting until 2 sibling nodes are full and then divided into 3 each ⅔ full.
- Increasing storage utilization.
- Speeding up search as height of the tree is reduced.
B+-Trees structure
- All keys reside in the leave.
- Nonleaf levels are organized as B-tree, consist only index. All keys reside in
leaves.
- Leaf nodes are usually linked together left-to-right.
B+-Tree Operations
- Insertion:
○ Almost identically to B-tree. ○ During a split, instead of promoting the middle key, promote a copy of the key.
- Deletion:
○ key to be deleted always reside in leaf node, which makes deletion simple. ○ As long as the leaf remains at least half full, the upper index levels does not need change.
- Find:
○ Search does not stop on exact match, instead the right pointer is followed. ○ Almost proceeds all the way to a leaf.
Content
- Motivation for Indexing
- B-tree
○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments
- Learned Index
B-tree in Multiuser Environment
- Should permit several user requests to be processed simultaneously.
- One process may read a node and follow one of the links while another
process is changing it.
- Find operations goes top down, while insertion and deletion require bottom-up
access.
B-tree in Multiuser Environment: Locking
- Find operation
○ locks a node once it has been read ○ Release when search proceed to next level ○ Readers locks at most two nodes at any time.
- Update operation
○ Reservation on access ○ Reservation converted to an absolute lock if update changes will propagate to the reserved node, otherwise cancel reservation ○ Reserved node may be read but may not be reserved a second time
B-tree in Multiuser Environment: Security
- Protection of information in a multiuser environment.
- Memory protection mechanism of paging.
- Encryption techniques can be used to protect contents of a file outside of the
underlying system.
Summary of B-tree
- Efficient, simple and easily maintained.
- Logarithmic cost find insert and delete operations.
- Guarantee 50% storage utilization.
- B+-tree allow efficient sequential processing.
- There are many variants of B-tree.
- Can be used in multiuser environment.
Content
- Motivation for Indexing
- B-tree
○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments
- Learned Index
Indexes as Models
- B-Tree Index : Maps key to position of record in sorted array
- Hash Index : Maps key to position of record in unsorted array
- BitMap Index : Checks if a data-record exists
Indexes as Models
- B-Tree Index : Maps key to position of record in sorted array
- Hash Index : Maps key to position of record in unsorted array
- BitMap Index : Checks if a data-record exists
Can we replace these traditional models with other kinds of models?
Activity
If we have fixed length records with continuous integer keys from 1 to 1 million, can we find a better way to access record corresponding to any given key? What if the length of each record was one unit greater than its immediate predecessor?
Knowing data distribution helps !
- ML, especially neural nets, can learn variety of data distributions, mixtures
and other patterns
- Balancing complexity of model with accuracy is important
What should the model learn?
What should the model learn?
Model that predicts position of key within a sorted array effectively approximates the Cumulative Distribution Function (CDF) corresponding to the keys p = F(Key)*N
Position estimate Total number
- f
records Estimated CDF
Naive Learned Index
- Performance worse in comparison to
traditional Btrees !
- Might be more CPU and space-efficient to
narrow down position of an item from entire dataset to region with thousands of records
- Significantly more difficult to run the last
mile
- Cache-efficiency and memory efficiency of
B-trees difficult to replicate in our model
Learning Index Framework
- Learns simple models on the fly and relies on TensorFlow for complex models
- Generates efficient index structures in C++ for inference
- Runs simple models in order of 30 ns
Recursive Model Index
- Learning Hierarchy of models
instead of a single unified model for indexing
- Each stage takes key as input
and selects another better model in the next hierarchical layer
- Final stage predicts position
Hybrid Indexes
- Different layers have different types of learning models
- Ideas?
Hybrid Indexes
- Different layers have different types of learning models
- Ideas?
○ Small ReLU neural net at top can learn wide range of complex data distributions ○ Bottom can have simple linear regression models - inexpensive in space and execution time ○ Traditional B-Trees (i.e. decision trees) can be used at bottom if data is particularly difficult to learn
Hybrid Indexes
- Different layers have different types of learning models
- Ideas?
○ Small ReLU neural net at top can learn wide range of complex data distributions ○ Bottom can have simple linear regression models - inexpensive in space and execution time ○ Traditional B-Trees (i.e. decision trees) can be used at bottom if data is particularly difficult to learn
- Worst-case performance of learned indexes bound to that of B-Trees !
Searching record with learned index
- Finding the first key higher/lower from the look-up key based on prediction
- Model biased search
○ Middle-point of binary search set to the value predicted by our model
- Biased quaternary search
○ Three middle points - pos-𝞽,pos,pos+𝞽
- Min and max-errors used to define the search area
- Can this work for non-existent keys?
Indexing Strings / Training
Results
- Map - longitude of locations is relatively linear and has fewer irregularities
- Weblogs - worst-case scenario, complex time patterns
- Log-Normal (synthetic) - highly non-linear, CDF difficult to learn using NNs
Results
Comparison with other models
- Lookup Tables with fixed records allows use of AVX instructions
- FAST - Highly SIMD-optimized data structure is used, but memory requirement is higher
- Fixed-Size Btree with interpolation search (variation of binary search for uniformly distributed data) -
height of Btree is fixed to reduce memory consumption
- Multivariate Learned Index - multivariate linear regression used at top layer of hierarchy with
variables like key, log(key), key2
Results
String Datasets
- Speed-up for learned index is not
prominent due to high cost of execution and search over strings
- Higher precision in hybrid indexes helps
since string search is more expensive
- Different search strategies make a
difference (biased binary search vs biased quaternary search)
- Non-hybrid RMI with quaternary search
performed best
Point Index
- Hash-maps have been used for point look-ups
- Efficient implementations aim to reduce conflicts
- Previous learning models for hash functions didn’t consider underlying data
distribution and hence the size of data-structure grew with data-size
Hash-Model Index
- Learning CDF of key distribution
- We don’t aim to store keys
compactly or in strictly sorted
- rder
- Inserts, look-ups and conflict
handling depends on hashmap architecture
- Benefits of learned hashmap
function depend on accuracy of model in representing the CDF, hashmap architecture, etc
Results
- Learned models reduced conflicts upto 77% over these datasets, learning
empirical CDF at reasonable cost
- For distributed settings with RDMA for lookup, benefits of learned models can
be high
- Depending on hashmap architecture, complexity of learned models may or
may not pay off
Existence Index
Bloom Filters
- Space-efficient probabilistic data-structure to test if an element is member of a set
- Guarantee no false negatives, but false positives possible
- In spite of being space-efficient, can still occupy significant amount of memory
Learned Bloom Filters
- Given high latencies to access cold-storage, we can afford to have more
complex models reducing false positives and space requirements
- What properties should a good hash function for bloom filters have?
Learned Bloom Filters
- Given high latencies to access cold-storage, we can afford to have more
complex models reducing false positives and space requirements
- What properties should a good hash function for bloom filters have?
○ lots of collisions among keys ○ lots of collisions among non-keys(keys that don’t exist) ○ few collisions between keys and non-keys
Learned Bloom Filters
- Given high latencies to access cold-storage, we can afford to have more
complex models reducing false positives and space requirements
- What properties should a good hash function for bloom filters have?
○ lots of collisions among keys ○ lots of collisions among non-keys (keys that don’t exist) ○ few collisions between keys and non-keys
- Maintain specific FPR for realistic queries while maintaining FNR of zero
- Existence indices have traditionally not used distribution of keys to advantage,
but learned bloom filters can !
- Any ideas?
Learned Bloom Filters
Bloom-filters as classification problem -
- Using neural network with sigmoid activation to produce binary probabilistic
classifier
- Choosing threshold 𝜐 such that outputs above 𝜐 are assumed to exist in the
database
- Such a model will have a positive FNR along with positive FPR ! Solutions?
Learned Bloom Filters
Bloom-filters as classification problem -
- Using neural network with sigmoid activation to produce binary probabilistic
classifier
- Choosing threshold 𝜐 such that outputs about 𝜐 are assumed to exist in the
database
- Such a model will have a positive FNR along with positive FPR ! Solutions?
Learned Bloom Filters
How to maintain a specific FPR p*? FPRO= FPR𝜐 + (1-FPR𝜐)*FPRB For simplicity, keep FPR𝜐 = FPRB = p*/2 to ensure FPRO≤ p*. Such a 𝜐 can be tuned over the held-out data-set of non-keys. Learned model is small comparative to dataset, and overflow bloom filter scales with FNR => lower memory footprint Bloom-filters with model hashes - learning hash function such that most of the keys are mapped to higher bit positions and non-keys mapped to lower bit positions => same probabilistic model can be used !
Results
Example - Normal bloom filter with FPR of 1% needs 2.04 MB. 16 dim GRU type of RNN requires 0.0259 MB. Setting 𝜐 = 0.5% makes FNR = 55% and spillover bloom filter requires 1.39 MB (36% reduction in size) Additional work - Covariate shifts in query distribution, using additional features for ML models, etc
Conclusions and Future Directions
- Exploring other ML models
- Multi-dimensional indexes using any combination of attributes as key
- Learned Algorithms
- GPU/TPU and other hardware improvements