Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - PowerPoint PPT Presentation

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu

Content ● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index

Motivation for Indexing Activity Question: Why do we need indexing?

Motivation for Indexing Activity Question: Why do we need indexing? ● Items are retrieved from secondary storage to memory before processed. ● Organizing files intelligently makes the retrieval process efficient. ● Large, randomly accessed file in a computer system is associated with index ○ which like the labels on the drawers ○ directing the searcher to the small part of the file containing the desired item.

Operations on a file ● Files: set of records k 0 α 0 ● Each record: r i = ( k i , α i ), where k i is the key k 1 α 1 and α i is the associated information k 2 α 2 ● Operations ○ Insert: add new record, ( k i , α i ), checking k i is unique. ... ... ○ Delete: remove record, ( k i , α i ), given k i ○ Find: retrieve α i , given k i . ○ Next: retrieve α i+1 , given that α i was just retrieved.

B-tree: Generalization of Binary Search Tree ● More than 2 paths leave a given node. ● Compare query key and the key stored at the node the decide path to take. ● Exact match (success). No exact match and leaf is reached (failure).

B-tree of Order d ● Each node contains at most 2d keys and 2d + 1 pointers. ● Each node contains at least d keys and d + 1 points (at least ½ full).

Balancing B-Tree: ● Never visits more than 1 + log d (n) node. ● Accessing each node is a separate access to secondary storage.

Insertion 1. Find: proceeds from root to location the proper leaf for insertion. 2. Insert: balance is restored by a procedure which moves from the leaf back towards the root.

Insertion: Split Of the 2d + 1 keys, the smallest d are placed in one node, the largest d are placed in another node, and the remaining value is promoted to the parent node as separator. The splitting can propagtes to root and the tree increase height by 1.

Deletion Find proper node. There are two possibilities: 1. The key to be deleted resides in a leaf 2. The key resides in a nonleaf node. a. An adjance key be found and swapped into the vacated position. b. Use the leftmost leaf in the right subtree.

Deletion: Underflow After the removal, check to see at least d keys remain in each node. If a node has less than d keys, then underflow is said to occur and redistribution of the keys becomes necessary.

Deletion: Concatenation ● Redistribution of keys among two neighbors only there are at least 2d keys. ● When there are less than 2d keys remain, a concatenation must occur. ○ Keys are simply combined into one of the nodes and the other is discarded. ○ Since only one node remains, the key separating the two nodes in the ancestor is no longer necessary and added to the single remaining leaf. ○ If the descendants of the root are concatenated, they form a new root, decrease B-tree height by 1.

The cost of operations ● Retrieval costs ● Insertion and Deletion costs ● Sequential Processing

Retrieval costs ● Find operation grows as the logarithm of the file size. ● With d being order of the B-tree, n being number of keys in the file, h being the height of the tree:

Insertion and Deletion costs - Tree Height ● May require additional secondary storage accesses beyond the cost of a find operation as it progresses back up the tree. ● Overall, the costs are at most doubled, so the height of the tree still dominates the cost. ● In a B-tree of order d for a file of n records, insertion and deletion take time proportional to log d (n) in the worst case.

Insertion and Deletion costs - Tree Order ● As the branch factor, d, increases, the logarithmic base increases, the cost of find, insert and delete operation delete decreases. ● There are practical limits on the size of a node. ○ Most hardware systems bound the amount of data that can be transferred with one access to secondary storage. ○ The cost estimation is now hiding the constant factor which grows as the size of data transferred increases.

Sequential Processing ● Using the next operation to process all records in key-sequence order. ● B-tree may not do well in sequential processing ○ Preorder tree walk requires space for at least h = log d (n+1) nodes in main memory since it stacks the nodes along a path from the root to avoid reading them twice ○ Processing a next operation may require tracing a path through several nodes before reaching the desired key. ● B+-tree improves sequential processing performance.

B-Tree variants ● Different variations ○ Splitting vs. Redistributed to neighbor ○ Processing a node once it has been retrieved from secondary storage, using different search method (e.g. linear search, binary search) ○ Varying “order” at each depth ● B*-Trees ● B+-Trees

B*-Trees ● Each node is at least ⅔ full (instead of just ½ full). ● Delay spitting until 2 sibling nodes are full and then divided into 3 each ⅔ full. ● Increasing storage utilization. ● Speeding up search as height of the tree is reduced.

B+-Trees structure ● All keys reside in the leave. ● Nonleaf levels are organized as B-tree, consist only index. All keys reside in leaves. ● Leaf nodes are usually linked together left-to-right.

B+-Tree Operations ● Insertion: ○ Almost identically to B-tree. ○ During a split, instead of promoting the middle key, promote a copy of the key. ● Deletion: ○ key to be deleted always reside in leaf node, which makes deletion simple. ○ As long as the leaf remains at least half full, the upper index levels does not need change. ● Find: ○ Search does not stop on exact match, instead the right pointer is followed. ○ Almost proceeds all the way to a leaf.

B-tree in Multiuser Environment ● Should permit several user requests to be processed simultaneously. ● One process may read a node and follow one of the links while another process is changing it. ● Find operations goes top down, while insertion and deletion require bottom-up access.

B-tree in Multiuser Environment: Locking ● Find operation ○ locks a node once it has been read ○ Release when search proceed to next level ○ Readers locks at most two nodes at any time. ● Update operation ○ Reservation on access ○ Reservation converted to an absolute lock if update changes will propagate to the reserved node, otherwise cancel reservation ○ Reserved node may be read but may not be reserved a second time

B-tree in Multiuser Environment: Security ● Protection of information in a multiuser environment. ● Memory protection mechanism of paging. ● Encryption techniques can be used to protect contents of a file outside of the underlying system.

Summary of B-tree ● Efficient, simple and easily maintained. ● Logarithmic cost find insert and delete operations. ● Guarantee 50% storage utilization. ● B+-tree allow efficient sequential processing. ● There are many variants of B-tree. ● Can be used in multiuser environment.

Indexes as Models ● B-Tree Index : Maps key to position of record in sorted array ● Hash Index : Maps key to position of record in unsorted array ● BitMap Index : Checks if a data-record exists

Indexes as Models ● B-Tree Index : Maps key to position of record in sorted array ● Hash Index : Maps key to position of record in unsorted array ● BitMap Index : Checks if a data-record exists Can we replace these traditional models with other kinds of models?

Activity If we have fixed length records with continuous integer keys from 1 to 1 million, can we find a better way to access record corresponding to any given key? What if the length of each record was one unit greater than its immediate predecessor?

Knowing data distribution helps ! ● ML, especially neural nets, can learn variety of data distributions, mixtures and other patterns ● Balancing complexity of model with accuracy is important

What should the model learn?

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - PowerPoint PPT Presentation

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing B-tree B-tree basics The cost of B-tree operations B-tree variants B-tree in multi-user Environments Learned Index

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping Time Warping Eamonn Keogh

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

Biometric Indexing Yi Wang alice.yi.wang@ieee.org 13/Jan/2017 Outlines Introduction to

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Media Indexing & Retrieval Media Indexing & Retrieval Prepared by Ling Guan Jose Lay

Capabilities Capabilities Indexing and Publishing Indexing and Publishing Jason M. Coposky

Software Architecture Software architecture The design process for identifying the sub-

Programming in Oz Wacek Ku snierczyk December 10., 2010 1 Lecture Outline Introduction to

Reading Assignment Chapter 4 of PR The Kalman Filter Focus on histogram and particle

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

Con Convol oluti tion onal Neural Netw twork orks Presented by Tristan Maidment Adapted

Recommender Systems Francesco Ricci Database and Information Systems Free University of Bozen,

Filtering, Decomposition and Search Space Reduction for Optimal Sequential Planning (FDP System)

Searches Through Encrypted Data presenter: Reza Curtmola Advanced Topics in Network Security

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content - PowerPoint PPT Presentation

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing B-tree B-tree basics The cost of B-tree operations B-tree variants B-tree in multi-user Environments Learned Index

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping Time Warping Eamonn Keogh

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

Biometric Indexing Yi Wang alice.yi.wang@ieee.org 13/Jan/2017 Outlines Introduction to

Chapter V: Indexing &amp; Searching Information Retrieval &amp; Data Mining Universitt des

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Media Indexing &amp; Retrieval Media Indexing &amp; Retrieval Prepared by Ling Guan Jose Lay

Capabilities Capabilities Indexing and Publishing Indexing and Publishing Jason M. Coposky

Software Architecture Software architecture The design process for identifying the sub-

Programming in Oz Wacek Ku snierczyk December 10., 2010 1 Lecture Outline Introduction to

Reading Assignment Chapter 4 of PR The Kalman Filter Focus on histogram and particle

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

Con Convol oluti tion onal Neural Netw twork orks Presented by Tristan Maidment Adapted

Recommender Systems Francesco Ricci Database and Information Systems Free University of Bozen,

Filtering, Decomposition and Search Space Reduction for Optimal Sequential Planning (FDP System)

Searches Through Encrypted Data presenter: Reza Curtmola Advanced Topics in Network Security

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des

Media Indexing & Retrieval Media Indexing & Retrieval Prepared by Ling Guan Jose Lay