SLIDE 2 4 The Search Engine Data Structure
If we did not care about efficiency, we could implement the search of a few keywords using a simple script which calls grep and find. But this would be horribly inefficient if we want to answer searches fast. The core of our prototype search engine will be a dictionary called an inverted index or inverted file. For each index term that appears in the collection of our stored web pages, an inverted file lists each document where it appears. In other words it stores pairs of (w, Lw) where w is an index term and Lw is the list of pages containing the index term w (or its occurance list). This data structure is especially good at boolean queries. What are the kinds of queries we seek from our data structure. If we look for “What are Suffix Trees”
Xxx
Are we ,........................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
Project
What we do,....................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
Yyy
Prefixed and ,,............... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ suffixes
Trees
Trees play an important ,,.......................... ............................ ............................ ............................ ............................ ............................ ............................ ............................
Suffix Trees
What are suffix trees?,,................. ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
Zzz
Qweqwer,............ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
Zzasz
Qweqwer,............ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
Ased
Qweqwer,............ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
Zzz
Qweqwer,............ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................
we should output all the pages that have all the index terms “What”, “are”, “Suffix” and “Trees”. Note that this is a boolean query. Our data structure should retrieve (“What”, LWhat), (“are”, Lare), (“Suffix”, LSuffix), (“Trees”, LTrees) and then compute and output LWhat ∩ Lare ∩ LSuffix ∩ LTrees One thing this operation suggests is that we should keep all the occurance lists in sorted order so that we can do boolean operations fast. How do we implement this data structure? We will implement it as a trie for the set of index terms. A trie is a tree based data structure for storing strings in order to support fast pattern matching. The trie will have pointers to the corresponding occurance lists, so that as soon as there is a match for an index term w, we can get hold of its list Lw. After looking for all the search keywords that are in the query, we have all the occurance lists, we just need to do an intersection(AND)/union(OR) computation and output the intersection/union of these lists. Hence, the main job left in the design of our prototype search engine is to do fast pattern matching using a data structure. What do we want from this data structure? Our goal would be to process the text so that the occurance of any search keyword can be found quickly in our list of words or terms. We will preprocess
- ur set of words to facilitate fast queries of search keywords.
How long does searching [a 2-3-4 tree, a treap, a balanced BST, sorted array] take? The answer which we have come to know is O(log s), where s is the number of elements in the array. This, however, is not strictly true. If the elements we are dealing with are ints, it is close enough, but what if they are Strings? In
- rder to find out whether one string is >, <, or = to some other, we need to go through every character and
compare those one by one. So the real answer is that it takes O(Mlogs), where M is the number of bytes in 2