CSE 373: Analysis of Algorithms Topic: Reinventing search engines - PDF document

CSE 373: Analysis of Algorithms Topic: Reinventing search engines using Tries Nov 03, 2003 Lecturer: Piyush Kumar Scribed by: Piyush Kumar Please Help us improve this draft. If you read these notes and find any errors or have an idea to improve it, please send feedback. Search others for their virtues, thyself for thy vices. Benjamin Franklin (1706 - 1790) 1 Search Engines The world wide web today contains billions of pages for you to explore. Google, Altavista, Infoseek, Yahoo and thousands of other search engines exist today to help you find what you are looking for on the web. Below is a list of approximately how many pages you can search in some of the popular search engines (Dec 2002). Search Engine Approximate number of pages in billions Google 3.1 AlltheWeb 2.1 AltaVista 1.7 WiseNut 1.5 Have you ever wondered how they work? In this set of lectures we peek into a data structure that would help us design a prototype of a search engine. Designing and Implementing a small search engine is both easy and fun. 2 The Prototype There will be two major parts of our prototype. A Crawler will be a program that gathers the web pages that our search engine will search on. Crawlers are also known as robots, bots or spiders. Real crawlers have to deal with many issues that we will not consider in our prototype design. Our prototype design will be simple enough to implement in a hundred lines of perl code. Also our prototype will not implement page ranking (Although we encourage you to think how we could incorporate page ranking in our prototype). Our prototype crawler will use a simple breadth first search for the web graph (possibly with bounds on the depth reached). Many scientists claim that breadth first search crawling tends to find high quality pages early in the crawl (Why?). Once the Crawler has gathered the pages that we want our search engine to search on, how do we implement the search data structure? 3 Occurance Lists Before we goto the design of a search data structure, we will create a set of elements for the data structure called the occurance lists . An occurance list is a list of web pages that contain a particular word w . If we assign a number to each of our web pages, then the occurance list can be created by just parsing the web pages and storing pairs of ( w, UrlID) as we go through each word in a web page. Once this big list is stored on disk, we can use our external memory sort engine (That we designed in the programming project) to sort this data so that we now can collect for each distinct word, which are the pages that contain it. Thus now we have a way to create for each distinct word or term occuring in our web pages, theire respective occurance lists. We will call these distinct word or terms, index terms . 1

4 The Search Engine Data Structure If we did not care about efficiency, we could implement the search of a few keywords using a simple script which calls grep and find . But this would be horribly inefficient if we want to answer searches fast. The core of our prototype search engine will be a dictionary called an inverted index or inverted file . For each index term that appears in the collection of our stored web pages, an inverted file lists each document where it appears. In other words it stores pairs of ( w, L w ) where w is an index term and L w is the list of pages containing the index term w (or its occurance list). This data structure is especially good at boolean queries. What are the kinds of queries we seek from our data structure. If we look for “What are Suffix Trees” Zzz Zzasz Yyy Project Qweqwer,............ Qweqwer,............ Prefixed and What we ............................ ............................ suffixes ,,............... do,....................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ Zzz ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ Qweqwer,............ ............................ ............................ ............................ ............................ ............................ Suffix Trees ............................ ............................ ............................ What are suffix ............................ trees?,,................. Xxx ............................ ............................ Trees ............................ Ased Are we ............................ ,........................... Trees play an ............................ ............................ Qweqwer,............ important ............................ ............................ ............................ ,,.......................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ we should output all the pages that have all the index terms “What”, “are”, “Suffix” and “Trees”. Note that this is a boolean query. Our data structure should retrieve (“What” , L What ), (“are” , L are ), (“Suffix” , L Suffix ), (“Trees” , L Trees ) and then compute and output L What ∩ L are ∩ L Suffix ∩ L Trees One thing this operation suggests is that we should keep all the occurance lists in sorted order so that we can do boolean operations fast. How do we implement this data structure? We will implement it as a trie for the set of index terms. A trie is a tree based data structure for storing strings in order to support fast pattern matching. The trie will have pointers to the corresponding occurance lists, so that as soon as there is a match for an index term w , we can get hold of its list L w . After looking for all the search keywords that are in the query, we have all the occurance lists, we just need to do an intersection(AND)/union(OR) computation and output the intersection/union of these lists. Hence, the main job left in the design of our prototype search engine is to do fast pattern matching using a data structure. What do we want from this data structure? Our goal would be to process the text so that the occurance of any search keyword can be found quickly in our list of words or terms. We will preprocess our set of words to facilitate fast queries of search keywords. How long does searching [a 2-3-4 tree, a treap, a balanced BST, sorted array] take? The answer which we have come to know is O(log s), where s is the number of elements in the array. This, however, is not strictly true. If the elements we are dealing with are ints, it is close enough, but what if they are Strings? In order to find out whether one string is >, <, or = to some other, we need to go through every character and compare those one by one. So the real answer is that it takes O ( Mlogs ), where M is the number of bytes in 2

CSE 373: Analysis of Algorithms Topic: Reinventing search engines - PDF document

CSE 373: Analysis of Algorithms Topic: Reinventing search engines using Tries Nov 03, 2003 Lecturer: Piyush Kumar Scribed by: Piyush Kumar Please Help us improve this draft. If you read these notes and find any errors or have an idea to improve

Lecture 1: Welcome! CSE 373: Data Structures and Algorithms CSE 373 19 WI - KASEY CHAMPION 1

Lecture 15: Sorting CSE 373: Data Structures and Algorithms Algorithms CSE 373 WI 19 - KASEY

Lecture 4: Introduction to CSE 373: Data Structures and Asymptotic Analysis Algorithms CSE 373

Lecture 4: Introduction to CSE 373: Data Structures and Code Analysis Algorithms CSE 373 19 SP

Lecture 13: Computer CSE 373 Data Structures and Memory Algorithms CSE 373 SP 18 - KASEY

Lecture 2: Stacks and CSE 373: Data Structures and Queues Algorithms CSE 373 19 SP - KASEY

Lecture 21: Disjoint Sets CSE 373: Data Structures and with Arrays Algorithms CSE 373 19 WI -

Lecture 11: Introduction CSE 373: Data Structures and to Hash Tables Algorithms CSE 373 SU 19 -

Queues Algorithms CSE 373 19 SU -- ROBBIE WEBER 1 Administrivia Course Stuff - Office hours

CSE 390B: Graph Algorithms Based on CSE 373 slides by Jessica Miller, Ruth Anderson 1 A Graph:

B trees Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up Suppose we have

Implementing Hash and Data Structures and AVL Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Implementing Graphs Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Disjoint Sets Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up Finding

CSE202: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

CSE101: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg Department of Statistics &

CS490W Web Search (I) Luo Si Department of Computer Science Purdue University Slides from

Multiprocessors and Thread-Level Parallelism 1 MO401 Tpicos IC-UNICAMP Centralized

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Lecture 14 HCI History Mark Woehrer CS 3053 - Human-Computer Interaction Computer Science

Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh 1 ,

Overview Agenda Architecture of search on the web including an overview of Crawling,

Web Engineering HTTP is based on TCP to experiment with the protocol telnet can be used.

CSE 373: Analysis of Algorithms Topic: Reinventing search engines - PDF document

CSE 373: Analysis of Algorithms Topic: Reinventing search engines using Tries Nov 03, 2003 Lecturer: Piyush Kumar Scribed by: Piyush Kumar Please Help us improve this draft. If you read these notes and find any errors or have an idea to improve

Lecture 1: Welcome! CSE 373: Data Structures and Algorithms CSE 373 19 WI - KASEY CHAMPION 1

Lecture 15: Sorting CSE 373: Data Structures and Algorithms Algorithms CSE 373 WI 19 - KASEY

Lecture 4: Introduction to CSE 373: Data Structures and Asymptotic Analysis Algorithms CSE 373

Lecture 4: Introduction to CSE 373: Data Structures and Code Analysis Algorithms CSE 373 19 SP

Lecture 13: Computer CSE 373 Data Structures and Memory Algorithms CSE 373 SP 18 - KASEY

Lecture 2: Stacks and CSE 373: Data Structures and Queues Algorithms CSE 373 19 SP - KASEY

Lecture 21: Disjoint Sets CSE 373: Data Structures and with Arrays Algorithms CSE 373 19 WI -

Lecture 11: Introduction CSE 373: Data Structures and to Hash Tables Algorithms CSE 373 SU 19 -

Queues Algorithms CSE 373 19 SU -- ROBBIE WEBER 1 Administrivia Course Stuff - Office hours

CSE 390B: Graph Algorithms Based on CSE 373 slides by Jessica Miller, Ruth Anderson 1 A Graph:

B trees Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up Suppose we have

Implementing Hash and Data Structures and AVL Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Implementing Graphs Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Disjoint Sets Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up Finding

CSE202: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

CSE101: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg Department of Statistics &amp;

CS490W Web Search (I) Luo Si Department of Computer Science Purdue University Slides from

Multiprocessors and Thread-Level Parallelism 1 MO401 Tpicos IC-UNICAMP Centralized

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Lecture 14 HCI History Mark Woehrer CS 3053 - Human-Computer Interaction Computer Science

Text Summarization Using A Trainable Summarizer and Latent Semantic Analysis Jen-Yuan Yeh 1 ,

Overview Agenda Architecture of search on the web including an overview of Crawling,

Web Engineering HTTP is based on TCP to experiment with the protocol telnet can be used.

DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg Department of Statistics &