Database Management Course Content Systems Introduction - PowerPoint PPT Presentation

Database Management Course Content Systems • Introduction • Database Design Theory • Query Processing and Optimisation Winter 2003 • Concurrency Control CMPUT 391: Information Retrieval and the Web • Data Base Recovery and Security • Object-Oriented Databases • Inverted Index for IR Dr. Osmar R. Zaïane • XML • Data Warehousing • Data Mining • Parallel and Distributed Databases University of Alberta Chapter 27 of • Other Advanced Database Topics Textbook  Dr. Osmar R. Zaïane, 2001-2003 1  Dr. Osmar R. Zaïane, 2001-2003 2 2 Database Management Systems University of Alberta Database Management Systems University of Alberta Inverted Indexes and IR Objectives of Lecture 7 Inverted Indexes and Information Retrieval • Inverted Indexes and Information Retrieval • Get a general idea about the technologies • Signature Files behind search engines • Anatomy of a Search Engine • Get acquainted with inverted indexes • Discuss ranking issues • Web Crawler • Ranking Results • Authorities, Hubs and PageRank  Dr. Osmar R. Zaïane, 2001-2003  Dr. Osmar R. Zaïane, 2001-2003 3 4 Database Management Systems University of Alberta Database Management Systems University of Alberta

Everyday Activity Information Retrieval • We use search engines whenever we look • Find resources (documents) that contain a for resources on the Internet certain list of keywords • How do these search engines work? Find the pages where the phrase “alpha • How come they give different results beta” occurs. while the results come from the same Searching sequentially is too expensive. Web? • The results are often very disappointing. You would need an index to directly find the pages. Why aren’t we satisfied?  Dr. Osmar R. Zaïane, 2001-2003 5  Dr. Osmar R. Zaïane, 2001-2003 6 Database Management Systems University of Alberta Database Management Systems University of Alberta Querying Creating an Index Inverted Index For each document w a : D 1 , D 2 , D 3 … w b : D 1 , D 3 … Which document D 1 , D 2 , D 3 … D i : w a , w b , w c … w c : D 1 , … contains W a and W b ? ∩ ∩ ∩ ∩ w d : D 2 , D 3 , … D 1 , D 3 … … index documents Document D i Inverted Index w a : D 1 , D 2 , D 3 … w a : D 1 , D 2 , D 3 … D 1 : w a , w b , w c … Which document w b : D 1 , D 3 … w b : D 1 , D 3 … D 1 , D 2 , D 3 … D 2 : w a , w d , w e … contains W a or W b ? w c : D 1 , … ∪ ∪ ∪ ∪ w c : D 1 , … w d : D 2 , D 3 , … D 3 : w a , w b , w d … D 1 , D 3 … w d : D 2 , D 3 , … … … … documents D n : w x , w y , w z … Inverted Index  Dr. Osmar R. Zaïane, 2001-2003  Dr. Osmar R. Zaïane, 2001-2003 7 8 Database Management Systems University of Alberta Database Management Systems University of Alberta

Indexing for Text Search Inverted Indexes and IR • Text database: Collection of text documents • Important class of queries: Keyword searches • Inverted Indexes and Information Retrieval – Boolean queries: Query terms connected with AND, OR and NOT. Result is list of documents that satisfy • Signature Files the boolean expression. • Anatomy of a Search Engine – Ranked queries: Result is list of documents ranked by their “relevance”. • Web Crawler – IR: Precision (percentage of retrieved documents that are relevant) and recall (percentage of relevant • Ranking Results objects that are retrieved) • Authorities, Hubs and PageRank • Inverted indexes is not the only approach in IR. Signature files are also used for document retrieval.  Dr. Osmar R. Zaïane, 2001-2003 9  Dr. Osmar R. Zaïane, 2001-2003 10 Database Management Systems University of Alberta Database Management Systems University of Alberta Signature Files: Query Evaluation Signature Files • Boolean query consisting of conjunction of words: – Generate query signature Sq • Index structure (the signature file) with one – Scan signatures of all documents. data entry for each document – If signature S matches Sq, then retrieve document and check for false positives. • Hash function hashes words to bit-vector. • Boolean query consisting of disjunction of k • Data entry for a document (the signature of words: the document) is the OR of all hashed – Generate k query signatures S1, …, Sk words. – Scan signature file to find documents whose signature • Signature S1 matches signature S2 if matches any of S1, …, Sk – Check for false positives S2&S1=S2  Dr. Osmar R. Zaïane, 2001-2003  Dr. Osmar R. Zaïane, 2001-2003 11 12 Database Management Systems University of Alberta Database Management Systems University of Alberta

Signature Files: Example Inverted Indexes and IR Word Hash • Inverted Indexes and Information Retrieval Agent 010 • Signature Files James 100 • Anatomy of a Search Engine Mobile 001 • Web Crawler • Ranking Results RID Document Signature 1 Agent James 110 • Authorities, Hubs and PageRank 2 Mobile agent 011  Dr. Osmar R. Zaïane, 2001-2003 13  Dr. Osmar R. Zaïane, 2001-2003 14 Database Management Systems University of Alberta Database Management Systems University of Alberta A Search Engine Blocs Search Engine Components • A Search Engine has an interface to enter queries Interface Inverted Index • A search engine has access to an inverted Query/Results User index already built • A search engine ranks the results found in Built off-line the index Ranking  Dr. Osmar R. Zaïane, 2001-2003  Dr. Osmar R. Zaïane, 2001-2003 15 16 Database Management Systems University of Alberta Database Management Systems University of Alberta

Search Engine General Inverted Indexes and IR Architecture Page • Inverted Indexes and Information Retrieval 2 • Signature Files Page Parser and Crawler indexer 3 • Anatomy of a Search Engine 5 1 • Web Crawler 4 Index LTV • Ranking Results 3 6 • Authorities, Hubs and PageRank LV Search 4 Engine LNV  Dr. Osmar R. Zaïane, 2001-2003 17  Dr. Osmar R. Zaïane, 2001-2003 18 Database Management Systems University of Alberta Database Management Systems University of Alberta Inverted Indexes and IR Search Engines are not Enough • Inverted Indexes and Information Retrieval • Most of the knowledge in the World-Wide • Signature Files Web is buried inside documents. • Search engines (and crawlers) barely • Anatomy of a Search Engine scratch the surface of this knowledge by • Web Crawler extracting keywords from web pages. • Ranking Results • There is text mining, text summarization, • Authorities, Hubs and PageRank natural language statistical analysis, etc., but not the scope of this course.  Dr. Osmar R. Zaïane, 2001-2003  Dr. Osmar R. Zaïane, 2001-2003 19 20 Database Management Systems University of Alberta Database Management Systems University of Alberta

Relevancy Ranking How do we Rank ? • Some search engine claim to have indexed • Each Search Engine uses a different ranking about one billion documents function. Usually these ranking functions are not disclosed. (similarity measure) • Each search can yield a very large list of “supposedly relevant” documents • Parameters used in ranking: • Sifting through thousands of results is - Frequency of words - Existence in directory tedious and not necessary - Location of words - Inward and outward Links - Metadata - Entirety of query • It is extremely important to rank the results - Domain - Size of document since most users will look mainly at the 10 - And $$$$ - Age of document to 20 first documents.  Dr. Osmar R. Zaïane, 2001-2003 21  Dr. Osmar R. Zaïane, 2001-2003 22 Database Management Systems University of Alberta Database Management Systems University of Alberta Ontology for Search Results • There are still too many results in typical search engine responses. • Reorganize results using a semantic hierarchy (Zaïane et al. 2001). WordNet Semantic Search network result  Dr. Osmar R. Zaïane, 2001-2003  Dr. Osmar R. Zaïane, 2001-2003 23 24 Database Management Systems University of Alberta Database Management Systems University of Alberta

Database Management Course Content Systems Introduction - PowerPoint PPT Presentation

Database Management Course Content Systems Introduction Database Design Theory Query Processing and Optimisation Winter 2003 Concurrency Control CMPUT 391: Information Retrieval and the Web Data Base Recovery and

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Database Management Course Content Systems Introduction Database Design Theory

Course Content Database Management Systems Introduction Database Design Theory

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

Advanced Database Management Systems Database Management Systems Alvaro A A Fernandes School of

Database Management Course Content Systems Introduction Database Design Theory

Database Management Course Content Systems Introduction Database Design Theory

Database Management Course Content Systems Introduction Database Design Theory

Database Management Course Content Systems Introduction Database Design Theory

Course Content Database Management Systems Introduction Database Design Theory

Course Content Database Management Systems Introduction Database Design Theory

Database Systems Rolf Fagerberg DM26, Fall 2005 1 Course Literature Database Management

GE 103- Database Management Course Introduction DBMS Database == Data collection managed by a

DATABASE SYSTEMS Introduction to MySQL Database System Course, 2016 AGENDA FOR TODAY

Hash-Based Indexes (From Chapter 11)

CS 10: Problem solving via Object Oriented Programming Winter

Program Security CMPSC 443 - Spring 2012 Introduction Computer and Network Security Professor

Chapter 6: File Systems File systems n Files n Directories & naming n File system

FindStat a database and search engine for combinatorial statistics and maps Martin Rubey and

Peachnote Massive OMR recognized 1,6 M music sheets, 500 M notes multiple collections: IMSLP,

site with Apache solr Presentation by Janmejaya Mishra (drupal.org id - janmejaya) Deepak

T ool for Bioinformatics Medha Umarji Carolyn Seaman Dept. of Information Systems, Univ. of

Database Management Course Content Systems Introduction - PowerPoint PPT Presentation

Database Management Course Content Systems Introduction Database Design Theory Query Processing and Optimisation Winter 2003 Concurrency Control CMPUT 391: Information Retrieval and the Web Data Base Recovery and

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Database Management Course Content Systems Introduction Database Design Theory

Course Content Database Management Systems Introduction Database Design Theory

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

Advanced Database Management Systems Database Management Systems Alvaro A A Fernandes School of

Database Management Course Content Systems Introduction Database Design Theory

Database Management Course Content Systems Introduction Database Design Theory

Database Management Course Content Systems Introduction Database Design Theory

Database Management Course Content Systems Introduction Database Design Theory

Course Content Database Management Systems Introduction Database Design Theory

Course Content Database Management Systems Introduction Database Design Theory

Database Systems Rolf Fagerberg DM26, Fall 2005 1 Course Literature Database Management

GE 103- Database Management Course Introduction DBMS Database == Data collection managed by a

DATABASE SYSTEMS Introduction to MySQL Database System Course, 2016 AGENDA FOR TODAY

Hash-Based Indexes (From Chapter 11)

CS 10: Problem solving via Object Oriented Programming Winter

Program Security CMPSC 443 - Spring 2012 Introduction Computer and Network Security Professor

Chapter 6: File Systems File systems n Files n Directories &amp; naming n File system

FindStat a database and search engine for combinatorial statistics and maps Martin Rubey and

Peachnote Massive OMR recognized 1,6 M music sheets, 500 M notes multiple collections: IMSLP,

site with Apache solr Presentation by Janmejaya Mishra (drupal.org id - janmejaya) Deepak

T ool for Bioinformatics Medha Umarji Carolyn Seaman Dept. of Information Systems, Univ. of

Chapter 6: File Systems File systems n Files n Directories & naming n File system