text indexing
play

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, - PowerPoint PPT Presentation

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep utility on Unix - specify a regular expression - search all specified files Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003 Searching Text


  1. Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003

  2. Searching Text • grep utility on Unix - specify a regular expression - search all specified files Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  3. Searching Text • grep utility on Unix - specify a regular expression - search all specified files • what happens if - the files are very big, and - many repeated searches need to be carried out Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  4. Searching Text • grep utility on Unix - specify a regular expression - search all specified files • what happens if - the files are very big, and - many repeated searches need to be carried out • can we do better? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  5. Indexing • split the search process - create an index of frequently used terms (also called a concordance ) - handle the search as a query to lookup the index amortize indexing time over a large number of queries Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  6. Full-text Retrieval • full-text retrieval ≡ searching large text databases using automatically constructed concordances Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  7. Full-text Retrieval • full-text retrieval ≡ searching large text databases using automatically constructed concordances • Questions - How is this different from a library catalog? - Can we rely on high-speed modern processors to do exhaustive searches? - What kind of indexing would be required for full-text retrieval? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  8. Indexing: A General Technique • no large database can be searched without indexes • there may be primary and secondary indexes • elaborate data structures to hold the index to support rapid queries - e.g., B+ trees • other issues - separate structures for separate indexes? - rapid reindexing for addition, deletion, update - size of the index Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  9. Applications • databases - every database has elaborate index generation schemes • web search - search engines, e.g., google, yahoo!, lycos - also the issue of ranking and displaying the results • disk search - Apple’s Sherlock creates index files for filesystem search Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  10. Inverted File Index • term ≡ keywords of interest • lexicon ≡ list of all terms occurring in the text index[term] = document1, document2, . . . Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  11. Inverted File Index • term ≡ keywords of interest • lexicon ≡ list of all terms occurring in the text index[term] = document1, document2, . . . How do you index non-text data (e.g., PDF files, images)? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  12. An Example Document Text 1 Pease porridge hot, pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot, some like it cold 5 Sole like it in the pot 6 Nine days old find the lexicon and build the inverted index Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  13. Example (contd.) Number Term Documents 1 cold 2; 1, 4 2 days 2; 3, 6 3 hot 2; 1, 4 4 in 2; 2, 5 5 it 2; 4, 5 6 like 2; 4, 5 7 nine 2; 3, 6 9 old 2; 3, 6 10 pease 2; 1, 2 11 porridge 2; 1, 2 12 pot 2; 2, 5 13 some 3; 4, 5 14 the 2; 2, 5 Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  14. Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  15. Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document • compound queries? - conjunctive and disjunctive queries, e.g., “term1 AND term2”, “term1 OR term2” - complement, “NOT term1” Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  16. Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document • compound queries? - conjunctive and disjunctive queries, e.g., “term1 AND term2”, “term1 OR term2” - complement, “NOT term1” • near queries? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  17. Using the Inverted Index • simple lookup is trivial - for large documents, may have to maintain the location within each document • compound queries? - conjunctive and disjunctive queries, e.g., “term1 AND term2”, “term1 OR term2” - complement, “NOT term1” • near queries? • potentially huge index files - should we worry about the index size? Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  18. Trimming the Index • case folding - mostly, case is immaterial • stemming - are “search”, “searching”, “searches” different? - strategy: maintain only the neutral form of the term • eliminate stop words - frequently occurring terms ≡ stop list - e.g., “a”, “the”, “in”, “to”, etc. Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  19. Effectiveness: Precision Precision = r t r: number of relevant documents retrieved t: total number of documents retrieved if 50 documents are retrieved, 35 are relevant, then the precision is 70% Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  20. Effectiveness: Recall Recall = r n r: number of relevant documents retrieved n: total number of relevant documents in the collection if 50 documents are retrieved, 35 are relevant, then the precision is 70% if there are 140 relevant documents then the recall is 25% Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  21. Search Engines Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  22. Indexing the Web • more than 2 billion documents on the web • google claims to index 1.5 billion documents • two indexing approaches - search engines (e.g., google) - hierarchical directories (e.g., Yahoo!) Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  23. Web Search Characteristics • bulk • rapidly changing content - about one-third changes every year • heterogeneous content • duplication, as much as 30% • high linkage • wide variety of users • varying user behavior - 85% only look at the first screen - 78% never modify their first query Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  24. Query Characteristics 0 term in query 21% 1 term in query 26% 2 terms in query 26% 3 terms in query 15% > 3 terms in query 12% Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  25. Goals of a Search Engine • speed • recall • precision • precision in the top result page Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  26. Search Engine Architecture • crawler - collects pages from the Web • indexer - indexes the collected pages • query server - accepts and processes queries and returns the results Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  27. The Crawler base ← set of known working hyperlinks queue ← base while (! queue.empty()) { p = first element of queue process p for each page, q, referenced from p add q to queue; } Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  28. Indexing • inverted index - most common, used by google - superimposed coding is another technique • term extraction - title or the whole document - document analysis to identify keywords Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  29. Query Processing • keyword vs concept-based searching - concept-based searching uses “clustering” - Excite used concept-based searching • searching “similar” results • ranking the hits Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  30. Rankings • google’s page-popularity based rankings - combined with proximity of search keywords to those in the document let page P be pointed to by pages T 1 , T 2 , T 3 , etc. let L ( x ) be the number of links going out of page x let R ( x ) be the page rank of page x R ( P ) = (1 − d ) + d × ( R ( T 1 ) L ( T 1 ) + R ( T 2 ) L ( T 2 ) + . . . + R ( T k ) L ( T k )) where, d is a damping factor Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  31. Solving the Rankings 1 1 1 1 ) R ( T p 1 2 ) R ( T p 1 k 1 ) R ( T p 1 R ( P 1 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 1 )) L ( T p 1 L ( T p 1 L ( T p 1 1 1 1 1 ) R ( T p 2 2 ) R ( T p 2 k 2 ) R ( T p 2 R ( P 2 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 2 )) L ( T p 2 L ( T p 2 L ( T p 2 . . . 1 1 1 1 ) R ( T p n 2 ) R ( T p n k n ) R ( T p n R ( P n ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k n )) L ( T p n L ( T p n L ( T p n Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

  32. Solving the Rankings 1 1 1 1 ) R ( T p 1 2 ) R ( T p 1 k 1 ) R ( T p 1 R ( P 1 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 1 )) L ( T p 1 L ( T p 1 L ( T p 1 1 1 1 1 ) R ( T p 2 2 ) R ( T p 2 k 2 ) R ( T p 2 R ( P 2 ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k 2 )) L ( T p 2 L ( T p 2 L ( T p 2 . . . 1 1 1 1 ) R ( T p n 2 ) R ( T p n k n ) R ( T p n R ( P n ) = (1 − d ) + d × ( 1 ) + 2 ) + . . . + k n )) L ( T p n L ( T p n L ( T p n L × R = C Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend