3 text and document databases
play

3. Text and document databases Normal databases: formatted records; - PowerPoint PPT Presentation

3. Text and document databases Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries


  1. 3. Text and document databases � Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). � Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries /encyclopedias - Electronic newspapers - Source program libraries - Automated law and patent offices � What is a ‘document’? E.g. a book, chapter, paragraph, article, letter, web page, source program, etc. � General problem setting: Searching documents by contents ; often called also associative search. � Usually based on keywords or terms occurring in documents. � Search terms may be combined with Boolean connectives (AND, OR, NOT) MMDB-3 J. Teuhola 2012 39

  2. Non-indexed methods for string matching � Sequential full-text scanning; slow but some advantages: - No extra disk space required - Updates are fast (no index maintenance) - Partial-match retrieval is rather simple (using wildcard characters) - Approximate matching is also possible (using a threshold for edit distance ) � Popular efficient algorithm: Boyer-Moore technique - Peculiar feature: searching is faster for longer search strings. - Based on preprocessing of the search string - Performance is sublinear in practice - Disk accesses cannot be reduced, except for extremely long search strings. � Several other string matching algorithms exist (skipped). MMDB-3 J. Teuhola 2012 40

  3. Inverted indexing � Traditional way of improving search speed. � What means ‘inverted’? A document is a list of words, but the index gives for each word the list of documents where the word appears. Example documents: Inverted index: BE � {D 1 , D 2 , D 3 } D 1 : ”TO BE OR NOT TO BE” DO � D 2 : ”TO BE IS TO DO” {D 2 , D 3 } IS � D 3 : ”DO BE DO BE DO” {D 2 } NOT � {D 1 } OR � {D 1 } TO � {D 1 , D 2 } MMDB-3 J. Teuhola 2012 41

  4. Inverted indexing (cont.) � The set of words is called a lexicon . Some principles for it: - Case folding : Convert uppercase letters to lowercase - Stemming : Remove suffixes; index only the root forms of terms. - Do not include stopwords , like “the”, “is”, “as”, “that”, etc. which occur very often but do not bear semantic relevance. � The pointers to term occurrences may appear in different granularities : - Coarse-grained index identifies document groups where the term appears. - Moderate-grained index identifies the relevant documents - Fine-grained index contains sentence, word, or even byte numbers for term occurrences. MMDB-3 J. Teuhola 2012 42

  5. Inverted indexing (cont.) � Coarse-grained index: - Small index size - Small maintenance penalty - Lot of plain text scanning - False drops for multi-term queries (terms do not co-occur). � Fine-grained index: - Large index size - High maintenance penalty - Supports proximity queries (terms occurring together) MMDB-3 J. Teuhola 2012 43

  6. Inverted indexing (cont.) Ways to save storage space: � Front compression : Index in alphabetic order; the prefix common with the previous term is expressed compactly. � Tail ( suffix ) compression : Store terms to the point where they can be uniquely distinguished from other terms. � In fine-grained index: Instead of full pointers, store intervals of successive occurrences of a term. Compound queries: � AND: Retrieve pointer lists and compute their intersection . � OR: Retrieve pointer lists and form their union . � NOT: This is usually combined with AND, so that we can apply set difference to the pointer lists of terms. MMDB-3 J. Teuhola 2012 44

  7. Data structures for inverted indexes (a) B + -tree , with index terms as keys and pointer lists as leaves. (b) Hash organization , preferably a dynamic version, e.g. - Linear hashing - Extendible hashing (c) Trie -structure : each node represents one character, and the term is found by following a path from the root to a leaf. Problem: must be mapped to external storage In each case, the variable-length pointer lists can be either locally in the structure, or (preferably) detached. MMDB-3 J. Teuhola 2012 45

  8. Algorithm for building an inverted index 1. For each document, gather the index terms, combined with pointers to the actual locations. The result is a big sequential file (called S). 2. Sort S by using e.g. external mergesort . 3. Combine the successive entries representing the same index term. 4. Build the index (B + -tree, hash table, ...) on the index terms and let the leaf entries refer to the detached pointer lists, stored as variable-length records. Assessment of inverted indexes: � A very effective access method for retrieval. � A very popular technique in practice for static document sets (in spite of the storage penalty). � Presumably the main retrieval tool in web search engines . MMDB-3 J. Teuhola 2012 46

  9. Bitmap indexing � Another representation for the inverted index � Suitable for coarse and moderate-grained indexing � A bitmap is a matrix with a row for each index term , and a column for each document. Element < i , j > is 1, if term i occurs in document j , otherwise 0. d1 d2 d3 BE 1 1 1 Example documents: DO 0 1 1 d1. ”TO BE OR NOT TO BE” IS 0 1 0 d2. ”TO BE IS TO DO” NOT 1 0 0 d3. ”DO BE DO BE DO” OR 1 0 0 TO 1 1 0 [Note: Normally these index terms would be stopwords .] MMDB-3 J. Teuhola 2012 47

  10. Bitmap indexing (cont.) � Especially efficient for Boolean queries: AND, OR, NOT can be implemented directly in hardware (e.g. 64 bits in parallel). � Problem: High storage consumption (#terms × #documents); the matrix is usually sparse . � Possible combined structure: Use an inverted index for the less frequent terms, and a bitmap for the more frequent terms. � One option: compression . E.g. run-length coding : Replace sequences of zeroes by their count (which gets close to the normal inverted index). Also the encoding of integers has to be decided. Example: Bitmap row = 001010000010001100000100... Run-length code = <2, 1, 5, 3, 0, 5, ...> MMDB-3 J. Teuhola 2012 48

  11. Hierarchical compression of bit strings � Divide the string into equal-sized blocks, � Apply disjunction (OR) to the bits within each block, creating a higher-level bit: A block of zeroes generates a 0-bit, others a 1-bit. � The process is repeated on higher levels, recursively. � Advantage: Single bits are easily accessible, by studying one path in the tree. 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 � Compression: zero blocks need not be stored, at all. Most of the leaf blocks are usually zero blocks (sparse bitmap). MMDB-3 J. Teuhola 2012 49

  12. Signature indexing � A probabilistic technique � Can be generalized to any objects characterized by a variable-size set of index terms or descriptors (also called features ). � Signature is a bit-array, generated by hashing the index terms to indexes of the array. It is usually at least some hundreds of bits long. � Signatures are collected to a separate signature file , which is smaller than the whole space required by the documents. � The signature file acts as a filtering mechanism , to reduce the amount of actual data to be searched. � The structure enables partial-match retrieval (subset of terms match) � Queries can also be considered a kind of documents (collection of keywords), and be converted into signature form. The query signature Q is compared against document signatures D i . If 1-bits of Q are included in 1-bits of D i , then D i is a candidate result. � Signatures are approximate descriptions of documents. False drops must be eliminated by checking the actual match of all candidates. MMDB-3 J. Teuhola 2012 50

  13. Signature indexing (cont.) Advantages of signature files: � Low storage consumption, compared to (fine-grained) inverted indices. Typical value is 10-20% of the primary database size � More convenient than multidimensional indexes (to be studied later) Signature generation methods: � Word signature method: Each index term is hashed into a sparse bit pattern, and the patterns are concatenated to form the document signature. This method usually results in higher false drop probability, but preserves sequencing information of terms in documents. � Superimposed coding : Each index term produces a bit pattern of full signature size. The patterns are OR’ed to form the document signature. This is here the default method. How to minimize false drops? � The proportions of 0’s and 1’s should be equal and uniformly distributed. ( Weight = number of 1-bits in the signature) MMDB-3 J. Teuhola 2012 51

  14. Signature indexing: Example Document index terms: D 1 : database, object, programming, schema D 2 : algorithm, computer, programming D 3 : algorithm, data structure, programming Hashing: hash(“algorithm”) = 3 hash(“computer”) = 1 hash(“database”) = 7 hash(“data structure”) = 5 hash(“object”) = 6 hash(“programming”) = 4 hash(“schema”) = 1 Signatures: D 1 = 10010110, D 2 = 10110000, D 3 = 00111000 Query: Documents about “algorithm” and “programming”? Query signature: Q = 00110000, matches with D 2 and D 3 . MMDB-3 J. Teuhola 2012 52

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend