3. Text and document databases Normal databases: formatted records; - PowerPoint PPT Presentation

3. Text and document databases � Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). � Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries /encyclopedias - Electronic newspapers - Source program libraries - Automated law and patent offices � What is a ‘document’? E.g. a book, chapter, paragraph, article, letter, web page, source program, etc. � General problem setting: Searching documents by contents ; often called also associative search. � Usually based on keywords or terms occurring in documents. � Search terms may be combined with Boolean connectives (AND, OR, NOT) MMDB-3 J. Teuhola 2012 39

Non-indexed methods for string matching � Sequential full-text scanning; slow but some advantages: - No extra disk space required - Updates are fast (no index maintenance) - Partial-match retrieval is rather simple (using wildcard characters) - Approximate matching is also possible (using a threshold for edit distance ) � Popular efficient algorithm: Boyer-Moore technique - Peculiar feature: searching is faster for longer search strings. - Based on preprocessing of the search string - Performance is sublinear in practice - Disk accesses cannot be reduced, except for extremely long search strings. � Several other string matching algorithms exist (skipped). MMDB-3 J. Teuhola 2012 40

Inverted indexing � Traditional way of improving search speed. � What means ‘inverted’? A document is a list of words, but the index gives for each word the list of documents where the word appears. Example documents: Inverted index: BE � {D 1 , D 2 , D 3 } D 1 : ”TO BE OR NOT TO BE” DO � D 2 : ”TO BE IS TO DO” {D 2 , D 3 } IS � D 3 : ”DO BE DO BE DO” {D 2 } NOT � {D 1 } OR � {D 1 } TO � {D 1 , D 2 } MMDB-3 J. Teuhola 2012 41

Inverted indexing (cont.) � The set of words is called a lexicon . Some principles for it: - Case folding : Convert uppercase letters to lowercase - Stemming : Remove suffixes; index only the root forms of terms. - Do not include stopwords , like “the”, “is”, “as”, “that”, etc. which occur very often but do not bear semantic relevance. � The pointers to term occurrences may appear in different granularities : - Coarse-grained index identifies document groups where the term appears. - Moderate-grained index identifies the relevant documents - Fine-grained index contains sentence, word, or even byte numbers for term occurrences. MMDB-3 J. Teuhola 2012 42

Inverted indexing (cont.) � Coarse-grained index: - Small index size - Small maintenance penalty - Lot of plain text scanning - False drops for multi-term queries (terms do not co-occur). � Fine-grained index: - Large index size - High maintenance penalty - Supports proximity queries (terms occurring together) MMDB-3 J. Teuhola 2012 43

Inverted indexing (cont.) Ways to save storage space: � Front compression : Index in alphabetic order; the prefix common with the previous term is expressed compactly. � Tail ( suffix ) compression : Store terms to the point where they can be uniquely distinguished from other terms. � In fine-grained index: Instead of full pointers, store intervals of successive occurrences of a term. Compound queries: � AND: Retrieve pointer lists and compute their intersection . � OR: Retrieve pointer lists and form their union . � NOT: This is usually combined with AND, so that we can apply set difference to the pointer lists of terms. MMDB-3 J. Teuhola 2012 44

Data structures for inverted indexes (a) B + -tree , with index terms as keys and pointer lists as leaves. (b) Hash organization , preferably a dynamic version, e.g. - Linear hashing - Extendible hashing (c) Trie -structure : each node represents one character, and the term is found by following a path from the root to a leaf. Problem: must be mapped to external storage In each case, the variable-length pointer lists can be either locally in the structure, or (preferably) detached. MMDB-3 J. Teuhola 2012 45

Algorithm for building an inverted index 1. For each document, gather the index terms, combined with pointers to the actual locations. The result is a big sequential file (called S). 2. Sort S by using e.g. external mergesort . 3. Combine the successive entries representing the same index term. 4. Build the index (B + -tree, hash table, ...) on the index terms and let the leaf entries refer to the detached pointer lists, stored as variable-length records. Assessment of inverted indexes: � A very effective access method for retrieval. � A very popular technique in practice for static document sets (in spite of the storage penalty). � Presumably the main retrieval tool in web search engines . MMDB-3 J. Teuhola 2012 46

Bitmap indexing � Another representation for the inverted index � Suitable for coarse and moderate-grained indexing � A bitmap is a matrix with a row for each index term , and a column for each document. Element < i , j > is 1, if term i occurs in document j , otherwise 0. d1 d2 d3 BE 1 1 1 Example documents: DO 0 1 1 d1. ”TO BE OR NOT TO BE” IS 0 1 0 d2. ”TO BE IS TO DO” NOT 1 0 0 d3. ”DO BE DO BE DO” OR 1 0 0 TO 1 1 0 [Note: Normally these index terms would be stopwords .] MMDB-3 J. Teuhola 2012 47

Bitmap indexing (cont.) � Especially efficient for Boolean queries: AND, OR, NOT can be implemented directly in hardware (e.g. 64 bits in parallel). � Problem: High storage consumption (#terms × #documents); the matrix is usually sparse . � Possible combined structure: Use an inverted index for the less frequent terms, and a bitmap for the more frequent terms. � One option: compression . E.g. run-length coding : Replace sequences of zeroes by their count (which gets close to the normal inverted index). Also the encoding of integers has to be decided. Example: Bitmap row = 001010000010001100000100... Run-length code = <2, 1, 5, 3, 0, 5, ...> MMDB-3 J. Teuhola 2012 48

Hierarchical compression of bit strings � Divide the string into equal-sized blocks, � Apply disjunction (OR) to the bits within each block, creating a higher-level bit: A block of zeroes generates a 0-bit, others a 1-bit. � The process is repeated on higher levels, recursively. � Advantage: Single bits are easily accessible, by studying one path in the tree. 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 � Compression: zero blocks need not be stored, at all. Most of the leaf blocks are usually zero blocks (sparse bitmap). MMDB-3 J. Teuhola 2012 49

Signature indexing � A probabilistic technique � Can be generalized to any objects characterized by a variable-size set of index terms or descriptors (also called features ). � Signature is a bit-array, generated by hashing the index terms to indexes of the array. It is usually at least some hundreds of bits long. � Signatures are collected to a separate signature file , which is smaller than the whole space required by the documents. � The signature file acts as a filtering mechanism , to reduce the amount of actual data to be searched. � The structure enables partial-match retrieval (subset of terms match) � Queries can also be considered a kind of documents (collection of keywords), and be converted into signature form. The query signature Q is compared against document signatures D i . If 1-bits of Q are included in 1-bits of D i , then D i is a candidate result. � Signatures are approximate descriptions of documents. False drops must be eliminated by checking the actual match of all candidates. MMDB-3 J. Teuhola 2012 50

Signature indexing (cont.) Advantages of signature files: � Low storage consumption, compared to (fine-grained) inverted indices. Typical value is 10-20% of the primary database size � More convenient than multidimensional indexes (to be studied later) Signature generation methods: � Word signature method: Each index term is hashed into a sparse bit pattern, and the patterns are concatenated to form the document signature. This method usually results in higher false drop probability, but preserves sequencing information of terms in documents. � Superimposed coding : Each index term produces a bit pattern of full signature size. The patterns are OR’ed to form the document signature. This is here the default method. How to minimize false drops? � The proportions of 0’s and 1’s should be equal and uniformly distributed. ( Weight = number of 1-bits in the signature) MMDB-3 J. Teuhola 2012 51

Signature indexing: Example Document index terms: D 1 : database, object, programming, schema D 2 : algorithm, computer, programming D 3 : algorithm, data structure, programming Hashing: hash(“algorithm”) = 3 hash(“computer”) = 1 hash(“database”) = 7 hash(“data structure”) = 5 hash(“object”) = 6 hash(“programming”) = 4 hash(“schema”) = 1 Signatures: D 1 = 10010110, D 2 = 10110000, D 3 = 00111000 Query: Documents about “algorithm” and “programming”? Query signature: Q = 00110000, matches with D 2 and D 3 . MMDB-3 J. Teuhola 2012 52

3. Text and document databases Normal databases: formatted records; - PowerPoint PPT Presentation

3. Text and document databases Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

for Efficient Quantum Sorting Naveed Mahmud, Bailey K. Srimoungchanh, Bennett Haase-Divine, Nolan

Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez

L OGO TO SVG Vladimir Batagelj Department of mathematics, FMF, University of Ljubljana Jadranska

Dialogue management, system design & evaluation Pierre Lison IN4080 : Natural Language

T alk Outline Platform Overview Introducing Metro style games App Model Object model for

Application Accelerators: Application Accelerators: Application Accelerators: Application

3. Text and document databases Normal databases: formatted records; - PowerPoint PPT Presentation

3. Text and document databases Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

for Efficient Quantum Sorting Naveed Mahmud, Bailey K. Srimoungchanh, Bennett Haase-Divine, Nolan

Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez

L OGO TO SVG Vladimir Batagelj Department of mathematics, FMF, University of Ljubljana Jadranska

Dialogue management, system design &amp; evaluation Pierre Lison IN4080 : Natural Language

T alk Outline Platform Overview Introducing Metro style games App Model Object model for

Application Accelerators: Application Accelerators: Application Accelerators: Application

Dialogue management, system design & evaluation Pierre Lison IN4080 : Natural Language