Indexing and Searching Indexing and Searching TDT4215 TDT4215 - - PDF document

indexing and searching indexing and searching
SMART_READER_LITE
LIVE PREVIEW

Indexing and Searching Indexing and Searching TDT4215 TDT4215 - - PDF document

1 Indexing Approaches Indexing Approaches R. Baeza-Yates and B. R. Baeza-Yates and B. Ribeiro-Neto Ribeiro-Neto: : Modern Informa Modern Information Retrie ion Retrieval, l, Chapter 8. Chapt Chapter Chapter 8 1999 . 1999 1999


slide-1
SLIDE 1

1

Indexing Approaches Indexing Approaches

  • R. Baeza-Yates and B.
  • R. Baeza-Yates and B. Ribeiro-Neto

Ribeiro-Neto: : Modern Informa Modern Information Retrie ion Retrieval, l, Chapter Chapter 8 1999 1999 Addision Addision Wesley Wesley Chapt Chapter 8. . 1999 1999. . Addision Addision Wesley Wesley.

Jon Atle Gulla

TDT4215 – Indexing & Searching 2

Indexing and Searching Indexing and Searching

TDT4215 – Indexing & Searching

  • TDT4215
slide-2
SLIDE 2

3

Outline

  • Indexing approaches:

– Inverted files – Suffix Trees & Suffix Arrays – Signature Files

  • Search part of chapter optional

TDT4215 – Indexing & Searching 4

Inverted Files - Definition

  • The inverted file stucture is composed of two elements:
  • The inverted file stucture is composed of two elements:

– Vocabulary the set of all different words in the text. O – Occurrences. For each word a list of all the text positions where the word appears is

  • stored. The set of all those list is called the occurrences.

TDT4215 – Indexing & Searching

  • TDT4215
slide-3
SLIDE 3

5

Inverted Files - Block Inverted Files - Block Addressing (1) g ( )

  • The space required for the vocabulary is rather small.
  • Occurrences demand much more space since each word
  • Occurrences demand much more space, since each word

appearing in the text is referenced once in that structure.

  • To reduce space requirements a technique called block

To reduce space requirements, a technique called block addressing is used:

– the text is divided in blocks; – the occurrences point to the blocks where the word appears.

  • By using block addressing

th i t ll b th f bl k th iti – the pointers are smaller, because there are fewer blocks than positions, – all the occurrences of a word inside a single block are collapsed to one reference.

TDT4215 – Indexing & Searching 6

Inverted Files - Block Inverted Files - Block Addressing (2) g ( )

Note that we do not know now that there are 2 occurrences of “words” in block 3 words in block 3

  • Size:

Fixed imposing a logical block structure over the text database – Fixed – imposing a logical block structure over the text database. – Natural division – text collection into files, documents, Web pages and

  • thers.

TDT4215 – Indexing & Searching

slide-4
SLIDE 4

7

Index Size (1)

  • Vocabulary: O(n)  typically between 0.4 and 0.6

O O( )

  • Occurrences: O(n)

E l

  • Example:

– TREC-2 collection 1 Gb – Vocabulary: 5 Mb Vocabulary: 5 Mb

TDT4215 – Indexing & Searching 8

I d Si (2) Index Size(2)

Index Small collection (1 Mb) Medium collection (200 Mb) Large collection (2 Gb) Mb) (200 Mb) Gb) stopwords stopwords stopwords Adressing words 45% 73% 36% 64% 35% 63% Adressing documents (10 Kb) 19% 26% 18% 32% 26% 47% Adressing 256 blocks 18% 25% 1.7% 2.4% 0.5% 0.7%

E.g. For document addressing indexes, the index size is 26% of the total document collection, provided that stopwords are deleted. If stopwords are not deleted from collection, the index is 47% . IMPORTANT NOTE: The left and right column for each collection type should be switched. TDT4215 – Indexing & Searching f g f yp This error also appears in the book (Table 8.1, page 195).

slide-5
SLIDE 5

9

Inverted Files. Searching

  • What to search on inverted files:

Single word queries the process ends by delivering the list of – Single-word queries – the process ends by delivering the list of

  • ccurrences.

– Context queries are more difficult to solve with inverted indices

h l t t b h d t l d li t t d f

  • each element must be searched separately and a list generated for

each one.

  • then the list of all elements are traversed in synchronization to find

places where all the words appear in sequence or appear close places where all the words appear in sequence or appear close enough.

TDT4215 – Indexing & Searching 10

Inverted Files Search Inverted Files. Search Algorithm g

The search algorithm on an inverted index follows three general steps: s eps

  • 1. Vocabulary search.

– The words and patterns present in the query are isolated and searched in the vocabulary. y – Notice that phrases and proximity queries are split into single words.

  • 2. Retrieval of occurrences.

– The list of the occurences of the words found are retrieved. The list of the occurences of the words found are retrieved.

  • 3. Manipulation of occurrences.

– The occurrences are processed to solve phrases, proximity or Boolean operations. – If block addressing is used it may be necessary to directly search the text to find the If block addressing is used it may be necessary to directly search the text to find the information missing from the occurrences. TDT4215 – Indexing & Searching

slide-6
SLIDE 6

11

Inverted Files. Construction

  • All the vocabulary known up to a moment is kept in a trie data structure storing for each

All the vocabulary known up to a moment is kept in a trie data structure, storing for each word a list of its occurrences.

  • Each word of the text is read and searched in the trie.
  • If it is not found it is added to the trie with empty list of occurrences.

p y

  • Once it is in the trie, the new position is added to the end of its list of occurrences.

TDT4215 – Indexing & Searching

  • TDT4215

12

Inverted Files. Construction

  • It is a good practice to split the index into two files.

In the first the list of occurrences are stored contiguously (posting – In the first the list of occurrences are stored contiguously (posting file). – In the second file, the vocabulary is stored in lexicographical order and for each word a pointer to its list in the first file is also and, for each word, a pointer to its list in the first file is also included. – This allows the vocabulary to be kept in memory at search time in many cases.

TDT4215 – Indexing & Searching

slide-7
SLIDE 7

13

Inverted Files Construction of Inverted Files. Construction of large text g

  • The algorithm is not practical for large text where the

index does not fit in main memory: index does not fit in main memory:

– The algorithm is used until the main memory is exhausted. – The partial index Ii obtained up to now written to disk and erased from main memory before continuing with the rest of the text memory before continuing with the rest of the text. – Finally, a number of partial indices Ii exist on disk. These indices are then merged in a hierarchical fashion.

  • I and I

I and I and so on

  • I1 and I2 , I3 and I4 and so on
  • I1..2 and I3..4 , I5..6 and I7..8 and so on
  • This continued until there is just one index comprising the whole text.

Merging two indices consists of: – Merging two indices consists of:

  • merging the sorted vocabularies and whenever the same word appears

in both indices;

  • merging both list of occurrences

TDT4215 – Indexing & Searching

  • merging both list of occurrences.

14

Inverted Files Construction of Inverted Files. Construction of large text g

TDT4215 – Indexing & Searching

slide-8
SLIDE 8

15

Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Definition

  • Suffix Trees and Suffix Arrays indexes see the text as one long string. Each

position in the text is considered as a text suffix. Each suffix is thus uniquely identified by its position identified by its position.

  • Index points are selected from the text, which point to the beginning of the text

positions which will be retrievable.

  • This structure can be used to index words or characters
  • This structure can be used to index words or characters.

TDT4215 – Indexing & Searching

  • TDT4215

16

Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Structure

  • A suffix tree is a trie data

structure built over all the structure built over all the suffixes of the text.

  • The pointers to the suffixes

are stored at the leaf nodes.

  • The problem with this

structure is its space. st uctu e s ts space

  • Compression of the trie

structure by compressing unary paths unary paths.

TDT4215 – Indexing & Searching

slide-9
SLIDE 9

17

Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Structure

  • Suffix arrays provide essentially the same functionality as suffix trees with much

less space requirements.

  • A suffix array is simply an array containing all the pointers to the text suffixes listed
  • A suffix array is simply an array containing all the pointers to the text suffixes listed

in lexicographical order.

  • Suffix arrays are designed to allow binary searches done by comparing the

contents of each pointer. co te ts o eac po te

TDT4215 – Indexing & Searching 18

Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Structure

  • Supra-indices over the suffix arrays.
  • The simplest supra-index is no more than a sampling of one out of b suffix

array entries, where for each sample the first l suffix characters are stored array entries, where for each sample the first l suffix characters are stored in the supra-index.

  • This supra-index is then used as a first step of the search to reduce

external accesses external accesses.

TDT4215 – Indexing & Searching

  • TDT4215
slide-10
SLIDE 10

19

Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Searching

  • With suffix trees and suffix arrays we can search for

– Words; – Prefixes & suffixes; – Phrases.

  • The search pattern originates two ‘limiting’ patterns P and P
  • The search pattern originates two limiting patterns P1 and P2 ,

so that we want any suffix S such that P1  S  P2

  • Then all the elements lying between both positions point to

exactly those suffixes that start like the original pattern.

  • All these queries searching is a good case for these indices.

– A simple phrase of words can be searched as if it was a simple pattern. A simple phrase of words can be searched as if it was a simple pattern. – This is because the suffix tree/array sorts with respect to the complete suffixes and not only their first word.

TDT4215 – Indexing & Searching 20

Suffix Trees and Suffix Arrays. Construction of Suffix Arrays for the Large Text Large Text

  • The problem – large databases do not fit in main

memory memory.

– Split the text into blocks that can be sorted in main memory. – Then for each block, build its suffix arrays in main memory and merge it with the rest of the array already built for previous text. – That is:

  • build the suffix array for the first block;

y

  • build the suffix array for the second block;
  • merge both suffix arrays;
  • build the suffix array for the third block;

y ;

  • merge the new suffix array with the previous one;
  • … and so on.

– The difficult part is how to merge a large suffix array with the small

TDT4215 – Indexing & Searching

The difficult part is how to merge a large suffix array with the small suffix array.

slide-11
SLIDE 11

21

Suffix Trees and Suffix Arrays. Construction of Suffix Arrays for the Large Text Text

  • The solution is

– To determine how many elements of the large array are to be placed To determine how many elements of the large array are to be placed between each pair of elements in the small array – Use that information to merge the arrays without accessing the text.

TDT4215 – Indexing & Searching

  • TDT4215

22

Signature Files. Definition

A signature file uses a hash function (or ‘signature’) that maps words to bit

  • A signature file uses a hash function (or signature ) that maps words to bit

masks of B bits.

  • It divides the text in blocks of b words each. To each text block of size b, a bit

mask of size B will be assigned. mask of size B will be assigned.

  • The idea is that if a word is present in a text block, then all the bits set in its

signature are also set in the bit mask of the text block.

TDT4215 – Indexing & Searching

  • TDT4215
slide-12
SLIDE 12

23

Signature Files. Searching

  • It is possible that all the corresponding bits are set even though the word is not
  • there. This is called false drop.
  • The most delicate part of the design of a signature file is to ensure that the

probability of a false drop is low enough while keeping the signature file as short as possible. Searching a single word is carried out by:

  • Searching a single word is carried out by:

– hashing it to a bit mask W, – comparing the bit mask Bi of all the text blocks. if (W & B W) th t t bl k t i th d – if (W & Bi = W), the text block may contain the word.

  • No other types of patterns can be searched in this scheme.
  • Scheme is more efficient to search phrases and reasonable proximity queries.

TDT4215 – Indexing & Searching 24

Signature Files. Construction

  • The construction of a signature file is easy.

– the text is simply cut in blocks; – for each block an entry of the signature file is generated. – this entry is the bitwise OR of the signature of all the words in the block.

  • Adding text is also easy, since it is only necessary to

keep adding records to the signature file.

TDT4215 – Indexing & Searching

slide-13
SLIDE 13

25

Conclusions

  • Implementation of information retrieval models

Inverted files – Inverted files

  • Vocabulary & occurrences
  • Position index?

S ffi T & S ffi A – Suffix Trees & Suffix Arrays

  • Phrase search and keyword search collapse

– Signature Files

  • Efficient
  • Not false-proof

TDT4215 – Indexing & Searching