Index Construction
Dictionary, postings, scalable indexing, dynamic indexing
Web Search
1
Index Construction Dictionary, postings, scalable indexing, dynamic - - PowerPoint PPT Presentation
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexi xing Ranki king Applica cation Results Documents User Information Query y Query analys ysis proce cess
1
2 Applica cation Multimedia documents User Information analys ysis Indexes Ranki king Query Documents Indexi xing Query Results Query y proce cess ssing Crawler
3
4
multimedia search engines index crawler ranking inverted-file ... ...
5
Posting lists Terms dictionary
docId 10 40 33 ... weight 0.837 0.634 0.447 ... pos 2,56,890 1,89,456 4,5,6 docId 3 2 99 40 ... weight 0.901 0.438 0.420 0.265 ... pos 64,75 4,543,234 23,545
. . . . . . . . .
docId ... weight ... pos
6
10
11
multimedia search engines index crawler ranking inverted-file ... ...
12
Terms dictionary
docId 10 40 33 ... weight 0.837 0.634 0.447 ... pos 2,56,890 1,89,456 4,5,6 docId 3 2 99 40 ... weight 0.901 0.438 0.420 0.265 ... pos 64,75 4,543,234 23,545
. . . . . . . . .
docId ... weight ... pos
New York Times provides an index of >150 years of newswire
13
14
15
16
17
18
Notes: 4: Parse and accumulate all termID-docID pairs 5: Collect all termID-docID with the same termID into the same postings list 7: Opens all blocks and keep a small reading buffer for each block. Merge into the final file. (Avoid seeks, read/write sequentially)
19
Disk 1 3 4 2 2 1 4 3 Runs being merged. Merged run.
20
To hash or not to hash? What about wildcard queries? The small look-up table of the Shakespeare collection is so small that it fits in the CPU cache.
21
22
23
24
25
26
27
28
29
splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign Map phase Segment files Reduce phase
30
31
https://www.youtube.com/watch?v=zRwPSFpLX8I
32
33
34
each postings list.
postings lists of length 1 in one file etc.)
35
36
37
38
39
40
41
Chapter 4 Chapter 4 (dictionary data structures)