part 4 index construction
play

Part 4: Index Construction Francesco Ricci Most of these slides - PowerPoint PPT Presentation

Part 4: Index Construction Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan 1 Ch. 4 Index construction p How do we construct an index? p What


  1. Part 4: Index Construction Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan 1

  2. Ch. 4 Index construction p How do we construct an index? p What strategies can we use with limited main memory? 2

  3. Sec. 4.1 Hardware basics p Many design decisions in information retrieval are based on the characteristics of hardware p We begin by reviewing hardware basics 3

  4. Sec. 4.1 Hardware basics p Access to data in memory is much faster than access to data on disk p Disk seeks: No data is transferred from disk while the disk head is being positioned p Therefore transferring one large chunk of data from disk to memory is faster than transferring many small chunks p Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks) p Block sizes: 8KB to 256 KB. 4 Inside of Hard Drive video

  5. Sec. 4.1 Hardware basics p Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB p Available disk space is several (2–3) orders of magnitude larger p Fault tolerance is very expensive: It ’ s much cheaper to use many regular machines rather than one fault tolerant machine. 5

  6. Google Web Farm p The best guess is that Google now has more than 2 Million servers (8 Petabytes of RAM 8*10 6 Gigabytes) p Spread over at least 12 locations around the world p Connecting these centers is a high-capacity fiber optic network that the company has assembled over the last few years. (video) The Dalles, Oregon Dublin, Ireland 6

  7. Sec. 4.1 Hardware assumptions p symbol statistic value p s average seek time 5 ms = 5 x 10 − 3 s p b transfer time per byte 0.02 µs = 2 x 10 − 8 s/B p processor ’ s clock rate 10 9 s − 1 p p low-level operation 0.01 µs = 10 − 8 s (e.g., compare & swap a word) p size of main memory several GB p size of disk space 1 TB or more p Example: Reading 1GB from disk n If stored in contiguous blocks: 2 x 10 − 8 s/B x 10 9 B = 20s n If stored in 1M chunks of 1KB: 20s + 10 6 x 5 x 10 − 3 s = 5020 s = 1.4 h 7

  8. Sec. 4.2 A Reuters RCV1 document 8

  9. Sec. 4.2 Reuters RCV1 statistics symbol statistic value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) 400,000 avg. # bytes per token 6 (incl. spaces/punct.) avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 T non-positional postings 100,000,000 • 4.5 bytes per word token vs. 7.5 bytes per word type: why? 9 • Why T < N*L?

  10. Sec. 4.2 Recall IIR 1 index construction Term Doc # I 1 did 1 p Documents are parsed to extract words and enact 1 julius 1 these are saved with the Document ID. caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 Doc 1 Doc 2 so 2 let 2 it 2 So let it be with be 2 I did enact Julius with 2 Caesar. The noble caesar 2 Caesar I was killed the 2 Brutus hath told you noble 2 i' the Capitol; Caesar was brutus 2 Brutus killed me. hath 2 ambitious told 2 you 2 caesar 2 10 was 2 ambitious 2

  11. Sec. 4.2 Key step Term Doc # Term Doc # I 1 ambitious 2 did 1 be 2 enact 1 brutus 1 julius 1 brutus 2 p After all documents have caesar 1 capitol 1 caesar 1 I 1 been parsed, the inverted was 1 caesar 2 killed 1 caesar 2 file is sorted by terms. did 1 i' 1 the 1 enact 1 capitol 1 hath 1 I 1 brutus 1 killed 1 I 1 i' 1 me 1 so 2 it 2 let 2 julius 1 We focus on this sort step. killed 1 it 2 be 2 killed 1 We have 100M items to sort with 2 let 2 me 1 caesar 2 for Reuters RCV1 (after the 2 noble 2 so 2 noble 2 having removed duplicated the 1 brutus 2 hath 2 the 2 docid for each term) told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 11

  12. Sec. 4.2 Scaling index construction p In-memory index construction does not scale p How can we construct an index for very large collections? p Taking into account the hardware constraints we just learned about . . . p Memory, disk, speed, etc. 12

  13. Sec. 4.2 Sort-based index construction p As we build the index, we parse docs one at a time n While building the index, we cannot easily exploit compression tricks (you can, but much more complex) n The final postings for any term are incomplete until the end p At 12 bytes per non-positional postings entry (term, doc, freq) , demands a lot of space for large collections p T = 100,000,000 in the case of RCV1 – so 1.2GB n So … we can do this in memory in 2015, but typical collections are much larger - e.g. the New York Times provides an index of >150 years of newswire p Thus: We need to store intermediate results on disk. 13

  14. Sec. 4.2 Use the same algorithm for disk? p Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? n I.e. scan the documents, and for each term write the corresponding posting (term, doc, freq) on a file n Finally sort the postings and build the postings lists for all the terms p No: Sorting T = 100,000,000 records (term, doc, freq) on disk is too slow – too many disk seeks n See next slide p We need an external sorting algorithm. 14

  15. Sec. 4.2 Bottleneck p Parse and build postings entries one doc at a time p Then sort postings entries by term (then by doc within each term) p Doing this with random disk seeks would be too slow – must sort T = 100M records If every comparison took 2 disk seeks, and N items could be sorted with N log 2 N comparisons, how long would this take? p symbol statistic value p s average seek time 5 ms = 5 x 10 − 3 s p b transfer time per byte 0.02 µs = 2 x 10 − 8 s p p low-level operation 0.01 µs = 10 − 8 s (e.g., compare & swap a word) 15

  16. Solution (2*ds-time + comparison-time)*Nlog 2 N seconds = (2*5*10 -3 + 10 -8 )* 10 8 log 2 10 8 ~= (2*5*10 -3 )* 10 8 log 2 10 8 since the time required for the comparison is actually negligible (as the time for transferring data in the main memory) = 10 6 * log 2 10 8 = 10 6 * 26,5 = 2,65 * 10 7 s = 307 days! p What can we do? 16

  17. Gaius Julius Caesar Divide et Impera 17

  18. Sec. 4.2 BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks) p 12-byte (4+4+4) records (term-id, doc-id, freq) p These are generated as we parse docs p Must now sort 100M such 12-byte records by term p Define a Block ~ 10M such records n Can easily fit a couple into memory n Will have 10 such blocks to start with (RCV1) p Basic idea of algorithm: n Accumulate postings for each block (write on a file), (read and) sort, write to disk n Then merge the sorted blocks into one long 18 sorted order.

  19. Sec. 4.2 blocks contain term-id instead Blocks obtained parsing different documents 19

  20. Sec. 4.2 Sorting 10 blocks of 10M records p First, read each block and sort ( in memory ) within: n Quicksort takes 2 N log 2 N expected steps n In our case 2 x (10M log 2 10M) steps p Exercise: estimate total time to read each block from disk and quicksort it n Approximately 7 s p 10 times this estimate – gives us 10 sorted runs of 10M records each p Done straightforwardly, need 2 copies of data on disk n But can optimize this 20

  21. Sec. 4.2 Block sorted-based indexing Keeping the dictionary in memory n = number of generated blocks 21

  22. Sec. 4.2 How to merge the sorted runs? p Open all block files and maintain small read buffers - and a write buffer for the final merged index p In each iteration select the lowest termID that has not been processed yet p All postings lists for this termID are read and merged, and the merged list is written back to disk p Each read buffer is refilled from its file when necessary p Providing you read decent-sized chunks of each block into memory and then write out a decent- sized output chunk, then you ’ re not killed by disk 22 seeks.

  23. Sec. 4.3 Remaining problem with sort-based algorithm p Our assumption was: we can keep the dictionary in memory p We need the dictionary (which grows dynamically) in order to implement a term to termID mapping p Actually, we could work with (term, docID) postings instead of (termID, docID) postings . . . p . . . but then intermediate files become larger - we would end up with a scalable, but slower index construction method. Why? 23

  24. Sec. 4.3 SPIMI: Single-pass in-memory indexing p Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks p Key idea 2: Don ’ t sort the postings - accumulate postings in postings lists as they occur n But at the end, before writing on disk, sort the terms p With these two ideas we can generate a complete inverted index for each block p These separate indexes can then be merged into one big index (because terms are sorted). 24

  25. Sec. 4.3 SPIMI-Invert When the memory has been exhausted - write the index of the block (dictionary, postings lists) to disk p Then merging of blocks is analogous to 25 BSBI (plus dictionary merging).

  26. Sec. 4.3 SPIMI: Compression p Compression makes SPIMI even more efficient. n Compression of terms n Compression of postings 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend