NPFL103: Information Retrieval (3) Index construction, Distributed - PowerPoint PPT Presentation

Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 73 pecina@ufal.mff.cuni.cz

Index construction MapReduce Postings compression Dictionary compression Term statistics Index compression Logarithmic merge Dynamic indexing Distributed indexing Distributed indexing SPIMI algorithm BSBI algorithm Index construction Contents Index compression Dynamic indexing 2 / 73

Index construction Distributed indexing Dynamic indexing Index compression Index construction 3 / 73

Index construction Distributed indexing machines than one fault tolerant machine. RAM, and TBs of disk space. opposed to smaller chunks). Block sizes: 8KB to 256 KB faster than many small chunks. 4 / 73 disk head is being positioned. Hardware basics Index compression Dynamic indexing ▶ Data access much faster in memory than on HD disk (approx. 10 × ) ▶ Disk seeks are “idle” time: No data is transferred from disk while the ▶ To optimize transfer time from disk to memory: one large chunk is ▶ Disk I/O is block-based: Reading and writing of entire blocks (as ▶ Servers used in IR systems typically have tens or hundreds of GBs of ▶ Fault tolerance is expensive: It’s cheaper to use many regular

Index construction b 1 TB or more size of disk space several GBs size of main memory lowlevel operation (e.g., compare+swap a word) p processor’s clock rate Distributed indexing transfer time per byte average seek time s value statistic symbol Some HW statistics Index compression Dynamic indexing 5 / 73 5 ms = 5 × 10 − 3 s 0.02 µ s = 2 × 10 − 8 s 10 9 s − 1 0.01 µ s = 10 − 8 s ▶ SSD (Solid State Drive) faster but smaller, more expensive, limitued write cycles

Index construction Distributed indexing Dynamic indexing Index compression RCV1 collection demonstrating many of the points in this course. we will use the Reuters RCV1 collection. 6 / 73 ▶ Shakespeare’s collected works are not large enough for ▶ As an example for applying scalable index construction algorithms, ▶ English newswire articles published in 1995–1996 (one year). ▶ https://trec.nist.gov/data/reuters/reuters.html

Index construction Distributed indexing Dynamic indexing Index compression A Reuters RCV1 document 7 / 73

Index construction 6 3. How many positional postings? 2. 4.5 bytes per token vs. 7.5 bytes per type: why the difgerence? 1. Average doc. frequency of a term (how many tokens)? Exercise: 100,000,000 non-positional postings T 7.5 bytes per term (= word type) 4.5 bytes per token (without spaces/punct.) bytes per token (incl. spaces/punct.) Distributed indexing 400,000 terms (= word types) M 200 tokens per document L 800,000 documents N Reuters RCV1 statistics Index compression Dynamic indexing 8 / 73

Index construction 2 5 Distributed indexing 16 57 132 … Calpurnia 31 2 54 101 . . . dictionary postings 4 6 1 4 Dynamic indexing Index compression Goal: construct the inverted index Brutus 2 1 11 173 Caesar 174 45 31 10 / 73 − → − → − → � ��

Index construction caesar 1 I 1 I 1 hath 1 enact 1 did 2 caesar 2 1 1 caesar 1 capitol 2 brutus 1 brutus 2 be 2 ambitious docID term i’ it Distributed indexing the 2 with 2 was 1 was 2 you 2 told 2 the 1 2 2 so 2 noble 1 me 2 let 1 killed 1 killed 1 julius = 2 ambitious 1 brutus 1 capitol 1 the 1 i’ 1 killed 1 was 1 I caesar killed 1 julius 1 enact 1 did 1 I docID term Index construction: Sort postings in memory Index compression Dynamic indexing 1 1 2 me was 2 caesar 2 you 2 told 2 hath 2 brutus 2 noble 2 the 2 caesar 2 with 2 be 2 it 2 let 2 so 1 11 / 73 ⇒

Index construction Distributed indexing typical current machine. collections. at the end? Index compression Sort-based index construction Dynamic indexing 12 / 73 ▶ As we build index, we parse documents one at a time. ▶ The final postings for any term are incomplete until the end. ▶ Can we keep all postings in memory and then do the sort in-memory ▶ No, not for large collections ▶ At 10–12 bytes per postings entry, we need a lot of space for large ▶ T = 100 , 000 , 000 in the case of RCV1: we can do this in memory on a ▶ In-memory index construction does not scale for large collections. ▶ Thus: We need to store intermediate results on disk.

Index construction Distributed indexing Dynamic indexing Index compression Same algorithm for disk? collections, but by using disk instead of memory? disk seeks. 13 / 73 ▶ Can we use the same index construction algorithm for larger ▶ No: Sorting T = 100 , 000 , 000 records on disk is too slow – too many ▶ We need an external sorting algorithm.

Index construction Distributed indexing Dynamic indexing Index compression “External” sorting algorithm (using few disk seeks) (i) accumulate postings, (ii) sort in memory, (iii) write to disk 14 / 73 ▶ We must sort T = 100 , 000 , 000 non-positional postings. ▶ Each posting has size 12 bytes (4+4+4: termID, docID, doc. freq). ▶ Define a block to consist of 10 , 000 , 000 such postings ▶ We can easily fit that many postings into memory. ▶ We will have 10 such blocks for RCV1. ▶ Basic idea of algorithm: ▶ For each block: ▶ Then merge the blocks into one long sorted order.

Index construction julius brutus d2 brutus d3 caesar d1 caesar d4 d1 postings killed d2 noble d3 with d4 merged postings disk to be merged d2 Distributed indexing noble Dynamic indexing Index compression Merging two blocks Block 1 brutus d3 caesar d4 d3 killed with d4 Block 2 brutus d2 caesar d1 julius d1 15 / 73

Index construction 3 2. collect [termID, docID] pairs with the same docID 1. sort [termID, docID] pairs 7 6 5 Distributed indexing 4 while (all documents have not been processed) 2 1 Blocked Sort-Based Indexing (BSBI) Index compression Dynamic indexing 16 / 73 BSBIndexConstruction () n ← 0 do n ← n + 1 block ← ParseNextBlock () BSBI-Invert ( block ) WriteBlockToDisk ( block , f n ) MergeBlocks ( f 1 , . . . , f n ; f merged ) ▶ BSBI-Invert: ▶ Key decision: What is the size of one block?

Index construction Distributed indexing Dynamic indexing Index compression Problem with sort-based algorithm implement a term to termID mapping. [termID, docID] postings … with a scalable, but very slow index construction method.) 18 / 73 ▶ Our assumption was: we can keep the dictionary in memory. ▶ We need the dictionary (which grows dynamically) in order to ▶ Actually, we could work with [term, docID] postings instead of ▶ …but then intermediate files become very large. (We would end up

Index construction Distributed indexing Dynamic indexing Index compression Single-pass in-memory indexing (SPIMI) maintain term-termID mapping across blocks. occur. each block. 19 / 73 ▶ Key idea 1: Generate separate dictionaries for each block – no need to ▶ Key idea 2: Don’t sort. Accumulate postings in postings lists as they ▶ With these two ideas we can generate a complete inverted index for ▶ These separate indexes can then be merged into one big index.

Index construction 5 return output _ file 13 12 11 10 9 8 7 Distributed indexing 6 20 / 73 while (free memory available) 3 Dynamic indexing 2 Index compression 4 1 SPIMI-Invert SPIMI-Invert ( token _ stream ) output _ file ← NewFile () dictionary ← NewHash () do token ← next ( token _ stream ) if term ( token ) / ∈ dictionary then postings _ list ← AddToDictionary ( dictionary , term ( token )) else postings _ list ← GetPostingsList ( dictionary , term ( token )) if full ( postings _ list ) then postings _ list ← DoublePostingsList ( dictionary , term ( token )) AddToPostingsList ( postings _ list , docID ( token )) sorted _ terms ← SortTerms ( dictionary ) WriteBlockToDisk ( sorted _ terms , dictionary , output _ file ) ▶ Merging of blocks is analogous to BSBI. ▶ Compression of terms/postings makes SPIMI even more efgicient

Index construction Distributed indexing Dynamic indexing Index compression Distributed indexing 21 / 73

Index construction Distributed indexing Dynamic indexing Index compression Distributed indexing fail. 22 / 73 ▶ For web-scale indexing: must use a distributed computer cluster ▶ Individual machines are fault-prone: can unpredictably slow down or ▶ How do we exploit such a pool of machines?

NPFL103: Information Retrieval (3) Index construction, Distributed - PowerPoint PPT Presentation

Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

More Efficient Network Class Loading through Bundling David Hovemeyer and William Pugh

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

MA/CSSE 473 Day 31 (35 in 201720) Student questions Data Compression Minimal Spanning Tree

Fast Burrows Wheeler Compression ! Using All-Cores " Aditya'Deshpande*''and'''P'J'Narayanan'

Implementing Computer-Audio Recorded Interviewing (CARI) Using Blaise 4 8 2 Using Blaise 4.8.2

Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3

PySimpleGUI Ruud van der Ham, salabim.org Quite a lot of experience in the animation part of

Training for your website Rose Ziech South Central Library System Welcome Introductions

NPFL103: Information Retrieval (3) Index construction, Distributed - PowerPoint PPT Presentation

Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

More Efficient Network Class Loading through Bundling David Hovemeyer and William Pugh

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

MA/CSSE 473 Day 31 (35 in 201720) Student questions Data Compression Minimal Spanning Tree

Fast Burrows Wheeler Compression ! Using All-Cores &quot; Aditya'Deshpande*''and'''P'J'Narayanan'

Implementing Computer-Audio Recorded Interviewing (CARI) Using Blaise 4 8 2 Using Blaise 4.8.2

Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3

PySimpleGUI Ruud van der Ham, salabim.org Quite a lot of experience in the animation part of

Training for your website Rose Ziech South Central Library System Welcome Introductions

Fast Burrows Wheeler Compression ! Using All-Cores " Aditya'Deshpande*''and'''P'J'Narayanan'