Information Retrieval Index Construction Hamid Beigy Sharif - PowerPoint PPT Presentation

Information Retrieval Information Retrieval Index Construction Hamid Beigy Sharif university of technology October 6, 2018 Hamid Beigy | Sharif university of technology | October 6, 2018 1 / 30

Information Retrieval Table of contents 1. Introduction 2. Sort-based index construction 3. Single–pass in-memory indexing (SPIMI) 4. Distributed indexing 5. Dynamic indexing Hamid Beigy | Sharif university of technology | October 6, 2018 2 / 30

Information Retrieval | Introduction Table of contents 1 Introduction 2 Sort-based index construction 3 Single–pass in-memory indexing (SPIMI) 4 Distributed indexing 5 Dynamic indexing Hamid Beigy | Sharif university of technology | October 6, 2018 3 / 30

Information Retrieval | Introduction Inverted index 1 The goal is constructing inverted index For each term t , we store a list of all documents that contain t . Brutus 1 2 4 11 31 45 173 174 − → Caesar 1 2 4 5 6 16 57 132 . . . − → Calpurnia 2 31 54 101 − → . . . � �� dictionary postings Hamid Beigy | Sharif university of technology | October 6, 2018 3 / 30

Information Retrieval | Introduction RCV1 collection 1 Shakespeare’s collected works are not large enough for demonstrating many of the points in this course. 2 As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. 3 English newswire articles sent over the wire in 1995 and 1996 (a year). 4 RCV1 statistics Number of documents ( N ): 800,000 Number of tokens per document ( L ): 200 terms ( M ) : 400,000 Bytes per token (including spaces): 6 Bytes per token (without spaces): 4.5 Bytes per term: 7.5 5 Why does the algorithm given in previous sections not scale to very large collections? Hamid Beigy | Sharif university of technology | October 6, 2018 4 / 30

Information Retrieval | Introduction Hardware Basics 1 Access to data is much faster in memory than on disk. (roughly a factor of 10) 2 Disk seeks are ”idle” time: No data is transferred from disk while the disk head is being positioned. 3 To optimize transfer time from disk to memory: one large chunk is faster than many small chunks. 4 Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB 5 Servers used in IR systems typically have many GBs of main memory and TBs of disk space. 6 Fault tolerance is expensive: Its cheaper to use many regular machines than one fault tolerant machine. Hamid Beigy | Sharif university of technology | October 6, 2018 5 / 30

Information Retrieval | Introduction Hard Disk Hamid Beigy | Sharif university of technology | October 6, 2018 6 / 30

Information Retrieval | Sort-based index construction Table of contents 1 Introduction 2 Sort-based index construction 3 Single–pass in-memory indexing (SPIMI) 4 Distributed indexing 5 Dynamic indexing Hamid Beigy | Sharif university of technology | October 6, 2018 7 / 30

Information Retrieval | Sort-based index construction Sort-based index construction 1 As we build index, we parse docs one at a time. 2 The final postings for any term are incomplete until the end. 3 Can we keep all postings in memory and then do the sort in-memory at the end? 4 No, not for large collections 5 Thus: We need to store intermediate results on disk. 6 Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? 7 No: Sorting very large sets of records on disk is too slow– too many disk seeks. 8 We need an external sorting algorithm. Hamid Beigy | Sharif university of technology | October 6, 2018 7 / 30

Information Retrieval | Sort-based index construction External sorting algorithm 1 We must sort T = 100,000,000 non-positional postings. 2 Each posting has size 12 bytes (4+4+4: termID, docID, term frequency). 3 Define a block to consist of 10,000,000 such postings 4 We can easily fit that many postings into memory. We will have 10 such blocks for RCV1. 5 Basic idea of algorithm: 6 For each block do accumulate postings sort in memory write to disk 7 Then merge the blocks into one long sorted order. Hamid Beigy | Sharif university of technology | October 6, 2018 8 / 30

Information Retrieval | Sort-based index construction Merging two blocks postings to be merged brutus d2 brutus d3 Block 1 Block 2 caesar d1 brutus d3 brutus d2 merged caesar d4 caesar d4 caesar d1 julius d1 postings noble d3 julius d1 killed d2 with d4 killed d2 noble d3 with d4 disk Hamid Beigy | Sharif university of technology | October 6, 2018 9 / 30

Information Retrieval | Sort-based index construction Blocked Sort-Based Indexing BSBIndexConstruction () 1 n ← 0 2 while (all documents have not been processed) 3 do n ← n + 1 4 block ← ParseNextBlock () 5 BSBI-Invert ( block ) 6 WriteBlockToDisk ( block , f n ) 7 MergeBlocks ( f 1 , . . . , f n ; f merged ) Hamid Beigy | Sharif university of technology | October 6, 2018 10 / 30

Information Retrieval | Sort-based index construction Problem with sort-based algorithm 1 The assumption was: we can keep the dictionary in memory. 2 We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. 3 Actually, we could work with term,docID postings instead of termID,docID postings . . . 4 The intermediate files become very large. (We would end up with a scalable, but very slow index construction method.) Hamid Beigy | Sharif university of technology | October 6, 2018 11 / 30

Information Retrieval | Single–pass in-memory indexing (SPIMI) Table of contents 1 Introduction 2 Sort-based index construction 3 Single–pass in-memory indexing (SPIMI) 4 Distributed indexing 5 Dynamic indexing Hamid Beigy | Sharif university of technology | October 6, 2018 12 / 30

Information Retrieval | Single–pass in-memory indexing (SPIMI) Single–pass in-memory indexing (SPIMI) 1 Key idea 1: Generate separate dictionaries for each block no need to maintain term-termID mapping across blocks. 2 Key idea 2: Dont sort. Accumulate postings in postings lists as they occur. 3 With these two ideas we can generate a complete inverted index for each block. 4 These separate indexes can then be merged into one big index. Hamid Beigy | Sharif university of technology | October 6, 2018 12 / 30

Information Retrieval | Single–pass in-memory indexing (SPIMI) Single–pass in-memory indexing algorithm SPIMI-Invert ( token stream ) 1 output file ← NewFile () 2 dictionary ← NewHash () 3 while (free memory available) 4 do token ← next ( token stream ) 5 if term ( token ) / ∈ dictionary 6 then postings list ← AddToDictionary ( dictionary , term ( token )) 7 else postings list ← GetPostingsList ( dictionary , term ( token )) 8 if full ( postings list ) 9 then postings list ← DoublePostingsList ( dictionary , term ( token )) 10 AddToPostingsList ( postings list , docID ( token )) 11 sorted terms ← SortTerms ( dictionary ) 12 WriteBlockToDisk ( sorted terms , dictionary , output file ) 13 return output file Merging of blocks is analogous to BSBI. Hamid Beigy | Sharif university of technology | October 6, 2018 13 / 30

Information Retrieval | Single–pass in-memory indexing (SPIMI) Single–pass in-memory indexing : compression 1 Compression makes SPIMI even more efficient. Compression of terms Compression of postings Hamid Beigy | Sharif university of technology | October 6, 2018 14 / 30

Information Retrieval | Distributed indexing Table of contents 1 Introduction 2 Sort-based index construction 3 Single–pass in-memory indexing (SPIMI) 4 Distributed indexing 5 Dynamic indexing Hamid Beigy | Sharif university of technology | October 6, 2018 15 / 30

Information Retrieval | Distributed indexing Distributed indexing 1 For web-scale indexing: must use a distributed computer cluster 2 Individual machines are fault-prone. Can unpredictably slow down or fail. 3 How do we exploit such a pool of machines? 4 Distributed index is partitioned across several machines - either according to term or according to document. Hamid Beigy | Sharif university of technology | October 6, 2018 15 / 30

Information Retrieval | Distributed indexing google data centers (Gartner estimates ) 1 Google data centers mainly contain commodity machines. Data centers are distributed all over the world. 2 1 million servers, 3 million processors/cores 3 Google installs 100,000 servers each quarter. 4 Based on expenditures of 200250 million dollars per year. This would be 10% of the computing capacity of the world! 5 If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system (assuming it does not tolerate failures)? 6 Answer: 37% 7 Suppose a server will fail after 3 years. For an installation of 1 million servers, what is the interval between machine failures? 8 Answer: Less than two minutes. Hamid Beigy | Sharif university of technology | October 6, 2018 16 / 30

Cluster architecture Hamid Beigy | Sharif university of technology | October 6, 2018 Information Retrieval | Distributed indexing 2-10 Gbps backbone between racks s between Switch air of nodes ack Switch Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk ch rack contains 16-64 nodes 17 / 30

Information Retrieval Index Construction Hamid Beigy Sharif - PowerPoint PPT Presentation

Information Retrieval Information Retrieval Index Construction Hamid Beigy Sharif university of technology October 6, 2018 Hamid Beigy | Sharif university of technology | October 6, 2018 1 / 30 Information Retrieval Table of contents 1.

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies

Disks, Memories & Buffer Management The two offices of memory are collection and

SASE : Implementation of a Compressed Text Search Engine Srinidhi Varadarajan Tzi-cker

Single-Database Private Information Retrieval 07.11.2005 Aleksandr Grebennik Tartu University a

Indexed Files : Outline ! Introduction ! Indexed Files ! Full Index Organization ! Indexed

Challenges in Con erting the Challenges in Converting the National Crime Victimization Survey to

Music The Compact Disc replaced vinyl and cassettes Movies The DVD replaced VHS tapes Video

Integration Tests with Super Powers And even more... Alexandre Figura Site Reliability

Information Retrieval Index Construction Hamid Beigy Sharif - PowerPoint PPT Presentation

Information Retrieval Information Retrieval Index Construction Hamid Beigy Sharif university of technology October 6, 2018 Hamid Beigy | Sharif university of technology | October 6, 2018 1 / 30 Information Retrieval Table of contents 1.

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies

Disks, Memories &amp; Buffer Management The two offices of memory are collection and

SASE : Implementation of a Compressed Text Search Engine Srinidhi Varadarajan Tzi-cker

Single-Database Private Information Retrieval 07.11.2005 Aleksandr Grebennik Tartu University a

Indexed Files : Outline ! Introduction ! Indexed Files ! Full Index Organization ! Indexed

Challenges in Con erting the Challenges in Converting the National Crime Victimization Survey to

Music The Compact Disc replaced vinyl and cassettes Movies The DVD replaced VHS tapes Video

Integration Tests with Super Powers And even more... Alexandre Figura Site Reliability

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Disks, Memories & Buffer Management The two offices of memory are collection and