Index Construction Indexing, session 7 CS6200: Information - PowerPoint PPT Presentation

Feb 18, 2023 •140 likes •217 views

Index Construction Indexing, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton Basic Indexing Given a collection of documents, how can we efficiently create an inverted index of its contents? Basic In-Memory Indexer The

Index Construction Indexing, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton
Basic Indexing Given a collection of documents, how can we efficiently create an inverted index of its contents? Basic In-Memory Indexer The basic steps are: 1. Tokenize each document, to convert it to a sequence of terms 2. Add doc to inverted list for each token This is simple at small scale and in memory, but grows much more complex to do efficiently as the document collection and vocabulary grow.
Merging Lists The basic indexing algorithm will fail as soon as you run out of memory. To address this, we store a partial inverted list to disk when it grows too large to handle. We reset the in-memory index and start over. When we’re finished, we merge all the partial indexes. The partial indexes should be written in a manner that facilitates later merging. For instance, store the terms in some reasonable sorted order. This permits merging with a single linear pass through all partial lists.
Merging Example
Result Merging An index can be updated from a new batch of documents by merging the posting lists from the new documents. However, this is inefficient for small updates. Instead, we can run a search against both old and new indexes and merge the result lists at search time. Once enough changes have accumulated, we can merge the old and new indexes in a large batch. In order to handle deleted documents, we also need to maintain a delete list of docids to ignore from the old index. At search time, we simply ignore postings from the old index for any docid in the delete list. If a document is modified, we place its docid into the delete list and place the new version in the new index.
Updating Indexes If each term’s inverted list is stored in a separate file, updating the index is straightforward: we simply merge the postings from the old and new index. However, most filesystems can’t handle very large numbers of files, so several inverted lists are generally stored together in larger files. This complicates merging, especially if the index is still being used for query processing. There are ways to update live indexes efficiently, but it’s often simpler to simply write a new index, then redirect queries to the new index and delete the old one.
Wrapping Up We have just scratched the surface of the complexities of constructing and updating large-scale indexes. The most complex indexes are massive engineering projects that are constantly being improved. An indexing algorithm needs to address hardware limitations (e.g., memory usage), OS limitations (the maximum number of files the filesystem can efficiently handle), and algorithmic concerns. When considering whether your algorithm is sufficient, consider how it would perform on a document collection a few orders of magnitude larger than it was designed for. Next, we’ll look at how to distribute indexing across a network.

Recommend

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary index vs. secondary index (= clustering index vs. non-clustering index) Tree-based vs. hash-based index Tree-based index Indexed

1.7k views • 122 slides

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Web Information Retrieval Lecture 3 Index Construction Index construction This time: Plan Index construction How do we construct an index? What strategies can we use with limited main memory? Sec. 4.2 RCV1: Our collection for

235 views • 21 slides

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network US Equity Mid/Small-Cap 2500 Index SN2500 S-Network US Equity Large-Cap 500 Index SN500 S-Network US Equity Mid-Cap 500 Index SNM500 S-Network US

563 views • 15 slides

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics

923 views • 66 slides

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP University VP University WHEN DO YOU USE HYBRI D CONSTRUCTI ON? Hybrid Construction is an economical alternative to conventional steel structures

431 views • 15 slides

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index Weighting 9% The popularity of Index funds and ETFs has increased in recent years. 5% 3%2% 5% 2% 2% 3% FAANG+ represent

702 views • 10 slides

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The revision of the Index of Retail Prices is a process p that is undertaken following the conduct of a Household Budget Survey (HBS), since the results

710 views • 28 slides

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe Equity 1000 Index (Ticker: SNE1000) S-Network Pacific Equity 500 Index (Ticker: SNP500) S-Network Pacific Equity 1000 Index (Ticker: SNP1000)

401 views • 13 slides

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

The dual space of a nilpotent Lie group Index sets and representations Index sets and representations Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets and representations June 22 2013 Index

1.49k views • 104 slides

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux Heap storage Clustered (primary) index Non-clustered (secondary) index On heap stored table On clustered table Index Calculations

544 views • 15 slides

Part 4: Index Construction Francesco Ricci Most of these slides comes from the course:

Part 4: Index Construction Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan 1 Ch. 4 Index construction p How do we construct an index? p What

759 views • 48 slides

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler Perhaps a new source language Perhaps a new target for an existing compiler Perhaps both 2 Compiler Construction Compiler Construction

661 views • 18 slides

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC Club Joint Lab Interpret Telematics Brand 2 TCC Index TCC and Meihua.info jointly released TCC Index " give a read of how much the

581 views • 21 slides

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc Nienhaus, March 29, 2018 NVIDIA INDEX Analyze Large-Scale Data for Faster Discoveries INTERACTIVE SCALABLE CLOUD MASSIVE 2 NVIDIA INDEX 2.0

627 views • 21 slides

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul Organization Index by key #1 Index by key #1 Data File Index by key #2 Data File Index by key #2 Index by key #3 Index by key #3 rasitjutrakul

672 views • 9 slides

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted Index In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a

1.27k views • 41 slides

Library Collections Promote Exploration of Self, Other, & the World Update Spring 2019 How

Library Collections Promote Exploration of Self, Other, & the World Update Spring 2019 How has district funding of our library collections changed? Year Funding per student Prior to 17-18 $0 2017-18 $5 2018-19 $10 Instruction

275 views • 9 slides

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors Shoaib Akram , Jennifer B. Sartor, Kenzo Van Craeynest, Wim Heirman, Lieven Eeckhout Ghent University, Belgium Shoaib.Akram@UGent.be Popularity of

824 views • 46 slides

1 So, heres our agenda for today. First we are going to talk a bit about what a relation is

Let me begin by introducing myself. I began working with Progress in 1984 and I have been a Progress Application Partner since 1986. For many years d I h b P A li ti P t i 1986 F I was the architect and chief developer for our ERP

770 views • 53 slides

Representing B has everything PartialOrders that A has and more: B A Albert R Meyer

313 MathematicsforComputerScience propersubsetrelation MIT 6.042J/18.062J means A B Representing B has everything PartialOrders that A has and more: B A Albert R Meyer March 22, 2013 Albert R Meyer March 22,

294 views • 5 slides

First-order framework of inquiry Nina Gierasimczuk Institute for Logic, Language and Computation

First-order Framework of Inquiry Inquiry via Belief Revision First-order framework of inquiry Nina Gierasimczuk Institute for Logic, Language and Computation University of Amsterdam FLT Course, MoL Spring 2013 March 19 th , 2013 Nina

570 views • 30 slides

18 As you arrive: 1. Start up your computer and plug it in. Log into Angel and go to CSSE 120.

Session 18 As you arrive: 1. Start up your computer and plug it in. Log into Angel and go to CSSE 120. 2. Defining your Do the Attendance Widget the PIN is on the board. OWN classes Go to the Course Schedule web page. 3.

523 views • 11 slides

Welcome to Mobey Forums Corporate Mobile Banking Webinar The webinar will start at 3pm CET

Welcome to Mobey Forums Corporate Mobile Banking Webinar The webinar will start at 3pm CET Mobile corporate banking should be a key component in a banks omni-channel strategy. 1) Consumers have been well served: the business case has been

157 views • 11 slides

Wei-Bin Lee Ph.D. ( ) Taipei Fubon Commercial Bank Co., Ltd./Digital Banking

Wei-Bin Lee Ph.D. ( ) Taipei Fubon Commercial Bank Co., Ltd./Digital Banking Advisor 2019/03/09 @ - Digital Natives PDS2 "Banking is

315 views • 9 slides