Index Construction
Introduction to Information Retrieval INF 141 Donald J. Patterson
Content adapted from Hinrich Schütze http://www.informationretrieval.org
Index Construction Introduction to Information Retrieval INF 141 - - PowerPoint PPT Presentation
Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org BSBI Reuters collection example (approximate #s) 800,000 documents from the
Content adapted from Hinrich Schütze http://www.informationretrieval.org
BSBI
BSBI
disk seek time. BSBI
disk seek time.
BSBI
disk seek time.
BSBI
disk seek time.
BSBI
disk seek time.
BSBI
disk seek time.
need to be sorted with N log2(N) comparisons? BSBI
disk seek time.
need to be sorted with N log2(N) comparisons?
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI
BSBI - Block sort-based indexing
BSBI - Block sort-based indexing
(1998,www.cnn.com) (Every,www.cnn.com) (Her,news.google.com) (I'm,news.bbc.co.uk)
Block
(1998,news.google.com) (Her,news.bbc.co.uk) (I,www.cnn.com) (Jensen's,www.cnn.com)
Block
(1998,www.cnn.com) (1998,news.google.com) (Every,www.cnn.com) (Her,news.bbc.co.uk) (Her,news.google.com) (I,www.cnn.com) (I'm,news.bbc.co.uk) (Jensen's,www.cnn.com)
Merged Postings Disk
BSBI - Block sort-based indexing
full
BSBI - Block sort-based indexing
BSBI - Block sort-based indexing
BSBI - Block sort-based indexing
Index Construction
Single-Pass In-Memory Indexing
Single-Pass In-Memory Indexing
SPIMI-Invert(tokenStream) 1
2 dictionary ← NewHash() 3 while (free memory available) 4 do token ← next(tokenStream) 5 if term(token) / ∈ dictionary 6 then postingsList ← AddToDictionary(dictionary, term(token)) 7 else postingsList ← GetPostingsList(dictionary, term(token)) 8 if full(postingsList) 9 then postingsList ← DoublePostingsList(dictionary, term(token)) 10 AddToPostingsList(postingsList, docID(token)) 11 sortedTerms ← SortTerms(dictionary) 12 WriteBlockToDisk(sortedTerms, dictionary, outputFile) 13 return outputFile
each block to hold more data Single-Pass In-Memory Indexing
Single-Pass In-Memory Indexing
Index Construction
Distributed Indexing
Distributed Indexing - Architecture
Distributed Indexing - Architecture
apache.org Distributed Indexing - Architecture
Distributed Indexing - Architecture
Distributed Indexing - Architecture
Postings A-F Corpus ...
Master
A-F G-P Q-Z A-F G-P Q-Z A-F G-P Q-Z Parsers ... ... A-F G-P Q-Z A-F G-P Q-Z A-F G-P Q-Z Inverters G-P Q-Z