CS6200: Information Retrieval
Slides by: Jesse Anderton
Inverted Indexes
IR, session 5
Inverted Indexes IR, session 5 CS6200: Information Retrieval - - PowerPoint PPT Presentation
Inverted Indexes IR, session 5 CS6200: Information Retrieval Slides by: Jesse Anderton Scaling up A term incidence matrix with V Corpus Terms Docs Entries terms and D documents has O(V x D) entries. Shakespeares ~1.1 ~31,000 37
CS6200: Information Retrieval
Slides by: Jesse Anderton
IR, session 5
terms and D documents has O(V x D) entries.
distinct words across 37 plays, for about 1.1M entries.
Corpus Terms Docs Entries Shakespeare’s Plays ~31,000 37 ~1.1 million English Wikipedia ~1.7 million ~4.5 million ~7.65 trillion English Web >2 million >1.7 billion >3.4x1015
pages comprises about 4.5M pages and roughly 1.7M distinct words. Assuming just one bit per matrix entry, this would consume about 890GB of memory.
manageable size:
uses a tiny fraction of the vocabulary.
words, so we don’t need the rest.
using a term incidence matrix directly.
to a posting list of documents which use that term.
t1 AND t2 AND … AND tn
take the intersections of the term posting lists.
relies on the posting lists being sorted by length.
in order from least common to most
Many other data structures were considered, but none has matched its efficiency.
more fields providing extra information about the documents.