Index Construction Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

Index Construction Overview • Introduction • Hardware • BSBI - Block sort-based indexing • SPIMI - Single Pass in-memory indexing • Distributed indexing • Dynamic indexing • Miscellaneous topics

Indices The index has a list of vector space models 1 1998 1 Every 1 have 1 Her 1 hear 1 I 3 her 1 I'm 1 husband 1 Jensen's 1 if 2 Julie 1 it 1 Letter 1 killing 1 Most 1 letter 1 all 1 nothing 1 allegedly 1 now 1 back 1 of 1 before 1 pray 1 brings 1 read, 2 brothers 1 saved 1 could 1 sister 1 days 1 stands 1 dead 1 story 1 death 1 the 1 everything 2 they 1 for 1 time 1 from 1 trial 1 full 1 wonder 1 happens 1 wrong 1 haunts 1 wrote 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1

• This picture is deceptive • We need to “invert” the • Our queries are terms - it is really very sparse “Term-Document Matrix” Capture Keywords vector space model • To make “postings” not documents A Column for Each Web Page (or “Document”) 0 0 0 1 1 4 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 2 ........... 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 Indices A Row For Each Word (or “Term”)

Introduction Terms

Introduction Terms • Inverted index

Introduction Terms • Inverted index • (Term, Document) pairs

Introduction Terms • Inverted index • (Term, Document) pairs • building blocks for working with Term-Document Matrices

Introduction Terms • Inverted index • (Term, Document) pairs • building blocks for working with Term-Document Matrices • Index construction (or indexing)

Introduction Terms • Inverted index • (Term, Document) pairs • building blocks for working with Term-Document Matrices • Index construction (or indexing) • The process of building an inverted index from a corpus

Introduction Terms • Inverted index • (Term, Document) pairs • building blocks for working with Term-Document Matrices • Index construction (or indexing) • The process of building an inverted index from a corpus • Indexer

Introduction Terms • Inverted index • (Term, Document) pairs • building blocks for working with Term-Document Matrices • Index construction (or indexing) • The process of building an inverted index from a corpus • Indexer • The system architecture and algorithm that constructs the index

Indices The index is built from term-document pairs (TERM,DOCUMENT) (have,www.cnn.com) (1998,www.cnn.com) (hear,www.cnn.com) (Every,www.cnn.com) (her,www.cnn.com) (Her,www.cnn.com) (husband,www.cnn.com) (I,www.cnn.com) (if,www.cnn.com) (I'm,www.cnn.com) (it,www.cnn.com) (Jensen's,www.cnn.com) (killing,www.cnn.com) (Julie,www.cnn.com) (letter,www.cnn.com) (Letter,www.cnn.com) (nothing,www.cnn.com) (Most,www.cnn.com) (now,www.cnn.com) (all,www.cnn.com) (of,www.cnn.com) (allegedly,www.cnn.com) (pray,www.cnn.com) (back,www.cnn.com) (read,,www.cnn.com) (before,www.cnn.com) (saved,www.cnn.com) (brings,www.cnn.com) (sister,www.cnn.com) (brothers,www.cnn.com) (stands,www.cnn.com) (could,www.cnn.com) (story,www.cnn.com) (days,www.cnn.com) (the,www.cnn.com) (dead,www.cnn.com) (they,www.cnn.com) (death,www.cnn.com) (time,www.cnn.com) (everything,www.cnn.com) (trial,www.cnn.com) (for,www.cnn.com) (wonder,www.cnn.com) (from,www.cnn.com) (wrong,www.cnn.com) (full,www.cnn.com) (wrote,www.cnn.com) (happens,www.cnn.com) (haunts,www.cnn.com)

Indices The index is built from term-document pairs (TERM,DOCUMENT) (have,www.cnn.com) (1998,www.cnn.com) (hear,www.cnn.com) (Every,www.cnn.com) • Core indexing step is to (her,www.cnn.com) (Her,www.cnn.com) (husband,www.cnn.com) (I,www.cnn.com) (if,www.cnn.com) (I'm,www.cnn.com) (it,www.cnn.com) (Jensen's,www.cnn.com) sort by terms (killing,www.cnn.com) (Julie,www.cnn.com) (letter,www.cnn.com) (Letter,www.cnn.com) (nothing,www.cnn.com) (Most,www.cnn.com) (now,www.cnn.com) (all,www.cnn.com) (of,www.cnn.com) (allegedly,www.cnn.com) (pray,www.cnn.com) (back,www.cnn.com) (read,,www.cnn.com) (before,www.cnn.com) (saved,www.cnn.com) (brings,www.cnn.com) (sister,www.cnn.com) (brothers,www.cnn.com) (stands,www.cnn.com) (could,www.cnn.com) (story,www.cnn.com) (days,www.cnn.com) (the,www.cnn.com) (dead,www.cnn.com) (they,www.cnn.com) (death,www.cnn.com) (time,www.cnn.com) (everything,www.cnn.com) (trial,www.cnn.com) (for,www.cnn.com) (wonder,www.cnn.com) (from,www.cnn.com) (wrong,www.cnn.com) (full,www.cnn.com) (wrote,www.cnn.com) (happens,www.cnn.com) (haunts,www.cnn.com)

Indices Term-document pairs make lists of postings (TERM,DOCUMENT, DOCUMENT, DOCUMENT, ....) ( 1998 ,www.cnn.com,news.google.com,news.bbc.co.uk) ( Every ,www.cnn.com, news.bbc.co.uk) ( Her ,www.cnn.com,news.google.com) ( I ,www.cnn.com,www.weather.com, ) ( I'm ,www.cnn.com,www.wallstreetjournal.com) ( Jensen's ,www.cnn.com) ( Julie ,www.cnn.com) ( Letter ,www.cnn.com) ( Most ,www.cnn.com) ( all ,www.cnn.com) ( allegedly ,www.cnn.com)

Indices Term-document pairs make lists of postings • A posting is a list of all (TERM,DOCUMENT, DOCUMENT, DOCUMENT, ....) documents in which a ( 1998 ,www.cnn.com,news.google.com,news.bbc.co.uk) ( Every ,www.cnn.com, news.bbc.co.uk) term occurs. ( Her ,www.cnn.com,news.google.com) ( I ,www.cnn.com,www.weather.com, ) ( I'm ,www.cnn.com,www.wallstreetjournal.com) ( Jensen's ,www.cnn.com) ( Julie ,www.cnn.com) ( Letter ,www.cnn.com) ( Most ,www.cnn.com) ( all ,www.cnn.com) ( allegedly ,www.cnn.com)

Indices Term-document pairs make lists of postings • A posting is a list of all (TERM,DOCUMENT, DOCUMENT, DOCUMENT, ....) documents in which a ( 1998 ,www.cnn.com,news.google.com,news.bbc.co.uk) ( Every ,www.cnn.com, news.bbc.co.uk) term occurs. ( Her ,www.cnn.com,news.google.com) • This is “inverted“ from ( I ,www.cnn.com,www.weather.com, ) ( I'm ,www.cnn.com,www.wallstreetjournal.com) ( Jensen's ,www.cnn.com) how documents ( Julie ,www.cnn.com) ( Letter ,www.cnn.com) naturally occur ( Most ,www.cnn.com) ( all ,www.cnn.com) ( allegedly ,www.cnn.com)

Introduction Terms • How do we construct an index?

Introduction Interactions • An indexer needs raw text • We need crawlers to get the documents • We need APIs to get the documents from data stores • We need parsers (HTML, PDF, PowerPoint, etc.) to convert the documents • Indexing the web means this has to be done web-scale

Introduction Construction • Index construction in main memory is simple and fast. • But: • As we build the index we parse docs one at a time • Final postings for a term are incomplete until the end. • At 10-12 postings per term, large collections demand a lot of space • Intermediate results must be stored on disk

Index Construction Overview • Introduction • Hardware • BSBI - Block sort-based indexing • SPIMI - Single Pass in-memory indexing • Distributed indexing • Dynamic indexing • Miscellaneous topics

Hardware in 2007 System Parameters • Disk seek time = 0.005 sec • Transfer time per byte = 0.00000002 sec • Processor clock rate = 0.00000001 sec • Size of main memory = several GB • Size of disk space = several TB

Hardware in 2007 System Parameters • Disk Seek Time • The amount of time to get the disk head to the data • About 10 times slower than memory access • We must utilize caching • No data is transferred during seek • Data is transferred from disk in blocks • There is no additional overhead to read in an entire block • 0.2 seconds to get 10 MB if it is one block • 0.7 seconds to get 10 MB if it is stored in 100 blocks

Hardware in 2007 System Parameters • Data is transferred from disk in blocks • Operating Systems read data in blocks, so • Reading one byte and reading one block take the same amount of time

Index Construction Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Index Construction Overview Introduction Hardware BSBI - Block

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Part 4: Index Construction Francesco Ricci Most of these slides comes from the course:

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted

Logistics Please fill out class survey! https://uw.iasystem.org/survey/205862 Midterm

IMPROVING NEWS RANKING BY COMMUNITY TWEETS Xin Shuai, Xiaozhong Liu, Johan Bollen Sunday, April

OSG News Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC Two Slides of

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded

Exploring API Embedding for API Usages and Applications Yi Chang Trong Duc Nguyen, Anh Tuan

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema mappings and

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

Index Construction Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Index Construction Overview Introduction Hardware BSBI - Block

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index S&amp;P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Part 4: Index Construction Francesco Ricci Most of these slides comes from the course:

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted

Logistics Please fill out class survey! https://uw.iasystem.org/survey/205862 Midterm

IMPROVING NEWS RANKING BY COMMUNITY TWEETS Xin Shuai, Xiaozhong Liu, Johan Bollen Sunday, April

OSG News Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC Two Slides of

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded

Exploring API Embedding for API Usages and Applications Yi Chang Trong Duc Nguyen, Anh Tuan

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema mappings and

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large