Text Indexing
Arun Chauhan COMP 314
Lecture 15, 16 Mar 4, Mar 6, 2003
Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, - - PowerPoint PPT Presentation
Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep utility on Unix - specify a regular expression - search all specified files Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003 Searching Text
Arun Chauhan COMP 314
Lecture 15, 16 Mar 4, Mar 6, 2003
Searching Text
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Searching Text
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Searching Text
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing
concordance)
amortize indexing time over a large number of queries
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Full-text Retrieval
using automatically constructed concordances
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Full-text Retrieval
using automatically constructed concordances
exhaustive searches?
retrieval?
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing: A General Technique
support rapid queries
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Applications
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Inverted File Index
index[term] = document1, document2, . . .
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Inverted File Index
index[term] = document1, document2, . . .
How do you index non-text data (e.g., PDF files, images)?
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
An Example
Document Text 1 Pease porridge hot, pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot, some like it cold 5 Sole like it in the pot 6 Nine days old find the lexicon and build the inverted index
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Example (contd.)
Number Term Documents 1 cold 2; 1, 4 2 days 2; 3, 6 3 hot 2; 1, 4 4 in 2; 2, 5 5 it 2; 4, 5 6 like 2; 4, 5 7 nine 2; 3, 6 9
2; 3, 6 10 pease 2; 1, 2 11 porridge 2; 1, 2 12 pot 2; 2, 5 13 some 3; 4, 5 14 the 2; 2, 5
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index
within each document
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index
within each document
term2”, “term1 OR term2”
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index
within each document
term2”, “term1 OR term2”
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Using the Inverted Index
within each document
term2”, “term1 OR term2”
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Trimming the Index
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Effectiveness: Precision
Precision = r t r: number of relevant documents retrieved t: total number of documents retrieved
if 50 documents are retrieved, 35 are relevant, then the precision is 70%
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Effectiveness: Recall
Recall = r n r: number of relevant documents retrieved n: total number of relevant documents in the collection
if 50 documents are retrieved, 35 are relevant, then the precision is 70% if there are 140 relevant documents then the recall is 25%
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Search Engines
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing the Web
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Web Search Characteristics
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Query Characteristics
0 term in query 21% 1 term in query 26% 2 terms in query 26% 3 terms in query 15% > 3 terms in query 12%
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Goals of a Search Engine
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Search Engine Architecture
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
The Crawler
base ← set of known working hyperlinks queue ← base while (! queue.empty()) { p = first element of queue process p for each page, q, referenced from p add q to queue; }
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Indexing
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Query Processing
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Rankings
document let page P be pointed to by pages T1, T2, T3, etc. let L(x) be the number of links going out of page x let R(x) be the page rank of page x R(P) = (1 − d) + d × (R(T1) L(T1) + R(T2) L(T2) + . . . + R(Tk) L(Tk)) where, d is a damping factor
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Solving the Rankings
R(P1) = (1 − d) + d × ( 1 L(T p1
1 )R(T p1 1 ) +
1 L(T p1
2 )R(T p1 2 ) + . . . +
1 L(T p1
k1 )R(T p1 k1 ))
R(P2) = (1 − d) + d × ( 1 L(T p2
1 )R(T p2 1 ) +
1 L(T p2
2 )R(T p2 2 ) + . . . +
1 L(T p2
k2 )R(T p2 k2 ))
. . . R(Pn) = (1 − d) + d × ( 1 L(T pn
1 )R(T pn 1 ) +
1 L(T pn
2 )R(T pn 2 ) + . . . +
1 L(T pn
kn )R(T pn kn )) Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003
Solving the Rankings
R(P1) = (1 − d) + d × ( 1 L(T p1
1 )R(T p1 1 ) +
1 L(T p1
2 )R(T p1 2 ) + . . . +
1 L(T p1
k1 )R(T p1 k1 ))
R(P2) = (1 − d) + d × ( 1 L(T p2
1 )R(T p2 1 ) +
1 L(T p2
2 )R(T p2 2 ) + . . . +
1 L(T p2
k2 )R(T p2 k2 ))
. . . R(Pn) = (1 − d) + d × ( 1 L(T pn
1 )R(T pn 1 ) +
1 L(T pn
2 )R(T pn 2 ) + . . . +
1 L(T pn
kn )R(T pn kn ))
L × R = C
Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003