1
Boolean and Vector Space Retrieval Models
- CS 290N
- Some of slides from R. Mooney (UTexas), J.
Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models CS 290N Some of slides - - PowerPoint PPT Presentation
Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Which results satisfy the query constraint? Boolean model Statistical vector space
1
Ghosh (UT ECE), D. Lee (USTHK).
3
4
5
queries.
stream.
ranked lists rather than binary filtering. News stream user
6
punctuation, numbers, etc.).
stopwords (e.g. a, the, it, etc.).
specific dictionary).
it).
7
connected by AND, OR, and NOT, including the use
matches or ranking.
8
Brutus AND Caesar but NOT Calpurnia?
and Caesar, then strip out lines containing Calpurnia?
countrymen) not feasible
9
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1
10
Caesar and Calpurnia (complemented) bitwise AND.
11
that contain T.
12
Dictionary Postings
13
Friends Romans Countrymen friend roman countryman Friends, Romans, countrymen.
14
15
Brutus AND Caesar
– Retrieve its postings.
– Retrieve its postings.
16
time linear in the total number of postings entries
17
search service (started 1975; ranking added 1992)
the federal tort claims act?
/3 CLAIM
incrementally developed; not like web search
Boolean queries:
18
Can we still run through the merge in time O(m+n)?
19
retrieved.
irrelevant, how should the query be modified?
20
words (unordered words with frequencies).
same element.
weights:
Q = < database 0.5; text 0.8; information 0.2 >
Q = < database; text; information >
21
22
preprocessing; call them index terms or the vocabulary.
Dimension = t = |vocabulary|
real-valued weight, wij.
t-dimensional vectors: dj = (w1j, w2j, …, wtj)
23
in the vector space model by a term-document matrix.
“weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
24
Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 T1 T2
D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3
7 3 2 5
similarity? Distance? Angle? Projection?
25
term within a document and within the entire collection?
a document and the query?
what are the effects of links, formatting information, etc.?
26
important, i.e. more indicative of the topic. fij = frequency of term i in document j
the entire corpus: tfij = fij / max{fij}
27
less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents)
28
tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi)
rarely in the rest of the collection is given high weight.
have been proposed.
well.
29
Given a document with term frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
30
the degree of similarity between two vectors.
each document:
the size of the retrieved set can be controlled.
31
query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq
where wij is the weight of term i in document j and wiq is the weight of term i in the query
number of matched query terms in the document (size of intersection).
products of the weights of the matched terms.
t i 1
32
unique terms.
many terms are not matched.
33
Binary:
sim(D, Q) = 3
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
34
cosine of the angle between two vectors.
vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
2 t3 t1 t2
D1 D2 Q
1 D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
t i t i t i
iq ij iq ij j j
1 1 2 2 1
) (
CosSim(dj, q) =
35
document collections.
36
structure, word order, proximity information).
synonomy).
requiring a term to appear in a document).
document containing A frequently but not B, over a document that contains both A and B, but both less frequently.