Query Operations Query Operations
Berlin Chen 2004
Reference:
- 1. Modern Information Retrieval. chapter 5
Query Operations Query Operations Berlin Chen 2004 Reference: 1. - - PowerPoint PPT Presentation
Query Operations Query Operations Berlin Chen 2004 Reference: 1. Modern Information Retrieval . chapter 5 Introduction Users have no detailed knowledge of The collection makeup Difficult to formulate queries The retrieval
Berlin Chen 2004
Reference:
2
– The collection makeup – The retrieval environment
improved query formulation is constructed and posed again
Difficult to formulate queries
Expand the original query with new terms (query expansion) and rewight the terms in the expanded query (term weighting)
3
terming weighting
– Feedback information from the user
– With vector, probabilistic models et al. – Information derived from the set of documents initially retrieved (called local set of documents)
– Local clustering, local context analysis – Global information derived from document collection
– Similar thesaurus or statistical thesaurus
4
– The most popular query reformation strategy
– A list of retrieved docs is presented – User or system exam them (e.g. the top 10 or 20 docs) and marked the relevant ones – Important terms are selected from the docs marked as relevant, and the importance of them are enhanced in the new query formulation
relevant docs irrelevant docs query
5
– Shield users from details of query reformulation
– Break down the whole searching task into a sequence
– Provide a controlled process designed to emphasize some terms (relevant ones) and de-emphasize others (non-relevant ones)
For automatic relevance feedback, the whole process is done in an implicit manner
6
– Relevant docs have term-weight vectors that resemble each other – Non-relevant docs have term-weight vectors which are dissimilar from the ones for the relevant docs – The reformulated query gets to closer to the term- weight vector space of relevant docs
relevant docs irrelevant docs query
term-weight vectors
7
Relevant Docs Cr Answer Set Doc Collection with size N Relevant Docs identified by the user Dr Non-relevant Docs identified by the user Dn
8
– The complete set of relevant docs Cr to a given query q is known in advance – Problem: the complete set of relevant docs Cr are not known a priori
incrementally change the initial query vector based
– User or automatic judgments
∈ ∀ ∉ ∀
− − =
r C i d r C j d j r i r
d C N d C q
r r
r r r | | 1 | | 1
9
∈ ∀ ∈ ∀
⋅ − ⋅ + ⋅ =
Dn j d j n Dr i d i r m
d D d D q q
r r
r r r r | | | | γ β α
∈ ∀ ∈ ∀
⋅ − ⋅ + ⋅ =
Dn j d j Dr i d i m
d d q q
r r
r r r r γ β α
j relevant non Dr i d i m
d d q q r r r r
r − ∈ ∀
⋅ − ⋅ + ⋅ =
max γ β α
The highest ranked non-relevant doc modified query initial/original query
Rocchio 1965
10
– Similar results were achieved for the above three approach (Dec-Hi slightly better in the past) – Usually, constant β is bigger than γ (why?)
– More about the constants
11
– Simple, good results
the retrieved docs
– No optimality criterion
query
12
– :is constant for all indexing terms – :approx. by doc freq. of index terms
( ) ∑
=
⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − + − × × ≈
t i i i i i j i q i j
R k P R k P R k P R k P w w q d sim
1 , ,
) | ( ) | ( 1 log ) | ( 1 ) | ( log ,
Binary weights (0 or 1) are used
5 . ) | ( = R k P
i
N n R k P
i i
= ) | (
( )
= =
− × × = ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ − + − × × ≈
t i i i j i q i t i i i j i q i j
n n N w w N n N n w w q d sim
1 , , 1 , ,
log 1 log 5 . 1 5 . log ,
Roberston & Sparck Jones 1976
set of relevant docs
13
r i r i
D D R k P
,
) | ( =
r i r i i
D N D n R k P − − =
,
) | (
Relevant docs containing term ki Relevant docs
( )
= =
⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − + − − ⋅ − × × = ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − + − × × ≈
t i i r i i r i r i r r i r j i q i t i r i r i r i r i r i r r i r j i q i j
D n D n D N D D D w w D N D n D N D n D D D D w w q d sim
1 , , , , , , 1 , , , , , ,
log 1 log 1 log ,
1 5 . ) | (
,
+ + =
r i r i
D D R k P 1 5 . ) | (
,
+ − + − =
r i r i i
D N D n R k P
1 ) | (
,
+ + =
r i i r i
D N n D R k P
1 ) | (
,
+ − + − =
r i i r i i
D N N n D n R k P
Approach 2 Approach 1
14
– Feedback process is directly related to the derivation
– The term reweighting is optimal under the assumptions of term independence and binary doc indexing
– Document term weights are not taken into considered – Weights of terms in previous query formulations are disregarded – No query expansion is used
reweighted over and over again
15
– Distinct initial search assumptions – Within-document frequency weight included
Croft 1983 j i i q j i
f idf C F
, , ,
) ( + =
=
∝
t i q j i j i q i j
F w w q d sim
1 , , , ,
) , (
) max( ) 1 (
, , , j i j i j i
f f K K f + + =
~ Term frequency (normalized with the maximum within-document frequency) ~ Inversed document frequency
http://ciir.cs.umass.edu/
16
j i i i i i q j i
f R k P R k P R k P R k P C F
, , ,
) ) | ( ) | ( 1 log ) | ( 1 ) | ( log ( − + − + =
1 5 . ) | (
,
+ + =
r i r i
D D R k P
1 5 . ) | (
,
+ − + − =
r i r i i
D N D n R k P
17
– The within-doc frequencies are considered – A normalized version of these frequencies is adopted – Constants C and K are introduced for greater flexibility
– More complex formulation – No query expansion (just reweighting of index terms)
18
feedback is unrealistic
– Since the user has seen the docs during reference feedback
the higher ranks assigned to the set R of docs – The real gains in retrieval performance should be measured based on the docs not seen by the user yet
∈ ∀ ∈ ∀
⋅ − ⋅ + ⋅ =
Dn j d j n Dr i d i r m
d D d D q q
r r
r r r r | | | | γ β α
modified query
19
collection
– Residual collection
provided by the user – Evaluate the retrieval performance of the modified query qm considering only the residual collection – The recall-precision figures for qm tend to be lower than the figures for the original query q
performance of different relevance feedback strategies
20
– Top ranked docs separated into two classes
– Terms in known relevant docs help describe a larger cluster of relevant docs
– Description of larger cluster of relevant docs is built iteratively with assistance from the user
relevant docs irrelevant docs query
Attar and Fraenkel 1977
21
description for a large cluster of relevant docs
– Identify terms which are related to the query terms
陳水扁 總統 李登輝 總統府 秘書長 陳師孟 一邊一國… 連戰 宋楚瑜 國民黨 一個中國 …
22
– Global analysis
global thesaurus-like structure for QE – Local analysis
interference
determine terms for QE
23
– Build global structures such as association matrices to quantify term correlations – Use the correlated terms for QE – But not always effective in general collections
– Operate solely on the docs retrieved for the query – Not suitable for Web search: time consuming – Suitable for intranets
in specialized doc collections like medical doc collections
陳水扁 視察 阿里山 小火車 陳水扁 總統 呂秀蓮 綠色矽島 勇哥 吳淑珍 …
24
– Stem
grammatical variants of each other – E.g. {polish, polishing, polished}
– e.g., s= polish – For a given query
(stems) in the local document set
25
– Consider the co-occurrence of stems (terms) inside docs
– Consider the distance between two terms in a doc
– Consider the neighborhoods of two terms
26
– Based on the co-occurrence of stems (terms) inside docs
docs have a synonymity association – An association matrix with |Sl| rows and |Dl| columns
|Sl|
m r
stem-doc matrix
m r
t
m r
x
s r
|Sl| |Sl|
stem-stem association matrix
|Dl|
27
– Each entry in the stem-stem association matrix stands for the correlation factor between two stems – The unnormalized form – The normalized form ( )
j v s l D j d j u s v u
f f c
, , ,
∈
× =
v u v v u u v u v u
, , , , ,
v u v u
, ,
i
n
j
n
j i
n ,
k
n
k i
n ,
ranged from 0 to 1
Tanimoto coefficient
28
– The u-th row in the association matrix stands all the associations for the stem su – A local association cluster Su(m)
respective values su,v being the top m ones in the u-th row of the association matrix – Given a query, only the association clusters of query terms are calculated
clusters are selected and added the query formulation
29
– Other measures for term association
N n N n N n k P k P k k P k k MI s
v u v u v u v u v u v u
× = = =
, ,
log , log ,
v v u u v u v u
, , , ,
30
– Take into consideration the distance between two terms in a doc while computing their correlation factor – The entry of local stem-stem metric correlation matrix can be expressed as
s r
( ) ( )
∈ ∈
=
u s V i k v s V j k j i v u
k k r c , 1
,
ki and kj in the same doc if ki and kj are in distinct docs
( )
∞ =
j i k
k r ,
v u v u
c s
, ,
=
v u v u v u
s V s V c s × =
, ,
The local association clusters
defined
ranged from 0 to 1
31
– Idea: two stems (terms) with similar neighborhoods have some synonymity relationship – Derive the synonymity relationship between two stems by comparing the sets Su(m) and Sv(m)
u u v u v u
s s s s s r r r r × ⋅ =
,
|Sl| |Sl|
u
v
s r
The stem-stem association matrix achieved before Derive a new scalar association matrix
32
– “neighbor”: a stem su belongs to a cluster associated to another term sv is said to be a neighbor of sv
sense – Stems belonging to clusters associated to the query stems (terms) can be used to expand the original query
stems su as a neighbor or the stem sv
33
– Query expansion
q, select m neighbors stems from the cluster Sv(m) and add them to the query
relevant docs – The impact of normalized or unnormalized clusters
stem (term) correlations
∈
v u v v u u v u v u
c c c c s
, , , , ,
e.g, − + =
34
– Based on the set of docs retrieved for the
– Based on term (stem) correlation inside docs – Terms are neighbors of each query terms are used to expand the query
– Based on the whole doc collection – The thesaurus for term relationships are built by considering small contexts (e.g. passages) and phrase structures instead of the context of the whole doc – Terms closest to the whole query are selected for query expansion
Local context analysis combines features from both Calculation of term correlations at query time Pre-calculation
35
– Document concepts: Noun groups from retrieved docs as the units for QE instead of single keywords – Concepts selected from the top ranked passages (instead of docs) based on their co-occurrence with the whole set of query terms (no stemming)
Xu and Croft 1996
36
steps
– Retrieve the top n ranked passages using the original query (a doc is segmented into several passages) – For each concept c in the top ranked passages, the similarity sim(q,c) between the whole query q and the concept c is computed using a variant of tf-idf ranking – The top m ranked concepts are added to the original query q
1-0.9x i/m (i: the position in rank)
37
i idf q i k c i
n idf k c f c q sim
∈
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ × + = log , log , δ
j c n j j i i
pf pf k c f
, 1 ,
,
=
× =
⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = 5 log , 1 max
10 c c
np N idf ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = 5 log , 1 max
10 i i
np N idf
the no. of top ranked passages considered the no. of passages in the collection emphasize the infrequent terms the no. of passages containing concept c Set to 0.1 to avoid zero
38
– Term to term relationships rather than term co-occurrences are considered
– Terms for query expansion are selected based on their similarity to the whole query rather the similarities to individual terms
t
term-doc matrix
N
( )
N u u u u
w w w k
, 2 , 1 ,
,..., , = r
( )
N v v v v
w w w k
, 2 , 1 ,
,..., , = r
doc terms Docs are interpreted as indexing elements here
term vector
Qiu and Frei 1993
39
– fu,j: the frequency of term ku in document dj – tj : the number of distinct index terms in document dj – Inverse term frequency
term-doc matrix
=
⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ × ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + × ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =
N l l l u l l u j j u j j u j u
itf f f itf f f w
1 2 , , , , ,
max 5 . 5 . max 5 . 5 .
j j
t t itf log =
Let term vector have a unit norm
(doc containing more distinct terms is less important) The importance of the doc dj to a term ku
40
– The vector representations are normalized – The computation is computationally expensive
docs
∀
j d j v j u v u v u
, , ,
is just a cosine measure
ranged from 0 to 1
41
2.Based on the global thesaurus, compute a similarity between the each term kv and the whole query q
according to sim(q,kv)
∈
q u k u q u
,
∈ ∈
× = ⋅ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ × =
q u k v u q u v q u k u q u v
c w k k w k q sim
, , ,
, r r
( )
∈ ∈ ∈ ′
× = =
q k q u v u q k q u q k q u v q v
u u u
w c w w k q sim w
, , , , ,
,
Concept-based QE
42
be quite close to the whole query while its distances to individual query terms are larger
43
in the term-concept space
– Doc is first represented in the term-concept space – Similarity measure
in the generalized vector space model – Differences » Weight computation » Only the top r ranked terms are used here
∈
× =
j d v k v j v j
k w d r r
,
∈ ∈
× × ∝
j d v k q u k v u q u j v j
c w w d q sim
, , ,
,
44
group correlated terms in the context of the whole collection
expand the original user query
– The terms selected must be low frequency terms
45
terms
– To circumvent this problem, we cluster docs into classes instead and use the low frequency terms in these docs to define our thesaurus classes – This algorithm must produce small and tight clusters
46
– Place each doc in a distinct cluster – Compute the similarity between all pairs of clusters – Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity (using the cosine formula) – Merge the clusters Cu and Cv – Verify a stop criterion. If this criterion is not met then go back to step 2 – Return a hierarchy of clusters
defined as
– The minimum of similarities between all pairs of inter-cluster docs Cu Cv
Cosine formula of the vector model is used
47
– Higher level clusters represent a looser grouping
Cz Cv Cu
0.15 0.11
sim(Cu,Cv)=0.15 sim(Cu+v,Cz)=0.11
48
collection, the terms that compose each class of the global thesaurus are selected as follows
– Three parameters obtained from the user
49
– Use the parameter TC as threshold value for determining the doc clusters that will be used to generate thesaurus classes
the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class – Use the parameter NDC as a limit on the size of clusters (number of docs) to be considered
the smaller clusters
50
– Consider the set of docs in each doc cluster pre- selected above
sources of terms for the thesaurus classes
inverse doc frequency for any term which is selected to participate in a thesaurus class
they can be to query expansion
51
C1 D1 D2 D3 D4 C2 C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
q= A E E
Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A
sim(1,3) = 0.99 sim(1,2) = 0.40 sim(2,3) = 0.29 sim(4,1) = 0.00 sim(4,2) = 0.00 sim(4,3) = 0.00 idf A = 0.0 idf B = 0.3 idf C = 0.12 idf D = 0.12 idf E = 0.60
q'=A B E E
cosine formula with tf-idf weighting
52
– Initialization of parameters TC, NDC and MIDF – TC depends on the collection – Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC – A high value of TC might yield classes with too few terms
53
– Graphical interfaces (2D or 3D) for relevance feedback
to the Web environments
– Alleviate the computational burden imposed on the search engine
Adapted from Prof. Lin-shan Lee