Models for Models for Retrieval and Browsing Retrieval and Browsing
- Fuzzy Set, Extended Boolean,
Generalized Vector Space Models
Berlin Chen 2004
Reference:
- 1. Modern Information Retrieval. Chapter 2
Models for Models for Retrieval and Browsing Retrieval and - - PowerPoint PPT Presentation
Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended Boolean, Generalized Vector Space Models Berlin Chen 2004 Reference: 1. Modern Information Retrieval . Chapter 2 Taxonomy of Classic IR Models Set
Berlin Chen 2004
Reference:
IR 2004 – Berlin Chen 2
Non-Overlapping Lists Proximal Nodes Structured Models
Retrieval: Adhoc Filtering Browsing U s e r T a s k
Classic Models Boolean Vector Probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Hidden Markov Model Probabilistic LSI Language Model Algebraic Generalized Vector Latent Semantic Indexing (LSI) Neural Networks Browsing Flat Structure Guided Hypertext probability-based
IR 2004 – Berlin Chen 3
– Fuzzy Set Model (Fuzzy Information Retrieval) – Extended Boolean Model
– Generalized Vector Space Model
IR 2004 – Berlin Chen 4
– Docs and queries are represented through sets of keywords, therefore the matching between them is vague
information need and the doc’s main theme – For each query term (keyword)
aboutness
wi, wj, wk,…. ws, wp, wq,…. Retrieval Model
陳總統、北二高、、 陳水扁、北部第二高速公路、、
IR 2004 – Berlin Chen 5
– Framework for representing classes (sets) whose boundaries are not well defined – Key idea is to introduce the notion of a degree of membership associated with the elements of a set – This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership
– Thus, membership is now a gradual instead of abrupt
Here we will define a fuzzy set for each query (or index) term, thus each doc has a degree of membership in this set.
IR 2004 – Berlin Chen 6
– A fuzzy subset A of a universal of discourse U is characterized by a membership function µA: U → [0,1]
number µA(u) in the interval [0,1] – Let A and B be two fuzzy subsets of U. Also, let A be the complement of A. Then,
) ( 1 ) ( u u
A A
µ µ − =
)) ( ), ( max( ) ( u u u
B A B A
µ µ µ =
∪
B A B A
∩ U A B
u
IR 2004 – Berlin Chen 7
– Fuzzy sets are modeled based on a thesaurus – This thesaurus can be constructed by a term-term correlation matrix (or called keyword connection matrix)
– The relationship is symmetric !
l i
l i l i l i l i
n n n n c
, , ,
− + =
l i i
n n
,
: no of docs that contain ki : no of docs that contain both ki and kl
Defining term relationship
i k i l l i l k
l i
, ,
docs, paragraphs, sentences, .. ranged from 0 to 1
IR 2004 – Berlin Chen 8
– Union: algebraic sum (instead of max) – Intersection: algebraic product (instead of min)
U A1 A2
u
( )
= ∪
= + + =
2 1
1 1 ) ( ) ( ) ( ) ( ) ( ) ( ) (
2 1 2 1 2 1 2 1
j A A A A A A A A A
(u)
u u u u u u
j
µ µ µ µ µ µ µ
1 1 ) ( ) (
1
2 1
= ∪ ∪ ∪
= =
n j A A A A A
(u)
u
j j j n
µ µ
L
) ( ) ( ) (
2 1 2 1
u u u
A A A A
µ µ µ =
∩
= ∩ ∩
n j A A A A
j n
1
2 1
L
a negative algebraic product
( ) ( )
) 1 )( 1 ( 1 ) 1 ( 1 1 1 b a ab b a ab a ab b ab b a b a ab b a b a ab − − − = + − − − = − + − + = − + − + = + +
IR 2004 – Berlin Chen 9
– The degree of membership between a doc dj and an index term ki
– Implemented as the complement of a negative algebraic product – A doc dj belongs to the fuzzy set associated to the term ki if its own terms are related to ki
related to the index ki ( ) then µki,dj ∼1 – ki is a good fuzzy index for doc dj – And vice versa
∈ ∈ ∪
∈ j l j l l l j d l k j i
d k l i d k i k i k i d j k
,
1 ~
,l i
c
a i
c ,
b i
c ,
a i
c , 1 −
b i
c , 1 −
a
k
b
k
i
k
algebraic sum (a doc is a union of index terms)
IR 2004 – Berlin Chen 10
– Query q=ka ∧ (kb ∨ ¬kc) qdnf =(ka ∧ kb ∧ kc) ∨ (ka ∧ kb ∧ ¬ kc) ∨(ka ∧ ¬kb ∧ ¬kc) =cc1+cc2+cc3 – Da is the fuzzy set of docs associated to the term ka – Degree of membership ?
cc3 cc2 Da Db Dc
disjunctive normal form
conjunctive component
cc1
IR 2004 – Berlin Chen 11
, , , , , , , , , , , , 3 1 , , ,
3 2 1 j j j j j j j j j j j j j i j j
d c d b d a d c d b d a d c d b d a d c b a d c b a d c b a i d cc d cc cc cc d q
∩ ∩ ∩ ∩ ∩ ∩ = ∪ ∪
algebraic sum negative algebraic product
cc1 cc2 cc3
algebraic product
for a doc in the fuzzy answer set
j
d
q
D
cc3 cc2 Da Db Dc cc1
IR 2004 – Berlin Chen 12
– The correlations among index terms are considered – Degree of relevance between queries and docs can be achieved
– Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory – Experiments with standard test collections are not available
IR 2004 – Berlin Chen 13
– Extend the Boolean model with the functionality of partial matching and term weighting
doc contains either kx or ky is as irrelevant as another doc which contains neither of them
– Combine Boolean query formulations with characteristics of the vector model
Salton et al., 1983
a ranking can be obtained
陳水扁 及 呂秀蓮 陳水扁 或 呂秀蓮
IR 2004 – Berlin Chen 14
– The weight for the term kx in a doc dj is
– Let denote the weight of term kx on doc dj – Let denote the weight of term ky on doc dj – The doc vector is represented as – Queries and docs can be plotted in a two-dimensional map
i i x j x j x
, ,
Normalized idf
j x
j x
j y
j y j x j
w w d
, , ,
= r
y x d j , =
normalized frequency
ranged from 0 to 1
IR 2004 – Berlin Chen 15
2 1 1 1 ,
2 2
y x d q sim
and
− + − − =
2-norm model (Euclidean distance)
dj dj+1
(0,0) (1,1)
kx ky AND
j y j x j
, , ,
2 / 1 1− 2 / 1 1− 2 / 1 1−
j x
,
j y
,
IR 2004 – Berlin Chen 16
2 ,
2 2
y x d q sim
+ = dj dj+1 y = wy,j
(0,0) (1,1)
kx ky Or x = wx,j
2 / 1 2 / 1
2-norm model (Euclidean distance)
IR 2004 – Berlin Chen 17
and ,
IR 2004 – Berlin Chen 18
– t index terms are used → t-dimensional space – p-norm model, – Some interesting properties
m p p p and
k k k q ∧ ∧ ∧ = ...
2 1
m p p p
k v k k q ...
2 1
∨ ∨ = ( ) ( ) ( ) ( )
p p m p p and
m x x x d q sim
1 2 1
1 ... 1 1 1 , ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + + − + − − =
( )
p p m p p
m x x x d q sim
1 2 1
... , ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + + + =
( ) ( )
m x x x d q sim d q sim
m
and
+ + + = = ... , ,
2 1
( ) ( )
i and
x d q sim min , ≈
( ) ( )
i
x d q sim max , ≈
just like the formula of fuzzy logic
IR 2004 – Berlin Chen 19
– Processed by grouping the operators in a predefined
– Combination of different algebraic distances
3 2 1
p p
( ) ( ) ( )
p p p p p p
x x x d q sim
1 3 1 2 1
2 2 1 1 1 , ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + − − =
3 2 2 1
∞
⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =
3 2 1 2 2 2 1
, 2 min , x x x d q sim
IR 2004 – Berlin Chen 20
– A hybrid model including properties of both the set theoretic models and the algebraic models
– Distributive operation does not hold for ranking computation
– Assumes mutual independence of index terms
3 2 2 2 3 2 1 2 3 2 2 2 1 1
d q sim d q sim , ,
2 1
≠
( ) ( )
2 1 2 3 2 2 1 2 2 2 12 2 1 1 1 ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + − − x x x
2 1 2 2 3 2 2 2 2 2 2 1
2 2 1 2 1 1 ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − − x x x x
IR 2004 – Berlin Chen 21
– Classic models enforce independence of index terms – For the Vector model
independent and form a basis for the subspace of interest
∀i,j ⇒ ki ․kj = 0 (in a more restrictive sense)
– The index term vectors are linearly independent, but not pairwise orthogonal
Wong et al., 1985
IR 2004 – Berlin Chen 22
– Index term vectors form the basis of the space are not
components (minterms)
– {k1, k2, …, kt}: the set of all terms – wi,j: the weight associated with [ki, dj] – Minterms:binary indicators (0 or 1) of all patterns of
specific document
IR 2004 – Berlin Chen 23
m1=(0,0,….,0) m2=(1,0,….,0) m3=(0,1,….,0) m4=(1,1,….,0) m5=(0,0,1,..,0) … m2t=(1,1,1,..,1) m1=(1,0,0,0,0,….,0) m2=(0,1,0,0,0,….,0) m3=(0,0,1,0,0,….,0) m4=(0,0,0,1,0,….,0) m5=(0,0,0,0,1,….,0) … m2t=(0,0,0,0,0,….,1) 2t minterms 2t minterm vectors
Points to the docs where only index terms k1 and k2 co-occur and the other index terms disappear Point to the docs containing all the index terms Pairwise orthogonal vectors mi associated with minterms mi as the basis for the generalized vector space
IR 2004 – Berlin Chen 24
– Each minterm specifies a kind of dependence among index terms – That is, the co-occurrence of index terms inside docs in the collection induces dependencies among these index terms
IR 2004 – Berlin Chen 25
( ) ( )
= ∀ = ∀
1 , 2 , 1 , , r m i g r r i r m i g r r r i i
( )
( )
= all for , , , l m g d g d j i r i
r l j l j
r
All the docs whose term co-occurrence relation (pattern) can be represented as (exactly coincide with that of) minterm mr
sums up the weights of the term ki in all the docs which have a term occurrence pattern given by mr.
( )
r i
m g
Indicates the index term ki is in the minterm mr
IR 2004 – Berlin Chen 26
⋅ = ⋅ = = ⇒ = = ⇒ =
r r d r r q r r d r q j j i q i i q i i j i q i j j i r r r q i q i j r r r j i i j i j
s s s s d q sim w w w w d q sim m s k w q m s k w d
, , , , , , , , , , , ,
, , r r r r r r r r r r
t-dimensional 2t-dimensional
IR 2004 – Berlin Chen 27
d1 d2 d3 d4 d5 d6 d7 k1 k2 k3 k1 k2 k3 minterm d1 2 1 m6 d2 1 m2 d3 1 3 m7 d4 2 m2 d5 1 2 4 m8 d6 1 2 m4 d7 5 m3 q 1 2 3
minterm k1 k2 k3 m1 m2 1 m3 1 m4 1 1 m5 1 m6 1 1 m7 1 1 m8 1 1 1
2 8 , 2 2 7 , 2 2 4 , 2 2 3 , 2 8 8 , 2 7 7 , 2 4 4 , 2 3 3 , 2 2
c c c c m c m c m c m c k + + + + + + = r r r r r
2 8 , 1 2 6 , 1 2 4 , 1 2 2 , 1 8 8 , 1 6 6 , 1 4 4 , 1 2 2 , 1 1
c c c c m c m c m c m c k + + + + + + = r r r r r
2 8 , 3 2 7 , 3 2 6 , 3 2 5 , 3 8 8 , 3 7 7 , 3 6 6 , 3 5 5 , 3 3
c c c c m c m c m c m c k + + + + + + = r r r r r
1 2 1 3 2 1
5 , 1 8 , 1 1 , 1 6 , 1 6 . 1 4 , 1 4 . 1 2 . 1 2 , 1
= = = = = = = + = + = w c w c w c w w c
2 2 2 2 8 6 4 2 1
1 2 1 3 1 2 1 3 + + + + + + = m m m m k r r r r r
2 1 2 5
5 , 2 8 , 2 3 , 2 7 , 2 6 , 2 4 , 2 7 , 2 3 , 2
= = = = = = = = w c w c w c w c
2 2 2 2 8 7 4 3 2
2 1 2 5 2 1 2 5 + + + + + + = m m m m k r r r r r
4 3 1
5 , 3 8 , 3 3 , 3 7 , 3 1 , 3 6 , 3 5 , 3
= = = = = = = w c w c w c c
2 2 2 2 8 7 6 5 3
4 3 1 4 3 1 + + + + + + = m m m m k r r r r r
IR 2004 – Berlin Chen 28
15 1 2 1 3 1 2 1 3 1 2 1 3
8 6 4 2 2 2 2 2 8 6 4 2 1
m m m m m m m m k r r r r r r r r r + + + = + + + + + + = 34 2 1 2 5 2 1 2 5 2 1 2 5
8 7 4 3 2 2 2 2 8 7 4 3 2
m m m m m m m m k r r r r r r r r r + + + = + + + + + + = 26 4 3 1 4 3 1 4 3 1
8 7 6 2 2 2 2 8 7 6 5 3
m m m m m m m k r r r r r r r r + + = + + + + + + =
8 7 6 4 2 3 1 1
26 4 1 15 1 2 26 3 1 26 1 1 15 2 2 15 1 2 15 3 2 1 2 m m m m m k k d r r r r r r r r ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⋅ + ⋅ = + =
8 7 6 4 3 2 3 2 1
26 4 3 34 2 2 15 1 1 26 3 3 34 1 2 26 1 3 15 2 1 34 2 2 15 1 1 34 5 2 15 3 1 3 2 1 m m m m m m k k k q r r r r r r r r r r ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⋅ + ⋅ = + + =
sd1,4 sd1,2 sd1,6 sd1,7 sq,6 sd1,8 sq,2 sq,3 sq,4
8 7 6 4 3 2 3 2 1
26 4 3 34 2 2 15 1 1 26 3 3 34 1 2 26 1 3 15 2 1 34 2 2 15 1 1 34 5 2 15 3 1 3 2 1 m m m m m m k k k q r r r r r r r r r r ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⋅ + ⋅ + ⋅ + ⋅ = + + =
sq,8 sq,7
( ) ( )
2 2 2 2 2 2 2 2 2 2 2 8 , 8 , 7 , 7 , 6 , 6 , 4 , 4 , 2 , 2 , 1 2 2 , ,
8 , 1 7 , 1 6 , 1 4 , 1 2 , 1 8 , 7 , 6 , 4 , 3 , 2 , 1 1 1 1 1 , , , , , , , ,
, ) , ( consine ,
d d d d d q q q q q q r d r q r d r d r q r q r d r q
s s s s s s s s s s s s s s s s s s s s s d q sim s s s s d q d q sim
d q d q d q d q d q s s r s s r s s r r d r q
+ + + + + + + + + + + + + = ⋅ = =
≠ ∧ ≠ ≠ ∧ ≠ ≠ ∧ ≠
The similarity between the query and doc is calculated in the space of minterm vectors
IR 2004 – Berlin Chen 29
– The degree of correlation between the terms ki and kj can now be computed as
done it before!)
= ∧ = ∀
) ( 1 ) ( | , , r m j g r m i g r r j r i j i
IR 2004 – Berlin Chen 30
– Model considers correlations among index terms – Model does introduce interesting new ideas
– Not clear in which situations it is superior to the standard vector model – Computation costs are higher