1
Goharian, Grossman, Frieder 2002, 2010
Retrieval Strategies: Vector Space Model and Boolean
(COSC 416)
Nazli Goharian
nazli@cs.georgetown.edu
Goharian, Grossman, Frieder 2002, 2010
Retrieval Strategy
- An IR strategy is a technique by which a
Retrieval Strategies: Vector Space Model and Boolean (COSC 416) - - PDF document
Retrieval Strategies: Vector Space Model and Boolean (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Goharian, Grossman, Frieder 2002, 2010 Retrieval Strategy An IR strategy is a technique by which a relevance measure is obtained
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
a term i appears in document j (tfij )
appears in, (dfi).
measure for a term i in collection, i.e., how discriminating term i is.
(idf i) = log10(n / dfj), where n is the number of document
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
=
t j j ij j ij ij
1 2
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
ij t j qj i
=
1
Goharian, Grossman, Frieder 2002, 2010
= = =
t j t j qj ij ij t j qj i
1 1 2 2 1
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Avgn: average document normalization factor over entire collection s: can be obtained empirically
avgn d s s d w D Q SC
t j ij ij t j qj i
= =
+ − =
1 2 1
. 1 ,
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
i ij t j qj i
=
1
Goharian, Grossman, Frieder 2002, 2010
Term df idf
1 a 3 2 arrived 2 0.176 3 damaged 1 0.477 4 delivery 1 0.477 5 fire 1 0.477 6 gold 2 0.176 7 in 3 8
3 9 silver 1 0.477 10 shipment 2 0.176 11 truck 2 0.176
Goharian, Grossman, Frieder 2002, 2010
doc t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 D1 .477 0 .477 ..176 0 0 .176 0 D2 .176 .477 0 0 .954 0 .176 D3 .176 .176 0 0 .176 .176 Q .176 0 0 .477 0 .176
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
doc t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 D1 1 1 1 0 0 1 D2 1 1 1 0 0 1 1 D3 1 1 1 0 0 1 1 D4 1 0 0 1 1
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
Goharian, Grossman, Frieder 2002, 2010
x AND y: tfx x tfy x OR y: tfx + tfy NOT x: 0 if tfx > 0, 1 if tfx = 0
Goharian, Grossman, Frieder 2002, 2010