Simulation of Within-Session Query Variations using a Text Segmentation Approach
Debasis Ganguly Johannes Leveling Gareth J.F. Jones
CNGL, School of Computing, Dublin City University, Ireland
Simulation of Within-Session Query Variations using a Text - - PowerPoint PPT Presentation
Simulation of Within-Session Query Variations using a Text Segmentation Approach Debasis Ganguly Johannes Leveling Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Introduction to query reformulation
CNGL, School of Computing, Dublin City University, Ireland
Example: “Mahatma Gandhi” “Mahatma Gandhi non-violence movement”
Example: “Mahatma Gandhi assassination” “Mahatma Gandhi life and works”
Example: “Mahatma Gandhi assassination” “Gandhi film”
Session IR tasks: goal is to improve the IR effectiveness over an entire query session for a user Collaborative IR tasks: goal is to improve the IR effectiveness of a new user by utilizing user responses to related queries
wildlife poaching Indian tigers African lions
Osteoporosis
Document about bone diseases in general with a dedicated section on osteoporosis Bone Bone Bone Bone
Introduction – The search for life in space How the moon helped life evolve on earth The moon's chemical composition
Example from M. Hearst. CL. 1997.
Term 1: dominant in topic 1 Term 2: dominant in topic 2 Term 3: general term topic 2 topic 1
– Add the most specific terms from the most/least similar segments of documents to the original query to get a more specific/drifting query – Substitute original query terms with more general terms as
– term frequency in segment, – inverse segment frequency, and – idf
– term frequency in document, – segment frequency, and – idf
– Smaller set of relevant documents (queries are typically longer) – Top ranked documents for the original query become more general with respect to the specific reformulated query but are still relevant (overlap in top ranked documents)
– Larger set of relevant documents (queries are typically shorter) – Low overlap and high shift of top ranked documents retrieved in response to the original query
Overlap of retrieved documents at cut-off 10, 20, 50 and 500: O(N) – Net perturbation of top m documents: 1/m Σk=1
m new_rank(dk)-k
p(N)
– High overlap and low perturbation for specialized queries – Low overlap and high perturbation for general queries
TREC disk 4+5 documents TREC-8 topics:
– Topic titles as initial queries for specific and drift reformulations – Topic description as initial queries for general reformulations
Type Manual Assessment Result Set Measures Assessor-1 Assessor-2 O(10) O(20) O(50) O(500) p(5) Specific 39 (78%) 26 (52%) 39.0 38.1 42.7 44.7 367.9 General 39 (78%) 43 (86%) 22.4 22.5 24.5 32.2 2208.6 Drift 34 (68%) 35 (70%) 12.0 10.2 8.6 5.9 3853.3
Highest inter-assessor agreement for drift since a drift in information need is not subject to personal judgements Lowest inter-assessor agreement for specific reformulations since semantic specificity of added words can depend on personal judgement Specific and general reformulations which are associated with an increase in overlap percentage with increasing cut-off rank indicate that we get more “seen” documents further down the ranked list
Specific reformulations Assessor 1 agrees Assessor 1 disagrees Assessor 2 agrees behavioural genetics chromosomes DNA genome cosmic events magnitude proton ion Assessor 2 disagrees N/A salvaging, shipwreck, treasure found aircraft Rotterdam
semantically related to the original keywords, and the degree of semantic closeness is often subject to personal judgments
“proton” and “ion” make the initial query “cosmic events” more specific
Images from Flickr
Images from Flickr