CS6200: Information Retrieval
Slides by: Jesse Anderton
Term Co-Occurrence
VSM, session 11
Term Co-Occurrence VSM, session 11 CS6200: Information Retrieval - - PowerPoint PPT Presentation
Term Co-Occurrence VSM, session 11 CS6200: Information Retrieval Slides by: Jesse Anderton Query Expansion We can add words with similar meanings to query terms, e.g. from stem classes or a thesaurus. We can also add words which commonly
CS6200: Information Retrieval
Slides by: Jesse Anderton
VSM, session 11
We can add words with similar meanings to query terms, e.g. from stem classes or a thesaurus. We can also add words which commonly co-occur with query terms,
related to the same topic.
Medical Subject Headings Thesaurus (NIH)
There are many measures of term co-
We’ll summarize them here, and then examine what each means and how they differ.
Measures of co-occurrence. * These formulas are partial, but rank- equivalent to the full formulas.
Measure Formula Mutual Information
nab na·nb
(MIM) Expected Mutual Inf. nab · log
nab na·nb
Chi-square
(nab− 1
N ·na·nb)2
na·nb
(Χ2) Dice’s coefficient
nab na+nb
(Dice)
Dice’s coefficient, aka the Sørensen index, is used to compare two random samples. In this case, we compare the population of documents containing terms a and b to the populations containing a and containing b.
rank
Pointwise mutual information is a measure of correlation from information theory.
N na N nb N
rank
Expected mutual information corrects a bias of pointwise mutual information toward low frequency terms.
Pearson’s Chi-squared test is a test of statistical significance which compares the number of term co-occurrences to the number we’d expect if the terms were independent. (This is also not the full form of this measure.)
N · nb N
N · nb N rank
N · na · nb
Most associated terms for “tropical” in a collection of TREC news stories. Most associated terms for “fish” in the same collection.
Instead of counting co-occurrences in the entire document, count those that
Look for new terms associated with multiple query terms instead of just
Using Dice with “tropical fish” gives the following list: goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet.
Most associated terms for “fish” with co-occurrences measured in a window of 5 terms.
Using term association measures to select words for query expansion can help improve retrieval performance. However, it can also worsen performance if care is not taken to provide meaningful context. This approach can suffer from “topic drift.” In our next session we’ll look at relevance feedback, which finds terms for expansion based on information about which documents are relevant to the query.