Term Co-Occurrence VSM, session 11 CS6200: Information Retrieval - - PowerPoint PPT Presentation

term co occurrence
SMART_READER_LITE
LIVE PREVIEW

Term Co-Occurrence VSM, session 11 CS6200: Information Retrieval - - PowerPoint PPT Presentation

Term Co-Occurrence VSM, session 11 CS6200: Information Retrieval Slides by: Jesse Anderton Query Expansion We can add words with similar meanings to query terms, e.g. from stem classes or a thesaurus. We can also add words which commonly


slide-1
SLIDE 1

CS6200: Information Retrieval

Slides by: Jesse Anderton

Term Co-Occurrence

VSM, session 11

slide-2
SLIDE 2

We can add words with similar meanings to query terms, e.g. from stem classes or a thesaurus. We can also add words which commonly co-occur with query terms,

  • n an assumption that they must be

related to the same topic.

Query Expansion

Medical Subject Headings Thesaurus (NIH)

slide-3
SLIDE 3

There are many measures of term co-

  • ccurrence.

We’ll summarize them here, and then examine what each means and how they differ.

Term Association Measures

Measures of co-occurrence. * These formulas are partial, but rank- equivalent to the full formulas.

Measure Formula Mutual Information

nab na·nb

(MIM) Expected Mutual Inf. nab · log

  • N ·

nab na·nb

  • (EMIM)

Chi-square

(nab− 1

N ·na·nb)2

na·nb

(Χ2) Dice’s coefficient

nab na+nb

(Dice)

slide-4
SLIDE 4

Dice’s coefficient, aka the Sørensen index, is used to compare two random samples. In this case, we compare the population of documents containing terms a and b to the populations containing a and containing b.

Dice’s Coefficient

dice(a, b) = 2 · nab na + nb

rank

= nab na + nb

slide-5
SLIDE 5

Pointwise mutual information is a measure of correlation from information theory.

Pointwise Mutual Information

pmi(a, b) := log p(a, b) p(a)p(b)

  • = log

nab

N na N nb N

  • = log N + log nab

nanb

rank

= nab nanb

slide-6
SLIDE 6

Expected mutual information corrects a bias of pointwise mutual information toward low frequency terms.

Expected Mutual Information

emim(a, b) ∝P(a, b) · log P(a, b) P(a)P(b) =nab N log

  • N ·

nab na · nb

  • rank

= nab · log

  • N ·

nab na · nb

slide-7
SLIDE 7

Pearson’s Chi-squared test is a test of statistical significance which compares the number of term co-occurrences to the number we’d expect if the terms were independent. (This is also not the full form of this measure.)

Pearson’s Chi-squared Measure

chi2(a, b) =

  • nab − N · na

N · nb N

2 N · na

N · nb N rank

=

  • nab − 1

N · na · nb

2 na · nb

slide-8
SLIDE 8

Association Measure Example

Most associated terms for “tropical” in a collection of TREC news stories. Most associated terms for “fish” in the same collection.

slide-9
SLIDE 9

Instead of counting co-occurrences in the entire document, count those that

  • ccur within a smaller window.

Look for new terms associated with multiple query terms instead of just

  • ne.

Using Dice with “tropical fish” gives the following list: goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet.

Improving the Results

Most associated terms for “fish” with co-occurrences measured in a window of 5 terms.

slide-10
SLIDE 10

Using term association measures to select words for query expansion can help improve retrieval performance. However, it can also worsen performance if care is not taken to provide meaningful context. This approach can suffer from “topic drift.” In our next session we’ll look at relevance feedback, which finds terms for expansion based on information about which documents are relevant to the query.

Wrapping Up