INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University, Ithaca, NY 8 Oct 2009 1 /


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 13: Query Expansion and Probabilistic Retrieval

Paul Ginsparg

Cornell University, Ithaca, NY

8 Oct 2009

1 / 34

slide-2
SLIDE 2

Administrativa

No office hours tomorrow, Fri 9 Oct e-mail questions, doubts, concerns, problems to cs4300-l@lists.cs.cornell.edu Remember mid-term is one week from today, Thu 15 Oct. For more info, see http://www.infosci.cornell.edu/Courses/info4300/2009fa/exams.html

2 / 34

slide-3
SLIDE 3

Overview

1

Recap

2

Pseudo Relevance Feedback

3

Query expansion

4

Probabalistic Retrieval

5

Discussion

3 / 34

slide-4
SLIDE 4

Outline

1

Recap

2

Pseudo Relevance Feedback

3

Query expansion

4

Probabalistic Retrieval

5

Discussion

4 / 34

slide-5
SLIDE 5

Selection of singular values

t × d t × m m × m m × d Ck Uk Σk V T

k

t × d t × k k × k k × d m is the original rank of C. k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k ≪ m. Σ−1

k

defined only on k-dimensional subspace.

5 / 34

slide-6
SLIDE 6

Now approximate C → Ck

In the LSI approximation, use Ck (the rank k approximation to C), so similarity measure between query and document becomes

  • q ·

d(j) | q| | d(j)| = q · C · e(j) | q| |C e(j)| = ⇒

  • q · Ck ·

e(j) | q| |Ck e(j)| =

  • q ·

d∗

(j)

| q| | d∗

(j)|

, (2) where d∗

(j) = Ck

e(j) = UkΣkV T e(j) is the LSI representation of the jth document vector in the original term–document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1, . . . , N documents, and returning the best matches.

6 / 34

slide-7
SLIDE 7

Compare documents in concept space

Recall the i, j entry of C TC is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T

k Ck = (UkΣkV T k )T (UkΣkV T k ) = VkΣkUT k UkΣkV T k = (VkΣk)(VkΣk)T

Thus i, j entry the dot product between the i, j columns of (VkΣk)T = ΣkV T

k .

In concept space, comparison between pseudo-document ˆ q and document ˆ d(j) thus given by the cosine between Σk ˆ q and Σk ˆ d(j): (Σk ˆ q) · Σk ˆ d(j) |Σk ˆ q| |Σk ˆ d(j)| = ( qT UkΣ−1

k Σk)(ΣkΣ−1 k UT k

d∗

(j))

|UT

k

q| |UT

k

d∗

(j)|

=

  • q ·

d∗

(j)

|UT

k

q| | d∗

(j)|

, (3) in agreement with (2), up to an overall q-dependent normalization which doesn’t affect similarity rankings.

7 / 34

slide-8
SLIDE 8

8 / 34

slide-9
SLIDE 9

Rocchio illustrated

x x x x x x

  • µR
  • µNR
  • µR −

µNR

  • qopt
  • µR: centroid of relevant documents
  • µNR: centroid of nonrelevant documents
  • µR −

µNR: difference vector Add difference vector to µR to get qopt

  • qopt separates relevant/nonrelevant perfectly.

9 / 34

slide-10
SLIDE 10

Outline

1

Recap

2

Pseudo Relevance Feedback

3

Query expansion

4

Probabalistic Retrieval

5

Discussion

10 / 34

slide-11
SLIDE 11

Relevance feedback: Problems

Relevance feedback is expensive.

Relevance feedback creates long modified queries. Long queries are expensive to process.

Users are reluctant to provide explicit feedback. It’s often hard to understand why a particular document was retrieved after applying relevance feedback. Excite had full relevance feedback at one point, but abandoned it later.

11 / 34

slide-12
SLIDE 12

Pseudo-relevance feedback

Pseudo-relevance feedback automates the “manual” part of true relevance feedback. Pseudo-relevance algorithm:

Retrieve a ranked list of hits for the user’s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio)

Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift.

12 / 34

slide-13
SLIDE 13

Pseudo-relevance feedback at TREC4

Cornell SMART system Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents lnc.ltc 3210 lnc.ltc-PsRF 3634 Lnu.ltu 3709 Lnu.ltu-PsRF 4350 Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) This demonstrates that pseudo-relevance feedback is effective on average.

13 / 34

slide-14
SLIDE 14

Outline

1

Recap

2

Pseudo Relevance Feedback

3

Query expansion

4

Probabalistic Retrieval

5

Discussion

14 / 34

slide-15
SLIDE 15

Query expansion

Query expansion is another method for increasing recall. We use “global query expansion” to refer to “global methods for query reformulation”. In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. Main information we use: (near-)synonymy A publication or database that collects (near-)synonyms is called a thesaurus. We will look at two types of thesauri: manually created and automatically created.

15 / 34

slide-16
SLIDE 16

Query expansion: Example

16 / 34

slide-17
SLIDE 17

Types of user feedback

User gives feedback on documents.

More common in relevance feedback

User gives feedback on words or phrases.

More common in query expansion

17 / 34

slide-18
SLIDE 18

Types of query expansion

Manual thesaurus (maintained by editors, e.g., PubMed) Automatically derived thesaurus (e.g., based on co-occurrence statistics) Query-equivalence based on query log mining (common on the web as in the “palm” example)

18 / 34

slide-19
SLIDE 19

Thesaurus-based query expansion

For each term t in the query, expand the query with words the thesaurus lists as semantically related with t. Example from earlier: hospital → medical Generally increases recall May significantly decrease precision, particularly with ambiguous terms

interest rate → interest rate fascinate

Widely used in specialized search engines for science and engineering It’s very expensive to create a manual thesaurus and to maintain it over time. A manual thesaurus is roughly equivalent to annotation with a controlled vocabulary.

19 / 34

slide-20
SLIDE 20

Example for manual thesaurus: PubMed

20 / 34

slide-21
SLIDE 21

Automatic thesaurus generation

Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents Fundamental notion: similarity between two words Definition 1: Two words are similar if they co-occur with similar words.

“car” and “motorcycle” cooccur with “road”, “gas” and “license”, so they must be similar.

Definition 2: Two words are similar if they occur in a given grammatical relation with the same words.

You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar.

Co-occurrence is more robust, grammatical relations are more accurate.

21 / 34

slide-22
SLIDE 22

Co-occurence-based thesaurus: Examples

Word Nearest neighbors absolutely absurd, whatsoever, totally, exactly, nothing bottomed dip, copper, drops, topped, slide, trimmed captivating shimmer, stunningly, superbly, plucky, witty doghouse dog, porch, crawling, beside, downstairs makeup repellent, lotion, glossy, sunscreen, skin, gel mediating reconciliation, negotiate, case, conciliation keeping hoping, bring, wiping, could, some, would lithographs drawings, Picasso, Dali, sculptures, Gauguin pathogens toxins, bacteria, organisms, bacterial, parasite senses grasp, psyche, truly, clumsy, naive, innate

22 / 34

slide-23
SLIDE 23

Summary

Relevance feedback and query expansion increase recall. In many cases, precision is decreased, often significantly. Log-based query modification (which is more complex than simple query expansion) is more common on the web than relevance feedback.

23 / 34

slide-24
SLIDE 24

Outline

1

Recap

2

Pseudo Relevance Feedback

3

Query expansion

4

Probabalistic Retrieval

5

Discussion

24 / 34

slide-25
SLIDE 25

Basics of probability theory

A = event 0 ≤ p(A) ≤ 1 joint probability p(A, B) = p(A ∩ B) conditional probability p(A|B) = p(A, B)/p(B) Note p(A, B) = p(A|B)p(B) = p(B|A)p(A), gives posterior probability of A after seeing the evidence B Bayes ‘Thm‘ : p(A|B) = p(B|A)p(A) p(B) In denominator, use p(B) = p(B, A) + p(B, A) = p(B|A)p(A) + p(B|A)p(A) Odds: O(A) = p(A) p(A) = p(A) 1 − p(A)

25 / 34

slide-26
SLIDE 26

Probability Ranking Principle (PRP)

For query q and document d, let Rd,q be binary indicator whether d relevant to q: R = 1 if relevant, else R = 0 Order documents according to estimated probability of relevance with respect to information need p(R = 1|d, q) (Bayes optimal decision rule) d is relevant iff p(R = 1|d, q) > p(R = 0|d, q)

26 / 34

slide-27
SLIDE 27

Binary independence model (BIM)

Represent documents and queries as binary term incidence vectors:

  • x = (x1, . . . , xM) ,
  • q = (q1, . . . , qn)

(xt = 1 if term t present in document d, else xt = 0) Assume independence: no association between terms in binary bag

  • f words (not true of course, but works ok to first approximation).

Want to determine p(R = 1| x, q) = p( x|R = 1, q)p(R = 1| q) p( x| q) p(R = 0| x, q) = p( x|R = 0, q)p(R = 0| q) p( x| q) (on r.h.s. probabilites that if document is retrieved, then representation is x, and prior probabilities of retrieving relevant or non-relevant document. also p(R = 1| x, q) + p(R = 0| x, q) = 1)

27 / 34

slide-28
SLIDE 28

Ranking function

Order by descending p(R = 1|d, q), using BIM p(R = 1| x, q). Use odds of relevance O(R| x, q) =

p( x|R=1, q)p(R=1| q) p( x| q) p( x|R=0, q)p(R=0| q) p( x| q)

= p(R = 1| q) p(R = 0| q) · p( x|R = 1, q) p( x|R = 0, q) How to estimate last term?

28 / 34

slide-29
SLIDE 29

Naive Bayes conditional independence assumption

p( x|R = 1, q) p( x|R = 0, q) =

M

  • t=1

p(xt|R = 1, q) p(xt|R = 0, q) so O(R| x, q) = O(R| q) ·

M

  • t=1

p(xt|R = 1, q) p(xt|R = 0, q) Let probability of terms appearing in relevant/nonrelevant documents w.r.t. query be pt = p(xt = 1|R = 1, q) and ut = p(xt = 1|R = 0, q): document relevant (R = 1) nonrelevant (R = 0) term present xt = 1 pt ut term absent xt = 0 1 − pt 1 − ut

29 / 34

slide-30
SLIDE 30

One more approximation

Also assume that terms not occuring in the query are equally likely to occur in relevant and nonrelevant documents: pt = ut if qt = 0. Then only need to retain query terms (qt = 1): O(R| x, q) = O(R| q) ·

M

  • t=1

p(xt|R = 1, q) p(xt|R = 0, q) = O(R| q) ·

  • t:xt=1,qt=1

pt ut ·

  • t:xt=0,qt=1

1 − pt 1 − ut

30 / 34

slide-31
SLIDE 31

Estimate

dft is number of documents that contain term t term documents relevant nonrelevant total present xt = 1 s dft − s dft absent xt = 0 S − s (N − dft) − (S − s) N − dft total S N − S N Hence pt = s/S and ut = (dft − s)/(N − S). In practice if relevant documents a small percentage, then ut ≈ dft/N and log[(1 − ut)/ut] = log[(N − dft/dft] ≈ log N/dft reproducing idf weight

31 / 34

slide-32
SLIDE 32

Outline

1

Recap

2

Pseudo Relevance Feedback

3

Query expansion

4

Probabalistic Retrieval

5

Discussion

32 / 34

slide-33
SLIDE 33

Discussion 4

Original LSA article: Scott Deerwester, Susan T. Dumais, George

  • W. Furnas, Thomas K. Landauer, Richard Harshman, “Indexing by

latent semantic analysis”. Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. Some questions: Explain the name “latent semantic analysis” What problems is LSA attempting to solve? does it succeed? What criteria were used in selecting SVD of the term–doc matrix? Explain the meaning of the matrices in the SVD C = UΣV T What does the rank reduction Ck ≈ C = UkΣkV T

k (keeping

  • nly first k elements of Σ) have to do with latent semantics?
  • Fig. 1: what aspect of LSA does this illustrate? (which docs

are closer to the query vector in concept space despite not containing words in common with the query?)

33 / 34

slide-34
SLIDE 34
  • Fig. 4: a) LSI-100 does better at the right of this graph than

the left — What does this have to do with synonomy and polysemy? Describe methodology of the MED experiment. Why were authors surprised that TERM and SMART gave similar results? The results of CISI were not as strong, possible explanations?

  • Fig. 5: what data does the graph plot? what conclusions can

you draw? The article states “the only way documents can be retrieved is by an exhaustive comparison of a query vector against all stored document vectors”. Explain the statement. Is it a serious problem?

34 / 34