INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 12: LSA wrap-up, Relevance Feedback and Query Expansion Paul Ginsparg Cornell University, Ithaca, NY 6 Oct


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 12: LSA wrap-up, Relevance Feedback and Query Expansion

Paul Ginsparg

Cornell University, Ithaca, NY

6 Oct 2011

1 / 60

slide-2
SLIDE 2

Administrativa

Assignment 2 due Sat 8 Oct, 1pm (late submission permitted until Sun 9 Oct at 11 p.m.) Office hour: Saeed F3:30-4:30 (or cs4300-l, or Piazza) No class Tue 11 Oct (midterm break) Remember mid-term is one week from today, Thu Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. Email me by tomorrow if you will be out of town. Topics examined include assignments, lectures and discussion class readings before the midterm break: term-doc matrix, tf.idf, precision recall graph, LSA and recommender systems, word statistics (Heap and Zipf). For sample, see

http://www.infosci.cornell.edu/Courses/info4300/2011fa/exams.html

2 / 60

slide-3
SLIDE 3

Assignment related issues

Include instructions on how to compile and run your code. Comment all files to include name and netID. Follow common sense programming practices (this is referring to things like static paths, no visible prompts and “some generally bizarre interfaces encountered”) Make sure that code runs in the CSUG Lab environment (e.g., at least from remote on the Linux machines: everyone should have a CSUG Lab account, and should be able to access lab machines from remote) Use of external libraries (i.e., other than those already specified in assignment description) is discouraged, should be justified and cleared with graders beforehand

3 / 60

slide-4
SLIDE 4

Overview

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

4 / 60

slide-5
SLIDE 5

Outline

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

5 / 60

slide-6
SLIDE 6

Documents in concept space

Consider the original term–document matrix C, and let e(j) = jth basis vector (single 1 in jth position, 0 elsewhere). Then d(j) = C e(j) are the components of the jth document, considered as a column vector. Since C = UΣV T, we can consider ΣV T e(j) as the components of the document vector in concept space, before U maps it into word space. Note: we can also consider the original d(j) to be a vector in word space, and since left multiplication by U maps from concept space to word space, we can apply U−1 = UT to map d(j) into concept space, giving UT d(j) = UTC e(j) = UTUΣV T e(j) = ΣV T e(j) , as above.

6 / 60

slide-7
SLIDE 7

Term–term Comparison

To compare two terms, take the dot product between two rows of C, which measures the extent to which they have similar pattern of

  • ccurrence across the full set of documents.

The i, j entry of CC T is equal to the dot product between i, j rows of C Since CC T = UΣV TV ΣUT = UΣ2UT = (UΣ)(UΣ)T , the i, j entry is the dot product between the i, j rows of UΣ. Hence the rows of UΣ can be considered as coordinates for terms, whose dot products give comparisons between terms. (Σ just rescales the coordinates)

7 / 60

slide-8
SLIDE 8

Document–document Comparison

To compare two documents, take the dot product between two columns of C, which measures the extent to which two documents have a similar profile of terms. The i, j entry of C TC is equal to the dot product between the i, j columns of C Since C TC = V ΣUTUΣV T = V Σ2V T = (V Σ)(V Σ)T, the i, j entry is the dot product between the i, j rows of V Σ Hence the rows of V Σ can be considered as coordinates for documents, whose dot products give comparisons between documents. (Σ again just rescales coordinates)

8 / 60

slide-9
SLIDE 9

Term–document Comparison

To compare a term and a document Use directly the value of i, j entry of C = UΣV T This is the dot product between ith row of UΣ1/2 and jth row

  • f V Σ1/2

So use UΣ1/2 and V Σ1/2 as coordinates Recall UΣ for term–term, and V Σ for document–document comparisons — can’t use a single set of coordinates to make both between document and term and within term or document comparisons, but difference is only Σ1/2 stretch.

9 / 60

slide-10
SLIDE 10

Recall query document comparison

query = vector q in term space components qi = 1 if term i is in the query, and otherwise 0 any query terms not in the original term vector space ignored In VSM, similarity between query q and jth document d(j) given by the “cosine measure”:

  • q ·

d(j) | q| | d(j)| Using term–document matrix Cij, this dot product given by the jth component of q · C: d(j) = C e(j) ( e(j) = jth basis vector, single 1 in jth position, 0 elsewhere). Hence Similarity( q, d(j)) = cos(θ) = q · d(j) | q| | d(j)| = q · C · e(j) | q| |C e(j)| . (1)

10 / 60

slide-11
SLIDE 11

Now approximate C → Ck

In the LSI approximation, use Ck (the rank k approximation to C), so similarity measure between query and document becomes

  • q ·

d(j) | q| | d(j)| = q · C · e(j) | q| |C e(j)| = ⇒

  • q · Ck ·

e(j) | q| |Ck e(j)| =

  • q ·

d∗

(j)

| q| | d∗

(j)|

, (2) where d∗

(j) = Ck

e(j) = UkΣkV T e(j) is the LSI–reduced representation of the jth document vector in the original term space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1, . . . , N documents, and returning the best matches.

11 / 60

slide-12
SLIDE 12

Pseudo-document

To see that this agrees with the prescription given in the course text (and the original LSI article), recall: jth column of V T

k represents document j in “concept space”:

  • dc

(j) = V T k

e(j) query q is considered a “pseudo-document” in this space. LSI document vector in term space given above as

  • d∗

(j) = Ck

e(j) = UkΣkV T

k

e(j) = UkΣk dc

(j), so follows that

  • dc

(j) = Σ−1 k UT k

d∗

(j)

The “pseudo-document” query vector q is translated into the concept space using the same transformation: qc = Σ−1

k UT k

q.

12 / 60

slide-13
SLIDE 13

More document–document comparison in concept space

Recall the i, j entry of C TC is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T

k Ck = (UkΣkV T k )T (UkΣkV T k ) = VkΣkUT k UkΣkV T k = (VkΣk)(VkΣk)T

Thus i, j entry the dot product between the i, j columns of (VkΣk)T = ΣkV T

k .

In concept space, comparison between pseudo-document qc and document dc

(j) thus given by the cosine between Σk

qc and Σk dc

(j):

(Σk qc) · (Σk dc

(j))

|Σk qc| |Σk dc

(j)|

= ( qT UkΣ−1

k Σk)(ΣkΣ−1 k UT k

d∗

(j))

|UT

k

q| |UT

k

d∗

(j)|

=

  • q ·

d∗

(j)

|UT

k

q| | d∗

(j)|

, (3) in agreement with (2), up to an overall q-dependent normalization which doesn’t affect similarity rankings.

13 / 60

slide-14
SLIDE 14

Pseudo-document – document comparison summary

So given a novel query, find its location in concept space, and find its cosine w.r.t existing documents, or other documents not in

  • riginal analysis (SVD).

We’ve just learned how to represent “pseudo-documents”, and how to compute comparisons. A query q is a vector of terms, like the columns of C, hence considered a pseudo-document Derive representation for any term vector q to be used in document comparison formulas. (like a row of V as earlier) Constraint: for a real document q = d(j) (= jth column Cij), and before truncation (i.e., for Ck = C), should give row of V Use qc = qUΣ−1 for comparing pseudodocs to docs

14 / 60

slide-15
SLIDE 15

Pseudo-document – document Comparison: qc = qUΣ−1

  • qc =

qUΣ−1 sums corresponding rows of UΣ, hence corresponds to placing pseudo-document at centroid of corresponding term points (up to rescaling of rows by Σ). (Just as row of V scaled by Σ1/2 or Σ can be used in semantic space for making term–doc or doc–doc comparisons.) Note: all of above after any preprocessing used to construct C

15 / 60

slide-16
SLIDE 16

Recommendation to new user

Let 1k be the diagonal matrix with first k entries equal to 1 (i.e., the projection or truncation onto first k dimensions). Then since X = UΣV T and Σk = Σ1k, the usual rank reduction can be written Xk = (UΣ)1kV T = (XV )1kV T = X(V 1kV T) , where the rows of Xk contain the recommendations for existing

  • users. (V T

k = 1kV T, Vk = V 1k, so Xk = X(VV T k ))

We are looking for a transformation of a new user vector n, which would have the same effect. From the above, right multiplying any row of X by V 1kV T turns it into the corresponding row of the ”improved” Xk, so we use n V 1kV T to make recommendations for a new user who is not already contained in X.

16 / 60

slide-17
SLIDE 17

17 / 60

slide-18
SLIDE 18

Discussion 3

Original LSA article: Scott Deerwester, Susan T. Dumais, George

  • W. Furnas, Thomas K. Landauer, Richard Harshman, “Indexing by

latent semantic analysis”. Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. Some questions: Explain the name “latent semantic analysis” What problems is LSA attempting to solve? does it succeed? What criteria were used in selecting SVD of the term–doc matrix? Explain the meaning of the matrices in the SVD C = UΣV T What does the rank reduction Ck ≈ C = UkΣkV T

k (keeping

  • nly first k elements of Σ) have to do with latent semantics?
  • Fig. 1: what aspect of LSA does this illustrate? (which docs

are closer to the query vector in concept space despite not containing words in common with the query?)

18 / 60

slide-19
SLIDE 19
  • Fig. 4: a) LSI-100 does better at the right of this graph than

the left — What does this have to do with synonomy and polysemy? Describe methodology of the MED experiment. Why were authors surprised that TERM and SMART gave similar results? The results of CISI were not as strong, possible explanations?

  • Fig. 5: what data does the graph plot? what conclusions can

you draw? The article states “the only way documents can be retrieved is by an exhaustive comparison of a query vector against all stored document vectors”. Explain the statement. Is it a serious problem?

19 / 60

slide-20
SLIDE 20

Outline

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

20 / 60

slide-21
SLIDE 21

Independent Parameters

Examples: real 2x2 matrix: a b d c

  • (4 parameters)

real symmetric 2x2 matrix: a b b c

  • (3 parameters)

real symmetric 2x2 matrix, equal diag elts: a b b a

  • (2 parameters)

real orthogonal 2x2 matrix: cos θ sin θ − sin θ cos θ

  • (1 parameter)

(a2 + b2 = 1, c2 + d2 = 1, ad + bc = 0, 3 constraints) But notice that result of product of real 2x2 matrix with any of above still has total of only four independent parameters a b c d

  • 21 / 60
slide-22
SLIDE 22

Example: Just one Feature (from lecture 8, slide 9)

Suppose only 1 feature, overall quality, and 1 corresponding user tendency to rate high/low. Three users: Uu = (1, 2, 3) Five movies: Vm = (1, 1, 3, 2, 1) Predicted rating matrix: Pum = UuVm =   1 1 3 2 1 2 2 6 4 2 3 3 9 6 3   ‘Explain’ 15 data points with only 7 parameters (two unit vectors plus one overall scale = 2 + 4 + 1)

22 / 60

slide-23
SLIDE 23

In general

           1 . . . 1 2 . . . 1 1 . . . 2 . . . 2 . . . 4 5 . . . 1 1 . . . 2           

  • n×r

   1 1 . . . 2 1 . . . . . . . . . . . . . . . 2 5 . . . 1 2   

  • r×m

=            3 6 . . . 3 3 4 7 . . . 5 4 5 11 . . . 4 5 . . . 10 22 . . . 8 10 7 10 . . . 11 7 5 11 . . . 4 4           

  • n×m

=            1 . . . 1 2 . . . 1 1 . . . 2 . . . 2 . . . 4 5 . . . 1 1 . . . 2           

  • n×r

  R−1  

  • r×r

  R  

  • r×r

   1 1 . . . 2 1 . . . . . . . . . . . . . . . 2 5 . . . 1 2   

  • r×m

nr + mr − r 2 parameters

23 / 60

slide-24
SLIDE 24

Alternatively

Number of free parameters in M × M orthogonal matrix: 1

2M(M − 1)

(OOT = 1 is automatically symmetric so M [normalization] constraints on the diagonal and 1

2M(M − 1) off-diagonal [orthogonality] constraints, gives a total

  • f 1

2M(M + 1) constraints, leaving free M2 − 1 2M(M + 1) = 1 2M(M − 1).)

If only the first r columns of U contribute, then there are rM entries subject to r normalization conditions (columns are length 1 vectors), and 1

2r(r − 1) orthogonality conditions (different columns

have vanishing dot product), leaving free rM − 1

2r(r + 1) .

Similarly, the r relevant columns of V (rows of V T) gives rN − 1

2r(r + 1), plus the r singular values in Σ, sum to

rM − 1 2r(r + 1) + rN − 1 2r(r + 1) + r = r(M + N) − r2 , as expected for an M × r times an r × N matrix, invariant under insertion of arbitrary R−1R in between (previous slide). Note for r = 1, this gives M + N − 1 (two slides back).

24 / 60

slide-25
SLIDE 25

Outline

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

25 / 60

slide-26
SLIDE 26

How can we improve recall in search?

Main topic today: two ways of improving recall: relevance feedback and query expansion Example

Query q: [aircraft] Document d contains “plane”, but doesn’t contain “aircraft”. A simple IR system will not return d for q. Even if d is the most relevant document for q!

Options for improving recall

Local: Do a “local”, on-demand analysis for a user query

Main local method: relevance feedback

Global: Do a global analysis once (e.g., of collection) to produce thesaurus

Use thesaurus for query expansion

26 / 60

slide-27
SLIDE 27

Outline

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

27 / 60

slide-28
SLIDE 28

Relevance feedback: Basic idea

The user issues a (short, simple) query. The search engine returns a set of documents. User marks some docs as relevant, some as nonrelevant. Search engine computes a new representation of the information need – should be better than the initial query. Search engine runs new query and returns new results. New results have (hopefully) better recall.

28 / 60

slide-29
SLIDE 29

Relevance feedback

We can iterate this: several rounds of relevance feedback. We will use the term ad hoc retrieval to refer to regular retrieval without relevance feedback. We will now look at three different examples of relevance feedback that highlight different aspects of the process.

29 / 60

slide-30
SLIDE 30

Example 3: A real (non-image) example

Initial query: New space satellite applications Results for initial query: (r = rank) r + 1 0.539 NASA Hasn’t Scrapped Imaging Spectrometer + 2 0.533 NASA Scratches Environment Gear From Satellite Plan 3 0.528 Science Panel Backs NASA Satellite Plan, But Urges Launches

  • f Smaller Probes

4 0.526 A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget 5 0.525 Scientist Who Exposed Global Warming Proposes Satellites for Climate Research 6 0.524 Report Provides Support for the Critics Of Using Big Satellites to Study Climate 7 0.516 Arianespace Receives Satellite Launch Pact From Telesat Canada + 8 0.509 Telecommunications Tale of Two Companies User then marks relevant documents with “+”.

30 / 60

slide-31
SLIDE 31

Expanded query after relevance feedback

2.074 new 15.106 space 30.816 satellite 5.660 application 5.991 nasa 5.196 eos 4.196 launch 3.972 aster 3.516 instrument 3.446 arianespace 3.004 bundespost 2.806 ss 2.790 rocket 2.053 scientist 2.003 broadcast 1.172 earth 0.836

  • il

0.646 measure

31 / 60

slide-32
SLIDE 32

Results for expanded query

r * 1 0.513 NASA Scratches Environment Gear From Satellite Plan * 2 0.500 NASA Hasn’t Scrapped Imaging Spectrometer 3 0.493 When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 4 0.493 NASA Uses ‘Warm’ Superconductors For Fast Cir- cuit * 5 0.492 Telecommunications Tale of Two Companies 6 0.491 Soviets May Adapt Parts of SS-20 Missile For Com- mercial Use 7 0.490 Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 8 0.490 Rescue of Satellite By Space Agency To Cost $90 Million

32 / 60

slide-33
SLIDE 33

Key concept for relevance feedback: Centroid

The centroid is the center of mass of a set of points. Recall that we represent documents as points in a high-dimensional space. Thus: we can compute centroids of documents. Definition:

  • µ(D) =

1 |D|

  • d∈D
  • v(d)

where D is a set of documents and v(d) = d is the vector we use to represent document d.

33 / 60

slide-34
SLIDE 34

Centroid: Examples

x x x x

⋄ ⋄ ⋄ ⋄ ⋄ ⋄

34 / 60

slide-35
SLIDE 35

Rocchio algorithm

The Rocchio algorithm implements relevance feedback in the vector space model. Rocchio chooses the query qopt that maximizes

  • qopt

= arg max

  • q

[sim( q, µ(Dr)) − sim( q, µ(Dnr))] Closely related to maximum separation between relevant and nonrelevant docs Making some additional assumptions, we can rewrite qopt as:

  • qopt

= µ(Dr) + [µ(Dr) − µ(Dnr)] Dr: set of relevant docs; Dnr: set of nonrelevant docs

35 / 60

slide-36
SLIDE 36

Rocchio algorithm

The optimal query vector is:

  • qopt

= µ(Dr) + [µ(Dr) − µ(Dnr)] = 1 |Dr|

  • dj∈Dr
  • dj + [ 1

|Dr|

  • dj∈Dr
  • dj −

1 |Dnr|

  • dj∈Dnr
  • dj]

We move the centroid of the relevant documents by the difference between the two centroids.

36 / 60

slide-37
SLIDE 37

Exercise: Compute Rocchio vector

x x x x x x circles: relevant documents, X’s: nonrelevant documents

37 / 60

slide-38
SLIDE 38

Rocchio illustrated

x x x x x x

  • µR
  • µNR
  • µR −

µNR

  • qopt
  • µR: centroid of relevant documents
  • µNR: centroid of nonrelevant documents
  • µR −

µNR: difference vector Add difference vector to µR to get qopt

  • qopt separates relevant/nonrelevant perfectly.

38 / 60

slide-39
SLIDE 39

Rocchio 1971 algorithm (SMART)

Used in practice:

  • qm

= α q0 + βµ(Dr) − γµ(Dnr) = α q0 + β 1 |Dr|

  • dj∈Dr
  • dj − γ

1 |Dnr|

  • dj∈Dnr
  • dj

qm: modified query vector; q0: original query vector; Dr and Dnr: sets of known relevant and nonrelevant documents respectively; α, β, and γ: weights attached to each term New query moves towards relevant documents and away from nonrelevant documents. Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ. Set negative term weights to 0. “Negative weight” for a term doesn’t make sense in the vector space model.

39 / 60

slide-40
SLIDE 40

Positive vs. negative relevance feedback

Positive feedback is more valuable than negative feedback. For example, set β = 0.75, γ = 0.25 to give higher weight to positive feedback. Many systems only allow positive feedback.

40 / 60

slide-41
SLIDE 41

Relevance feedback: Assumptions

When can relevance feedback enhance recall? Assumption A1: The user knows the terms in the collection well enough for an initial query. Assumption A2: Relevant documents contain similar terms (so I can “hop” from one relevant document to a different one when giving relevance feedback).

41 / 60

slide-42
SLIDE 42

Violation of A1

Violation of assumption A1: The user knows the terms in the collection well enough for an initial query. Mismatch of searcher’s vocabulary and collection vocabulary Example: cosmonaut / astronaut

42 / 60

slide-43
SLIDE 43

Violation of A2

Violation of A2: Relevant documents are not similar. Example query: contradictory government policies Several unrelated “prototypes”

Subsidies for tobacco farmers vs. anti-smoking campaigns Aid for developing countries vs. high tariffs on imports from developing countries

Relevance feedback on tobacco docs will not help with finding docs on developing countries.

43 / 60

slide-44
SLIDE 44

Relevance feedback: Evaluation

Pick one of the evaluation measures from earlier lecture, e.g., precision in top 10: P@10 Compute P@10 for original query q0 Compute P@10 for modified relevance feedback query q1 In most cases: q1 is spectacularly better than q0! Is this a fair evaluation?

44 / 60

slide-45
SLIDE 45

Relevance feedback: Evaluation

Fair evaluation must be on “residual” collection: docs not yet judged by user. Studies have shown that relevance feedback is successful when evaluated this way. Empirically, one round of relevance feedback is often very

  • useful. Two rounds are marginally useful.

45 / 60

slide-46
SLIDE 46

Evaluation: Caveat

True evaluation of usefulness must compare to other methods taking the same amount of time. Alternative to relevance feedback: User revises and resubmits query. Users may prefer revision/resubmission to having to judge relevance of documents. There is no clear evidence that relevance feedback is the “best use” of the user’s time.

46 / 60

slide-47
SLIDE 47

Outline

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

47 / 60

slide-48
SLIDE 48

Relevance feedback: Problems

Relevance feedback is expensive.

Relevance feedback creates long modified queries. Long queries are expensive to process.

Users are reluctant to provide explicit feedback. It’s often hard to understand why a particular document was retrieved after applying relevance feedback. Excite had full relevance feedback at one point, but abandoned it later.

48 / 60

slide-49
SLIDE 49

Pseudo-relevance feedback

Pseudo-relevance feedback automates the “manual” part of true relevance feedback. Pseudo-relevance algorithm:

Retrieve a ranked list of hits for the user’s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio)

Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift.

49 / 60

slide-50
SLIDE 50

Pseudo-relevance feedback at TREC4

Cornell SMART system Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents lnc.ltc 3210 lnc.ltc-PsRF 3634 Lnu.ltu 3709 Lnu.ltu-PsRF 4350 Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) This demonstrates that pseudo-relevance feedback is effective on average.

50 / 60

slide-51
SLIDE 51

Outline

1

Recap

2

Reduction in number of parameters

3

Motivation for query expansion

4

Relevance feedback: Details

5

Pseudo Relevance Feedback

6

Query expansion

51 / 60

slide-52
SLIDE 52

Query expansion

Query expansion is another method for increasing recall. We use “global query expansion” to refer to “global methods for query reformulation”. In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. Main information we use: (near-)synonymy A publication or database that collects (near-)synonyms is called a thesaurus. We will look at two types of thesauri: manually created and automatically created.

52 / 60

slide-53
SLIDE 53

Query expansion: Example

53 / 60

slide-54
SLIDE 54

Types of user feedback

User gives feedback on documents.

More common in relevance feedback

User gives feedback on words or phrases.

More common in query expansion

54 / 60

slide-55
SLIDE 55

Types of query expansion

Manual thesaurus (maintained by editors, e.g., PubMed) Automatically derived thesaurus (e.g., based on co-occurrence statistics) Query-equivalence based on query log mining (common on the web as in the “palm” example)

55 / 60

slide-56
SLIDE 56

Thesaurus-based query expansion

For each term t in the query, expand the query with words the thesaurus lists as semantically related with t. Example from earlier: hospital → medical Generally increases recall May significantly decrease precision, particularly with ambiguous terms

interest rate → interest rate fascinate

Widely used in specialized search engines for science and engineering It’s very expensive to create a manual thesaurus and to maintain it over time. A manual thesaurus is roughly equivalent to annotation with a controlled vocabulary.

56 / 60

slide-57
SLIDE 57

Example for manual thesaurus: PubMed

57 / 60

slide-58
SLIDE 58

Automatic thesaurus generation

Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents Fundamental notion: similarity between two words Definition 1: Two words are similar if they co-occur with similar words.

“car” and “motorcycle” cooccur with “road”, “gas” and “license”, so they must be similar.

Definition 2: Two words are similar if they occur in a given grammatical relation with the same words.

You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar.

Co-occurrence is more robust, grammatical relations are more accurate.

58 / 60

slide-59
SLIDE 59

Co-occurence-based thesaurus: Examples

Word Nearest neighbors absolutely absurd, whatsoever, totally, exactly, nothing bottomed dip, copper, drops, topped, slide, trimmed captivating shimmer, stunningly, superbly, plucky, witty doghouse dog, porch, crawling, beside, downstairs makeup repellent, lotion, glossy, sunscreen, skin, gel mediating reconciliation, negotiate, case, conciliation keeping hoping, bring, wiping, could, some, would lithographs drawings, Picasso, Dali, sculptures, Gauguin pathogens toxins, bacteria, organisms, bacterial, parasite senses grasp, psyche, truly, clumsy, naive, innate

59 / 60

slide-60
SLIDE 60

Summary

Relevance feedback and query expansion increase recall. In many cases, precision is decreased, often significantly. Log-based query modification (which is more complex than simple query expansion) is more common on the web than relevance feedback.

60 / 60