!"#$ %&'!( - - PDF document

v 0 o 2
SMART_READER_LITE
LIVE PREVIEW

!"#$ %&'!( - - PDF document

CS-463: Information Retrieval Systems !"#$ %&'!( )*v,*-./0o2 Yannis Tzitzikas University of Crete CS-463,Spring 05


slide-1
SLIDE 1

1

Yannis Tzitzikas

  • CS-463: Information Retrieval Systems

!"#$ %&'!( )*v,*-./0o2

University of Crete

CS-463,Spring 05

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 2

  • α α (Relevance Feedback)
  • αα (Query Reformulation)

– α (Term Reweighting) – α(α) (Query Expansion),

  • αα !α α"α# $

– Optimal Query, Rochio Method, Ide Method, DeHi Method

  • #!
  • %-αα α (Pseudo relevance feedback)
  • & !#"
  • α

– #"α '(#α) (Automatic Local Analysis) – α – α (α (Thesaurus-based Query Expansion) – #"α α (Automatic Global Analysis) – α (α (Statistical Thesaurus)

slide-2
SLIDE 2

2

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 3

  • )αα # * "*α αα+

αα *# ααα α α* " α α !!αα

– * !+ *#" " !! – #! * " α α α# α# ! – α* " α α ! α# α α (αα!α α !α !!αα)

  • , α* " α

α α α α!

  • ! !α * α" α "α

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 4

'# "

(1) - α* (2) . ) (3) - αα " (4) - α!" (") α

  • )αα

– 'α (2) ,(3),(4) * "#" α"α (+ α α #" )

  • α " (1)
slide-3
SLIDE 3

3

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 5

)!! !α - *

α!:

(α) * αα () * ι (1) +α α ααα (2) +α !

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 6

υι (Relevance Feedback):

, α α

  • $ αα α", "

* ι υι # # !! α

  • ."" α α !α α

ιυυ

  • α# α"" * α

ααα"

  • ..
slide-4
SLIDE 4

4

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 7

* !α α α

Rankings

IR System Ranked Documents

  • 1. Doc1
  • 2. Doc2
  • 3. Doc3

. .

  • 1. Doc1
  • 2. Doc2
  • 3. Doc3
  • .

.

Feedback Query String Revised Query ReRanked Documents

  • 1. Doc2
  • 2. Doc4
  • 3. Doc5

. .

Query Reformulation

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 8

'""αα * "α

User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view logical view inverted file DB Manager Module Text Database

slide-5
SLIDE 5

5

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 9

http://nayana.ece.ucsb.edu/imsearch/imsearch.html q=bike

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 10

"αα

slide-6
SLIDE 6

6

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 11

$α"α α

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 12

αα"

slide-7
SLIDE 7

7

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 13

α α: αα

(Relevance Feedback: Query Reformulation)

'# ααα " αα:

  • υ ω (Term Reweighting):

– α # α α !!αα α " α # α "-α !!αα..

  • (Query Expansion):

– !" ω ω (.*. α# ! α !!αα)

  • /* α!#" !α ααα

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 14

αα #ιυι $

  • , (Optimal Query)

– " # !+" Cr ω α !!. – , α (α αα #α α α !!αα α "# α) α α:

  • r

j r j

C d j r C d j r

  • pt

d C N d C q

  • 1

1

Where N is the total number of documents. #" !+" Cr ,α " #0 α* α *.

answer(q):

slide-8
SLIDE 8

8

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 15

'* !α α*

  • (1) Rochio Method
  • (11) Ide Method
  • (111) DeHi Method

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 16

(1) Standard Rochio Method

# α α !, *" α ω α (Dr) α "-α (Dn) !!αα (α# α α* ) !! α "α α* q.

  • n

D j d j n r D j d j r m

d D d D q q

  • : Tunable weight for initial query.

: Tunable weight for relevant documents. : Tunable weight for irrelevant documents. Usually < (the relevant docs are more important) If =0 then we have positive feedback only

answer(q):

slide-9
SLIDE 9

9

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 17

(11) Ide Regular Method

)# αα => "!α α"# ααα. &α α#, #* α αα.

  • n

j r j

D d j D d j m

d d q q

  • : Tunable weight for initial query.

: Tunable weight for relevant documents. : Tunable weight for irrelevant documents.

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 18

(III) Ide “Dec Hi” Method

' !α α#0 "-α !! * 0#

(Bias towards rejecting just the highest ranked of the irrelevant documents:)

) ( max

j relevant non D d j m

d d q q

r j

  • : Tunable weight for initial query.

: Tunable weight for relevant documents. : Tunable weight for irrelevant document.

answer(q):

slide-10
SLIDE 10

10

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 19

! "# (I) (II) (III)

  • &, α α"α "α αα# "α

α *.

  • * α# ( recall & precision)
  • α==!=1
  • n

j r j

D d j D d j m

d d q q

  • )

( max

j relevant non D d j m

d d q q

r j

  • n

D j d j n r D j d j r m

d D d D q q

  • CS-463, Information

Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 20

#! α α

Remarks

  • By construction, reformulated query will rank explicitly-marked relevant

documents higher and explicitly-marked irrelevant documents lower.

  • Method should not get credit for improvement on these documents, since it was

told their relevance.

  • In machine learning, this error is called “testing on the training data.”
  • Evaluation should focus on generalizing to other un-rated documents.

Fair Evaluation of Relevance Feedback

  • Remove from the corpus any documents for which feedback was provided.
  • Measure recall/precision performance on the remaining residual collection.
  • Compared to complete corpus, specific recall/precision numbers may decrease

since relevant documents were removed.

  • However, relative performance on the residual collection provides fair data on

the effectiveness of relevance feedback

slide-11
SLIDE 11

11

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 21

Relevance Feedback Evaluation

Simulated interactive retrieval consistently outperforms non-interactive retrieval (70% here).

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 22

Relevance Feedback Evaluation: Case Study

Example of evaluation of interactive information retrieval [Koenemann & Belkin 1996]

Goal of study: show that relevance feedback improves retrieval effectiveness Details

  • 64 novice searchers (43 female, 21 male, native English)
  • TREC test bed (Wall Street Journal subset)
  • Two search topics

– Automobile Recalls – Tobacco Advertising and the Young

  • Relevance judgements from TREC and experimenter
  • System was INQUERY (vector space with some bells and whistles)
  • Subjects had a tutorial session to learn the system
  • Their goal was to keep modifying the query until they have developed one that gets high

precision

  • Reweighting of terms similar to but different from Rocchio
slide-12
SLIDE 12

12

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 23

Credit: Marti Hearst

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 24

Evaluation:Precision vs. RF condition (from Koenemann & Belkin 96)

Credit: Marti Hearst

Criterion: p@30 (precision at 30 documents) Compare:

p@30 for users with relevance feedback p@30 for users without relevance feedback

Goal: show that users with relevance feedback do better

Results:

  • Subjects with

relevance feedback had 17-34% better performance

  • But: Difference in

precision numbers not statistically significant. Search times approximately equal

slide-13
SLIDE 13

13

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 25

/ α α

  • A1: User has sufficient knowledge for initial query.

– However:

  • User does not always have sufficient initial knowledge.
  • Examples: Misspellings, Mismatch of searcher’s vocabulary vs collection

vocabulary.

  • A2: Relevance prototypes are “well-behaved”.

– Either: All relevant documents are similar to a single prototype. – Or: There are different prototypes, but they have significant vocabulary

  • verlap.

– However:

  • There are several relevance prototypes.

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 26

&α αα α *"α ;

  • 2 * * + α
  • , αα * α"α "!

!"# αα # *#(search engines process lots of queries and allow little time for each one)

  • $ α α * !!αα α α

"" α αα" 0α

slide-14
SLIDE 14

14

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 27

α α )α!#" 1#

  • Some search engines offer a similar/related pages feature

(simplest form of relevance feedback)

– Google (link-based)

  • But some don’t because it’s hard to explain to average user.

– Excite initially had true relevance feedback, but abandoned it due to lack of use.

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 28

%υ υι Pseudo Relevance Feedback

  • . "# αα α ω
  • &" # α αα m α# α αα"α !!αα α

α (α * α !α αα)

– $" α *"" α αα !!αα !α α αα

  • α " # *+α

" #

answer(q):

slide-15
SLIDE 15

15

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 29

%αα

Rankings

IR System Ranked Documents

  • 1. Doc1
  • 2. Doc2
  • 3. Doc3

. .

Query String Revised Query ReRanked Documents

  • 1. Doc2
  • 2. Doc4
  • 3. Doc5

. .

Query Reformulation

  • 1. Doc1
  • 2. Doc2
  • 3. Doc3
  • .

.

Pseudo Feedback

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 30

#! %αα

  • - α α# α!"# TREC (ad-

hoc retrieval task)

  • α#"α αα α α αα !!αα α

α α "α boolean α " α *" !α αα

slide-16
SLIDE 16

16

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 31

& !#" α α α

– 2 * αα α " α ! (local optimal, remember Hill Climbing) – , * ! α!" [John Holland 95] α " α α

  • & !#" :

– , α:

  • ) "!α α α!α αα
  • 2 α α# α !α α αα α ("α

α)

– 3#! 0 α αα / "α, ! α!#" " α α! α# "!α

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 32

A Simple Genetic Algorithm

Simple_Genetic_Algorithm () { initialize population; calculate fitness function; while (fitness != OptimalValue) { selection; crossover; mutation; calculate fitness function; } }

slide-17
SLIDE 17

17

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 33

& !#" !α α α: 'α "αα

(1) * * α "α q * (2) ' "α "! 10-40 q " υ # (3) , α ααα! qi 'ιι (qi) = α * RetRel - * RetNRel - ! 4rel #

RetRel = α"# αα" α α RetNRel = α"# αα" α α 4rel = α"# " αα" α

Nrel α !, !’ α# * !=0 (4) 2 α ααα! !α, α!α α *+α *α +! (5) α# αα α α "α " α " (6) go to (3) , ααα α *+α !α # "

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 34

)α!"α (1/2)

  • Original query q={k1,….,k15}
  • 10 variations {q1,…,q10} with random weights
  • Compute Score(answer(qi)) for each i=1..10
  • Let A, B, C, D be the queries with the highest score
  • Make an extra copy and 2 random copies
  • Result: A A A A B B B C C D
  • Order them randomly and form pairs
  • (A,C) (A,B) (C,A)(A,D)(B,B)
  • Crossover rate

– 60% means that 3 from the 5 pairs will be selected for breeding

  • Breed them using one-point crossover operator

– e.g. If |V|=15 then a random point could be 8

slide-18
SLIDE 18

18

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 35

)α!"α (2/2)

A C

8 8

One-point crossover at position 8

Single point crossover Two point crossover Uniform (random) crossover Arithmetic crossover

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 36

)α!"α (3/3)

  • Mutations

– Mutation rate: determines the number of variants to be chosen for mutations (usually small e.g. 0.05) – Mutation method: choose randomly one weight and replace it with another random weight

Mutation with Random bit inversion

  • )"αα α #

– 2 ! α!#" α"α#α

  • (! Rochio ?)
slide-19
SLIDE 19

19

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 37

  • α α (Relevance Feedback)
  • αα (Query Reformulation)

– α (Term Reweighting) – α(α) (Query Expansion),

  • αα !α α"α# $

– Optimal Query, Rochio Method, Ide Method, DeHi Method

  • #!
  • %-αα α (Pseudo relevance feedback)
  • & !#"
  • ωω

– #"α '(#α) (Automatic Local Analysis) – α – α (α (Thesaurus-based Query Expansion) – #"α α (Automatic Global Analysis) – α (α (Statistical Thesaurus)

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 38

(Query Expansion)

  • In relevance feedback, users give additional input (relevant/non-

relevant) on documents.

  • In query expansion, users give additional input (good/bad search

term) on words or phrases

slide-20
SLIDE 20

20

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 39

'* α

  • Refinements based on query log mining

– Common on the web

  • Local Analysis:

– Analysis of documents in result set

  • Global Analysis: Thesaurus-based

– Controlled vocabulary

  • Maintained by editors (e.g., medline)

– Automatically derived thesaurus

  • (co-occurrence statistics)

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 40

Example of Query Expansion

Also: see altavista, teoma

slide-21
SLIDE 21

21

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 41

#"α ' (#α) Automatic Local Analysis

  • $ α , α (α)

"α+α α αα αα"α !!αα

  • ' "α α+ * * α

α# ! α

– αα " α ! α#"αα

– 2 αα "! !#α "αα (α’# α α) – )α!"α

  • “Apple computer”

“Apple computer Powerbook laptop”

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 42

#"α ' (#α)

  • '* α#"α α

– Association Matrix – Metric Correlation Matrix – //Scalar Clusters – //Local context analysis

slide-22
SLIDE 22

22

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 43

(a) Association Matrix and Normalized Association Matrix

w1 w2 w3 …………………..wn w1 w2 w3 . . wn c11 c12 c13…………………c1n c21 c31 . . cn1

cij: Correlation factor between term i and term j:

  • D

d jk ik ij

k

f f c

fik : frequency of term i in document k

  • Normalized Association Matrix

– Frequency based correlation factor favors more frequent terms. – Normalize association scores: – Normalized score is 1 if two terms have the same frequency in all documents.

ij jj ii ij ij

c c c c s

  • 5 " # α α

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 44

(b) Metric Correlation Matrix

  • Association correlation does not account for the proximity of

terms in documents, just co-occurrence frequencies within documents.

  • Metric correlations account for term proximity.
  • i

u j v

V k V k v u ij

k k r c ) , ( 1

Vi: Set of all occurrences of term i in any document. r(ku,kv): Distance in words between word occurrences ku and kv ( if ku and kv are occurrences in different documents).

  • Normalized Metric Correlation Matrix

– to account for term frequencies:

j i ij ij

V V c s

slide-23
SLIDE 23

23

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 45

Query Expansion with Correlation Matrix

  • For each term i in query, expand query with n terms, those with

the highest value of cij.

  • This adds semantically related terms in the “neighborhood” of the

query terms.

(υ ω ι )"ι υ

slide-24
SLIDE 24

24

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 47

  • 5α α# α* !α "α α

"α! α [see also Sec 7.2.5]

  • )α!"α:

physician syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||sawbones rel: medic, general practitioner, surgeon,

  • Online-α:

– Roget’s thesaurus – INSPEC thesaurus – WordNet (http://wordnet.princeton.edu/) – The free dictionary http://www.thefreedictionary.com/

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 48

. (α

  • α "/ " ! # α# α#
  • α+ *" α # α

– (α#"α α α# ! *)

  • &α α

– α "α α ", "" α " # * α ( "α, ..) – α α "!, "" α "" α# α α ααα" # " .

slide-25
SLIDE 25

25

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 49

  • & (α

– )* Roget’s thesaurus. Designed to assist the writer in creatively selecting covabulary

  • (α α !α Information Retrieval

– for coordinating the basic processes of indexing and retrieval – designed for speific subject areas and are therefore domain dependent – Examples

  • INSPEC

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 50

INSPEC thesaurus (for IR)

  • Domain: physics, electrical engineering, electronics, computers
  • Example:

– computer-aided instruction

  • see also education
  • UF teaching machines

(UF: Used For, converse: USE)

  • BT educational computing (BT: Broader Term)
  • TT computer applications (TT: Top Node, I.e. root of the hierarchy)
  • RT education , teaching (RT: Related Term)
slide-26
SLIDE 26

26

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 51

WordNet (http://wordnet.princeton.edu/)

  • A more detailed database of semantic relationships between English words.
  • Developed by famous cognitive psychologist George Miller and a team at Princeton

University.

  • About 144,000 English words.Nouns, adjectives, verbs, and adverbs grouped into

about 109,000 synonym sets called synsets.

Synset Relationships

  • Antonym: front back
  • Attribute: benevolence good (noun to adjective)
  • Pertainym: alphabetical alphabet (adjective to noun)
  • Similar: unquestioning absolute
  • Cause: kill die
  • Entailment: breathe inhale
  • Holonym: chapter text (part-of)
  • Meronym: computer cpu (whole-of)
  • Hyponym: tree plant (specialization)
  • Hypernym: fruit apple (generalization)

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 52

AAT (Art and Architecture Thesaurus)

  • Controlled vocabulary for describing and retrieving information:

fine art, architecture, decorative art, and material culture.

  • Almost 120,000 terms for objects, textual materials, images,

architecture and culture from all periods and all cultures.

  • Used by archives, museums, and libraries to describe items in

their collections.

  • Used to search for materials.
  • Used by computer programs, for information retrieval, and natural

language processing.

slide-27
SLIDE 27

27

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 53

.αα (α

  • Coordination Level (α"# ")

– refers to the construction of phrases from individual terms – precoordination: the thesaurus contain phrases

  • + the vocabulary is very precise
  • - the user has to be aware of the phrase construction rules, large size

– postcoordination: the thesaurus does not contain phrases. They are constructed while indexing/searching

  • + user does not worry about the order of the words
  • - precision may fall
  • Term Relationships

– equivalence relations (e.g. synonymy) – hierarchical relations (e.g. dogs BT animals,) – nonhierarchical relations (e.g. RT)

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 54

.αα (α (2)

  • Number of Entries per Term

– preferably: a single entry for each thesaurus term – however homonyms does not make this possible

  • parenthetical qualifiers:

– bonds(chemical), bonds(adhesive) // *"# "# / # !#

  • Specificity of Vocabulary

– high specificity -> large vocabulary size

  • Control of Term Frequency of Class Members (for statistical thesauri)

– the terms of a thesaurus should have roughly equal frequencies – the total frequency in each class (of terms) should be equal

  • Normalization of Vocabulary

– terms should be in noun form – other rules related to singularity of terms, spelling, capitalization, abbreviations, initials, acronyms, punctuation

slide-28
SLIDE 28

28

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 55

'# αα (α

[] . "!α [-] #"α αα

[-.1] α# υ ιω

)6#: 4α * "α "! α α ! "

[-.2] α# υυ α

)6#: 4α * >2 α" α !α * "α α

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 56

[] . (α

  • Define subject boundaries
  • partition into divisions and subject areas
  • collection of terms

– sources: encyclopedias, handbooks, textbooks, journal titles, catalogues,

  • ther thesauri, subject experts, potential users
  • analysis of terms (synonyms, hierarchical structure, definitions,

scope notes)

  • reviewing phase
  • delivery in both hierarchical and in alphabetical arrangement
  • maintenance (new terms, etc)

Very long, laborious and costly process

slide-29
SLIDE 29

29

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 57

Faceted Taxonomies + CTCA

  • A new flexible and fast approach (to be discussed in another

lecture)

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 58

α (α

Thesaurus-based Query Expansion

  • '#:

– & # t , # α "α α * (related terms) t – 'α " α α *α"#α α α* – E.g. of a WordNet-based Query Expansion

  • Add synonyms in the same synset.
  • Add hyponyms to add specialized terms.
  • Add hypernyms to generalize a query.
  • Add other related terms to expand query.

– α (recall.) – $ α " αα (precision), αα #α * α""

  • “interest rate” “interest rate fascinate evaluate”
slide-30
SLIDE 30

30

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 59

[-1] #"α αα (α α# "α

  • , αα (α# α ) # α α α

* α !α # !

  • 2 "" α *"" α# α α#

+α * + α#

  • *: $ +υ ιι ι ' 'ω

ιι ι υ ιω

  • α

– 1/ – 2/ η – 3/ η ( )

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 60

#"α αα (α α# "α (II)

  • 1/ αα 3!

– Decision: Desired specificity

  • if high then emphasis will be given on identifying precise phrases

– Terms can be selected from titles, abstracts, or even the full text – Normalization: stemming, stoplists – Criteria for selecting a term:

  • frequency of occurrence (divide words to 3 categoris: low, medium, high, select

terms with medium frequency)

  • discrimination value ~ idf

– Phrase construction (if desired, recall coordination level)

  • 2/ /!"# 2"#α

– )αα!"αα " : Cosine, Dice

slide-31
SLIDE 31

31

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 61

#"α αα (α α# "α (III)

3/ η ( )

  • 2 α!#" clustering " α *"

5α !#" 1/ Identify a set of frequency ranges 2/ Group the vocabulary terms into different classes based on their frequencies and the ranges selected in Step 1. There will be one term class for each frequency range 3/ The highest frequency class is assigned level 0, the next level 1, and so on 4/ Parent-child links: The parent(s) of a term at level i is the most similar term in level i-1 (a term is allowed to have multiple parent) 5/ Continue until reaching level 1

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 62

)α!"α " 3 *

Computer Databases IRS SQL TF-IDF

*#α "

Computer Databases IRS SQL TF-IDF Computer Databases IRS SQL TF-IDF

slide-32
SLIDE 32

32

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 63

Query Expansion Based on a Statistical Thesaurus

  • O α# αα α# "α

*+#" # α # !

  • )αα

– 2 # !α α α * "! α α#α – 7α α α * *α" *#α " – 2 *α" *# "α α α !α α " "α – 1α: ω ω α α# # " *α" *#

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 64

αα (α

  • Complete Link Algorithm

– 1. -α !!α α α# cluster – 2. /#! "#α "α # +!α cluster

  • +α * "#α "α # +!α !!
  • (* " ")

– 3. - +! [Cu,Cv] " 0# inter-cluster "#α – 4. !* α clusters Cu, Cv – 5. "α" αα * 2 – 6. 0 "α α*α clusters

  • η η Clustering
slide-33
SLIDE 33

33

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 65

#"α α Automatic Global Analysis

  • )"# α" "#α "α #

α α # !

  • /!"# * (association matrices)

"#α "α α!α " # * "α+α

  • α " (α ) #.
  • '#

– Query Expansion Based on a Similarity Thesaurus

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 66

)"αα α

  • Term ambiguity may introduce irrelevant statistically correlated

terms.

– “Apple computer” “Apple red fruit computer”

  • Since terms are highly correlated anyway, expansion may not

retrieve many additional documents.

slide-34
SLIDE 34

34

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 67

Query Expansion Based on a Similarity Thesaurus

  • Keypoint

– 2 # α α+α " α#α (α #* α#α α# # *)

  • α α *α":

– “Apple computer” “Apple red fruit computer”

  • α

– “fruit” not added to “Apple computer” since it is far from “computer.” – “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.”

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 68

Query Expansion Based on a Similarity Thesaurus

  • '#

– 5 4 !!αα, t # ={k1,…,kt} – )α" " " α "α * 4 α

  • (α α α *" α0 # # α !!)

) ,..., (

1 iN i i

w w k

  • 2

2 ) ) ( max 5 . 1 5 . ( ) ) ( max 5 . 5 . ( j itf il f l il f N l j itf ij f j ij f ij w

slide-35
SLIDE 35

35

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 69

Query Expansion Based on a Similarity Thesaurus (II)

  • H * "α #

v u v u

k k c

  • ,
  • Query Expansion

– (1) Represent query in the concept space – (2) Compute sim(q,ku) for each ku – (3) Expand q with the top r ranked terms

  • Results

– 20% improved retrieval performance i q ki iqk

w q

  • u

u

k q k q sim

  • )

, (

  • q

i k iq u uq

w k q sim w ) , (

'

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 70

α vs. #α

  • , α α * "! !# # α "#

α*

– α # α !!αα ! α α

  • , α * α# !# # !α

– (α# # α !! α "# α α)

  • , α αα α"αα
slide-36
SLIDE 36

36

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 71

α : ""αα

  • , α " *+#" # " α

α#, αα α (recall).

  • , α#! ! *+#" # " α "

αα (precision).

CS-463, Information Retrieval Yannis Tzitzikas, U. of Crete, Spring 2005 72