Outline Introduction to information retrieval Logical view of - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Introduction to information retrieval Logical view of - - PDF document

1 I f Information Retrieval Models ti R t i l M d l Chapter 2. In R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval, 1999. Addision Wesley. Jon Atle Gulla / Terje Brasethvik / Jon Atle Gulla / Terje Brasethvik / Geir


slide-1
SLIDE 1
  • 1

I f ti R t i l M d l Information Retrieval Models

Chapter 2. In R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval, 1999. Addision Wesley.

Jon Atle Gulla / Terje Brasethvik / Jon Atle Gulla / Terje Brasethvik / Geir Solskinnsbakk

  • 2

Outline

  • Introduction to information retrieval

L i l i f d t

  • Logical view of documents

– Document representations – The “bag-of-words” approach The bag of words approach

  • The Classic IR Models

– Boolean – Vector – Probablistic

slide-2
SLIDE 2
  • 3

Information retrieval

  • Information retrieval = information access

(D t t i l / T t t i l/ S h)

  • (Document retrieval / Text retrieval/ Search)
  • Retrieve documents that satisfy user’s information

need from document collection need from document collection

– Query interpretation – Document representation and indexing – Ranking of retrieved documents – Linguistics, arithmetics and statistics

  • 4

AllTheWeb

  • AllTheWeb: FAST’s showcase (www.alltheweb.com) - 2002

Query R d d Retrieved documents (www.alltheweb.com part of Yahoo today)

slide-3
SLIDE 3
  • 5

IR vs. IE vs. TDM

  • Information retrieval

“finding documents that is similar to the query” – finding documents that is similar to the query

  • Fulfilling an information need

– Document retrieval / text retrieval

  • Give me information about Trondheim?
  • Information Extraction

– Extracting data – Extracting data

  • Extract todays car-sales advertisements from adressa.no ?
  • Text Mining

– Discovering new knowledge from text

  • Ex: Pubgene (http://www.pubgene.org) Discovery of genome relations

through retrieval of MEDILINE articles

  • 6

Document retrieval

  • Give me information about Apple Computer?

Article? Web site? Web store? Prices? – Article? Web site? Web store? Prices?

  • Is this flower poisonous?

– Image? Fact sheet? Medical/biological encyclopedia? Image? Fact sheet? Medical/biological encyclopedia?

  • How much does a ticket cost from Trondheim to

Paris?

– Airline Price table? Web travel agency?

slide-4
SLIDE 4
  • 7

Text Retrieval vs Database Text Retrieval vs. Database Queries?

  • Well defined schema vs. no schema

St t d d t l i f tt d d t

  • Structured data vs. plain unformatted data
  • Identity of records vs. “fuzzy” similarity measures

W ll d fi d l d ti

  • Well defined query languages and operations vs.

“Natural Language” queries and lexical and mathematical query transformations mathematical query transformations

  • 8

Document Retrieval problems

  • What is the definition of “CSCW” ?

– Finds no documents about “CSCW” – Finds 1M + documents about “CSCW” Finds 1M documents about CSCW – Find no documents that actually define CSCW – Find 50 different definitions of CSCW

slide-5
SLIDE 5
  • 9

Retrieval Models

  • A retrieval model is an idealization or abstraction of

an actual retrieval process an actual retrieval process

  • Approximation of the retrieval situation
  • Approximation of the retrieval situation

A retrieval model is not the same as a retrieval

  • A retrieval model is not the same as a retrieval

implementation

  • 1

Components of a retrieval Components of a retrieval model

  • User

Search expert (e g librarian) vs non expert – Search expert (e.g. librarian) vs. non-expert – Background (knowledge of topic) – In-depth searching vs. ”just-wanna-get-an-idea” searching

  • Documents:

Diff t l – Different languages – Semi-structured (e.g. HTML or XML) vs. plain

slide-6
SLIDE 6
  • 11

Retrieving vs. Browsing ?

  • Open web directories

Yahoo – Yahoo, …

  • Domain specific

– Medline, Lexis-Nexis, Jussnett, Dialog, … Medline, Lexis Nexis, Jussnett, Dialog, …

  • Libraries

– Bibsys, ACM/IEEE - Diglib

  • Company Intranets

– Project workspaces G l I f ti – General Information

  • WWW

– Google alltheweb askJeeves – Google, alltheweb, askJeeves, …

  • 1

2

Taxonomy of retrieval models

Set theoretic

  • Fuzzy sets
  • Extended boolean

Classic models

  • Boolean
  • Vector

Algebraic Retrieval:

  • Ad Hoc
  • Filtering

Vector

  • Probabilistic
  • generalized vector
  • Latent semantic

indexing

  • Neural networks

Browsing Structured models

  • Non overlapping

lists

  • Proximal Nodes

Probabilistic

  • Inference networks
  • Belief networks

Browsing models

  • Flat
  • Flat
  • Structure guided
  • Hypertext
slide-7
SLIDE 7
  • 1

3

Information Retrieval Model

  • An information retrieval model is a quadruple

[D Q F R( i dj)] h [D, Q, F, R(qi,dj)] where

– D is a set composed of logical views for the documents in the D is a set composed of logical views for the documents in the collection – Q is a set composed of logical views for the user information needs (queries) (queries) – F is a framework for modeling document representations, queries, and their relationships R( i dj) i ki f ti hi h i t l b ith – R(qi,dj) is a ranking function which associates a real number with a query qi  Q and a document representation dj D. Such ranking defines an ordering among the documents with regard to the query qi qi.

  • 1

4

The retrieval cycle

  • Query Transformation
  • Normalization
  • Query Expansion
  • Query Expansion
  • Phrasing / Anti Phrasing
  • Result Presentation:
  • Ranking

Cl t i

  • Clustering
  • Classification
slide-8
SLIDE 8
  • 1

5

About Document representations

  • Document meta-information

(author title date URI ) – (author, title, date, URI, …)

  • Index term selection ?

– Automated indexing - bag of words Automated indexing bag of words – User selected words: Key-words – Controlled vocabularies

  • Document structure
  • Document type
  • 1

6

Index term selection

Encoding Transliteration Phrasing Stemming Language detection Document D t t Meta-data Extraction St t Document type recognition Document categorization Word Analysis Structure recognition categorization Index term selection

slide-9
SLIDE 9
  • 1

7

Bag-of-words approach

  • A document is an unordered list of words/tokens

Grammatical information is lost – Grammatical information is lost

  • Tokenization: What is a word?

– Is ”White House” one or two words? Is White House one or two words?

  • Case folding

– ”President Bush” becomes ”president”, ”bush”

  • Stemming or lemmatization

– Morphological information is thrown away: ”agreements” becomes ”agreement” (lemmatization) or even ”agree” (stemming) agreement (lemmatization) or even agree (stemming)

  • 1

8

Some repetition

  • IR = retrieval of documents that seem to be similar to

the users information need the users information need

  • Information retrieval models

– Users

  • > Query

Users > Query – Documents

  • > Document representation

– Similarity function -> sim(q, di)

  • Document representations

– (logical views of documents) Index term selection – Index term selection

slide-10
SLIDE 10
  • 1

9

Example ”bag of words”

Scientists have found compelling new evidence of possible ancient microscopic life on Mars, derived from magnetic crystals in a meteorite that fell to Earch from the red planet NASA announced on Monday that fell to Earch from the red planet, NASA announced on Monday. a, ancient, announced, compelling, crystals, derived, earth, evidence, fell, p g y found, from (2X), have, in, magnetic, mars, meteorite, microscopic, monday, nasa, new, of, on (2X), planet, possible, red, scientists, that, the, to

  • 2

What is this about?

allmennviteskapelige, at (2x), av, bredt, datateknikk (2x), de (2x), doktorgradsstudier, Dr.ing., dr scient emner en et etter- fagtilbud fleste grunn- har hoveddel hovedfagsstudier i dr.scient., emner, en, et, etter-, fagtilbud, fleste, grunn-, har, hoveddel, hovedfagsstudier, i (3x), Instituttet (2x), informasjonsvitenskap., informatikk, innen, innenfor, kurs, leverer (2x), mellom-, NTNU, NTNUs (2x),

  • g

(5x),

  • gså,
  • mråder,

samt (2x), selvsagt, sivilingenixrstudium, Som, studiene, til, tilbyr (3x), undervisning, undervisningen, universitetsinstitutt, ved (2x), vi (2x), videre., videreutdanningstilbud,

slide-11
SLIDE 11
  • 2

1

What is this about?

allmennviteskapelige, at (2x), av, bredt, datateknikk (2x), de (2x), doktorgradsstudier, Dr.ing., dr scient emner en et etter- fagtilbud fleste grunn- har hoveddel hovedfagsstudier i dr.scient., emner, en, et, etter-, fagtilbud, fleste, grunn-, har, hoveddel, hovedfagsstudier, i (3x), Instituttet (2x), informasjonsvitenskap., informatikk, innen, innenfor, kurs, leverer (2x), mellom-, NTNU, NTNUs (2x),

  • g

(5x),

  • gså,
  • mråder,

samt (2x), selvsagt, sivilingenixrstudium, Som, studiene, til, tilbyr (3x), undervisning, undervisningen, universitetsinstitutt, ved (2x), vi (2x), videre., videreutdanningstilbud, Instituttet har et bredt fagtilbud og tilbyr undervisning i emner innenfor de fleste områder innen datateknikk og informasjonsvitenskap. Instituttet leverer en hoveddel av undervisningen ved g j p g NTNUs sivilingeniørstudium i datateknikk, samt at vi tilbyr grunn-, mellom-

  • g

hovedfagsstudier i informatikk ved de allmennviteskapelige studiene. Som universitetsinstitutt tilbyr vi selvsagt også doktorgradsstudier (dr.ing. og dr.scient.), samt at vi leverer kurs til NTNUs etter og videreutdanningstilbud NTNU videre NTNUs etter- og videreutdanningstilbud - NTNU videre.

  • 2

2

“The language problem”

Q

?

D Drep Drep Drep Drep Drep

slide-12
SLIDE 12
  • 2

3

Similarity in terms of clustering

  • Document collection C

Q er Q and doc ments Dn D4 D7 D6

  • Query Q and documents Dn.
  • Relevant documents A (D2 and D8)
  • The task:

Q D1 D8 D3 D5

  • Given the query Q, a (vauge)

description of A, which documents are in the set A? 1 Find the features that best describe Q D8 D2

  • 1. Find the features that best describe

the objects in A. (intra cluster similarity)

  • 2. What are the features that better

C A distinguish the objects in A from the rest of the objects in C. (inter cluster dissimilarity) C

  • 2

4

Index terms and weights

  • Index terms have varying relevance when used to

describe document contents describe document contents

  • Weights:

Weights:

– t is the number of index terms – K = {k1,...., kt} is the set of all index terms – weight wij > 0 is associated with each index term ki of document j – Document vector: dj = (w1j, w2j,...,wtj)

slide-13
SLIDE 13
  • 2

5

Taxonomy of retrieval models

Set theoretic

  • Fuzzy sets
  • Extended boolean

Classic models

  • Boolean
  • Vector

Algebraic Retrieval:

  • Ad Hoc
  • Filtering

Vector

  • Probabilistic
  • generalized vector
  • Latent semantic

indexing

  • Neural networks

Browsing Structured models

  • Non overlapping

lists

  • Proximal Nodes

Probabilistic

  • Inference networks
  • Belief networks

Browsing models

  • Flat
  • Flat
  • Structure guided
  • Hypertext
  • 2

6

Boolean retrieval model

  • Simple formalism
  • Document term weights are binary {0,1}

Q i ti l b l i

  • Queries are conventional boolean expressions
  • All documents are either relevant or non-relevant

i ( dj) {1 0} – sim(q, dj) = {1, 0}

slide-14
SLIDE 14
  • 2

7

Boolean retrieval

  • Boolean operators: AND, OR, NOT (NEAR)
  • The semantics of boolean operators:

– t1 AND t2 = {d | t1  r(d)}  {d | t2  r(d)} t1 AND t2 {d | t1  r(d)}  {d | t2  r(d)} Documents whose representation contains t1 and t2 – t1 OR t2 = {d | t1  r(d)}  {d | t2  r(d)} Documents whose representation contains t1 or t2 Documents whose representation contains t1 or t2 – NOT t1 = {d | t1  r(d)} Documents whose representation does not contain t1

  • 2

8

Boolean retrieval

  • Information need: President George Bush

B l b h AND ( OR id t)

  • Boolean query: bush AND (george OR president)

george bush president bush p

slide-15
SLIDE 15
  • 2

9

Pros and cons of boolean Pros and cons of boolean retrieval

  • Pros of Boolean retrieval

Clean and simple formalism – Clean and simple formalism – Firm grip on query formulation

  • Cons of Boolean retrieval

– Most non-experts cannot handle boolean expressions – No ranking of retrieved documents E t t hi l d t t f t t i d – Exact matching may lead to too few or too many retrieved documents

  • 3

Vector space retrieval

  • Most common modern retrieval system

F t

  • Features:

– Users can enter free text – Documents are ranked Documents are ranked – Relaxation of the matching criterion

  • Key idea: Everything (documents, queries, terms) is

a vector in a high-dimensional space

  • Example system: SMART (Salton, 1960s), FAST

( llth b ) (www.alltheweb.com)

slide-16
SLIDE 16
  • 3

1

Not the same as Boolean…

  • Documents ranked according

to relevance

  • 11 documents in the first result

set did not contain “gothic”

  • Gothic’s contribution to

document ranking small

  • 3

2

vector space representations

  • Documents are vectors of terms

T t f d t

  • Terms are vectors of documents
  • Queries are vectors of terms

d1 d2 d3 t1 t2 t3 t4 .... 1 0 0 1 .... 0 1 0 1 .... 0 0 1 1 .... d4 . . 1 1 1 0 .... .. .. .. .. .. .. .. .. .. ..

slide-17
SLIDE 17
  • 3

3

Vector space

FJELL

  • Distance ?

D3

  • Distance ?

A l ?

D2

  • (1,3)
  • Angle ?

Q

D1 D4

  • (4,1)

Norge

  • 3

4

Similarity in the vector space

  • Given a query vector q and a document vector d,

both of length n both of length n.

  • Similarity between q and d is defined as:

– The inner product of q and d: q  d where The inner product of q and d: q  d, where

) * (

1 i n i i

d q d q

 

– and qi (di) is the value of the i-th position of q (d)

1 i 

slide-18
SLIDE 18
  • 3

5

Vector similarity: example

  • Given a query

t1 t2 t3 t4 t5

  • and a document collection

1 1 0 0 1 q t1 t2 t3 t4 t5

and a document collection

t1 t2 t3 t4 t5 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 d1 d2 d3 d4

  • The similarities are

1 1 1 0 1 d4 q  d d1 d2 d3 d4 2 2 3

  • 3

6

Document length

  • Only measuring the inner product has some

disadvantages: disadvantages:

– Longer documents are more likely to be relevant, as they are more likely to contain matching terms – If two documents have the same score, we would like to prefer the shorter one as it’s more focused on the information need

  • Conclusion: The length of a document has to be

integrated in computing the similarity score

slide-19
SLIDE 19
  • 3

7

Document length Document length normalization

  • The length of a document is the sum of its terms:

n

  • Normalize the document vector:

n i i w

d d

1

Normalize the document vector:

dnorm  d d  d1 d , d2 d ,, dn d      

  • The normalized inner product is then defined as:

dw dw dw dw    

d q dnorm  q d dw

  • 3

8

Example: normalized Example: normalized similarities

  • Given a query

t1 t2 t3 t4 t5 1 1 0 0 1 q

  • and a document collection

t1 t2 t3 t4 t5 1 0 0 0 1 0 1 1 1 1 1 1 d1 d2 d3

Th i il i i

0 0 1 1 0 1 1 1 0 1 d3 d4

  • The similarities are

d1 d2 d3 q  d q  (d*/dw) 2 2/2 = 1 2 2/4 = 0.5 d3 d4 0 0/2 = 0 3 3/4 = 0.75

slide-20
SLIDE 20
  • 3

9

normalization: so far so good

  • One of the problems is solved:

Shorter and more focused documents receive a higher normalized – Shorter and more focused documents receive a higher normalized score than longer documents with the same matching terms

  • On the other hand, we got a new problem:

– Now shorter documents are generally preferred over longer ones

  • 4

Cosine similarity

  • Measuring the angle between vectors

Q

D2

cos () A l 90 ( ) i il i Angle 90 -> cos () = 0 : no similarity Angle 0 -> cos () = 1 : Max similarity (“equal”)

slide-21
SLIDE 21
  • 4

1

vector normalization

  • The vector <1,1> is longer than the vector <1,0>

H t li d t i th t ?

  • How to normalize documents in the vector space?

– In an n-dimensional space, the length of a vector v is defined as:

n

v  vi

2 i1 n

  • Given a vector v, the vector v^(-1) * v is

normalized, i.e. of length 1 (the unit vector) g ( )

  • 4

2

Cosine similarity

  • Let a query vector q and a document vector d, both of

lenght n be given lenght n, be given.

  • The cosine similarity is defined as
  • (qi * di)

n

sim(q,d) (qi

i) i1

n n

( q q ) ( d d ) qi

2 n

* di

2 n

q d

i1 i1

slide-22
SLIDE 22
  • 4

3

Term weights

  • Up to now, we only considered binary term weights:

1: term occurs in documents – 1: term occurs in documents – 0: term does not occur in document

  • Two shortcomings

g

– Does not reflect frequency of terms – All terms are equally important (e.g. ‘president’ vs. ‘the’)

  • Improvement:

– Store the frequency in the vector (e.g. 4) instead of 1

  • tf-score

tf score

– Meaningful terms occur only in a few documents, they are good discriminators - distinguish those documents from the rest

  • idf-score
  • idf-score
  • 4

4

tf*idf-score

  • tf-score: tfi,j = frequency of term i in document j

Normalised: fi j / maxl (fl j) – Normalised: fi,j / maxl (fl,j)

  • idf-score: idfi = log (N/ni), where

g ( ),

– N is the size of the collection (no. of documents) – ni is the number of documents in which term i occurs l ith i d f d i – logarithm is used for dampening

  • The term weight of term i in document j is then

computed as: computed as:

  • tfi,j * idfi
slide-23
SLIDE 23
  • 4

5

Weighting the query?

  • Queries are shorter than documents…

) ( ) N log( * ) f freq * 0.5 0.5 ( w

q i, q i,

  : ) ( where n g freq max

i q l, l q i,

q in frequency term maximum : max q in i term

  • f

frequency the is : freq

l q i,

q in frequency term maximum : max

q l,

  • (Salton, Buckley 1988)
  • (Salton, Buckley 1988)
  • 4

6

Queries frequencies count Queries frequencies count…

slide-24
SLIDE 24
  • 4

7

Pros and cons of Vector Pros and cons of Vector model

  • Term weighting

R ki

  • Ranking

– Allows partial matching – Good (at least difficult to beat) ranking scheme Good (at least difficult to beat) ranking scheme

  • Fast and simple implementation
  • 4

8

Probabilistic model

  • Basic idea:

– Calculate the probability that a document d is relevant to query q – Use all possible “evidence” Use all possible evidence – Binary index term weights

  • Basic task:

Describe the properties of the set of relevant documents

I E “ ll h id ” – I.E. “collect the evidence”

  • Ranking:

All documents are ranked by their probability of relevance – All documents are ranked by their probability of relevance – Two assumptions:

  • Relevance is a binary property: a document is either relevant or non-

relevant relevant

  • The usefulness of a document does not depend on other documents
slide-25
SLIDE 25
  • 4

9

Probabilistic similarity Probabilistic similarity calculations

  • Q - The query – subset of index terms

R th t f l t d t

  • R - the set of relevant documents
  • R - the set of non-relevant documents

sim (d,q) P (R d ) P (R d ) P (R d )

  • 5

Bayes theorem

  • Swap order of dependence in order to facilitate

calculation calculation

P(B A)* P(A) P(A | B)  P(B A) P(A) P(B)

slide-26
SLIDE 26
  • 5

1

Computational transformations Computational transformations and simplifications p

  • Applying Bayes Theorem:

P(R d) P(R) * P(d R) P(d R) sim(d,q)  P(R d) P(R d)  P(R) * P(d R) P(R) * P(d R)  P(d R) P(d R)

  • Where

P(R) th i i b f l ti l t d t

( ) ( ) ( )

– P(R) = the a priori prob. of selecting a relevant document – P(R) = the a priori prob. of selecting a non-relevant document – P(d¦R) = prob. of randomly selecting doc. d from set of relevant doc.

  • And

Assume a priori probabilities are the same for all – Assume a priori probabilities are the same for all documents

  • 5

2

Simplifications (2)

  • Assume independence of index terms:

 

sim(di,q) P(dR) P(dR)  ( P(ki R)

gi (d j )1

)  ( P(ki R)

gi (d j ) 0

) ( P(ki R)

)  ( P(ki R)

)

  • Where

P(k | R) b bilit th t l t d t t i th i d

( ) ( (

i

)

gi (d j )1

) ( (

i

)

gi (d j ) 0

)

– P(k | R) = probability that a relevant document contains the index term k – gi(d) = v: Element i in document d has weight v – index term ki is either present or not present in document d

slide-27
SLIDE 27
  • 5

3

Simplifications (3)

  • Assume constant a priori term probabilities for all

documents (i e they do not change ranking for a documents (i.e. they do not change ranking for a document wrt. a given query)

  • Logarithmic calculations

Logarithmic calculations

  • P (A) + P (A) = 1

) | ( 1 ) | (     R k P R k P

n

) ) | ( ) | ( 1 log ) | ( 1 ) | ( (log * *

1 , ,

                   

R k P R k P R k P R k P w w

i i n i i i j i q i j q)

, sim(d

  • wi,q : weight of term i in query q

i h f i i d j

  • wi,d : weight of term i in document j
  • 5

4

Retrieval cycle

  • Guess a priori probabilities
  • Retrieve n most relevant documents
  • Retrieve n most relevant documents

– Automatic: ranked above some threshold – Manual: ask user (-> relevance feedback) – Collect properties of the relevant set

  • Given the retrieved relevant set of documents and

their properties: their properties:

– Recalculate probabilities

  • Iterate
slide-28
SLIDE 28
  • 5

5

A priori guessing and A priori guessing and recursive refinement

  • A priori guess

P(ki | R) 0.5 P(ki | R) ni N

  • ni: number of documents that

contain term ki

  • N: total number of documents
  • Recursive refinement (based on retrieved set)

V P(ki | R) Vi V V

  • V: Subset of documents initially retrieved

and ranked (or the number of documents in V)

P(ki | R) ni Vi N V

(or the number of documents in V)

  • Vi: Subset of V which contain term ki

(or number of documents in Vi)

  • + Zero value and small value adjustment (smoothing)
  • 5

6

Pros and cons of Probabilistic Pros and cons of Probabilistic model

  • Theoretically “sound”

Probabilistic calculations – Probabilistic calculations – Use independent “evidence” properties – Allows for iterative retrieval

  • Must make initial guess
  • Independence assumptions
  • Generally costly implementations (vs Vector models)
slide-29
SLIDE 29
  • 5

7

Brief (!) evaluation of classic Brief (!) evaluation of classic IR models IR models

Boolean Vector Probabilistic

Term lists Query Formulation Logical expressions Term lists Boolean relations lost Term lists Iterative Boolean relations lost Similarity function Crisp (0,1) Arithmetic Probability estimation Term independence Term independence assumtion Result No ranking Ranking Ranking (Relevance presentation No ranking Ranking (Relevance feedback) I l i Small and easy Large but Easy ? Implementation y (library systems) g y (web engines) (researach prototypes)

  • 5

8

Other models …

Set theoretic

  • Fuzzy sets
  • Extended boolean

Classic models

  • Boolean
  • Vector
  • Probabilistic

Algebraic

  • generalized vector

Retrieval:

  • Ad Hoc
  • Filtering
  • Probabilistic

Structured models generalized vector

  • Latent semantic

indexing

  • Neural networks

Probabilistic Browsing

  • Non overlapping

lists

  • Proximal Nodes
  • Inference

networks

  • Belief networks

Browsing models

  • Flat
  • Structure guided
  • Hypertext
slide-30
SLIDE 30
  • 5

9

Summary

  • Components of an IR model

User vs query – User vs. query – Document vs. document representation – Ranking

  • The Boolean retrieval model

Th V t d l

  • The Vector model
  • The Probabilistic model
  • 6

Additional material

  • IR Tutorial

– Christof Monz & Maarten de Rijke

htt // t ff i l/ d /T hi /IR/ESSLLI01/

  • http://staff.science.uva.nl/~mdr/Teaching/IR/ESSLLI01/
  • Books on the web

– Van Rijsbergen “Information Retrieval” Van Rijsbergen Information Retrieval

  • http://www.dcs.gla.ac.uk/Keith/Preface.html

– Anselm Spoerri: InfoCrystal - a visual tool for IR

  • http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/InfoCrystal.htm

– Manning, Raghavan, Schütze: Introduction to Information Retrieval

  • http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html