[PPT] - Fundamentals in Information Retrieval Jean-Cdric C HAPPELIER Emmanuel PowerPoint Presentation

SLIDE 1

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Fundamentals in Information Retrieval

Jean-Cédric CHAPPELIER Emmanuel ECKARD

LIA

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 1 / 74

SLIDE 2

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Information Retrieval

Definition

selection of documents relevant to a query in an unstructured collection of documents. ◮ unstructured: not produced with IR in mind, not a database. ◮ document: here, natural language text (but could also be video, audio or images) ◮ query: utterance in natural language (possibly augmented with commands, see later) ◮ relevant:

1. users-wise: answering the IR requirements
2. mathematically: maximising a defined “proximity measure”

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 3 / 74

SLIDE 3

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Example of Information retrieval: issuing a query on an unstructured collection

◮ query (“Alan Turing”) ◮ search among unstructured collection (Wikipedia articles)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 4 / 74

SLIDE 4

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Example of Information retrieval: results returned by the system

◮ list of results with a percentage match ◮ highest matches first

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 5 / 74

SLIDE 5

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Ambiguity

Sometimes uninteded results occur

Example

query: “Chicago school” wanted? ◮ schools in Chicago (IL)? ◮ body of works in sociology? ◮ architectural style? ◮ where to learn how to play Chicago (game):

◮ bridge? ◮ or pocker??

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 6 / 74

SLIDE 6

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Relevance? Content versus topic

“Relevant” documents: What does “relevant” mean? ◮ useful? ◮ new? ◮ topically related? ◮ content related?

◮ at word level? ◮ at semantic/pragmatic level?

Semantic representation Surface form (raw text) T

pics

Semantic content

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 7 / 74

SLIDE 7

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Relevance? Content versus topic

Semantic content: what the document talks about (topic) vs what it says (content).

Example

Document 1: Note how misty the river banks are. Document 2: She got misty by the river of bank notes falling on the table. Document 3: Money had never interested her.

Doc. 1 & 2 have similar word content but are not topically related.
Doc. 2 & 3 have similar topics but opposite semantic content.

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 8 / 74

SLIDE 8

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

How it IR done?

Tasks

◮ have the computer represent documens (at the adequate level): preprocessing, indexing, ... ◮ represent the query, not necessarily the same way as documents (short queries, operators, . . . ) ◮ define satisfying relevance measures between representations

Similarities with other NLP tasks

◮ Classification (no query) ◮ Data mining (formatted data) ◮ Information extraction (retrieve shorts parts of documents)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 9 / 74

SLIDE 9

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

IR Before computers

◮ Colophons on clay tablets of Mesopotamia (3500 BCE) ◮ Tags on scrolls of Edfu temple (from 237 BCE) ◮ Middle Age: indexes of key terms of the Bible ◮ Indexes for important texts: the Bible, Shakespeare’s works, . . .

Index of Thiers’ Histoire de la Révolution française, 1854

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 10 / 74

SLIDE 10

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Simple example: Boolean model

Boolean model

◮ Documents are sets of terms (presence/absence) ◮ Queries are boolean expressions on terms

Steps

◮ V, a finite vocabulary of indexing terms ◮ R representation space ◮ RD : V ∗ → R representation function ◮ matching between query and documents

Example

◮ feeling; ease; pain; feet; pain; ship ◮ {0;1}|V| ◮ presence/absence ◮ Boolean operators

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 11 / 74

SLIDE 11

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Simple example: Boolean model

010...0 100...1 000...1 ... 010...0 1 ... Documents Query

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 12 / 74

SLIDE 12

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Example: Boolean representation of documents

Example

Document 1: Come on, now, I hear you’re feeling down. Well I can ease your pain Get you on your feet again. Document 2: There is no pain you are receding A distant ship, smoke on the horizon. → Doc1: feeling; ease; pain; feet → Doc2: pain; ship; smoke; horizon

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 13 / 74

SLIDE 13

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Example: Boolean representation of queries; retrieval

Example

Query: pain AND feeling Doc1: feeling; ease; pain; feet Doc2: pain; ship; smoke; horizon

Results

◮ Doc1 matches ◮ Doc2 does not match

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 14 / 74

SLIDE 14

Introduction

Context and definitions Simple example: Boolean model

Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Limitations of the Boolean model

Example

Query: pain AND feeling Doc1: feeling; ease; pain; feet Doc2: pain; ship; smoke; horizon → Doc1 matches; Doc2 does not.

Limitations

◮ We might want to return Doc2 as a second best choice. The boolean model does not allow this. ◮ What happens with “pain OR feeling”? ☞ does not match common layman wisdom

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 15 / 74

SLIDE 15

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Indexing and represention of documents

Definition

Representation: translating a document (words) into computable data (numbers). Indexing: selecting relevant elements (features) to support the representation

Themes related to indexing:

◮ Tokenisation ◮ Stop words ◮ Zipf and Luhn ◮ Stemming and lemmatisation ◮ Bag of words model

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 17 / 74

SLIDE 16

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Tokenisation

Definition

Tokenisation: splitting the text into words (Pre-requisite to choosing indexing terms)

Example

◮ easy: whitespaces Now is the winter of our discontent Made glorious summer by this son of York ◮ less easy: space not always indicative of a term segmentation (compounds): Distributional Semantics Information Retrieval and Latent Semantics Indexing performance comparison ◮ agglutinative languages are a problem: Rinderkennzeichnungs- und Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz ◮ Technical terms

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 18 / 74

SLIDE 17

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Tokenisation of technical terms

e.g. in Chemistry

Methionyl-glutaminyl-arginyl-tyrosyl-glutamyl-seryl-leucyl-phenyl-alanyl- alanyl-glutaminyl-leucyl-lysyl-glutamyl-arginyl-lysy-glutamyl-gycyl-alanyl- phenyl-alanyl-valyl-prolyl-phenyl-alanyl-valyl-threonyl-leucyl-glycyl- aspartyl-prolyl-glycyl-isoleucyl-glutamyl-glutaminyl-seryl-leucyl-lysyl- isoleucyl-aspartyl-threonyl-leucyl-isoleucyl-glutamyl-alanyl-glycyl-alanyl- aspartyl-alanyl-leucyl-glutamyl-leucyl-glycyl-isoleucyl-prolyl-phenyl-alanyl- seryl-aspartyl-prolyl-leucyl-alanyl-aspartyl-glycyl-prolyl-threonyl-isoleucyl- glutaminyl-asparaginyl-alanyl-threonyl-leucyl-arginyl-alanyl-phenyl-alanyl- alanyl-alanyl-glycyl-valyl-threonyl-prolyl-alanyl-glutaminyl-cysteinyl- phenyl-alanyl-glutamyl-methionyl-leucyl-alanyl-leucyl-isoleucyl-arginyl- glutaminyl-lysyl-histidyl-prolyl-threonyl-isoleucyl-prolyl-isoleucyl-glycyl- leucyl-leucyl-methionyl-tyrosyl-alanyl-asparaginyl-leucyl-valyl-phenyl-...

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 19 / 74

SLIDE 18

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Word Entities

Definition

Semantic entity: compound word (group of words) bearing a semantic meaning

Example

◮ “Information retrieval” ◮ “rendez-vous” ◮ “radio antenna” ◮ “Singing Lily” (a type of pastry) ◮ “Dolphin striker” (a spar [part of boat])

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 20 / 74

SLIDE 19

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Conclusion on Tokenisation

Tokenisation is actually a NLP issue (use NLP techniques)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 21 / 74

SLIDE 20

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Choice of indexing terms

Filtering

Automated choice of indexing terms using filters: ◮ on morpho-syntactic categories (e.g.: prepositions have no semantic content; nouns do) ◮ on stop-words ◮ on frequencies

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 22 / 74

SLIDE 21

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Stop words

Definition

Stop word: term explicitely to be excluded from indexing.

Example

stoplist: the; a; ’s; in; but; I; we; my; your; their; then Young men’s love then lies Not truly in their hearts, but in their eyes. Document: Young men love lies truly hearts eyes

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 23 / 74

SLIDE 22

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Stop words

◮ Benefits:

◮ more informative indexes ◮ cheap way to remove classes of words without semantic content ◮ smaller indexes (tractability)

◮ Problems: To be or not to be − → this sentence would be entirely stopped.

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 24 / 74

SLIDE 23

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Choice of indexing terms: frequencies

Zipf and Luhn

If r is the rank of a term and n is its number of occurences (frequency) in the collection: ◮ Zipf (1949): n ∼ 1/r ◮ Luhn (1958): mid-rank terms are the best indicators of topics

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 25 / 74

SLIDE 24

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Choice of indexing terms: frequencies

Word rank

Word frequency

Upper cut-off Lower cut-off

Word too rare Word too com- mon

Significant Word

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 26 / 74

SLIDE 25

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Stemming and lemmatisation

Definition

Stem: morphological root of a word. Stemming: Process of reducing words to their stem.

Example

◮ prepaid, paid − → paid ◮ interesting, uninteresting − → interest

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 27 / 74

SLIDE 26

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Stemming and lemmatisation

Benefits

Reduces lexical variability ⇒ reduces index size increases information value of each indexing term.

Non-trivial process

factual − → fact equal − → eq OK wrong (“eq” is too short)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 28 / 74

SLIDE 27

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Desequentialisation: bag of words model

Assumption

Positions of the terms are ignored. Term distribution is indicative enough of the meaning.

Model

d1 = {(t1,n(d1,t1));(t2,n(d1,t2));...} d2 = {(t1,n(d2,t1));(t2,n(d2,t2));...} A document is a multiset of terms

Example

Now so long, Marianne ; it’s time that we began to laugh and cry and cry ; and laugh about it all again. → ([begin,1] [cry,2] [laugh,2] [long,1] [Marianne,1] [time,1])

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 29 / 74

SLIDE 28

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Phrases, neighbourhoods: beyond the words

Position could be kept to allow ◮ litteral search (quotations): "more things in heaven and earth" ◮ search by proximity: dreamt WITHIN 5 philosophy

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 30 / 74

SLIDE 29

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Conclusions on indexing

◮ Bad indexing can ruin the performances of an otherwise sophisticated IR system ◮ Good indexing is anything but trivial

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 31 / 74

SLIDE 30

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Vector Space model

Objective

Overcome the limitations of the Boolean model by representing documents with vector describing term distributions.

Principle

◮ V, a finite vocabulary of indexing terms ◮ R representation space ◮ RD : V ∗ → R representation function ◮ similarity: Mprox : R ×R → R+ Note: choose similarity measure well behaved for the representation (depends on the representation) ☞ more in the “Textual Data Analysis” lecture

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 32 / 74

SLIDE 31

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Vocabulary of indexing terms

Example

◮ Now so long, Marianne it’s time that we began to laugh and cry and cry and laugh about it all again. ◮ V, a finite vocabulary: aardvark, begin, cry, information, laugh, long, Marianne, retrieval, time, ... → Now so long Marianne it’s time that we began to laugh and cry and cry and laugh about it all again.

In practice

the vocabulary is several thousands of terms large

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 33 / 74

SLIDE 32

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Characterisation

Definition

characterisation: projection of the document into the representation space

Example

◮ Now so long, Marianne it’s time that we began to laugh and cry and cry and laugh about it all again. ◮ R representation space: R|V| → ([aardvark,?] [begin,?] [cry,?] [information,?] [laugh,?] [long,?] [Marianne,?] [retrieval,?] [time,?])

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 34 / 74

SLIDE 33

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Weightings

Term Frequency

tf(wi,dj) = nb of occurences of term wi in document dj Sometimes 1+log(tf(wi,dj)) is used in place of tf(wi,dj)

Term Frequency - Inverse Document Frequency

tf-idf(wi,dj) = tf(wi,dj)·idf(wi) with idf(wi) = log

|D|

nb(dk ⊃ wi)

|D|: number of documents

nb(dk ⊃ wi): number of documents which contain term wi

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 35 / 74

SLIDE 34

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Weighting

Example

◮ Now so long, Marianne it’s time that we began to laugh and cry and cry and laugh about it all again. ◮ RD : V ∗ → R representation function: here: Term Frequency − → ([aardvark,0] [begin,1] [cry,2] [information,0] [laugh,2] [long,1] [Marianne,1] [retrieval,0] [time,1]) − → (0 1 2 0 2 1 1 0 1 ...)

In practice

the vector is very sparse

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 36 / 74

SLIDE 35

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Vector space model

1

t

2

t

3

t

1

d

2

d

3

d

◮ indexing terms define axis ◮ documents are point in the vector space (representing directions)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 37 / 74

SLIDE 36

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Proximity measure between documents

Cosine similarity

cos(d1,d2) = d1 ||d1|| · d2 ||d2|| =

N

∑

j=1

d1j d2j

∑j d1j

2 ∑j d2j

2 ◮ bounded (0 < cos(d1,d2) < 1,∀d1,d2) ◮ it is a similarity: the greater, the more similar the documents (as opposed to a metric) ◮ independent on the length of the document

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 38 / 74

SLIDE 37

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Proximity measure between documents

Document 1

◮ Now so long, Marianne, it’s time that we began to laugh and cry and cry and laugh about it all again. ◮ ...,[long,1] [Marianne,1] [time,1] [begin,1] [laugh,2] [cry,2],... ◮ d1 = (...,1,1,1,1,2,2,...)

Document 2

◮ I haven’t seen Marianne laugthing for some time, is she crying all day long ? ◮ ...,[long,1] [Marianne,1] [time,1] [begin,0] [laugh,1] [cry,1],... ◮ d2 = (...,1,1,1,0,1,1,...)

Example

cos(d1,d2) = 7/( √ 12· √ 5) = 0.904

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 39 / 74

SLIDE 38

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Summary

Choices depending on the application

◮ Weighting: allows to translate semantic notions into computable models ◮ Proximity measure: fixes the topology of the representation space

Constants

◮ |V|-dimensional vector space ◮ very sparse vectors

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 40 / 74

SLIDE 39

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Queries: definition

Definition

Queries (or “topics”) are “questions” asked to the system Typically keywords, possibly augmented with operators dreamt WITHIN 5 philosophy Supposed unknown at indexing time (difference between IR and classification or clustering) Visit http://www.google.com/trends for real-life examples

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 41 / 74

SLIDE 40

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Query representation

Example

◮ easy: as for documents more things in heaven and earth ◮ less easy (verbatim sentence) "more things in heaven and earth" ◮ quite different from the document (positional information) dreamt WITHIN 5 philosophy

Conclusion:

Query representation is not necessarly trivial (not always the same as representation of documents).

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 42 / 74

SLIDE 41

Introduction Toolchain

Indexing Vector Space model Queries

Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Problem of short queries

Web queries

On the web, ◮ the average query length is under three words ◮ very few users use operators Language being ambiguous, three-word queries are difficult to satisfy.

Solutions

◮ query expansion: use knowledge about the query terms to associate them with other terms and improve the query. ◮ query term reweighting: weight the terms of the query as to

btain maximum retrieval performance.

◮ relevance feedback: User provides the system an evaluation

f the relevance of its answers.

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 43 / 74

SLIDE 42

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Evaluation campaigns

Evaluation set

1. Document collection
2. Query set
3. Referential

Definition

Referential: list of documents of a collection to be retrieved for

ne given query (handmade).

Examples of evaluation campains

◮ Smart (1970s) ◮ TREC (since the 1990s; large collections) ◮ AMARYLLIS (French)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 45 / 74

SLIDE 43

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Performances of IR systems

Reminder: Given an IR system, a document collection, queries, referential and an answer by the system:

Definition

Precision is the proportion of the documents retrieved by the system that are relevant (according to the referential)

Definition

Recall is the proportion of the relevant documents which were retrieved by the system ◮ Precision can be cheated by returning no document ◮ Recall can be cheated by returning all documents

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 46 / 74

SLIDE 44

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Performances of IR systems

Given an IR system, a document collection and a referential; for a query q, the results returned by the system is evaluated with: ◮ Precision: Pr(q) = |R(q)S(q)|

|S(q)|

◮ Recall: Rec(q) = |R(q)S(q)|

|R(q)|

Collection

R S R S

U

C

Relevant Retrieved

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 47 / 74

SLIDE 45

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Performance measures: R-Precision

Definition

Precision at n document: Prn(q) = |R(q)Sn(q)| |Sn(q)| with Sn(q) = n first documents to be retrieved

R-Precision

precision obtained after retrieving as many documents as there are relevant documents, averaged over queries R-Precision = 1 N

N

∑

i=1

Pr|R(qi)|(qi)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 48 / 74

SLIDE 46

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Performance measures: Mean Average Precision

Average Precision

Average of the precisions whenever all relevant documents below rank rk(d,q) are retrieved: AvgP(q) = 1 |R(q)| ∑

d∈R(q)

Prrk(d,q)(q)

Mean Average Precision

Mean over the queries of the Average Precisions 1 N ∑

i

AvgP(qi) MAP measures the tendency of the system to retrieve relevant documents first.

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 49 / 74

SLIDE 47

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Plotting average Precision and Recall

0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.4 0.6 0.8 1 Precision Recall DSIR (alpha=0) Hybrid (alpha=0.5) VS (alpha=1)

Aim of the game: push the curve towards the upper right corner

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 50 / 74

SLIDE 48

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Probabilistic models

Idea

The best possible ranking returns documents sorted by probability to be relevent given a query.

for instance: Sparck-Jones’ model

◮ Estimate the probability that a given document di is relevant (di ∈ R(q)) to given query q: P(di ∈ R(q)|di,q) ◮ Invert the probability (here R is a boolean variable, standing for di ∈ R(q)) : P(di|R,q) ◮ Write P(di|R,q) as a function of the probabilities of occurence

f the terms (assuming that terms are conditionally

independant): P(ti ∈ d|R,q)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 52 / 74

SLIDE 49

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Sparck-Jones’ model

Document d contains term ti (of the query)

w(ti,d) = log p(ti ∈ d|d ∈ R) p(ti ∈ d|d / ∈ R)

Document d does not contain term ti (of the query)

w(ti,d) = log p(ti / ∈ d|d ∈ R) p(ti / ∈ d|d / ∈ R) = log 1−p(ti ∈ d|d ∈ R) 1−p(ti ∈ d|d / ∈ R)

Combining the two

w(ti,d) = log p(ti∈d|d∈R)

p(ti∈d|d / ∈R) −log p(ti / ∈d|d∈R)] p(ti / ∈d|d / ∈R)

= log p(ti∈d|d∈R)[1−p(ti∈d|d /

∈R)] p(ti∈d|d / ∈R)[1−p(ti∈d|d∈R)]

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 53 / 74

SLIDE 50

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Okapi BM25

Idea

Refine Sparck-Jones’ model by including term frequencies w = log p(freq(t,d) = tf|d ∈ R)p(t / ∈ d|d / ∈ R) p(freq(t,d) = tf|d / ∈ R)p(t / ∈ d|d ∈ R)

BM25 weight for term i

wBM25

i

= tfi(k1 +1) k1((1−b)+b dl

avdl )+tfi

·idfi with dl = document length avfl = average document length BM25 is a very good model and used as reference for comparison with new models

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 54 / 74

SLIDE 51

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Introduction to topic-based models

Problem

Information retrieval has problems notably with ◮ Polymesy ◮ Synonymy

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 55 / 74

SLIDE 52

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Polymesy

Example

Query includes term Bank → Bank of England? Bank of fishes? Grand bank? Airplane bank?

Consequences

Negative impact on precision

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 56 / 74

SLIDE 53

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Synonymy

Example

Query includes term freedom → liberty will not be seen as relevant

Consequences

Negative impact on recall

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 57 / 74

SLIDE 54

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Topic-based models

Idea

Apply a transformation to the representation space as to emphasise the most relevant features: index senses rather than mere words

Note

Stemming is already a step in this direction (less dependent on mere words)

Reminder

Occurence matrix: term × document matrix containing the weights {wij} associated to document di and term tj

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 58 / 74

SLIDE 55

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Latent Semantic Indexing

Idea

Reduction of dimensionality of the original representation space Create a matrix close to the occurence matrix but of smaller rank

1

t

2

t

3

t

1

d

2

d

3

d

Before After

1

t

1

d

2

d

3

d

2

t

3

t

" "

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 59 / 74

SLIDE 56

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Latent Semantic Indexing

Reduction of dimensionality

◮ approximation of the occurence matrix ◮ filtering of the occurence matrix

Example

Singular Matrix Decomposition with k values (k between 100 to 300).

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 60 / 74

SLIDE 57

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Latent Semantic Indexing

Illustration

[wij] (D ×T) → [wij] (D ×k) [flower] [car] [truck]    [flower] √ 12·[car]+π ·[truck] (∼ = [vehicle] ?)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 61 / 74

SLIDE 58

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Latent Semantic Indexing

Advantages

◮ More significant representation

Drawbacks

◮ Out-performed by other models ◮ Too expensive to compute on large bases (requires iterative methods) ◮ Meaning of axis ?? ◮ Query projection is problematic

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 62 / 74

SLIDE 59

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Distributional Semantics Information Retrieval

Idea

There is a high degree of correlation between the observable distributional caracteristics of a term and its meaning: "a word is characterized by the company it keeps";

Z. Harris (1954), J.R. Firth (1957)

Example

◮ Some X, for instance, naturally attack rats. ◮ The X on the roof was exposing its back to the shine of the sun. ◮ He heard the mewings of X in the forest . ◮ X is a: . . .

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 63 / 74

SLIDE 60

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

X is a . . .

Bertil Videt, GFDL & CC-by-2.5

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 64 / 74

SLIDE 61

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Co-occurence profile

Definition

Co-occurence profile: caracterisation of a word by its co-occurences with indexing terms

Example

Document 1 Now so long, Marianne, it’s time that we began to laugh and cry and cry and laugh about it all again. Document 2 it seems so long ago, Nancy was alone looking at the Late Late show through a semi-precious stone. → Co-occurence profile of long = ([cry,2] [begin,1] [Marianne,1] [Nancy,1] [time,1] [late,2] [laugh,2] )

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 65 / 74

SLIDE 62

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Co-occurence matrix

Definition

Co-occurence matrix: words × terms matrix of the co-occurence profiles with terms fij: number of times that the word wi and the indexing term tj occur together.

DSIR Document representation

FD = Foccurence ·Fco-occurence → ponderation of the words in documents by the co-occurences

Note

When indexing a collection C, the co-occurence matrix would typically be evaluated on a control collection representative of the language/domain (could be C itself, but not necessarily)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 66 / 74

SLIDE 63

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Computing co-occurencies

The actor was wearing a grimacing mask of ancient theatre

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 67 / 74

SLIDE 64

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Co-occurencies and syntactic features

Use heads of phrases

(The actor) (was wearing) (a grimacing mask) (of ancient theatre) actor wear grimacing mask ancient theatre Actor Wear Mask Theatre Grimacing Ancient

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 68 / 74

SLIDE 65

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Co-occurencies augmented with Part-of-Speech

Using syntactic rules and semantic roles

The actor was wearing a grimacing mask of ancient theatre SUBJ(actor,wear) OBJ(wear,mask) ADJ(mask,grimacing) ADJ(theatre,ancient) CNOUN(mask,theatre) Actor Wear Mask Theatre Grimacing Ancient

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 69 / 74

SLIDE 66

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

DSIR results

0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.4 0.6 0.8 1 Precision Recall DSIR (alpha=0) Hybrid (alpha=0.5) VS (alpha=1)

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 70 / 74

SLIDE 67

Introduction Toolchain Evaluation Beyond the vector model

Probabilistic models Topic-based models LSI DSIR

Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

Summary / Keypoints

◮ Vector-space model; ◮ Indexing (and its important role); ◮ Weighting schemes, tf-idf; ◮ Evaluation: Precision and Recall.

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 73 / 74

SLIDE 69

Introduction Toolchain Evaluation Beyond the vector model Conclusion

c

EPFL 2008–2014

Jean-Cédric Chappelier & Emmanuel Eckard

References

[1] C. D. Manning, P . Raghavan and H. Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008. [2] R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval", Addison Wesley, 1999. [3] "Topics in Information Retrieval", chap. 15 in "Foundations of Statistical Natural Language Processing", C. D. Manning and

H. Schütze, MIT Press, 1999.

Computational Linguistics Course (EPFL-MsCS) – Information Retrieval – 74 / 74