Introduction to Natural Language Processing Evaluation Vector space - - PowerPoint PPT Presentation

introduction to natural language processing
SMART_READER_LITE
LIVE PREVIEW

Introduction to Natural Language Processing Evaluation Vector space - - PowerPoint PPT Presentation

Introduction Week 7, lecture WWW: pecina@ufal.mfg.cuni.cz E-mail: Pavel Pecina Todays teacher: Boolen and Vector Space Models for Information Retrieval Todays topic: Today: Boolean retrieval by members of the Institute of Formal and


slide-1
SLIDE 1

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Introduction to Natural Language Processing

a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 7, lecture Today’s topic: Boolen and Vector Space Models for Information Retrieval Today’s teacher: Pavel Pecina

E-mail: pecina@ufal.mfg.cuni.cz WWW: htup://ufal.mfg.cuni.cz/∼pecina/

1 / 110

slide-2
SLIDE 2

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Contents

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

2 / 110

slide-3
SLIDE 3

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Introduction

3 / 110

slide-4
SLIDE 4

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Definition of Information Retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

4 / 110

slide-5
SLIDE 5

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Boolean retrieval

5 / 110

slide-6
SLIDE 6

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Boolean retrieval

▶ Boolean model is arguably the simplest model to base an information

retrieval system on.

▶ Qveries are Boolean expressions, e.g., Caesar and Brutus ▶ The search engine returns all documents that satisfy the Boolean

expression.

6 / 110

slide-7
SLIDE 7

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Term-document incidence matrix

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 … Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest.

8 / 110

slide-8
SLIDE 8

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Incidence vectors

▶ So we have a 0/1 vector for each term. ▶ To answer the query Brutus and Caesar and not Calpurnia:

  • 1. Take the vectors for Brutus, Caesar, and Calpurnia
  • 2. Complement the vector of Calpurnia
  • 3. Do a (bitwise) and on the three vectors:

110100 and 110111 and 101111 = 100100

9 / 110

slide-9
SLIDE 9

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

0/1 vector for Brutus

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 … result: 1 1

10 / 110

slide-10
SLIDE 10

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Answers to query

Anthony and Cleopatra, Act III, Scene ii: Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii: Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.

11 / 110

slide-11
SLIDE 11

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Bigger collections

▶ Consider N = 106 documents, each with about 1000 tokens

⇒ total of 109 tokens

▶ On average 6 bytes per token, including spaces and punctuation

⇒ size of document collection is about 6 · 109 = 6 GB

▶ Assume there are M = 500,000 distinct terms in the collection

⇒ M = 500,000 × 106 = half a trillion 0s and 1s.

▶ But the matrix has no more than one billion 1s.

⇒ Matrix is extremely sparse.

▶ What is a betuer representations?

⇒ We only record the 1s.

12 / 110

slide-12
SLIDE 12

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Inverted Index

For each term t, we store a list of all documents that contain t. Brutus − → 1 2 4 11 31 45 173 174 Caesar − → 1 2 4 5 6 16 57 132 … Calpurnia − → 2 31 54 101 . . .

  • dictionary

postings

13 / 110

slide-13
SLIDE 13

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Inverted index construction

  • 1. Collect the documents to be indexed:

Friends, Romans, countrymen. So let it be with Caesar …

  • 2. Tokenize the text, turning each document into a list of tokens:

Friends Romans countrymen So …

  • 3. Do linguistic preprocessing, producing a list of normalized tokens,

which are the indexing terms: friend roman countryman so …

  • 4. Index the documents that each term occurs in by creating an inverted

index, consisting of a dictionary and postings.

14 / 110

slide-14
SLIDE 14

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Tokenization and preprocessing

Doc 1. I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:

Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the no- ble brutus hath told you caesar was ambitious

15 / 110

slide-15
SLIDE 15

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Generate postings, sort, create lists, determine document frequency

Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the no- ble brutus hath told you caesar was ambitious

term docID i 1 did 1 enact 1 julius 1 caesar 1 i 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2

term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 i 1 i 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

term

  • doc. freq.

→ postings lists ambitious 1 → 2 be 1 → 2 brutus 2 → 1 → 2 capitol 1 → 1 caesar 2 → 1 → 2 did 1 → 1 enact 1 → 1 hath 1 → 2 i 1 → 1 i’ 1 → 1 it 1 → 2 julius 1 → 1 killed 1 → 1 let 1 → 2 me 1 → 1 noble 1 → 2 so 1 → 2 the 2 → 1 → 2 told 1 → 2 you 1 → 2 was 2 → 1 → 2 with 1 → 2 16 / 110

slide-16
SLIDE 16

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Split the result into dictionary and postings file

For each term t, we store a list of all documents that contain t. Brutus − → 1 2 4 11 31 45 173 174 Caesar − → 1 2 4 5 6 16 57 132 … Calpurnia − → 2 31 54 101 . . .

  • dictionary

postings The dictionary is the data structure for storing the term vocabulary.

17 / 110

slide-17
SLIDE 17

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Dictionary as array of fixed-width entries

▶ For each term, we need to store a couple of items:

▶ document frequency ▶ pointer to postings list ▶ …

▶ Assume for the time being that we can store this information in a

fixed-length entry.

▶ Assume that we store these entries in an array.

19 / 110

slide-18
SLIDE 18

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Dictionary as array of fixed-width entries

Dictionary: term document frequency pointer to postings list a 656,265 − → aachen 65 − → … … … zulu 221 − → Space needed: 20 bytes 4 bytes 4 bytes

  • 1. How do we look up a query term qi in this array at query time?
  • 2. Which data structure do we use to locate the entry (row) in the array

where qi is stored?

20 / 110

slide-19
SLIDE 19

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Data structures for looking up term

▶ Two main classes of data structures: hashes and trees. ▶ Some IR systems use hashes, some use trees. ▶ Criteria for when to use hashes vs. trees:

  • 1. Is there a fixed number of terms or will it keep growing?
  • 2. What are the frequencies with which various keys will be accessed?
  • 3. How many terms are we likely to have?

21 / 110

slide-20
SLIDE 20

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Hashes

▶ Each vocabulary term is hashed into an integer. ▶ Try to avoid collisions ▶ At query time, do the following: hash query term, resolve collisions,

locate entry in fixed-width array

▶ Pros:

  • 1. Lookup in a hash is faster than lookup in a tree.
  • 2. Lookup time is constant.

▶ Cons:

  • 1. no way to find minor variants (resume vs. résumé)
  • 2. no prefix search (all terms starting with automat)
  • 3. need to rehash everything periodically if vocabulary keeps growing

22 / 110

slide-21
SLIDE 21

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Trees

▶ Trees solve the prefix problem (e.g. find all terms starting with auto). ▶ Search is slightly slower than in hashes: O(log M), where M is the size

  • f the vocabulary

▶ O(log M) only holds for balanced trees. Rebalancing is expensive. ▶ B-trees mitigate the rebalancing problem. ▶ B-tree definition: every internal node has a number of children in the

interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4].

▶ Simplest tree: binary tree

23 / 110

slide-22
SLIDE 22

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Binary tree example

24 / 110

slide-23
SLIDE 23

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

B-tree example

25 / 110

slide-24
SLIDE 24

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Simple conjunctive query (two terms)

▶ Consider the query: Brutus AND Calpurnia ▶ To find all matching documents using inverted index:

  • 1. Locate Brutus in the dictionary
  • 2. Retrieve its postings list from the postings file
  • 3. Locate Calpurnia in the dictionary
  • 4. Retrieve its postings list from the postings file
  • 5. Intersect the two postings lists
  • 6. Return intersection to user

27 / 110

slide-25
SLIDE 25

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Intersecting two postings lists

Brutus − → 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Calpurnia − → 2 → 31 → 54 → 101 Intersection = ⇒ 2 → 31

▶ This is linear in the length of the postings lists. ▶ Note: This only works if postings lists are sorted.

28 / 110

slide-26
SLIDE 26

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Boolean queries

▶ Boolean model can answer any query that is a Boolean expression.

▶ Boolean queries use and, or and not to join query terms. ▶ Views each document as a set of terms. ▶ Is precise: Document matches condition or not.

▶ Primary commercial retrieval tool for 3 decades ▶ Many professional searchers (e.g., lawyers) still like Boolean queries.

▶ You know exactly what you are getuing. 29 / 110

slide-27
SLIDE 27

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Text processing

30 / 110

slide-28
SLIDE 28

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Parsing a document

▶ We need to deal with format and language of each document. ▶ What format is it in? pdf, word, excel, html etc. ▶ What language is it in? ▶ What character set is in use?

31 / 110

slide-29
SLIDE 29

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Definitions

▶ Word – A delimited string of characters as it appears in the text. ▶ Term – A “normalized” word (morphology, spelling etc.); an

equivalence class of words.

▶ Token – An instance of a word or term occurring in a document. ▶ Type – The same as a term in most cases: an equivalence class of

tokens.

32 / 110

slide-30
SLIDE 30

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Normalization

▶ Need to “normalize” terms in indexed text as well as query terms into

the same form.

▶ We want to match U.S.A. and USA

▶ We most commonly implicitly define equivalence classes of terms. ▶ Normalization and language detection interact.

▶ PETER WILL NICHT MIT. → MIT = mit ▶ He got his PhD from MIT. → MIT ̸= mit

▶ Numbers

▶ 3/20/91 vs. 20/3/91 ▶ (800) 234-2333 vs. 800.234.2333 33 / 110

slide-31
SLIDE 31

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Tokenization

▶ What are the delimiters? Space? Apostrophe? Hyphen? ▶ For each of these: sometimes they delimit, sometimes they don’t.

▶ Hewletu-Packard ▶ State-of-the-art ▶ co-education ▶ the hold-him-back-and-drag-him-away maneuver ▶ data base ▶ San Francisco ▶ Los Angeles-based company ▶ cheap San Francisco-Los Angeles fares ▶ York University vs. New York University

▶ No whitespace in many languages! (e.g., Chinese) ▶ No whitespace in Dutch, German, Swedish compounds

(Lebensversicherungsgesellschafusangestellter)

34 / 110

slide-32
SLIDE 32

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Accents and diacritics

▶ Accents: résumé vs. resume (simple omission of accent) ▶ Umlauts: Universität vs. Universitaet (substitution “ä” and “ae”) ▶ Most important criterion: How are users likely to write their queries

for these words?

▶ Even in languages that standardly have accents, users ofuen do not

type them (e.g. Czech)

35 / 110

slide-33
SLIDE 33

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Case folding

▶ Reduce all letuers to lower case ▶ Possible exceptions: capitalized words in mid-sentence

Example: MIT vs. mit, Fed vs. fed

▶ It’s ofuen best to lowercase everything since users will use lowercase

regardless of correct capitalization.

36 / 110

slide-34
SLIDE 34

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Stop words

▶ stop words = extremely common words which would appear to be of

litule value in helping select documents matching a user need

▶ Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of,

  • n, that, the, to, was, were, will, with

▶ Stop word elimination used to be standard in older IR systems. ▶ But you need stop words for phrase queries, e.g. “King of Denmark” ▶ Most web search engines index stop words.

37 / 110

slide-35
SLIDE 35

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

More equivalence classing

▶ Soundex: phonetic equivalence, e.g. Muller = Mueller ▶ Thesauri: semantic equivalence, e.g. car = automobile

38 / 110

slide-36
SLIDE 36

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Lemmatization

▶ Reduce inflectional/variant forms to base form ▶ Examples:

▶ am, are, is → be ▶ car, cars, car’s, cars’ → car ▶ the boy’s cars are difgerent colors → the boy car be difgerent color

▶ Lemmatization implies doing “proper” reduction to dictionary

headword form (the lemma).

▶ Two types:

▶ inflectional (cutuing → cut) ▶ derivational (destruction → destroy) 39 / 110

slide-37
SLIDE 37

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Stemming

▶ Crude heuristic process that chops ofg the ends of words in the hope

  • f achieving what “principled” lemmatization atuempts to do with a

lot of linguistic knowledge.

▶ Language dependent ▶ Ofuen inflectional and derivational ▶ Example (derivational): automate, automatic, automation all reduce to

automat

40 / 110

slide-38
SLIDE 38

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Porter algorithm (1980)

▶ Most common algorithm for stemming English ▶ Results suggest that it is at least as good as other stemming options

(1980!)

▶ Conventions + 5 phases of reductions applied sequentially ▶ Each phase consists of a set of commands. ▶ Sample command: Delete final ement if what remains is longer than 1

character (replacement → replac, cement → cement)

▶ Sample convention: Of the rules in a compound command, select the

  • ne that applies to the longest sufgix.

41 / 110

slide-39
SLIDE 39

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Porter stemmer: A few rules

Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat

42 / 110

slide-40
SLIDE 40

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Three stemmers: A comparison

Sample text: Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

43 / 110

slide-41
SLIDE 41

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Does stemming improve efgectiveness?

▶ In general, stemming increases efgectiveness for some queries, and

decreases efgectiveness for others.

▶ Qveries where stemming is likely to help:

▶ [tartan sweaters], [sightseeing tour san francisco] ▶ equivalence classes: {sweater,sweaters}, {tour,tours}

▶ Qveries where stemming hurts:

▶ [operational research], [operating system], [operative dentistry] ▶ Porter Stemmer equivalence class oper contains all of operate,

  • perating, operates, operation, operative, operatives, operational.

44 / 110

slide-42
SLIDE 42

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Ranked retrieval

45 / 110

slide-43
SLIDE 43

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Ranked retrieval

▶ So far, our queries have been boolean - document is a match or not. ▶ Good for experts: precise understanding of the needs and collection. ▶ Good for applications: can easily consume thousands of results. ▶ Not good for the majority of users. ▶ Most users are not capable or lazy to write Boolean queries. ▶ Most users don’t want to wade through 1000s of results. ▶ This is particularly true of web search.

46 / 110

slide-44
SLIDE 44

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Problem with Boolean search

▶ Boolean queries ofuen result in either too few or too many results.

▶ Qvery 1: [standard user dlink 650] → 200,000 hits ▶ Qvery 2: [standard user dlink 650 no card found] → 0 hits

▶ In Boolean retrieval, it takes a lot of skill to come up with a query

that produces a manageable number of hits.

▶ With ranking, large result sets are not an issue.

▶ Just show the top 10 results. ▶ This doesn’t overwhelm the user. ▶ Premise: the ranking algorithm works.

…More relevant results are ranked higher than less relevant results.

47 / 110

slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Scoring as the basis of ranked retrieval

▶ We wish to rank documents that are more relevant higher than

documents that are less relevant.

▶ How can we accomplish such a ranking of the documents in the

collection with respect to a query?

▶ Assign a score to each query-document pair, say in [0, 1]. ▶ This score measures how well document and query “match”.

52 / 110

slide-49
SLIDE 49

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Qvery-document matching scores

▶ How do we compute the score of a query-document pair? ▶ Let’s start with a one-term query. ▶ If the query term does not occur in the document: score should be 0. ▶ The more frequent the query term in the document, the higher the

score

▶ We will look at a number of alternatives for doing this.

53 / 110

slide-50
SLIDE 50

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Take 1: Jaccard coefgicient

▶ A commonly used measure of overlap of two sets ▶ Let A and B be two sets ▶ Jaccard coefgicient:

jaccard(A, B) = |A ∩ B| |A ∪ B|, where(A ̸= ∅ or B ̸= ∅)

▶ jaccard(A, A) = 1 ▶ jaccard(A, B) = 0 if A ∩ B = 0 ▶ A and B don’t have to be the same size. ▶ Always assigns a number between 0 and 1.

54 / 110

slide-51
SLIDE 51

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Jaccard coefgicient: Example

What is the query-document score the Jaccard coefgicient computes for:

▶ Qvery: “ides of March” ▶ Document: “Caesar died in March” ▶ jaccard(q, d) = 1/6

55 / 110

slide-52
SLIDE 52

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

What’s wrong with Jaccard?

▶ It ignores term frequency (how many occurrences a term has). ▶ Rare terms are more informative than frequent terms. Jaccard does

not consider this information. → We need a more sophisticated way of normalizing for the length of a document.

56 / 110

slide-53
SLIDE 53

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Term weighting

57 / 110

slide-54
SLIDE 54

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 …

▶ Each document is represented as a binary vector ∈ {0, 1}|V|.

58 / 110

slide-55
SLIDE 55

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Count matrix

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 …

▶ Each document is represented as a count vector ∈ N|V|.

59 / 110

slide-56
SLIDE 56

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Bag of words model

▶ We do not consider the order of words in a document. ▶ John is quicker than Mary and Mary is quicker than John are

represented the same way.

▶ This is called a bag of words model. ▶ In a sense, this is a step back: The positional index was able to

distinguish these two documents.

▶ We will look at “recovering” positional information later in this

course.

▶ For now: bag of words model

60 / 110

slide-57
SLIDE 57

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Term frequency tf

▶ The term frequency tft,d of term t in document d is defined as the

number of times that t occurs in d.

▶ We want to use tf when computing query-document match scores. ▶ But how? ▶ Raw term frequency is not what we want because: ▶ A document with tf = 10 occurrences of the term is more relevant

than a document with tf = 1 occurrence of the term.

▶ But not 10 times more relevant.

62 / 110

slide-58
SLIDE 58

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Instead of raw frequency: Log frequency weighting

▶ The log frequency weight of term t in d is defined as follows:

wt,d = { 1 + log10 tft,d if tft,d > 0

  • therwise

▶ tft,d → wt,d: 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. ▶ Score for a document-query pair: sum over terms t in both q and d:

tf-matching-score(q, d) = ∑

t∈q∩d

(1 + log tft,d)

▶ The score is 0 if none of the query terms is present in the document.

63 / 110

slide-59
SLIDE 59

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Frequency in document vs. frequency in collection

▶ In addition, to the frequency of the term in the document …

…we also want to use the frequency of the term in the collection for weighting and ranking.

65 / 110

slide-60
SLIDE 60

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Desired weight for terms

▶ Rare terms are more informative than frequent terms.

▶ Consider a term in the query that is rare in the collection

(e.g., arachnocentric).

▶ A document containing this term is very likely to be relevant.

→ we want high weights for rare terms like arachnocentric.

▶ Frequent terms are less informative than rare terms.

▶ Consider a term in the query that is frequent in the collection

(e.g., good, increase, line).

▶ A document containing this term is more likely to be relevant than a

document that doesn’t but words like good, increase and line are not sure indicators of relevance.

→ For frequent terms like good, increase, and line, we want positive weights but lower weights than for rare terms.

66 / 110

slide-61
SLIDE 61

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Document frequency

▶ The document frequency (dft) is the number of documents in the

collection that the term occurs in.

▶ dft is an inverse measure of the informativeness of term t. ▶ We define the idf weight of term t in a collection of N documents as:

idft = log10 N dft

▶ idft is a measure of the informativeness of the term. ▶ log N/dft instead of [N/dft] to “dampen” the efgect of idf ▶ Note that we use the log transformation for both term frequency and

document frequency.

67 / 110

slide-62
SLIDE 62

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Collection frequency vs. Document frequency

word collection frequency document frequency insurance 10440 3997 try 10422 8760

▶ Collection frequency of t: number of tokens of t in the collection ▶ Document frequency of t: number of documents t occurs in ▶ Why these numbers? ▶ Which word is a betuer search term (should get a higher weight)? ▶ This example suggests that df/idf is betuer for weighting than cf.

68 / 110

slide-63
SLIDE 63

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

tf-idf weighting

▶ tf-idf weight of a term is product of its tf weight and its idf weight.

wt,d = (1 + log tft,d) · log N dft

▶ tf-weight ▶ idf-weight ▶ Best known weighting scheme in information retrieval. ▶ Increases with the number of occurrences within a document (tf). ▶ Increases with the rarity of the term in the collection (idf). ▶ Note: the “-” in tf-idf is a hyphen, not a minus (altso tf.idf, tf x idf).

70 / 110

slide-64
SLIDE 64

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Vector space model

71 / 110

slide-65
SLIDE 65

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 …

▶ Each document is represented as a binary vector ∈ {0, 1}|V|.

73 / 110

slide-66
SLIDE 66

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Count matrix

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 …

▶ Each document is represented as a count vector ∈ N|V|.

74 / 110

slide-67
SLIDE 67

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Weight matrix

Anthony Julius The Hamlet Othello Macbeth … and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.00 0.00 0.00 0.35 Brutus 1.21 6.10 0.00 1.00 0.00 0.00 Caesar 8.59 2.54 0.00 1.51 0.25 0.00 Calpurnia 0.00 1.54 0.00 0.00 0.00 0.00 Cleopatra 2.85 0.00 0.00 0.00 0.00 0.00 mercy 1.51 0.00 1.90 0.12 5.25 0.88 worser 1.37 0.00 0.11 4.15 0.25 1.95 …

▶ Each document is represented as a real-valued vector ∈ R|V|.

75 / 110

slide-68
SLIDE 68

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Documents as vectors

▶ Each document is now represented as a real-valued vector of tf-idf

weights ∈ R|V|.

▶ So we have a |V|-dimensional real-valued vector space. ▶ Terms are axes of the space. ▶ Documents are points or vectors in this space. ▶ Very high-dimensional: tens/hundreds of millions of dimensions

when you apply this to web search engines

▶ Each vector is very sparse - most entries are zero.

76 / 110

slide-69
SLIDE 69

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Qveries as vectors

▶ Key idea 1: Do the same for queries: represent them as vectors ▶ Key idea 2: Rank documents according to their proximity to query

▶ proximity = similarity ▶ proximity ≈ negative distance

▶ Negative distance between two points/end points of the two vectors? ▶ Euclidean distance? ▶ Bad idea – Euclidean distance is large for vectors of difgerent lengths.

77 / 110

slide-70
SLIDE 70

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Why distance is a bad idea

1 1

rich poor q:[rich poor] d1:Ranks of starving poets swell d2:Rich poor gap grows d3:Record baseball salaries in 2010 The Euclidean distance of ⃗ q and ⃗ d2 is large although the distribution of terms in query q and the distribution of terms in document d2 are similar.

78 / 110

slide-71
SLIDE 71

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Use angle instead of distance

▶ Rank documents according to angle with query ▶ Ranking documents according to the angle between query and

document in decreasing order is equivalent to

▶ Ranking documents according to cosine(query,document) in

increasing order.

▶ Cosine is a monotonically decreasing function of the angle for the

interval [0◦, 180◦]

79 / 110

slide-72
SLIDE 72

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Length normalization

▶ How do we compute the cosine? ▶ A vector can be (length-) normalized by dividing each of its

components by its length – e.g. by the L2 norm: ||x||2 = √∑

i x2 i ▶ This maps vectors onto the unit sphere since afuer normalization:

||x||2 = √∑

i x2 i = 1.0 ▶ As a result, longer documents and shorter documents have weights of

the same order of magnitude.

▶ Efgect on the two documents d and d′ (d appended to itself) from

earlier slide: they have identical vectors afuer length-normalization.

80 / 110

slide-73
SLIDE 73

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Cosine similarity between query and document

cos(⃗ q,⃗ d) = sim(⃗ q,⃗ d) = ⃗ q ·⃗ d |⃗ q||⃗ d| = ∑|V|

i=1 qidi

√∑|V|

i=1 q2 i

√∑|V|

i=1 d2 i

▶ qi is the tf-idf weight of term i in the query. ▶ di is the tf-idf weight of term i in the document. ▶ |⃗

q| and |⃗ d| are the lengths of ⃗ q and ⃗ d.

▶ This is the cosine similarity of ⃗

q and ⃗ d or, equivalently: the cosine of the angle between ⃗ q and ⃗ d.

▶ For normalized vectors, the cosine is equivalent to the dot product:

cos(⃗ q,⃗ d) = ⃗ q ·⃗ d = ∑

i

qi · di

81 / 110

slide-74
SLIDE 74

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Cosine similarity illustrated

1 1

rich poor ⃗ v(q) ⃗ v(d1) ⃗ v(d2) ⃗ v(d3) θ

82 / 110

slide-75
SLIDE 75

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Cosine: Example

How similar are these novels? SaS: Sense and Sensibility PaP: Pride and Prejudice WH: Wuthering Heights term frequencies (counts) term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 6 wuthering 38

83 / 110

slide-76
SLIDE 76

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Cosine: Example

term frequencies (counts) term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 6 wuthering 38 log frequency weighting term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 1.78 wuthering 2.58 (To simplify this example, we don’t do idf weighting.)

84 / 110

slide-77
SLIDE 77

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Cosine: Example

log frequency weighting term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 1.78 wuthering 2.58 log frequency weighting & cosine normalization term SaS PaP WH affection 0.79 0.83 0.52 jealous 0.52 0.56 0.47 gossip 0.34 0.0 0.41 wuthering 0.0 0.0 0.59

▶ cos(SaS,PaP) ≈ 0.79∗0.83+0.52∗0.56+0.34∗0.0+0.0∗0.0 ≈ 0.94 ▶ cos(SaS,WH) ≈ 0.79 ▶ cos(PaP,WH) ≈ 0.69 ▶ Why do we have cos(SaS,PaP) > cos(SAS,WH)?

85 / 110

slide-78
SLIDE 78

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Computing the cosine score

CosineScore(q) 1 float Scores[N] = 0 2 float Length[N] 3 for each query term t 4 do calculate wt,q and fetch postings list for t 5 for each pair(d, tft,d) in postings list 6 do Scores[d]+ = wt,d × wt,q 7 Read the array Length 8 for each d 9 do Scores[d] = Scores[d]/Length[d] 10 return Top K components of Scores[]

86 / 110

slide-79
SLIDE 79

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Components of tf-idf weighting

Term frequency Document frequency Normalization n (natural) tft,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tft,d) t (idf) log N dft c (cosine)

1

w2

1+w2 2+...+w2 M

a (augmented) 0.5 + 0.5×tft,d

maxt(tft,d)

p (prob idf) max{0, log N−dft

dft } u (pivoted

unique) 1/u b (boolean) {1 if tft,d > 0

  • therwise

b (byte size) 1/CharLengthα, α < 1 L (log ave)

1+log(tft,d) 1+log(avet∈d(tft,d))

Best known combination of weighting options Default: no weighting

87 / 110

slide-80
SLIDE 80

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Summary: Ranked retrieval in the vector space model

▶ Represent the query as a weighted tf-idf vector ▶ Represent each document as a weighted tf-idf vector ▶ Compute the cosine similarity between the query vector and each

document vector

▶ Rank documents with respect to the query ▶ Return the top K (e.g., K = 10) to the user

88 / 110

slide-81
SLIDE 81

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Evaluation

89 / 110

slide-82
SLIDE 82

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Main measure: user happiness and relevance

▶ User happiness equated with relevance of search results to the query. ▶ But how do you measure relevance? ▶ Standard methodology in IR consists of three elements:

  • 1. A benchmark document collection.
  • 2. A benchmark suite of queries.
  • 3. An assessment of the relevance of each query-document pair.

91 / 110

slide-83
SLIDE 83

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Precision and recall (unranked retrieval)

▶ Precision (P) is the fraction of retrieved documents that are relevant:

Precision = #(relevant items retrieved) #(retrieved items) = P(relevant|retrieved)

▶ Recall (R) is the fraction of relevant documents that are retrieved:

Recall = #(relevant items retrieved) #(relevant items) = P(retrieved|relevant)

92 / 110

slide-84
SLIDE 84

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Precision and recall: confusion matrix

Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) P = TP TP + FP R = TP TP + FN

93 / 110

slide-85
SLIDE 85

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Precision/recall tradeofg

▶ You can increase recall by returning more docs. ▶ Recall is a non-decreasing function of the number of docs retrieved. ▶ A system that returns all docs has 100% recall! ▶ The converse is also true (usually): It’s easy to get high precision for

very low recall.

▶ Suppose the document with the largest score is relevant. How can we

maximize precision?

94 / 110

slide-86
SLIDE 86

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

A combined measure: F

▶ The F measure allows us to trade ofg precision against recall.

F = 1 α 1

P + (1 − α) 1 R

= (β2 + 1)PR β2P + R where β2 = 1 − α α

▶ α ∈ [0, 1] and thus β2 ∈ [0, ∞] ▶ Most frequently used: balanced F with β = 1 or α = 0.5 ▶ This is the harmonic mean of P and R: 1 F = 1 2( 1 P + 1 R) ▶ What value range of β weights recall higher than precision?

95 / 110

slide-87
SLIDE 87

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

F measure: Example

relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120

▶ P = 20/(20 + 40) = 1/3 ▶ R = 20/(20 + 60) = 1/4 ▶ F1 = 2 1

1 1 3

+ 1

1 4

= 2/7

96 / 110

slide-88
SLIDE 88

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Accuracy

▶ Why do we use complex measures like precision, recall, and F? ▶ Why not something simple like accuracy? ▶ Accuracy is the fraction of correct decisions (relevant/nonrelevant) ▶ In terms of the contingency table above:

A = TP + TN TP + FP + FN + TN

▶ Why is accuracy not a useful measure for web information retrieval?

97 / 110

slide-89
SLIDE 89

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Why accuracy is a useless measure in IR

▶ The number of relavant and non-relevant documents is unbalanced. ▶ A trick to maximize accuracy in IR: always say no and return nothing. ▶ You then get 99.99% accuracy on most queries (0.01% docs relevant). ▶ Searchers on the web (and in IR in general) want to find something

and have a certain tolerance for junk.

▶ It’s betuer to return some bad hits as long as you return something.

→ We use precision, recall, and F for evaluation, not accuracy.

98 / 110

slide-90
SLIDE 90

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Difgiculties in using precision, recall and F measure

▶ We should always average over a large set of queries. ▶ We need relevance judgments for information-need-document pairs –

but they are expensive to produce.

▶ Alternatives to using precision/recall and having to produce relevance

judgments exists (A/B testing).

99 / 110

slide-91
SLIDE 91

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Precision-recall curve

▶ Precision/recall/F are measures for unranked sets. ▶ We can easily turn set measures into measures of ranked lists. ▶ Just compute the set measure for each “prefix” of the ranked list:

the top 1, top 2, top 3, top 4 etc. results.

▶ Doing this for precision and recall gives you a precision-recall curve.

101 / 110

slide-92
SLIDE 92

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

A precision-recall curve

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 Recall Precision ▶ Each point corresponds to a result for top k ranked hits (k=1,2,3,…) ▶ Interpolation (in red): Take maximum of all future points. ▶ Rationale for interpolation: The user is willing to look at more stufg if

both precision and recall get betuer.

102 / 110

slide-93
SLIDE 93

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

11-point interpolated average precision

Recall Interpolated Precision 0.0 1.00 0.1 0.67 0.2 0.63 0.3 0.55 0.4 0.45 0.5 0.41 0.6 0.36 0.7 0.29 0.8 0.13 0.9 0.10 1.0 0.08 11-point average: ≈ 0.425 How can precision at 0.0 be > 0?

103 / 110

slide-94
SLIDE 94

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Averaged 11-point precision/recall graph

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall Precision

▶ Compute interpolated precision at recall levels 0.0, 0.1, 0.2, … ▶ Do this for each of the queries in the evaluation benchmark. ▶ Average over queries. ▶ This measure measures performance at all recall levels.

104 / 110

slide-95
SLIDE 95

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

What we need for a benchmark

  • 1. A collection of documents

▶ Must be representative of the documents we expect to see in reality.

  • 2. A collection of information needs

▶ (which we will ofuen incorrectly refer to as queries) ▶ Must be representative of the inform. needs we expect to see in reality.

  • 3. Human relevance assessments

▶ We need to hire/pay “judges” or assessors to do this. ▶ Expensive, time-consuming. ▶ Judges must be representative of the users we expect to see in reality. 106 / 110

slide-96
SLIDE 96

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Standard relevance benchmark: Cranfield

▶ Pioneering: first testbed allowing precise quantitative measures of

information retrieval efgectiveness.

▶ Late 1950s, UK. ▶ 1398 abstracts of aerodynamics journal articles, a set of 225 queries,

exhaustive relevance judgments of all query-document-pairs.

▶ Too small, too untypical for serious IR evaluation today.

108 / 110

slide-97
SLIDE 97

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Standard relevance benchmark: TREC

▶ TREC = Text Retrieval Conference (TREC) ▶ Organized by National Institute of Standards and Technology (NIST) ▶ TREC is actually a set of several difgerent relevance benchmarks. ▶ Best known: TREC Ad Hoc, used for TREC evaluations in 1992 – 1999 ▶ 1.89 M documents, mainly newswire articles, 450 information needs ▶ No exhaustive relevance judgments – too expensive ▶ Rather, NIST assessors’ relevance judgments are available only for the

documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed.

109 / 110

slide-98
SLIDE 98

Introduction Boolean retrieval Text processing Ranked retrieval Term weighting Vector space model Evaluation

Standard relevance benchmarks: Others

▶ GOV2

▶ Another TREC/NIST collection ▶ 25 million web pages ▶ Used to be largest collection that is easily available ▶ But still 3 orders of magnitude smaller than what Google/Bing index

▶ NTCIR

▶ East Asian language and cross-language information retrieval

▶ Cross Language Evaluation Forum (CLEF)

▶ This evaluation series has concentrated on European languages and

cross-language information retrieval.

▶ Many others

110 / 110