INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 4: Scoring, Term Weighting, The Vector Space Model I Paul Ginsparg Cornell University, Ithaca, NY 7 Sep 2010


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 4: Scoring, Term Weighting, The Vector Space Model I

Paul Ginsparg

Cornell University, Ithaca, NY

7 Sep 2010

1 / 47

slide-2
SLIDE 2

Administrativa

Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2010fa/ Assignment 1. Posted: 3 Sep, Due: Sun, 19 Sep Lectures: Tuesday and Thursday 11:40-12:55, Olin Hall 165 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Cornell Information Science, 301 College Avenue Instructor’s Assistant: Corinne Russell, crussell@cs.. . ., 255-5925, Cornell Information Science, 301 College Avenue Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Niranjan Sivakumar, ns253@... The Teaching Assistant does not have scheduled office hours but is available to help you by email. Course text at: http://informationretrieval.org/

2 / 47

slide-3
SLIDE 3

Overview

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

3 / 47

slide-4
SLIDE 4

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

4 / 47

slide-5
SLIDE 5

Heaps’ law for Reuters

2 4 6 8 1 2 3 4 5 6 log10 T log10 M

Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log10 M = 0.49 ∗ log10 T + 1.64 is the best least squares fit. Thus, M = 101.64T 0.49 and k = 101.64 ≈ 44 and b = 0.49. M = kT b = 44T .49

5 / 47

slide-6
SLIDE 6

http://en.wikipedia.org/wiki/Zipf’s law

Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will

  • ccur approximately twice as often as the second most frequent

word, which occurs twice as often as the fourth most frequent word, etc. Brown Corpus: “the”: 7% of all word occurrences (69,971 of

  • >1M).

“of”: ∼3.5% of words (36,411) “and”: 2.9% (28,852) Only 135 vocabulary items account for half the Brown Corpus.

The Brown University Standard Corpus of Present-Day American English is a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources . . . for many years among the most-cited resources in the field.

6 / 47

slide-7
SLIDE 7

Zipf’s law for Reuters

1 2 3 4 5 6 7 1 2 3 4 5 6 7 log10 rank log10 cf

Fit far from perfect, but nonetheless key insight: Few frequent terms, many rare terms.

7 / 47

slide-8
SLIDE 8

more from http://en.wikipedia.org/wiki/Zipf’s law

“A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as

  • expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.”

8 / 47

slide-9
SLIDE 9

Another wikipedia count, 15 May 2010

http://imonad.com/seo/wikipedia-word-frequency-list/ “Word frequency distribution follows Zipf’s law”

9 / 47

slide-10
SLIDE 10

Collection Freq in perl

#!/usr/bin/perl #Usage: cat test/file*.txt | collection freq.pl while () { chomp; tr/[A-Z]/[a-z]/; foreach my $word (split) { next if $word !∼ /ˆ[a-z]/; ++$cf{$word}; } } foreach my $word (sort {$cf{$b} = $cf{$a}} keys %cf) { print ++$m,” $word $cf{$word}\n”; }

10 / 47

slide-11
SLIDE 11

Collection Frequencies 1 the 918 2 of 551 3 to 476 4 and 444 5 in 364 6 a 363 7 that 212 8 for 170 9 is 152 10 on 135 11 are 106 12 as 100 13 with 98 14 at 94 15 from 91 16 will 84 17 by 79 18 have 78 19 said 77 20 it 74 21 be 71 22 this 67 23 an 62 24 was 57 25 new 53 26 has 53 27 not 53 28 its 52 29 they 50 30 were 50 31 their 48 32 we 46 33 other 45 34 or 42 35 one 41 36 can 38 37 more 38 38 which 38 39 about 37 40 but 36 41 been 35 42 first 35 43 had 33 44 brain 32 45 these 31 46 cells 29 47 who 26 48 space 26 49 up 26 50 what 25 51 development 25 52 genes 25 53 also 25 54 all 24 55 than 24 56 data 24 57 into 24 58 some 24 59 now 24 60 million 23 61 you 23 62 over 23 63 most 22 64 between 22 65 found 22 66 like 22 67 time 22 68 cell 22 69 way 21 70 when 21 71 may 21 72 how 21 · · · 129 off 12 130 understanding 12 131 several 12 132 interactions 12 133 if 12 134 neurons 12 135 carbon 12 136 spacecraft 12 137 science 12 138 must 12 139 launch 12 140 them 12 141 gene 12 142 human 12 143 mammalian 12 144 he 11 145 known 11 146 both 11 147 get 11 148 appear 11 149 many 11 150 life 11 151 say 11 152 together 11 153 observations 11 154 radio 11 155 feature 11 156 her 11 157 where 11 158 percent 11 159 then 11 160 changes 11 161 aphids 11 162 make 11 163 news 11 164 do 11 · · · 190 before 10 191 going 9 192 video 9 193 working 9 194 business 9 195 mobile 9 196 sequences 9 197 map 9 198 increase 9 199 own 9 200 possible 9 201 later 9 202 lead 9 203 stories 9 204 professor 9 205 each 9 206 your 9 207 court 9 208 report 9 209 robot 9 210 analysis 9 211 next 9 212 similar 9 213 same 9 214 since 9 215 done 9 216 early 9 217 used 9 218 event 9 219 his 9 220 methylation 9 221 while 9 222 buckyballs 9 223 browser 9 224 health 9 225 temperature 9

· · ·

3307 circular 1 3308 division 1 3309 range 1 3310 quarter 1 3311 informed 1 3312 transmitter 1 3313 jury 1 3314 medium 1 3315 root 1 3316 severe 1 3317 beta 1 3318 transmission 1 3319 repair 1 3320 implies 1 3321 remained 1 3322 declined 1 3323 doldrums 1 3324 sudden 1 3325 perspective 1 3326 community 1 3327 catalyze 1 3328 answers 1 3329 represents 1 3330 primary 1 3331 statistically 1 3332 absent 1 3333 availability 1 3334 modifications 1 3335 picture 1 3336 competition 1 3337 requests 1 3338 thin 1 3339 seriously 1 3340 analyze 1 3341 candidate 1 3342 clearer 1 11 / 47

slide-12
SLIDE 12

Collection Frequency vs Rank

30 document test data for Assignment 1 (Aug ’10 from nasa/cnet/bio-news, T=17428 total tokens, M=3342 distinct, 1856 appear once [1487–3342])

100 200 300 400 500 600 700 800 900 1000 500 1000 1500 2000 2500 3000 3500 and

  • f
  • n

in to for that is are the a brain (44,32) 1 10 100 1000 1 10 100 1000 and the

  • n

in

  • f

that to a is are for brain (44,32)

Note: brain 13:11 16:10 17:9 20:2 [4]; accidentally discarded capitalized

12 / 47

slide-13
SLIDE 13

Document Frequencies 1 a 30 2 and 30 3 in 30 4 is 30 5 of 30 6 the 30 7 to 30 8 by 29 9 for 29 10 on 29 11 that 29 12 with 29 13 are 28 14 as 28 15 from 28 16 this 28 17 at 26 18 have 26 19 an 25 20 be 25 21 it 25 22 will 24 23 has 23 24 its 23 25 but 22 26 one 22 27 other 22 28 said 22 29 their 22 30 was 22 31 we 22 32 new 21 33 or 21 34 were 21 35 august 20 36 more 20 37 not 20 38 they 20 39 about 19 40 been 19 41 first 19 42 when 19 43 which 18 44 also 17 45 can 17 46 these 17 47 may 16 48 now 16 49 who 16 50 between 15 51 most 15 52 well 15 53 had 14 54 like 14 55 over 14 56 some 14 57 those 14 58 up 14 59 what 14 60 would 14 61 all 13 62 including 13 63 only 13 64 our 13 65 so 13 66 such 13 67 than 13 68 time 13 69 world 13 70 according 12 71 both 12 72 could 12 · · · 110 you 10 111 already 9 112 am 9 113 another 9 114 based 9 115 during 9 116 get 9 117 go 9 118 he 9 119 however 9 120 information 9 121 institute 9 122 international 9 123 journal 9 124 just 9 125 large 9 126 life 9 127 nasas 9 128 science 9 129 scientists 9 130 similar 9 131 since 9 132 space 9 133 system 9 134 understanding 9 135 use 9 136 used 9 137 way 9 138 where 9 139 aug 8 140 basic 8 141 before 8 142 different 8 143 down 8 144 even 8 145 future 8 146 his 8 · · · 164 addition 7 165 cell 7 166 changes 7 167 company 7 168 didnt 7 169 director 7 170 discovered 7 171 discovery 7 172 do 7 173 done 7 174 dont 7 175 each 7 176 few 7 177 focus 7 178 going 7 179 having 7 180 help 7 181 internet 7 182 later 7 183 launch 7 184 likely 7 185 make 7 186 many 7 187 member 7 188 might 7 189 national 7 190 news 7 191 patterns 7 192 percent 7 193 possible 7 194 power 7 195 program 7 196 published 7 197 researchers 7 198 say 7 199 several 7

· · ·

1351 aaron 1 1352 abandoning 1 1353 absent 1 1354 absolutely 1 1355 abundant 1 1356 academy 1 1357 accelerate 1 . . . 3987 wolfson 1 3988 woman 1 3989 wondering 1 3990 words 1 3991 worker 1 3992 workers 1 3993 worst 1 3994 worth 1 3995 wreak 1 3996 writing 1 3997 wrong 1 3998 year-old 1 3999 yearlong 1 4000 yield 1 4001 york 1 4002 youd 1 4003 youll 1 4004 younger 1 4005 youngest 1 4006 youre 1 4007 youtube 1 4008 zamponi 1 4009 zarya 1 4010 zdnet 1 4011 zhao 1 4012 zuckerberg 1 4013 zune 1 13 / 47

slide-14
SLIDE 14

Discussion 1

Objective: explore three information retrieval systems (Bing, LOC, PubMed), and use each for the discovery task: “What is the medical evidence that cell phone usage can cause cancer?” Some general questions and observations: How to authenticate the information? Is the information up to date? (how to find updated info?) In what order are items returned? (by “relevance”, but how is relevance determined: link analysis? tf.idf?) Use results of Bing search to refine vocabulary Assignment: everyone upload as a test of CMS the best reference found, and outline of strategy used to find it

14 / 47

slide-15
SLIDE 15

Summary Notes

Sample queries:

cell phone cancer “cell phones” cancer study What is the medical evidence that cell phone usage can cause cancer (full query returned too much, e.g., country music from loc.gov) cell phone canc* (also returned too much)

How to authenticate? By source, by date. Is fielded search useful? (gave too little for some queries, or still too much irrelevant for others) NCBI found synonyms, e.g., neoplasm Exercise care with site search site:loc.gov (not site:catalog.loc.gov) What about news? (e.g., lady gaga doesn’t use cell phone for fear) Do controversial items come up preferentially? (how do they determine order of presentation of results? How much is determined passive collaborative filtering — clickthrough analysis?) More generally, how are results ranked? (and why are Google and Bing so similar?)

15 / 47

slide-16
SLIDE 16

Summary Notes: some URLs retrieved I

Cell Phones and Cancer Risk, National Cancer Institute

http://www.cancer.gov/cancertopics/factsheet/Risk/cellphones

Largest cellphone-cancer study to date clarifies little

http://arstechnica.com/gadgets/news/2010/05/largest-cellphone-cancer-study-to-date-clarifies-little.ars

via Google news:

Lady Gaga Refuses To Use A Cell Phone In Fear Of Developing Cancer

http://poponthepop.com/2010/08/lady-gaga-doesnt-understand-cancer/

Via Library of Congress search:

http://www.loc.gov/fedsearch/metasearch/?cclquery=cell+phone+cancer

Cell phones : invisible hazards in the wireless age : an insider’s alarming discoveries about cancer and genetic damage Carlo, George Louis ( 2001 )

16 / 47

slide-17
SLIDE 17

Summary Notes: some URLs retrieved II

cell phone cancer - PMC Results

http://www.ncbi.nlm.nih.gov/pmc?term=cell%20phone%20cancer

Cancer: Strong Signal for Cell Phone Effects

  • M. Nathaniel Mead

Environ Health Perspect. 2008 October; 116(10): A422. Long-term use of cellular phones and brain tumours: increased risk associated with use for 10 years Lennart Hardell, Michael Carlberg, Fredrik S¨

  • derqvist, and Kjell

Hansson Mild Occup Environ Med. 2007 September; 64(9): 626632.

17 / 47

slide-18
SLIDE 18

Summary Notes: some URLs retrieved III

cell phone cancer - PubMed results

http://www.ncbi.nlm.nih.gov/pubmed?term=cell%20phone%20cancer Risks of carcinogenesis from electromagnetic radiation of mobile telephony devices. Yakymenko I, Sidorik E. Exp Oncol. 2010 Jul;32(2):54-60.

Brain cancer incidence trends in relation to cellular telephone use in the United States. Inskip PD, Hoover RN, Devesa SS. Neuro

  • Oncol. 2010 Jul 16.

One conclusion emerges from Interphone study: controversy will

  • continue. Peres J. J Natl Cancer Inst. 2010 Jul 7;102(13):928-31.

http://jnci.oxfordjournals.org/cgi/content/full/102/13/928 (“Journal of the National Cancer Institute . . . published twice monthly by Oxford University Press and is not affiliated with the United States National Cancer Institute”)

18 / 47

slide-19
SLIDE 19

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

19 / 47

slide-20
SLIDE 20

Ranked retrieval

Thus far, our queries have all been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . .

. . . or they are, but they think it’s too much work.

Most users don’t want to wade through 1000s of results. This is particularly true of web search.

20 / 47

slide-21
SLIDE 21

Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650”

→ 200,000 hits – feast!

Query 2: “standard user dlink 650 no card found”

→ 0 hits – famine!

It takes a lot of skill to come up with a query that produces a manageable number of hits.

21 / 47

slide-22
SLIDE 22

Feast or famine: No problem in ranked retrieval

With ranking, large result sets are not an issue. Just show the top 10 results Doesn’t overwhelm the user Premise: the ranking algorithm works.

22 / 47

slide-23
SLIDE 23

Scoring as the basis of ranked retrieval

We wish to return in order the documents most likely to be useful to the searcher. How can we rank-order the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [0, 1]. This score measures how well document and query “match”.

23 / 47

slide-24
SLIDE 24

Query-document matching scores

How do we compute the score of a query-document pair? Let’s start with a one-term query. If the query term does not occur in the document: score should be 0. The more frequent the query term in the document, the higher the score We will look at a number of alternatives for doing this.

24 / 47

slide-25
SLIDE 25

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅) jaccard(A, A) = 1 jaccard(A, B) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1.

25 / 47

slide-26
SLIDE 26

Jaccard coefficient: Example

What is the query-document match score that the Jaccard coefficient computes for:

Query: “ides of March” Document “Caesar died in March” ?

26 / 47

slide-27
SLIDE 27

What’s wrong with Jaccard?

It doesn’t consider term frequency (how many occurrences a term has). Rare terms are more informative than frequent terms. Jaccard doesn’t consider this information. We need a more sophisticated way of normalizing for the length of a document. Later in this lecture, we’ll use |A ∩ B|/

  • |A ∪ B| (cosine) . . .

. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.

27 / 47

slide-28
SLIDE 28

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

28 / 47

slide-29
SLIDE 29

Recall: Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented by a binary vector ∈ {0, 1}|V |.

29 / 47

slide-30
SLIDE 30

From now on, we will use the frequencies of terms

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is represented by a count vector ∈ N|V |.

30 / 47

slide-31
SLIDE 31

Bag of words model

We do not consider the order of words in a document. John is quicker than Mary and Mary is quicker than John are represented the same way. This is called a bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at “recovering” positional information later in this course. For now: bag of words model

31 / 47

slide-32
SLIDE 32

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want. A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.

32 / 47

slide-33
SLIDE 33

Log frequency weighting

The log frequency weight of term t in d is defined as follows wt,d = 1 + log10 tft,d if tft,d > 0

  • therwise

tft,d → wt,d: 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: matching-score(q, d) =

t∈q∩d(1 + log tft,d)

The score is 0 if none of the query terms is present in the document.

33 / 47

slide-34
SLIDE 34

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

34 / 47

slide-35
SLIDE 35

Desired weight for rare terms

Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection (e.g., arachnocentric). A document containing this term is very likely to be relevant. → We want high weights for rare terms like arachnocentric.

35 / 47

slide-36
SLIDE 36

Desired weight for frequent terms

Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the collection (e.g., good, increase, line). A document containing this term is more likely to be relevant than a document that doesn’t . . . . . . but it’s not a sure indicator of relevance. → For frequent terms, we want positive weights for words like good, increase, and line . . . . . . but lower weights than for rare terms.

36 / 47

slide-37
SLIDE 37

Document frequency

We want high weights for rare terms like arachnocentric. We want low (positive) weights for frequent words like good, increase, and line. We will use document frequency to factor this into computing the matching score. The document frequency is the number of documents in the collection in which the term occurs.

37 / 47

slide-38
SLIDE 38

idf weight

dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t. We define the idf weight of term t as follows: idft = log10 N dft idft is a measure of the informativeness of the term. We use [log(N/dft)] instead of [N/dft] to “dampen” the effect of idf. So we use the log transformation for both term frequency and document frequency.

38 / 47

slide-39
SLIDE 39

Examples for idf

Compute idft using the formula: idft = log10

1,000,000

dft term dft idft calpurnia 1 6 animal 100 4 sunday 1000 3 fly 10,000 2 under 100,000 1 the 1,000,000

39 / 47

slide-40
SLIDE 40

Effect of idf on ranking

idf affects the ranking of documents for queries with at least two terms. For example, in the query “arachnocentric line”, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has no effect on ranking for one-term queries.

40 / 47

slide-41
SLIDE 41

Collection frequency vs. Document frequency

Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 The collection frequency of t is the number of tokens of t in the collection where we count multiple occurrences. Why these numbers? Which word is a better search term (and should get a higher weight)? This example suggests that df (and idf) is better for weighting that cf (and “icf”).

41 / 47

slide-42
SLIDE 42

tf.idf weighting

The tf.idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft tf-weight idf-weight Best known weighting scheme in information retrieval Alternative names: tf-idf, tf x idf, tf/idf Note: the “-” in tf-idf is a hyphen, not a minus sign!

42 / 47

slide-43
SLIDE 43

Summary: tf.idf

Assign a tf.idf weight for each term t in each document d: wt,d = (1 + log tft,d) · log N dft N: total number of documents The tf.idf weight . . .

. . . increases with the number of occurrences within a

  • document. (term frequency)

. . . increases with the rarity of the term in the collection. (inverse document frequency)

43 / 47

slide-44
SLIDE 44

Term, collection and document frequency

Quantity Symbol Definition term frequency tft,d number of occurrences of t in d document frequency dft number of documents in the collection in which t occurs collection frequency cft total number of occurrences of t in the collection Relationship between df and cf? Relationship between tf and cf? Relationship between tf and df?

44 / 47

slide-45
SLIDE 45

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

45 / 47

slide-46
SLIDE 46

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . Each document is now represented by a real-valued vector of tf.idf weights ∈ R|V |.

46 / 47

slide-47
SLIDE 47

Documents as vectors

Each document is now represented by a real-valued vector of tf.idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines This is a very sparse vector - most entries are zero. Not only useful for scoring documents on a query, but also essential for document classication and document clustering

47 / 47