INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 4: Scoring, Term Weighting, The Vector Space Model I Paul Ginsparg Cornell University, Ithaca, NY 6 Sep 2011


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 4: Scoring, Term Weighting, The Vector Space Model I

Paul Ginsparg

Cornell University, Ithaca, NY

6 Sep 2011

1 / 52

slide-2
SLIDE 2

Administrativa

Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, office hour Fri 3:30pm-4:30pm in the small conference room (133) at 301 College Ave, and by email, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/

Introduction to Information Retrieval, C.Manning, P.Raghavan, H.Sch¨ utze

see also

Information Retrieval, S. B¨ uttcher, C. Clarke, G. Cormack

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

2 / 52

slide-3
SLIDE 3

Overview

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

3 / 52

slide-4
SLIDE 4

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

4 / 52

slide-5
SLIDE 5

Heaps’ law for Reuters

2 4 6 8 1 2 3 4 5 6 log10 T log10 M

Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log10 M = 0.49 ∗ log10 T + 1.64 is the best least squares fit. Thus, M = 101.64T 0.49 and k = 101.64 ≈ 44 and b = 0.49. M = kT b = 44T .49

5 / 52

slide-6
SLIDE 6

http://en.wikipedia.org/wiki/Zipf’s law

Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will

  • ccur approximately twice as often as the second most frequent

word, which occurs twice as often as the fourth most frequent word, etc. Brown Corpus: “the”: 7% of all word occurrences (69,971 of

  • >1M).

“of”: ∼3.5% of words (36,411) “and”: 2.9% (28,852) Only 135 vocabulary items account for half the Brown Corpus.

The Brown University Standard Corpus of Present-Day American English is a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources . . . for many years among the most-cited resources in the field.

6 / 52

slide-7
SLIDE 7

Zipf’s law for Reuters

1 2 3 4 5 6 7 1 2 3 4 5 6 7 log10 rank log10 cf

Fit far from perfect, but nonetheless key insight: Few frequent terms, many rare terms.

7 / 52

slide-8
SLIDE 8

more from http://en.wikipedia.org/wiki/Zipf’s law

“A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as

  • expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.”

8 / 52

slide-9
SLIDE 9

Another wikipedia count, 15 May 2010

http://imonad.com/seo/wikipedia-word-frequency-list/ “Word frequency distribution follows Zipf’s law”

9 / 52

slide-10
SLIDE 10

Assignment 1 posted

See

http://www.infosci.cornell.edu/Courses/info4300/2011fa/assignment1.html

40 text files from four sources: nasa.gov, biologynews.net, news.cnet.com, sciencedaily.com file00.txt . . . file39.txt Due 18 Sep 2011

10 / 52

slide-11
SLIDE 11

Collection Freq in perl

#!/usr/bin/perl #Usage: cat test/file*.txt | collection freq.pl while () { chomp; tr/[A-Z]/[a-z]/; foreach my $word (split) { next if $word !∼ /ˆ[a-z]/; ++$cf{$word}; } } foreach my $word (sort {$cf{$b} = $cf{$a}} keys %cf) { print ++$m,” $word $cf{$word}\n”; }

11 / 52

slide-12
SLIDE 12

Collection Frequencies (from 30 file dataset 2010) 1 the 918 2 of 551 3 to 476 4 and 444 5 in 364 6 a 363 7 that 212 8 for 170 9 is 152 10 on 135 11 are 106 12 as 100 13 with 98 14 at 94 15 from 91 16 will 84 17 by 79 18 have 78 19 said 77 20 it 74 21 be 71 22 this 67 23 an 62 24 was 57 25 new 53 26 has 53 27 not 53 28 its 52 29 they 50 30 were 50 31 their 48 32 we 46 33 other 45 34 or 42 35 one 41 36 can 38 37 more 38 38 which 38 39 about 37 40 but 36 41 been 35 42 first 35 43 had 33 44 brain 32 45 these 31 46 cells 29 47 who 26 48 space 26 49 up 26 50 what 25 51 development 25 52 genes 25 53 also 25 54 all 24 55 than 24 56 data 24 57 into 24 58 some 24 59 now 24 60 million 23 61 you 23 62 over 23 63 most 22 64 between 22 65 found 22 66 like 22 67 time 22 68 cell 22 69 way 21 70 when 21 71 may 21 72 how 21 · · · 129 off 12 130 understanding 12 131 several 12 132 interactions 12 133 if 12 134 neurons 12 135 carbon 12 136 spacecraft 12 137 science 12 138 must 12 139 launch 12 140 them 12 141 gene 12 142 human 12 143 mammalian 12 144 he 11 145 known 11 146 both 11 147 get 11 148 appear 11 149 many 11 150 life 11 151 say 11 152 together 11 153 observations 11 154 radio 11 155 feature 11 156 her 11 157 where 11 158 percent 11 159 then 11 160 changes 11 161 aphids 11 162 make 11 163 news 11 164 do 11 · · · 190 before 10 191 going 9 192 video 9 193 working 9 194 business 9 195 mobile 9 196 sequences 9 197 map 9 198 increase 9 199 own 9 200 possible 9 201 later 9 202 lead 9 203 stories 9 204 professor 9 205 each 9 206 your 9 207 court 9 208 report 9 209 robot 9 210 analysis 9 211 next 9 212 similar 9 213 same 9 214 since 9 215 done 9 216 early 9 217 used 9 218 event 9 219 his 9 220 methylation 9 221 while 9 222 buckyballs 9 223 browser 9 224 health 9 225 temperature 9

· · ·

3307 circular 1 3308 division 1 3309 range 1 3310 quarter 1 3311 informed 1 3312 transmitter 1 3313 jury 1 3314 medium 1 3315 root 1 3316 severe 1 3317 beta 1 3318 transmission 1 3319 repair 1 3320 implies 1 3321 remained 1 3322 declined 1 3323 doldrums 1 3324 sudden 1 3325 perspective 1 3326 community 1 3327 catalyze 1 3328 answers 1 3329 represents 1 3330 primary 1 3331 statistically 1 3332 absent 1 3333 availability 1 3334 modifications 1 3335 picture 1 3336 competition 1 3337 requests 1 3338 thin 1 3339 seriously 1 3340 analyze 1 3341 candidate 1 3342 clearer 1 12 / 52

slide-13
SLIDE 13

Collection Frequency vs Rank

30 document test data for Assignment 1 (Aug ’10 from nasa/cnet/bio-news, T=17428 total tokens, M=3342 distinct, 1856 appear once [1487–3342])

100 200 300 400 500 600 700 800 900 1000 500 1000 1500 2000 2500 3000 3500 and

  • f
  • n

in to for that is are the a brain (44,32) 1 10 100 1000 1 10 100 1000 and the

  • n

in

  • f

that to a is are for brain (44,32)

13 / 52

slide-14
SLIDE 14

Document Frequencies (from 30 file dataset 2010) 1 a 30 2 and 30 3 in 30 4 is 30 5 of 30 6 the 30 7 to 30 8 by 29 9 for 29 10 on 29 11 that 29 12 with 29 13 are 28 14 as 28 15 from 28 16 this 28 17 at 26 18 have 26 19 an 25 20 be 25 21 it 25 22 will 24 23 has 23 24 its 23 25 but 22 26 one 22 27 other 22 28 said 22 29 their 22 30 was 22 31 we 22 32 new 21 33 or 21 34 were 21 35 august 20 36 more 20 37 not 20 38 they 20 39 about 19 40 been 19 41 first 19 42 when 19 43 which 18 44 also 17 45 can 17 46 these 17 47 may 16 48 now 16 49 who 16 50 between 15 51 most 15 52 well 15 53 had 14 54 like 14 55 over 14 56 some 14 57 those 14 58 up 14 59 what 14 60 would 14 61 all 13 62 including 13 63 only 13 64 our 13 65 so 13 66 such 13 67 than 13 68 time 13 69 world 13 70 according 12 71 both 12 72 could 12 · · · 110 you 10 111 already 9 112 am 9 113 another 9 114 based 9 115 during 9 116 get 9 117 go 9 118 he 9 119 however 9 120 information 9 121 institute 9 122 international 9 123 journal 9 124 just 9 125 large 9 126 life 9 127 nasas 9 128 science 9 129 scientists 9 130 similar 9 131 since 9 132 space 9 133 system 9 134 understanding 9 135 use 9 136 used 9 137 way 9 138 where 9 139 aug 8 140 basic 8 141 before 8 142 different 8 143 down 8 144 even 8 145 future 8 146 his 8 · · · 164 addition 7 165 cell 7 166 changes 7 167 company 7 168 didnt 7 169 director 7 170 discovered 7 171 discovery 7 172 do 7 173 done 7 174 dont 7 175 each 7 176 few 7 177 focus 7 178 going 7 179 having 7 180 help 7 181 internet 7 182 later 7 183 launch 7 184 likely 7 185 make 7 186 many 7 187 member 7 188 might 7 189 national 7 190 news 7 191 patterns 7 192 percent 7 193 possible 7 194 power 7 195 program 7 196 published 7 197 researchers 7 198 say 7 199 several 7

· · ·

1351 aaron 1 1352 abandoning 1 1353 absent 1 1354 absolutely 1 1355 abundant 1 1356 academy 1 1357 accelerate 1 . . . 3987 wolfson 1 3988 woman 1 3989 wondering 1 3990 words 1 3991 worker 1 3992 workers 1 3993 worst 1 3994 worth 1 3995 wreak 1 3996 writing 1 3997 wrong 1 3998 year-old 1 3999 yearlong 1 4000 yield 1 4001 york 1 4002 youd 1 4003 youll 1 4004 younger 1 4005 youngest 1 4006 youre 1 4007 youtube 1 4008 zamponi 1 4009 zarya 1 4010 zdnet 1 4011 zhao 1 4012 zuckerberg 1 4013 zune 1 14 / 52

slide-15
SLIDE 15

Discussion 1

Objective: explore three information retrieval systems (Bing, LOC, PubMed), and use each for the discovery task: “What is the medical evidence that vaccines can cause autism?” Some general questions and observations: How to authenticate the information? Is the information up to date? (how to find updated info?) In what order are items returned? (by “relevance”, but how is relevance determined: link analysis? tf.idf?) Use query completion to refine vocabulary: e.g., “autism vaccine” → “. . . fraud, link” (google) → “. . . link controversy lawsuit study hoax debate” (bing) → “. . . controversy link wakefield” (pubmed)

15 / 52

slide-16
SLIDE 16

Summary Notes

Sample queries:

What is the medical evidence that vaccines can cause autism? (full query returned too much, e.g., weapons, war criminals, anarchists on trial from loc.gov) in quotes: only this course webpages on google vaccine autism (Note: vaccine,vaccines cause,causes)

How to authenticate? By source, by date. Is fielded search useful? (gave too little for some queries, or still too much irrelevant for others) Exercise care with site search site:loc.gov (not site:catalog.loc.gov) What about news? Do controversial items come up preferentially? (how do they determine order of presentation of results? How much is determined passive collaborative filtering — clickthrough analysis?) More generally, how are results ranked? (and why are Google and Bing so similar?)

16 / 52

slide-17
SLIDE 17

Summary Notes: some URLs retrieved Ia (Google/Bing)

Recent news at top (c.f. WTC ’01)

http://www.cnn.com/2011/HEALTH/01/05/autism.vaccines/index.html

Retracted autism study an ’elaborate fraud,’ British journal finds

http://www.sciencefriday.com/blog/2011/01/the-great-autism-vaccine-fraud/

(5 Jan 2011) The scandal that is the anti-vaccination movement may have finally reached its nadir with news that the seminal paper linking childhood vaccinations to autism is a complete fake.

http://www.webmd.com/brain/autism/searching-for-answers/vaccines-autism

Autism-Vaccine Link: Evidence Doesn’t Dispel Doubts (31 Mar 2008, with difficulty)

17 / 52

slide-18
SLIDE 18

Summary Notes: some URLs retrieved Ib (Google/Bing)

http://www.nytimes.com/2011/08/26/health/26vaccine.html

Vaccine Cleared Again as Autism Culprit

(25 Aug, 2011) Yet another panel of scientists has found no evidence that a popular vaccine causes autism. But despite the scientists’ best efforts, their report is unlikely to have any impact on the frustrating debate about the safety of these crucial medicines. “The M.M.R. vaccine doesn’t cause autism, and the evidence is overwhelming that it doesn’t,”

  • Dr. Ellen Wright Clayton, the chairwoman of the panel, assembled by the

Institute of Medicine, said in an interview. http://www.nbcsandiego.com/news/health/ (6 Jan 2011)

Dr-Predicts-Autism-Vaccine-Hoax-News-Will-Have-Little-Effect-113039689.html

  • Dr. Predicts Autism-Vaccine Hoax News Has Little Effect

5th result for ‘autism vaccine link’, 2nd for ‘autism vaccine’: MMR vaccine controversy – Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/MMR vaccine controversy

18 / 52

slide-19
SLIDE 19

Wikipedia: “MMR vaccine controversy”

http://en.wikipedia.org/wiki/MMR vaccine controversy

The MMR vaccine controversy refers to the claims that autism spectrum disorders can be caused by the MMR vaccine, an immunization against measles, mumps and rubella. Claims of a connection between the vaccine and autism were raised in a 1998 paper in The Lancet, a respected British medical journal.[1] Investigation by Sunday Times journalist Brian Deer discovered that the lead author of the article, Andrew Wakefield, had multiple undeclared conflicts of interest,[2][3] had manipulated evidence,[4] and had broken other ethical codes. The Lancet paper was retracted, and Wakefield was found guilty by the General Medical Council of serious professional misconduct in May 2010 and was struck off the Medical Register, meaning he could no longer practice as a doctor.[5] The research was declared fraudulent in 2011 by the BMJ.[6] The scientific consensus is that no evidence links the vaccine to the development of autism, and that the vaccine’s benefits greatly outweigh its risks. Following the initial claims in 1998, multiple large epidemiological studies were undertaken. Reviews of the evidence by the Centers for Disease Control and Prevention,[7] the American Academy

  • f Pediatrics, the Institute of Medicine of the US National Academy of Sciences,[8]

the UK National Health Service,[9] and the Cochrane Library[10] all found no link between the vaccine and autism. While the Cochrane review expressed a need for improved design and reporting of safety outcomes in MMR vaccine studies, it concluded that the evidence of the safety and effectiveness of MMR in the prevention of diseases that still carry a heavy burden of morbidity and mortality justifies its global use, and that the lack of confidence in the vaccine has damaged public health.[10] A special court convened in the United States to review claims under the National Vaccine Injury Compensation Program rejected compensation claims from parents of autistic children.[11][12]

The claims in Wakefield’s 1998 The Lancet article were widely reported;[13] vaccination rates in the UK and Ireland dropped sharply,[14] which in turn led to greatly increased incidence of measles and mumps, resulting in a few deaths and some severe and permanent injuries.[15] Physicians, medical journals, and editors[16][17][18][19][20] have made statements tying Wakefield’s fraudulent actions to various epidemics and deaths.[21][22] 19 / 52

slide-20
SLIDE 20

Summary Notes: some URLs retrieved II (.gov)

http://www.cdc.gov/vaccinesafety/Concerns/Autism/Index.html

(Page last modified: January 15, 2010, but dated 14 may 2004)

http://www.nichd.nih.gov/publications/pubs/upload/autismMMR.pdf

(june 2001) loc.gov: When your doctor is wrong : Hepatitis B vaccine & Autism, Judy Converse. 2002

20 / 52

slide-21
SLIDE 21

Summary Notes: some URLs retrieved IIIa (PubMed)

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3136032/

The MMR vaccine and autism: Sensation, refutation, retraction, and fraud [From the struggle for freedom to the denial of evidence: history of the anti-vaccination movements in Europe]. Indian J Psychiatry. 2011 Apr-Jun; 53(2): 9596.

  • 5. http://www.ncbi.nlm.nih.gov/pubmed/21623535 A

positive association found between autism prevalence and childhood vaccination uptake across the U.S. population (2011).

  • G. Delong. J Toxicol Environ Health A. 2011 Jan;74(14):903-16.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2954080/ The autism-vaccine story: fiction and deception? Can Fam Physician. 2010 October; 56(10): 1013

Convincing evidence from multiple countries shows no association between MMR vaccine (or thimerosal) and autistic disorders. The origins

  • f this controversy incorporate unethical conduct and misleading research.

21 / 52

slide-22
SLIDE 22

Summary Notes: some URLs retrieved IIIb (PubMed)

http://www.ncbi.nlm.nih.gov/pubmed/20628442 Does thimerosal or other mercury exposure increase the risk for autism? A review of current literature. Acta Neurobiol Exp (Wars). 2010; 70(2): 187-95.

This report reviews current literature regarding the association of the pharmaceutical preservative thimerosal and other mercury exposures with the risk for autism. The evidence presented here does not support a causal association between autism and mercury exposure from the preservative thimerosal.

see also “related articles” in right margin Note also:

(”autistic disorder”[MeSH Terms] OR (”autistic”[All Fields] AND ”disorder”[All Fields]) OR ”autistic disorder”[All Fields] OR ”autism”[All Fields]) AND (”vaccines”[MeSH Terms] OR ”vaccines”[All Fields] OR ”vaccine”[All Fields])

22 / 52

slide-23
SLIDE 23

Summary Notes: some URLs retrieved V (blogs)

  • 1. http://www.ageofautism.com/

(31 Aug 2011) Brian Deer is the reporter spearheading the current smear campaign against Dr. Wakefield, the lead author of the 1998 case series about children who developed autism and bowel problems after their MMR vaccinations. Deer falsely accused Wakefield of altering data.

2.

http://www.autismkey.com/vaccines-autism-and-the-squelching-of-the-non-believers/

23 / 52

slide-24
SLIDE 24

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

24 / 52

slide-25
SLIDE 25

Ranked retrieval

Thus far, our queries have all been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . .

. . . or they are, but they think it’s too much work.

Most users don’t want to wade through 1000s of results. This is particularly true of web search.

25 / 52

slide-26
SLIDE 26

Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650”

→ 200,000 hits – feast!

Query 2: “standard user dlink 650 no card found”

→ 0 hits – famine!

It takes a lot of skill to come up with a query that produces a manageable number of hits.

26 / 52

slide-27
SLIDE 27

Feast or famine: No problem in ranked retrieval

With ranking, large result sets are not an issue. Just show the top 10 results Doesn’t overwhelm the user Premise: the ranking algorithm works.

27 / 52

slide-28
SLIDE 28

Scoring as the basis of ranked retrieval

We wish to return in order the documents most likely to be useful to the searcher. How can we rank-order the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [0, 1]. This score measures how well document and query “match”.

28 / 52

slide-29
SLIDE 29

Query-document matching scores

How do we compute the score of a query-document pair? Let’s start with a one-term query. If the query term does not occur in the document: score should be 0. The more frequent the query term in the document, the higher the score We will look at a number of alternatives for doing this.

29 / 52

slide-30
SLIDE 30

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅) jaccard(A, A) = 1 jaccard(A, B) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1.

30 / 52

slide-31
SLIDE 31

Jaccard coefficient: Example

What is the query-document match score that the Jaccard coefficient computes for:

Query: “ides of March” Document “Caesar died in March” ?

31 / 52

slide-32
SLIDE 32

What’s wrong with Jaccard?

It doesn’t consider term frequency (how many occurrences a term has). Rare terms are more informative than frequent terms. Jaccard doesn’t consider this information. We need a more sophisticated way of normalizing for the length of a document. Later in this lecture, we’ll use |A ∩ B|/

  • |A ∪ B| (cosine) . . .

. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.

32 / 52

slide-33
SLIDE 33

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

33 / 52

slide-34
SLIDE 34

Recall: Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented by a binary vector ∈ {0, 1}|V |.

34 / 52

slide-35
SLIDE 35

From now on, we will use the frequencies of terms

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is represented by a count vector ∈ N|V |.

35 / 52

slide-36
SLIDE 36

Bag of words model

We do not consider the order of words in a document. John is quicker than Mary and Mary is quicker than John are represented the same way. This is called a bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at “recovering” positional information later in this course. For now: bag of words model

36 / 52

slide-37
SLIDE 37

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want. A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.

37 / 52

slide-38
SLIDE 38

Log frequency weighting

The log frequency weight of term t in d is defined as follows wt,d = 1 + log10 tft,d if tft,d > 0

  • therwise

tft,d → wt,d: 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: matching-score(q, d) =

t∈q∩d(1 + log tft,d)

The score is 0 if none of the query terms is present in the document.

38 / 52

slide-39
SLIDE 39

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

39 / 52

slide-40
SLIDE 40

Desired weight for rare terms

Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection (e.g., arachnocentric). A document containing this term is very likely to be relevant. → We want high weights for rare terms like arachnocentric.

40 / 52

slide-41
SLIDE 41

Desired weight for frequent terms

Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the collection (e.g., good, increase, line). A document containing this term is more likely to be relevant than a document that doesn’t . . . . . . but it’s not a sure indicator of relevance. → For frequent terms, we want positive weights for words like good, increase, and line . . . . . . but lower weights than for rare terms.

41 / 52

slide-42
SLIDE 42

Document frequency

We want high weights for rare terms like arachnocentric. We want low (positive) weights for frequent words like good, increase, and line. We will use document frequency to factor this into computing the matching score. The document frequency is the number of documents in the collection in which the term occurs.

42 / 52

slide-43
SLIDE 43

idf weight

dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t. We define the idf weight of term t as follows: idft = log10 N dft idft is a measure of the informativeness of the term. We use [log(N/dft)] instead of [N/dft] to “dampen” the effect of idf. So we use the log transformation for both term frequency and document frequency.

43 / 52

slide-44
SLIDE 44

Examples for idf

Compute idft using the formula: idft = log10

1,000,000

dft term dft idft calpurnia 1 6 animal 100 4 sunday 1000 3 fly 10,000 2 under 100,000 1 the 1,000,000

44 / 52

slide-45
SLIDE 45

Effect of idf on ranking

idf affects the ranking of documents for queries with at least two terms. For example, in the query “arachnocentric line”, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has no effect on ranking for one-term queries.

45 / 52

slide-46
SLIDE 46

Collection frequency vs. Document frequency

Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 The collection frequency of t is the number of tokens of t in the collection where we count multiple occurrences. Why these numbers? Which word is a better search term (and should get a higher weight)? This example suggests that df (and idf) is better for weighting that cf (and “icf”).

46 / 52

slide-47
SLIDE 47

tf.idf weighting

The tf.idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft tf-weight idf-weight Best known weighting scheme in information retrieval Alternative names: tf-idf, tf x idf, tf/idf Note: the “-” in tf-idf is a hyphen, not a minus sign!

47 / 52

slide-48
SLIDE 48

Summary: tf.idf

Assign a tf.idf weight for each term t in each document d: wt,d = (1 + log tft,d) · log N dft N: total number of documents The tf.idf weight . . .

. . . increases with the number of occurrences within a

  • document. (term frequency)

. . . increases with the rarity of the term in the collection. (inverse document frequency)

48 / 52

slide-49
SLIDE 49

Term, collection and document frequency

Quantity Symbol Definition term frequency tft,d number of occurrences of t in d document frequency dft number of documents in the collection in which t occurs collection frequency cft total number of occurrences of t in the collection Relationship between df and cf? Relationship between tf and cf? Relationship between tf and df?

49 / 52

slide-50
SLIDE 50

Outline

1

Recap

2

Why ranked retrieval?

3

Term frequency

4

tf.idf weighting

5

The vector space model

50 / 52

slide-51
SLIDE 51

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . Each document is now represented by a real-valued vector of tf.idf weights ∈ R|V |.

51 / 52

slide-52
SLIDE 52

Documents as vectors

Each document is now represented by a real-valued vector of tf.idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines This is a very sparse vector - most entries are zero. Not only useful for scoring documents on a query, but also essential for document classication and document clustering

52 / 52