INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 3: Term Statistics and Discussion 1 Paul Ginsparg Cornell University, Ithaca, NY 2 Sep 2010 1 / 28


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 3: Term Statistics and Discussion 1

Paul Ginsparg

Cornell University, Ithaca, NY

2 Sep 2010

1 / 28

slide-2
SLIDE 2

Administrativa

Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2010fa/ Assignment 1. Posted: 3 Sep, Due: Sun, 19 Sep Lectures: Tuesday and Thursday 11:40-12:55, Olin Hall 165 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Cornell Information Science, 301 College Avenue Instructor’s Assistant: Corinne Russell, crussell@cs.. . ., 255-5925, Cornell Information Science, 301 College Avenue Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Niranjan Sivakumar, ns253@... The Teaching Assistant does not have scheduled office hours but is available to help you by email. Course text at: http://informationretrieval.org/

2 / 28

slide-3
SLIDE 3

Overview

1

Recap

2

Term Statistics

3

Discussion

3 / 28

slide-4
SLIDE 4

Outline

1

Recap

2

Term Statistics

3

Discussion

4 / 28

slide-5
SLIDE 5

Type/token distinction

Token – An instance of a word or term occurring in a document. Type – An equivalence class of tokens. In June, the dog likes to chase the cat in the barn. 12 word tokens, 9 word types

5 / 28

slide-6
SLIDE 6

Problems in tokenization

What are the delimiters? Space? Apostrophe? Hyphen? For each of these: sometimes they delimit, sometimes they don’t. No whitespace in many languages! (e.g., Chinese) No whitespace in Dutch, German, Swedish compounds (Lebensversicherungsgesellschaftsangestellter)

6 / 28

slide-7
SLIDE 7

Problems in “equivalence classing”

A term is an equivalence class of tokens. How do we define equivalence classes? Numbers (3/20/91 vs. 20/3/91) Case folding Stemming, Porter stemmer Morphological analysis: inflectional vs. derivational Equivalence classing problems in other languages

More complex morphology than in English Finnish: a single verb may have 12,000 different forms Accents, umlauts

7 / 28

slide-8
SLIDE 8

Outline

1

Recap

2

Term Statistics

3

Discussion

8 / 28

slide-9
SLIDE 9

How big is the term vocabulary?

That is, how many distinct words are there? Can we assume there is an upper bound? Not really: At least 7020 ≈ 1037 different words of length 20. The vocabulary will keep growing with collection size. Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5. Heaps’ law is linear in log-log space.

It is the simplest possible relationship between collection size and vocabulary size in log-log space. Empirical law

9 / 28

slide-10
SLIDE 10

Power Laws in log-log space

y = cxk (k=1/2,1,2) log10 y = k ∗ log10 x + log10 c

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 sqrt(x) x x**2 1 10 100 1 10 100 sqrt(x) x x**2

10 / 28

slide-11
SLIDE 11

Model collection: The Reuters collection

symbol statistic value N documents 800,000 L

  • avg. # word tokens per document

200 M word types 400,000

  • avg. # bytes per word token (incl. spaces/punct.)

6

  • avg. # bytes per word token (without spaces/punct.)

4.5

  • avg. # bytes per word type

7.5 T non-positional postings 100,000,000 1Gb of text sent over Reuters newswire 20 Aug ’96 – 19 Aug ’97

11 / 28

slide-12
SLIDE 12

Heaps’ law for Reuters

2 4 6 8 1 2 3 4 5 6 log10 T log10 M

Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log10 M = 0.49 ∗ log10 T + 1.64 is the best least squares fit. Thus, M = 101.64T 0.49 and k = 101.64 ≈ 44 and b = 0.49. M = kT b = 44T .49

12 / 28

slide-13
SLIDE 13

Empirical fit for Reuters

Good, as we just saw in the graph. Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 44 × 1,000,0200.49 ≈ 38,323 The actual number is 38,365 terms, very close to the prediction. Empirical observation: fit is good in general.

13 / 28

slide-14
SLIDE 14

Exercise

1

What is the effect of including spelling errors vs. automatically correcting spelling errors on Heaps’ law?

2

Compute vocabulary size M

Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 1010) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?

14 / 28

slide-15
SLIDE 15

Zipf’s law

Now we have characterized the growth of the vocabulary in collections. We also want to know how many frequent vs. infrequent terms we should expect in a collection. In natural language, there are a few very frequent terms and very many very rare terms. Zipf’s law (linguist/philologist George Zipf, 1935): The ith most frequent term has frequency proportional to 1/i. cfi ∝ 1

i

cfi is collection frequency: the number of occurrences of the term ti in the collection.

15 / 28

slide-16
SLIDE 16

http://en.wikipedia.org/wiki/Zipf’s law

Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will

  • ccur approximately twice as often as the second most frequent

word, which occurs twice as often as the fourth most frequent word, etc. Brown Corpus: “the”: 7% of all word occurrences (69,971 of

  • >1M).

“of”: ∼3.5% of words (36,411) “and”: 2.9% (28,852) Only 135 vocabulary items account for half the Brown Corpus.

The Brown University Standard Corpus of Present-Day American English is a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources . . . for many years among the most-cited resources in the field.

16 / 28

slide-17
SLIDE 17

Zipf’s law

Zipf’s law: The ith most frequent term has frequency proportional to 1/i. cfi ∝ 1

i

cf is collection frequency: the number of occurrences of the term in the collection. So if the most frequent term (the) occurs cf1 times, then the second most frequent term (of) has half as many occurrences cf2 = 1

2cf1 . . .

. . . and the third most frequent term (and) has a third as many occurrences cf3 = 1

3cf1 etc.

Equivalent: cfi = cik and log cfi = log c + k log i (for k = −1) Example of a power law

17 / 28

slide-18
SLIDE 18

Power Laws in log-log space

y = cx−k (k=1/2,1,2) log10 y = −k ∗ log10 x + log10 c

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 100/sqrt(x) 100/x 100/x**2 1 10 100 1 10 100 100/sqrt(x) 100/x 100/x**2

18 / 28

slide-19
SLIDE 19

Zipf’s law for Reuters

1 2 3 4 5 6 7 1 2 3 4 5 6 7 log10 rank log10 cf

Fit far from perfect, but nonetheless key insight: Few frequent terms, many rare terms.

19 / 28

slide-20
SLIDE 20

more from http://en.wikipedia.org/wiki/Zipf’s law

“A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as

  • expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.”

20 / 28

slide-21
SLIDE 21

Power laws more generally

E.g., consider power law distributions of the form c r −k , describing the number of book sales versus sales-rank r of a book,

  • r the number of Wikipedia edits made by the rth most frequent

contributor to Wikipedia. Amazon book sales: c r −k, k ≈ .87 number of Wikipedia edits: c r −k, k ≈ 1.7 (More on power laws and the long tail here: Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg Chpt 18:

http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch18.pdf) 21 / 28

slide-22
SLIDE 22

200 400 600 800 1000 200 400 600 800 1000 Wikipedia edits/month | Amazon sales/week User|Book rank r 40916 / r^{.87} 1258925 / r^{1.7}

Normalization given by the roughly 1 sale/week for the 200,000th ranked Amazon title: 40916r−.87 and by the 10 edits/month for the 1000th ranked Wikipedia editor: 1258925r−1.7

0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 10000 100000 1e+06 Wikipedia edits/month | Amazon sales/week User|Book rank r 1258925 / r^{1.7} 40916 / r^{.87}

Long tail: about a quarter of Amazon book sales estimated to come from the long tail, i.e., those outside the top 100,000 bestselling titles

22 / 28

slide-23
SLIDE 23

Another Wikipedia count (15 May 2010)

http://imonad.com/seo/wikipedia-word-frequency-list/ All articles in the English version of Wikipedia, 21GB in XML format (five hours to parse entire file, extract data from markup language, filter numbers, special characters, extract statistics): Total tokens (words, no numbers): T = 1,570,455,731 Unique tokens (words, no numbers): M = 5,800,280

23 / 28

slide-24
SLIDE 24

“Word frequency distribution follows Zipf’s law”

24 / 28

slide-25
SLIDE 25

rank 1–50 (86M-3M), stop words (the, of, and, in, to, a, is, . . .) rank 51–3K (2.4M-56K), frequent words (university, January, tea, sharp, . . .) rank 3K–200K (56K-118), words from large comprehensive dictionaries (officiates, polytonality, neologism, . . .) above rank 50K mostly Long Tail words rank 200K–5.8M (117-1), terms from obscure niches, misspelled words, transliterated words from other languages, new words and non-words (euprosthenops, eurotrochilus, lokottaravada, . . .)

25 / 28

slide-26
SLIDE 26

Some selected words and associated counts

Google 197920 Twitter 894 domain 111850 domainer 22 Wikipedia 3226237 Wiki 176827 Obama 22941 Oprah 3885 Moniker 4974 GoDaddy 228

26 / 28

slide-27
SLIDE 27

Outline

1

Recap

2

Term Statistics

3

Discussion

27 / 28

slide-28
SLIDE 28

Discussion 1

Objective: explore three information retrieval systems (Bing, LOC, PubMed), and use each for the discovery task: “What is the medical evidence that cell phone usage can cause cancer?” Some general questions and observations: How to authenticate the information? Is the information up to date? (how to find updated info?) In what order are items returned? (by “relevance”, but how is relevance determined: link analysis? tf.idf?) Use results of Bing search to refine vocabulary Assignment: everyone upload as a test of CMS the best reference found, and outline of strategy used to find it

28 / 28