info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 3: Term Statistics and Discussion 1 Paul Ginsparg Cornell University, Ithaca, NY 2 Sep 2010 1 / 28


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 3: Term Statistics and Discussion 1 Paul Ginsparg Cornell University, Ithaca, NY 2 Sep 2010 1 / 28

  2. Administrativa Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2010fa/ Assignment 1. Posted: 3 Sep, Due: Sun, 19 Sep Lectures: Tuesday and Thursday 11:40-12:55, Olin Hall 165 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Cornell Information Science, 301 College Avenue Instructor’s Assistant: Corinne Russell, crussell@cs. . . . , 255-5925, Cornell Information Science, 301 College Avenue Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Niranjan Sivakumar, ns253@... The Teaching Assistant does not have scheduled office hours but is available to help you by email. Course text at: http://informationretrieval.org/ 2 / 28

  3. Overview Recap 1 Term Statistics 2 Discussion 3 3 / 28

  4. Outline Recap 1 Term Statistics 2 Discussion 3 4 / 28

  5. Type/token distinction Token – An instance of a word or term occurring in a document. Type – An equivalence class of tokens. In June, the dog likes to chase the cat in the barn. 12 word tokens, 9 word types 5 / 28

  6. Problems in tokenization What are the delimiters? Space? Apostrophe? Hyphen? For each of these: sometimes they delimit, sometimes they don’t. No whitespace in many languages! (e.g., Chinese) No whitespace in Dutch, German, Swedish compounds ( Lebensversicherungsgesellschaftsangestellter ) 6 / 28

  7. Problems in “equivalence classing” A term is an equivalence class of tokens. How do we define equivalence classes? Numbers (3/20/91 vs. 20/3/91) Case folding Stemming, Porter stemmer Morphological analysis: inflectional vs. derivational Equivalence classing problems in other languages More complex morphology than in English Finnish: a single verb may have 12,000 different forms Accents, umlauts 7 / 28

  8. Outline Recap 1 Term Statistics 2 Discussion 3 8 / 28

  9. How big is the term vocabulary? That is, how many distinct words are there? Can we assume there is an upper bound? Not really: At least 70 20 ≈ 10 37 different words of length 20. The vocabulary will keep growing with collection size. Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0 . 5. Heaps’ law is linear in log-log space. It is the simplest possible relationship between collection size and vocabulary size in log-log space. Empirical law 9 / 28

  10. Power Laws in log-log space y = cx k (k=1/2,1,2) log 10 y = k ∗ log 10 x + log 10 c 100 100 sqrt(x) sqrt(x) x x x**2 x**2 90 80 70 60 50 10 40 30 20 10 0 1 0 10 20 30 40 50 60 70 80 90 100 1 10 100 10 / 28

  11. Model collection: The Reuters collection symbol statistic value documents 800,000 N avg. # word tokens per document 200 L M word types 400,000 avg. # bytes per word token (incl. spaces/punct.) 6 avg. # bytes per word token (without spaces/punct.) 4.5 avg. # bytes per word type 7.5 non-positional postings 100,000,000 T 1Gb of text sent over Reuters newswire 20 Aug ’96 – 19 Aug ’97 11 / 28

  12. Heaps’ law for Reuters Vocabulary size M as a function of collection size T (number of tokens) for 6 Reuters-RCV1. For these data, the dashed line log 10 M = 0 . 49 ∗ log 10 T + 1 . 64 5 is the best least squares fit. Thus, M = 10 1 . 64 T 0 . 49 4 and k = 10 1 . 64 ≈ 44 log10 M 3 and b = 0 . 49. 2 M = kT b = 44 T . 49 1 0 0 2 4 6 8 log10 T 12 / 28

  13. Empirical fit for Reuters Good, as we just saw in the graph. Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 44 × 1 , 000 , 020 0 . 49 ≈ 38 , 323 The actual number is 38,365 terms, very close to the prediction. Empirical observation: fit is good in general. 13 / 28

  14. Exercise What is the effect of including spelling errors vs. automatically 1 correcting spelling errors on Heaps’ law? Compute vocabulary size M 2 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? 14 / 28

  15. Zipf’s law Now we have characterized the growth of the vocabulary in collections. We also want to know how many frequent vs. infrequent terms we should expect in a collection. In natural language, there are a few very frequent terms and very many very rare terms. Zipf’s law (linguist/philologist George Zipf, 1935): The i th most frequent term has frequency proportional to 1 / i . cf i ∝ 1 i cf i is collection frequency: the number of occurrences of the term t i in the collection. 15 / 28

  16. http://en.wikipedia.org/wiki/Zipf’s law Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. Brown Corpus: “the”: 7% of all word occurrences (69,971 of > 1M). � “of”: ∼ 3.5% of words (36,411) “and”: 2.9% (28,852) Only 135 vocabulary items account for half the Brown Corpus. The Brown University Standard Corpus of Present-Day American English is a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources . . . for many years among the most-cited resources in the field. 16 / 28

  17. Zipf’s law Zipf’s law: The i th most frequent term has frequency proportional to 1 / i . cf i ∝ 1 i cf is collection frequency: the number of occurrences of the term in the collection. So if the most frequent term ( the ) occurs cf 1 times, then the second most frequent term ( of ) has half as many occurrences cf 2 = 1 2 cf 1 . . . . . . and the third most frequent term ( and ) has a third as many occurrences cf 3 = 1 3 cf 1 etc. Equivalent: cf i = ci k and log cf i = log c + k log i (for k = − 1) Example of a power law 17 / 28

  18. Power Laws in log-log space y = cx − k (k=1/2,1,2) log 10 y = − k ∗ log 10 x + log 10 c 100 100 100/sqrt(x) 100/sqrt(x) 100/x 100/x 100/x**2 100/x**2 90 80 70 60 50 10 40 30 20 10 0 1 0 10 20 30 40 50 60 70 80 90 100 1 10 100 18 / 28

  19. Zipf’s law for Reuters 7 6 5 4 log10 cf 3 2 1 0 0 1 2 3 4 5 6 7 log10 rank Fit far from perfect, but nonetheless key insight: Few frequent terms, many rare terms. 19 / 28

  20. more from http://en.wikipedia.org/wiki/Zipf’s law “A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.” 20 / 28

  21. Power laws more generally E.g., consider power law distributions of the form c r − k , describing the number of book sales versus sales-rank r of a book, or the number of Wikipedia edits made by the r th most frequent contributor to Wikipedia. Amazon book sales: c r − k , k ≈ . 87 number of Wikipedia edits: c r − k , k ≈ 1 . 7 (More on power laws and the long tail here: Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg Chpt 18: http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch18.pdf ) 21 / 28

  22. 1000 Normalization given by the roughly 1 sale/week for the 800 Wikipedia edits/month | Amazon sales/week 200,000th ranked Amazon title: 600 40916 r − . 87 and by the 400 10 edits/month for the 40916 / r^{.87} 1000th ranked Wikipedia editor: 200 1258925 r − 1 . 7 1258925 / r^{1.7} 0 0 200 400 600 800 1000 User|Book rank r 1e+07 1e+06 Wikipedia edits/month | Amazon sales/week 100000 10000 Long tail: about a quarter of 1000 Amazon book sales estimated 40916 / r^{.87} to come from the long tail, 100 1258925 / r^{1.7} i.e., those outside the top 10 100,000 bestselling titles 1 0.1 1 10 100 1000 10000 100000 1e+06 User|Book rank r 22 / 28

  23. Another Wikipedia count (15 May 2010) http://imonad.com/seo/wikipedia-word-frequency-list/ All articles in the English version of Wikipedia, 21GB in XML format (five hours to parse entire file, extract data from markup language, filter numbers, special characters, extract statistics): Total tokens (words, no numbers): T = 1,570,455,731 Unique tokens (words, no numbers): M = 5,800,280 23 / 28

  24. “Word frequency distribution follows Zipf’s law” 24 / 28

  25. rank 1–50 (86M-3M), stop words (the, of, and, in, to, a, is, . . . ) rank 51–3K (2.4M-56K), frequent words (university, January, tea, sharp, . . . ) rank 3K–200K (56K-118), words from large comprehensive dictionaries (officiates, polytonality, neologism, . . . ) above rank 50K mostly Long Tail words rank 200K–5.8M (117-1), terms from obscure niches, misspelled words, transliterated words from other languages, new words and non-words (euprosthenops, eurotrochilus, lokottaravada, . . . ) 25 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend