INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/
IR 3: Term Statistics and Discussion 1
Paul Ginsparg
Cornell University, Ithaca, NY
1 Sep 2010
1 / 29
INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 3: Term Statistics and Discussion 1 Paul Ginsparg Cornell University, Ithaca, NY 1 Sep 2010 1 / 29
1 / 29
2 / 29
3 / 29
4 / 29
5 / 29
6 / 29
7 / 29
8 / 29
9 / 29
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 sqrt(x) x x**2 1 10 100 1 10 100 sqrt(x) x x**2
10 / 29
11 / 29
2 4 6 8 1 2 3 4 5 6 log10 T log10 M
12 / 29
13 / 29
1
2
14 / 29
15 / 29
16 / 29
17 / 29
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 100/sqrt(x) 100/x 100/x**2 1 10 100 1 10 100 100/sqrt(x) 100/x 100/x**2
18 / 29
1 2 3 4 5 6 7 1 2 3 4 5 6 7 log10 rank log10 cf
19 / 29
“A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as
20 / 29
http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch18.pdf) 21 / 29
200 400 600 800 1000 200 400 600 800 1000 Wikipedia edits/month | Amazon sales/week User|Book rank r 40916 / r^{.87} 1258925 / r^{1.7}
0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 10000 100000 1e+06 Wikipedia edits/month | Amazon sales/week User|Book rank r 1258925 / r^{1.7} 40916 / r^{.87}
22 / 29
23 / 29
24 / 29
25 / 29
26 / 29
27 / 29
28 / 29
29 / 29