text statistics 1 many slides courtesy James Allan@umass 2 Word - PowerPoint PPT Presentation

text statistics 1 many slides courtesy James Allan@umass

Word � Occurrences � � � Percentage � � the �� 8,543,794 � � � 6.8 � of � � 3,893,790 � � � 3.1 � to � � 3,364,653 � � � 2.7 � and � � 3,320,687 � � � 2.6 � in � � 2,311,785 � � � 1.8 � is � � 1,559,147 � � � 1.2 � for � � 1,313,561 � � � 1.0 � that � � 1,066,503 � � � 0.8 � said � � 1,027,713 � � � 0.8 � � Frequencies from 336,310 documents in the 1GB TREC Volume 3 Corpus � 125,720,891 total word occurrences; 508,209 unique words 3

• A few words occur very often � – 2 most frequent words can account for 10% of occurrences � – top 6 words are 20%, top 50 words are 50% � • Many words are infrequent � • “Principle of Least Effort” � – easier to repeat words rather than coining new ones � � • Rank · Frequency ≈ Constant � – pr = (Number of occurrences of word of rank r)/N � • N total word occurrences � • probability that a word chosen randomly from the text will be the word of rank r � • for D unique words Σ p r = 1 � � � – r ·pr = A � � – A ≈ 0.1 � � George Kingsley Zipf, 1902-1950 � Linguistic professor at Harvard � 4

Top 50 words from 423 short TIME magazine articles 5

Zipf’s Law and H.P.Luhn 7

• A word that occurs n times has rank r n = AN/n � � • Several words may occur n times � � • Assume rank given by r n applies to last of the words that occur n times � � • r n words occur n times or more (ranks 1..r n ) � � • r n+1 words occur n+1 times or more � – Note: r n > r n+1 since words that � � occur frequently are at the start of � list (lower rank) 8

• The number of words that occur exactly n times is � I n = r n – r n+1 = AN/n - AN/(n+1) = AN / (n(n+1)) � � • Highest ranking term occurs once and has rank � D = AN/1 � � • Proportion of words with frequency n is � I n /D = 1/ (n(n+1)) � � • Proportion of words occurring once is 1/2 9

Rank 10

Zipf’s law and real data • A law of the form y = kx c is called a power law. � � • Zipf’s law is a power law with c = –1 � – r = A·n -1 n = A·r -1 � – A is a constant for a fixed collection � � • On a log-log plot, power laws give a straight line with slope c . � - log( y ) = log( kx c ) = log( k ) + clog(x) � - log(n) = log(Ar -1 ) = log(A) – 1·log(r) � � • Zipf is quite accurate except for very high and low rank. � 11

high and low ranks 12

• The following more general form gives bit better fit � – Adds a constant to the denominator � – y=k(x+t) c � � • Here, � n = A·(r+t) -1 � 13

• Zipf’s explanation was his “principle of least effort.” � � •Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. � � • Debate (1955-61) between Mandelbrot and H. Simon over explanation. � � • Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. � – http://linkage.rockefeller.edu/wli/zipf/ � – Short words more likely to be generated 14

Explanations for Zipf Law • Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. � � • Debate (1955-61) between Mandelbrot and H. Simon over explanation � � • Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. � � � – http://linkage.rockefeller.edu/wli/zipf/ � � � – Short words more likely to be generated � 15

• How does the size of the overall vocabulary (number � of unique words) grow with the size of the corpus? � – Vocabulary has no upper bound due to proper names, typos, etc. � – New words occur less frequently as vocabulary grows � � • If V is the size of the vocabulary and the N is the � length of the corpus in words: � – V = KN β (0 < β < 1) � � • Typical constants: � – K ≈ 10 − 100 � – β ≈ 0.4 − 0.6 (approx. square-root of n) � � • Can be derived from Zipf’s law by assuming � documents are generated by randomly sampling � words from a Zipfian distribution 16

V =n = KN β 17

text statistics 1 many slides courtesy James Allan@umass 2 Word - PowerPoint PPT Presentation

text statistics 1 many slides courtesy James Allan@umass 2 Word Occurrences Percentage the 8,543,794 6.8 of 3,893,790 3.1 to 3,364,653 2.7 and

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

How runtime systems can support resource awareness in HPC: the HPX case Tommaso Bianucci

Counting Words: Expectation = sample average Poisson sampling LNRE Modelling Plugging in ZM

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts Workshop on

Branching random walks and fractals Ben Hambly (joint with David Croydon, Philippe Charmoy)

UDLS by Andrej Karpathy How to spot them Has detail on every level Is self-similar at

OpenMP and GPU Programming GPU Exercises Emanuele Ruffaldi

Regular Array Computation in Haskell on GPUs Geoffrey Mainland Microsoft Research Ltd WG 2.8,

Motivation Key: Easy and worthwhile to specify deterministic behavior of parallel programs

Sambuz

Useful Links

Newsletter

Mail Us