what every computational linguist should know about type
play

What Every Computational Linguist Should Know About Type-Token - PowerPoint PPT Presentation

What Every Computational Linguist Should Know About Type-Token Distributions and Zipfs Law Tutorial 1, 7 May 2018 Stefan Evert FAU Erlangen-Nrnberg http://zipfr.r-forge.r-project.org/lrec2018.html Licensed under CC-by-sa version 3.0


  1. Part 1 Descriptive statistics & notation Vocabulary growth curve our sample: recently , very , not , otherwise , much , very , very , merely , not , now , very , much , merely , not , very ◮ N = 1, V ( N ) = 1, V 1 ( N ) = 1 ◮ N = 3, V ( N ) = 3, V 1 ( N ) = 3 ◮ N = 7, V ( N ) = 5, V 1 ( N ) = 4 ◮ N = 12, V ( N ) = 7, V 1 ( N ) = 4 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

  2. Part 1 Descriptive statistics & notation Vocabulary growth curve our sample: recently , very , not , otherwise , much , very , very , merely , not , now , very , much , merely , not , very ◮ N = 1, V ( N ) = 1, V 1 ( N ) = 1 ◮ N = 3, V ( N ) = 3, V 1 ( N ) = 3 ◮ N = 7, V ( N ) = 5, V 1 ( N ) = 4 ◮ N = 12, V ( N ) = 7, V 1 ( N ) = 4 ◮ N = 15, V ( N ) = 7, V 1 ( N ) = 3 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

  3. Part 1 Descriptive statistics & notation Vocabulary growth curve our sample: recently , very , not , otherwise , much , very , very , merely , not , now , very , much , merely , not , very vocabulary growth curve: adverbs 10 V ( N ) V 1 ( N ) 8 ◮ N = 1, V ( N ) = 1, V 1 ( N ) = 1 ◮ N = 3, V ( N ) = 3, V 1 ( N ) = 3 V ( N ) V 1 ( N ) 6 ◮ N = 7, V ( N ) = 5, V 1 ( N ) = 4 4 ◮ N = 12, V ( N ) = 7, V 1 ( N ) = 4 2 ◮ N = 15, V ( N ) = 7, V 1 ( N ) = 3 0 0 2 4 6 8 10 12 14 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

  4. Part 1 Descriptive statistics & notation A realistic vocabulary growth curve: the Brown corpus vocabulary growth curve: Brown corpus 50000 V ( N ) V 1 ( N ) 40000 30000 V ( N ) V 1 ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 16 / 99

  5. Part 1 Descriptive statistics & notation Vocabulary growth in authorship attribution ◮ Authorship attribution by n-gram tracing applied to the case of the Bixby letter (Grieve et al. submitted) ◮ Word or character n-grams in disputed text are compared against large “training” corpora from candidate authors 323 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 17 / 99

  6. Part 1 Descriptive statistics & notation Observing Zipf’s law across languages and different linguistic units Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 18 / 99

  7. Part 1 Descriptive statistics & notation Observing Zipf’s law The Italian prefix ri- in the la Repubblica corpus Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 19 / 99

  8. Part 1 Descriptive statistics & notation Observing Zipf’s law Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99

  9. Part 1 Descriptive statistics & notation Observing Zipf’s law Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99

  10. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  11. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a ◮ If we take logarithm on both sides, we obtain: log f r = log C − a · log r Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  12. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a ◮ If we take logarithm on both sides, we obtain: log f r = log C − a · log r � �� � ���� y x Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  13. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a ◮ If we take logarithm on both sides, we obtain: log f r = log C − a · log r � �� � ���� y x ◮ Intuitive interpretation of a and C : ◮ a is slope determining how fast log frequency decreases ◮ log C is intercept , i.e. log frequency of most frequent word ( r = 1 ➜ log r = 0) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  14. Part 1 Descriptive statistics & notation Observing Zipf’s law Least-squares fit = linear regression in log-space (Brown corpus) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 22 / 99

  15. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Mandelbrot (1953, 1962) ◮ Mandelbrot’s extra parameter: C f r = ( r + b ) a ◮ Zipf’s law is special case with b = 0 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

  16. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Mandelbrot (1953, 1962) ◮ Mandelbrot’s extra parameter: C f r = ( r + b ) a ◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60,000, b = 1: ◮ For word with rank 1, Zipf’s law predicts frequency of 60,000; Mandelbrot’s variation predicts frequency of 30,000 ◮ For word with rank 1,000, Zipf’s law predicts frequency of 60; Mandelbrot’s variation predicts frequency of 59.94 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

  17. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Mandelbrot (1953, 1962) ◮ Mandelbrot’s extra parameter: C f r = ( r + b ) a ◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60,000, b = 1: ◮ For word with rank 1, Zipf’s law predicts frequency of 60,000; Mandelbrot’s variation predicts frequency of 30,000 ◮ For word with rank 1,000, Zipf’s law predicts frequency of 60; Mandelbrot’s variation predicts frequency of 59.94 ◮ Zipf-Mandelbrot law forms basis of statistical LNRE models ◮ ZM law derived mathematically as limiting distribution of vocabulary generated by a character-level Markov process Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

  18. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Non-linear least-squares fit (Brown corpus) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 24 / 99

  19. Part 1 Some examples (zipfR) Outline Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 25 / 99

  20. Part 1 Some examples (zipfR) zipfR Evert and Baroni (2007) ◮ http://zipfR.R-Forge.R-Project.org/ ◮ Conveniently available from CRAN repository ◮ Package vignette = gentle tutorial introduction Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 26 / 99

  21. Part 1 Some examples (zipfR) First steps with zipfR ◮ Set up a folder for this course, and make sure it is your working directory in R (preferably as an RStudio project) ◮ Install the most recent version of the zipfR package ◮ Package, handouts, code samples & data sets available from http://zipfr.r-forge.r-project.org/lrec2018.html > library(zipfR) > ?zipfR # documentation entry point > vignette("zipfr-tutorial") # read the zipfR tutorial Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 27 / 99

  22. Part 1 Some examples (zipfR) Loading type-token data ◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”) ☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed ( .gz , .bz2 , .xz ) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

  23. Part 1 Some examples (zipfR) Loading type-token data ◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”) ☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed ( .gz , .bz2 , .xz ) ◮ Sample data: brown_adverbs.txt on tutorial homepage ◮ lowercased adverb tokens from Brown corpus (original order) ☞ download and save to your working directory Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

  24. Part 1 Some examples (zipfR) Loading type-token data ◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”) ☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed ( .gz , .bz2 , .xz ) ◮ Sample data: brown_adverbs.txt on tutorial homepage ◮ lowercased adverb tokens from Brown corpus (original order) ☞ download and save to your working directory > adv <- readLines("brown_adverbs.txt", encoding="UTF-8") > head(adv, 30) # mathematically, a ‘‘vector’’ of tokens > length(adv) # sample size = 52,037 tokens Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

  25. Part 1 Some examples (zipfR) Descriptive statistics: type-frequency list > adv.tfl <- vec2tfl(adv) > adv.tfl k f type 1 1 4859 not 2 2 2084 n’t 3 3 1464 so 4 4 1381 only 5 5 1374 then 6 6 1309 now 7 7 1134 even 8 8 1089 as . . . . . . . . . N V 52037 1907 > N(adv.tfl) # sample size > V(adv.tfl) # type count Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 29 / 99

  26. Part 1 Some examples (zipfR) Descriptive statistics: frequency spectrum > adv.spc <- tfl2spc(adv.tfl) # or directly with vec2spc > adv.spc m Vm 1 1 762 2 2 260 3 3 144 4 4 99 5 5 69 6 6 50 7 7 40 8 8 34 . . . . . . N V 52037 1907 > N(adv.spc) # sample size > V(adv.spc) # type count Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 30 / 99

  27. Part 1 Some examples (zipfR) Descriptive statistics: vocabulary growth ◮ VGC lists vocabulary size V ( N ) at different sample sizes N ◮ Optionally also spectrum elements V m ( N ) up to m.max > adv.vgc <- vec2vgc(adv, m.max=2) ◮ Visualize descriptive statistics with plot method > plot(adv.tfl) # Zipf ranking > plot(adv.tfl, log="xy") # logarithmic scale recommended > plot(adv.spc) # barplot of frequency spectrum > plot(adv.vgc, add.m = 1:2) # vocabulary growth curve Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 31 / 99

  28. Part 1 Some examples (zipfR) Further example data sets ?Brown words from Brown corpus ?BrownSubsets various subsets ?Dickens words from novels by Charles Dickens ?ItaPref Italian word-formation prefixes ?TigerNP NP and PP patterns from German Tiger treebank ?Baayen2001 frequency spectra from Baayen (2001) ?EvertLuedeling2001 German word-formation affixes (manually corrected data from Evert and Lüdeling 2001) Practice: ◮ Explore these data sets with descriptive statistics ◮ Try different plot options (from help pages ?plot.tfl , ?plot.spc , ?plot.vgc ) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 32 / 99

  29. Part 1 LNRE models: intuition Outline Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 33 / 99

  30. Part 1 LNRE models: intuition Motivation ◮ Interested in productivity of affix, vocabulary of author, . . . ; not in a particular text or sample ☞ statistical inference from sample to population ◮ Discrete frequency counts are difficult to capture with generalizations such as Zipf’s law ◮ Zipf’s law predicts many impossible types with 1 < f r < 2 ☞ population does not suffer from such quantization effects Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 34 / 99

  31. Part 1 LNRE models: intuition LNRE models ◮ This tutorial introduces the state-of-the-art LNRE approach proposed by Baayen (2001) ◮ LNRE = Large Number of Rare Events ◮ LNRE uses various approximations and simplifications to obtain a tractable and elegant model ◮ Of course, we could also estimate the precise discrete distributions using MCMC simulations, but . . . 1. LNRE model usually minor component of complex procedure 2. often applied to very large samples ( N > 1 M tokens) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 35 / 99

  32. Part 1 LNRE models: intuition The LNRE population ◮ Population: set of S types w i with occurrence probabilities π i ◮ S = population diversity can be finite or infinite ( S = ∞ ) ◮ Not interested in specific types ➜ arrange by decreasing probability: π 1 ≥ π 2 ≥ π 3 ≥ · · · ☞ impossible to determine probabilities of all individual types ◮ Normalization: π 1 + π 2 + . . . + π S = 1 ◮ Need parametric statistical model to describe full population (esp. for S = ∞ ), i.e. a function i �→ π i ◮ type probabilities π i cannot be estimated reliably from a sample, but parameters of this function can ◮ NB: population index i � = Zipf rank r Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 36 / 99

  33. Part 1 LNRE models: intuition Examples of population models 0.10 0.10 ● 0.08 0.08 ●●●● ● ● ● 0.06 ● 0.06 ● ● ● ● π k ● π k ● ● 0.04 0.04 ● ● ● ● ● ● ● ● ● ● 0.02 0.02 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k 0.10 0.10 ● 0.08 0.08 ● ● ● 0.06 0.06 ● ● ● ● π k π k ● ● 0.04 ● 0.04 ● ● ● ● ● ● ● ● ● ● ● ● 0.02 0.02 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 37 / 99

  34. Part 1 LNRE models: intuition The Zipf-Mandelbrot law as a population model What is the right family of models for lexical frequency distributions? ◮ We have already seen that the Zipf-Mandelbrot law captures the distribution of observed frequencies very well Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99

  35. Part 1 LNRE models: intuition The Zipf-Mandelbrot law as a population model What is the right family of models for lexical frequency distributions? ◮ We have already seen that the Zipf-Mandelbrot law captures the distribution of observed frequencies very well ◮ Re-phrase the law for type probabilities: C π i := ( i + b ) a ◮ Two free parameters: a > 1 and b ≥ 0 ◮ C is not a parameter but a normalization constant, needed to ensure that � i π i = 1 ◮ This is the Zipf-Mandelbrot population model Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99

  36. Part 1 LNRE models: intuition The parameters of the Zipf-Mandelbrot model 0.10 0.10 ● a = 1.2 ● a = 2 0.08 0.08 b = 1.5 b = 10 ● 0.06 0.06 ● ● ● π k π k ● 0.04 0.04 ● ● ● ● ● ● ● ● 0.02 0.02 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k 0.10 0.10 ● a = 2 a = 5 0.08 0.08 ● b = 15 b = 40 ● ● 0.06 0.06 ● ● ● ● π k π k ● ● 0.04 ● 0.04 ● ● ● ● ● ● ● ● ● ● ● ● 0.02 0.02 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 39 / 99

  37. Part 1 LNRE models: intuition The parameters of the Zipf-Mandelbrot model ● ● 5e−02 5e−02 ● ● ● a = 1.2 ● a = 2 ● ● ● ●●●●● ● b = 1.5 b = 10 ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−03 ● 5e−03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π k ● ● π k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 5 10 20 50 100 1 2 5 10 20 50 100 k k ● ● 5e−02 5e−02 ● ● ● ● ● ●●●●● ● ● a = 2 a = 5 ● ● ● ●●●●● b = 15 ● b = 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−03 ● ● 5e−03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π k ● ● π k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 5 10 20 50 100 1 2 5 10 20 50 100 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 40 / 99

  38. Part 1 LNRE models: intuition The finite Zipf-Mandelbrot model Evert (2004) ◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i , and the type probabilities π i can become arbitrarily small ◮ π = 10 − 6 (once every million words), π = 10 − 9 (once every billion words), π = 10 − 15 (once on the entire Internet), π = 10 − 100 (once in the universe?) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

  39. Part 1 LNRE models: intuition The finite Zipf-Mandelbrot model Evert (2004) ◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i , and the type probabilities π i can become arbitrarily small ◮ π = 10 − 6 (once every million words), π = 10 − 9 (once every billion words), π = 10 − 15 (once on the entire Internet), π = 10 − 100 (once in the universe?) ◮ The finite Zipf-Mandelbrot model stops after first S types ◮ Population diversity S becomes a parameter of the model → the finite Zipf-Mandelbrot model has 3 parameters Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

  40. Part 1 LNRE models: intuition The finite Zipf-Mandelbrot model Evert (2004) ◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i , and the type probabilities π i can become arbitrarily small ◮ π = 10 − 6 (once every million words), π = 10 − 9 (once every billion words), π = 10 − 15 (once on the entire Internet), π = 10 − 100 (once in the universe?) ◮ The finite Zipf-Mandelbrot model stops after first S types ◮ Population diversity S becomes a parameter of the model → the finite Zipf-Mandelbrot model has 3 parameters Abbreviations: ◮ ZM for Zipf-Mandelbrot model ◮ fZM for finite Zipf-Mandelbrot model Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

  41. Part 1 LNRE models: intuition Sampling from a population model Assume we believe that the population we are interested in can be described by a Zipf-Mandelbrot model: 0.05 5e−02 a = 3 a = 3 ● ● 0.04 ● ● ● ● ●●●●● b = 50 b = 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.03 ● ● ● ● 5e−03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π k π k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 1 2 5 10 20 50 100 k k Use computer simulation to generate random samples: ◮ Draw N tokens from the population such that in each step, type w i has probability π i to be picked ◮ This allows us to make predictions for samples (= corpora) of arbitrary size N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 42 / 99

  42. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  43. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  44. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  45. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . 2 11 105 21 11 17 17 1 16 . . . #3: Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  46. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . 2 11 105 21 11 17 17 1 16 . . . #3: #4: 44 3 110 34 223 2 25 20 28 . . . #5: 24 81 54 11 8 61 1 31 35 . . . #6: 3 65 9 165 5 42 16 20 7 . . . #7: 10 21 11 60 164 54 18 16 203 . . . #8: 11 7 147 5 24 19 15 85 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  47. Part 1 LNRE models: intuition Samples: type frequency list & spectrum rank r f r type i m V m 1 37 6 1 83 2 36 1 2 22 3 33 3 3 20 4 31 7 4 12 5 31 10 5 10 6 30 5 6 5 7 28 12 7 5 8 27 2 8 3 9 24 4 9 3 10 24 16 10 3 . . 11 23 8 . . . . 12 22 14 . . . . . . . . . sample #1 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 44 / 99

  48. Part 1 LNRE models: intuition Samples: type frequency list & spectrum rank r f r type i m V m 1 39 2 1 76 2 34 3 2 27 3 30 5 3 17 4 29 10 4 10 5 28 8 5 6 6 26 1 6 5 7 25 13 7 7 8 24 7 8 3 9 23 6 10 4 10 23 11 11 2 . . 11 20 4 . . . . 12 19 17 . . . . . . . . . sample #2 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 45 / 99

  49. Part 1 LNRE models: intuition Random variation in type-frequency lists Sample #1 Sample #2 40 40 ● ● ● ● ● ●● 30 30 ● ● ● ● ● ● ● ● ●● ● ● ●● ● 20 ● 20 f r f r ● r ↔ f r ● ●● ● ●● ●●● ●●●●● ● ● ●●● ● ● ●● ● ●●●● ●●● ● ●● 10 10 ●●● ●●●● ●●● ●●● ●●● ●●●●● ●●●●●●● ●●●●● ●●●●● ●●●● 0 0 0 10 20 30 40 50 0 10 20 30 40 50 r r Sample #1 Sample #2 40 40 ● ● ● ● ● ● ● 30 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 20 f k f k ● ● ● ● i ↔ f i ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● 10 ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0 0 0 10 20 30 40 50 0 10 20 30 40 50 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 46 / 99

  50. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #1 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  51. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #2 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  52. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #3 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  53. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #4 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  54. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #1 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  55. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #2 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  56. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #3 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  57. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #4 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  58. Part 1 LNRE models: intuition Expected values ◮ There is no reason why we should choose a particular sample to compare to the real data or make a prediction – each one is equally likely or unlikely ◮ Take the average over a large number of samples, called expected value or expectation in statistics � and E � V ( N ) � V m ( N ) � ◮ Notation: E ◮ indicates that we are referring to expected values for a sample of size N ◮ rather than to the specific values V and V m observed in a particular sample or a real-world data set ◮ Expected values can be calculated efficiently without generating thousands of random samples Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 49 / 99

  59. Part 1 LNRE models: intuition The expected frequency spectrum Sample #1 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  60. Part 1 LNRE models: intuition The expected frequency spectrum Sample #2 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  61. Part 1 LNRE models: intuition The expected frequency spectrum Sample #3 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  62. Part 1 LNRE models: intuition The expected frequency spectrum Sample #4 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  63. Part 1 LNRE models: intuition The expected vocabulary growth curve Sample #1 Sample #1 200 200 150 150 E [ V 1 ( N )] E [ V ( N )] 100 100 50 50 V ( N ) V 1 ( N ) E [ V ( N )] E [ V 1 ( N )] 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 N N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 51 / 99

  64. Part 1 LNRE models: intuition Prediction intervals for the expected VGC Sample #1 Sample #1 200 200 150 150 E [ V 1 ( N )] E [ V ( N )] 100 100 50 50 V ( N ) V 1 ( N ) E [ V ( N )] E [ V 1 ( N )] 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 N N “Confidence intervals” indicate predicted sampling distribution: ☞ for 95% of samples generated by the LNRE model, VGC will fall within the range delimited by the thin red lines Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 52 / 99

  65. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.5 , b = 7.5 50000 a = 1.5 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  66. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.3 , b = 7.5 50000 a = 1.3 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  67. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.3 , b = 0.2 50000 a = 1.3 , b = = 0.2 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  68. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.5 , b = 7.5 50000 a = 1.5 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  69. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.7 , b = 7.5 50000 a = 1.7 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  70. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.7 , b = 80 50000 a = 1.7 , b = = 80 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  71. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 2 , b = 550 50000 a = 2 , b = = 550 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  72. Part 1 LNRE models: intuition Automatic parameter estimation 25000 a = 2.39 , b = 1968.49 50000 a = 2.39 , b = = 1968.49 observed expected 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed expected 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N ◮ By trial & error we found a = 2 . 0 and b = 550 ◮ Automatic estimation procedure: a = 2 . 39 and b = 1968 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 54 / 99

  73. Part 1 LNRE models: mathematics Outline Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 55 / 99

  74. Part 1 LNRE models: mathematics The sampling model ◮ Draw random sample of N tokens from LNRE population ◮ Sufficient statistic: set of type frequencies { f i } ◮ because tokens of random sample have no ordering ◮ Joint multinomial distribution of { f i } : N ! k 1 ! · · · k S ! π k 1 1 · · · π k S Pr ( { f i = k i } | N ) = S Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99

  75. Part 1 LNRE models: mathematics The sampling model ◮ Draw random sample of N tokens from LNRE population ◮ Sufficient statistic: set of type frequencies { f i } ◮ because tokens of random sample have no ordering ◮ Joint multinomial distribution of { f i } : N ! k 1 ! · · · k S ! π k 1 1 · · · π k S Pr ( { f i = k i } | N ) = S ◮ Approximation: do not condition on fixed sample size N ◮ N is now the average (expected) sample size ◮ Random variables f i have independent Poisson distributions: Pr ( f i = k i ) = e − N π i ( N π i ) k i k i ! Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99

  76. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

  77. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample ◮ Can be expressed in terms of indicator variables � 1 f i = m I [ f i = m ] = 0 otherwise Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

  78. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample ◮ Can be expressed in terms of indicator variables � 1 f i = m I [ f i = m ] = 0 otherwise S � V m = I [ f i = m ] i =1 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

  79. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample ◮ Can be expressed in terms of indicator variables � 1 f i = m I [ f i = m ] = 0 otherwise S � V m = I [ f i = m ] i =1 S S � � � 1 − I [ f i =0] � V = I [ f i > 0] = i =1 i =1 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend