What Every Computational Linguist Should Know About Type-Token - - PowerPoint PPT Presentation

what every computational linguist should know about type
SMART_READER_LITE
LIVE PREVIEW

What Every Computational Linguist Should Know About Type-Token - - PowerPoint PPT Presentation

What Every Computational Linguist Should Know About Type-Token Distributions and Zipfs Law Tutorial 1, 7 May 2018 Stefan Evert FAU Erlangen-Nrnberg http://zipfr.r-forge.r-project.org/lrec2018.html Licensed under CC-by-sa version 3.0


slide-1
SLIDE 1

What Every Computational Linguist Should Know About Type-Token Distributions and Zipf’s Law

Tutorial 1, 7 May 2018 Stefan Evert FAU Erlangen-Nürnberg http://zipfr.r-forge.r-project.org/lrec2018.html

Licensed under CC-by-sa version 3.0

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 1 / 99

slide-2
SLIDE 2

Outline

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 2 / 99

slide-3
SLIDE 3

Part 1 Motivation

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 3 / 99

slide-4
SLIDE 4

Part 1 Motivation

Type-token statistics

◮ Type-token statistics different from most statistical inference

◮ not about probability of a specific event ◮ but about diversity of events and their probability distribution

◮ Relatively little work in statistical science ◮ Nor a major research topic in computational linguistics

◮ very specialized, usually plays ancillary role in NLP

◮ But type-token statistics appear in wide range of applications

◮ often crucial for sound analysis

➥ NLP community needs better awareness of statistical techniques, their limitations, and available software

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 4 / 99

slide-5
SLIDE 5

Part 1 Motivation

Some research questions

◮ How many words did Shakespeare know? ◮ What is the coverage of my treebank grammar on big data? ◮ How many typos are there on the Internet? ◮ Is -ness more productive than -ity in English? ◮ Are there differences in the productivity of nominal compounds between academic writing and novels? ◮ Does Dickens use a more complex vocabulary than Rowling? ◮ Can a decline in lexical complexity predict Alzheimer’s disease? ◮ How frequent is a hapax legomenon from the Brown corpus? ◮ What is appropriate smoothing for my n-gram model? ◮ Who wrote the Bixby letter, Lincoln or Hay? ◮ How many different species of . . . are there? (Brainerd 1982)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 5 / 99

slide-6
SLIDE 6

Part 1 Motivation

Some research questions

◮ ◮ coverage estimates ◮ ◮ ◮ productivity ◮ lexical complexity & stylometry ◮ ◮ prior & posterior distribution ◮ ◮ unexpected applications ◮

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 6 / 99

slide-7
SLIDE 7

Part 1 Motivation

Zipf’s law (Zipf 1949)

A) Frequency distributions in natural language are highly skewed B) Curious relationship between rank & frequency

word r f r · f the 1. 142,776 142,776 and 2. 100,637 201,274 be 3. 94,181 282,543

  • f

4. 74,054 296,216 (Dickens)

C) Various explanations of Zipf’s law

◮ principle of least effort (Zipf 1949) ◮ optimal coding system, MDL (Mandelbrot 1953, 1962) ◮ random sequences (Miller 1957; Li 1992; Cao et al. 2017) ◮ Markov processes ➜ n-gram models (Rouault 1978)

D) Language evolution: birth-death-process (Simon 1955) ☞ not the main topic today!

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 7 / 99

slide-8
SLIDE 8

Part 1 Descriptive statistics & notation

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 8 / 99

slide-9
SLIDE 9

Part 1 Descriptive statistics & notation

Tokens & types

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 15: number of tokens = sample size ◮ V = 7: number of distinct types = vocabulary size (recently, very, not, otherwise, much, merely, now)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 9 / 99

slide-10
SLIDE 10

Part 1 Descriptive statistics & notation

Tokens & types

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 15: number of tokens = sample size ◮ V = 7: number of distinct types = vocabulary size (recently, very, not, otherwise, much, merely, now) type-frequency list w fw recently 1 very 5 not 3

  • therwise

1 much 2 merely 2 now 1

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 9 / 99

slide-11
SLIDE 11

Part 1 Descriptive statistics & notation

Zipf ranking

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 15: number of tokens = sample size ◮ V = 7: number of distinct types = vocabulary size (recently, very, not, otherwise, much, merely, now) Zipf ranking w r fr very 1 5 not 2 3 merely 3 2 much 4 2 now 5 1

  • therwise

6 1 recently 7 1

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 10 / 99

slide-12
SLIDE 12

Part 1 Descriptive statistics & notation

Zipf ranking

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 15: number of tokens = sample size ◮ V = 7: number of distinct types = vocabulary size (recently, very, not, otherwise, much, merely, now) Zipf ranking w r fr very 1 5 not 2 3 merely 3 2 much 4 2 now 5 1

  • therwise

6 1 recently 7 1

1 2 3 4 5 6 7 2 4 6 8 10

Zipf ranking: adverbs

rank frequency

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 10 / 99

slide-13
SLIDE 13

Part 1 Descriptive statistics & notation

A realistic Zipf ranking: the Brown corpus

top frequencies bottom frequencies r f word rank range f randomly selected examples 1 69836 the 7731 – 8271 10 schedules, polynomials, bleak 2 36365

  • f

8272 – 8922 9 tolerance, shaved, hymn 3 28826 and 8923 – 9703 8 decreased, abolish, irresistible 4 26126 to 9704 – 10783 7 immunity, cruising, titan 5 23157 a 10784 – 11985 6 geographic, lauro, portrayed 6 21314 in 11986 – 13690 5 grigori, slashing, developer 7 10777 that 13691 – 15991 4 sheath, gaulle, ellipsoids 8 10182 is 15992 – 19627 3 mc, initials, abstracted 9 9968 was 19628 – 26085 2 thar, slackening, deluxe 10 9801 he 26086 – 45215 1 beck, encompasses, second-place

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 11 / 99

slide-14
SLIDE 14

Part 1 Descriptive statistics & notation

A realistic Zipf ranking: the Brown corpus

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 12 / 99

slide-15
SLIDE 15

Part 1 Descriptive statistics & notation

A realistic Zipf ranking: the Brown corpus

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 12 / 99

slide-16
SLIDE 16

Part 1 Descriptive statistics & notation

Frequency spectrum

◮ pool types with f = 1 (hapax legomena), types with f = 2 (dis legomena), . . . , f = m, . . . ◮ V1 = 3: number of hapax legomena (now, otherwise, recently) ◮ V2 = 2: number of dis legomena (merely, much) ◮ general definition: Vm = |{w | fw = m}| Zipf ranking w r fr very 1 5 not 2 3 merely 3 2 much 4 2 now 5 1

  • therwise

6 1 recently 7 1 frequency spectrum m Vm 1 3 2 2 3 1 5 1

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 13 / 99

slide-17
SLIDE 17

Part 1 Descriptive statistics & notation

Frequency spectrum

◮ pool types with f = 1 (hapax legomena), types with f = 2 (dis legomena), . . . , f = m, . . . ◮ V1 = 3: number of hapax legomena (now, otherwise, recently) ◮ V2 = 2: number of dis legomena (merely, much) ◮ general definition: Vm = |{w | fw = m}| Zipf ranking w r fr very 1 5 not 2 3 merely 3 2 much 4 2 now 5 1

  • therwise

6 1 recently 7 1 frequency spectrum m Vm 1 3 2 2 3 1 5 1

1 2 3 4 5 6 7

frequency spectrum: adverbs

m Vm 2 4 6 8 10

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 13 / 99

slide-18
SLIDE 18

Part 1 Descriptive statistics & notation

A realistic frequency spectrum: the Brown corpus

1 2 3 4 5 6 7 8 9 11 13 15

frequency spectrum: Brown corpus

m Vm 5000 10000 15000 20000

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 14 / 99

slide-19
SLIDE 19

Part 1 Descriptive statistics & notation

Vocabulary growth curve

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 1, V (N) = 1, V1(N) = 1

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

slide-20
SLIDE 20

Part 1 Descriptive statistics & notation

Vocabulary growth curve

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 1, V (N) = 1, V1(N) = 1 ◮ N = 3, V (N) = 3, V1(N) = 3

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

slide-21
SLIDE 21

Part 1 Descriptive statistics & notation

Vocabulary growth curve

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 1, V (N) = 1, V1(N) = 1 ◮ N = 3, V (N) = 3, V1(N) = 3 ◮ N = 7, V (N) = 5, V1(N) = 4

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

slide-22
SLIDE 22

Part 1 Descriptive statistics & notation

Vocabulary growth curve

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 1, V (N) = 1, V1(N) = 1 ◮ N = 3, V (N) = 3, V1(N) = 3 ◮ N = 7, V (N) = 5, V1(N) = 4 ◮ N = 12, V (N) = 7, V1(N) = 4

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

slide-23
SLIDE 23

Part 1 Descriptive statistics & notation

Vocabulary growth curve

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 1, V (N) = 1, V1(N) = 1 ◮ N = 3, V (N) = 3, V1(N) = 3 ◮ N = 7, V (N) = 5, V1(N) = 4 ◮ N = 12, V (N) = 7, V1(N) = 4 ◮ N = 15, V (N) = 7, V1(N) = 3

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

slide-24
SLIDE 24

Part 1 Descriptive statistics & notation

Vocabulary growth curve

  • ur sample: recently, very, not, otherwise, much, very, very,

merely, not, now, very, much, merely, not, very ◮ N = 1, V (N) = 1, V1(N) = 1 ◮ N = 3, V (N) = 3, V1(N) = 3 ◮ N = 7, V (N) = 5, V1(N) = 4 ◮ N = 12, V (N) = 7, V1(N) = 4 ◮ N = 15, V (N) = 7, V1(N) = 3

2 4 6 8 10 12 14 2 4 6 8 10

vocabulary growth curve: adverbs

N V(N) V1(N) V(N) V1(N)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

slide-25
SLIDE 25

Part 1 Descriptive statistics & notation

A realistic vocabulary growth curve: the Brown corpus

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

vocabulary growth curve: Brown corpus

N V(N) V1(N) V(N) V1(N) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 16 / 99

slide-26
SLIDE 26

Part 1 Descriptive statistics & notation

Vocabulary growth in authorship attribution

◮ Authorship attribution by n-gram tracing applied to the case

  • f the Bixby letter (Grieve et al. submitted)

◮ Word or character n-grams in disputed text are compared against large “training” corpora from candidate authors

323

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 17 / 99

slide-27
SLIDE 27

Part 1 Descriptive statistics & notation

Observing Zipf’s law

across languages and different linguistic units

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 18 / 99

slide-28
SLIDE 28

Part 1 Descriptive statistics & notation

Observing Zipf’s law

The Italian prefix ri- in the la Repubblica corpus

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 19 / 99

slide-29
SLIDE 29

Part 1 Descriptive statistics & notation

Observing Zipf’s law

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99

slide-30
SLIDE 30

Part 1 Descriptive statistics & notation

Observing Zipf’s law

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99

slide-31
SLIDE 31

Part 1 Descriptive statistics & notation

Observing Zipf’s law

◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: fr = C r a

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

slide-32
SLIDE 32

Part 1 Descriptive statistics & notation

Observing Zipf’s law

◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: fr = C r a ◮ If we take logarithm on both sides, we obtain: log fr = log C − a · log r

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

slide-33
SLIDE 33

Part 1 Descriptive statistics & notation

Observing Zipf’s law

◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: fr = C r a ◮ If we take logarithm on both sides, we obtain: log fr

y

= log C − a · log r

  • x

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

slide-34
SLIDE 34

Part 1 Descriptive statistics & notation

Observing Zipf’s law

◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: fr = C r a ◮ If we take logarithm on both sides, we obtain: log fr

y

= log C − a · log r

  • x

◮ Intuitive interpretation of a and C:

◮ a is slope determining how fast log frequency decreases ◮ log C is intercept, i.e. log frequency of most frequent word

(r = 1 ➜ log r = 0)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

slide-35
SLIDE 35

Part 1 Descriptive statistics & notation

Observing Zipf’s law

Least-squares fit = linear regression in log-space (Brown corpus)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 22 / 99

slide-36
SLIDE 36

Part 1 Descriptive statistics & notation

Zipf-Mandelbrot law

Mandelbrot (1953, 1962)

◮ Mandelbrot’s extra parameter: fr = C (r + b)a ◮ Zipf’s law is special case with b = 0

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

slide-37
SLIDE 37

Part 1 Descriptive statistics & notation

Zipf-Mandelbrot law

Mandelbrot (1953, 1962)

◮ Mandelbrot’s extra parameter: fr = C (r + b)a ◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60,000, b = 1:

◮ For word with rank 1, Zipf’s law predicts frequency of 60,000;

Mandelbrot’s variation predicts frequency of 30,000

◮ For word with rank 1,000, Zipf’s law predicts frequency of 60;

Mandelbrot’s variation predicts frequency of 59.94

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

slide-38
SLIDE 38

Part 1 Descriptive statistics & notation

Zipf-Mandelbrot law

Mandelbrot (1953, 1962)

◮ Mandelbrot’s extra parameter: fr = C (r + b)a ◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60,000, b = 1:

◮ For word with rank 1, Zipf’s law predicts frequency of 60,000;

Mandelbrot’s variation predicts frequency of 30,000

◮ For word with rank 1,000, Zipf’s law predicts frequency of 60;

Mandelbrot’s variation predicts frequency of 59.94

◮ Zipf-Mandelbrot law forms basis of statistical LNRE models

◮ ZM law derived mathematically as limiting distribution of

vocabulary generated by a character-level Markov process

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

slide-39
SLIDE 39

Part 1 Descriptive statistics & notation

Zipf-Mandelbrot law

Non-linear least-squares fit (Brown corpus)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 24 / 99

slide-40
SLIDE 40

Part 1 Some examples (zipfR)

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 25 / 99

slide-41
SLIDE 41

Part 1 Some examples (zipfR)

zipfR

Evert and Baroni (2007)

◮ http://zipfR.R-Forge.R-Project.org/ ◮ Conveniently available from CRAN repository ◮ Package vignette = gentle tutorial introduction

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 26 / 99

slide-42
SLIDE 42

Part 1 Some examples (zipfR)

First steps with zipfR

◮ Set up a folder for this course, and make sure it is your working directory in R (preferably as an RStudio project) ◮ Install the most recent version of the zipfR package ◮ Package, handouts, code samples & data sets available from http://zipfr.r-forge.r-project.org/lrec2018.html > library(zipfR) > ?zipfR

# documentation entry point

> vignette("zipfr-tutorial")

# read the zipfR tutorial

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 27 / 99

slide-43
SLIDE 43

Part 1 Some examples (zipfR)

Loading type-token data

◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”)

☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed (.gz, .bz2, .xz)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

slide-44
SLIDE 44

Part 1 Some examples (zipfR)

Loading type-token data

◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”)

☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed (.gz, .bz2, .xz)

◮ Sample data: brown_adverbs.txt on tutorial homepage

◮ lowercased adverb tokens from Brown corpus (original order)

☞ download and save to your working directory

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

slide-45
SLIDE 45

Part 1 Some examples (zipfR)

Loading type-token data

◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”)

☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed (.gz, .bz2, .xz)

◮ Sample data: brown_adverbs.txt on tutorial homepage

◮ lowercased adverb tokens from Brown corpus (original order)

☞ download and save to your working directory

> adv <- readLines("brown_adverbs.txt", encoding="UTF-8") > head(adv, 30) # mathematically, a ‘‘vector’’ of tokens > length(adv)

# sample size = 52,037 tokens

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

slide-46
SLIDE 46

Part 1 Some examples (zipfR)

Descriptive statistics: type-frequency list

> adv.tfl <- vec2tfl(adv) > adv.tfl

k f type 1 1 4859 not 2 2 2084 n’t 3 3 1464 so 4 4 1381

  • nly

5 5 1374 then 6 6 1309 now 7 7 1134 even 8 8 1089 as . . . . . . . . . N V 52037 1907

> N(adv.tfl)

# sample size

> V(adv.tfl)

# type count

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 29 / 99

slide-47
SLIDE 47

Part 1 Some examples (zipfR)

Descriptive statistics: frequency spectrum

> adv.spc <- tfl2spc(adv.tfl)

# or directly with vec2spc

> adv.spc

m Vm 1 1 762 2 2 260 3 3 144 4 4 99 5 5 69 6 6 50 7 7 40 8 8 34 . . . . . . N V 52037 1907

> N(adv.spc)

# sample size

> V(adv.spc)

# type count

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 30 / 99

slide-48
SLIDE 48

Part 1 Some examples (zipfR)

Descriptive statistics: vocabulary growth

◮ VGC lists vocabulary size V (N) at different sample sizes N ◮ Optionally also spectrum elements Vm(N) up to m.max > adv.vgc <- vec2vgc(adv, m.max=2) ◮ Visualize descriptive statistics with plot method > plot(adv.tfl)

# Zipf ranking

> plot(adv.tfl, log="xy")

# logarithmic scale recommended

> plot(adv.spc)

# barplot of frequency spectrum

> plot(adv.vgc, add.m = 1:2) # vocabulary growth curve

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 31 / 99

slide-49
SLIDE 49

Part 1 Some examples (zipfR)

Further example data sets

?Brown words from Brown corpus ?BrownSubsets various subsets ?Dickens words from novels by Charles Dickens ?ItaPref Italian word-formation prefixes ?TigerNP NP and PP patterns from German Tiger treebank ?Baayen2001 frequency spectra from Baayen (2001) ?EvertLuedeling2001 German word-formation affixes (manually corrected data from Evert and Lüdeling 2001) Practice: ◮ Explore these data sets with descriptive statistics ◮ Try different plot options (from help pages ?plot.tfl, ?plot.spc, ?plot.vgc)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 32 / 99

slide-50
SLIDE 50

Part 1 LNRE models: intuition

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 33 / 99

slide-51
SLIDE 51

Part 1 LNRE models: intuition

Motivation

◮ Interested in productivity of affix, vocabulary of author, . . . ; not in a particular text or sample

☞ statistical inference from sample to population

◮ Discrete frequency counts are difficult to capture with generalizations such as Zipf’s law

◮ Zipf’s law predicts many impossible types with 1 < fr < 2

☞ population does not suffer from such quantization effects

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 34 / 99

slide-52
SLIDE 52

Part 1 LNRE models: intuition

LNRE models

◮ This tutorial introduces the state-of-the-art LNRE approach proposed by Baayen (2001)

◮ LNRE = Large Number of Rare Events

◮ LNRE uses various approximations and simplifications to

  • btain a tractable and elegant model

◮ Of course, we could also estimate the precise discrete distributions using MCMC simulations, but . . .

  • 1. LNRE model usually minor component of complex procedure
  • 2. often applied to very large samples (N > 1 M tokens)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 35 / 99

slide-53
SLIDE 53

Part 1 LNRE models: intuition

The LNRE population

◮ Population: set of S types wi with occurrence probabilities πi ◮ S = population diversity can be finite or infinite (S = ∞) ◮ Not interested in specific types ➜ arrange by decreasing probability: π1 ≥ π2 ≥ π3 ≥ · · ·

☞ impossible to determine probabilities of all individual types

◮ Normalization: π1 + π2 + . . . + πS = 1 ◮ Need parametric statistical model to describe full population (esp. for S = ∞), i.e. a function i → πi

◮ type probabilities πi cannot be estimated reliably from a

sample, but parameters of this function can

◮ NB: population index i = Zipf rank r Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 36 / 99

slide-54
SLIDE 54

Part 1 LNRE models: intuition

Examples of population models

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 37 / 99

slide-55
SLIDE 55

Part 1 LNRE models: intuition

The Zipf-Mandelbrot law as a population model

What is the right family of models for lexical frequency distributions? ◮ We have already seen that the Zipf-Mandelbrot law captures the distribution of observed frequencies very well

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99

slide-56
SLIDE 56

Part 1 LNRE models: intuition

The Zipf-Mandelbrot law as a population model

What is the right family of models for lexical frequency distributions? ◮ We have already seen that the Zipf-Mandelbrot law captures the distribution of observed frequencies very well ◮ Re-phrase the law for type probabilities: πi := C (i + b)a ◮ Two free parameters: a > 1 and b ≥ 0 ◮ C is not a parameter but a normalization constant, needed to ensure that

i πi = 1

◮ This is the Zipf-Mandelbrot population model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99

slide-57
SLIDE 57

Part 1 LNRE models: intuition

The parameters of the Zipf-Mandelbrot model

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

a = 1.2 b = 1.5

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

a = 2 b = 10

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

a = 2 b = 15

  • 10

20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 k πk

a = 5 b = 40 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 39 / 99

slide-58
SLIDE 58

Part 1 LNRE models: intuition

The parameters of the Zipf-Mandelbrot model

  • 1

2 5 10 20 50 100 1e−04 5e−04 5e−03 5e−02 k πk

a = 1.2 b = 1.5

  • ●●●●●
  • 1

2 5 10 20 50 100 1e−04 5e−04 5e−03 5e−02 k πk

a = 2 b = 10

  • ● ● ●●●●●
  • 1

2 5 10 20 50 100 1e−04 5e−04 5e−03 5e−02 k πk

a = 2 b = 15

  • ● ● ●●●●●
  • 1

2 5 10 20 50 100 1e−04 5e−04 5e−03 5e−02 k πk

a = 5 b = 40 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 40 / 99

slide-59
SLIDE 59

Part 1 LNRE models: intuition

The finite Zipf-Mandelbrot model

Evert (2004)

◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i, and the type probabilities πi can become arbitrarily small ◮ π = 10−6 (once every million words), π = 10−9 (once every billion words), π = 10−15 (once on the entire Internet), π = 10−100 (once in the universe?)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

slide-60
SLIDE 60

Part 1 LNRE models: intuition

The finite Zipf-Mandelbrot model

Evert (2004)

◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i, and the type probabilities πi can become arbitrarily small ◮ π = 10−6 (once every million words), π = 10−9 (once every billion words), π = 10−15 (once on the entire Internet), π = 10−100 (once in the universe?) ◮ The finite Zipf-Mandelbrot model stops after first S types ◮ Population diversity S becomes a parameter of the model → the finite Zipf-Mandelbrot model has 3 parameters

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

slide-61
SLIDE 61

Part 1 LNRE models: intuition

The finite Zipf-Mandelbrot model

Evert (2004)

◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i, and the type probabilities πi can become arbitrarily small ◮ π = 10−6 (once every million words), π = 10−9 (once every billion words), π = 10−15 (once on the entire Internet), π = 10−100 (once in the universe?) ◮ The finite Zipf-Mandelbrot model stops after first S types ◮ Population diversity S becomes a parameter of the model → the finite Zipf-Mandelbrot model has 3 parameters Abbreviations: ◮ ZM for Zipf-Mandelbrot model ◮ fZM for finite Zipf-Mandelbrot model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

slide-62
SLIDE 62

Part 1 LNRE models: intuition

Sampling from a population model

Assume we believe that the population we are interested in can be described by a Zipf-Mandelbrot model:

  • 10

20 30 40 50 0.00 0.01 0.02 0.03 0.04 0.05 k πk

a = 3 b = 50

  • ● ● ●●●●●
  • 1

2 5 10 20 50 100 1e−04 5e−04 5e−03 5e−02 k πk

a = 3 b = 50

Use computer simulation to generate random samples: ◮ Draw N tokens from the population such that in each step, type wi has probability πi to be picked ◮ This allows us to make predictions for samples (= corpora)

  • f arbitrary size N

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 42 / 99

slide-63
SLIDE 63

Part 1 LNRE models: intuition

Sampling from a population model

#1: 1 42 34 23 108 18 48 18 1 . . .

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

slide-64
SLIDE 64

Part 1 LNRE models: intuition

Sampling from a population model

#1: 1 42 34 23 108 18 48 18 1 . . . time order room school town course area course time . . .

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

slide-65
SLIDE 65

Part 1 LNRE models: intuition

Sampling from a population model

#1: 1 42 34 23 108 18 48 18 1 . . . time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . .

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

slide-66
SLIDE 66

Part 1 LNRE models: intuition

Sampling from a population model

#1: 1 42 34 23 108 18 48 18 1 . . . time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . #3: 2 11 105 21 11 17 17 1 16 . . .

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

slide-67
SLIDE 67

Part 1 LNRE models: intuition

Sampling from a population model

#1: 1 42 34 23 108 18 48 18 1 . . . time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . #3: 2 11 105 21 11 17 17 1 16 . . . #4: 44 3 110 34 223 2 25 20 28 . . . #5: 24 81 54 11 8 61 1 31 35 . . . #6: 3 65 9 165 5 42 16 20 7 . . . #7: 10 21 11 60 164 54 18 16 203 . . . #8: 11 7 147 5 24 19 15 85 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

slide-68
SLIDE 68

Part 1 LNRE models: intuition

Samples: type frequency list & spectrum

rank r fr type i 1 37 6 2 36 1 3 33 3 4 31 7 5 31 10 6 30 5 7 28 12 8 27 2 9 24 4 10 24 16 11 23 8 12 22 14 . . . . . . . . . m Vm 1 83 2 22 3 20 4 12 5 10 6 5 7 5 8 3 9 3 10 3 . . . . . . sample #1

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 44 / 99

slide-69
SLIDE 69

Part 1 LNRE models: intuition

Samples: type frequency list & spectrum

rank r fr type i 1 39 2 2 34 3 3 30 5 4 29 10 5 28 8 6 26 1 7 25 13 8 24 7 9 23 6 10 23 11 11 20 4 12 19 17 . . . . . . . . . m Vm 1 76 2 27 3 17 4 10 5 6 6 5 7 7 8 3 10 4 11 2 . . . . . . sample #2

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 45 / 99

slide-70
SLIDE 70

Part 1 LNRE models: intuition

Random variation in type-frequency lists

  • 10

20 30 40 50 10 20 30 40

Sample #1

r fr

  • 10

20 30 40 50 10 20 30 40

Sample #2

r fr

r ↔ fr

  • 10

20 30 40 50 10 20 30 40

Sample #1

k fk

  • 10

20 30 40 50 10 20 30 40

Sample #2

k fk

i ↔ fi

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 46 / 99

slide-71
SLIDE 71

Part 1 LNRE models: intuition

Random variation: frequency spectrum

Sample #1

m Vm 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

slide-72
SLIDE 72

Part 1 LNRE models: intuition

Random variation: frequency spectrum

Sample #2

m Vm 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

slide-73
SLIDE 73

Part 1 LNRE models: intuition

Random variation: frequency spectrum

Sample #3

m Vm 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

slide-74
SLIDE 74

Part 1 LNRE models: intuition

Random variation: frequency spectrum

Sample #4

m Vm 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

slide-75
SLIDE 75

Part 1 LNRE models: intuition

Random variation: vocabulary growth curve

200 400 600 800 1000 50 100 150 200

Sample #1

N V(N) V1(N)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

slide-76
SLIDE 76

Part 1 LNRE models: intuition

Random variation: vocabulary growth curve

200 400 600 800 1000 50 100 150 200

Sample #2

N V(N) V1(N)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

slide-77
SLIDE 77

Part 1 LNRE models: intuition

Random variation: vocabulary growth curve

200 400 600 800 1000 50 100 150 200

Sample #3

N V(N) V1(N)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

slide-78
SLIDE 78

Part 1 LNRE models: intuition

Random variation: vocabulary growth curve

200 400 600 800 1000 50 100 150 200

Sample #4

N V(N) V1(N)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

slide-79
SLIDE 79

Part 1 LNRE models: intuition

Expected values

◮ There is no reason why we should choose a particular sample to compare to the real data or make a prediction – each one is equally likely or unlikely ◮ Take the average over a large number of samples, called expected value or expectation in statistics ◮ Notation: E

V (N) and E Vm(N)

  • ◮ indicates that we are referring to expected values for a sample
  • f size N

◮ rather than to the specific values V and Vm

  • bserved in a particular sample or a real-world data set

◮ Expected values can be calculated efficiently without generating thousands of random samples

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 49 / 99

slide-80
SLIDE 80

Part 1 LNRE models: intuition

The expected frequency spectrum

Vm E[

[Vm] Sample #1

m Vm E[Vm] 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

slide-81
SLIDE 81

Part 1 LNRE models: intuition

The expected frequency spectrum

Vm E[

[Vm] Sample #2

m Vm E[Vm] 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

slide-82
SLIDE 82

Part 1 LNRE models: intuition

The expected frequency spectrum

Vm E[

[Vm] Sample #3

m Vm E[Vm] 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

slide-83
SLIDE 83

Part 1 LNRE models: intuition

The expected frequency spectrum

Vm E[

[Vm] Sample #4

m Vm E[Vm] 20 40 60 80 100

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

slide-84
SLIDE 84

Part 1 LNRE models: intuition

The expected vocabulary growth curve

200 400 600 800 1000 50 100 150 200

Sample #1

N E[V(N)] V(N) E[V(N)] 200 400 600 800 1000 50 100 150 200

Sample #1

N E[V1(N)] V1(N) E[V1(N)]

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 51 / 99

slide-85
SLIDE 85

Part 1 LNRE models: intuition

Prediction intervals for the expected VGC

200 400 600 800 1000 50 100 150 200

Sample #1

N E[V(N)] V(N) E[V(N)] 200 400 600 800 1000 50 100 150 200

Sample #1

N E[V1(N)] V1(N) E[V1(N)]

“Confidence intervals” indicate predicted sampling distribution: ☞ for 95% of samples generated by the LNRE model, VGC will fall within the range delimited by the thin red lines

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 52 / 99

slide-86
SLIDE 86

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 1.5, b = 7.5

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 1.5, b = = 7.5

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-87
SLIDE 87

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 1.3, b = 7.5

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 1.3, b = = 7.5

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-88
SLIDE 88

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 1.3, b = 0.2

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 1.3, b = = 0.2

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-89
SLIDE 89

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 1.5, b = 7.5

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 1.5, b = = 7.5

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-90
SLIDE 90

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 1.7, b = 7.5

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 1.7, b = = 7.5

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-91
SLIDE 91

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 1.7, b = 80

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 1.7, b = = 80

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-92
SLIDE 92

Part 1 LNRE models: intuition

Parameter estimation by trial & error

  • bserved

ZM model

a = 2, b = 550

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 2, b = = 550

N V(N) E[V(N)]

  • bserved

ZM model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

slide-93
SLIDE 93

Part 1 LNRE models: intuition

Automatic parameter estimation

  • bserved

expected

a = 2.39, b = 1968.49

m Vm E[Vm] 5000 10000 15000 20000 25000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000

a = 2.39, b = = 1968.49

N V(N) E[V(N)]

  • bserved

expected

◮ By trial & error we found a = 2.0 and b = 550 ◮ Automatic estimation procedure: a = 2.39 and b = 1968

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 54 / 99

slide-94
SLIDE 94

Part 1 LNRE models: mathematics

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 55 / 99

slide-95
SLIDE 95

Part 1 LNRE models: mathematics

The sampling model

◮ Draw random sample of N tokens from LNRE population ◮ Sufficient statistic: set of type frequencies {fi}

◮ because tokens of random sample have no ordering

◮ Joint multinomial distribution of {fi}: Pr({fi = ki} | N) = N! k1! · · · kS!πk1

1 · · · πkS S

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99

slide-96
SLIDE 96

Part 1 LNRE models: mathematics

The sampling model

◮ Draw random sample of N tokens from LNRE population ◮ Sufficient statistic: set of type frequencies {fi}

◮ because tokens of random sample have no ordering

◮ Joint multinomial distribution of {fi}: Pr({fi = ki} | N) = N! k1! · · · kS!πk1

1 · · · πkS S

◮ Approximation: do not condition on fixed sample size N

◮ N is now the average (expected) sample size

◮ Random variables fi have independent Poisson distributions: Pr(fi = ki) = e−Nπi (Nπi)ki ki!

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99

slide-97
SLIDE 97

Part 1 LNRE models: mathematics

Frequency spectrum

◮ Key problem: we cannot determine fi in observed sample

◮ becasue we don’t know which type wi is ◮ recall that population ranking fi = Zipf ranking fr

◮ Use spectrum {Vm} and sample size V as statistics

◮ contains all information we have about observed sample Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

slide-98
SLIDE 98

Part 1 LNRE models: mathematics

Frequency spectrum

◮ Key problem: we cannot determine fi in observed sample

◮ becasue we don’t know which type wi is ◮ recall that population ranking fi = Zipf ranking fr

◮ Use spectrum {Vm} and sample size V as statistics

◮ contains all information we have about observed sample

◮ Can be expressed in terms of indicator variables I[fi=m] =

  • 1

fi = m

  • therwise

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

slide-99
SLIDE 99

Part 1 LNRE models: mathematics

Frequency spectrum

◮ Key problem: we cannot determine fi in observed sample

◮ becasue we don’t know which type wi is ◮ recall that population ranking fi = Zipf ranking fr

◮ Use spectrum {Vm} and sample size V as statistics

◮ contains all information we have about observed sample

◮ Can be expressed in terms of indicator variables I[fi=m] =

  • 1

fi = m

  • therwise

Vm =

S

  • i=1

I[fi=m]

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

slide-100
SLIDE 100

Part 1 LNRE models: mathematics

Frequency spectrum

◮ Key problem: we cannot determine fi in observed sample

◮ becasue we don’t know which type wi is ◮ recall that population ranking fi = Zipf ranking fr

◮ Use spectrum {Vm} and sample size V as statistics

◮ contains all information we have about observed sample

◮ Can be expressed in terms of indicator variables I[fi=m] =

  • 1

fi = m

  • therwise

Vm =

S

  • i=1

I[fi=m] V =

S

  • i=1

I[fi>0] =

S

  • i=1

1 − I[fi=0]

  • Stefan Evert

T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

slide-101
SLIDE 101

Part 1 LNRE models: mathematics

The expected spectrum

◮ It is easy to compute expected values for the frequency spectrum (and variances because the fi are independent) E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)m m!

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 58 / 99

slide-102
SLIDE 102

Part 1 LNRE models: mathematics

The expected spectrum

◮ It is easy to compute expected values for the frequency spectrum (and variances because the fi are independent) E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)m m! E[Vm] =

S

  • i=1

E[I[fi=m]] =

S

  • i=1

e−Nπi (Nπi)m m!

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 58 / 99

slide-103
SLIDE 103

Part 1 LNRE models: mathematics

The expected spectrum

◮ It is easy to compute expected values for the frequency spectrum (and variances because the fi are independent) E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)m m! E[Vm] =

S

  • i=1

E[I[fi=m]] =

S

  • i=1

e−Nπi (Nπi)m m! E[V ] =

S

  • i=1

E

1 − I[fi=0] =

S

  • i=1

1 − e−Nπi

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 58 / 99

slide-104
SLIDE 104

Part 1 LNRE models: mathematics

The expected spectrum

◮ It is easy to compute expected values for the frequency spectrum (and variances because the fi are independent) E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)m m! E[Vm] =

S

  • i=1

E[I[fi=m]] =

S

  • i=1

e−Nπi (Nπi)m m! E[V ] =

S

  • i=1

E

1 − I[fi=0] =

S

  • i=1

1 − e−Nπi

◮ NB: Vm and V are not independent because they are derived from the same random variables fi

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 58 / 99

slide-105
SLIDE 105

Part 1 LNRE models: mathematics

Sampling distribution of Vm and V

◮ Joint sampling distribution of {Vm} and V is complicated ◮ Approximation: V and {Vm} asymptotically follow a multivariate normal distribution

◮ motivated by the multivariate central limit theorem:

sum of many independent variables I[fi=m]

◮ Usually limited to first spectrum elements, e.g. V1, . . . , V15

◮ approximation of discrete Vm by continuous distribution

suitable only if E[Vm] is sufficiently large

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 59 / 99

slide-106
SLIDE 106

Part 1 LNRE models: mathematics

Sampling distribution of Vm and V

◮ Joint sampling distribution of {Vm} and V is complicated ◮ Approximation: V and {Vm} asymptotically follow a multivariate normal distribution

◮ motivated by the multivariate central limit theorem:

sum of many independent variables I[fi=m]

◮ Usually limited to first spectrum elements, e.g. V1, . . . , V15

◮ approximation of discrete Vm by continuous distribution

suitable only if E[Vm] is sufficiently large

◮ Parameters of multivariate normal: µ µ µ = (E[V ], E[V1], E[V2], . . .) and Σ Σ Σ = covariance matrix Pr

(V , V1, . . . , Vk) = v ∼ e− 1

2 (v−µ

µ µ)TΣ Σ Σ−1(v−µ µ µ)

  • (2π)k+1 detΣ

Σ Σ

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 59 / 99

slide-107
SLIDE 107

Part 1 LNRE models: mathematics

Type density function

◮ Discrete sums of probabilities in E[V ], E[Vm], ldots are inconvenient and computationally expensive ◮ Approximation: continuous type density function g(π) |{wi | a ≤ πi ≤ b}| =

b

a

g(π) dπ

  • {πi | a ≤ πi ≤ b} =

b

a

πg(π) dπ

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 60 / 99

slide-108
SLIDE 108

Part 1 LNRE models: mathematics

Type density function

◮ Discrete sums of probabilities in E[V ], E[Vm], ldots are inconvenient and computationally expensive ◮ Approximation: continuous type density function g(π) |{wi | a ≤ πi ≤ b}| =

b

a

g(π) dπ

  • {πi | a ≤ πi ≤ b} =

b

a

πg(π) dπ ◮ Normalization constraint:

πg(π) dπ = 1 ◮ Good approximation for low-probability types, but probability mass of w1, w2, . . . “smeared out” over range

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 60 / 99

slide-109
SLIDE 109

Part 1 LNRE models: mathematics

Type density function

0.0 0.1 0.2 0.3 0.4 0.5 10 20 30 40

Type density as continuous approximation

  • ccurrence probability π

type density g(π)

w1 w2 w3 w4

π π

π π π π

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 61 / 99

slide-110
SLIDE 110

Part 1 LNRE models: mathematics

Type density function

0.0 0.1 0.2 0.3 0.4 0.5 10 20 30 40

Type density as continuous approximation

  • ccurrence probability π

type density g(π)

w1 w2 w3 w4

0.0 0.1 0.2 0.3 0.4 0.5 1 2 3 4

Probability density vs. type probabilities

  • ccurrence probability π

probability density f(π)

π1 = .376 π2 = .215 π3 = .130 π4 = .082

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 61 / 99

slide-111
SLIDE 111

Part 1 LNRE models: mathematics

ZM and fZM as LNRE models

◮ Discrete Zipf-Mandelbrot population πi := C (i + b)a for i = 1, . . . , S

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 62 / 99

slide-112
SLIDE 112

Part 1 LNRE models: mathematics

ZM and fZM as LNRE models

◮ Discrete Zipf-Mandelbrot population πi := C (i + b)a for i = 1, . . . , S ◮ Corresponding type density function (Evert 2004) g(π) =

  • C · π−α−1

A ≤ π ≤ B

  • therwise

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 62 / 99

slide-113
SLIDE 113

Part 1 LNRE models: mathematics

ZM and fZM as LNRE models

◮ Discrete Zipf-Mandelbrot population πi := C (i + b)a for i = 1, . . . , S ◮ Corresponding type density function (Evert 2004) g(π) =

  • C · π−α−1

A ≤ π ≤ B

  • therwise

with parameters

◮ α = 1/a (0 < α < 1) ◮ B = b · α/(1 − α) ◮ 0 ≤ A < B determines S (ZM with S = ∞ for A = 0)

☞ C is a normalization factor, not a parameter

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 62 / 99

slide-114
SLIDE 114

Part 1 LNRE models: mathematics

ZM and fZM as LNRE models

1e-09 1e-07 1e-05 1e-03 20000 40000 60000 80000

Type density of LNRE model

  • ccurrence probability π

type density g(π) ZM

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 63 / 99

slide-115
SLIDE 115

Part 1 LNRE models: mathematics

ZM and fZM as LNRE models

1e-09 1e-07 1e-05 1e-03 20000 40000 60000 80000

Type density of LNRE model

  • ccurrence probability π

type density g(π) fZM

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 63 / 99

slide-116
SLIDE 116

Part 1 LNRE models: mathematics

Expectations as integrals

◮ Expected values can now be expressed as integrals over g(π) E[Vm] =

(Nπ)m m! e−Nπg(π) dπ E[V ] =

∞ 1 − e−Nπg(π) dπ

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 64 / 99

slide-117
SLIDE 117

Part 1 LNRE models: mathematics

Expectations as integrals

◮ Expected values can now be expressed as integrals over g(π) E[Vm] =

(Nπ)m m! e−Nπg(π) dπ E[V ] =

∞ 1 − e−Nπg(π) dπ

◮ Reduce to simple closed form for ZM (approximation) E[Vm] = C m! · Nα · Γ(m − α) E[V ] = C · Nα · Γ(1 − α) α ◮ fZM and exact solution for ZM with incompl. Gamma function

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 64 / 99

slide-118
SLIDE 118

Part 1 LNRE models: mathematics

Parameter estimation from training corpus

◮ For ZM, α = E[V1]

E[V ] ≈ V1 V can be estimated directly,

but prone to overfitting ◮ General parameter fitting by MLE: maximize likelihood of observed spectrum v max

α,A,B Pr

(V , V 1, . . . , Vk) = v

  • α, A, B
  • Stefan Evert

T1: Zipf’s Law 7 May 2018 | CC-by-sa 65 / 99

slide-119
SLIDE 119

Part 1 LNRE models: mathematics

Parameter estimation from training corpus

◮ For ZM, α = E[V1]

E[V ] ≈ V1 V can be estimated directly,

but prone to overfitting ◮ General parameter fitting by MLE: maximize likelihood of observed spectrum v max

α,A,B Pr

(V , V 1, . . . , Vk) = v

  • α, A, B
  • ◮ Multivariate normal approximation:

min

α,A,B (v − µ

µ µ)TΣ Σ Σ−1(v − µ µ µ) ◮ Minimization by gradient descent (BFGS, CG) or simplex search (Nelder-Mead)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 65 / 99

slide-120
SLIDE 120

Part 1 LNRE models: mathematics

Parameter estimation from training corpus

0.2 0.4 0.6 0.8 −4 −3 −2 −1

BNC (bare singular PPs)

α log10(B) 0.55 0.60 0.65 0.70 0.75 −2.5 −2.4 −2.3 −2.2 −2.1 −2.0 −1.9 −1.8

Goodness−of−fit X2 (m = 10)

α log10(B) (0.65, −2.11)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 66 / 99

slide-121
SLIDE 121

Part 1 LNRE models: mathematics

Parameter estimation from training corpus

0.2 0.4 0.6 0.8 −5 −4 −3 −2 −1

Brown Corpus (word forms)

α log10(B) 0.35 0.40 0.45 0.50 0.55 −3.4 −3.2 −3.0 −2.8 −2.6

Goodness−of−fit X2 (m = 5)

α log10(B) (0.45, −3.01)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 66 / 99

slide-122
SLIDE 122

Part 1 LNRE models: mathematics

Goodness-of-fit

(Baayen 2001, Sec. 3.3)

◮ How well does the fitted model explain the observed data? ◮ For multivariate normal distribution: X 2 = (V − µ µ µ)TΣ Σ Σ−1(V − µ µ µ) ∼ χ2

k+1

where V = (V , V1, . . . , Vk)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 67 / 99

slide-123
SLIDE 123

Part 1 LNRE models: mathematics

Goodness-of-fit

(Baayen 2001, Sec. 3.3)

◮ How well does the fitted model explain the observed data? ◮ For multivariate normal distribution: X 2 = (V − µ µ µ)TΣ Σ Σ−1(V − µ µ µ) ∼ χ2

k+1

where V = (V , V1, . . . , Vk) ➥ Multivariate chi-squared test of goodness-of-fit

◮ replace V by observed v ➜ test statistic x2 ◮ must reduce df = k + 1 by number of estimated parameters

◮ NB: significant rejection of the LNRE model for p < .05

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 67 / 99

slide-124
SLIDE 124

Part 1 LNRE models: mathematics

Coffee break!

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 68 / 99

slide-125
SLIDE 125

Part 2 Applications & examples (zipfR)

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 69 / 99

slide-126
SLIDE 126

Part 2 Applications & examples (zipfR)

Measuring morphological productivity

example from Evert and Lüdeling (2001)

5000 10000 15000 20000 25000 30000 35000 100 200 300 400 500

Vocabulary Growth Curves

N V(N)

  • bar
  • sam
  • ös

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 70 / 99

slide-127
SLIDE 127

Part 2 Applications & examples (zipfR)

Measuring morphological productivity

example from Evert and Lüdeling (2001)

50000 150000 250000 350000 5000 10000 15000

a = 1.45, b = 34.59, S = 20587

N V(N) E[V(N)] V1(N) E[V1(N)]

  • bserved

expected

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 70 / 99

slide-128
SLIDE 128

Part 2 Applications & examples (zipfR)

Measuring morphological productivity

example from Evert and Lüdeling (2001)

5000 10000 15000 20000 25000 30000 35000 100 200 300 400 500

Vocabulary Growth Curves

N V(N)

  • bar
  • sam
  • ös

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 70 / 99

slide-129
SLIDE 129

Part 2 Applications & examples (zipfR)

Quantitative measures of productivity

(Tweedie and Baayen 1998; Baayen 2001) ◮ Baayen’s (1991) productivity index P (slope of vocabulary growth curve) P = V1 N ◮ TTR = type-token ratio TTR = V N ◮ Zipf-Mandelbrot slope a ◮ Herdan’s law (1964) C = log V log N

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 71 / 99

slide-130
SLIDE 130

Part 2 Applications & examples (zipfR)

Quantitative measures of productivity

(Tweedie and Baayen 1998; Baayen 2001) ◮ Baayen’s (1991) productivity index P (slope of vocabulary growth curve) P = V1 N ◮ TTR = type-token ratio TTR = V N ◮ Zipf-Mandelbrot slope a ◮ Herdan’s law (1964) C = log V log N ◮ Yule (1944) / Simpson (1949) K = 10 000 ·

  • m m2Vm − N

N2 ◮ Guiraud (1954) R = V √ N ◮ Sichel (1975) S = V2 V ◮ Honoré (1979) H = log N 1 − V1

V Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 71 / 99

slide-131
SLIDE 131

Part 2 Applications & examples (zipfR)

Productivity measures for bare singulars in the BNC

spoken written V 2,039 12,876 N 6,766 85,750 K 86.84 28.57 R 24.79 43.97 S 0.13 0.15 C 0.86 0.83 P 0.21 0.08 TTR 0.301 0.150 a 1.18 1.27

  • pop. S

15,958 36,874

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 72 / 99

slide-132
SLIDE 132

Part 2 Applications & examples (zipfR)

Productivity measures for bare singulars in the BNC

spoken written V 2,039 12,876 N 6,766 85,750 K 86.84 28.57 R 24.79 43.97 S 0.13 0.15 C 0.86 0.83 P 0.21 0.08 TTR 0.301 0.150 a 1.18 1.27

  • pop. S

15,958 36,874

20000 40000 60000 80000 2000 4000 6000 8000 10000 12000

vocabulary growth curves (BNC)

N V(N) written spoken

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 72 / 99

slide-133
SLIDE 133

Part 2 Applications & examples (zipfR)

Are these “lexical constants” really constant?

1000 3000 5000 20 40 60 80 100

Yule's K

1000 3000 5000 10 20 30 40 50

Guiraud's R

1000 3000 5000 0.00 0.05 0.10 0.15 0.20

Sichel's S

1000 3000 5000 0.0 0.2 0.4 0.6 0.8 1.0

Herdan's law C

1000 3000 5000 0.0 0.1 0.2 0.3 0.4 0.5

Baayen's P

1000 3000 5000 2 4 6 8 10

TTR

1000 3000 5000 0.0 0.5 1.0 1.5 2.0

Zipf slope (a)

1000 3000 5000 5000 15000 25000 35000

population size

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 73 / 99

slide-134
SLIDE 134

Part 2 Applications & examples (zipfR)

Simulation experiments based on LNRE models

◮ Systematic study of size dependence and other aspects of productivity measures based on samples from LNRE model ◮ LNRE model ➜ well-defined population ◮ Random sampling helps to assess variability of measures ◮ Expected values E[P] etc. can often be computed directly (or approximated) ➜ computationally efficient ➥ LNRE models as tools for understanding productivity measures

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 74 / 99

slide-135
SLIDE 135

Part 2 Applications & examples (zipfR)

Simulation: sample size

5000 10000 15000 20000 25000 20 40 60 80 100

Yule−Simpson K

5000 10000 15000 20000 25000 10 20 30 40 50

Guiraud R

5000 10000 15000 20000 25000 0.0 0.2 0.4 0.6 0.8 1.0

Herdan's law C

5000 10000 15000 20000 25000 2 4 6 8 10

TTR

5000 10000 15000 20000 25000 0.0 0.1 0.2 0.3 0.4 0.5

Baayen P

5000 10000 15000 20000 25000 0.0 0.5 1.0 1.5 2.0

Zipf slope (a)

5000 10000 15000 20000 25000 0.00 0.05 0.10 0.15 0.20

Sichel S

5000 10000 15000 20000 25000 2000 4000 6000 8000 10000 12000

Honoré H

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 75 / 99

slide-136
SLIDE 136

Part 2 Applications & examples (zipfR)

Simulation: frequent lexicalized types

0.0 0.1 0.2 0.3 0.4 20 40 60 80 100

Yule−Simpson K

0.0 0.1 0.2 0.3 0.4 10 20 30 40 50

Guiraud R

0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 1.0

Herdan law C

0.0 0.1 0.2 0.3 0.4 2 4 6 8 10

TTR

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5

Baayen P

0.0 0.1 0.2 0.3 0.4 0.0 0.5 1.0 1.5 2.0

Zipf slope (a)

0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.20

Sichel S

0.0 0.1 0.2 0.3 0.4 2000 4000 6000 8000 10000 12000

Honoré H

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 76 / 99

slide-137
SLIDE 137

Part 2 Applications & examples (zipfR)

interactive demo

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 77 / 99

slide-138
SLIDE 138

Part 2 Applications & examples (zipfR)

Posterior distribution

1e−08 1e−05 1e−02 1e+01 0.0 0.2 0.4 0.6 0.8 1.0

Posterior distribution (m = 1) for ZM model with α = 0.4

expected frequency (Nπ π) posterior distribution Pr(π|m) (m = 1) MLE

95% confidence 99.9% confidence Good−Turing

posterior log−adjusted Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 78 / 99

slide-139
SLIDE 139

Part 2 Applications & examples (zipfR)

Posterior distribution

1e−08 1e−05 1e−02 1e+01 0.0 0.2 0.4 0.6 0.8 1.0

Posterior distribution (m = 1) for ZM model with α = 0.9

expected frequency (Nπ π) posterior distribution Pr(π|m) (m = 1) MLE

95% confidence 99.9% confidence Good−Turing

posterior log−adjusted Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 78 / 99

slide-140
SLIDE 140

Part 2 Applications & examples (zipfR)

Posterior distribution

1e−08 1e−05 1e−02 1e+01 0.0 0.2 0.4 0.6 0.8 1.0

Posterior distribution (m = 2) for ZM model with α = 0.9

expected frequency (Nπ π) posterior distribution Pr(π|m) (m = 2) MLE

95% confidence 99.9% confidence MAP Good−Turing

posterior log−adjusted Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 78 / 99

slide-141
SLIDE 141

Part 2 Limitations

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 79 / 99

slide-142
SLIDE 142

Part 2 Limitations

How reliable are the fitted models?

Three potential issues:

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 80 / 99

slide-143
SLIDE 143

Part 2 Limitations

How reliable are the fitted models?

Three potential issues:

  • 1. Model assumptions = population

(e.g. distribution does not follow a Zipf-Mandelbrot law)

☞ model cannot be adequate, regardless of parameter settings

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 80 / 99

slide-144
SLIDE 144

Part 2 Limitations

How reliable are the fitted models?

Three potential issues:

  • 1. Model assumptions = population

(e.g. distribution does not follow a Zipf-Mandelbrot law)

☞ model cannot be adequate, regardless of parameter settings

  • 2. Parameter estimation unsuccessful

(i.e. suboptimal goodness-of-fit to training data)

☞ optimization algorithm trapped in local minimum ☞ can result in highly inaccurate model

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 80 / 99

slide-145
SLIDE 145

Part 2 Limitations

How reliable are the fitted models?

Three potential issues:

  • 1. Model assumptions = population

(e.g. distribution does not follow a Zipf-Mandelbrot law)

☞ model cannot be adequate, regardless of parameter settings

  • 2. Parameter estimation unsuccessful

(i.e. suboptimal goodness-of-fit to training data)

☞ optimization algorithm trapped in local minimum ☞ can result in highly inaccurate model

  • 3. Uncertainty due to sampling variation

(i.e. training data differ from population distribution)

☞ model fitted to training data, may not reflect true population ☞ another training sample would have led to different parameters ☞ especially critical for small samples (N < 10,000)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 80 / 99

slide-146
SLIDE 146

Part 2 Limitations

How reliable are the fitted models?

Three potential issues:

  • 1. Model assumptions = population

(e.g. distribution does not follow a Zipf-Mandelbrot law)

☞ model cannot be adequate, regardless of parameter settings

  • 2. Parameter estimation unsuccessful

(i.e. suboptimal goodness-of-fit to training data)

☞ optimization algorithm trapped in local minimum ☞ can result in highly inaccurate model

  • 3. Uncertainty due to sampling variation

(i.e. training data differ from population distribution)

☞ model fitted to training data, may not reflect true population ☞ another training sample would have led to different parameters ☞ especially critical for small samples (N < 10,000)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 80 / 99

slide-147
SLIDE 147

Part 2 Limitations

Bootstrapping

◮ An empirical approach to sampling variation:

◮ take many random samples from the same population ◮ estimate LNRE model from each sample ◮ analyse distribution of model parameters, goodness-of-fit, etc.

(mean, median, s.d., boxplot, histogram, . . . )

◮ problem: how to obtain the additional samples? Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 81 / 99

slide-148
SLIDE 148

Part 2 Limitations

Bootstrapping

◮ An empirical approach to sampling variation:

◮ take many random samples from the same population ◮ estimate LNRE model from each sample ◮ analyse distribution of model parameters, goodness-of-fit, etc.

(mean, median, s.d., boxplot, histogram, . . . )

◮ problem: how to obtain the additional samples?

◮ Bootstrapping (Efron 1979)

◮ resample from observed data with replacement ◮ this approach is not suitable for type-token distributions

(resamples underestimate vocabulary size V !)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 81 / 99

slide-149
SLIDE 149

Part 2 Limitations

Bootstrapping

◮ An empirical approach to sampling variation:

◮ take many random samples from the same population ◮ estimate LNRE model from each sample ◮ analyse distribution of model parameters, goodness-of-fit, etc.

(mean, median, s.d., boxplot, histogram, . . . )

◮ problem: how to obtain the additional samples?

◮ Bootstrapping (Efron 1979)

◮ resample from observed data with replacement ◮ this approach is not suitable for type-token distributions

(resamples underestimate vocabulary size V !)

◮ Parametric bootstrapping

◮ use fitted model to generate samples, i.e. sample from the

population described by the model

◮ advantage: “correct” parameter values are known Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 81 / 99

slide-150
SLIDE 150

Part 2 Limitations

Bootstrapping

parametric bootstrapping with 100 replicates

Zipfian slope a = 1/α

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 α Density

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99

slide-151
SLIDE 151

Part 2 Limitations

Bootstrapping

parametric bootstrapping with 100 replicates

Offset b = (1 − α)/(B · α)

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 B Density

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99

slide-152
SLIDE 152

Part 2 Limitations

Bootstrapping

parametric bootstrapping with 100 replicates

fZM probability cutoff A = πS

0.0e+00 5.0e−06 1.0e−05 1.5e−05 2.0e−05 50000 150000 A Density

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99

slide-153
SLIDE 153

Part 2 Limitations

Bootstrapping

parametric bootstrapping with 100 replicates

Goodness-of-fit statistic X 2 (model not plausible for X 2 > 11)

5 10 15 20 25 30 0.00 0.02 0.04 0.06 0.08 X2 Density p < 0.05

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99

slide-154
SLIDE 154

Part 2 Limitations

Bootstrapping

parametric bootstrapping with 100 replicates

Population diversity S

0e+00 2e+21 4e+21 6e+21 0e+00 2e−05 4e−05 6e−05 S Density

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99

slide-155
SLIDE 155

Part 2 Limitations

Bootstrapping

parametric bootstrapping with 100 replicates

Population diversity S

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 2e−05 4e−05 S Density

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99

slide-156
SLIDE 156

Part 2 Limitations

Sample size matters!

Brown corpus is too small for reliable LNRE parameter estimation (bare singulars)

0.9 1.0 1.1 1.2 1.3 1.4 1.5 5 10 15

Zipf slope (a)

BNC spoken (N=6766) Brown (N=1005) 10000 20000 30000 40000 50000 0.00000 0.00002 0.00004 0.00006 0.00008 0.00010 0.00012

population size

BNC spoken (N=6766) Brown (N=1005)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 83 / 99

slide-157
SLIDE 157

Part 2 Limitations

How reliable are the fitted models?

Three potential issues:

  • 1. Model assumptions = population

(e.g. distribution does not follow a Zipf-Mandelbrot law)

☞ model cannot be adequate, regardless of parameter settings

  • 2. Parameter estimation unsuccessful

(i.e. suboptimal goodness-of-fit to training data)

☞ optimization algorithm trapped in local minimum ☞ can result in highly inaccurate model

  • 3. Uncertainty due to sampling variation

(i.e. training data differ from population distribution)

☞ model fitted to training data, may not reflect true population ☞ another training sample would have led to different parameters ☞ especially critical for small samples (N < 10,000)

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 84 / 99

slide-158
SLIDE 158

Part 2 Limitations

How well does Zipf’s law hold?

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 85 / 99

slide-159
SLIDE 159

Part 2 Limitations

How well does Zipf’s law hold?

◮ Z-M law seems to fit the first few thousand ranks very well, but then slope of empirical ranking becomes much steeper

◮ similar patterns have been found in many different data sets Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 86 / 99

slide-160
SLIDE 160

Part 2 Limitations

How well does Zipf’s law hold?

◮ Z-M law seems to fit the first few thousand ranks very well, but then slope of empirical ranking becomes much steeper

◮ similar patterns have been found in many different data sets

◮ Various modifications and extensions have been suggested (Sichel 1971; Kornai 1999; Montemurro 2001)

◮ mathematics of corresponding LNRE models are often much

more complex and numerically challenging

◮ may not have closed form for E[V ], E[Vm], or for the

cumulative type distribution G(ρ) = ∞

ρ g(π) dπ

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 86 / 99

slide-161
SLIDE 161

Part 2 Limitations

How well does Zipf’s law hold?

◮ Z-M law seems to fit the first few thousand ranks very well, but then slope of empirical ranking becomes much steeper

◮ similar patterns have been found in many different data sets

◮ Various modifications and extensions have been suggested (Sichel 1971; Kornai 1999; Montemurro 2001)

◮ mathematics of corresponding LNRE models are often much

more complex and numerically challenging

◮ may not have closed form for E[V ], E[Vm], or for the

cumulative type distribution G(ρ) = ∞

ρ g(π) dπ

◮ E.g. Generalized Inverse Gauss-Poisson (GIGP; Sichel 1971) g(π) = (2/bc)γ+1 Kγ+1(b) · πγ−1 · e− π

c − b2c 4π Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 86 / 99

slide-162
SLIDE 162

Part 2 Limitations

The GIGP model (Sichel 1971)

1e-09 1e-07 1e-05 1e-03 20000 40000 60000 80000

Type density of LNRE model

  • ccurrence probability π

type density g(π) fZM

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 87 / 99

slide-163
SLIDE 163

Part 2 Limitations

The GIGP model (Sichel 1971)

1e-09 1e-07 1e-05 1e-03 20000 40000 60000 80000

Type density of LNRE model

  • ccurrence probability π

type density g(π) fZM GIGP

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 87 / 99

slide-164
SLIDE 164

Part 2 Non-randomness

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 88 / 99

slide-165
SLIDE 165

Part 2 Non-randomness

How accurate is LNRE-based extrapolation?

(Baroni and Evert 2005)

10000 20000 30000 100 200 300 400 500 600

Suffix −bar (25%)

N (tokens) V (types) Corpus ZM fZM GIGP 10000 20000 30000 100 200 300 400 500 600

Suffix −bar (50%)

N (tokens) V (types) Corpus ZM fZM GIGP Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99

slide-166
SLIDE 166

Part 2 Non-randomness

How accurate is LNRE-based extrapolation?

(Baroni and Evert 2005)

50 100 150 200 250 1000 2000 3000

Suffix −lich (25%)

N (k tokens) V (types) Corpus ZM fZM GIGP logN 50 100 150 200 250 1000 2000 3000

Suffix −lich (50%)

N (k tokens) V (types) Corpus ZM fZM GIGP logN Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99

slide-167
SLIDE 167

Part 2 Non-randomness

How accurate is LNRE-based extrapolation?

(Baroni and Evert 2005)

200 400 600 800 1000 10 20 30 40 50

LOB (25%)

N (k tokens) V (k types) Corpus ZM fZM GIGP logN 200 400 600 800 1000 10 20 30 40 50

LOB (50%)

N (k tokens) V (k types) Corpus ZM fZM GIGP logN Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99

slide-168
SLIDE 168

Part 2 Non-randomness

How accurate is LNRE-based extrapolation?

(Baroni and Evert 2005)

20 40 60 80 100 100 200 300 400 500 600

BNC (1%)

N (M tokens) V (k types) Corpus ZM fZM GIGP logN 20 40 60 80 100 100 200 300 400 500 600

BNC (10%)

N (M tokens) V (k types) Corpus ZM fZM GIGP logN Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99

slide-169
SLIDE 169

Part 2 Non-randomness

Reasons for poor extrapolation quality

◮ Major problem: non-randomness of corpus data

◮ LNRE modelling assumes that corpus is random sample Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 90 / 99

slide-170
SLIDE 170

Part 2 Non-randomness

Reasons for poor extrapolation quality

◮ Major problem: non-randomness of corpus data

◮ LNRE modelling assumes that corpus is random sample

◮ Cause 1: repetition within texts

◮ most corpora use entire text as unit of sampling ◮ also referred to as “term clustering” or “burstiness” ◮ well-known in computational linguistics (Church 2000) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 90 / 99

slide-171
SLIDE 171

Part 2 Non-randomness

Reasons for poor extrapolation quality

◮ Major problem: non-randomness of corpus data

◮ LNRE modelling assumes that corpus is random sample

◮ Cause 1: repetition within texts

◮ most corpora use entire text as unit of sampling ◮ also referred to as “term clustering” or “burstiness” ◮ well-known in computational linguistics (Church 2000)

◮ Cause 2: non-homogeneous corpus

◮ cannot extrapolate from spoken BNC to written BNC ◮ similar for different genres and domains ◮ also within single text, e.g. beginning/end of novel Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 90 / 99

slide-172
SLIDE 172

Part 2 Non-randomness

The ECHO correction

(Baroni and Evert 2007)

◮ Empirical study: quality of extrapolation N0 → 4N0 starting from random samples of corpus texts

ZM fZM GIGP

Relative error: E[V] vs. V (DEWAC)

relative error (%) −30 −20 −10 10 20 30

  • N0

2N0 3N0 5000 10000 15000 5 10 15

Goodness−of−fit vs. accuracy for V (3N0)

X2 rMSE (%)

  • standard

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 91 / 99

slide-173
SLIDE 173

Part 2 Non-randomness

The ECHO correction

(Baroni and Evert 2007)

◮ Empirical study: quality of extrapolation N0 → 4N0 starting from random samples of corpus texts

ZM fZM GIGP

Relative error: E[V] vs. V (BNC)

relative error (%) −30 −20 −10 10 20 30

  • N0

2N0 3N0 5000 10000 15000 5 10 15

Goodness−of−fit vs. accuracy for V (3N0)

X2 rMSE (%)

  • standard

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 91 / 99

slide-174
SLIDE 174

Part 2 Non-randomness

The ECHO correction

(Baroni and Evert 2007)

◮ ECHO correction: replace every repetition within same text by special type echo (= document frequencies)

ZM fZM GIGP fZM echo GIGP echo GIGP partition

Relative error: E[V] vs. V (DEWAC)

relative error (%) −20 −10 10 20

  • N0

2N0 3N0 5000 10000 15000 5 10 15

Goodness−of−fit vs. accuracy for V (3N0)

X2 rMSE (%)

  • standard

echo model partition− adjusted

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 92 / 99

slide-175
SLIDE 175

Part 2 Non-randomness

The ECHO correction

(Baroni and Evert 2007)

◮ ECHO correction: replace every repetition within same text by special type echo (= document frequencies)

ZM fZM GIGP fZM echo GIGP echo GIGP partition

Relative error: E[V] vs. V (BNC)

relative error (%) −40 −20 20 40

  • N0

2N0 3N0 5000 10000 15000 5 10 15

Goodness−of−fit vs. accuracy for V (3N0)

X2 rMSE (%)

  • standard

echo model partition− adjusted

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 92 / 99

slide-176
SLIDE 176

Part 2 Conclusion & outlook

Outline

Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 93 / 99

slide-177
SLIDE 177

Part 2 Conclusion & outlook

Future plans for zipfR

◮ More efficient LNRE sampling & parametric bootstrapping ◮ Improve parameter estimation (minimization algorithm) ◮ Better computation accuracy by numerical integration ◮ Extended Zipf-Mandelbrot LNRE model: piecewise power law ◮ Development of robust and interpretable productivity measures, using LNRE simulations ◮ Computationally expensive modelling (MCMC) for accurate inference from small samples

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 94 / 99

slide-178
SLIDE 178

Part 2 Conclusion & outlook

Thank you!

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 95 / 99

slide-179
SLIDE 179

Part 2 Conclusion & outlook

References I

Baayen, Harald (1991). A stochastic process for word frequency distributions. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pages 271–278. Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht. Baroni, Marco and Evert, Stefan (2005). Testing the extrapolation quality of word frequency models. In P. Danielsson and M. Wagenmakers (eds.), Proceedings of Corpus Linguistics 2005, volume 1, no. 1 of Proceedings from the Corpus Linguistics Conference Series, Birmingham, UK. ISSN 1747-9398. Baroni, Marco and Evert, Stefan (2007). Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 904–911, Prague, Czech Republic. Brainerd, Barron (1982). On the relation between the type-token and species-area

  • problems. Journal of Applied Probability, 19(4), 785–793.

Cao, Yong; Xiong, Fei; Zhao, Youjie; Sun, Yongke; Yue, Xiaoguang; He, Xin; Wang, Lichao (2017). Pow law in random symbolic sequences. Digital Scholarship in the Humanities, 32(4), 733–738.

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 96 / 99

slide-180
SLIDE 180

Part 2 Conclusion & outlook

References II

Church, Kenneth W. (2000). Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2. In Proceedings of COLING 2000, pages 173–179, Saarbrücken, Germany. Efron, Bradley (1979). Bootstrap methods: Another look at the jackknife. The Annals

  • f Statistics, 7(1), 1–26.

Evert, Stefan (2004). A simple LNRE model for random character sequences. In Proceedings of the 7èmes Journées Internationales d’Analyse Statistique des Données Textuelles (JADT 2004), pages 411–422, Louvain-la-Neuve, Belgium. Evert, Stefan and Baroni, Marco (2007). zipfR: Word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, pages 29–32, Prague, Czech Republic. Evert, Stefan and Lüdeling, Anke (2001). Measuring morphological productivity: Is automatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery,

  • A. Hardie, and S. Khoja (eds.), Proceedings of the Corpus Linguistics 2001

Conference, pages 167–175, Lancaster. UCREL. Grieve, Jack; Carmody, Emily; Clarke, Isobelle; Gideon, Hannah; Heini, Annina; Nini, Andrea; Waibel, Emily (submitted). Attributing the Bixby Letter using n-gram

  • tracing. Digital Scholarship in the Humanities. Submitted on May 26, 2017.

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 97 / 99

slide-181
SLIDE 181

Part 2 Conclusion & outlook

References III

Herdan, Gustav (1964). Quantitative Linguistics. Butterworths, London. Kornai, András (1999). Zipf’s law outside the middle range. In Proceedings of the Sixth Meeting on Mathematics of Language, pages 347–356, University of Central Florida. Li, Wentian (1992). Random texts exhibit zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842–1845. Mandelbrot, Benoît (1953). An informational theory of the statistical structure of

  • languages. In W. Jackson (ed.), Communication Theory, pages 486–502.

Butterworth, London. Mandelbrot, Benoît (1962). On the theory of word frequencies and on related Markovian models of discourse. In R. Jakobson (ed.), Structure of Language and its Mathematical Aspects, pages 190–219. American Mathematical Society, Providence, RI. Miller, George A. (1957). Some effects of intermittent silence. The American Journal

  • f Psychology, 52, 311–314.

Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitative

  • linguistics. Physica A, 300, 567–578.

Rouault, Alain (1978). Lois de Zipf et sources markoviennes. Annales de l’Institut H. Poincaré (B), 14, 169–188.

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 98 / 99

slide-182
SLIDE 182

Part 2 Conclusion & outlook

References IV

Sichel, H. S. (1971). On a family of discrete distributions particularly suited to represent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of the Third Symposium on Mathematical Statistics, pages 51–97, Pretoria, South Africa. C.S.I.R. Sichel, H. S. (1975). On a distribution law for word frequencies. Journal of the American Statistical Association, 70, 542–547. Simon, Herbert A. (1955). On a class of skew distribution functions. Biometrika, 47(3/4), 425–440. Tweedie, Fiona J. and Baayen, R. Harald (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities, 32, 323–352. Yule, G. Udny (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press, Cambridge. Zipf, George Kingsley (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA. Zipf, George Kingsley (1965). The Psycho-biology of Language. MIT Press, Cambridge, MA.

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 99 / 99