Counting Words: the basics Introduction Zipfs law Typical - - PowerPoint PPT Presentation

counting words
SMART_READER_LITE
LIVE PREVIEW

Counting Words: the basics Introduction Zipfs law Typical - - PowerPoint PPT Presentation

Introduction Baroni & Evert Roadmap Lexical statistics: Counting Words: the basics Introduction Zipfs law Typical frequency patterns Zipfs law Consequences Applications Marco Baroni & Stefan Evert Productivity in


slide-1
SLIDE 1

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Counting Words: Introduction

Marco Baroni & Stefan Evert M´ alaga, 7 August 2006

slide-2
SLIDE 2

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Roadmap

◮ Introduction and motivation

slide-3
SLIDE 3

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Roadmap

◮ Introduction and motivation ◮ LNRE modeling: soft

slide-4
SLIDE 4

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Roadmap

◮ Introduction and motivation ◮ LNRE modeling: soft ◮ LNRE modeling: hard

slide-5
SLIDE 5

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Roadmap

◮ Introduction and motivation ◮ LNRE modeling: soft ◮ LNRE modeling: hard ◮ Playtime!

slide-6
SLIDE 6

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Roadmap

◮ Introduction and motivation ◮ LNRE modeling: soft ◮ LNRE modeling: hard ◮ Playtime! ◮ The bad news and outlook

slide-7
SLIDE 7

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Outline

Roadmap Lexical statistics: the basics Zipf’s law Applications

slide-8
SLIDE 8

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Lexical statistics

Zipf 1949/1961, Baayen 2001, Evert 2005

◮ Statistical study of distribution of types (words and

  • ther units) in texts

◮ Different from other categorical data because of extreme

richness of types

slide-9
SLIDE 9

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Basic terminology

◮ N: sample/corpus size, number of tokens in the sample

slide-10
SLIDE 10

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Basic terminology

◮ N: sample/corpus size, number of tokens in the sample ◮ V : vocabulary size, number of distinct types in the

sample

slide-11
SLIDE 11

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Basic terminology

◮ N: sample/corpus size, number of tokens in the sample ◮ V : vocabulary size, number of distinct types in the

sample

◮ Vm: type count of spectrum element m, number of

types in the sample with token frequency m

slide-12
SLIDE 12

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Basic terminology

◮ N: sample/corpus size, number of tokens in the sample ◮ V : vocabulary size, number of distinct types in the

sample

◮ Vm: type count of spectrum element m, number of

types in the sample with token frequency m

◮ V1: hapax legomena count, number of types that occur

  • nly once in the sample (for hapaxes, Count(types) =

Count(tokens))

slide-13
SLIDE 13

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Basic terminology

◮ N: sample/corpus size, number of tokens in the sample ◮ V : vocabulary size, number of distinct types in the

sample

◮ Vm: type count of spectrum element m, number of

types in the sample with token frequency m

◮ V1: hapax legomena count, number of types that occur

  • nly once in the sample (for hapaxes, Count(types) =

Count(tokens))

◮ A sample: a b b c a a b a

slide-14
SLIDE 14

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Basic terminology

◮ N: sample/corpus size, number of tokens in the sample ◮ V : vocabulary size, number of distinct types in the

sample

◮ Vm: type count of spectrum element m, number of

types in the sample with token frequency m

◮ V1: hapax legomena count, number of types that occur

  • nly once in the sample (for hapaxes, Count(types) =

Count(tokens))

◮ A sample: a b b c a a b a ◮ N: 8; V : 3; V1: 1

slide-15
SLIDE 15

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Rank/frequency profile

◮ The sample: a b b c a a b a d

slide-16
SLIDE 16

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Rank/frequency profile

◮ The sample: a b b c a a b a d ◮ Frequency list ordered by decreasing frequency

t f a 4 b 3 c 1 d 1

slide-17
SLIDE 17

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Rank/frequency profile

◮ The sample: a b b c a a b a d ◮ Frequency list ordered by decreasing frequency

t f a 4 b 3 c 1 d 1

◮ Replace type labels with ranks to obtain rank/frequency

profile: r f 1 4 2 3 3 1 4 1

slide-18
SLIDE 18

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Rank/frequency profile

◮ The sample: a b b c a a b a d ◮ Frequency list ordered by decreasing frequency

t f a 4 b 3 c 1 d 1

◮ Replace type labels with ranks to obtain rank/frequency

profile: r f 1 4 2 3 3 1 4 1

◮ Allows expression of frequency in function of rank of type

slide-19
SLIDE 19

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Rank/frequency profile of Brown corpus

slide-20
SLIDE 20

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Frequency spectrum

◮ The sample: a b b c a a b a d

slide-21
SLIDE 21

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Frequency spectrum

◮ The sample: a b b c a a b a d ◮ Frequency classes: 1 (c, d), 3 (b), 4 (a)

slide-22
SLIDE 22

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Frequency spectrum

◮ The sample: a b b c a a b a d ◮ Frequency classes: 1 (c, d), 3 (b), 4 (a) ◮ Frequency spectrum:

m Vm 1 2 3 1 4 1

slide-23
SLIDE 23

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Rank/frequency profiles and frequency spectra

◮ From rank/frequency profile to spectrum: count

  • ccurrences of each f in profile to obtain Vf values of

corresponding spectrum elements

◮ From spectrum to rank/frequency profile: given highest f

(i.e., m) in a spectrum, the ranks 1 to Vf in the corresponding rank/frequency profile will have frequency f , the ranks Vf + 1 to Vf + Vg (where g is the second highest frequency in the spectrum) will have frequency g, etc.

slide-24
SLIDE 24

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Frequency spectrum of Brown corpus

1 2 3 4 5 6 7 8 9 11 13 15 m V_m 5000 10000 15000 20000

slide-25
SLIDE 25

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve

◮ The sample: a b b c a a b a

slide-26
SLIDE 26

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve

◮ The sample: a b b c a a b a ◮ N: 1, V : 1, V1: 1

slide-27
SLIDE 27

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve

◮ The sample: a b b c a a b a ◮ N: 1, V : 1, V1: 1 ◮ N: 3, V : 2, V1: 1

slide-28
SLIDE 28

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve

◮ The sample: a b b c a a b a ◮ N: 1, V : 1, V1: 1 ◮ N: 3, V : 2, V1: 1 ◮ N: 5, V : 3, V1: 1

slide-29
SLIDE 29

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve

◮ The sample: a b b c a a b a ◮ N: 1, V : 1, V1: 1 ◮ N: 3, V : 2, V1: 1 ◮ N: 5, V : 3, V1: 1 ◮ N: 8, V : 3, V1: 1

slide-30
SLIDE 30

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve

◮ The sample: a b b c a a b a ◮ N: 1, V : 1, V1: 1 ◮ N: 3, V : 2, V1: 1 ◮ N: 5, V : 3, V1: 1 ◮ N: 8, V : 3, V1: 1 ◮ (Most VGCs on our slides smoothed with binomial

interpolation)

slide-31
SLIDE 31

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Vocabulary growth curve of Brown corpus

With V1 growth in red

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 N V and V_1

slide-32
SLIDE 32

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Outline

Roadmap Lexical statistics: the basics Zipf’s law Applications

slide-33
SLIDE 33

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Typical frequency patterns

Top and bottom ranks in the Brown corpus

top frequencies bottom frequencies rank fq word rank range fq randomly selected examples 1 62642 the 7967-8522 10 recordings undergone privileges 2 35971

  • f

8523-9236 9 Leonard indulge creativity 3 27831 and 9237-10042 8 unnatural Lolotte authenticity 4 25608 to 10043-11185 7 diffraction Augusta postpone 5 21883 a 11186-12510 6 uniformly throttle agglutinin 6 19474 in 12511-14369 5 Bud Councilman immoral 7 10292 that 14370-16938 4 verification gleamed groin 8 10026 is 16939-21076 3 Princes nonspecifically Arger 9 9887 was 21077-28701 2 blitz pertinence arson 10 8811 for 28702-53076 1 Salaries Evensen parentheses

slide-34
SLIDE 34

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Typical frequency patterns

BNC

slide-35
SLIDE 35

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Typical frequency patterns

Other corpora

slide-36
SLIDE 36

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Typical frequency patterns

Brown bigrams and trigrams

slide-37
SLIDE 37

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Typical frequency patterns

The Italian prefix ri- in the la Repubblica corpus

slide-38
SLIDE 38

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Language after language, corpus after corpus, linguistic

type after linguistic type. . .

slide-39
SLIDE 39

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Language after language, corpus after corpus, linguistic

type after linguistic type. . .

◮ same “few giants, many dwarves” pattern is encountered

slide-40
SLIDE 40

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Language after language, corpus after corpus, linguistic

type after linguistic type. . .

◮ same “few giants, many dwarves” pattern is encountered ◮ Similarity of plots suggests that relation between rank

and frequency could be captured by a law

slide-41
SLIDE 41

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Language after language, corpus after corpus, linguistic

type after linguistic type. . .

◮ same “few giants, many dwarves” pattern is encountered ◮ Similarity of plots suggests that relation between rank

and frequency could be captured by a law

◮ Nature of relation becomes clearer if we plot log f in

function of log r

slide-42
SLIDE 42

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Language after language, corpus after corpus, linguistic

type after linguistic type. . .

◮ same “few giants, many dwarves” pattern is encountered ◮ Similarity of plots suggests that relation between rank

and frequency could be captured by a law

◮ Nature of relation becomes clearer if we plot log f in

function of log r

slide-43
SLIDE 43

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Straight line in double-logarithmic space corresponds to

power law for original variables

slide-44
SLIDE 44

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Straight line in double-logarithmic space corresponds to

power law for original variables

◮ This leads to Zipf’s (1949, 1965) famous law:

f (w) = C r(w)a

slide-45
SLIDE 45

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Straight line in double-logarithmic space corresponds to

power law for original variables

◮ This leads to Zipf’s (1949, 1965) famous law:

f (w) = C r(w)a

◮ With a = 1 and C = 60, 000, Zipf’s law predicts that

most frequent word has frequency 60,000; second most frequent word has frequency 30,000; third word has frequency 20,000. . .

slide-46
SLIDE 46

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

◮ Straight line in double-logarithmic space corresponds to

power law for original variables

◮ This leads to Zipf’s (1949, 1965) famous law:

f (w) = C r(w)a

◮ With a = 1 and C = 60, 000, Zipf’s law predicts that

most frequent word has frequency 60,000; second most frequent word has frequency 30,000; third word has frequency 20,000. . .

◮ and long tail of 80,000 words with frequency between 1.5

and 0.5

slide-47
SLIDE 47

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

Logarithmic version

◮ Zipf’s power law:

f (w) = C r(w)a

slide-48
SLIDE 48

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

Logarithmic version

◮ Zipf’s power law:

f (w) = C r(w)a

◮ If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

slide-49
SLIDE 49

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

Logarithmic version

◮ Zipf’s power law:

f (w) = C r(w)a

◮ If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

◮ I.e., Zipf’s law predicts that rank/frequency profiles are

straight lines in double logarithmic space, which, we saw, is a reasonable approximation

slide-50
SLIDE 50

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

Logarithmic version

◮ Zipf’s power law:

f (w) = C r(w)a

◮ If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

◮ I.e., Zipf’s law predicts that rank/frequency profiles are

straight lines in double logarithmic space, which, we saw, is a reasonable approximation

◮ Best fit a and C can be found with least squares method

slide-51
SLIDE 51

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

Logarithmic version

◮ Zipf’s power law:

f (w) = C r(w)a

◮ If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

◮ I.e., Zipf’s law predicts that rank/frequency profiles are

straight lines in double logarithmic space, which, we saw, is a reasonable approximation

◮ Best fit a and C can be found with least squares method ◮ Provides intuitive interpretation of a and C:

◮ a is slope determining how fast log frequency decreases

with log rank

◮ log C is intercept, i.e., predicted log frequency of word

with rank 1 (log rank 0), i.e., most frequent word

slide-52
SLIDE 52

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf’s law

Fitting the Brown rank/frequency profile

slide-53
SLIDE 53

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Fit of Zipf’s law

◮ At right edge (low frequencies):

slide-54
SLIDE 54

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Fit of Zipf’s law

◮ At right edge (low frequencies):

◮ “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequencies

slide-55
SLIDE 55

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Fit of Zipf’s law

◮ At right edge (low frequencies):

◮ “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequencies

◮ More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

slide-56
SLIDE 56

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Fit of Zipf’s law

◮ At right edge (low frequencies):

◮ “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequencies

◮ More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

◮ At left edge (high frequencies):

slide-57
SLIDE 57

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Fit of Zipf’s law

◮ At right edge (low frequencies):

◮ “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequencies

◮ More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

◮ At left edge (high frequencies):

◮ Highest frequencies lower than predicted

slide-58
SLIDE 58

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Fit of Zipf’s law

◮ At right edge (low frequencies):

◮ “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequencies

◮ More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

◮ At left edge (high frequencies):

◮ Highest frequencies lower than predicted → Mandelbrot’s

correction

slide-59
SLIDE 59

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf-Mandelbrot’s law

Mandelbrot 1953

◮ Mandelbrot’s extra parameter:

f (w) = C (r(w) + b)a

slide-60
SLIDE 60

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf-Mandelbrot’s law

Mandelbrot 1953

◮ Mandelbrot’s extra parameter:

f (w) = C (r(w) + b)a

◮ Zipf’s law is special case with b = 0

slide-61
SLIDE 61

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf-Mandelbrot’s law

Mandelbrot 1953

◮ Mandelbrot’s extra parameter:

f (w) = C (r(w) + b)a

◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60, 000, b = 1:

◮ For word with rank 1, Zipf’s law predicts frequency of

60,000; Mandelbrot’s variation predicts frequency of 30,000

◮ For word with rank 1,000, Zipf’s law predicts frequency

  • f 60; Mandelbrot’s variation predicts frequency of 59.94
slide-62
SLIDE 62

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf-Mandelbrot’s law

Mandelbrot 1953

◮ Mandelbrot’s extra parameter:

f (w) = C (r(w) + b)a

◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60, 000, b = 1:

◮ For word with rank 1, Zipf’s law predicts frequency of

60,000; Mandelbrot’s variation predicts frequency of 30,000

◮ For word with rank 1,000, Zipf’s law predicts frequency

  • f 60; Mandelbrot’s variation predicts frequency of 59.94

◮ No longer a straight line in double logarithmic space;

finding best fit harder than least squares

slide-63
SLIDE 63

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Zipf-Mandelbrot’s law

Mandelbrot 1953

◮ Mandelbrot’s extra parameter:

f (w) = C (r(w) + b)a

◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60, 000, b = 1:

◮ For word with rank 1, Zipf’s law predicts frequency of

60,000; Mandelbrot’s variation predicts frequency of 30,000

◮ For word with rank 1,000, Zipf’s law predicts frequency

  • f 60; Mandelbrot’s variation predicts frequency of 59.94

◮ No longer a straight line in double logarithmic space;

finding best fit harder than least squares

◮ Zipf-Mandelbrot’s law is basis of LNRE statistical models

we will introduce

slide-64
SLIDE 64

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Mandelbrot’s adjustment

Fitting the Brown rank/frequency profile

slide-65
SLIDE 65

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

More fits

slide-66
SLIDE 66

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

A few mildly interesting things about Zipf(-Mandelbrot)’s law

◮ a is often close to 1 for word frequency distributions

(hence simplified version: f = C/r, and -1 slope in log-log space)

slide-67
SLIDE 67

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

A few mildly interesting things about Zipf(-Mandelbrot)’s law

◮ a is often close to 1 for word frequency distributions

(hence simplified version: f = C/r, and -1 slope in log-log space)

◮ Zipf’s law also provides good fit to frequency spectra

slide-68
SLIDE 68

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

A few mildly interesting things about Zipf(-Mandelbrot)’s law

◮ a is often close to 1 for word frequency distributions

(hence simplified version: f = C/r, and -1 slope in log-log space)

◮ Zipf’s law also provides good fit to frequency spectra ◮ Monkey languages display Zipf’s law (intuition: few short

words have very high chances to be generated; long tail

  • f highly unlikely long words)
slide-69
SLIDE 69

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

A few mildly interesting things about Zipf(-Mandelbrot)’s law

◮ a is often close to 1 for word frequency distributions

(hence simplified version: f = C/r, and -1 slope in log-log space)

◮ Zipf’s law also provides good fit to frequency spectra ◮ Monkey languages display Zipf’s law (intuition: few short

words have very high chances to be generated; long tail

  • f highly unlikely long words)

◮ Zipf’s law is everywhere (Li 2002)

slide-70
SLIDE 70

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Consequences

◮ Data sparseness

slide-71
SLIDE 71

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Consequences

◮ Data sparseness ◮ Standard statistics, normal approximation not

appropriate for lexical type distributions

slide-72
SLIDE 72

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Consequences

◮ Data sparseness ◮ Standard statistics, normal approximation not

appropriate for lexical type distributions

◮ V is not stable, will grow with sample size, we need

special methods to estimate V and related quantities at arbitrary sizes (including V of whole type population)

slide-73
SLIDE 73

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Consequences

◮ Data sparseness ◮ Standard statistics, normal approximation not

appropriate for lexical type distributions

◮ V is not stable, will grow with sample size, we need

special methods to estimate V and related quantities at arbitrary sizes (including V of whole type population)

slide-74
SLIDE 74

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V , sample size and the Zipfian distribution

◮ Significant tail of hapax legomena indicates that chances

  • f encountering new type if we keep sampling are high

◮ Zipfian distribution implies vocabulary curve that is still

growing at largest sample size

slide-75
SLIDE 75

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Pronouns in Italian (la Repubblica)

Rank/frequency profile

  • 20

40 60 80 1 100 10000 rank fq

slide-76
SLIDE 76

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Pronouns in Italian

Frequency spectrum

  • 1

100 10000 0.6 0.8 1.0 1.2 1.4 m V_m

slide-77
SLIDE 77

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Pronouns in Italian

Vocabulary growth curve

0e+00 1e+06 2e+06 3e+06 4e+06 20 40 60 80 N V and V_1

slide-78
SLIDE 78

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Pronouns in Italian

Vocabulary growth curve (zooming in)

2000 4000 6000 8000 10000 20 40 60 80 N V and V_1

slide-79
SLIDE 79

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

ri- in Italian (la Repubblica)

Rank/frequency profile

slide-80
SLIDE 80

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

ri- in Italian

Frequency spectrum

1 2 3 4 5 6 7 8 9 11 13 15 m V_m 50 100 150 200 250 300 350

slide-81
SLIDE 81

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

ri- in Italian

Vocabulary growth curve

200000 600000 1000000 200 400 600 800 1000 N V and V_1

slide-82
SLIDE 82

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Outline

Roadmap Lexical statistics: the basics Zipf’s law Applications

slide-83
SLIDE 83

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Applications

◮ Productivity (in morphology and elsewhere) ◮ Lexical richness (in stylometry, language

acquisition/pathology and elsewhere)

◮ Extrapolation of type counts and type frequency

distribution for practical NLP purposes (e.g., estimating proportion of OOV words, typos, etc.)

◮ ... (e.g., Good-Turing smoothing, prior distribution for

Bayesian language modeling)

slide-84
SLIDE 84

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Productivity

◮ In many linguistic problems, rate of growth of VGC is

interesting issue in itself

◮ Baayen (1989 and later) makes link between linguistic

notion of productivity and vocabulary growth rate

slide-85
SLIDE 85

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Productivity in morphology: the classic definition

Schultink (1961), translated by Booij

Productivity as morphological phenomenon is the possibility which language users have to form an in principle uncountable number of new words unintentionally, by means of a morphological process which is the basis of the form-meaning correspondence of some words they know.

slide-86
SLIDE 86

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V as a measure of productivity

◮ Comparable for same N only!

slide-87
SLIDE 87

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V as a measure of productivity

◮ Comparable for same N only! ◮ Good first approximation, but it is measuring

attestedness, not potential:

slide-88
SLIDE 88

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V as a measure of productivity

◮ Comparable for same N only! ◮ Good first approximation, but it is measuring

attestedness, not potential:

◮ (According to rough BNC counts) de- verbs have V of

141, un- verbs have V of 119, contra our intuition

slide-89
SLIDE 89

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V as a measure of productivity

◮ Comparable for same N only! ◮ Good first approximation, but it is measuring

attestedness, not potential:

◮ (According to rough BNC counts) de- verbs have V of

141, un- verbs have V of 119, contra our intuition

◮ We want productivity index of pronouns to be 0, not 72!

slide-90
SLIDE 90

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Baayen’s P

◮ Operationalize productivity of a process as probability

that the next token created by the process that we sample is a new word

◮ This is same as probability that next token in sample is

hapax legomenon

◮ Thus, we can estimate probability of sampling a new

word as relative frequency of hapax legomena in our sample: P = V1 N

slide-91
SLIDE 91

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Baayen’s P

P = V1 N

◮ Probability to sample token representing type we will

never encounter again (token labeled “hapax”) at first stage of sampling (when we are at the beginning of N-token-sample) is given by the proportion of hapaxes in the whole N-token-sample divided by the total number of tokens in the sample

◮ Thus, this must also be probability that last token

sampled represents new type

◮ P as productivity measure matches intuition that

productivity should measure potential of process to generate new forms

slide-92
SLIDE 92

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P as vocabulary growth rate

◮ P measures the potentiality of growth of V in a very

literal way, i.e., it is the growth rate of V , the rate at which vocabulary size increases

◮ P is (approximation to) the derivative of V at N, i.e.,

the slope of the tangent to the vocabulary growth curve at N (Baayen 2001, pp. 49-50)

◮ Again, “rate of growth” of vocabulary generated by word

formation process seems good match for intuition about productivity of word formation process

slide-93
SLIDE 93

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

ri- in Italian la Repubblica corpus

200000 600000 1000000 1400000 200 400 600 800 1000 N V

slide-94
SLIDE 94

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Pronouns in Italian la Repubblica corpus

2000 4000 6000 8000 10000 20 40 60 80 N V

slide-95
SLIDE 95

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Baayen’s P and intuition

class V V1 N P

  • it. ri-

1098 346 1,399,898 0.00025

  • it. pronouns

72 4,313,123

  • en. un-

119 25 7,618 .00328

  • en. de-

141 16 86,130 .000185

slide-96
SLIDE 96

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P and sample size

◮ We saw that as N increases, V also increases (for

at-least-mildly-productive processes)

slide-97
SLIDE 97

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P and sample size

◮ We saw that as N increases, V also increases (for

at-least-mildly-productive processes)

◮ Thus, V cannot be compared at different Ns

slide-98
SLIDE 98

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V and N

English re- and mis-

10000 20000 30000 40000 50000 50 100 150 200 250 N V

slide-99
SLIDE 99

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P and sample size

◮ We saw that as N increases, V also increases (for

at-least-mildly-productive processes)

◮ Thus, V cannot be compared at different Ns

slide-100
SLIDE 100

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P and sample size

◮ We saw that as N increases, V also increases (for

at-least-mildly-productive processes)

◮ Thus, V cannot be compared at different Ns ◮ However, growth rate is also systematically decreasing as

N becomes larger

slide-101
SLIDE 101

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P and sample size

◮ We saw that as N increases, V also increases (for

at-least-mildly-productive processes)

◮ Thus, V cannot be compared at different Ns ◮ However, growth rate is also systematically decreasing as

N becomes larger

◮ At the beginning, any word will be a hapax legomenon;

as sample increases, hapaxes will be increasingly lower proportion of sample

slide-102
SLIDE 102

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P and sample size

◮ We saw that as N increases, V also increases (for

at-least-mildly-productive processes)

◮ Thus, V cannot be compared at different Ns ◮ However, growth rate is also systematically decreasing as

N becomes larger

◮ At the beginning, any word will be a hapax legomenon;

as sample increases, hapaxes will be increasingly lower proportion of sample

◮ A specific instance of the more general problem of

“variable constants” (Tweedie and Baayen 1998) in lexical statistics (cf. type/token ratio)

slide-103
SLIDE 103

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Growth rate of re- at different sample sizes

50000 100000 150000 200000 200 250 300 N V

slide-104
SLIDE 104

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

P as a function of N (re-)

50000 100000 150000 200000 1e−04 5e−04 2e−03 5e−03 2e−02 N P

slide-105
SLIDE 105

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V and P at arbitrary Ns

◮ In order to compare V and P of processes (and predict

how process will develop in larger samples). . .

slide-106
SLIDE 106

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V and P at arbitrary Ns

◮ In order to compare V and P of processes (and predict

how process will develop in larger samples). . .

◮ we need to be able to estimate V and V1 at arbitrary Ns

slide-107
SLIDE 107

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V and P at arbitrary Ns

◮ In order to compare V and P of processes (and predict

how process will develop in larger samples). . .

◮ we need to be able to estimate V and V1 at arbitrary Ns ◮ Once we compare P at same N, we might as well

compare V1 directly (since P = V1/N and N will be constant across compared processes)

slide-108
SLIDE 108

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

V and P at arbitrary Ns

◮ In order to compare V and P of processes (and predict

how process will develop in larger samples). . .

◮ we need to be able to estimate V and V1 at arbitrary Ns ◮ Once we compare P at same N, we might as well

compare V1 directly (since P = V1/N and N will be constant across compared processes)

◮ Most intuitive: VGC plot comparison

slide-109
SLIDE 109

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Productivity beyond morphology

◮ Measuring generative potential of process/category not

limited to morphology

slide-110
SLIDE 110

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Productivity beyond morphology

◮ Measuring generative potential of process/category not

limited to morphology

◮ Applications in lexicology, collocation and idiom studies,

morphosyntax, syntax, language technology

slide-111
SLIDE 111

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Productivity beyond morphology

◮ Measuring generative potential of process/category not

limited to morphology

◮ Applications in lexicology, collocation and idiom studies,

morphosyntax, syntax, language technology

◮ E.g., measure growth of nouns, adjectives, loanwords,

relative productivity of two constructions, growth of UNKNOWN lemmas as dataset increases. . .

slide-112
SLIDE 112

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Productivity beyond morphology

◮ Measuring generative potential of process/category not

limited to morphology

◮ Applications in lexicology, collocation and idiom studies,

morphosyntax, syntax, language technology

◮ E.g., measure growth of nouns, adjectives, loanwords,

relative productivity of two constructions, growth of UNKNOWN lemmas as dataset increases. . .

◮ An example: measuring productivity of NP and PP

expansions in German TIGER treebank

slide-113
SLIDE 113

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

TIGER expansions

◮ Types are non-terminal rewrite rules for NP and PP, e.g:

◮ NP → ART ADJA NN ◮ PP → APPR ART NN

◮ Frequency of occurrence of expansions collected from

about 900,000 tokens (50,000 sentences) of German newspaper text from Frankfurter Rundschau

◮ http://www.ims.uni-stuttgart.de/projekte/TIGER

slide-114
SLIDE 114

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

NP spectrum

1 2 3 4 5 6 7 8 9 11 13 15 m V(m) 500 1000 1500

slide-115
SLIDE 115

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

PP spectrum

1 2 3 4 5 6 7 8 9 11 13 15 m V(m) 500 1000 1500

slide-116
SLIDE 116

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Growth curves of NP and PP

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 500 1000 1500 2000 2500 3000 3500 N V and V_1 np pp

slide-117
SLIDE 117

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Lexical richness

◮ How many words did Shakespeare know? Are the later

Harry Potters more lexically diverse than the early ones?

slide-118
SLIDE 118

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Lexical richness

◮ How many words did Shakespeare know? Are the later

Harry Potters more lexically diverse than the early ones?

◮ Are advanced learners distinguishable from native

speakers in terms of vocabulary richness? How many words do 5-year-old children know?

slide-119
SLIDE 119

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Lexical richness

◮ How many words did Shakespeare know? Are the later

Harry Potters more lexically diverse than the early ones?

◮ Are advanced learners distinguishable from native

speakers in terms of vocabulary richness? How many words do 5-year-old children know?

◮ Can changes in V detect the onset of Alzheimer’s

disease? (Garrard et al. 2005)

slide-120
SLIDE 120

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

The Dickens’ datasets

◮ Dickens corpus: collection of 14 works by Dickens, about

2.8 million tokens

◮ Oliver Twist: early work (1837-1839), about 160k tokens ◮ Great Expectations: later work (1860-1861), considered

  • ne of Dickens’ masterpieces, about 190k tokens

◮ Our Mutual Friend: last completed novel (1864-1865),

about 330k tokens

slide-121
SLIDE 121

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Dickens’ V

500000 1000000 1500000 2000000 2500000 10000 20000 30000 40000 N V and V_1 dickens

  • mf

ge

  • t
slide-122
SLIDE 122

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

The novels compared

50000 100000 150000 200000 250000 300000 5000 10000 15000 N V and V_1

  • mf

ge

  • t
slide-123
SLIDE 123

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Oliver vs. Great Expectations

50000 100000 150000 2000 4000 6000 8000 10000 N V and V_1 ge

  • t
slide-124
SLIDE 124

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Conclusion and outlook

◮ Productivity, lexical richness, extrapolation of type

counts for language engineering purposes. . .

slide-125
SLIDE 125

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Conclusion and outlook

◮ Productivity, lexical richness, extrapolation of type

counts for language engineering purposes. . .

◮ all applications require a model of the larger population

  • f types that our sample comes from
slide-126
SLIDE 126

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Conclusion and outlook

◮ Productivity, lexical richness, extrapolation of type

counts for language engineering purposes. . .

◮ all applications require a model of the larger population

  • f types that our sample comes from

◮ Two reasons to construct model of type population

distribution:

slide-127
SLIDE 127

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Conclusion and outlook

◮ Productivity, lexical richness, extrapolation of type

counts for language engineering purposes. . .

◮ all applications require a model of the larger population

  • f types that our sample comes from

◮ Two reasons to construct model of type population

distribution:

◮ Population distribution interesting by itself, for

theoretical reasons or in NLP applications

slide-128
SLIDE 128

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Conclusion and outlook

◮ Productivity, lexical richness, extrapolation of type

counts for language engineering purposes. . .

◮ all applications require a model of the larger population

  • f types that our sample comes from

◮ Two reasons to construct model of type population

distribution:

◮ Population distribution interesting by itself, for

theoretical reasons or in NLP applications

◮ We know how to simulate sampling from population; thus

  • nce we have population model we can obtain estimates
  • f type-related quantities (e.g., V and V1) at arbitrary Ns
slide-129
SLIDE 129

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Productivity

◮ Distribution of types of category of interest necessary to

estimate V and V1 at arbitrary Ns, in order to compare VGCs and P of different processes

slide-130
SLIDE 130

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Productivity

◮ Distribution of types of category of interest necessary to

estimate V and V1 at arbitrary Ns, in order to compare VGCs and P of different processes

◮ However, type population distribution of word formation

process (or other category) might be of interest by itself, as model of a part of the mental lexicon of speaker

slide-131
SLIDE 131

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Lexical richness

◮ Lexical richness = V of whole population (how many

words did Shakespeare know? Was the lexical repertoire

  • f young Dickens smaller than that of old Dickens? How

many words do 5-year-old children know?)

slide-132
SLIDE 132

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Lexical richness

◮ Lexical richness = V of whole population (how many

words did Shakespeare know? Was the lexical repertoire

  • f young Dickens smaller than that of old Dickens? How

many words do 5-year-old children know?)

◮ Accurate estimate of population V would solve “variable

constant” problem

slide-133
SLIDE 133

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Lexical richness

◮ Lexical richness = V of whole population (how many

words did Shakespeare know? Was the lexical repertoire

  • f young Dickens smaller than that of old Dickens? How

many words do 5-year-old children know?)

◮ Accurate estimate of population V would solve “variable

constant” problem

◮ Sampling from population, in particular to compute

VGC, also of interest

slide-134
SLIDE 134

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Some NLP applications

◮ Estimate number (and growth rate) of typos,

UNKNOWNs (or other target tokens) in larger samples

slide-135
SLIDE 135

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Some NLP applications

◮ Estimate number (and growth rate) of typos,

UNKNOWNs (or other target tokens) in larger samples → estimate V and V1 at arbitrary Ns

slide-136
SLIDE 136

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Some NLP applications

◮ Estimate number (and growth rate) of typos,

UNKNOWNs (or other target tokens) in larger samples → estimate V and V1 at arbitrary Ns

◮ Estimate proportion of OOV words under assumption

that lexicon contains top n most frequent types (see zipfR tutorial)

slide-137
SLIDE 137

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Some NLP applications

◮ Estimate number (and growth rate) of typos,

UNKNOWNs (or other target tokens) in larger samples → estimate V and V1 at arbitrary Ns

◮ Estimate proportion of OOV words under assumption

that lexicon contains top n most frequent types (see zipfR tutorial) → requires estimation of V and frequency spectrum at arbitrary Ns (to find out for how many tokens do the top n types account for)

slide-138
SLIDE 138

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Some NLP applications

◮ Estimate number (and growth rate) of typos,

UNKNOWNs (or other target tokens) in larger samples → estimate V and V1 at arbitrary Ns

◮ Estimate proportion of OOV words under assumption

that lexicon contains top n most frequent types (see zipfR tutorial) → requires estimation of V and frequency spectrum at arbitrary Ns (to find out for how many tokens do the top n types account for)

◮ Good-Turing estimation, Bayesian priors

slide-139
SLIDE 139

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Modeling the population

Some NLP applications

◮ Estimate number (and growth rate) of typos,

UNKNOWNs (or other target tokens) in larger samples → estimate V and V1 at arbitrary Ns

◮ Estimate proportion of OOV words under assumption

that lexicon contains top n most frequent types (see zipfR tutorial) → requires estimation of V and frequency spectrum at arbitrary Ns (to find out for how many tokens do the top n types account for)

◮ Good-Turing estimation, Bayesian priors → require full

type population model

slide-140
SLIDE 140

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Outlook

◮ We need model of type population distribution

slide-141
SLIDE 141

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Outlook

◮ We need model of type population distribution ◮ We will use Zipf(-Mandelbrot)’s law as starting point to

model how population looks like

slide-142
SLIDE 142

Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law

Typical frequency patterns Zipf’s law Consequences

Applications

Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and

  • utlook

Outlook

◮ We need model of type population distribution ◮ We will use Zipf(-Mandelbrot)’s law as starting point to

model how population looks like

TO BE CONTINUED