Counting Words: Non- Randomness Pre-Processing and Non-Randomness - - PowerPoint PPT Presentation

counting words
SMART_READER_LITE
LIVE PREVIEW

Counting Words: Non- Randomness Pre-Processing and Non-Randomness - - PowerPoint PPT Presentation

Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M alaga, 11 August 2006 Outline Pre-processing and


slide-1
SLIDE 1

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Counting Words: Pre-Processing and Non-Randomness

Marco Baroni & Stefan Evert M´ alaga, 11 August 2006

slide-2
SLIDE 2

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Outline

Pre-Processing Non-Randomness The End

slide-3
SLIDE 3

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Pre-processing

◮ IT IS IMPORTANT!!! (Evert and L¨

udeling 2001)

slide-4
SLIDE 4

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Pre-processing

◮ IT IS IMPORTANT!!! (Evert and L¨

udeling 2001)

◮ Automated pre-processing often necessary (13,850 types

begin with re- in BNC, 103,941 types begin with ri- in itWaC)

slide-5
SLIDE 5

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Pre-processing

◮ IT IS IMPORTANT!!! (Evert and L¨

udeling 2001)

◮ Automated pre-processing often necessary (13,850 types

begin with re- in BNC, 103,941 types begin with ri- in itWaC)

◮ We can rely on:

◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form

must be analyzable as PRE+VERB, with VERB independently attested in corpus)

slide-6
SLIDE 6

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Pre-processing

◮ IT IS IMPORTANT!!! (Evert and L¨

udeling 2001)

◮ Automated pre-processing often necessary (13,850 types

begin with re- in BNC, 103,941 types begin with ri- in itWaC)

◮ We can rely on:

◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form

must be analyzable as PRE+VERB, with VERB independently attested in corpus)

◮ However. . .

slide-7
SLIDE 7

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The problem with low frequency words

◮ Correct analysis of low frequency words is fundamental to

measure productivity, estimate LNRE models

slide-8
SLIDE 8

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The problem with low frequency words

◮ Correct analysis of low frequency words is fundamental to

measure productivity, estimate LNRE models

◮ Automated tools will tend to have lowest performance on

low frequency forms:

◮ Statistical tools will suffer from lack of relevant training

data

◮ Manually-crafted tools will probably lack the relevant

resources

slide-9
SLIDE 9

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The problem with low frequency words

◮ Correct analysis of low frequency words is fundamental to

measure productivity, estimate LNRE models

◮ Automated tools will tend to have lowest performance on

low frequency forms:

◮ Statistical tools will suffer from lack of relevant training

data

◮ Manually-crafted tools will probably lack the relevant

resources

◮ Problems in both directions (under- and overestimation

  • f hapax counts)
slide-10
SLIDE 10

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The problem with low frequency words

◮ Correct analysis of low frequency words is fundamental to

measure productivity, estimate LNRE models

◮ Automated tools will tend to have lowest performance on

low frequency forms:

◮ Statistical tools will suffer from lack of relevant training

data

◮ Manually-crafted tools will probably lack the relevant

resources

◮ Problems in both directions (under- and overestimation

  • f hapax counts)

◮ Part of the more general “95% performance” problem

slide-11
SLIDE 11

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Underestimation of hapaxes

◮ The Italian TreeTagger lemmatizer is lexicon-based;

  • ut-of-lexicon words (e.g., productively formed words

containing a prefix) are lemmatized as UNKNOWN

◮ No prefixed word with dash (ri-cadere) is in lexicon ◮ Writers are more likely to use dash to mark transparent

morphological structure

slide-12
SLIDE 12

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Productivity of ri- with and without an extended lexicon

200000 600000 1000000 200 400 600 800 1000 N E[V(N)] post−cleaning pre−cleaning

slide-13
SLIDE 13

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Overestimation of hapaxes

◮ “Noise” generates hapax legomena ◮ The Italian TreeTagger thinks that dashed expressions

containing pronoun-like strings are pronouns

◮ Dashed strings can be anything, including full sentences ◮ This creates a lot of pseudo-pronoun hapaxes: tu-tu,

parapaponzi-ponzi-p`

  • , altri-da-lui-simili-a-lui
slide-14
SLIDE 14

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Productivity of the pronoun class before and after cleaning

0e+00 1e+06 2e+06 3e+06 4e+06 50 100 150 200 250 300 350 N E[V(N)] pre−cleaning post−cleaning

slide-15
SLIDE 15

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

P (and V ) with/without correct post-processing

◮ With:

class V V1 N P ri- 1098 346 1,399,898 0.00025 pronouns 72 4,313,123

◮ Without:

class V V1 N P ri- 318 8 1,268,244 0.000006 pronouns 348 206 4,314,381 0.000048

slide-16
SLIDE 16

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

A final word on pre-processing

◮ IT IS IMPORTANT

slide-17
SLIDE 17

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

A final word on pre-processing

◮ IT IS IMPORTANT ◮ Often, major roadblock of lexical statistics investigations

slide-18
SLIDE 18

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Outline

Pre-Processing Non-Randomness The End

slide-19
SLIDE 19

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Non-randomness

◮ LNRE modeling based on assumption that our

corpora/datasets are random samples from the population

slide-20
SLIDE 20

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Non-randomness

◮ LNRE modeling based on assumption that our

corpora/datasets are random samples from the population

◮ This is obviously not the case

slide-21
SLIDE 21

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Non-randomness

◮ LNRE modeling based on assumption that our

corpora/datasets are random samples from the population

◮ This is obviously not the case ◮ Can we pretend that a corpus is random?

slide-22
SLIDE 22

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Non-randomness

◮ LNRE modeling based on assumption that our

corpora/datasets are random samples from the population

◮ This is obviously not the case ◮ Can we pretend that a corpus is random? ◮ What are the consequences of non-randomness?

slide-23
SLIDE 23

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

A Brown-sized random sample from a ZM population estimated with Brown

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N V(N)

slide-24
SLIDE 24

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The real Brown

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N V(

(N)

slide-25
SLIDE 25

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from?

◮ Syntax?

slide-26
SLIDE 26

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from?

◮ Syntax? ◮ the the should be most frequent English bigram

slide-27
SLIDE 27

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from?

◮ Syntax? ◮ the the should be most frequent English bigram ◮ If the problem is due to syntax, randomizing by sentence

will not get rid of it (Baayen 2001, ch. 5)

slide-28
SLIDE 28

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The Brown randomized by sentence

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N V(N)

slide-29
SLIDE 29

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from?

◮ Not syntax (syntax has short span effect; the counts for

10k intervals are OK)

slide-30
SLIDE 30

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from?

◮ Not syntax (syntax has short span effect; the counts for

10k intervals are OK)

◮ Underdispersion of content-rich words ◮ The chance of two Noriegas is closer to π/2 than π2

(Church 2000)

◮ diethylstilbestrol occurs 3 times in Brown, all in same

document (recommendations on feed additives)

slide-31
SLIDE 31

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from?

◮ Not syntax (syntax has short span effect; the counts for

10k intervals are OK)

◮ Underdispersion of content-rich words ◮ The chance of two Noriegas is closer to π/2 than π2

(Church 2000)

◮ diethylstilbestrol occurs 3 times in Brown, all in same

document (recommendations on feed additives)

◮ Underdispersion will lead to serious underestimation of

rare type count

◮ fZM estimated on Brown predicts S = 115, 539 in English

slide-32
SLIDE 32

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Underestimating types

Extrapolating Brown VGC with fZM

0e+00 1e+07 2e+07 3e+07 4e+07 20000 40000 60000 80000 100000 N E[V(N)]

slide-33
SLIDE 33

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Assessing extrapolation quality

◮ We have no way to assess goodness of fit of extrapolation

from corpus to larger sample from same population

◮ However, we can estimate models on subset of available

data, and extrapolate to full corpus size (Evert and Baroni 2006)

◮ I.e., use corpus as our population, sample from it

slide-34
SLIDE 34

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Extrapolation from a random sample of 250k Brown tokens

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N E[V(N)] interpolated zm fzm gigp

slide-35
SLIDE 35

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Goodness of fit to spectrum elements

Based on multivariate chi-squared statistic

estimation size max extrapolation size model X2 df p X2 df p ZM 7, 856 14 ≪ 0.001 35, 346 16 ≪ 0.001 fZM 539 13 ≪ 0.001 4, 525 16 ≪ 0.001 GIGP 597 13 ≪ 0.001 3, 449 16 ≪ 0.001 Compare to V fit:

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N E[V(N)] interpolated zm fzm gigp

slide-36
SLIDE 36

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Extrapolation from first 250k tokens in corpus

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N V(N) E[V(N)]

  • bserved

zm fzm gigp

slide-37
SLIDE 37

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Goodness of fit to spectrum elements

Based on multivariate chi-squared statistic

estimation size max extrapolation size model X2 df p X2 df p ZM 8, 066 14 ≪ 0.001 33, 6766 16 ≪ 0.001 fZM 1, 011 13 ≪ 0.001 17, 559 16 ≪ 0.001 GIGP 587 13 ≪ 0.001 7, 815 16 ≪ 0.001 Compare to V fit:

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 10000 20000 30000 40000 50000 N V(N) E[V(N)]

  • bserved

zm fzm gigp

slide-38
SLIDE 38

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The corpus as a (non-)random sample

◮ In our experiment, we had access to full population (the

Brown) and could take random sample from it

slide-39
SLIDE 39

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The corpus as a (non-)random sample

◮ In our experiment, we had access to full population (the

Brown) and could take random sample from it

◮ In real life, full corpus is our sample from the population

(e.g., “English”, an author’s mental lexicon, all words generated by a wfp)

slide-40
SLIDE 40

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

The corpus as a (non-)random sample

◮ In our experiment, we had access to full population (the

Brown) and could take random sample from it

◮ In real life, full corpus is our sample from the population

(e.g., “English”, an author’s mental lexicon, all words generated by a wfp)

◮ If it is not random, there is nothing we can do about it

(randomizing the sample will not help!)

slide-41
SLIDE 41

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What can we do?

◮ Abandon lexical statistics

slide-42
SLIDE 42

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What can we do?

◮ Abandon lexical statistics ◮ Live with it

slide-43
SLIDE 43

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What can we do?

◮ Abandon lexical statistics ◮ Live with it ◮ Re-define the population

slide-44
SLIDE 44

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What can we do?

◮ Abandon lexical statistics ◮ Live with it ◮ Re-define the population ◮ Try to account for underdispersion when computing the

models (will get mathematically very complicated, but see Baayen 2001, ch. 5)

slide-45
SLIDE 45

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Not always that bad

Our Mutual Friend

50000 100000 150000 200000 250000 300000 5000 10000 15000 20000 N V(N) E[V(N)]

  • bserved

zm fzm gigp

slide-46
SLIDE 46

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Outline

Pre-Processing Non-Randomness The End

slide-47
SLIDE 47

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we have done

◮ Motivation: studying distribution and V growth rate of

type-rich populations (sample captures only small proportion of types in population)

slide-48
SLIDE 48

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we have done

◮ Motivation: studying distribution and V growth rate of

type-rich populations (sample captures only small proportion of types in population)

◮ LNRE modeling:

◮ Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density function

◮ Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size using population model

◮ Estimation of population parameters via fit of expected

elements to observed frequency spectrum

slide-49
SLIDE 49

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we have done

◮ Motivation: studying distribution and V growth rate of

type-rich populations (sample captures only small proportion of types in population)

◮ LNRE modeling:

◮ Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density function

◮ Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size using population model

◮ Estimation of population parameters via fit of expected

elements to observed frequency spectrum

◮ zipfR package to apply LNRE modeling

slide-50
SLIDE 50

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we have done

◮ Motivation: studying distribution and V growth rate of

type-rich populations (sample captures only small proportion of types in population)

◮ LNRE modeling:

◮ Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density function

◮ Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size using population model

◮ Estimation of population parameters via fit of expected

elements to observed frequency spectrum

◮ zipfR package to apply LNRE modeling ◮ Problems

slide-51
SLIDE 51

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we (and perhaps some of you?) would like to do next

◮ Study (and deal with) non-randomness

slide-52
SLIDE 52

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we (and perhaps some of you?) would like to do next

◮ Study (and deal with) non-randomness ◮ Better parameter estimation

slide-53
SLIDE 53

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we (and perhaps some of you?) would like to do next

◮ Study (and deal with) non-randomness ◮ Better parameter estimation ◮ Improve zipfR (any feature request?)

slide-54
SLIDE 54

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

What we (and perhaps some of you?) would like to do next

◮ Study (and deal with) non-randomness ◮ Better parameter estimation ◮ Improve zipfR (any feature request?) ◮ Use LNRE modeling in applications, e.g.:

◮ Good-Turing-style estimation ◮ Productivity beyond morphology ◮ Better features for machine learning ◮ Mixture models

slide-55
SLIDE 55

Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

That’s All, Folks!

THE END