Counting Words: Non- Randomness Pre-Processing and Non-Randomness - PowerPoint PPT Presentation

Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M´ alaga, 11 August 2006

Outline Pre-processing and non-randomness Pre-Processing Baroni & Evert Pre-Processing Non-Randomness Non- Randomness The End The End

Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- Randomness The End

Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- ◮ We can rely on: Randomness The End ◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form must be analyzable as PRE+VERB , with VERB independently attested in corpus)

Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- ◮ We can rely on: Randomness The End ◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form must be analyzable as PRE+VERB , with VERB independently attested in corpus) ◮ However. . .

The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert Pre-Processing Non- Randomness The End

The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources

The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources ◮ Problems in both directions (under- and overestimation of hapax counts)

The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources ◮ Problems in both directions (under- and overestimation of hapax counts) ◮ Part of the more general “95% performance” problem

Underestimation of hapaxes Pre-processing ◮ The Italian TreeTagger lemmatizer is lexicon-based; and non-randomness out-of-lexicon words (e.g., productively formed words Baroni & Evert containing a prefix) are lemmatized as UNKNOWN Pre-Processing ◮ No prefixed word with dash ( ri-cadere ) is in lexicon Non- Randomness ◮ Writers are more likely to use dash to mark transparent The End morphological structure

Productivity of ri- with and without an extended lexicon Pre-processing and non-randomness Baroni & Evert 1000 Pre-Processing 800 Non- Randomness The End E [ V ( N )] 600 400 200 post−cleaning pre−cleaning 0 0 200000 600000 1000000 N

Overestimation of hapaxes Pre-processing ◮ “Noise” generates hapax legomena and non-randomness ◮ The Italian TreeTagger thinks that dashed expressions Baroni & Evert containing pronoun-like strings are pronouns Pre-Processing ◮ Dashed strings can be anything, including full sentences Non- Randomness ◮ This creates a lot of pseudo-pronoun hapaxes: tu-tu, The End parapaponzi-ponzi-p` o, altri-da-lui-simili-a-lui

Productivity of the pronoun class before and after cleaning Pre-processing and non-randomness 350 Baroni & Evert 300 Pre-Processing Non- 250 Randomness 200 The End E [ V ( N )] 150 100 50 pre−cleaning post−cleaning 0 0e+00 1e+06 2e+06 3e+06 4e+06 N

P (and V ) with/without correct post-processing Pre-processing ◮ With: and non-randomness class V V 1 N P Baroni & Evert 1098 346 1,399,898 0.00025 ri- Pre-Processing pronouns 72 0 4,313,123 0 Non- ◮ Without: Randomness The End class V V 1 N P 318 8 1,268,244 0.000006 ri- pronouns 348 206 4,314,381 0.000048

A final word on pre-processing Pre-processing ◮ IT IS IMPORTANT and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

A final word on pre-processing Pre-processing ◮ IT IS IMPORTANT and non-randomness ◮ Often, major roadblock of lexical statistics investigations Baroni & Evert Pre-Processing Non- Randomness The End

Outline Pre-processing and non-randomness Pre-Processing Baroni & Evert Pre-Processing Non-Randomness Non- Randomness The End The End

Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing Non- Randomness The End

Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness The End

Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness ◮ Can we pretend that a corpus is random? The End

Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness ◮ Can we pretend that a corpus is random? The End ◮ What are the consequences of non-randomness?

A Brown-sized random sample from a ZM population estimated with Brown Pre-processing and non-randomness 50000 Baroni & Evert Pre-Processing 40000 Non- Randomness The End 30000 V ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

The real Brown Pre-processing and non-randomness 50000 Baroni & Evert Pre-Processing 40000 Non- Randomness The End 30000 ( N ) V ( 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness ◮ the the should be most frequent English bigram Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness ◮ the the should be most frequent English bigram Baroni & Evert ◮ If the problem is due to syntax, randomizing by sentence Pre-Processing will not get rid of it (Baayen 2001, ch. 5) Non- Randomness The End

The Brown randomized by sentence Pre-processing and non-randomness Baroni & Evert Pre-Processing 50000 Non- Randomness 40000 The End 30000 V ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

Where does non-randomness come from? Pre-processing ◮ Not syntax (syntax has short span effect; the counts for and non-randomness 10k intervals are OK) Baroni & Evert Pre-Processing Non- Randomness The End

Where does non-randomness come from? Pre-processing ◮ Not syntax (syntax has short span effect; the counts for and non-randomness 10k intervals are OK) Baroni & Evert ◮ Underdispersion of content-rich words Pre-Processing ◮ The chance of two Noriegas is closer to π/ 2 than π 2 Non- Randomness (Church 2000) The End ◮ diethylstilbestrol occurs 3 times in Brown, all in same document (recommendations on feed additives)

Counting Words: Non- Randomness Pre-Processing and Non-Randomness - PowerPoint PPT Presentation

Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M alaga, 11 August 2006 Outline Pre-processing and

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting and Probability Whats to come? Counting and Probability Whats to come?

Counting with automorphisms Lectures for CO 430 / 630 March 24 April 2, 2020 1. Counting

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Computing Lecture 6b: Step Counting & Activity Recognition Emmanuel Agu Step Counting (How

3/31/14 Counting counting is hard with only 10 fingers How many ways to do X ? X = Choose an

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

compareGroups updated: version 2.0 Isaac Subirana & Joan Vila & H ector Sanz &

Forking / Forming a Food Culture Wiki Shih-Chieh Ilya Li CITI, Academia Sinica & Sociology

The Herbivore in the Room Body-Shaming and Food- Shaming Alienate Vegans and Pre-gans Vegan

Fruits, vegetables & health Martin White Programme Leader, Food systems and public Health

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani Review Matrix

Ordinary Differential Equations a Refresher Andreas Adelmann PSI November 12, 2018 CAS 2018

Introduction Eigenmodes Convolution and Response ODEs and Linear Systems Functions Further

Population Modeling with Ordinary Differential Equations Michael J. Coleman November 6, 2006

Counting Words: Non- Randomness Pre-Processing and Non-Randomness - PowerPoint PPT Presentation

Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M alaga, 11 August 2006 Outline Pre-processing and

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting and Probability Whats to come? Counting and Probability Whats to come?

Counting with automorphisms Lectures for CO 430 / 630 March 24 April 2, 2020 1. Counting

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Computing Lecture 6b: Step Counting &amp; Activity Recognition Emmanuel Agu Step Counting (How

3/31/14 Counting counting is hard with only 10 fingers How many ways to do X ? X = Choose an

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

compareGroups updated: version 2.0 Isaac Subirana &amp; Joan Vila &amp; H ector Sanz &amp;

Forking / Forming a Food Culture Wiki Shih-Chieh Ilya Li CITI, Academia Sinica &amp; Sociology

The Herbivore in the Room Body-Shaming and Food- Shaming Alienate Vegans and Pre-gans Vegan

Fruits, vegetables &amp; health Martin White Programme Leader, Food systems and public Health

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani Review Matrix

Ordinary Differential Equations a Refresher Andreas Adelmann PSI November 12, 2018 CAS 2018

Introduction Eigenmodes Convolution and Response ODEs and Linear Systems Functions Further

Population Modeling with Ordinary Differential Equations Michael J. Coleman November 6, 2006

Computing Lecture 6b: Step Counting & Activity Recognition Emmanuel Agu Step Counting (How

compareGroups updated: version 2.0 Isaac Subirana & Joan Vila & H ector Sanz &

Forking / Forming a Food Culture Wiki Shih-Chieh Ilya Li CITI, Academia Sinica & Sociology

Fruits, vegetables & health Martin White Programme Leader, Food systems and public Health