counting words
play

Counting Words: Non- Randomness Pre-Processing and Non-Randomness - PowerPoint PPT Presentation

Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M alaga, 11 August 2006 Outline Pre-processing and


  1. Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M´ alaga, 11 August 2006

  2. Outline Pre-processing and non-randomness Pre-Processing Baroni & Evert Pre-Processing Non-Randomness Non- Randomness The End The End

  3. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

  4. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- Randomness The End

  5. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- ◮ We can rely on: Randomness The End ◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form must be analyzable as PRE+VERB , with VERB independently attested in corpus)

  6. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- ◮ We can rely on: Randomness The End ◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form must be analyzable as PRE+VERB , with VERB independently attested in corpus) ◮ However. . .

  7. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert Pre-Processing Non- Randomness The End

  8. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources

  9. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources ◮ Problems in both directions (under- and overestimation of hapax counts)

  10. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources ◮ Problems in both directions (under- and overestimation of hapax counts) ◮ Part of the more general “95% performance” problem

  11. Underestimation of hapaxes Pre-processing ◮ The Italian TreeTagger lemmatizer is lexicon-based; and non-randomness out-of-lexicon words (e.g., productively formed words Baroni & Evert containing a prefix) are lemmatized as UNKNOWN Pre-Processing ◮ No prefixed word with dash ( ri-cadere ) is in lexicon Non- Randomness ◮ Writers are more likely to use dash to mark transparent The End morphological structure

  12. Productivity of ri- with and without an extended lexicon Pre-processing and non-randomness Baroni & Evert 1000 Pre-Processing 800 Non- Randomness The End E [ V ( N )] 600 400 200 post−cleaning pre−cleaning 0 0 200000 600000 1000000 N

  13. Overestimation of hapaxes Pre-processing ◮ “Noise” generates hapax legomena and non-randomness ◮ The Italian TreeTagger thinks that dashed expressions Baroni & Evert containing pronoun-like strings are pronouns Pre-Processing ◮ Dashed strings can be anything, including full sentences Non- Randomness ◮ This creates a lot of pseudo-pronoun hapaxes: tu-tu, The End parapaponzi-ponzi-p` o, altri-da-lui-simili-a-lui

  14. Productivity of the pronoun class before and after cleaning Pre-processing and non-randomness 350 Baroni & Evert 300 Pre-Processing Non- 250 Randomness 200 The End E [ V ( N )] 150 100 50 pre−cleaning post−cleaning 0 0e+00 1e+06 2e+06 3e+06 4e+06 N

  15. P (and V ) with/without correct post-processing Pre-processing ◮ With: and non-randomness class V V 1 N P Baroni & Evert 1098 346 1,399,898 0.00025 ri- Pre-Processing pronouns 72 0 4,313,123 0 Non- ◮ Without: Randomness The End class V V 1 N P 318 8 1,268,244 0.000006 ri- pronouns 348 206 4,314,381 0.000048

  16. A final word on pre-processing Pre-processing ◮ IT IS IMPORTANT and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

  17. A final word on pre-processing Pre-processing ◮ IT IS IMPORTANT and non-randomness ◮ Often, major roadblock of lexical statistics investigations Baroni & Evert Pre-Processing Non- Randomness The End

  18. Outline Pre-processing and non-randomness Pre-Processing Baroni & Evert Pre-Processing Non-Randomness Non- Randomness The End The End

  19. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing Non- Randomness The End

  20. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness The End

  21. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness ◮ Can we pretend that a corpus is random? The End

  22. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness ◮ Can we pretend that a corpus is random? The End ◮ What are the consequences of non-randomness?

  23. A Brown-sized random sample from a ZM population estimated with Brown Pre-processing and non-randomness 50000 Baroni & Evert Pre-Processing 40000 Non- Randomness The End 30000 V ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  24. The real Brown Pre-processing and non-randomness 50000 Baroni & Evert Pre-Processing 40000 Non- Randomness The End 30000 ( N ) V ( 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  25. Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

  26. Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness ◮ the the should be most frequent English bigram Baroni & Evert Pre-Processing Non- Randomness The End

  27. Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness ◮ the the should be most frequent English bigram Baroni & Evert ◮ If the problem is due to syntax, randomizing by sentence Pre-Processing will not get rid of it (Baayen 2001, ch. 5) Non- Randomness The End

  28. The Brown randomized by sentence Pre-processing and non-randomness Baroni & Evert Pre-Processing 50000 Non- Randomness 40000 The End 30000 V ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  29. Where does non-randomness come from? Pre-processing ◮ Not syntax (syntax has short span effect; the counts for and non-randomness 10k intervals are OK) Baroni & Evert Pre-Processing Non- Randomness The End

  30. Where does non-randomness come from? Pre-processing ◮ Not syntax (syntax has short span effect; the counts for and non-randomness 10k intervals are OK) Baroni & Evert ◮ Underdispersion of content-rich words Pre-Processing ◮ The chance of two Noriegas is closer to π/ 2 than π 2 Non- Randomness (Church 2000) The End ◮ diethylstilbestrol occurs 3 times in Brown, all in same document (recommendations on feed additives)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend