Outline zipfR zipfR (Computational) linguistics Evert & - PowerPoint PPT Presentation

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni Linguistics Linguistics Statistical inference in (computational) linguistics The zipfR library: Statistical Statistical inference inference Words and other rare events in R Zipf’s law and the LNRE problem Zipf’s law Zipf’s law LNRE models LNRE models LNRE models for linguistic populations Frequency Frequency Stefan Evert & Marco Baroni spectrum spectrum Model estimation: The frequency spectrum zipfR zipfR University of Osnabrück, Germany Extrapolation Extrapolation stefan.evert@uos.de The zipfR library Next steps Next steps University of Bologna, Forlì, Italy Availability Availability baroni@sslmit.unibo.it Extrapolation of VGCs useR! 2006, Vienna, 15 June 2006 Further work Availability What is (computational) linguistics? Corpora in (computational) linguistics zipfR zipfR ◮ increasing focus on language use and empirical The science of linguistics is concerned with . . . Evert & Baroni Evert & Baroni evidence in recent years ◮ natural language as a formal system Linguistics Linguistics ◮ based on corpora = (usually large) machine-readable (phonology, morphology, syntax, semantics, etc.) Statistical Statistical samples of naturally ocurring language inference ◮ human language production and understanding, inference ◮ some applications of corpus data Zipf’s law Zipf’s law including the acquisition of language competence LNRE models LNRE models ◮ test hypotheses about formal system of language Computational linguistics . . . Frequency Frequency ◮ validation of linguists’ introspective judgements spectrum spectrum ◮ applies computers and electronic resources ◮ observable result of human language production zipfR zipfR to linguistic research questions ◮ model for linguistic experience of human speaker Extrapolation Extrapolation ◮ training data for statistical NLP applications ◮ makes use of linguistic insights to build automatic Next steps Next steps ◮ corpus = sample ➜ need for statistical analysis natural language processing (NLP) systems Availability Availability ◮ standard methodologies are being established ◮ random sample assumption is controversial for most corpora ➜ statistical inference may be unreliable ☞ ongoing research into appropriate statistical models

Statistical inference from corpus data A characteristic problem: Zipf’s law zipfR zipfR ◮ only observable data are corpus frequencies ◮ linguistic population is usually characterized by a very Evert & Baroni Evert & Baroni large or even infinite number of type probabilities ◮ commonly used terminology: types vs. tokens Linguistics Linguistics ◮ tokens can be running words, sentences in a text, ◮ in addition, substantial portion of probability mass is Statistical Statistical instances of syntactic constructions, documents, etc. distributed over very infrequent types ( ≠ normal dist.) inference inference ◮ categorization into fixed or open-ended set of types : Zipf’s law Zipf’s law ◮ referred to as the LNRE property (Khmaladze 1987) distinct word forms or lemmas, parts of speech, etc. LNRE models LNRE models ( large number of rare events ) ◮ of central interest are type frequencies f(ω) Frequency Frequency ◮ corpus is interpreted as a random sample of tokens ◮ popularly known as Zipf’s law , based on the spectrum spectrum ➜ inferences about type probabilities π ω from f(ω) zipfR zipfR Zipf-Mandelbrot law for type probabilities π k = π w k : Extrapolation ◮ linguistic populations are characterized by . . . Extrapolation C Next steps Next steps 1. finite or countably infinite set of types ω π k ≈ Availability Availability (k + b) a 2. type probabilities π ω ➥ multinomial distribution of observed frequencies where b > 0 and a > 1 is usually close to 1 ◮ confidence intervals or Bayesian estimates ◮ Zipf ranking: π 1 ≥ π 2 ≥ π 3 ≥ . . . ◮ comparison of type probabilities ( H 0 : π 1 = π 2 ) ◮ see e.g. Baayen (2001, 101) for Zipf-Mandelbrot law ◮ statistical associations ◮ can be derived from Markov process (Rouault 1978) Consequences of Zipf’s law LNRE models zipfR zipfR ◮ most types occur just once in a sample ( hapax ◮ we need a population model for the distribution of type Evert & Baroni Evert & Baroni legomena ) or not at all ( out-of-vocabulary , OOV) probabilities ➜ LNRE model (Baayen 2001) Linguistics Linguistics ◮ such LNRE models have a wide range of applications ◮ hypothesis tests, confidence intervals and Bayesian Statistical Statistical ◮ analyze accuracy of hypothesis tests and confidence estimates (for uniform or beta priors) will be inaccurate inference inference interval estimates (Evert 2004b, Ch. 4) Zipf’s law Zipf’s law ◮ better prior distributions for Bayesian estimates LNRE models Imagine a population with 500 highly frequent types LNRE models ◮ estimate population vocabulary size (number of types), ( π = 10 − 3 ) and 500,000 rare types ( π = 10 − 6 ). In a Frequency Frequency spectrum spectrum sample of size N = 1000 there will be approx. 500 of the e.g. in authorship attribution (Thisted and Efron 1987), zipfR zipfR rare types among the hapax legomena, but the p -value stylometry, or early diagnosis of Alzheimer’s disease for each individual occurrence is p < . 001 (binomial test). Extrapolation Extrapolation (Garrard et al. 2005) ◮ extrapolate vocabulary growth, e.g. to estimate Next steps Next steps ◮ estimators can also be highly biased Availability Availability proportion of OOV types in large amounts of text, if unseen types (OOV) are not taken into account or the proportion of typos on the Web ◮ extrapolate proportion of hapaxes for measuring morphological productivity in word formation (Baayen 2003; Lüdeling and Evert 2003)

LNRE models based on the Zipf-Mandelbrot law The Zipf-Mandelbrot LNRE model zipfR zipfR ◮ most widely-used LNRE models are based on the Some simplifications . . . Evert & Baroni Evert & Baroni Zipf-Mandelbrot law ◮ use Poisson sampling instead of multinomial Linguistics Linguistics ◮ rewrite Zipf-Mandelbrot equation as distribution distribution (not conditioned on sample size N ) Statistical Statistical function for type probabilities (as r.v.) inference inference ◮ approximate step function G(ρ) by continuous Zipf’s law Zipf’s law function with type density g(π) : � F(ρ) ≔ π k LNRE models LNRE models � ∞ π k ≤ ρ Frequency Frequency G(ρ) ≈ g(π) d π spectrum spectrum ◮ F is an increasing step function with range [ 0 , 1 ] ρ zipfR zipfR ◮ type distribution function G is more useful: Extrapolation Extrapolation ➥ the Zipf-Mandelbrot (ZM) model (Evert 2004a) Next steps Next steps � � G(ρ) ≔ � { ω k | π k ≥ ρ } Availability Availability  C · π − α − 1 � 0 ≤ π ≤ B  g(π) ≔ 0 otherwise  ◮ G is a decreasing step function ◮ for ρ → 0 , we have G(ρ) → S ◮ free parameters are 0 < α < 1 and 0 < B ≤ 1 ( S = population vocabulary size, which may be infinite) ◮ relation to Zipf-Mandelbrot law: α = a − 1 ◮ can easily be specified for ρ = π k The Zipf-Mandelbrot LNRE model The Zipf-Mandelbrot LNRE model zipfR zipfR Evert & Baroni Type distribution (ZM model) Type density (ZM model) Evert & Baroni Distribution function (ZM model) Probability density (ZM model) 100 1.0 0.5 10 Linguistics Linguistics g ( π ) [log 10 −transformed, million types] 0.8 0.4 80 Statistical Statistical 8 f ( π ) [log 10 −transformed] inference inference G ( ρ ) [million types] 0.3 60 0.6 Zipf’s law Zipf’s law 6 F ( ρ ) LNRE models LNRE models 0.2 40 0.4 4 Frequency Frequency spectrum spectrum 0.2 0.1 20 2 zipfR zipfR 0.0 0.0 0 0 Extrapolation Extrapolation 1e−10 1e−08 1e−06 1e−04 1e−02 1e−10 1e−08 1e−06 1e−04 1e−02 1e−10 1e−08 1e−06 1e−04 1e−02 1e−10 1e−08 1e−06 1e−04 1e−02 Next steps Next steps ρ ρ π π Availability Availability ◮ type density function of Zipf-Mandelbrot LNRE model ◮ corresponding p.d.f. for type probabilities g(π) = C · π − α − 1 f(π) = C · π − α ( 0 ≤ π ≤ B) ( 0 ≤ π ≤ B) (densities in the images are log 10 -transformed) (densities in the images are log 10 -transformed)

Outline zipfR zipfR (Computational) linguistics Evert & - PowerPoint PPT Presentation

Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni Linguistics Linguistics Statistical inference in (computational) linguistics The zipfR library: Statistical Statistical inference inference Words and

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

TERMINOLOGY Second Victim - health care providers who are involved in an unanticipated

Systematic Execution of Android Test Suites in Adverse Conditions Christo ff er Quist Adamsen

Briefing on Results of the Agency Action Review Meeting Commission Meeting June 20, 2019 Agency

INNOVATION GROUP Hellp Todays Rural Opportunity Youth session organized in collaboration

Urban an p plan annin ing an and t tran ansport i infrastructure provis isio ion in the

NCleaner A lightweight and efficient tool for cleaning Web pages Stefan Evert University of

On Definability for Model Counting Jean-Marie Lagniez 1 , Emmanuel Lonca 1 and Pierre Marquis 1 , 2

Algorithmic Game Theory Bercea Multicast and Network Formation Games Introduction Potential