An Extensive Empirical Study of Collocation Extraction Methods - PowerPoint PPT Presentation

Introduction Colllocation Extraction Combining Measures Summary An Extensive Empirical Study of Collocation Extraction Methods Pavel Pecina pecina@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Charles University, Prague June 27, 2005 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Outline Introduction 1 Notion of Collocation Motivation The Task Colllocation Extraction 2 Methodology Association Measures Evaluation Combining Association Measures 3 Classification and Ranking Attribute Selection Summary 4 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Notion of Collocation Motivation The Task Outline Introduction 1 Notion of Collocation Motivation The Task Colllocation Extraction 2 Methodology Association Measures Evaluation Combining Association Measures 3 Classification and Ranking Attribute Selection Summary 4 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Notion of Collocation Motivation The Task Definitions I Firth (1951): “Collocations of a given word are statements of the habitual or customary places of that word.” Choueka (1988): “A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.” ˇ Cermák (1982): “Individual words cannot be combined freely or randomly only by syntactic rules. The ability of a word to combine with other words (collocability) can be expressed: a) intensionally → valency b) extensionally” → collocations Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Notion of Collocation Motivation The Task Characteristic Properties Non-compositionality (kick the bucket, carriage return, white man) The meaning of a collocation is not a straightforward composition of the meaning of its parts. Non-substitutability (yellow wine, hit the bucket, make homework) Components of collocation cannot be substituted with a related word or a synonym. Non-modifiability (give a big hand, poor as church mice) Collocations cannot be modified or syntactically transformed. Other properties Collocations are not necessarily adjacent. (knock the door) Collocations cannot be directly translated. (ice cream) Collocations are domain-specific. (carriage return) Judging collocations is subjective. (new company) Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Notion of Collocation Motivation The Task Types of Collocations Collocations have both linguistic and lexicographic character and covers a wide range of lexical phenomena: light verb compounds – verbs with little semantic content (take, make,do) verb particle constructions, phrasal verbs (look up, take off, tell off) idioms – fixed phrases (kick the bucket) stock phrases (good morning) technological expresions – concepts or objects in tech. dom. (hard disk) proper names (Ann Arbor) Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Notion of Collocation Motivation The Task Motivation Collocations can be used in a wide range of fields: Lexicography Machine translation Information retrieval, information extraction Word sense disambiguation Spell/grammar/style-checking Text classification and summarization Keyword extraction Language modeling Language generation Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Notion of Collocation Motivation The Task The Tasks To build a collocation lexicon. Creating manually annotated reference data 1 - of reasonable size. Evaluation of collocation extraction methods 2 - interval-wise by the means of precision-recall. Combining association measures for collocation extraction 3 - and achieve “better” results. Reduce number of combined measures 4 - and select the “best subset” of available association measures. Focus on bigram collocations Processing of longer expressions requires larger amounts of data. 1 Scalability of some methods to high order n-grams is limited. 2 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation Outline Introduction 1 Notion of Collocation Motivation The Task Colllocation Extraction 2 Methodology Association Measures Evaluation Combining Association Measures 3 Classification and Ranking Attribute Selection Summary 4 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation Collocation Extraction Most methods are based on verification of typical collocation properties. These properties are formally described by mathematical formulas that determine degree of association between words. Such formulas are called association measures and compute association score for each collocation candidate from a corpus. The scores indicate a chance of a candidate to be a collocation. The scores can be used for ranking or for classification: Ranking Classification red cross 15.66 red cross 1 decimal point 14.01 decimal point 1 arithmetic operation 10.52 arithmetic operation 1 paper feeder 10.17 paper feeder 1 system type 3.54 system type 0 and others 0.54 and others 0 program in 0.35 program in 0 level is 0.25 level is 0 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation The Methodology Identifying Word Base Forms: 1 - Surface forms - Stems or lemmas - Lemmas with additional morphosyntactic features Extracting all possible collocation candidates: 2 - Consequent word n-grams ( multi-word expressions ) - Sliding window - Syntactic structures ( dependency n-grams ) Collecting coocurrence statistics: 3 - Frequency of word and n-gram occurrences - Immediate contexts - Global contexts Computing association measures 4 Ranking or classification 5 Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation Word Base Forms Problem: Surface word forms too specific ( rich morphology, we work with Czech ) Lemmas too general ( loss of syntactic and semantic information ) Solution: Lemmas with a subset of morphological tags <f>nenahraditelná<l>nahraditelný_(*4)<t>AAFS1----1N----<r>8<g>7 ↓ ↓ ↓ ↓↓ nahraditelný_(*4) A F 1N ⇓ <f>nahraditelný_(*4)<t>A*F1N</f> ⇓ nenahraditelná Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation Dependency Bigrams Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation Coocurrence Statistics b ) Contexts a ) Contingency tables C w global context of word w f ( x ¯ f ( xy ) y ) f ( x ∗ ) C globall context of bigram xy xy f (¯ f (¯ x ¯ f (¯ xy ) y ) x ∗ ) C l left immediate context of xy xy f ( ∗ ¯ f ( ∗ y ) y ) N C r right immediate context of xy xy Example Example X=black X � = black X dobrá situace . Kapitálový trh je však stále nelikvidní že to není samostatný trh a že je souˇ cástá širšího Y=market black market new market market bariérách v pˇ rístupu na trh , cenových rozdílech , Y � = market black horse new horse horse banky . Americký akciový trh byl za silného obchodování Y black new (all) jít se svou kuží na trh . Pro vydán i mluvila Context word probability distribution P ( w i | x ) Pavel Pecina Collocation Extraction

Introduction Colllocation Extraction Combining Measures Summary Methodology Association Measures Evaluation Types of Association Measures “Collocations are very frequent word combinations.” 1 ML estimations of joint and conditional probabilities “Collocation components occur together more often than by a chance.” 2 Mutual information and derived measures Statistical tests of independence Likelihood measures Other heuristic association measures and coefficients “Collocations occur as units in a (inf.-theoretically) noisy environment.” 3 Immediate context measures “Collocations occur in different contexts than their components.” 4 Information-theory measures Information-retrieval similarity measures Total: 84 association measures + 3 morphosyntactic features Pavel Pecina Collocation Extraction

An Extensive Empirical Study of Collocation Extraction Methods - PowerPoint PPT Presentation

Introduction Colllocation Extraction Combining Measures Summary An Extensive Empirical Study of Collocation Extraction Methods Pavel Pecina pecina@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Charles University, Prague June

Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz

Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz

Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz

Automatic Collocation Extraction from Text Corpora Pavel Pecina Ustav form aln a

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

The Direct Collocation Method for Optimal Control Gilbert Gede May 26, 2011 Gilbert Gede The

Reduced Basis Collocation Methods for Partial Differential Equations with Random Coefficients

Quadratic C 1 -spline collocation for reaction-diffusion problems Torsten Linss 1 Goran Radojev 2

Numerical Optimal Control with DAEs Lecture 8: Direct Collocation S ebastien Gros AWESCO PhD

Optimization-Based Control: Direct Collocation Methods for Trajectory and Policy Optimization CS

Tools for collocation extraction: preferences for active vs. passive Ulrich Heid Marion Weller

Term and Collocation Extraction by means of complex Linguistic Web Services Ulrich Heid, Fabienne

Extensive-stage small cell lung cancer Tom Stinchcombe Duke Thoracic Oncology Program Extensive

Introduction to Game Theory Mehdi Dastani BBL-521 M.M.Dastani@uu.nl Extensive Games

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Lets Connect Darnyelle A. Jervey, MBA Darnyelle A. Jervey, MBA Speaker | Author |

3/27/2014 BROUGHT TO YOU BY THE PRIMARY CARE ASSOCIATIONS OF WASHINGTON, THE NWRPCA, AND CHAMPS

An elementary proof of James' characterization of weak compactness (Lecture slides) Conference

Introduction to Natural Language Processing Steven Bird Ewan Klein Edward Loper University of

TEXT AND TEXT AND AUTOMATED BIASES AUTOMATED BIASES NATURAL LANGUAGES ARE THE NATURAL

Three models for discriminative machine Three models for discriminative machine translation using

r qst s P

DISTRIBUTIONAL SEMANTICS AND COMPOSITIONALITY Corina Dima April 23rd, 2019 COURSE LOGISTICS