A contrastive Approach to Multi-word Term Extraction from - PowerPoint PPT Presentation

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca Bonin, Felice Dell' Orletta, Giulia Venturi and Simonetta Montemagni LREC Malta 2010

Outline • Introduction and aims • Multiword extraction process – Multi-word candidates extraction – Contrastive Re-ranking of extracted terms • Case studies • Evaluation • Conclusions LREC 2010 2

The aim • Extraction of domain specific terminology – Focus on multiwords. – Filtering terms from noise: • Open-domain terms, e.g. – anno successivo “following year” – For multi-domain terminology: Singling out terms which belong to different domains. • This is the case in the legal domain • e.g. environmental terms from legal terms Rifiuto pericoloso “dangerous waste” singled out from Diritto nazionale “national law” LREC 2010 3

Approaches to Terminology extraction The state of the art of TE proposes a wide variety of approaches:   Linguistic  Statistical  Mixed (Linguistic and statistical)  Contrastive Statistical approaches based on e.g. term frequency/inverse document frequency, log  likelihood, mutual information, up to more sophisticated approaches such as C-NC Value Contrastive approaches: usually applied on single terms extraction  They face multiwords extraction expanding the single terms heads  e.g. [Basili et. al 2001]: contrastive selection via Heads (CsvH)  Single candidate terms selection using a contrastive function:  distribution in the target and contrastive corpora; Multi-words ranking.  LREC 2010 4

Our approach: main features • Multiwords based: we consider multiwords as unique elements, independent from single terms Combines differen t approaches: linguistic + statistical + contrastive  Multi-layered approach  – We split the multiword extraction process in two steps: • Extraction of well-formed multi-word candidates' shortlist • Multi-word re-ranking.  Benefits of two-step approach: Overcomes the multi-word term sparseness problem  Multi-word contrastive ranking: independent from single terms  ranking. LREC 2010 5

General workflow Multi-word candidates extraction Input text Linguistic filters Statistical filter Sequences of NLP tools (C-NC Value) PoS patterns Tokenization Filtering of multi- word prepositions Morphological analysis (PoS-tagging) Multi-words contrastive ranking Lemmatization Contrastive functions LREC 2010 6

Step 1: Multi-word terms (MWT) candidates extraction MWT Candidate extraction process: Linguistic filters   Based on automatic POS tagged/lemmatized text  We identify sequences of allowed POS patterns in order to cover most of the Italian morphosyntactic multi–words structures:  Noun+(Prep+(Noun|ADJ)+ |Noun|ADJ)+ • Diritto nazionale – “national law” • Presidente della Repubblica – “President of the Republic”  Filtering of domain specific multi-word preposition, automatically extracted with a first run of the same process using the patterns  Noun-Prep-Noun ai sensi di – “by law”  Statistical filters: C-NC Value (Frantzi & Ananiadou 1999).  LREC 2010 7

Step 2: MWT contrastive ranking Candidates multi-word terms are re-ranked using a contrastive method against a  reference corpus. i. Single domain – contrast against open domain corpus [for filtering noisy general terms] ii. Double domain – contrast corpus sharing only one of the domains [for singling out different term types in multi-domain corpora] – In case i) - TFITF contrastive function • Basili et. al 2001 approach: contrastive selection via Heads (CsvH). • Our approach: Basili et. al 2001 function directly applied to multiwords. – In case ii) – CSmw contrastive function • based on arctan. • Particularly suitable for dealing with low frequency events LREC 2010 8

Step 2: MWT contrastive ranking - TF-ITF contrastive function TFITF: Term frequency Inverse Term frequency  Variant of Basili et al. 2001  applied to multi-word terms without passing through single head terms  A list L of candidate multi-words is extracted with C-NC Value  L toplist is ranked on TF-ITF score  TFITF = log  f i  t ∗ IWF  t  Where f i (t) is the frequency of the candidate term (multi-word) t , and IWF is  the inverse word frequency. IWF  t = log  N / F  t  N: sum of all F (t) for each t in L  F(t) : t frequency in the contrastive corpus  LREC 2010 9

Step 2: MWT contrastive ranking - Csmw contrastive function Csmw : Contrastive Selection of multi-word terms Specifically designed for dealing with low-frequency events  Arctan function's mathematical features suites the low frequency events  extraction The statistical weight is calculated directly on multi-word terms  CSmw  t = arctan  log  f i  t ∗ K  Where: K  t = 1 / F c  t / N c  C is the set of contrastive domains, F c (t) is the frequency of t in all contrastive domains  of C normalized on N c that is the sum of all frequencies in C for each t in L . LREC 2010 10

Case Studies Art History case study:  – Aim: domain specific term extraction  Corpus of Art history websites, 326,066 tokens. • Manually collected by a domain expert  Open domain contrastive Corpus: PAROLE. • Italian texts of different types, 3 millions tokens. Legislative-environmental case study  – Aim: “double” domain terminology extraction and classification.  Collection of Italian European Legal Texts concerning the environmental domain, 394,088 tokens  Contrastive corpora used:  PAROLE.corpus .(open-domain)  Collection of European Legal Texts concerning the consumer protection LREC 2010 11 domain, 72,210 tokens (Domain specific)

Art History case study Extraction of MWT candidates [C-NC-Value]  – Selection of a top list of C-NC-Value ranked candidates (threshold empirically set at 600 terms). Contrast : against the open domain corpus PAROLE.  – Final list L of 300 domain specific terms. Art History Corpus Contrastive ranking First 600 Linguistic Statistical against terms NLP tools filters filter PAROLE extracted Final list of 300 artistic LREC 2010 12 terms

Legislative Case study Extraction of MWT candidates [C-NC-Value]  – Selection of a top list of C-NC-Value ranked candidates (threshold empirically set at 600 terms). 1th contrast : against the open domain corpus PAROLE.  – List L of 300 legal and environmental terms. 2th contrast: against Legal Corpus on consumer protection  – Final list L new ranking: • Top list: environmental terms [r ifiuto pericoloso – dangerous waste] • Bottom list: legal terms [d iritto interno – national law] Corpus of legal texts Contrastive 300 candidate (environmental domain) ranking against terms (env & leg PAROLE mixed) First 600 Statistical Linguistic terms NLP tools filter filters extracted LREC 2010 13

Legislative Case study Extraction of MWT candidates [C-NC-Value]  – Selection of a top list of C-NC-Value ranked candidates (threshold empirically set at 600 terms). 1th contrast : against the open domain corpus PAROLE.  – List L of 300 legal and environmental terms. 2th contrast: against Legal Corpus on consumer protection  – Final list L new ranking: • Top list: environmental terms [r ifiuto pericoloso – dangerous waste] • Bottom list: legal terms [d iritto interno – national law] Corpus of legal texts Contrastive 300 candidate (environmental domain) ranking against terms (env & leg PAROLE mixed) First 600 Statistical Linguistic terms NLP tools filter filters extracted Contrastive ranking against Legal Env. terms LREC 2010 14 Corpus (consumer protection domain) Legal terms

Evaluation methodology Automatic evaluation using gold standard resources  – Term list provided by domain experts. – EartH, Environmental Applications Reference Thesaurus Manual evaluation, of unmatched terms, carried out by a domain expert  – Gold standard resources do not have proper coverage of complex terms. – Art domain - Art History Department, University of Pisa. – Environmental – Institute of Atmospheric pollution (CNR). – Legal – Scuola Superiore Sant Anna, Pisa (Ossevatorio sul danno alla persona) Evaluation has been carried on wrt the results obtained with: NC-Value  Csmw  CsvH  TF-ITF  LREC 2010 15

Evaluation – Art history domain List of 300 extracted artistic terms  Extracted MWT distributed into 10 groups of 30 terms each.  Out of the first 300 terms, CsvH method retrieved the largest amount of Artistic terms.  TFITF and Csmw have  Group NC-Value CS-vH TFITF Csmw more domain-specific terms in the top list . 0-30 24 28 25 25 30-60 20 21 25 24 60-90 20 23 26 25 90-120 18 20 21 24 120-150 20 24 22 26 Tot 102 116 119 124 LREC 2010 16

Evaluation – Legal domain List of 300 extracted artistic terms  Extracted MWT distributed into 10 groups of 30 terms each.  Top list: mainly environmental terms  Bottom list: mainly legal terms  25 20 15 Environmental terms 10 5 LREC 2010 17 0 30 60 90 120 150 180 210 240 270 300

A contrastive Approach to Multi-word Term Extraction from - PowerPoint PPT Presentation

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca Bonin, Felice Dell' Orletta, Giulia Venturi and Simonetta Montemagni LREC Malta 2010 Outline Introduction and aims Multiword extraction

Contrastive Causation Making Causation Contrastive What this talk presupposes... The

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Adversarial Contrastive Estimation ACL 2018 AVISHEK (JOEY) BOSE, HUAN LING, *YANSHUAI CAO

Summarizing Contrastive Viewpoints in Opinionated Text MICHAEL PAUL* CHENGXIANG ZHAI

Parallel corpora in translation and contrastive studies Lucie Chlumsk Faculty of Arts, Charles

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

mwetoolkit: A tool for automated extraction of multi-word expressions Vtor De Arajo Carlos

HIGH VACUUM MULTI-PHASE EXTRACTION CASE STUDIES MWCC CONFERENCE JULY 2019 High Vacuum

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

FlexiTerm: Flexible multi word term recognition Prof. Irena Spasi i.spasic@cs.cardiff.ac.uk

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

State of the Art / Motivation Many LMS: Blackboard, Luvit, WebCT, Lotus Learning Space,

Sistemas de Accionamento Electromecnico Comando e proteco de motores Introduo

Introduction to Generalized Stochastic Petri Nets Gianfranco Balbo Dipartimento di Informatica

Control de flujo en TCP Tema 5.- Nivel de transporte en Internet Dr. Daniel Morat Redes de

Analisi dei dati ed estrazione di conoscenza Mastering Data Mining Fosca Giannotti Pisa KDD

JOGO DA CERVEJA JOGO DA CERVEJA Atender demanda do consumidor final, Experimento 2:

NEOCOLONIALISMO Resultado da 2 Revoluo Industrial http://historiaonline.com.br CONTEXTO:

ORI: Pontua c ao e o modelo de espa co vetorial Marcelo Keese Albertini Faculdade de

Sambuz

Useful Links

Newsletter

Mail Us

A contrastive Approach to Multi-word Term Extraction from - PowerPoint PPT Presentation

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca Bonin, Felice Dell' Orletta, Giulia Venturi and Simonetta Montemagni LREC Malta 2010 Outline Introduction and aims Multiword extraction

Contrastive Causation Making Causation Contrastive What this talk presupposes... The

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Adversarial Contrastive Estimation ACL 2018 *AVISHEK (JOEY) BOSE, *HUAN LING, *YANSHUAI CAO

Summarizing Contrastive Viewpoints in Opinionated Text MICHAEL PAUL* CHENGXIANG ZHAI

Parallel corpora in translation and contrastive studies Lucie Chlumsk Faculty of Arts, Charles

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

mwetoolkit: A tool for automated extraction of multi-word expressions Vtor De Arajo Carlos

HIGH VACUUM MULTI-PHASE EXTRACTION CASE STUDIES MWCC CONFERENCE JULY 2019 High Vacuum

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

FlexiTerm: Flexible multi word term recognition Prof. Irena Spasi i.spasic@cs.cardiff.ac.uk

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

State of the Art / Motivation Many LMS: Blackboard, Luvit, WebCT, Lotus Learning Space,

Sistemas de Accionamento Electromecnico Comando e proteco de motores Introduo

Introduction to Generalized Stochastic Petri Nets Gianfranco Balbo Dipartimento di Informatica

Control de flujo en TCP Tema 5.- Nivel de transporte en Internet Dr. Daniel Morat Redes de

Analisi dei dati ed estrazione di conoscenza Mastering Data Mining Fosca Giannotti Pisa KDD

JOGO DA CERVEJA JOGO DA CERVEJA Atender demanda do consumidor final, Experimento 2:

NEOCOLONIALISMO Resultado da 2 Revoluo Industrial http://historiaonline.com.br CONTEXTO:

ORI: Pontua c ao e o modelo de espa co vetorial Marcelo Keese Albertini Faculdade de

Sambuz

Useful Links

Newsletter

Mail Us

Adversarial Contrastive Estimation ACL 2018 AVISHEK (JOEY) BOSE, HUAN LING, *YANSHUAI CAO

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT