Self-tuning ongoing terminology extraction retrained on terminology - PowerPoint PPT Presentation

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Agenda www.adaptcentre.ie Motivation • Why do we need to do terminology extraction on an ongoing basis? Methodology • Ongoing terminology extraction with and without learning Experimental Setup and Results • Description of Simulation Experiments and Results Conclusions and next steps • The feedback loop in machine learning-based ongoing terminology extraction can help in identifying the majority of terms in a batch of new content

www.adaptcentre.ie MOTIVATION

A frequent assumption in terminology extraction www.adaptcentre.ie • Surely if I do terminology extraction at some point towards the beginning of a content creation project, I will capture the majority of the terms of interest that are ever likely to appear, right? • I’m basically taking a representative sample of the terms in the project

Let’s test that assumption www.adaptcentre.ie • Here’s an actual example using the term -annotated ACL RD-TEC (QasemiZadeh and Handschuh, 2014) • ACL RD-TEC: a corpus of ACL academic papers written between 1965 to 2006 in which domain-specific terms have been manually annotated

Motivation – new content introduces new terms www.adaptcentre.ie The proportion of new terms in a subsequent year never reaches 0 Between 12% and 20% of all valid terms in any given year will be new If you don’t do term extraction periodically (e.g. annually) you will start missing out A LOT OF new terms within a few years

The reality is … www.adaptcentre.ie • As content gets updated, new previously unseen terms will start appearing • These terms will not have been captured during our initial term extraction and will have to be researched by our users or our terminologists downstream , causing bottlenecks in translation / usage of terminology, perhaps incurring additional costs Clipart from https://openclipart.org

www.adaptcentre.ie THE SOLUTION? (METHODOLOGY)

Ongoing terminology extraction www.adaptcentre.ie First proposed by Warburton (2013) – automatically filtering previously identified terms and non-terms in subsequent extraction exercises … Content Batch 1 Content Batch 2 Content Batch 3 Extraction and Ranking Extraction and Ranking Extraction and Ranking Automatic filtering Automatic filtering Validation Validation Validation … Rejected terms Selected terms Selected terms Rejected terms Selected terms Rejected terms Terminology Terminology Terminology Filtered terms Filtered terms Filtered terms Pipeline … Pipeline … Pipeline … Filtered terms Filtered terms

Proposed Solution: Machine Learning ongoing www.adaptcentre.ie Terminology Extraction (MLTE) Instead of compiling term lists for filtering, we introduce a Machine Learning classification model that learns from terminologist’s validation decisions … Content Batch 1 Content Batch 2 Content Batch 3 Extraction Extraction Extraction Validation Candidate Classification Candidate Classification Validation Validation … Rejected terms Rejected terms Selected terms Selected terms Rejected terms Selected terms Terminology Terminology Terminology Train model Pipeline … Pipeline … Pipeline … Retrained model Retrained model

Proposed System Architecture www.adaptcentre.ie CURRENT BATCH Current batch text Text from previous k batches Training, model, etc. … Valid Not Valid Training Validation decisions for current batch Model for Validation decisions from current batch previous k batches Parameter: • History size k (number of past batches to use as training data)

www.adaptcentre.ie EXPERIMENTAL SETUP AND RESULTS

Dataset www.adaptcentre.ie • Usage of the ACL RD-TEC corpus • Has terminology gold standard • Has term index info (which terms appear in which docs) • Documents are time-stamped (date of conference) • C04-1001_cln.txt • J05-1003_cln.txt • Sample: RDTEC papers from 2004 till 2006 • 2,781 articles • 9,114,767 words • 3,300 words per article on average • Sample divided in chronological batches of approx. 40 articles each • 69 batches • Simulation of ongoing term extraction AND validation using an annotated, time-stamped corpus

Simulation www.adaptcentre.ie Given current batch b t : 1. Extract term candidate n-grams from articles in batch (n = 1 .. 7) 2. Automatically remove any term candidates that appeared in any previous batch – like Warburton (2013) 3. Automatically remove any term candidates with POS patterns not associated with any valid terms in previous batches • This is to reduce the amount of non-valid term candidates in training data to counteract skewness towards non-valid candidates • Notice no need to supply manual POS pattern filters! 4. Using previously trained model (if available), predict whether each term candidate is a valid term or not 5. Evaluate prediction by comparing predictions with gold standard in ACL RD-TEC annotation – Simulates manual validation step 6. Create new training data by concatenating this gold standard data points with that of the previous k-1 batches (history of size k). In our experiments, best results with k = 16. 7. Train a new model using newly created training data. 8. Go to next batch b t+1 and start from 1 until completing all batches.

Model and Features www.adaptcentre.ie • Model • Support Vector Machine (SVM) classifier • Linear Kernel • Features • Term candidate’s POS pattern • Term candidate’s character 3 -grams • Two domain contrastive features: • Domain Relevance (DR) (Navigli and Velardi, 2002) • Term Cohesion (TC) (Park et al., 2002) • Contrastive corpus 1 – a 500-way clustering of 2009 Wikipedia documents (Baroni et al., 2009) • Contrastive corpus 2 – a dynamic clustering of batch history (each cluster has roughly 40 articles)

Experiments www.adaptcentre.ie • Our simulated approach, as described • Two baselines: • Baseline 1: An approximation to Warburton’s (2013) method using standard, off-the-shelf filter-rankers provided by JATE (Zhang et al., 2008) • Automatic filtering across batches takes place • No learning model is trained • Baseline 2:Train SVM classifier using our features on first batch and use that classifier to predict terms from all subsequent batches • Same as our approach, but no retraining at each batch takes place

Evaluation www.adaptcentre.ie • Recall (coverage): % of valid terms in a batch were predicted as valid • Low recall indicates we’re missing many valid terms • Precision (true positives): % of valid terms in the set of term candidates predicted as valid • Low precision indicates we’re producing many false positives • Usually, we want to identify as many true valid terms as possible, potentially at the risk of returning a relatively high number of false positives. • We’re interested in achieving high recall (coverage) at the expense of a moderate precision

Results www.adaptcentre.ie

www.adaptcentre.ie CONCLUSIONS AND NEXT STEPS

Conclusions www.adaptcentre.ie • Obtained good recall (coverage) scores using our method (ONGOING), much better than the two baselines • Average recall of 74.16% across all batches • Precision scores are quite disappointing, meaning that we can expect many false positives in each batch • Ongoing retraining does help in keeping high recall • Manual terminology validation already takes place in virtually all terminology extraction tasks. Let’s just use them to train an ongoing machine-learning classifier automatically! • The lack of a feedback loop mechanism in the statistical filter- rankers does hinder their performance when used on an ongoing basis with automatic exclusion lists

Future work www.adaptcentre.ie • Conduct human-based benchmarks • Address low precision scores • Post-processing strategies like re-ranking predicted candidates (e.g. by using statistical rankers) • Exploring new features based on topic models • Exploring reinforced learning techniques • Experiment on other datasets from several other domains • Further investigate role of contrastive corpus • E.g. not all specialised terms will feature in Wikipedia • Fall-back strategy like relying in sub-terms • Distributional vector composition techniques in order to estimate feature values of terms missing in contrastive corpus

www.adaptcentre.ie QUESTIONS? Alfredo Maldonado Research Fellow ADAPT Centre at Trinity College Dublin alfredo.maldonado@adaptcentre.ie maldonaa@tcd.ie @alfredomg on Twitter The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 12/RC/2106) and is co-funded under the European Regional Development Fund. Clipart from https://openclipart.org

www.adaptcentre.ie APPENDIX

Self-tuning ongoing terminology extraction retrained on terminology - PowerPoint PPT Presentation

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

SACM Terminology https://sacmwg.github.io/draft-ietf-sacm-terminology/

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

SELF-TUNING HTM Paolo Romano 2 Based on ICAC14 paper N. Diegues and Paolo Romano

Holographic self-tuning of the cosmological constant Francesco Nitti Laboratoire APC, U. Paris

Slide 1: Tech Terminology Overview There are lots of terminology terms that we use freely. I

TomcatCon Tomcat Load-balancing Mark Thomas TM Introduction TM Terminology TM Terminology:

Managing Ongoing Managing Ongoing Responsibilities for Responsibilities for Variable- -Rate

Lehman Brothers Fixed Income Energy Conference Houston, TX Houston, TX March 2, 2006 March 2,

SUSTAINABILITY INDICATORS OF MUNICIPAL SELECTIVE COLLECTION AND WASTE- PICKERS ORGANIZATIONS:

Topic 7: Flux and remote sensing, merging data products, future directions 1. Databases

Disruption: An Investors Perspective What Is A Disruptor? A company/product/concept that

How testing data is captured? 1 7/25/2017 How testing data is captured? CTE Performance

Self Assessments and Self Assessments and Environmental Audits Environmental Audits 2010

Special Education SELDA Presentation January 15, 2020 Dale R. Rose, PhD Program Specialist

A Unified Program for Modifying Built-In Self-Test Architectures for Xilinx Field Programmable