Self-tuning ongoing terminology extraction retrained on terminology - - PowerPoint PPT Presentation

self tuning ongoing terminology extraction retrained on
SMART_READER_LITE
LIVE PREVIEW

Self-tuning ongoing terminology extraction retrained on terminology - - PowerPoint PPT Presentation

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is


slide-1
SLIDE 1

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Alfredo Maldonado and David Lewis

ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin

The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

TKE 2016 Copenhagen

slide-2
SLIDE 2

www.adaptcentre.ie

Agenda

  • Why do we need to do terminology extraction on an ongoing basis?

Motivation

  • Ongoing terminology extraction with and without learning

Methodology

  • Description of Simulation Experiments and Results

Experimental Setup and Results

  • The feedback loop in machine learning-based ongoing terminology

extraction can help in identifying the majority of terms in a batch of new content Conclusions and next steps

slide-3
SLIDE 3

www.adaptcentre.ie

MOTIVATION

slide-4
SLIDE 4

www.adaptcentre.ie

A frequent assumption in terminology extraction

  • Surely if I do terminology extraction at some point towards the

beginning of a content creation project, I will capture the majority of the terms of interest that are ever likely to appear, right?

  • I’m basically taking a representative sample of the terms in the

project

slide-5
SLIDE 5

www.adaptcentre.ie

Let’s test that assumption

  • Here’s an actual example using the term-annotated ACL RD-TEC

(QasemiZadeh and Handschuh, 2014)

  • ACL RD-TEC: a corpus of ACL academic papers written between

1965 to 2006 in which domain-specific terms have been manually annotated

slide-6
SLIDE 6

www.adaptcentre.ie

Motivation – new content introduces new terms

The proportion of new terms in a subsequent year never reaches 0 Between 12% and 20% of all valid terms in any given year will be new If you don’t do term extraction periodically (e.g. annually) you will start missing out A LOT OF new terms within a few years

slide-7
SLIDE 7

www.adaptcentre.ie

The reality is …

  • As content gets updated, new previously unseen terms will start

appearing

  • These terms will not have been captured during our initial term

extraction and will have to be researched by our users or our terminologists downstream, causing bottlenecks in translation / usage of terminology, perhaps incurring additional costs

Clipart from https://openclipart.org

slide-8
SLIDE 8

www.adaptcentre.ie

THE SOLUTION?

(METHODOLOGY)

slide-9
SLIDE 9

www.adaptcentre.ie

Ongoing terminology extraction

… …

First proposed by Warburton (2013) – automatically filtering previously identified terms and non-terms in subsequent extraction exercises

Selected terms

Terminology Pipeline …

Rejected terms Selected terms

Terminology Pipeline …

Rejected terms Filtered terms Filtered terms

Content Batch 1 Content Batch 2

Extraction and Ranking Extraction and Ranking Automatic filtering Validation Validation

Filtered terms Selected terms

Terminology Pipeline …

Rejected terms Filtered terms

Content Batch 3

Validation Automatic filtering Extraction and Ranking

Filtered terms

slide-10
SLIDE 10

www.adaptcentre.ie

Proposed Solution: Machine Learning ongoing Terminology Extraction (MLTE)

… …

Selected terms

Terminology Pipeline …

Rejected terms Selected terms

Terminology Pipeline …

Rejected terms Train model

Content Batch 1 Content Batch 2

Extraction Extraction

Candidate Classification

Validation Validation

Retrained model Selected terms

Terminology Pipeline …

Content Batch 3

Validation

Candidate Classification

Extraction

Rejected terms Retrained model

Instead of compiling term lists for filtering, we introduce a Machine Learning classification model that learns from terminologist’s validation decisions

slide-11
SLIDE 11

www.adaptcentre.ie

Proposed System Architecture

Parameter:

  • History size k (number of past batches to use as training data)

Text from previous k batches

Training

Validation decisions from previous k batches

Training, model, etc. …

Model for current batch

CURRENT BATCH

Valid Not Valid

Current batch text Validation decisions for current batch

slide-12
SLIDE 12

www.adaptcentre.ie

EXPERIMENTAL SETUP AND RESULTS

slide-13
SLIDE 13

www.adaptcentre.ie

Dataset

  • Usage of the ACL RD-TEC corpus
  • Has terminology gold standard
  • Has term index info (which terms appear in which docs)
  • Documents are time-stamped (date of conference)
  • C04-1001_cln.txt
  • J05-1003_cln.txt
  • Sample: RDTEC papers from 2004 till 2006
  • 2,781 articles
  • 9,114,767 words
  • 3,300 words per article on average
  • Sample divided in chronological batches of approx. 40 articles

each

  • 69 batches
  • Simulation of ongoing term extraction AND validation using an

annotated, time-stamped corpus

slide-14
SLIDE 14

www.adaptcentre.ie

Simulation

Given current batch bt: 1. Extract term candidate n-grams from articles in batch (n = 1 .. 7) 2. Automatically remove any term candidates that appeared in any previous batch – like Warburton (2013) 3. Automatically remove any term candidates with POS patterns not associated with any valid terms in previous batches

  • This is to reduce the amount of non-valid term candidates in training

data to counteract skewness towards non-valid candidates

  • Notice no need to supply manual POS pattern filters!

4. Using previously trained model (if available), predict whether each term candidate is a valid term or not 5. Evaluate prediction by comparing predictions with gold standard in ACL RD-TEC annotation – Simulates manual validation step 6. Create new training data by concatenating this gold standard data points with that of the previous k-1 batches (history of size k). In our experiments, best results with k = 16. 7. Train a new model using newly created training data. 8. Go to next batch bt+1 and start from 1 until completing all batches.

slide-15
SLIDE 15

www.adaptcentre.ie

Model and Features

  • Model
  • Support Vector Machine (SVM) classifier
  • Linear Kernel
  • Features
  • Term candidate’s POS pattern
  • Term candidate’s character 3-grams
  • Two domain contrastive features:
  • Domain Relevance (DR) (Navigli and Velardi, 2002)
  • Term Cohesion (TC) (Park et al., 2002)
  • Contrastive corpus 1 – a 500-way clustering of 2009 Wikipedia

documents (Baroni et al., 2009)

  • Contrastive corpus 2 – a dynamic clustering of batch history

(each cluster has roughly 40 articles)

slide-16
SLIDE 16

www.adaptcentre.ie

Experiments

  • Our simulated approach, as described
  • Two baselines:
  • Baseline 1: An approximation to Warburton’s (2013) method

using standard, off-the-shelf filter-rankers provided by JATE (Zhang et al., 2008)

  • Automatic filtering across batches takes place
  • No learning model is trained
  • Baseline 2:Train SVM classifier using our features on first batch

and use that classifier to predict terms from all subsequent batches

  • Same as our approach, but no retraining at each batch

takes place

slide-17
SLIDE 17

www.adaptcentre.ie

Evaluation

  • Recall (coverage): % of valid terms in a batch were predicted as

valid

  • Low recall indicates we’re missing many valid terms
  • Precision (true positives): % of valid terms in the set of term

candidates predicted as valid

  • Low precision indicates we’re producing many false positives
  • Usually, we want to identify as many true valid terms as possible,

potentially at the risk of returning a relatively high number of false positives.

  • We’re interested in achieving high recall (coverage) at the

expense of a moderate precision

slide-18
SLIDE 18

www.adaptcentre.ie

Results

slide-19
SLIDE 19

www.adaptcentre.ie

CONCLUSIONS AND NEXT STEPS

slide-20
SLIDE 20

www.adaptcentre.ie

Conclusions

  • Obtained good recall (coverage) scores using our method

(ONGOING), much better than the two baselines

  • Average recall of 74.16% across all batches
  • Precision scores are quite disappointing, meaning that we can

expect many false positives in each batch

  • Ongoing retraining does help in keeping high recall
  • Manual terminology validation already takes place in virtually all

terminology extraction tasks. Let’s just use them to train an ongoing machine-learning classifier automatically!

  • The lack of a feedback loop mechanism in the statistical filter-

rankers does hinder their performance when used on an ongoing basis with automatic exclusion lists

slide-21
SLIDE 21

www.adaptcentre.ie

Future work

  • Conduct human-based benchmarks
  • Address low precision scores
  • Post-processing strategies like re-ranking predicted candidates

(e.g. by using statistical rankers)

  • Exploring new features based on topic models
  • Exploring reinforced learning techniques
  • Experiment on other datasets from several other domains
  • Further investigate role of contrastive corpus
  • E.g. not all specialised terms will feature in Wikipedia
  • Fall-back strategy like relying in sub-terms
  • Distributional vector composition techniques in order to estimate

feature values of terms missing in contrastive corpus

slide-22
SLIDE 22

www.adaptcentre.ie

QUESTIONS?

Alfredo Maldonado Research Fellow ADAPT Centre at Trinity College Dublin alfredo.maldonado@adaptcentre.ie maldonaa@tcd.ie @alfredomg on Twitter

The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 12/RC/2106) and is co-funded under the European Regional Development Fund.

Clipart from https://openclipart.org

slide-23
SLIDE 23

www.adaptcentre.ie

APPENDIX

slide-24
SLIDE 24

www.adaptcentre.ie

Two assumptions

  • Assumption 1: We have a terminology pipeline in which extracted terms will be further processed by

terminologists and other linguists (e.g. research, translation, etc.) and will end up in a terminology database (termbase) to be used by other professionals (e.g. translators, specialists), organisations and systems

  • Assumption 2: We have a non-static, ongoing source of new content
  • Examples:
  • Academic/scientific papers from journals, conferences proceedings, etc.
  • Technical manuals for industrial, technological or medical equipment
  • Web-based/online content
  • Strings for software, mobile apps, web apps
  • Content that starts getting translated before source text is completed (“sim ship” or “simultaneous ship”)
  • If your content is static and finite, you perhaps won’t benefit from ongoing terminology extraction – Is

your content really static???

Clipart from https://openclipart.org and http://howiedi2.deviantart.com

Termbase Content Terms Research Translation Users

slide-25
SLIDE 25

www.adaptcentre.ie

Evaluation of filter-rankers

  • Consider the top N ranked candidates as valid term predictions and

all other candidates as non-valid term predictions (Pecina, 2010).

  • If a batch has v valid terms, we could consider the N = v top

candidates as valid terms and the rest as non-valid terms.

  • However, N = v is too inflexible and will tend to penalise the recall of

rankers

  • In our experiments we use N = 2v
slide-26
SLIDE 26

www.adaptcentre.ie

Evaluation of filter-rankers

N = 2v

slide-27
SLIDE 27

www.adaptcentre.ie

Evaluation of filter-rankers

N = 7v