Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela - - PowerPoint PPT Presentation
Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela - - PowerPoint PPT Presentation
Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela Shapiro, Xuan Zhang SCALE Readout August 9, 2018 Continued Training In-domain Data Con-nued- General Trained Domain NMT NMT Model Model Monolingual Source-Side
Continued Training
In-domain Data Con-nued- Trained NMT Model General Domain NMT Model
Monolingual Source-Side In-Domain Data
???
In-domain Data Domain- Adapted NMT Model General Domain NMT Model
Monolingual à Parallel
- Forward Translation
- Use general-domain MT model to translate
monolingual in-domain data
- Continued training with “synthetic” parallel in-
domain data
- Data Selection
- Large corpora of parallel data from wide range
- f domains (web crawl)
- Use monolingual in-domain data to find parallel
sentences that are closest to desired domain
Forward Translation
Using “synthetic” in-domain data for machine translation
Setup
- Data
- Same general domain data, same in-domain
validation and test sets as continued training experiments
- Use only source-side of in-domain training data
- Forward Translation Models
- Baseline models trained on general domain data
- Neural machine translation (NMT) or statistical
machine translation (SMT)
- Continued Training (CT)
- “Synthetic” in-domain only
- General domain + synthetic in-domain data
TED Results
5 10 15 20 25 30 35 40
Arabic German Farsi Korean Russian Chinese
BLEU
General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + NMT in-domain CT with general domain + SMT in-domain +0.5 +0.1 +0.7 +0.3 +0.8
- 0.3
+1.0 +0.8 +1.2 +0.9 +0.6 +0.7 +1.5 +0.8 +1.4 +1.7 +0.9 +1.5
Patent Results
5 10 15 20 25 30 35 40
German Korean Russian Chinese
BLEU
General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + NMT in-domain CT with general domain + SMT in-domain +1.9 +0.4 +0.8
- 1.6
+2.0 +0.5 +0.9 +4.6 +1.5 +2.5
Remaining Questions
- How much synthetic in-domain vs. general domain
data to use for continued training?
- Amount of available general domain data varies widely
based on language, domain
- Should we treat general domain and synthetic in-
domain data differently during continued training?
- Exploiting Source-side Monolingual Data in Neural
Machine Translation (Zhang and Zong, EMNLP 2016)
- Synthetic target side of in-domain data is low-quality –
decoder should not be trained to produce it
Continued Training Updates
- Alternate training on general domain, in-
domain data
- Future work: experiment with ratio of general
domain to in-domain data
- Optionally treat synthetic data differently
- Freeze decoder parameters
- Future work: experiment with choice of frozen
parameters
- Multi-task training
Multi-Task Training
𝑦1 𝑦2 𝑦𝑈
𝑦
ℎ𝑈
𝑦
ℎ𝑈
𝑦
ℎ2 ℎ2 ℎ1 ℎ1
⋯ ⋯
𝑦′1 𝑦′2 𝑦′𝑈
𝑦
𝑡𝑈
𝑦
𝑡2 𝑡1
⋯
𝑧1 𝑧2 𝑧𝑈
𝑧
𝑨𝑈
𝑧
𝑨2 𝑨1
⋯
reordering translation reordered source-side monolingual data sentence aligned bilingual data
Baseline Model (General Domain Transla-on Task) New Decoder Auxiliary Model (In-Domain Transla-on Task)
(Zhang and Zong, EMNLP 2016)
Chinese TED Results
13 14 15 16 17 18
NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task
BLEU
16.7 17.1 16.6 16.8
Chinese WIPO Results
5 10 15 20
NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task
BLEU
18.1 18.6 14.7 16.6 17.2 11.0
Summary
- Continued training with synthetic in-
domain data produces consistent in- domain BLEU improvements
- NMT more consistent than SMT
- SMT better than NMT for some languages,
domains
- Many possible future research directions
- Modified Sockeye recipes enable alternating
domains, multi-task training
Mining Web-Crawled Data
Moore-Lewis Selection on Paracrawl Data
MT training
General Domain NMT Model General Domain Data
MT training
ParaCrawl Data General Domain NMT Model
MT training
ParaCrawl Data Con-nued- Trained NMT Model General Domain NMT Model
In-Domain Src Data
ParaCrawl Data
Pipeline for crawling parallel data from the web and cleaning it
- Documents aligned based on language ID &
URLs, then sentenced aligned, given score of cleanliness
Crawling Document Alignment Sentence Alignment Cleaning Domain Selec-on Returns Cleanliness Scores, Threshold for Clean/Size tradeoff
In-Domain Data Selection
Classic approach (Moore and Lewis 2010) Train source-side language models from IN, random sample of CRAWL Score each source-side CRAWL sentence by:
- ProbabilityIN(sent) / ProbabilityCRAWL(sent)
- Strong method for SMT, we're investigating it
for NMT
Most TED-Like Sentence
it changes the way we think ; it changes the way we walk in the world ; it changes our responses ; it changes our attitudes towards our current situations ; it changes the way we dress ; it changes the way we do things ; it changes the way we interact with people .
Least TED-Like Sentence
sunday , july 10 , 2016 in riverton the weather forecast would be : during the night the air temperature drops to + 21 ... + 24ºc , dew point : -1,6ºc ; precipitation is not expected , light breeze wind blowing from the south at a speed of 7-11 km / h , clear sky ; in the morning the air temperature drops to + 20 ... + 22ºc , dew point : + 1,24ºc ; precipitation is not expected , gentle breeze wind blowing from the west at a speed of 11-14 km / h , in the sky , sometimes there are small clouds ; in the afternoon the air temperature will be + 20 ... + 22ºc , dew point : + 4,06ºc ; ratio of temperature , wind speed and humidity : a bit dry for some ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-29 km / h , in the sky , sometimes there are small clouds ; in the evening the air temperature drops to + 15 ... + 19ºc , dew point : -0,12ºc ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-32 km / h , clear sky ; day length is 14:52
Perplexity-Based Selection
Rank sentences, then select amount of data based on perplexity on in-domain data
- 50
100 150 200 250 300 350 400 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000
Perplexity Number of Sentences
German TED
30 35 40 45
GenDomain (28M) Con-nued on Random (28+1M) Con-nued on ParaCrawl (28+1M) Con-nued on TED Bitext
+1.19 BLEU
TED Results
10 20 30 40 50
German Korean Russian
General Domain Con-nued on Random Con-nued on ParaCrawl Con-nued on TED Bitext
+1.19
BLEU
+0.34
Settings where Moore-Lewis performs poorly
Too much noise:
- Korean web-crawl data very noisy
- In preliminary experiments, threshold for
data cleaning mattered significantly Domain specificity:
- For German patent data, seems like may
not be enough relevant web-crawled data
Analysis
Long sentences:
- Quirk of Moore-Lewis method
- Smaller selections of Moore-Lewis ranked
data à high average sentence length
- NMT has length limit, but having
sentences near border may cause problems
- However, at thresholds we tried for TED
(600K-1M), overall positive results
Conclusions
- Domain-based data selection from web-
crawled data can help domain adaptation for NMT when you only have source data
- Cleanliness of web-crawled data matters
- Whether you can expect to find good
data for it on web depends on domain
Curriculum Learning
(for Continued-Training)
Curriculum Learning
- How to take advantage of Moore-Lewis scores?
- Filter ParaCrawl data by a threshold
- Arrange data processing order by Moore-Lewis score ranking
- Curriculum Learning (CL):
[Curriculum Learning, Bengio et al. 2009]
Process samples by certain order. (easy ➠ difficult)
Train better machine learning models faster.
CL for Continued-Training
- Curriculum Learning (CL):
Process samples by certain order.
(easy ➠ difficult)
- CL for ConKnued-Training (CT):
Reorganize TED (bitext) + ParaCrawl data.
Ranking Criterion: Moore-Lewis score
(easy: TED-like ➠ difficult: TED-unilke)
- Compare to pamela’s work:
ParaCrawl threshold + random sampling (pamela)
- vs. ParaCrawl threshold + TED bitext + ordering (xuan)
Methods
Jenks Natural Breaks ClassificaKon Algorithm
(Maximize variance between classes and reduce variance within classes)
- we put TED data in shard 0
- ParaCrawl data in shard 1-4
High (Moore-Lewis Score) Low TED-like TED-unlike
➠
Clean Noisy
shards
1 2 3 4
Moore-Lewis Score
Density
ParaCrawl 15163, 3066, 13519, 32179, 59708 12.26%, 2.48%, 10.93%, 26.03%, 48.29%
#samples samples%
TED ParaCrawl
Training Strategy
… until converged
curriculum update point curriculum phase timeline
* shuffling (among and within shards) (1000 minibatches)
Acer training the NMT model on general domain data, con-nued-training on TED + ParaCrawl as following:
Results
36.2 37.5 38.7 35 35 36 37 38 38 39 German CT on 10% TED CT on 10% TED + 15% ParaCrawl CT on 10% TED + 15% ParaCrawl w/ CL
BLEU
* CT for con-nued training * CL for curriculum learning
10% TED bitext
( 15,163 samples) low-resource in-domain data
15% ParaCrawl
(108,472 samples)
Conclusions: Using Monolingual Source-Side In-Domain Data
- 1. Proven adaptation methods for SMT can be
used for NMT, w/ some modification
- 2. Synthetic data generation: include general +
synthetic in continued training
- 3. Webcrawl bitext + in-domain filtering can be
effective (but should filter noise)
- 4. (3) can be improved with curriculum learning