Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela - - PowerPoint PPT Presentation

using monolingual source side in domain data
SMART_READER_LITE
LIVE PREVIEW

Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela - - PowerPoint PPT Presentation

Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela Shapiro, Xuan Zhang SCALE Readout August 9, 2018 Continued Training In-domain Data Con-nued- General Trained Domain NMT NMT Model Model Monolingual Source-Side


slide-1
SLIDE 1

Using Monolingual Source-Side In-Domain Data

Jen Drexler, Pamela Shapiro, Xuan Zhang SCALE Readout August 9, 2018

slide-2
SLIDE 2

Continued Training

In-domain Data Con-nued- Trained NMT Model General Domain NMT Model

slide-3
SLIDE 3

Monolingual Source-Side In-Domain Data

???

In-domain Data Domain- Adapted NMT Model General Domain NMT Model

slide-4
SLIDE 4

Monolingual à Parallel

  • Forward Translation
  • Use general-domain MT model to translate

monolingual in-domain data

  • Continued training with “synthetic” parallel in-

domain data

  • Data Selection
  • Large corpora of parallel data from wide range
  • f domains (web crawl)
  • Use monolingual in-domain data to find parallel

sentences that are closest to desired domain

slide-5
SLIDE 5

Forward Translation

Using “synthetic” in-domain data for machine translation

slide-6
SLIDE 6

Setup

  • Data
  • Same general domain data, same in-domain

validation and test sets as continued training experiments

  • Use only source-side of in-domain training data
  • Forward Translation Models
  • Baseline models trained on general domain data
  • Neural machine translation (NMT) or statistical

machine translation (SMT)

  • Continued Training (CT)
  • “Synthetic” in-domain only
  • General domain + synthetic in-domain data
slide-7
SLIDE 7

TED Results

5 10 15 20 25 30 35 40

Arabic German Farsi Korean Russian Chinese

BLEU

General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + NMT in-domain CT with general domain + SMT in-domain +0.5 +0.1 +0.7 +0.3 +0.8

  • 0.3

+1.0 +0.8 +1.2 +0.9 +0.6 +0.7 +1.5 +0.8 +1.4 +1.7 +0.9 +1.5

slide-8
SLIDE 8

Patent Results

5 10 15 20 25 30 35 40

German Korean Russian Chinese

BLEU

General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + NMT in-domain CT with general domain + SMT in-domain +1.9 +0.4 +0.8

  • 1.6

+2.0 +0.5 +0.9 +4.6 +1.5 +2.5

slide-9
SLIDE 9

Remaining Questions

  • How much synthetic in-domain vs. general domain

data to use for continued training?

  • Amount of available general domain data varies widely

based on language, domain

  • Should we treat general domain and synthetic in-

domain data differently during continued training?

  • Exploiting Source-side Monolingual Data in Neural

Machine Translation (Zhang and Zong, EMNLP 2016)

  • Synthetic target side of in-domain data is low-quality –

decoder should not be trained to produce it

slide-10
SLIDE 10

Continued Training Updates

  • Alternate training on general domain, in-

domain data

  • Future work: experiment with ratio of general

domain to in-domain data

  • Optionally treat synthetic data differently
  • Freeze decoder parameters
  • Future work: experiment with choice of frozen

parameters

  • Multi-task training
slide-11
SLIDE 11

Multi-Task Training

𝑦1 𝑦2 𝑦𝑈

𝑦

ℎ𝑈

𝑦

ℎ𝑈

𝑦

ℎ2 ℎ2 ℎ1 ℎ1

⋯ ⋯

𝑦′1 𝑦′2 𝑦′𝑈

𝑦

𝑡𝑈

𝑦

𝑡2 𝑡1

𝑧1 𝑧2 𝑧𝑈

𝑧

𝑨𝑈

𝑧

𝑨2 𝑨1

reordering translation reordered source-side monolingual data sentence aligned bilingual data

Baseline Model (General Domain Transla-on Task) New Decoder Auxiliary Model (In-Domain Transla-on Task)

(Zhang and Zong, EMNLP 2016)

slide-12
SLIDE 12

Chinese TED Results

13 14 15 16 17 18

NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task

BLEU

16.7 17.1 16.6 16.8

slide-13
SLIDE 13

Chinese WIPO Results

5 10 15 20

NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task

BLEU

18.1 18.6 14.7 16.6 17.2 11.0

slide-14
SLIDE 14

Summary

  • Continued training with synthetic in-

domain data produces consistent in- domain BLEU improvements

  • NMT more consistent than SMT
  • SMT better than NMT for some languages,

domains

  • Many possible future research directions
  • Modified Sockeye recipes enable alternating

domains, multi-task training

slide-15
SLIDE 15

Mining Web-Crawled Data

Moore-Lewis Selection on Paracrawl Data

slide-16
SLIDE 16

MT training

General Domain NMT Model General Domain Data

slide-17
SLIDE 17

MT training

ParaCrawl Data General Domain NMT Model

slide-18
SLIDE 18

MT training

ParaCrawl Data Con-nued- Trained NMT Model General Domain NMT Model

In-Domain Src Data

slide-19
SLIDE 19

ParaCrawl Data

Pipeline for crawling parallel data from the web and cleaning it

  • Documents aligned based on language ID &

URLs, then sentenced aligned, given score of cleanliness

Crawling Document Alignment Sentence Alignment Cleaning Domain Selec-on Returns Cleanliness Scores, Threshold for Clean/Size tradeoff

slide-20
SLIDE 20

In-Domain Data Selection

Classic approach (Moore and Lewis 2010) Train source-side language models from IN, random sample of CRAWL Score each source-side CRAWL sentence by:

  • ProbabilityIN(sent) / ProbabilityCRAWL(sent)
  • Strong method for SMT, we're investigating it

for NMT

slide-21
SLIDE 21

Most TED-Like Sentence

it changes the way we think ; it changes the way we walk in the world ; it changes our responses ; it changes our attitudes towards our current situations ; it changes the way we dress ; it changes the way we do things ; it changes the way we interact with people .

slide-22
SLIDE 22

Least TED-Like Sentence

sunday , july 10 , 2016 in riverton the weather forecast would be : during the night the air temperature drops to + 21 ... + 24ºc , dew point : -1,6ºc ; precipitation is not expected , light breeze wind blowing from the south at a speed of 7-11 km / h , clear sky ; in the morning the air temperature drops to + 20 ... + 22ºc , dew point : + 1,24ºc ; precipitation is not expected , gentle breeze wind blowing from the west at a speed of 11-14 km / h , in the sky , sometimes there are small clouds ; in the afternoon the air temperature will be + 20 ... + 22ºc , dew point : + 4,06ºc ; ratio of temperature , wind speed and humidity : a bit dry for some ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-29 km / h , in the sky , sometimes there are small clouds ; in the evening the air temperature drops to + 15 ... + 19ºc , dew point : -0,12ºc ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-32 km / h , clear sky ; day length is 14:52

slide-23
SLIDE 23

Perplexity-Based Selection

Rank sentences, then select amount of data based on perplexity on in-domain data

  • 50

100 150 200 250 300 350 400 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000

Perplexity Number of Sentences

slide-24
SLIDE 24

German TED

30 35 40 45

GenDomain (28M) Con-nued on Random (28+1M) Con-nued on ParaCrawl (28+1M) Con-nued on TED Bitext

+1.19 BLEU

slide-25
SLIDE 25

TED Results

10 20 30 40 50

German Korean Russian

General Domain Con-nued on Random Con-nued on ParaCrawl Con-nued on TED Bitext

+1.19

BLEU

+0.34

slide-26
SLIDE 26

Settings where Moore-Lewis performs poorly

Too much noise:

  • Korean web-crawl data very noisy
  • In preliminary experiments, threshold for

data cleaning mattered significantly Domain specificity:

  • For German patent data, seems like may

not be enough relevant web-crawled data

slide-27
SLIDE 27

Analysis

Long sentences:

  • Quirk of Moore-Lewis method
  • Smaller selections of Moore-Lewis ranked

data à high average sentence length

  • NMT has length limit, but having

sentences near border may cause problems

  • However, at thresholds we tried for TED

(600K-1M), overall positive results

slide-28
SLIDE 28

Conclusions

  • Domain-based data selection from web-

crawled data can help domain adaptation for NMT when you only have source data

  • Cleanliness of web-crawled data matters
  • Whether you can expect to find good

data for it on web depends on domain

slide-29
SLIDE 29

Curriculum Learning

(for Continued-Training)

slide-30
SLIDE 30

Curriculum Learning

  • How to take advantage of Moore-Lewis scores?
  • Filter ParaCrawl data by a threshold
  • Arrange data processing order by Moore-Lewis score ranking
  • Curriculum Learning (CL):

[Curriculum Learning, Bengio et al. 2009]

Process samples by certain order. (easy ➠ difficult)

Train better machine learning models faster.

slide-31
SLIDE 31

CL for Continued-Training

  • Curriculum Learning (CL):

Process samples by certain order.

(easy ➠ difficult)

  • CL for ConKnued-Training (CT):

Reorganize TED (bitext) + ParaCrawl data.

Ranking Criterion: Moore-Lewis score

(easy: TED-like ➠ difficult: TED-unilke)

  • Compare to pamela’s work:

ParaCrawl threshold + random sampling (pamela)

  • vs. ParaCrawl threshold + TED bitext + ordering (xuan)
slide-32
SLIDE 32

Methods

Jenks Natural Breaks ClassificaKon Algorithm

(Maximize variance between classes and reduce variance within classes)

  • we put TED data in shard 0
  • ParaCrawl data in shard 1-4

High (Moore-Lewis Score) Low TED-like TED-unlike

Clean Noisy

shards

1 2 3 4

Moore-Lewis Score

Density

ParaCrawl 15163, 3066, 13519, 32179, 59708 12.26%, 2.48%, 10.93%, 26.03%, 48.29%

#samples samples%

TED ParaCrawl

slide-33
SLIDE 33

Training Strategy

… until converged

curriculum update point curriculum phase timeline

* shuffling (among and within shards) (1000 minibatches)

Acer training the NMT model on general domain data, con-nued-training on TED + ParaCrawl as following:

slide-34
SLIDE 34

Results

36.2 37.5 38.7 35 35 36 37 38 38 39 German CT on 10% TED CT on 10% TED + 15% ParaCrawl CT on 10% TED + 15% ParaCrawl w/ CL

BLEU

* CT for con-nued training * CL for curriculum learning

10% TED bitext

( 15,163 samples) low-resource in-domain data

15% ParaCrawl

(108,472 samples)

slide-35
SLIDE 35

Conclusions: Using Monolingual Source-Side In-Domain Data

  • 1. Proven adaptation methods for SMT can be

used for NMT, w/ some modification

  • 2. Synthetic data generation: include general +

synthetic in continued training

  • 3. Webcrawl bitext + in-domain filtering can be

effective (but should filter noise)

  • 4. (3) can be improved with curriculum learning