using monolingual source side in domain data
play

Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela - PowerPoint PPT Presentation

Using Monolingual Source-Side In-Domain Data Jen Drexler, Pamela Shapiro, Xuan Zhang SCALE Readout August 9, 2018 Continued Training In-domain Data Con-nued- General Trained Domain NMT NMT Model Model Monolingual Source-Side


  1. Using Monolingual � Source-Side In-Domain Data Jen Drexler, Pamela Shapiro, Xuan Zhang SCALE Readout August 9, 2018

  2. Continued Training In-domain Data Con-nued- General Trained Domain NMT NMT Model Model

  3. Monolingual Source-Side � In-Domain Data In-domain Data Domain- General Adapted ??? Domain NMT NMT Model Model

  4. Monolingual à Parallel • Forward Translation � • Use general-domain MT model to translate monolingual in-domain data � • Continued training with “synthetic” parallel in- domain data � • Data Selection � • Large corpora of parallel data from wide range of domains (web crawl) � • Use monolingual in-domain data to find parallel sentences that are closest to desired domain �

  5. Forward Translation Using “synthetic” in-domain data for machine translation

  6. Setup • Data � • Same general domain data, same in-domain validation and test sets as continued training experiments � • Use only source-side of in-domain training data � • Forward Translation Models � • Baseline models trained on general domain data � • Neural machine translation (NMT) or statistical machine translation (SMT) � • Continued Training (CT) � • “Synthetic” in-domain only � • General domain + synthetic in-domain data �

  7. TED Results 40 +1.4 +1.2 +0.1 35 +0.8 +0.8 +0.5 30 +1.5 +1.5 +1.0 +0.8 +1.7 +0.9 25 +0.7 20 +0.7 +0.9 -0.3 15 +0.6 BLEU +0.3 10 5 0 Arabic German Farsi Korean Russian Chinese General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + SMT in-domain CT with general domain + NMT in-domain

  8. Patent Results +2.0 +1.9 40 35 30 +2.5 +0.9 +1.5 +0.8 25 20 +4.6 BLEU 15 -1.6 10 +0.5 +0.4 5 0 German Korean Russian Chinese General domain NMT model CT with NMT in-domain CT with SMT in-domain CT with general domain + SMT in-domain CT with general domain + NMT in-domain

  9. Remaining Questions • How much synthetic in-domain vs. general domain data to use for continued training? � • Amount of available general domain data varies widely based on language, domain � • Should we treat general domain and synthetic in- domain data differently during continued training? � • Exploiting Source-side Monolingual Data in Neural Machine Translation (Zhang and Zong, EMNLP 2016) � • Synthetic target side of in-domain data is low-quality – decoder should not be trained to produce it �

  10. Continued Training Updates • Alternate training on general domain, in- domain data � • Future work: experiment with ratio of general domain to in-domain data � • Optionally treat synthetic data differently � • Freeze decoder parameters � • Future work: experiment with choice of frozen parameters � • Multi-task training �

  11. Multi-Task Training (Zhang and Zong, New Decoder EMNLP 2016) reordered source-side sentence aligned monolingual data bilingual data 𝑧 𝑈 𝑧 𝑦′ 1 𝑦′ 2 𝑦′ 𝑈 𝑦 𝑧 1 𝑧 2 𝑨 𝑈 𝑧 𝑡 1 𝑡 2 𝑡 𝑈 𝑦 𝑨 1 𝑨 2 ⋯ ⋯ reordering translation ⋯ ℎ 𝑈 ℎ 1 ℎ 2 𝑦 Baseline ⋯ ℎ 𝑈 ℎ 1 ℎ 2 𝑦 Auxiliary Model Model (In-Domain (General Domain Transla-on Task) 𝑦 1 𝑦 2 𝑦 𝑈 𝑦 Transla-on Task)

  12. Chinese TED Results 18 17.1 16.8 16.7 16.6 17 16 15 BLEU 14 13 NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task

  13. Chinese WIPO Results 18.6 20 18.1 17.2 16.6 14.7 15 11.0 10 BLEU 5 0 NMT In-Domain SMT In-Domain Synthe-c In-Domain Concatenated Alterna-ng Alterna-ng + Freezing Alterna-ng + Mul--Task

  14. Summary • Continued training with synthetic in- domain data produces consistent in- domain BLEU improvements � • NMT more consistent than SMT � • SMT better than NMT for some languages, domains � • Many possible future research directions � • Modified Sockeye recipes enable alternating domains, multi-task training �

  15. Mining Web-Crawled Data Moore-Lewis Selection on Paracrawl Data

  16. MT training General Domain Data General Domain NMT Model

  17. MT training ParaCrawl Data General Domain NMT Model

  18. MT training ParaCrawl Data Con-nued- General Trained Domain NMT NMT Model Model In-Domain Src Data

  19. � � � � ParaCrawl Data Pipeline for crawling parallel data from the web and cleaning it � Returns Cleanliness Scores, Threshold for Clean/Size tradeoff Document Sentence Domain Crawling Cleaning Alignment Alignment Selec-on Documents aligned based on language ID & URLs, then sentenced aligned, given score of cleanliness �

  20. � � In-Domain Data Selection Classic approach (Moore and Lewis 2010) � Train source-side language models from IN, random sample of CRAWL � Score each source-side CRAWL sentence by: � Probability IN (sent) / Probability CRAWL (sent) � Strong method for SMT, we're investigating it for NMT �

  21. Most TED-Like Sentence it changes the way we think ; it changes the way we walk in the world ; it changes our responses ; it changes our attitudes towards our current situations ; it changes the way we dress ; it changes the way we do things ; it changes the way we interact with people . �

  22. Least TED-Like Sentence sunday , july 10 , 2016 in riverton the weather forecast would be : during the night the air temperature drops to + 21 ... + 24ºc , dew point : -1,6ºc ; precipitation is not expected , light breeze wind blowing from the south at a speed of 7-11 km / h , clear sky ; in the morning the air temperature drops to + 20 ... + 22ºc , dew point : + 1,24ºc ; precipitation is not expected , gentle breeze wind blowing from the west at a speed of 11-14 km / h , in the sky , sometimes there are small clouds ; in the afternoon the air temperature will be + 20 ... + 22ºc , dew point : + 4,06ºc ; ratio of temperature , wind speed and humidity : a bit dry for some ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-29 km / h , in the sky , sometimes there are small clouds ; in the evening the air temperature drops to + 15 ... + 19ºc , dew point : -0,12ºc ; precipitation is not expected , moderate breeze wind blowing from the north at a speed of 14-32 km / h , clear sky ; day length is 14:52 �

  23. � Perplexity-Based Selection Rank sentences, then select amount of data based on perplexity on in-domain data � 400 350 300 250 Perplexity 200 150 100 50 0 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 Number of Sentences

  24. German TED 45 40 +1.19 35 BLEU 30 GenDomain (28M) Con-nued on Random (28+1M) Con-nued on ParaCrawl (28+1M) Con-nued on TED Bitext

  25. TED Results 50 +1.19 40 +0.34 30 20 10 BLEU 0 German Korean Russian General Domain Con-nued on Random Con-nued on ParaCrawl Con-nued on TED Bitext

  26. Settings where Moore-Lewis performs poorly Too much noise: � • Korean web-crawl data very noisy � • In preliminary experiments, threshold for data cleaning mattered significantly � Domain specificity: � • For German patent data, seems like may not be enough relevant web-crawled data �

  27. Analysis Long sentences: � • Quirk of Moore-Lewis method � • Smaller selections of Moore-Lewis ranked data à high average sentence length � • NMT has length limit, but having sentences near border may cause problems � • However, at thresholds we tried for TED (600K-1M), overall positive results �

  28. Conclusions • Domain-based data selection from web- crawled data can help domain adaptation for NMT when you only have source data � • Cleanliness of web-crawled data matters � • Whether you can expect to find good data for it on web depends on domain �

  29. Curriculum Learning (for Continued-Training)

  30. Curriculum Learning • How to take advantage of Moore-Lewis scores? - Filter ParaCrawl data by a threshold - Arrange data processing order by Moore-Lewis score ranking • Curriculum Learning (CL): � [Curriculum Learning, Bengio et al. 2009] � Process samples by certain order. ( easy ➠ difficult ) Train better machine learning models faster .

  31. CL for Continued-Training • Curriculum Learning (CL): Process samples by certain order. � ( easy ➠ difficult ) � • CL for ConKnued-Training (CT): Reorganize TED (bitext) + ParaCrawl data. Ranking Criterion: Moore-Lewis score ( easy: TED-like ➠ difficult: TED-unilke) � • Compare to pamela’s work: ParaCrawl threshold + random sampling (pamela) vs. ParaCrawl threshold + TED bitext + ordering (xuan)

  32. Methods shards 0 1 2 3 4 • we put TED data in shard 0 • ParaCrawl data in shard 1-4 High (Moore-Lewis Score) Low ➠ TED-like TED-unlike Clean Noisy Jenks Natural Breaks ClassificaKon Algorithm (Maximize variance between classes and reduce variance within classes) ParaCrawl #samples 15163, 3066, 13519, 32179, 59708 samples% Density 12.26%, 2.48%, 10.93%, 26.03%, 48.29% ParaCrawl TED Moore-Lewis Score

  33. Training Strategy Acer training the NMT model on general domain data, con-nued-training on TED + ParaCrawl as following: curriculum phase (1000 minibatches) curriculum update point timeline * shuffling (among and within shards) … until converged

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend