Low Resource Machine Translation MarcAurelio Ranzato Facebook AI - PowerPoint PPT Presentation

Low Resource Machine Translation Marc’Aurelio Ranzato Facebook AI Research - NYC ranzato@fb.com Stanford - CS224N, 10 March 2020

Machine Translation English French Training data Ingredients: Train NMT • seq2seq with attention NMT System • SGD Ingredient: Test NMT • beam NMT System life is beautiful la vie est belle 2

Some Stats 6000+ languages in the world • 80% of the world population • does not speak English Less than 5% of the people in • the world are native English speakers. 3

The Long Tail of Languages The top 10 languages are spoken by less than 50% of the people. The remaining ~6500 are spoken by the rest! More than 2000 languages are spoken by less than 1000 people. source: https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

(X to English) source: https://ai.googleblog.com/2019/10/exploring-massively-multilingual.html

Machine Translation in Practice English Nepali 25M people Training data 6

Machine Translation in Practice English Nepali 25M people Training data Parallel training data (collection of sentences with corresponding translation) is small! 7

Machine Translation in Practice English Nepali Training data Let’s represent data with rectangles. The color indicates the language. 8

Machine Translation in Practice English Nepali sentences originating corresponding Nepali Bible in English translations Domain Parliamentary corresponding sentences originating English translations in Nepali Let’s represent (human) translations with empty rectangles. • Some parallel data originates in the source, some in the target language. • Source and target domains may not match.

Machine Translation in Practice English Nepali TEST News mono mono Bible Domain Parliamentary mono • Test data might be in another domain. • There might exist source side in-domain monolingual data.

Machine Translation in Practice English Nepali Hindi TEST News mono mono Bible Domain Parliamentary mono Books mono • There might be parallel and monolingual data with a high resource language close to the low resource language of interest. This data may belong to a different domain.

English Nepali Hindi Sinhala Bengali Spanish Tamil Gujarati TEST Domain … … the Mondrian like learning setting!

Low Resource Machine Translation Loose definition: A language pair can be considered low resource when the number of parallel sentences is in the order of 10,000 or less. Note: modern NMT systems have several hundred million parameters nowadays! Challenges: - data sourcing data to train on - evaluation datasets - - modeling unclear learning paradigm - domain adaptation - generalization - 13

Why Low Resource MT Is Interesting? • It is about learning with less labeled data. • It is about modeling structured outputs and compositional learning. • It is a real problem to solve. 14

Outline MODEL DATA “The FLoRes evaluation for low “Phrase-based & Neural Unsup MT” resource MT:…” Guzmán, Chen et al. Lample et al. EMNLP 2018 ’EMNLP 2019 “FBAI WAT’19 My-En translation task submission” Chen et al., WAT@EMNLP 2019 life of a “Investigating Multilingual NMT Representations at Scale” Kudugunta et al., researcher EMNLP 2019 “Multilingual Denoising Pre-training for NMT” Liu et al., arXiv 2001:08210 2020 “Analyzing uncertainty in NMT” Ott et al. ICML 2018 “On the evaluation of MT systems trained with ANALYSIS back-translation” Edunov et al. ACL 2020 “The source-target domain mismatch problem in MT” Shen et al. arXiv 1909.13151 2019 15

A Big “Small-Data” Challenge http://opus.nlpl.eu/ 16

Case Study: En-Ne English Nepali TEST Wikipedia mono mono Bible, JW300, etc. Domain GNOME, Ubuntu, etc. Common Crawl mono mono In-domain data: no parallel, little monolingual. Out-of-domain: little parallel, quite a bit monolingual No translation originating from Nepali.

A Case Study: En-Ne • Parallel Training data: versions of bible and ubuntu handbook (<1M sentences). • Nepali Monolingual data: wikipedia (90K), common crawl (few millions). • English Monolingual data: unlimited almost. • Test data: ??? 18

FLoRes Evaluation Benchmark • Validation, test and hidden test set, each with 3000 sentences in English-Nepali and English-Sinhala. • Sentences taken from Wikipedia documents. Data Collection Process: • Very expensive and slow. • Very hard to produce high-quality translations: automatic checks (language model filtering, transliteration filtering, length filtering, • language id filtering, etc), human assessment. • Guzmàn, Chen et al. “The FLoRes evaluation datasets for low resource MT…” EMNLP 2019

Examples Si-En original translation En-Si Wikipedia originating in Si has different topics than Wikipedia originating in En Guzmàn, Chen et al. “The FLoRes evaluation datasets for low resource MT…” EMNLP 2019

Examples Ne-En En-Ne Guzmàn, Chen et al. “The FLoRes evaluation datasets for low resource MT…” EMNLP 2019

• Useful to evaluate truly low resource language pairs. WMT 2019 and WMT 2020 shared filtering task. • Several publications. • • Sustained effort, more to come… https://github.com/facebookresearch/flores data & baseline models 22

What Did We Learn? • Data is often as or more important than designing a model. • Collecting data is not trivial. • Look at the data!! 23

Outline MODEL DATA “The FLoRes evaluation for low resource MT:…” Guzmán, Chen et al. “Phrase-based & Neural Unsup MT” ’EMNLP 2019 Lample et al. EMNLP 2018 “FBAI WAT’19 My-En translation task submission” Chen et al., WAT@EMNLP 2019 life of a “Massively Multilingual NMT” Aharoni et researcher al.,ACL 2019 “Multilingual Denoising Pre-training for NMT” Liu et al., arXiv 2001:08210 2020 “Analyzing uncertainty in NMT” Ott et al. ICML 2018 “On the evaluation of MT systems trained with ANALYSIS back-translation” Edunov et al. ACL 2020 “The source-target domain mismatch problem in MT” Shen et al. arXiv 1909.13151 2019 24

English Nepali Hindi Sinhala Bengali Spanish Tamil Gujarati TEST Domain … …

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI - PowerPoint PPT Presentation

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI Research - NYC ranzato@fb.com Stanford - CS224N, 10 March 2020 Machine Translation English French Training data Ingredients: Train NMT seq2seq with attention NMT

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Common Parameters for mono-W/Z/H/ Marie-Hlne Genest:

Mono for Game Developers Miguel de Icaza miguel@xamarin.com,

Coroutines and Reactive Programming friends or foes? Konrad Kami ski Allegro.pl suspend

Audio: Generation & Extraction Charu Jaiswal Music Composition which approach? Feed

Bazel { fast, correct } choose two Klaus Aehlig February 45, 2017 Bazel How Bazel Works

Advanced Reflec+on & Metaprogrammming Jb Evain

Going Reactive with Spring 5 @NisJUG, November 2018 Changing Requirements (then and now) 10

Some mathematical results on mixture flows. D. B RESCH Laboratoire de Math ematiques, UMR5127

Sambuz

Useful Links

Newsletter

Mail Us

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI - PowerPoint PPT Presentation

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI Research - NYC ranzato@fb.com Stanford - CS224N, 10 March 2020 Machine Translation English French Training data Ingredients: Train NMT seq2seq with attention NMT

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Common Parameters for mono-W/Z/H/ Marie-Hlne Genest:

Mono for Game Developers Miguel de Icaza miguel@xamarin.com,

Coroutines and Reactive Programming friends or foes? Konrad Kami ski Allegro.pl suspend

Audio: Generation &amp; Extraction Charu Jaiswal Music Composition which approach? Feed

Bazel { fast, correct } choose two Klaus Aehlig February 45, 2017 Bazel How Bazel Works

Advanced Reflec+on &amp; Metaprogrammming Jb Evain

Going Reactive with Spring 5 @NisJUG, November 2018 Changing Requirements (then and now) 10

Some mathematical results on mixture flows. D. B RESCH Laboratoire de Math ematiques, UMR5127

Sambuz

Useful Links

Newsletter

Mail Us

Audio: Generation & Extraction Charu Jaiswal Music Composition which approach? Feed

Advanced Reflec+on & Metaprogrammming Jb Evain