Data-based Strategies to Low-resource MT Graham Neubig Site - PowerPoint PPT Presentation

CS11-737 Multilingual NLP Data-based Strategies to Low-resource MT Graham Neubig Site http://demo.clab.cs.cmu.edu/11737fa20/ Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.

Data Challenges in Low-resource MT • MT of high-resource languages (HRLs) with large parallel corpora → relatively good translations HRL TRG • MT of low-resource languages (LRLs) with small parallel corpora → nonsense! LRL TRG

A Concrete Example A system that is trained with 5000 sentence pairs on Azerbaijani and English ? source - Atam balaca boz radiosunda BBC X ə b ə rl ə rin ə qulaq asırdı. translation - So I’m going to became a lot of people. reference - My father was listening to BBC News on his small , gray radio. Does not convey the correct meaning at all.

Multilingual Training Approaches ● Joint training with LRL ● Transfer HRL to LRL (Zoph et al., 2016; Nguyen and Chiang, 2017)   and HRL parallel data (Johnson et al., 2017; Neubig and Hu, 2018) train HRL TRG HRL TRG MT MT Syste System m LRL TRG LRL TRG adapt concatenate ● Problem: Suboptimal lexical/syntactic sharing. ● Problem: Can't leverage monolingual data.

Data Augmentation Available Resources Augmented Data TRG-M TRG LRL TRG-H HRL TRG-L LRL

Data Augmentation 101: Back Translation TRG -> LRL LRL-M TRG-M TRG-M TRG-H HRL TRG-L LRL

Back Translation Idea 2. Back-translate TRG → LRL LRL-M TRG-M TRG → LRL TRG-M 1. Train TRG → LRL System 3. Train LRL → TRG LRL → TRG-L LRL TRG ● Some degree of error in source data is permissible! Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." ACL 2016 .

How to Generate Translations ● How to generate translations? ● Beam search (Sennrich et al. 2016) ● Select the highest scoring output ● Higher quality, but lower diversity, potential for data bias ● Sampling (Edunov et al. 2018) ● Randomly sample from back-translation model ● Lower overall quality, but higher diversity ● Sampling has shown to be more effective overall Understanding Back-Translation at Scale. Sergey Edunov, Myle Ott, Michael Auli, David Grangier. EMNLP 2018.

Iterative Back-translation 4. Back-translate TRG-LRL LRL TRG TRG → TRG LRL 3. Train TRG → LRL System 2. Forward-translate LRL-TRG LRL → TRG LRL LRL TRG 1. Train LRL → TRG 5. Final LRL → TRG TRG LRL LRL → System System TRG Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, Trevor Cohn. "Iterative Back-Translation for Neural Machine Translation" WNGT 2018.

Back Translation Issues • Back-translation fails in low-resource languages or domains • Use other high-resource languages • Combine with monolingual data (maybe with denoising objectives, covered in following class) • Perform other varieties of rule-based augmentation

Using HRLs in Augmentation Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL 2019.

English -> HRL Augmentation ● Problem: TRG-LRL back-translation might be low quality   TRG -> HRL TRG-M TRG-M HRL-M ● Idea: also back-   translate into HRL ○ more sentence pairs ○ vocabulary sharing of source-side ○ syntactic similarity of source-side ○ improves target-side LM AZE : H ə H ə H ə . TRG : Thank you very much. TUR : Çok te ş ekkür ederim.

Available Resources + TRG-LRL and TRG-HRL Back- translation TRG -> LRL LRL-M TRG-M TRG-M TRG -> HRL HRL-M TRG-M TRG-H HRL TRG-L LRL

Augmentation via Pivoting ● Problem: HRL-TRG data might suffer from lack of lexical/syntactic overlap ● Idea: Translate existing HRL-TRG data ○ Translate from HRL to LRL HRL -> LRL TRG HRL LRL-H TRG TUR : Çok te ş ekkür ederim. AZE : Çox sa ğ olun. TRG : Thank you so much. TRG : Thank you so much.

Available Resources + TRG-LRL and TRG-HRL Back- translation + Pivoting TRG -> LRL LRL-M TRG-M TRG-M TRG -> HRL HRL-M TRG-M TRG-H HRL HRL -> LRL TRG-L LRL LRL-H TRG-H

Back-Translation by Pivoting ● Problem: TRG-HRL back-translated   TRG : Thank you so much. data also suffers from lexical or   syntactic mismatch   TUR : Çok te ş ekkür ederim. TRG : Thank you so much. ● Idea: TRG-HRL-LRL ○ Large amount of English   monolingual data can be utilized AZE : Çox sa ğ olun. TRG : Thank you so much. TRG-M TRG -> HRL HRL -> LRL HRL- LRL- TRG-M M MH

Data w/ Various Types of Pivoting TRG -> LRL LRL-M TRG-M TRG-M TRG -> HRL HRL -> LRL HRL-M LRL-MH TRG-M TRG-H HRL HRL -> LRL TRG-L LRL LRL-H TRG-H

Monolingual Data Copying

Monolingual Data Copying ● Problem: Back-translation may help with structure, but fail at terminology ● Idea: Use monolingual data as-is ○ Helps encourage the model to not drop words ○ Helps translation of terms that are identical across languages Copy TRG TRG TRG TRG : Thank you so much. SRC : Thank you so much. TRG : Thank you so much. Anna Currey, Antonio Valerio Miceli Barone, Kenneth Heafield. Copied Monolingual Data Improves Low-Resource Neural Machine Translation. WMT 2018.

Heuristic Augmentation Strategies

          Dictionary-based Augmentation 1. Find rare words in the source sentences 2. Use a language model to predict another word that could appear in that context   3. Replace, and aligned word with translation from dictionary Marzieh Fadaee, Arianna Bisazza, Christof Monz. Data Augmentation for Low-Resource Neural Machine Translation. ACL 2017.

An Aside: Word Alignment • Automatically find alignments between source and target words for dictionary learning, analysis, supervised attention etc. • Traditional symbolic methods: word-based translation models trained using EM algorithm • GIZA++: https://github.com/moses-smt/giza-pp • FastAlign: https://github.com/clab/fast_align • Neural methods: use model like multilingual BERT or translation and find words with similar embeddings • SimAlign: https://github.com/cisnlp/simalign

Word-by-word Data Augmentation • Even simpler, translate sentences word-by-word into target sentence using dictionary J'ai acheté une nouvelle voiture I bought a new car • Problem: what about word ordering, syntactic divergence? 私は新しい⾞を買った I the new car a bought Lample, Guillaume, et al. "Unsupervised machine translation using monolingual corpora only." arXiv preprint arXiv:1711.00043 (2017).

Word-by-word Augmentation w/ Reordering • Problem: Source-target word order can differ significantly in methods that use monolingual pre-training • Solution: Do re-ordering according to grammatical rules, followed by word-by-word translation to create pseudo-parallel data Zhou, Chunting, et al. "Handling Syntactic Divergence in Low-resource Machine Translation." arXiv preprint arXiv:1909.00040 (2019).

In-class Assignment

In-class Assignment • Read one of the cited papers on heuristic data augmentation Marzieh Fadaee, Arianna Bisazza, Christof Monz. Data Augmentation for Low-Resource Neural Machine Translation. ACL 2017. Zhou, Chunting, et al. "Handling Syntactic Divergence in Low-resource Machine Translation." EMNLP 2019. • Try to think of how it would work for one of the languages you're familiar with • Are there any potential hurdles to applying such a method? Are there any improvements you can think of?

Data-based Strategies to Low-resource MT Graham Neubig Site - PowerPoint PPT Presentation

CS11-737 Multilingual NLP Data-based Strategies to Low-resource MT Graham Neubig Site http://demo.clab.cs.cmu.edu/11737fa20/ Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL

Low-Resource NLP David R. Mortensen Algorithms for Natural Language Processing Learning

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

RESOURCE MANAGEMENT RESOURCE MANAGEMENT STRATEGIES FOR SDR CLOUDS STRATEGIES FOR SDR CLOUDS Vuk

(Low-Resource) NLP Tasks Graham Neubig @ CMU Low-resource NLP Bootcamp 5/18/2020 Most Spoken

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

New Resource Implementation Shawna Warneke, Resource Management Specialist Christina Weiler,

SDR CLOUDS SDR CLOUDS RESOURCE MANAGEMENT RESOURCE MANAGEMENT IMPLICATIONS IMPLICATIONS INDEX

Deadlock Example Process 1 Process 2 Resource 1 Resource 2 Example Process 1 Process 2

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

Assessing resource needs and gaps Towards national resource mobilization strategies Markus

Machine Learning for NLP Learning from small data: low resource languages Aurlie Herbelot 2018

BEHAVIOR @ HOME Simple Behavior Strategies Simple strategies that can make a big difference!

Computational Linguistics for Low-Resource Languages October 26, 2011 Alexis Palmer Wednesday,

Uninformed Search strategies AIMA sections 3.4 Uninformed search strategies Uninformed Search

TTS and Data Selection: Improving Systems for Low-Resource Languages Chevy Levitan, DREU 2015

The Future of Spreadsheets in in the Big ig Data Era David Birch 1* , David Lyford-Smith 2 &

The Probabilistic Method Week 12: P vs NP Joshua Brody CS49/Math59 Fall 2015 Reading Quiz

Chapter 4 Extra Examples: Solving Systems + Determining Rank Chapter 4 Example 1

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha Rajeev Rastogi Kyuseok

Banks on the verge of a crisis: phase transitions and hysteresis in banking systems Tomaso Aste 1

The probability of primality of the order of a genus 2 curve Jacobian Wouter Castryck joint with

Linguistically-Enriched Models for Bulgarian-to-English Machine Translation Rui Wang DFKI GmbH,

DOWNTOWN MANCHESTER WELCOMING TRANSFORMATION WELCOMING TRANSFORMATION 1 05/06/2019 Downtown

Data-based Strategies to Low-resource MT Graham Neubig Site - PowerPoint PPT Presentation

CS11-737 Multilingual NLP Data-based Strategies to Low-resource MT Graham Neubig Site http://demo.clab.cs.cmu.edu/11737fa20/ Many slides from: Xia, Mengzhou, et al. "Generalized data augmentation for low-resource translation." ACL

Low-Resource NLP David R. Mortensen Algorithms for Natural Language Processing Learning

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

RESOURCE MANAGEMENT RESOURCE MANAGEMENT STRATEGIES FOR SDR CLOUDS STRATEGIES FOR SDR CLOUDS Vuk

(Low-Resource) NLP Tasks Graham Neubig @ CMU Low-resource NLP Bootcamp 5/18/2020 Most Spoken

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei &amp; Tian

New Resource Implementation Shawna Warneke, Resource Management Specialist Christina Weiler,

SDR CLOUDS SDR CLOUDS RESOURCE MANAGEMENT RESOURCE MANAGEMENT IMPLICATIONS IMPLICATIONS INDEX

Deadlock Example Process 1 Process 2 Resource 1 Resource 2 Example Process 1 Process 2

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

Assessing resource needs and gaps Towards national resource mobilization strategies Markus

Machine Learning for NLP Learning from small data: low resource languages Aurlie Herbelot 2018

BEHAVIOR @ HOME Simple Behavior Strategies Simple strategies that can make a big difference!

Computational Linguistics for Low-Resource Languages October 26, 2011 Alexis Palmer Wednesday,

Uninformed Search strategies AIMA sections 3.4 Uninformed search strategies Uninformed Search

TTS and Data Selection: Improving Systems for Low-Resource Languages Chevy Levitan, DREU 2015

The Future of Spreadsheets in in the Big ig Data Era David Birch 1* , David Lyford-Smith 2 &amp;

The Probabilistic Method Week 12: P vs NP Joshua Brody CS49/Math59 Fall 2015 Reading Quiz

Chapter 4 Extra Examples: Solving Systems + Determining Rank Chapter 4 Example 1

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha Rajeev Rastogi Kyuseok

Banks on the verge of a crisis: phase transitions and hysteresis in banking systems Tomaso Aste 1

The probability of primality of the order of a genus 2 curve Jacobian Wouter Castryck joint with

Linguistically-Enriched Models for Bulgarian-to-English Machine Translation Rui Wang DFKI GmbH,

DOWNTOWN MANCHESTER WELCOMING TRANSFORMATION WELCOMING TRANSFORMATION 1 05/06/2019 Downtown

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

The Future of Spreadsheets in in the Big ig Data Era David Birch 1* , David Lyford-Smith 2 &