Before we start... Start downloading Moses: wget - PowerPoint PPT Presentation

Before we start... ◮ Start downloading Moses: wget http://ufal.mff.cuni.cz/~tamchyna/mosesgiza.64bit.tar.gz ◮ Start downloading our “playground” for SMT: wget http://ufal.mff.cuni.cz/eman/download/playground.tar ◮ Slides can be downloaded here: http://ufal.mff.cuni.cz/~tamchyna/mtm14.slides.pdf 1 / 43

Experimenting in MT: Moses Toolkit and Eman Ondˇ rej Bojar, Aleˇ s Tamchyna Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague Mon Sept 8, 2014 2 / 43

Outline ◮ Quick overview of Moses. ◮ Bird’s eye view of (phrase-based) MT. ◮ With pointers to Moses repository. ◮ Experiment management. ◮ Motivation. ◮ Overview of Eman. ◮ Run your own experiments. ◮ Introduce Eman’s features through building a baseline Czech → English MT system. ◮ Inspect the pipeline and created models. ◮ Try some techniques to improve over the baseline. 3 / 43

Moses Toolkit ◮ Comprehensive open-source toolkit for SMT ◮ Core: phrase-based and syntactic decoder 4 / 43

Moses Toolkit ◮ Comprehensive open-source toolkit for SMT ◮ Core: phrase-based and syntactic decoder ◮ Includes many related tools: ◮ Data pre-processing: cleaning, sentence splitting, tokenization, . . . ◮ Building models for translation: create phrase/rule tables from word-aligned data, train language models with KenLM ◮ Tuning translation systems (MERT and others) 4 / 43

Moses Toolkit ◮ Comprehensive open-source toolkit for SMT ◮ Core: phrase-based and syntactic decoder ◮ Includes many related tools: ◮ Data pre-processing: cleaning, sentence splitting, tokenization, . . . ◮ Building models for translation: create phrase/rule tables from word-aligned data, train language models with KenLM ◮ Tuning translation systems (MERT and others) ◮ You still need a tool for word alignment: ◮ GIZA++, fast align, . . . ◮ Bundled with its own experiment manager EMS ◮ We will use a different one 4 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) Basic model 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) Basic model Parameter optimization (MERT) Optimized model 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) Basic model Parameter optimization (MERT) Optimized model Translate 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) train-model.perl Basic model mert-moses.pl Parameter optimization (MERT) Optimized model Translate moses-parallel.pl 5 / 43

Bird’s Eye View of Phrase-Based MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) train-model.perl moses.ini Basic model mert-moses.pl Parameter optimization (MERT) moses.ini Optimized model Translate moses-parallel.pl 5 / 43

Now, This Complex World... Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) Basic model Parameter optimization (MERT) Optimized model Translate 6 / 43

...Has to Be Ruled by Someone Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) Basic model Parameter optimization (MERT) Optimized model Translate 7 / 43

...Has to Be Ruled by Someone Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) M4M Language Model (LM) Reordering M. (RM) Basic model EMS Parameter optimization (MERT) Ducttape Optimized model Translate 7 / 43

...Has to Be Ruled by Someone Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction 99 100 Translation M. (TM) M4M Language Model (LM) Reordering M. (RM) 94 Basic model EMS Parameter optimization (MERT) Ducttape 93 Optimized model Translate 7 / 43

Why Use an Experiment Manager? ◮ Automatic tracking of system configurations, versions of software,. . . ⇒ reproducibility of results 8 / 43

Why Use an Experiment Manager? ◮ Automatic tracking of system configurations, versions of software,. . . ⇒ reproducibility of results ◮ Efficiency/convenience ◮ (MT) experiments are pipelines of complex components ⇒ hide implementation details, provide a unified abstraction ◮ easily run many experiments in parallel 8 / 43

Why Use an Experiment Manager? ◮ Automatic tracking of system configurations, versions of software,. . . ⇒ reproducibility of results ◮ Efficiency/convenience ◮ (MT) experiments are pipelines of complex components ⇒ hide implementation details, provide a unified abstraction ◮ easily run many experiments in parallel ◮ Re-use of intermediate files ◮ different experiments may share e.g. the same language model 8 / 43

Features of Eman ◮ Console-based ⇒ easily scriptable (e.g. in bash). ◮ Versatile: “seeds” are up to the user, any language. ◮ Support for the manual search through the space of experiment configurations. ◮ Support for finding and marking (“tagging”) steps or experiments of interest. ◮ Support for organizing the results in 2D tables. ◮ Integrated with SGE ⇒ easy to run on common academic clusters. eman --man will tell you some details. http://ufal.mff.cuni.cz/eman/ has more. 9 / 43

Eman’s View ◮ Experiments consist of processing steps . ◮ Steps are: ◮ of a given type, e.g. align , tm , lm , mert , ◮ defined by immutable variables, e.g. ALISYM=gdfa , ◮ all located in one directory, the “ playground ”, ◮ timestamped unique directories, e.g. s.mert.a123.20120215-1632 ◮ self-contained in the dir as much as reasonable. ◮ dependent on other steps, e.g. first align , then build tm , then mert . DONE Lifetime of a step: RUNNING seed INITED PREPARED FAILED PREPFAILED 10 / 43

Why INITED → PREPARED → RUNNING? The call to eman init seed : ◮ Should be quick, it is used interactively. ◮ Should only check and set vars, “turn a blank directory into a valid eman step”. The call to eman prepare s.step.123.20120215 : ◮ May check for various input files. ◮ Less useful with heavy experiments where even corpus preparation needs cluster. ◮ Has to produce eman.command . ⇒ A chance to check it: are all file paths correct etc.? The call to eman start s.step.123.20120215 : ◮ Sends the job to the cluster. 11 / 43

Our Eman Seeds for MT Monolingual Parallel Devset Input Preprocessing: tokenization, tagging... Word alignment Phrase extraction Translation M. (TM) Language Model (LM) Reordering M. (RM) Basic model Parameter optimization (MERT) Optimized model Translate 12 / 43

Our Eman Seeds for MT Monolingual Parallel Devset Input corpus corpus corpus corpus Preprocessing: tokenization, tagging... align Word alignment Phrase extraction Translation M. (TM) tm Language lm Model (LM) rm Reordering M. (RM) model Basic model Parameter optimization (MERT) mert translate Optimized model Translate 12 / 43

Our Eman Seeds for MT Monolingual Parallel Devset Input corpman corpus corpus corpus corpus Preprocessing: tokenization, tagging... align Word alignment Phrase extraction Translation M. (TM) tm Language lm Model (LM) rm Reordering M. (RM) model Basic model Parameter optimization (MERT) mert translate Optimized model Translate 12 / 43

Eman’s Bells and Whistles Experiment management: ◮ ls , vars , stat for simple listing, ◮ select for finding steps, ◮ traceback for full info on experiments, ◮ redo failed experiments, ◮ clone individual steps as well as whole experiments. Meta-information on steps: ◮ status , ◮ tag s, autotags, ◮ collect ing results, ◮ tabulate for putting results into 2D tables. 13 / 43

Before we start... Start downloading Moses: wget - PowerPoint PPT Presentation

Before we start... Start downloading Moses: wget http://ufal.mff.cuni.cz/~tamchyna/mosesgiza.64bit.tar.gz Start downloading our playground for SMT: wget http://ufal.mff.cuni.cz/eman/download/playground.tar Slides can be

GLO Science Professional Before & After Images Before GLO After GLO Before GLO After GLO

W|S|W Certified Public Accountants + Butler | Snow Start. Grow. Sell. 2 Start. Grow. Sell.

before before before before

VOLUNTARY COMPLIANCE BOARDGAME Before we start Please Silence Cellphones Before we start Who

The Power of Brand Let s start with a game Fast Food Let s start with a game Tennis

Flying Start Flying Start Flying Start Childcare Childcare Childcare in Swansea in Swansea

Head Start and Early Head Start What is Head Start? Head Start is a federal program that

Sta tart- t-up, or not? t? Module 1 Module 1 START-UP? START-UP? START-UP? START-UP? Bu

#COVERTHECUT BEFORE #COVERTHECUT BEFORE #COVERTHECUT AFTER #COVERTHECUT BEFORE #COVERTHECUT

Before the beginning: process automation Before the beginning: Before the beginning: tabulation

SPPA Conference 2014 Start Swimming before you Start School Fiona Paterson Scottish Swimming

Sources of Start Sources of Start- -up Capital up Capital up Capital Sources of Start Sources

Social-Emotional Health in Head Start Head Start in NC (2016-2017) Programs 53 Centers ~450

Head Start to Early Head Start Conversions Kay Willmoth, Regional Program Manager, Region V

Sta tart- t-up, or not? t? Module 5 Module 5 START-UP? START-UP? Bu Business Id Idea 3.

MALE INFERTILITY CASE-I Before Treatment: After Treatment: After Treatment: CASE 2 BEFORE

Held Hostage? The Influence of Major ASes and CDNs on the Internet Original Idea The

React native Installation and overview Installation On Mac (see tutorialsPoint tutorial)

Continuous delivery for native apps Niels Frydenholm, ebay Classifieds Continuous delivery 3

Contextualising analyses through data and software preservation Robin Dasler WSSSPE5.1 6

1 Recall MIPS word size In hex 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12

Improve Cookie- based Session with Decorator Pattern @ ConFoo Montreal 2018-03-08 by Jian

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Unrolling residues to avoid progressions Steve Butler 1 Ron Graham Linyuan Lu 1 Department of

Before we start... Start downloading Moses: wget - PowerPoint PPT Presentation

Before we start... Start downloading Moses: wget http://ufal.mff.cuni.cz/~tamchyna/mosesgiza.64bit.tar.gz Start downloading our playground for SMT: wget http://ufal.mff.cuni.cz/eman/download/playground.tar Slides can be

GLO Science Professional Before &amp; After Images Before GLO After GLO Before GLO After GLO

W|S|W Certified Public Accountants + Butler | Snow Start. Grow. Sell. 2 Start. Grow. Sell.

before before before before

VOLUNTARY COMPLIANCE BOARDGAME Before we start Please Silence Cellphones Before we start Who

The Power of Brand Let s start with a game Fast Food Let s start with a game Tennis

Flying Start Flying Start Flying Start Childcare Childcare Childcare in Swansea in Swansea

Head Start and Early Head Start What is Head Start? Head Start is a federal program that

Sta tart- t-up, or not? t? Module 1 Module 1 START-UP? START-UP? START-UP? START-UP? Bu

#COVERTHECUT BEFORE #COVERTHECUT BEFORE #COVERTHECUT AFTER #COVERTHECUT BEFORE #COVERTHECUT

Before the beginning: process automation Before the beginning: Before the beginning: tabulation

SPPA Conference 2014 Start Swimming before you Start School Fiona Paterson Scottish Swimming

Sources of Start Sources of Start- -up Capital up Capital up Capital Sources of Start Sources

Social-Emotional Health in Head Start Head Start in NC (2016-2017) Programs 53 Centers ~450

Head Start to Early Head Start Conversions Kay Willmoth, Regional Program Manager, Region V

Sta tart- t-up, or not? t? Module 5 Module 5 START-UP? START-UP? Bu Business Id Idea 3.

MALE INFERTILITY CASE-I Before Treatment: After Treatment: After Treatment: CASE 2 BEFORE

Held Hostage? The Influence of Major ASes and CDNs on the Internet Original Idea The

React native Installation and overview Installation On Mac (see tutorialsPoint tutorial)

Continuous delivery for native apps Niels Frydenholm, ebay Classifieds Continuous delivery 3

Contextualising analyses through data and software preservation Robin Dasler WSSSPE5.1 6

1 Recall MIPS word size In hex 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12

Improve Cookie- based Session with Decorator Pattern @ ConFoo Montreal 2018-03-08 by Jian

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Unrolling residues to avoid progressions Steve Butler 1 Ron Graham Linyuan Lu 1 Department of

GLO Science Professional Before & After Images Before GLO After GLO Before GLO After GLO