Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2017 MT2: PBMT, NMT

Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based, Hierarchical and Syntactic MT. • Neural MT: Sequence-to-sequence. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December 2017 MT2: PBMT, NMT 1

Outline of MT Lecture 2 1. What makes MT statistical. • Brute-force statistical MT. • Noisy channed model. • Log-linear model. 2. Phrase-based translation model. • Phrase extraction. • Decoding (gradual construction of hypotheses). • Minimum error-rate training (weight optimization). 3. Neural machine translation (NMT). • Sequence-to-sequence, with attention. December 2017 MT2: PBMT, NMT 2

Quotes Warren Weaver (1949): I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text. Noam Chomsky (1969): . . . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Frederick Jelinek (80’s; IBM; later JHU and sometimes ´ UFAL) Every time I fire a linguist, the accuracy goes up. Hermann Ney (RWTH Aachen University): MT = Linguistic M odelling + Statistical Decision T heory December 2017 MT2: PBMT, NMT 3

The Statistical Approach (Statistical = Information-theoretic.) • Specify a probabilistic model. = How is the probability mass distributed among possible outputs given observed inputs. • Specify the training criterion and procedure. = How to learn free parameters from training data. Notice: • Linguistics helpful when designing the models: – How to divide input into smaller units. – Which bits of observations are more informative. December 2017 MT2: PBMT, NMT 4

Statistical MT Given a source (foreign) language sentence f J 1 = f 1 . . . f j . . . f J , Produce a target language (English) sentence e I 1 = e 1 . . . e j . . . e I . Among all possible target language sentences, choose the sentence with the highest probability: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) (1) e I,e I 1 We stick to the e I 1 , f J 1 notation despite translating from English to Czech. December 2017 MT2: PBMT, NMT 5

Brute-Force MT (1/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (2) 0 otherwise Any problems with the definition? • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. December 2017 MT2: PBMT, NMT 6

Brute-Force MT (2/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (3) 0 otherwise • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. ⇒ ∅ Good evening. December 2017 MT2: PBMT, NMT 7

Bayes’ Law Bayes’ law for conditional probabilities: p ( a | b ) = p ( b | a ) p ( a ) p ( b ) So in our case: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) Apply Bayes’ law e I,e I 1 p ( f J 1 | e I 1 ) p ( e I p ( f J 1 ) 1 ) constant = argmax p ( f J ⇒ irrelevant in maximization 1 ) I,e I 1 p ( f J 1 | e I 1 ) p ( e I = argmax 1 ) I,e I 1 Also called “Noisy Channel” model. December 2017 MT2: PBMT, NMT 8

Motivation for Noisy Channel ˆ I p ( f J 1 | e I 1 ) p ( e I ˆ 1 = argmax 1 ) (4) e I,e I 1 Bayes’ law divided the model into components: p ( f J 1 | e I Translation model (“reversed”, e I 1 → f J 1 ) 1 ) . . . is it a likely translation? p ( e I 1 ) Language model (LM) . . . is the output a likely sentence of the target language? • The components can be trained on different sources. There are far more monolingual data ⇒ language model more reliable. December 2017 MT2: PBMT, NMT 9

Without Equations Parallel Texts Monolingual Texts Translation Model Language Model Global Search Input Output for sentence with highest probability December 2017 MT2: PBMT, NMT 10

Summary of Language Models • p ( e I 1 ) should report how “good” sentence e I 1 is. • We surely want p ( The the the. ) < p ( Hello. ) • How about p ( The cat was black. ) < p ( Hello. ) ? . . . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model: p ( � � The cat was black . � � ) = p ( The | � � ) p ( cat | � The ) p ( was | The cat ) p ( black | cat was ) p ( . | was black ) p ( � | black . ) p ( � | . � ) Formally, with n = 3 : I � p ( e i | e i − 1 p LM ( e I 1 ) = i − n +1 ) (5) i =1 December 2017 MT2: PBMT, NMT 11

Estimating and Smoothing LM count( w 1 ) p ( w 1 ) = Unigram probabilities. total words observed p ( w 2 | w 1 ) = count( w 1 w 2 ) Bigram probabilities. count( w 1 ) p ( w 3 | w 2 , w 1 ) = count( w 1 w 2 w 3 ) Trigram probabilities. count( w 1 w 2 ) Unseen ngrams ( p ( ngram ) = 0 ) are a big problem, invalidate whole sentence: p LM ( e I 1 ) = · · · · 0 · · · · = 0 ⇒ Back-off with shorter ngrams: � 1 ) = � I p LM ( e I 0 . 8 · p ( e i | e i − 1 , e i − 2 )+ i =1 0 . 15 · p ( e i | e i − 1 )+ (6) 0 . 049 · p ( e i )+ � 0 . 001 � = 0 December 2017 MT2: PBMT, NMT 12

From Bayes to Log-Linear Model Och (2002) discusses some problems of Equation 19: • Models estimated unreliably ⇒ maybe LM more important: ˆ 1 )) 2 I p ( f J 1 | e I 1 )( p ( e I ˆ 1 = argmax (7) e I,e I 1 • In practice, “direct” translation model equally good: ˆ I p ( e I 1 | f J 1 ) p ( e I ˆ 1 = argmax 1 ) (8) e I,e I 1 • Complicated to correctly introduce other dependencies. ⇒ Use log-linear model instead. December 2017 MT2: PBMT, NMT 13

Log-Linear Model (1) • p ( e I 1 | f J 1 ) is modelled as a weighted combination of models, called “feature functions”: h 1 ( · , · ) . . . h M ( · , · ) exp( � M m =1 λ m h m ( e I 1 , f J 1 )) p ( e I 1 | f J 1 ) = (9) 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 )) I ′ e ′ • Each feature function h m ( e, f ) relates source f to target e . E.g. the feature for n -gram language model: I � p ( e i | e i − 1 h LM ( f J 1 , e I 1 ) = log i − n +1 ) (10) i =1 • Model weights λ M 1 specify the relative importance of features. December 2017 MT2: PBMT, NMT 14

Log-Linear Model (2) As before, the constant denominator not needed in maximization: exp( � M m =1 λ m h m ( e I 1 , f J 1 )) e ˆ I ˆ 1 = argmax I,e I 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 1 )) (11) I ′ e ′ 1 exp( � M m =1 λ m h m ( e I 1 , f J = argmax I,e I 1 )) December 2017 MT2: PBMT, NMT 15

Relation to Noisy Channel With equal weights and only two features: • h TM ( e I 1 , f J 1 ) = log p ( f J 1 | e I 1 ) for the translation model, • h LM ( e I 1 , f J 1 ) = log p ( e I 1 ) for the language model, log-linear model reduces to Noisy Channel: e ˆ 1 exp( � M I m =1 λ m h m ( e I 1 , f J ˆ 1 = argmax I,e I 1 )) 1 exp( h TM ( e I 1 , f J 1 ) + h LM ( e I 1 , f J = argmax I,e I 1 )) (12) 1 exp(log p ( f J 1 | e I 1 ) + log p ( e I = argmax I,e I 1 )) 1 p ( f J 1 | e I 1 ) p ( e I = argmax I,e I 1 ) December 2017 MT2: PBMT, NMT 16

Phrase-Based MT Overview This time around = Nyn´ ı . they ’re moving = zareagovaly faster even = dokonce jeˇ stˇ e even . . . = . . . moving This time around, they ’re moving = Nyn´ ı zareagovaly ’re they even faster = dokonce jeˇ stˇ e rychleji , . . . = . . . around time This Phrase-based MT: choose such segmentation of input string and such phrase “replacements” Nyn´ rychleji. stˇ zareagovaly ı e dokonce jeˇ to make the output sequence “coherent” (3-grams most probable). December 2017 MT2: PBMT, NMT 17

Phrase-Based Translation Model • Captures the basic assumption of phrase-based MT: 1 into K phrases ˜ f 1 . . . ˜ 1. Segment source sentence f J f K . 2. Translate each phrase independently: ˜ f k → ˜ e k . 3. Concatenate translated phrases (with possible reordering R ): e R (1) . . . ˜ ˜ e R ( K ) • In theory, the segmentation s K is a hidden variable in the maximization, we should be 1 summing over all segmentations: (Note the three args in h m ( · , · , · ) now.) e ˆ 1 exp( � M I m =1 λ m h m ( e I 1 , f J 1 , s K � ˆ = argmax I,e I 1 )) (13) s K 1 1 • In practice, the sum is approximated with a max (the biggest element only): e ˆ 1 exp( � M I m =1 λ m h m ( e I 1 , f J 1 , s K ˆ = argmax I,e I 1 max s K 1 )) (14) 1 December 2017 MT2: PBMT, NMT 18

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2017 MT2: PBMT, NMT

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models

Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and

Phrase Weights Statistical NLP Spring 2011 Lecture 10: Phrase Alignment Dan Klein UC

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Statistical Phrase-Based Translation Philipp Koehn, Franz Och, Daniel Marcu koehn@isi.edu,

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Translation Model Parallel corpus source target translation e f phrase phrase features

Phrase-Based Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Selective Phrase Pair Extraction for Improved Statistical Machine Translation Luke S.

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Translation from SQL into the relational algebra Consider the following relational schema:

Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind

Empirical Methods in Natural Language Processing Lecture 14 Machine translation (I): Introduction

Translation , 1 2,1 2 3,1 3 0,1 1,1

Neural Machine Translation: Breaking the Performance Plateau Rico Sennrich Institute for

Surprise Language Evaluation: Rapid-Response Cross-Language IR Maryland: Douglas W. Oard, Marine

Sambuz

Useful Links

Newsletter

Mail Us

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2017 MT2: PBMT, NMT

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models

Building a Phrase-based SMT System Graham Neubig &amp; Kevin Duh Nara Institute of Science and

Phrase Weights Statistical NLP Spring 2011 Lecture 10: Phrase Alignment Dan Klein UC

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Statistical Phrase-Based Translation Philipp Koehn, Franz Och, Daniel Marcu koehn@isi.edu,

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Translation Model Parallel corpus source target translation e f phrase phrase features

Phrase-Based Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Selective Phrase Pair Extraction for Improved Statistical Machine Translation Luke S.

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Translation from SQL into the relational algebra Consider the following relational schema:

Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind

Empirical Methods in Natural Language Processing Lecture 14 Machine translation (I): Introduction

Translation , 1 2,1 2 3,1 3 0,1 1,1

Neural Machine Translation: Breaking the Performance Plateau Rico Sennrich Institute for

Surprise Language Evaluation: Rapid-Response Cross-Language IR Maryland: Douglas W. Oard, Marine

Sambuz

Useful Links

Newsletter

Mail Us

Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and