Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT

Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based: Assumptions, beam search, key issues. • Neural MT: Sequence-to-sequence, attention, self-attentive. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December 2018 MT2: PBMT, NMT 1

Outline of MT Lecture 2 1. What makes MT statistical. • Brute-force statistical MT. • Noisy channed model. • Log-linear model. 2. Phrase-based translation model. • Phrase extraction. • Decoding (gradual construction of hypotheses). • Minimum error-rate training (weight optimization). 3. Neural machine translation (NMT). • Sequence-to-sequence, with attention. December 2018 MT2: PBMT, NMT 2

Quotes Warren Weaver (1949): I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text. Noam Chomsky (1969): . . . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Frederick Jelinek (80’s; IBM; later JHU and sometimes ´ UFAL) Every time I fire a linguist, the accuracy goes up. Hermann Ney (RWTH Aachen University): MT = Linguistic M odelling + Statistical Decision T heory December 2018 MT2: PBMT, NMT 3

The Statistical Approach (Statistical = Information-theoretic.) • Specify a probabilistic model. = How is the probability mass distributed among possible outputs given observed inputs. • Specify the training criterion and procedure. = How to learn free parameters from training data. Notice: • Linguistics helpful when designing the models: – How to divide input into smaller units. – Which bits of observations are more informative. December 2018 MT2: PBMT, NMT 4

Statistical MT Given a source (foreign) language sentence f J 1 = f 1 . . . f j . . . f J , Produce a target language (English) sentence e I 1 = e 1 . . . e j . . . e I . Among all possible target language sentences, choose the sentence with the highest probability: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) (1) e I,e I 1 We stick to the e I 1 , f J 1 notation despite translating from English to Czech. December 2018 MT2: PBMT, NMT 5

Brute-Force MT (1/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (2) 0 otherwise Any problems with the definition? • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. December 2018 MT2: PBMT, NMT 6

Brute-Force MT (2/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (3) 0 otherwise • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. ⇒ ∅ Good evening. December 2018 MT2: PBMT, NMT 7

Bayes’ Law Bayes’ law for conditional probabilities: p ( a | b ) = p ( b | a ) p ( a ) p ( b ) So in our case: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) Apply Bayes’ law e I,e I 1 p ( f J 1 | e I 1 ) p ( e I p ( f J 1 ) 1 ) constant = argmax p ( f J ⇒ irrelevant in maximization 1 ) I,e I 1 p ( f J 1 | e I 1 ) p ( e I = argmax 1 ) I,e I 1 Also called “Noisy Channel” model. December 2018 MT2: PBMT, NMT 8

Motivation for Noisy Channel ˆ I p ( f J 1 | e I 1 ) p ( e I ˆ 1 = argmax 1 ) (4) e I,e I 1 Bayes’ law divided the model into components: p ( f J 1 | e I Translation model (“reversed”, e I 1 → f J 1 ) 1 ) . . . is it a likely translation? p ( e I 1 ) Language model (LM) . . . is the output a likely sentence of the target language? • The components can be trained on different sources. There are far more monolingual data ⇒ language model more reliable. December 2018 MT2: PBMT, NMT 9

Without Equations Parallel Texts Monolingual Texts Translation Model Language Model Global Search Input Output for sentence with highest probability December 2018 MT2: PBMT, NMT 10

Summary of Language Models • p ( e I 1 ) should report how “good” sentence e I 1 is. • We surely want p ( The the the. ) < p ( Hello. ) • How about p ( The cat was black. ) < p ( Hello. ) ? . . . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model: p ( � � The cat was black . � � ) = p ( The | � � ) p ( cat | � The ) p ( was | The cat ) p ( black | cat was ) p ( . | was black ) p ( � | black . ) p ( � | . � ) Formally, with n = 3 : I � p ( e i | e i − 1 p LM ( e I 1 ) = i − n +1 ) (5) i =1 December 2018 MT2: PBMT, NMT 11

Estimating and Smoothing LM count( w 1 ) p ( w 1 ) = Unigram probabilities. total words observed p ( w 2 | w 1 ) = count( w 1 w 2 ) Bigram probabilities. count( w 1 ) p ( w 3 | w 2 , w 1 ) = count( w 1 w 2 w 3 ) Trigram probabilities. count( w 1 w 2 ) Unseen ngrams ( p ( ngram ) = 0 ) are a big problem, invalidate whole sentence: p LM ( e I 1 ) = · · · · 0 · · · · = 0 ⇒ Back-off with shorter ngrams: � 1 ) = � I p LM ( e I 0 . 8 · p ( e i | e i − 1 , e i − 2 )+ i =1 0 . 15 · p ( e i | e i − 1 )+ (6) 0 . 049 · p ( e i )+ � 0 . 001 � = 0 December 2018 MT2: PBMT, NMT 12

From Bayes to Log-Linear Model Och (2002) discusses some problems of Equation 19: • Models estimated unreliably ⇒ maybe LM more important: ˆ 1 )) 2 I p ( f J 1 | e I 1 )( p ( e I ˆ 1 = argmax (7) e I,e I 1 • In practice, “direct” translation model equally good: ˆ I p ( e I 1 | f J 1 ) p ( e I ˆ 1 = argmax 1 ) (8) e I,e I 1 • Complicated to correctly introduce other dependencies. ⇒ Use log-linear model instead. December 2018 MT2: PBMT, NMT 13

Log-Linear Model (1) • p ( e I 1 | f J 1 ) is modelled as a weighted combination of models, called “feature functions”: h 1 ( · , · ) . . . h M ( · , · ) exp( � M m =1 λ m h m ( e I 1 , f J 1 )) p ( e I 1 | f J 1 ) = (9) 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 )) I ′ e ′ • Each feature function h m ( e, f ) relates source f to target e . E.g. the feature for n -gram language model: I � p ( e i | e i − 1 h LM ( f J 1 , e I 1 ) = log i − n +1 ) (10) i =1 • Model weights λ M 1 specify the relative importance of features. December 2018 MT2: PBMT, NMT 14

Log-Linear Model (2) As before, the constant denominator not needed in maximization: exp( � M m =1 λ m h m ( e I 1 , f J 1 )) e ˆ I ˆ 1 = argmax I,e I 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 1 )) (11) I ′ e ′ 1 exp( � M m =1 λ m h m ( e I 1 , f J = argmax I,e I 1 )) December 2018 MT2: PBMT, NMT 15

Relation to Noisy Channel With equal weights and only two features: • h TM ( e I 1 , f J 1 ) = log p ( f J 1 | e I 1 ) for the translation model, • h LM ( e I 1 , f J 1 ) = log p ( e I 1 ) for the language model, log-linear model reduces to Noisy Channel: e ˆ 1 exp( � M I m =1 λ m h m ( e I 1 , f J ˆ 1 = argmax I,e I 1 )) 1 exp( h TM ( e I 1 , f J 1 ) + h LM ( e I 1 , f J = argmax I,e I 1 )) (12) 1 exp(log p ( f J 1 | e I 1 ) + log p ( e I = argmax I,e I 1 )) 1 p ( f J 1 | e I 1 ) p ( e I = argmax I,e I 1 ) December 2018 MT2: PBMT, NMT 16

Phrase-Based MT Overview This time around = Nyn´ ı . they ’re moving = zareagovaly faster even = dokonce jeˇ stˇ e even . . . = . . . moving This time around, they ’re moving = Nyn´ ı zareagovaly ’re they even faster = dokonce jeˇ stˇ e rychleji , . . . = . . . around time This Phrase-based MT: choose such segmentation of input string and such phrase “replacements” Nyn´ rychleji. stˇ zareagovaly ı e dokonce jeˇ to make the output sequence “coherent” (3-grams most probable). December 2018 MT2: PBMT, NMT 17

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models

Phrase Weights Statistical NLP Spring 2011 Lecture 10: Phrase Alignment Dan Klein UC

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Statistical Phrase-Based Translation Philipp Koehn, Franz Och, Daniel Marcu koehn@isi.edu,

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Translation Model Parallel corpus source target translation e f phrase phrase features

Phrase-Based Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Selective Phrase Pair Extraction for Improved Statistical Machine Translation Luke S.

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Bibliography for Module 16 on Evaluating Vaccine Efficacy Ninth Summer Institute in Statistics and

Systems Resilience and I (Inoue Lab 10 th Anniversary Symposium) Research group members : Hei Chan

Infinite Positive Semidefinite Tensor Factorization for Source Separation of Mixture Signals

Opportunities At SAMSI Sujit K. Ghosh Deputy Director One of 7 NSF funded Math institutes

The Automatic Statistician an AI for Data Science Zoubin Ghahramani Department of Engineering

"50 Shades of Truth: Visualisation and Communication of Uncertainty" Rodolphe Dewarrat

Combining Combining Constraint Programming Constraint Programming and Integer Programming and

P a r t 2 2 I n t e g e r p r o g r a m m i n g p r o b l e m s 1

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Building a Phrase-based SMT System Graham Neubig &amp; Kevin Duh Nara Institute of Science and

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models

Phrase Weights Statistical NLP Spring 2011 Lecture 10: Phrase Alignment Dan Klein UC

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Statistical Phrase-Based Translation Philipp Koehn, Franz Och, Daniel Marcu koehn@isi.edu,

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Translation Model Parallel corpus source target translation e f phrase phrase features

Phrase-Based Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

Selective Phrase Pair Extraction for Improved Statistical Machine Translation Luke S.

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Bibliography for Module 16 on Evaluating Vaccine Efficacy Ninth Summer Institute in Statistics and

Systems Resilience and I (Inoue Lab 10 th Anniversary Symposium) Research group members : Hei Chan

Infinite Positive Semidefinite Tensor Factorization for Source Separation of Mixture Signals

Opportunities At SAMSI Sujit K. Ghosh Deputy Director One of 7 NSF funded Math institutes

The Automatic Statistician an AI for Data Science Zoubin Ghahramani Department of Engineering

&quot;50 Shades of Truth: Visualisation and Communication of Uncertainty&quot; Rodolphe Dewarrat

Combining Combining Constraint Programming Constraint Programming and Integer Programming and

P a r t 2 2 I n t e g e r p r o g r a m m i n g p r o b l e m s 1

Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and

"50 Shades of Truth: Visualisation and Communication of Uncertainty" Rodolphe Dewarrat