machine translation 2 statistical mt phrase based and
play

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT


  1. Machine Translation 2: Statistical MT: Phrase-Based and Neural Ondˇ rej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT2: PBMT, NMT

  2. Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based: Assumptions, beam search, key issues. • Neural MT: Sequence-to-sequence, attention, self-attentive. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December 2018 MT2: PBMT, NMT 1

  3. Outline of MT Lecture 2 1. What makes MT statistical. • Brute-force statistical MT. • Noisy channed model. • Log-linear model. 2. Phrase-based translation model. • Phrase extraction. • Decoding (gradual construction of hypotheses). • Minimum error-rate training (weight optimization). 3. Neural machine translation (NMT). • Sequence-to-sequence, with attention. December 2018 MT2: PBMT, NMT 2

  4. Quotes Warren Weaver (1949): I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text. Noam Chomsky (1969): . . . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Frederick Jelinek (80’s; IBM; later JHU and sometimes ´ UFAL) Every time I fire a linguist, the accuracy goes up. Hermann Ney (RWTH Aachen University): MT = Linguistic M odelling + Statistical Decision T heory December 2018 MT2: PBMT, NMT 3

  5. The Statistical Approach (Statistical = Information-theoretic.) • Specify a probabilistic model. = How is the probability mass distributed among possible outputs given observed inputs. • Specify the training criterion and procedure. = How to learn free parameters from training data. Notice: • Linguistics helpful when designing the models: – How to divide input into smaller units. – Which bits of observations are more informative. December 2018 MT2: PBMT, NMT 4

  6. Statistical MT Given a source (foreign) language sentence f J 1 = f 1 . . . f j . . . f J , Produce a target language (English) sentence e I 1 = e 1 . . . e j . . . e I . Among all possible target language sentences, choose the sentence with the highest probability: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) (1) e I,e I 1 We stick to the e I 1 , f J 1 notation despite translating from English to Czech. December 2018 MT2: PBMT, NMT 5

  7. Brute-Force MT (1/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (2) 0 otherwise Any problems with the definition? • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. December 2018 MT2: PBMT, NMT 6

  8. Brute-Force MT (2/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobr´ e r´ ano. How are you? = Jak se m´ aˇ s? How are you? = Jak se m´ ate? � 1 if e I 1 = f J 1 seen in the TM p ( e I 1 | f J 1 ) = (3) 0 otherwise • Not a probability. There may be f J 1 p ( e I 1 | f J 1 , s.t. � 1 ) > 1 . e I ⇒ Have to normalize, use count( e I 1 ,f J 1 ) instead of 1. count( f J 1 ) • Not “smooth”, no generalization: ⇒ Good morning. Dobr´ e r´ ano. ⇒ ∅ Good evening. December 2018 MT2: PBMT, NMT 7

  9. Bayes’ Law Bayes’ law for conditional probabilities: p ( a | b ) = p ( b | a ) p ( a ) p ( b ) So in our case: ˆ I p ( e I 1 | f J ˆ 1 = argmax 1 ) Apply Bayes’ law e I,e I 1 p ( f J 1 | e I 1 ) p ( e I p ( f J 1 ) 1 ) constant = argmax p ( f J ⇒ irrelevant in maximization 1 ) I,e I 1 p ( f J 1 | e I 1 ) p ( e I = argmax 1 ) I,e I 1 Also called “Noisy Channel” model. December 2018 MT2: PBMT, NMT 8

  10. Motivation for Noisy Channel ˆ I p ( f J 1 | e I 1 ) p ( e I ˆ 1 = argmax 1 ) (4) e I,e I 1 Bayes’ law divided the model into components: p ( f J 1 | e I Translation model (“reversed”, e I 1 → f J 1 ) 1 ) . . . is it a likely translation? p ( e I 1 ) Language model (LM) . . . is the output a likely sentence of the target language? • The components can be trained on different sources. There are far more monolingual data ⇒ language model more reliable. December 2018 MT2: PBMT, NMT 9

  11. Without Equations Parallel Texts Monolingual Texts Translation Model Language Model Global Search Input Output for sentence with highest probability December 2018 MT2: PBMT, NMT 10

  12. Summary of Language Models • p ( e I 1 ) should report how “good” sentence e I 1 is. • We surely want p ( The the the. ) < p ( Hello. ) • How about p ( The cat was black. ) < p ( Hello. ) ? . . . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model: p ( � � The cat was black . � � ) = p ( The | � � ) p ( cat | � The ) p ( was | The cat ) p ( black | cat was ) p ( . | was black ) p ( � | black . ) p ( � | . � ) Formally, with n = 3 : I � p ( e i | e i − 1 p LM ( e I 1 ) = i − n +1 ) (5) i =1 December 2018 MT2: PBMT, NMT 11

  13. Estimating and Smoothing LM count( w 1 ) p ( w 1 ) = Unigram probabilities. total words observed p ( w 2 | w 1 ) = count( w 1 w 2 ) Bigram probabilities. count( w 1 ) p ( w 3 | w 2 , w 1 ) = count( w 1 w 2 w 3 ) Trigram probabilities. count( w 1 w 2 ) Unseen ngrams ( p ( ngram ) = 0 ) are a big problem, invalidate whole sentence: p LM ( e I 1 ) = · · · · 0 · · · · = 0 ⇒ Back-off with shorter ngrams: � 1 ) = � I p LM ( e I 0 . 8 · p ( e i | e i − 1 , e i − 2 )+ i =1 0 . 15 · p ( e i | e i − 1 )+ (6) 0 . 049 · p ( e i )+ � 0 . 001 � = 0 December 2018 MT2: PBMT, NMT 12

  14. From Bayes to Log-Linear Model Och (2002) discusses some problems of Equation 19: • Models estimated unreliably ⇒ maybe LM more important: ˆ 1 )) 2 I p ( f J 1 | e I 1 )( p ( e I ˆ 1 = argmax (7) e I,e I 1 • In practice, “direct” translation model equally good: ˆ I p ( e I 1 | f J 1 ) p ( e I ˆ 1 = argmax 1 ) (8) e I,e I 1 • Complicated to correctly introduce other dependencies. ⇒ Use log-linear model instead. December 2018 MT2: PBMT, NMT 13

  15. Log-Linear Model (1) • p ( e I 1 | f J 1 ) is modelled as a weighted combination of models, called “feature functions”: h 1 ( · , · ) . . . h M ( · , · ) exp( � M m =1 λ m h m ( e I 1 , f J 1 )) p ( e I 1 | f J 1 ) = (9) 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 )) I ′ e ′ • Each feature function h m ( e, f ) relates source f to target e . E.g. the feature for n -gram language model: I � p ( e i | e i − 1 h LM ( f J 1 , e I 1 ) = log i − n +1 ) (10) i =1 • Model weights λ M 1 specify the relative importance of features. December 2018 MT2: PBMT, NMT 14

  16. Log-Linear Model (2) As before, the constant denominator not needed in maximization: exp( � M m =1 λ m h m ( e I 1 , f J 1 )) e ˆ I ˆ 1 = argmax I,e I 1 exp( � M m =1 λ m h m ( e ′ I ′ 1 , f J � 1 1 )) (11) I ′ e ′ 1 exp( � M m =1 λ m h m ( e I 1 , f J = argmax I,e I 1 )) December 2018 MT2: PBMT, NMT 15

  17. Relation to Noisy Channel With equal weights and only two features: • h TM ( e I 1 , f J 1 ) = log p ( f J 1 | e I 1 ) for the translation model, • h LM ( e I 1 , f J 1 ) = log p ( e I 1 ) for the language model, log-linear model reduces to Noisy Channel: e ˆ 1 exp( � M I m =1 λ m h m ( e I 1 , f J ˆ 1 = argmax I,e I 1 )) 1 exp( h TM ( e I 1 , f J 1 ) + h LM ( e I 1 , f J = argmax I,e I 1 )) (12) 1 exp(log p ( f J 1 | e I 1 ) + log p ( e I = argmax I,e I 1 )) 1 p ( f J 1 | e I 1 ) p ( e I = argmax I,e I 1 ) December 2018 MT2: PBMT, NMT 16

  18. Phrase-Based MT Overview This time around = Nyn´ ı . they ’re moving = zareagovaly faster even = dokonce jeˇ stˇ e even . . . = . . . moving This time around, they ’re moving = Nyn´ ı zareagovaly ’re they even faster = dokonce jeˇ stˇ e rychleji , . . . = . . . around time This Phrase-based MT: choose such segmentation of input string and such phrase “replacements” Nyn´ rychleji. stˇ zareagovaly ı e dokonce jeˇ to make the output sequence “coherent” (3-grams most probable). December 2018 MT2: PBMT, NMT 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend