machine translation
play

Machine Translation 12: (Non-neural) Statistical Machine Translation - PowerPoint PPT Presentation

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of Edinburgh R. Sennrich MT 2018 12 1 / 27 Todays Lecture So far, main focus of lecture was on: neural machine translation research


  1. Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of Edinburgh R. Sennrich MT – 2018 – 12 1 / 27

  2. Today’s Lecture So far, main focus of lecture was on: neural machine translation research since ≈ 2013 today, we look at (non-neural) Statistical Machine Translation, and research since ≈ 1990 R. Sennrich MT – 2018 – 12 1 / 27

  3. MT – 2018 – 12 Statistical Machine Translation 1 Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT R. Sennrich MT – 2018 – 12 2 / 27

  4. Refresher: A probabilistic model of translation Suppose that we have: a source sentence S of length m ( x 1 , . . . , x m ) a target sentence T of length n ( y 1 , . . . , y n ) We can express translation as a probabilistic model: T ∗ = arg max P ( T | S ) T = arg max P ( S | T ) P ( T ) Bayes’ theorem T We can model translation via two models: language model to estimate P ( T ) translation model to estimate P ( S | T ) Without continuous space representations, how to estimate P ( S | T ) ? → break it up into smaller units R. Sennrich MT – 2018 – 12 3 / 27

  5. Word Alignment chicken-and-egg problem let’s break up P ( S | T ) into small units (words): we can estimate an alignment given a translation model expectation step we can estimate translation model given a an alignment (using relative frequencies) maximization step what can we do if we have neither? solution: Expectation Maximization Algorithm initialize model iterate between estimating alignment and translation model simplest model based on lexical translation; more complex models consider position and fertility R. Sennrich MT – 2018 – 12 4 / 27

  6. Word Alignment: IBM Models [Brown et al., 1993] ... la maison ... la maison blue ... la fleur ... ... the house ... the blue house ... the flower ... • Initial step: all alignments equally likely • Model learns that, e.g., la is often aligned with the R. Sennrich MT – 2018 – 12 5 / 27

  7. Word Alignment: IBM Models [Brown et al., 1993] ... la maison ... la maison blue ... la fleur ... ... the house ... the blue house ... the flower ... • After one iteration • Alignments, e.g., between la and the are more likely R. Sennrich MT – 2018 – 12 5 / 27

  8. Word Alignment: IBM Models [Brown et al., 1993] ... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ... • After another iteration • It becomes apparent that alignments, e.g., between fleur and flower are more likely (pigeon hole principle) R. Sennrich MT – 2018 – 12 5 / 27

  9. Word Alignment: IBM Models [Brown et al., 1993] ... la maison ... la maison bleu ... la fleur ... ... the house ... the blue house ... the flower ... • Convergence • Inherent hidden structure revealed by EM R. Sennrich MT – 2018 – 12 5 / 27

  10. Word Alignment: IBM Models [Brown et al., 1993] p (the | la) = 0 . 7 p (house | la) = 0 . 05 • Probabilities p (the | maison) = 0 . 1 p (house | maison) = 0 . 8 • Alignments la • • la • • la • • la • • the the the the ✱ ✱ ❅ ❅ ❅ ✱ ✱ ❅ • • • • • • • • maison house maison ❅ house maison ✱ house maison ✱ ❅ house p ( e , a | f ) = 0 . 56 p ( e , a | f ) = 0 . 035 p ( e , a | f ) = 0 . 08 p ( e , a | f ) = 0 . 005 p ( a | e , f ) = 0 . 824 p ( a | e , f ) = 0 . 052 p ( a | e , f ) = 0 . 118 p ( a | e , f ) = 0 . 007 c (the | la) = 0 . 824 + 0 . 052 c (house | la) = 0 . 052 + 0 . 007 • Counts c (the | maison) = 0 . 118 + 0 . 007 c (house | maison) = 0 . 824 + 0 . 118 R. Sennrich MT – 2018 – 12 5 / 27

  11. Linear Models T ∗ = arg max P ( S | T ) P ( T ) Bayes’ theorem T M T ∗ ≈ arg max � λ m h m ( S, T ) [Och, 2003] T m =1 linear combination of arbitrary features Minimum Error Rate Training to optimize feature weights big trend in SMT research: engineering new/better features R. Sennrich MT – 2018 – 12 6 / 27

  12. Word-based SMT core idea combine word-based translation model and n-gram language model to compute score of translation consequences + models are easy to compute - word translations are assumed to be independent of each other: only LM takes into account context - poor at modelling long-distance phenomena: n-gram context is limited R. Sennrich MT – 2018 – 12 7 / 27

  13. MT – 2018 – 12 Statistical Machine Translation 1 Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT R. Sennrich MT – 2018 – 12 8 / 27

  14. Phrase-based SMT core idea Basic translation unit in translation model is not word, but word sequence (phrase) consequences + much better memorization of frequent phrase translations - large (and noisy) phrase table - large search space; requires sophisticated pruning - still poor at modelling long-distance phenomena leider ist Herr Steiger nach Köln gefahren unfortunately , Mr Steiger has gone to Cologne R. Sennrich MT – 2018 – 12 9 / 27

  15. Phrase Extraction extraction rules based on word-aligned sentence pair phrase pair must be compatible with alignment... ...but unaligned words are ok phrases are contiguous sequences entsprechenden Anmerkungen aushändigen werde Ihnen Ich die I shall be = werde shall be passing on to you some comments R. Sennrich MT – 2018 – 12 10 / 27

  16. Phrase Extraction extraction rules based on word-aligned sentence pair phrase pair must be compatible with alignment... ...but unaligned words are ok phrases are contiguous sequences entsprechenden Anmerkungen aushändigen werde Ihnen Ich die I shall be passing on to you some comments = some die entsprechenden Anmerkungen comments R. Sennrich MT – 2018 – 12 10 / 27

  17. Phrase Extraction extraction rules based on word-aligned sentence pair phrase pair must be compatible with alignment... ...but unaligned words are ok phrases are contiguous sequences entsprechenden Anmerkungen aushändigen werde Ihnen Ich die I shall be werde Ihnen die entsprechenden passing Anmerkungen aushändigen on = shall be passing on to you to some comments you some comments R. Sennrich MT – 2018 – 12 10 / 27

  18. Common Features in Phrase-based SMT phrase translation probabilities (in both directions) word translation probabilities (in both directions) language model reordering model constant penalty for each phrase used sparse features with learned cost for some (classes of) phrase pairs multiple models of each type possible R. Sennrich MT – 2018 – 12 11 / 27

  19. Decoding er geht ja nicht nach hause he is yes not after house it are is do not to home , it goes , of course does not according to chamber , he go is not in at home it is not home he will be is not under house it goes does not return home he goes do not do not is to are following is after all not after does not to not is not are not is not a • The machine translation decoder does not know the right answer – picking the right translation options – arranging them in the right order → Search problem solved by heuristic beam search R. Sennrich MT – 2018 – 12 12 / 27

  20. Decoding er geht ja nicht nach hause are pick any translation option, create new hypothesis R. Sennrich MT – 2018 – 12 13 / 27

  21. Decoding er geht ja nicht nach hause he are it create hypotheses for all other translation options R. Sennrich MT – 2018 – 12 13 / 27

  22. Decoding er geht ja nicht nach hause yes he goes home are home does not go it to also create hypotheses from created partial hypothesis R. Sennrich MT – 2018 – 12 13 / 27

  23. Decoding er geht ja nicht nach hause yes he goes home are home does not go it to backtrack from highest scoring complete hypothesis R. Sennrich MT – 2018 – 12 13 / 27

  24. Decoding large search space (exponential number of hypotheses) reduction of search space: recombination of identical hypotheses pruning of hypotheses efficient decoding is a lot more complex in SMT than in neural MT R. Sennrich MT – 2018 – 12 14 / 27

  25. MT – 2018 – 12 Statistical Machine Translation 1 Basics Phrase-based SMT Hierarchical SMT Syntax-based SMT R. Sennrich MT – 2018 – 12 15 / 27

  26. Hierarchical SMT core idea use context-free grammars (CFG) rules as basic translation units → allows gaps consequences + better modeling of some reordering patterns leider ist Herr Steiger nach Köln gefahren unfortunately Mr Steiger has gone to Cologne , - overgeneralisation is still possible leider ist Herr Steiger nicht nach Köln gefahren unfortunately Herr Steiger does not has gone to Cologne , R. Sennrich MT – 2018 – 12 16 / 27

  27. Hierarchical Phrase Extraction entsprechenden Anmerkungen aushändigen subtracting werde Ihnen subphrase Ich die I shall be werde X aushändigen passing = shall be passing on X on to you some comments R. Sennrich MT – 2018 – 12 17 / 27

  28. Decoding Decoding via (S)CFG derivation | • Derivation starts with pair of linked s symbols. R. Sennrich MT – 2018 – 12 18 / 27

  29. Decoding Decoding via (S)CFG derivation | ⇒ s 2 x 3 | s 2 x 3 • s → s 1 x 2 | s 1 x 2 (glue rule) R. Sennrich MT – 2018 – 12 18 / 27

  30. Decoding Decoding via (S)CFG derivation | ⇒ s 2 x 3 | s 2 x 3 ⇒ s 2 x 4 und x 5 | s 2 x 4 and x 5 • x → x 1 und x 2 | x 1 and x 2 R. Sennrich MT – 2018 – 12 18 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend