machine translation 2 statistical mt neural mt and
play

Machine Translation 2: Statistical MT: Neural MT and - PowerPoint PPT Presentation

Machine Translation 2: Statistical MT: Neural MT and Representations Ondej Bojar bojar@ufal.mfg.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague May 2020 MT2: NMT and


  1. Machine Translation 2: Statistical MT: Neural MT and Representations Ondřej Bojar bojar@ufal.mfg.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague May 2020 MT2: NMT and Representations

  2. Outline of Lectures on MT 1. Introduction. – Phrase-Based MT. 2. Neural Machine Translation. May 2020 MT2: NMT and Representations 1 • Why is MT diffjcult. • MT evaluation. • Approaches to MT. • Document, sentence and esp. word alignment. • Classical Statistical Machine Translation. • Neural MT: Sequence-to-sequence, attention, self-attentive. • Sentence representations. • Role of Linguistic Features in MT.

  3. Outline of MT Lecture 2 1. Fundamental problems of PBMT. 2. Neural machine translation (NMT). May 2020 MT2: NMT and Representations 2 • Brief summary of NNs. • Sequence-to-sequence, with attention. • Transformer, self-attention. • Linguistic features in NMT.

  4. Summary of PBMT Phrase-based MT: – lookup of all relevant translation options – stack-based beam search, gradually expanding hypotheses To train a PBMT system: 1. Align words. 2. Extract (and score) phrases consistent with word alignment. 3. Optimize weights (MERT). May 2020 MT2: NMT and Representations 3 • is a log-linear model • assumes phrases relatively independent of each other • decomposes sentence into contiguous phrases • search has two parts:

  5. May 2020 1: Align Training Sentences MT2: NMT and Representations 4 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.

  6. May 2020 2: Align Words MT2: NMT and Representations 5 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.

  7. May 2020 3: Extract Phrase Pairs (MTUs) MT2: NMT and Representations 6 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.

  8. 4: New Input May 2020 MT2: NMT and Representations 7 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. New input: Nemám ko č ku.

  9. May 2020 4: New Input MT2: NMT and Representations 8 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: Nemám ko č ku.

  10. 5: Pick Probable Phrase Pairs (TM) May 2020 MT2: NMT and Representations 9 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: New input: Nemám Nemám ko č ku. I have

  11. 10 May 2020 MT2: NMT and Representations 6: So That n -Grams Probable (LM) Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: New input: Nemám Nemám ko č ku. ko č ku. I have a cat.

  12. Meaning Got Reversed! May 2020 MT2: NMT and Representations 11 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. a cat. ✘ New input: New input: Nemám Nemám ko č ku. ko č ku. I have

  13. What Went Wrong? – Phrases do depend on each other. MT2: NMT and Representations May 2020 But adding it would increase data sparseness. – Word alignments ignored that dependence. Here “nemám” and “žádného” jointly express one negation. 12 ˆ ∏ p ( ˆ I p ( f J 1 | e I 1 ) p ( e I e ) p ( e I ˆ 1 = argmax 1 ) = argmax f | ˆ 1 ) (1) e I,e I I,e I ( ˆ e ) ∈ phrase pairs of f J 1 ,e I 1 1 f, ˆ 1 • Too strong phrase-independence assumption. • Language model is a separate unit. – p ( e I 1 ) models the target sentence independently of f J 1 .

  14. 13 (2) MT2: NMT and Representations May 2020 But what technical device can learn this? Main Benefjt: All dependencies available. Redefjning p ( e I 1 | f J 1 ) What if we modelled p ( e I 1 | f J 1 ) directly, word by word: p ( e I 1 | f J 1 ) = p ( e 1 , e 2 , . . . e I | f J 1 ) = p ( e 1 | f J 1 ) · p ( e 2 | e 1 , f J 1 ) · p ( e 3 | e 2 , e 1 , f J 1 ) . . . I ∏ p ( e i | e 1 , . . . e i − 1 , f J = 1 ) i =1 1 ) = ∏ I …this is “just a cleverer language model:” p ( e I i =1 p ( e i | e 1 , . . . e i − 1 )

  15. NNs: Universal Approximators approximate any continuous function to any precision. https://www.quora.com/How-can-a-deep-neural-network-with-ReLU-activations-in-its-hidden-layers-approximate-any-function May 2020 MT2: NMT and Representations 14 • A neural network with a single hidden layer (possibly huge) can • (Nothing claimed about learnability.)

  16. playground.tensorfmow.org May 2020 MT2: NMT and Representations 15

  17. Perfect Features May 2020 MT2: NMT and Representations 16

  18. Bad Features & Low Depth May 2020 MT2: NMT and Representations 17

  19. Too Complex NN Fails to Learn May 2020 MT2: NMT and Representations 18

  20. Deep NNs for Image Classifjcation May 2020 MT2: NMT and Representations 19

  21. Representation Learning (sample inputs and expected outputs) May 2020 MT2: NMT and Representations 20 • Based on training data • the neural network learns by itself • what is important in the inputs • to predict the outputs best. A “representation” is a new set of axes. • Instead of 3 dimensions ( x, y, color ) , we get • 2000 dimensions: (elephantity, number of storks, blueness, …) • designed automatically to help in best prediction of the output

  22. Skew: Transpose: Non-lin.: Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ May 2020 MT2: NMT and Representations 21 One Layer tanh( Wx + b ) , 2D → 2D W b tanh

  23. Four Layers, Disentagling Spirals Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ May 2020 MT2: NMT and Representations 22

  24. Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 23 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓

  25. Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 24 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓

  26. Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 25 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓

  27. Solution: Word Embeddings – CBOW: Predict the word from its four neighbours. MT2: NMT and Representations May 2020 Right: CBOW with just a single-word context (http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) – Skip-gram: Predict likely neighbours given the word. 26 – NNs: The matrix that maps 1-hot input to the fjrst layer. – The dimensions have no clear interpretation. • Map each word to a dense vector. • In practice 300–2000 dimensions are used, not 1–2M. • Embeddings are trained for each particular task. • The famous word2vec (Mikolov et al., 2013): Input layer Hidden layer Output layer x 1 y 1 x 2 y 2 x 3 y 3 h 1 h 2 x k y j h i W V × N ={ w ki } W ' N × V ={ w' ij } h N x V y V

  28. Continuous Space of Words Word2vec embeddings show interesting properties: (3) Illustrations from https://www.tensorflow.org/tutorials/word2vec May 2020 MT2: NMT and Representations 27 v ( king ) − v ( man ) + v ( woman ) ≈ v ( queen )

  29. Further Compression: Sub-Words Morphemes MT2: NMT and Representations May 2020 český politik s@@ vez@@ l mi@@ granty BPE 30k Chars Char Pairs 28 Syllables český politik svezl migranty Orig nejneobhodpodařovávatelnějšími, Donaudampfschifgfahrtsgesellschaftskapitän • SMT struggled with productive morphology ( > 1M wordforms). • NMT can handle only 30–80k dictionaries. ⇒ Resort to sub-word units. čes ký ⊔ po li tik ⊔ sve zl ⊔ mig ran ty česk ý ⊔ politik ⊔ s vez l ⊔ migrant y če sk ý ⊔ po li ti k ⊔ sv ez l ⊔ mi gr an ty č e s k ý ⊔ p o l i t i k ⊔ s v e z l ⊔ m i g r a n t y BPE (Byte-Pair Encoding) uses n most common substrings (incl. frequent words).

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend