advances and challenges in neural machine translation
play

Advances and Challenges in Neural Machine Translation Gongbo Tang - PowerPoint PPT Presentation

Advances and Challenges in Neural Machine Translation Gongbo Tang 26 September 2019 Outline Model Architectures 1 Nosiy Data 2 Monolingual Data 3 Domain Adaption 4 Coverage 5 Understanding NMT 6 Gongbo Tang Advances and Challenges


  1. Advances and Challenges in Neural Machine Translation Gongbo Tang 26 September 2019

  2. Outline Model Architectures 1 Nosiy Data 2 Monolingual Data 3 Domain Adaption 4 Coverage 5 Understanding NMT 6 Gongbo Tang Advances and Challenges in NMT 2/57

  3. The Best of Both Worlds Encoder-decoders With residual feed-forward layers Cascaded encoder Multi-column encoder (b) Multi-Column Encoder (a) Cascaded Encoder Source : The Best of Both Worlds : Combining Recent Advances in Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 3/57

  4. Star-Transformer h 1 h 1 h 8 h 2 h 8 h 2 h 7 h 3 h 7 s h 3 h 6 h 4 h 6 h 4 h 5 h 5 Figure 1: Left: Connections of one layer in Trans- former, circle nodes indicate the hidden states of in- put tokens. Right: Connections of one layer in Star- Transformer, the square node is the virtual relay node. Red edges and blue edges are ring and radical connec- tions, respectively. Source : Star-Transformer Gongbo Tang Advances and Challenges in NMT 4/57

  5. ☯ ☯ ☯ Modeling Recurrence for Transformer Output of Output of h 0 h 1 h 2 h 3 h 4 h 5 h 6 Transformer Encoder Recurrence Encoder e 1 e 2 e 3 e 4 e 5 e 6 Add & Norm Add & Norm Self-Attention Encoder Feed Recurrence Encoder Feed (a) Recurrent Neural Network Forward Forward d ence N × N × h 0 h 1 h 2 Norm Add & Norm Multi-Head Recurrence Attention Modeling c 1 c 2 Positional ⊕ Encoding e 1 e 2 e 3 e 4 e 5 e 6 Source Embedding (b) Attentive Recurrent Network Source Figure 3: Two implementations of recurrence model- ing: (a) standard RNN, and (b) the proposed ARN. Figure 2: The architecture of Transformer augmented with an additional recurrence encoder , the output of which is directly fed to the top decoder layer. Recurrent Neural Network (RNN) An intu- Source : Modeling Recurrence for Transformer Gongbo Tang Advances and Challenges in NMT 5/57

  6. Convolutional Self-Attention Networks Bush held a talk with Sharon Bush held a talk with Sharon Bush held a talk with Sharon (a) Vanilla SANs (b) 1D-Convolutional SANs (c) 2D-Convolutional SANs Figure 1: Illustration of (a) vanilla SANs; (b) 1-dimensional convolution with the window size being 3 ; and (c) 2-dimensional convolution with the area being 3 × 3 . Different colors represent different subspaces modeled by multi-head attention, and transparent colors denote masked tokens that are invisible to SANs. Source : Convolutional Self-Attention Networks Gongbo Tang Advances and Challenges in NMT 6/57

  7. Lattice-Based Transformer Encoder Hidden representations mao-yi fa-zhan ju fu zong-cai v 0 v 1 v 2 v 3 v 4 v 5 t 1 t 2 t 3 t 4 t 5 (1)Segmentaion 1 Add & Norm Feed t 1 v 0 mao-yi v 1 fa-zhan-ju v 2 fu-zong-cai v 3 Forward t 2 (2)Segmentaion 2 N x mao-yi-fa-zhan ju fu-zong-cai v 0 v 1 v 2 v 3 Add & Norm t 3 Lattice-aware t 4 (3)Segmentation 3 self-attention e 0:2 :mao-yi e 2:4 :fa-zhan e 4:5 :ju e 5:6 :fu e 6:8 :zong-cai t 5 v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 mao yi fa zhan ju fu zong cai Lattice c 6 c 7 c 1 c 2 c 3 c 4 c 5 c 8 Positional Encoding e 0:4 :mao-yi-fa-zhan e 2:5 :fa-zhan-ju e 5:8 :fu-zong-cai Input (4)Lattice Embedding Lattice sequence Inputs Figure 1: Incorporating three different segmentation Figure 2: The architecture of lattice-based Transformer for a lattice graph. The original sentence is “ mao-yi- encoder. Lattice positional encoding is added to the fa-zhan-ju-fu-zong-cai ”. In Chinese it is “ 贸 易 发 展局 embeddings of lattice sequence inputs. Different colors 副 总 裁 ”. In English it means “ The vice president of in lattice-aware self-attention indicate different relation Trade Development Council ” embeddings. Source : Lattice-Based Transformer Encoder for Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 7/57

  8. Incorporating Sentential Context layer 3 layer 3 layer 3 layer 2 layer 2 layer 2 layer 1 layer 1 layer 1 (a) Vanilla (b) Shallow Sentential Context (c) Deep Sentential Context Figure 1: Illustration of the proposed approache. As on a 3-layer encoder: (a) vanilla model without sentential context, (b) shallow sentential context representation (i.e. blue square) by exploiting the top encoder layer only; and (c) deep sentential context representation (i.e. brown square) by exploiting all encoder layers. The circles denote hidden states of individual tokens in the input sentence, and the squares denote the sentential context representations. The red up arrows denote that the representations are fed to the subsequent decoder. This figure is best viewed in color. Source : Exploiting Sentential Context for Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 8/57

  9. Tree Transformer wagging cute the dog tail Add & Norm its is Layer 2 Feed the Forward cute Constituent Layer 1 Priors dog Add & Norm is Multi-Head Constituent wagging Layer 0 Attention Attention its tail cute dog is wagging its tail the (A) (B) (C) Figure 1: (A) A 3-layer Tree Transformer, where the blocks are constituents induced from the input sentence. The two neighboring constituents may merge together in the next layer, so the sizes of constituents gradually grow from layer to layer. The red arrows indicate the self-attention. (B) The building blocks of Tree Transformer. (C) Constituent prior C for the layer 1 . Source : Tree Transformer : Integrating Tree Structures into Self-Attention Gongbo Tang Advances and Challenges in NMT 9/57

  10. Noise in Training Data • Crawled parallel data from the web (very noisy) SMT NMT WMT17 24.0 27.2 + Paracrawl 25.2 (+1.2) 17.3 (-9.9) (German-English, 90m words each of WMT17 and Crawl data) • Corpus cleaning methods [Xu and Koehn, EMNLP 2017] give improvements Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 10/57

  11. Noisy Data Types of noise Misaligned sentences Disfluent language (from MT, bad translations) Wrong language data (e.g., French in German–English corpus) Untranslated sentences Short segments (e.g., dictionaries) Mismatched domain Gongbo Tang Advances and Challenges in NMT 11/57

  12. Mismatched Sentences • Artificial created by randomly shuffling sentence order • Added to existing parallel corpus in different amounts 5% 10% 20% 50% 100% 24.0 24.0 23.9 26.1 23.9 25.3 23.4 -0.0 -0.0 -0.1 -0.1 -0.6 -1.1 -1.9 • Bigger impact on NMT (green, left) than SMT (blue, right) Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 12/57

  13. Misordered Words • Artificial created by randomly shuffling words in each sentence 5% 10% 20% 50% 100% 24.0 23.6 23.9 26.6 23.6 25.5 23.7 Source -0.0 -0.1 -0.4 -0.4 -0.3 -0.6 -1.7 24.0 24.0 23.4 26.7 23.2 26.1 22.9 Target -0.0 -0.0 -0.6 -0.5 -0.8 -1.1 -1.1 • Similar impact on NMT than SMT, worse for source reshuffle Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 13/57

  14. Untranslated Sentences 5% 10% 20% 50% 100% 17.6 23.8 11.2 23.9 5.6 23.8 3.2 23.4 3.2 21.1 -0.2 -0.1 -0.2 -0.6 -2.9 -9.8 Source -16.0 -21.6 -24.0 -24.0 27.2 27.0 26.7 26.8 26.9 Target -0.0 -0.2 -0.3 -0.5 -0.4 Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 14/57

  15. Short Sentences 5% 10% 20% 50% 27.1 24.1 26.5 23.9 26.7 23.8 1-2 words -0.1 +0.1 -0.1 -0.2 -0.7 -0.5 27.8 24.2 27.6 24.5 28.0 24.5 26.6 24.2 +0.8 +0.5 1-5 words +0.6 +0.2 +0.4 +0.5 -0.6 +0.2 • No harm done Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 15/57

  16. Amount of Training Data BLEU Scores with Varying Amounts of Training Data 31 . 1 30 . 3 29 . 6 29 . 2 30 28 . 6 30 . 4 30 . 1 27 . 9 27 . 4 29 . 2 26 . 9 28 . 6 26 . 2 27 . 8 25 . 7 26 . 9 24 . 9 26 . 1 23 . 5 23 . 4 24 . 7 22 . 2 21 . 8 21 . 2 22 . 4 19 . 6 20 18 . 1 18 . 2 16 . 4 14 . 7 11 . 9 10 7 . 2 Phrase-Based with Big LM Phrase-Based 1 . 6 Neural 0 10 6 10 7 10 8 Corpus Size (English Words) Gongbo Tang Advances and Challenges in NMT 16/57

  17. Using Monolingual Data in NMT Dummy source No source sentence randomly sample from monolingual data each epoch freeze encoder/attention layers for monolingual training instances Synthetic source Produce synthetic source-side sentence via back translation. Back-translation : use a trained model on the opposite direction to generate source-side sentence. Gongbo Tang Advances and Challenges in NMT 17/57

  18. Back Translation Steps train a system in reverse language direction use the system to translate target-side monolingual data combine both real parallel data and synthetic parallel data reverse system final system Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 18/57

  19. Iterative Back Translation back system 1 back system 2 final system Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 19/57

  20. Dual Learning • We could iterate through steps of – train system – create synthetic corpus • Dual learning: train models in both directions together – translation models F → E and E → F – take sentence f – translate into sentence e’ – translate that back into sentence f’ – training objective: f should match f’ • Setup could be fooled by just copying ( e’ = f ) ⇒ score e’ with a language for language E add language model score as cost to training objective Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 20/57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend