Advanced Neural Machine Translation Gongbo Tang 23 September 2019

Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models Character-level Models subword-level Models Gongbo Tang Advanced Neural Machine Translation 2/52

Encoder-Decoder Architecture Figure 1.6: Encoder-decoder architecture – example of the general approach for NMT. An encoder converts a source sentence into a meaning vector which is passed through a decoder to produce a translation. Gongbo Tang Advanced Neural Machine Translation 3/52

Encoder-Decoder with Attention Encoder States Attention Input Context Hidden State Output Words Gongbo Tang Advanced Neural Machine Translation 4/52

Attentional NMT Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 5/52

Attention Mechanisms Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states Gongbo Tang Advanced Neural Machine Translation 10/52

Attention Mechanisms Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states Computation e ij = v ⊤ a tanh ( W a s i − 1 + U a h j ) α ij = softmax ( e ij ) T x � c i = α ij h j j =1 Gongbo Tang Advanced Neural Machine Translation 10/52

Attention Mechanisms Methods to compute attention  t ¯ h ⊤ h s dot     score( h t , ¯ t W a ¯ h s )= h ⊤ h s general   W a [ h t ; ¯ v ⊤ � �  a tanh h s ] concat  Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states Gongbo Tang Advanced Neural Machine Translation 11/52

Attention Mechanisms Predictions Softmax c t Decoder Block s t − 1 n -layer Multi-head Attention α t blocks h c t s t − 1 Decoder Block Encoder hidden states (b) Advanced attention mechanism Gongbo Tang Advanced Neural Machine Translation 12/52

Attention Mechanisms Multi-Head Attention Attention( Q, K, V ) = softmax( QK T √ d k ) V MultiHead( Q, K, V ) = Concat(head 1 , ..., head h ) W O where head i = Attention( QW Q i , KW K i , V W V i ) Gongbo Tang Advanced Neural Machine Translation 13/52

CNN-based NMT models Figure from Convolutional Sequence to Sequence Learning . Gongbo Tang Advanced Neural Machine Translation 14/52

Transformer Figure from Attention Is All You Need. Gongbo Tang Advanced Neural Machine Translation 15/52

Understanding Attention Mechanisms Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ? Gongbo Tang Advanced Neural Machine Translation 16/52

Attention Visualization t t t r r r e e e t i n t i n t i n n e e n e e n e e a n t e l l a n t e l l a n t e l l e r r u n e e r r u n e e r r u n e a e e t a e e t a e e t w g d L e i S ? w g d L e i S ? w g d L e i S ? who who who guarantees guarantees guarantees people people people a a a job job job ? ? ? t t t </s> r </s> r </s> r e e e i n i n i n t t t n e e n e e n e e a t e l a t e l a t e l r r n u e l r r n u e l r r n u e l e a e e n e a e e n e a e e n w g d e i S t w g d e i S t w g d e i S t L ? L ? L ? (a) Layer 1 (b) Layer 2 (c) Layer 3 who who who guarantees guarantees guarantees people people people a a a job job job ? ? ? t t t </s> r </s> r </s> r e e e i i i t n t n t n n e e n e e n e e a t e l a t e l a t e l r r n u e l r r n u e l r r n u e l e a e e n e a e e n e a e e n w i t w i t w i t g d L e S ? g d L e S ? g d L e S ? (d) Layer 4 (e) Layer 5 (f) Layer 6 who who who guarantees guarantees guarantees people people people a a a job job job ? ? ? </s> </s> </s> (g) Layer 7 (h) Layer 8 (i) RNN Gongbo Tang Advanced Neural Machine Translation 17/52

100 100 56.49 48.55 L1 L1 47.80 77.38 45.37 67.20 L2 L2 65.66(PMI) Transformer 55.07 48.40 84.85 Transformer 71.68 46.72 91.02 L3 L3 52.95(PMI) 64.88 49.32 52.81 90.11 77.34 47.79 49.47 91.71 L4 L4 71.31 53.71 49.84 77.55 92.36 85.98 67.09 50.07 67.05 93.16 L5 L5 73.29 52.66 45.22 51.19 86.92 92.13 88.41 72.11 56.68 53.98 82.40 93.97 L6 L6 0 0 1 2 3 4 5 6 1 2 3 4 5 6 layer layer Attention as Alignment Method AER Tasks Methods ZH ⇒ EN DE ⇒ EN global (location) 0 . 39 0 . 34 F AST A LIGN 36.57 26.58 local-m (general) Attention mean 0 . 36 56.44 74.59 local-p (general) Attention best 45.22 53.98 ensemble 0 . 34 0 . 32 EAM 38.88 39.25 Berkeley Aligner PD 41.77 42.81 * Results are measured on T RANSFORMER -L6. Table 6: AER scores – results of various models on the RWTH English-German alignment data. Table 1: AER of the proposed methods. Findings RNN attention is better than Transformer attention. Attention is much worse than traditional alignment model. Figure from Effective Approaches to Attention-based Neural Machine Translation and On the Word Alignment from Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 18/52

Attention is not Alignment Figure from What does Attention in Neural Machine Translation Pay Attention to ? . Gongbo Tang Advanced Neural Machine Translation 19/52

Attention is not Alignment Netanyahu Verhältnis zwischen gespannt Netanyahu Obama Jahren relations between strained Obama und das seit ist years have been and . for the 47 17 . relationship die 56 16 81 Beziehungen 89 between 72 zwischen 72 26 Obama 87 Obama 96 and 93 Netanyahu und 79 95 Netanjahu 98 has 38 16 26 sind 42 11 38 been 21 14 54 seit 22 54 10 stretched 77 Jahren 98 for 38 33 12 years angespannt 84 90 . . 11 23 49 19 32 11 14 17 (a) Desired Alignment (b) Mismatched Alignment Gongbo Tang Advanced Neural Machine Translation 20/52

Attention Distribution attention to attention to POS tag alignment points % other words % NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46 Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag. Figure from What does Attention in Neural Machine Translation Pay Attention to ? . Gongbo Tang Advanced Neural Machine Translation 21/52

Attention and Word Sense Disambiguation 1 RNN 0.9 Transformer 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Attention range Figure 5: WSD accuracy over attention ranges. Figure from An Analysis of Attention Mechanisms : The Case of Word Sense Disambiguation in NMT . Gongbo Tang Advanced Neural Machine Translation 22/52

Attention and Anaphora Figure 5: An example of an attention map between source and context. On the y -axis are the source tokens, on the x -axis the context tokens. Note the high attention between “it” and its antecedent “heart”. Figure from Context-Aware Neural Machine Translation Learns Anaphora Resolution . Gongbo Tang Advanced Neural Machine Translation 23/52

�� Attention and Unknown Words 1. Find the corresponding source word 2. Lookup dictionary for translation Example from Zhaopeng Tu’s Tutorial, method from On Using Very Large Target Vocabulary for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 24/52

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 - PowerPoint PPT Presentation

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Neural Text Summarization Piji Li NLP Center, Tencent AI Lab pijili@tencent.com Paper Reading,

Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation:

Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising

Relationship between attentional processing of input and working Bimali Indrarathne memory:

Visual system Anatomy D aja et al., Front. Neuroanat. , 2014 The six layers of the

10/31/2019 My own journey with compassion Cultivate Compassion for Wellbeing: Recent Research

Robot attentional models for intuitive HRI. Verena Vanessa Hafner ! Kognitive Robotik, Institut