 
              Advanced Neural Machine Translation Gongbo Tang 21 September 2020
Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models Character-level Models subword-level Models Gongbo Tang Advanced Neural Machine Translation 2/51
Encoder-Decoder Architecture Figure 1.6: Encoder-decoder architecture – example of the general approach for NMT. An encoder converts a source sentence into a meaning vector which is passed through a decoder to produce a translation. Gongbo Tang Advanced Neural Machine Translation 3/51
Encoder-Decoder with Attention Encoder States Attention Input Context Hidden State Output Words Gongbo Tang Advanced Neural Machine Translation 4/51
Attentional NMT Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 5/51
Attentional NMT Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 6/51
Attentional NMT Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 7/51
Attentional NMT Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 8/51
Attentional NMT Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 9/51
Attention Mechanisms Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states Gongbo Tang Advanced Neural Machine Translation 10/51
Attention Mechanisms Predictions Attention Softmax c i α i h s i − 1 Decoder Networks Encoder hidden states Computation e ij = v ⊤ a tanh ( W a s i − 1 + U a h j ) α ij = softmax ( e ij ) T x � c i = α ij h j j =1 Gongbo Tang Advanced Neural Machine Translation 10/51
Attention Mechanisms Methods to compute attention  t ¯ h ⊤ h s dot     score( h t , ¯ t W a ¯ h s )= h ⊤ h s general   W a [ h t ; ¯ v ⊤ � �  a tanh h s ] concat  Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states Gongbo Tang Advanced Neural Machine Translation 11/51
Attention Mechanisms Methods to compute attention  t ¯ h ⊤ h s dot     score( h t , ¯ t W a ¯ h s )= h ⊤ h s general   W a [ h t ; ¯ v ⊤ � �  a tanh h s ] concat  Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states Gongbo Tang Advanced Neural Machine Translation 11/51
Attention Mechanisms Predictions Softmax c t Decoder Block s t − 1 n -layer Multi-head Attention α t blocks h c t s t − 1 Decoder Block Encoder hidden states (b) Advanced attention mechanism Gongbo Tang Advanced Neural Machine Translation 12/51
Attention Mechanisms Multi-Head Attention Attention( Q, K, V ) = softmax( QK T √ d k ) V MultiHead( Q, K, V ) = Concat(head 1 , ..., head h ) W O where head i = Attention( QW Q i , KW K i , V W V i ) Gongbo Tang Advanced Neural Machine Translation 13/51
CNN-based NMT models Figure from Convolutional Sequence to Sequence Learning . Gongbo Tang Advanced Neural Machine Translation 14/51
Transformer Figure from Attention Is All You Need. Gongbo Tang Advanced Neural Machine Translation 15/51
Understanding Attention Mechanisms Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ? Gongbo Tang Advanced Neural Machine Translation 16/51
Understanding Attention Mechanisms Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ? Gongbo Tang Advanced Neural Machine Translation 16/51
Attention Visualization t t t r r r e e e t i n t i n t i n n e e n e e n e e a n t e l l a n t e l l a n t e l l e r r u n e e r r u n e e r r u n e a e e t a e e t a e e t w g d L e i S ? w g d L e i S ? w g d L e i S ? who who who guarantees guarantees guarantees people people people a a a job job job ? ? ? t t t </s> r </s> r </s> r e e e i n i n i n t t t n e e n e e n e e a t e l a t e l a t e l r r n u e l r r n u e l r r n u e l e a e e n e a e e n e a e e n w g d e i S t w g d e i S t w g d e i S t L ? L ? L ? (a) Layer 1 (b) Layer 2 (c) Layer 3 who who who guarantees guarantees guarantees people people people a a a job job job ? ? ? t t t </s> r </s> r </s> r e e e i i i t n t n t n n e e n e e n e e a t e l a t e l a t e l r r n u e l r r n u e l r r n u e l e a e e n e a e e n e a e e n w i t w i t w i t g d L e S ? g d L e S ? g d L e S ? (d) Layer 4 (e) Layer 5 (f) Layer 6 who who who guarantees guarantees guarantees people people people a a a job job job ? ? ? </s> </s> </s> (g) Layer 7 (h) Layer 8 (i) RNN Gongbo Tang Advanced Neural Machine Translation 17/51
100 100 56.49 48.55 L1 L1 47.80 77.38 45.37 67.20 L2 L2 65.66(PMI) Transformer 55.07 48.40 84.85 Transformer 71.68 46.72 91.02 L3 L3 52.95(PMI) 64.88 49.32 52.81 90.11 77.34 47.79 49.47 91.71 L4 L4 71.31 53.71 49.84 77.55 92.36 85.98 67.09 50.07 67.05 93.16 L5 L5 73.29 52.66 45.22 51.19 86.92 92.13 88.41 72.11 56.68 53.98 82.40 93.97 L6 L6 0 0 1 2 3 4 5 6 1 2 3 4 5 6 layer layer Attention as Alignment Method AER Tasks Methods ZH ⇒ EN DE ⇒ EN global (location) 0 . 39 0 . 34 F AST A LIGN 36.57 26.58 local-m (general) Attention mean 0 . 36 56.44 74.59 local-p (general) Attention best 45.22 53.98 ensemble 0 . 34 0 . 32 EAM 38.88 39.25 Berkeley Aligner PD 41.77 42.81 * Results are measured on T RANSFORMER -L6. Table 6: AER scores – results of various models on the RWTH English-German alignment data. Table 1: AER of the proposed methods. Findings RNN attention is better than Transformer attention. Attention is much worse than traditional alignment model. Figure from Effective Approaches to Attention-based Neural Machine Translation and On the Word Alignment from Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 18/51
Attention is not Alignment Figure from What does Attention in Neural Machine Translation Pay Attention to ? . Gongbo Tang Advanced Neural Machine Translation 19/51
Attention is not Alignment Netanyahu Verhältnis zwischen gespannt Netanyahu Obama Jahren relations between strained Obama und das seit ist years have been and . for the 47 17 . relationship die 56 16 81 Beziehungen 89 between 72 zwischen 72 26 Obama 87 Obama 96 and 93 Netanyahu und 79 95 Netanjahu 98 has 38 16 26 sind 42 11 38 been 21 14 54 seit 22 54 10 stretched 77 Jahren 98 for 38 33 12 years angespannt 84 90 . . 11 23 49 19 32 11 14 17 (a) Desired Alignment (b) Mismatched Alignment Gongbo Tang Advanced Neural Machine Translation 20/51
Attention Distribution attention to attention to POS tag alignment points % other words % NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46 Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag. Figure from What does Attention in Neural Machine Translation Pay Attention to ? . Gongbo Tang Advanced Neural Machine Translation 21/51
Attention and Word Sense Disambiguation 1 RNN 0.9 Transformer 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Attention range Figure 5: WSD accuracy over attention ranges. Figure from An Analysis of Attention Mechanisms : The Case of Word Sense Disambiguation in NMT . Gongbo Tang Advanced Neural Machine Translation 22/51
Attention and Anaphora Figure 5: An example of an attention map between source and context. On the y -axis are the source tokens, on the x -axis the context tokens. Note the high attention between “it” and its antecedent “heart”. Figure from Context-Aware Neural Machine Translation Learns Anaphora Resolution . Gongbo Tang Advanced Neural Machine Translation 23/51
����� ������� ��� ������ Attention and Unknown Words 1. Find the corresponding source word 2. Lookup dictionary for translation Example from Zhaopeng Tu’s Tutorial, method from On Using Very Large Target Vocabulary for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 24/51
Recommend
More recommend