Advanced Neural Machine Translation
Gongbo Tang
23 September 2019
Advanced Neural Machine Translation Gongbo Tang 23 September 2019 - - PowerPoint PPT Presentation
Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models
Gongbo Tang
23 September 2019
1
NMT with Attention Mechanisms Attention Mechanisms Understanding Attention Mechanisms Attention Variants
2
NMT at Different Granularities Hybrid Models Character-level Models subword-level Models
Gongbo Tang Advanced Neural Machine Translation 2/52
Figure 1.6: Encoder-decoder architecture – example of the general approach for NMT. An encoder converts a source sentence into a meaning vector which is passed through a decoder to produce a translation.
Gongbo Tang Advanced Neural Machine Translation 3/52
Encoder States Attention Input Context Hidden State Output Words
Gongbo Tang Advanced Neural Machine Translation 4/52
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 5/52
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 6/52
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 7/52
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 8/52
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 9/52
αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci Gongbo Tang Advanced Neural Machine Translation 10/52
αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci
Computation
a tanh(Wasi−1 + Uahj)
Tx
Gongbo Tang Advanced Neural Machine Translation 10/52
Methods to compute attention
t ¯
t Wa¯
a tanh
Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states
Gongbo Tang Advanced Neural Machine Translation 11/52
Methods to compute attention
t ¯
t Wa¯
a tanh
Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states
Gongbo Tang Advanced Neural Machine Translation 11/52
αt h Encoder hidden states Multi-head Attention Decoder Block Decoder Block Softmax Predictions st−1 ct st−1 ct
n-layer blocks
Gongbo Tang Advanced Neural Machine Translation 12/52
Attention(Q, K, V ) = softmax(QKT √dk )V MultiHead(Q, K, V ) = Concat(head1, ..., headh)W O where headi = Attention(QW Q
i , KW K i , V W V i )
Multi-Head Attention
Gongbo Tang Advanced Neural Machine Translation 13/52
Figure from Convolutional Sequence to Sequence Learning. Gongbo Tang Advanced Neural Machine Translation 14/52
Figure from Attention Is All You Need. Gongbo Tang Advanced Neural Machine Translation 15/52
Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ?
Gongbo Tang Advanced Neural Machine Translation 16/52
Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ?
Gongbo Tang Advanced Neural Machine Translation 16/52
w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>
(a) Layer 1 (b) Layer 2 (c) Layer 3
w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>
(d) Layer 4 (e) Layer 5 (f) Layer 6
w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>
(g) Layer 7 (h) Layer 8 (i) RNN
Gongbo Tang Advanced Neural Machine Translation 17/52
Method AER global (location) 0.39 local-m (general) 0.34 local-p (general) 0.36 ensemble 0.34 Berkeley Aligner 0.32 Table 6: AER scores – results of various models
1 2 3 4 5 6 layer L1 L2 L3 L4 L5 L6 Transformer 56.49 47.80 77.38 55.07 48.40 84.85 64.88 49.32 52.81 90.11 71.31 53.71 49.84 77.55 92.36 73.29 52.66 45.22 51.19 86.92 92.13 65.66(PMI) 100 1 2 3 4 5 6 layer L1 L2 L3 L4 L5 L6 Transformer 48.55 45.37 67.20 71.68 46.72 91.02 77.34 47.79 49.47 91.71 85.98 67.09 50.07 67.05 93.16 88.41 72.11 56.68 53.98 82.40 93.97 52.95(PMI) 100
Methods Tasks ZH⇒EN DE⇒EN FAST ALIGN 36.57 26.58 Attention mean 56.44 74.59 Attention best 45.22 53.98 EAM 38.88 39.25 PD 41.77 42.81
* Results are measured on TRANSFORMER-L6.
Table 1: AER of the proposed methods.
Findings RNN attention is better than Transformer attention. Attention is much worse than traditional alignment model.
Figure from Effective Approaches to Attention-based Neural Machine Translation and On the Word Alignment from Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 18/52
Figure from What does Attention in Neural Machine Translation Pay Attention to ?. Gongbo Tang Advanced Neural Machine Translation 19/52
relations between Obama and Netanyahu have been strained for years . die Beziehungen zwischen Obama und Netanjahu sind seit Jahren angespannt . 56 89 72 16 26 96 79 98 42 11 11 14 38 22 84 23 54 10 98 49 das Verhältnis zwischen Obama und Netanyahu ist seit Jahren gespannt . the relationship between Obama and Netanyahu has been stretched for years . 11 47 81 72 87 93 95 38 21 17 16 14 38 19 33 90 32 26 54 77 12 17
(a) Desired Alignment (b) Mismatched Alignment
Gongbo Tang Advanced Neural Machine Translation 20/52
POS tag attention to alignment points % attention to
NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46
Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag.
Figure from What does Attention in Neural Machine Translation Pay Attention to ?. Gongbo Tang Advanced Neural Machine Translation 21/52
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Attention range RNN Transformer
Figure from An Analysis of Attention Mechanisms : The Case of Word Sense Disambiguation in NMT. Gongbo Tang Advanced Neural Machine Translation 22/52
Figure from Context-Aware Neural Machine Translation Learns Anaphora Resolution. Gongbo Tang Advanced Neural Machine Translation 23/52
source word
translation
method from On Using Very Large Target Vocabulary for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 24/52
Global align weights
Attention Layer
Context vector
Figure from Effective Approaches to Attention-based Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 25/52
Attention Layer
Context vector Local weights Aligned position
Figure from Effective Approaches to Attention-based Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 26/52
this is the last hundred years law
the last hundred <EOS> <SINK> . jahre hundert letzten der gesetz moores ist das . this is moore 's law last hundred years . <EOS>
i am going to give you the government government . <EOS> <SINK> . wählen ' regierung ' thema das nun werde ich now i am going to choose the government . <EOS>
Figure from Sparse and Constrained Attention for Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 27/52
Figure – The conventional attention distribution
Constrains The attention weights over each source token should sum to 1.
Gongbo Tang Advanced Neural Machine Translation 28/52
(a) (b) Figure 1: (a) The conventional attention mechanism and (b) The proposed fine-grained attention
t αt′,t = 1 in the conventional method, and t αd t′,t = 1 for all dimension
d in the proposed method.
Fine-Grained Attention Mechanism for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 29/52
Figure from Attention-via-Attention Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 30/52
Expensive computation during prediction Compute the probabilities over the entire target-side vocabulary. si-1 ci si
Context State
ti-1 ti
Word Prediction
yi-1 Eyi-1
Selected Word
yi Eyi
Embedding
ci-1
the cat this
fish there dog these
yi
Figure from Philipp Koehn’s slides Gongbo Tang Advanced Neural Machine Translation 31/52
MT is an open-vocabulary problem
compounding and other productive morphological processes
they charge a carry-on bag fee. sie erheben eine Hand|gepäck|gebühr.
names
Obama(English; German) Обама (Russian) オバマ (o-ba-ma) (Japanese)
technical terms, numbers, etc.
Problem Limited vocabulary causes out-of-vocabulary words.
Gongbo Tang Advanced Neural Machine Translation 32/52
MT is an open-vocabulary problem
compounding and other productive morphological processes
they charge a carry-on bag fee. sie erheben eine Hand|gepäck|gebühr.
names
Obama(English; German) Обама (Russian) オバマ (o-ba-ma) (Japanese)
technical terms, numbers, etc.
Problem Limited vocabulary causes out-of-vocabulary words.
Gongbo Tang Advanced Neural Machine Translation 32/52
Figure from Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models Gongbo Tang Advanced Neural Machine Translation 33/52
* C2W Compositional Model BLSTM W h e r e Word Vector for "Where"
Figure – Character to word Compositional model
Figure from Chracter-based Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 34/52
** V2C Generation Model Forward LSTM e s t a e s t a EOW SOW
Figure – Vector to character generation model
Figure from Chracter-based Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 35/52
Pros small vocabulary do not require segmentation/tokenization can model different, rare mopholigical variants of a word Cons sentence length is longer (harder for training) training time is longer
Gongbo Tang Advanced Neural Machine Translation 36/52
Pros small vocabulary do not require segmentation/tokenization can model different, rare mopholigical variants of a word Cons sentence length is longer (harder for training) training time is longer
Gongbo Tang Advanced Neural Machine Translation 36/52
Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.
Gongbo Tang Advanced Neural Machine Translation 37/52
Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.
Gongbo Tang Advanced Neural Machine Translation 38/52
Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.
Gongbo Tang Advanced Neural Machine Translation 39/52
Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.
Gongbo Tang Advanced Neural Machine Translation 40/52
Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.
Gongbo Tang Advanced Neural Machine Translation 41/52
Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.
Gongbo Tang Advanced Neural Machine Translation 42/52
Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size
Gongbo Tang Advanced Neural Machine Translation 43/52
Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size
Gongbo Tang Advanced Neural Machine Translation 44/52
Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size
Gongbo Tang Advanced Neural Machine Translation 45/52
Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size
Gongbo Tang Advanced Neural Machine Translation 46/52
Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size
Gongbo Tang Advanced Neural Machine Translation 47/52
Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size
Gongbo Tang Advanced Neural Machine Translation 48/52
Linguistically Motivated Vocabulary Reduction not only based on frequency based on morphological knowledge
Linguistically Motivated Vocabulary Reduction for NMT (331–342)
Method BLEU TER CHRF3 No Segmentation 17.77 68.07 38.94 BPE 19.52 66.23 42.33 Supervised 21.61 61.76 44.01 LMVR 21.71 61.41 43.90 Input (Reference) Method Segmentation Output ağlarını BPE ağ@@ larını the cry (the nets) LMVR ağ +larını the nets Supervised ağ +Noun + A3pl <EOW> networks ağlamayacak BPE ağ@@ lamayacak will not survive (would not be crying) LMVR ağlama +yacak will not cry Supervised ağla +Neg +Fut +A3sg <EOW> will not cry
Table 5. Results of Experiment 1 - TED corpus and no-OOV case. Top: Output accuracies, where indicates statistically significant improvement over the BPE baseline (p-value 0.05). Bottom: Translation examples.
Table 5 shows the performance of difgerent segmentation methods in Experiment 1. Our linguistically motivated vocabulary reduction (LMVR) method achieves the best performance on average, proving our hypothesis that a correct morphological repre- sentation generates more accurate translations. Our method outperforms the strong baseline of BPE-based segmentation by 2.2 BLEU, 4.8 TER and 1.6 CHR3F points. The performance is slightly higher than the supervised method, which is related to the ambiguity caused by loss of information during the morphological analysis. The pre- dicted vocabularies also indicate the signifjcant difgerence between LVMR and BPE, where 73% of the sub-word units in the vocabulary are completely difgerent. In order to better illustrate the properties of the generated sub-word units, we present example translations of two words from the test set. The two words have difgerent roots, the fjrst one is ağ (translation: net), and the second one is ağla (translation: (to) cry). BPE segments both words to the same root ağ, a character sequence frequently observed in root words in Turkish. In the fjrst case, both unsupervised methods segment the word into the same sub-word units, while the embedding of the sub-word unit segmented with BPE is semantically ambiguous and generates unreliable translations. On the
In Experiment 2, we evaluate our method at difgerent rates of vocabulary reduc- tion according to the vocabulary sizes given in Table 4. All metrics confjrm that our method achieves better performance than the baseline in both experiments. In Exper- iment 2.a, at a vocabulary reduction rate of 4.25 (140K -> 40K), we obtain an improve- ment of 2.3 BLEU points over the baseline. In the most challenging case, Experiment
339 Brought to you by | Uppsala University Library Authenticated Download Date | 9/22/19 12:26 PM
Figure from Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English Gongbo Tang Advanced Neural Machine Translation 49/52
Morfessor
Figure 1: Screenshot from the Morfessor 2.0 demo.
Figure from Morfessor 2.0 : Toolkit for statistical morphological segmentation Gongbo Tang Advanced Neural Machine Translation 50/52
dh x |V|
|V| x d
d=dh
Softmax Decoder
Figure from paper : BeyondWeight Tying : Learning Joint Input-Output Embeddings for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 51/52
Gongbo Tang Advanced Neural Machine Translation 52/52