Neural Machine Translation
Gongbo Tang
8 October 2018
Neural Machine Translation Gongbo Tang 8 October 2018 Outline - - PowerPoint PPT Presentation
Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52 Neural Machine Translation Figure Recurrent neural network based NMT
8 October 2018
1
2
Gongbo Tang Neural Machine Translation 2/52
Figure – Recurrent neural network based NMT model
From Thang Luong’s Thesis on Neural Machine Translation Gongbo Tang Neural Machine Translation 3/52
Figure – NMT model with attention mechanism
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 4/52
a source sentence S of length m (x1, . . . , xm) a target sentence T of length n (y1, . . . , yn)
T
n
Gongbo Tang Neural Machine Translation 5/52
Target-side language model:
p(T) =
n
p(yi|y1, . . . , yi−1)
Translation model:
p(T|S) =
n
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:
We do not care about p(S) We may want different vocabulary, network architecture for source text
Gongbo Tang Neural Machine Translation 6/52
simplifications of model by [Bahdanau et al., 2015] (for illustration)
plain RNN instead of GRU simpler output layer we do not show bias terms decoder follows Look, Update, Generate strategy [Sennrich et al., 2017] Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb
notation W, U, E, C, V are weight matrices (of different dimensionality)
E one-hot to embedding (e.g. 50000 · 512) W embedding to hidden (e.g. 512 · 1024) U hidden to hidden (e.g. 1024 · 1024) C context (2x hidden) to hidden (e.g. 2048 · 1024) Vo hidden to one-hot (e.g. 1024 · 50000) separate weight matrices for encoder and decoder (e.g. Ex and Ey) input X of length Tx; output Y of length Ty
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 7/52
Figure – NMT model with attention mechanism
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 8/52
Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN
Gongbo Tang Neural Machine Translation 9/52
Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN
− → h j =
, if j = 0 tanh(− → W xExxj + − → U xhj−1) , if j > 0 ← − h j =
, if j = Tx + 1 tanh(← − W xExxj + ← − U xhj+1) , if j ≤ Tx hj = (− → h j, ← − h j)
Gongbo Tang Neural Machine Translation 9/52
Figure – NMT model with attention mechanism
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 10/52
Context State
ti-1 ti
Word Prediction
yi-1 Eyi-1
Selected Word
yi Eyi
Embedding
si si-1 ci ci-1
Gongbo Tang Neural Machine Translation 11/52
Context State
ti-1 ti
Word Prediction
yi-1 Eyi-1
Selected Word
yi Eyi
Embedding
si si-1 ci ci-1
Gongbo Tang Neural Machine Translation 11/52
Gongbo Tang Neural Machine Translation 12/52
Gongbo Tang Neural Machine Translation 12/52
! 0.928 0.175 <eos> 0.999 0.175 hello 0.946 0.056 world 0.957 0.100
Gongbo Tang Neural Machine Translation 13/52
! 0.928 0.175 <eos> 0.999 0.175 hello 0.946 0.056 world 0.957 0.100
hello 0.946 0.056 world 0.957 0.100 World 0.010 4.632 . 0.030 3.609 ! 0.928 0.175 ... 0.014 4.384 <eos> 0.999 3.609 world 0.684 5.299 HI 0.007 4.920 <eos> 0.994 4.390 Hey 0.006 5.107 <eos> 0.999 0.175
Gongbo Tang Neural Machine Translation 13/52
Figure – NMT model with attention mechanism
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 14/52
αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci Gongbo Tang Neural Machine Translation 15/52
αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci
a tanh(Wasi−1 + Uahj)
Tx
Gongbo Tang Neural Machine Translation 15/52
Gongbo Tang Neural Machine Translation 16/52
Gongbo Tang Neural Machine Translation 16/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
Gongbo Tang Neural Machine Translation 17/52
From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 18/52
POS tag attention to alignment points % attention to
NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46
Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag.
From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 19/52
relations between Obama and Netanyahu have been strained for years . die Beziehungen zwischen Obama und Netanjahu sind seit Jahren angespannt . 56 89 72 16 26 96 79 98 42 11 11 14 38 22 84 23 54 10 98 49 das Verhältnis zwischen Obama und Netanyahu ist seit Jahren gespannt . the relationship between Obama and Netanyahu has been stretched for years . 11 47 81 72 87 93 95 38 21 17 16 14 38 19 33 90 32 26 54 77 12 17
(a) Desired Alignment (b) Mismatched Alignment
Gongbo Tang Neural Machine Translation 20/52
source word
translation
method from On Using Very Large Target Vocabulary for Neural Machine Translation Gongbo Tang Neural Machine Translation 21/52
αt h Encoder hidden states Attention Decoder Networks Softmax Predictions st−1 ct
(a) Vanilla attention mechanism
αt h Encoder hidden states Multi-head Attention Decoder Block Decoder Block Softmax Predictions st−1 ct st−1 ct
n-layer blocks
(b) Advanced attention mechanism Gongbo Tang Neural Machine Translation 22/52
αt h Encoder hidden states Attention Decoder Networks Softmax Predictions st−1 ct
(a) Vanilla attention mechanism
αt h Encoder hidden states Multi-head Attention Decoder Block Decoder Block Softmax Predictions st−1 ct st−1 ct
n-layer blocks
(b) Advanced attention mechanism Multi-Head Attention Gongbo Tang Neural Machine Translation 22/52
w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>
(a) Layer 1 (b) Layer 2 (c) Layer 3
wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s>
(d) Layer 4 (e) Layer 5 (f) Layer 6
wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s>
(g) Layer 7 (h) Layer 8 (i) RNN
Gongbo Tang Neural Machine Translation 23/52
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 L1 L2 L3 L4 L5 L6 L7 L8 Average attention weight Ambi. Nouns Tokens
Figure – Attention weights in different Transformer attention Layers
Gongbo Tang Neural Machine Translation 24/52
(a) (b) Figure 1: (a) The conventional attention mechanism and (b) The proposed fine-grained attention
t αt′,t = 1 in the conventional method, and t αd t′,t = 1 for all dimension
d in the proposed method.
Fine-Grained Attention Mechanism for Neural Machine Translation Gongbo Tang Neural Machine Translation 25/52
x1 x2 x3 x4 x5 (a) RNN x1 x2 x3 x4 x5
padding padding
(b) CNN x1 x2 x3 x4 x5 (c) Self-attention
Gongbo Tang Neural Machine Translation 26/52
Gongbo Tang Neural Machine Translation 27/52
(a) Cascaded Encoder (b) Multi-Column Encoder Gongbo Tang Neural Machine Translation 27/52
compounding and other productive morphological processes
they charge a carry-on bag fee. sie erheben eine Hand|gepäck|gebühr.
names
Obama(English; German) Обама (Russian) オバマ (o-ba-ma) (Japanese)
technical terms, numbers, etc.
Gongbo Tang Neural Machine Translation 28/52
Gongbo Tang Neural Machine Translation 29/52
Gongbo Tang Neural Machine Translation 29/52
* C2W Compositional Model BLSTM W h e r e Word Vector for "Where"
Figure – C2W Compositional model
From Chracter-based Neural Machine Translation Gongbo Tang Neural Machine Translation 30/52
** V2C Generation Model Forward LSTM e s t a e s t a EOW SOW
Figure – V2C generation model
From Chracter-based Neural Machine Translation Gongbo Tang Neural Machine Translation 31/52
from Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models Gongbo Tang Neural Machine Translation 32/52
Gongbo Tang Neural Machine Translation 33/52
Gongbo Tang Neural Machine Translation 34/52
Gongbo Tang Neural Machine Translation 35/52
Gongbo Tang Neural Machine Translation 36/52
Gongbo Tang Neural Machine Translation 37/52
Gongbo Tang Neural Machine Translation 38/52
Gongbo Tang Neural Machine Translation 39/52
Gongbo Tang Neural Machine Translation 40/52
Gongbo Tang Neural Machine Translation 41/52
Gongbo Tang Neural Machine Translation 42/52
Gongbo Tang Neural Machine Translation 43/52
Gongbo Tang Neural Machine Translation 44/52
Gongbo Tang Neural Machine Translation 45/52
Gongbo Tang Neural Machine Translation 45/52
Gongbo Tang Neural Machine Translation 46/52
− → h j = tanh(− → W(
|F|
Ekxjk) + − → U − → h j−1)
E1(close) = 0.4 0.1 0.2 E2(adj) = 0.1 E1(close) E2(adj) = 0.4 0.1 0.2 0.1
from Linguistic Input Features Improve Neural Machine Translation Gongbo Tang Neural Machine Translation 46/52
Gongbo Tang Neural Machine Translation 47/52
Gongbo Tang Neural Machine Translation 47/52
Gongbo Tang Neural Machine Translation 47/52
SGD is sensitive to order of training instances best practice:
first train on all available data continue training on in-domain data
Large BLEU improvements reported with minutes of training time
[Sennrich et al., 2016, Luong and Manning, 2015, Crego et al., 2016]
tst2013 tst2014 tst2015
0.0 10.0 20.0 30.0 40.0
26.5 23.5 25.5 30.4 25.9 28.4
BLEU Fine-tuning in IWSLT (en-de) Baseline Finetuned
Generic system (≈ 8M sentences), Fine-tune with TED (≈ 200k )
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 48/52
1: Illustration of our approach, after (Belinkov et al., 2017): (i) NMT system trained on
Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks Gongbo Tang Neural Machine Translation 49/52
106 107 108 10 20 30 21.8 23.4 24.9 26.2 26.9 27.9 28.6 29.2 29.6 30.1 30.4 16.4 18.1 19.6 21.2 22.2 23.5 24.7 26.1 26.9 27.8 28.6 1.6 7.2 11.9 14.7 18.2 22.4 25.7 27.4 29.2 30.3 31.1 Corpus Size (English Words) BLEU Scores with Varying Amounts of Training Data Phrase-Based with Big LM Phrase-Based Neural
Gongbo Tang Neural Machine Translation 50/52
Gongbo Tang Neural Machine Translation 51/52
Gongbo Tang Neural Machine Translation 52/52