Introduction to Neural Machine Translation
Gongbo Tang
16 September 2019
Introduction to Neural Machine Translation Gongbo Tang 16 - - PowerPoint PPT Presentation
Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural Machine Translation ? 1 Introduction to Neural Networks 2 Neural Language Models 3 Gongbo Tang Introduction to Neural Machine Translation 2/38
16 September 2019
1
2
3
Gongbo Tang Introduction to Neural Machine Translation 2/38
Translation Model Language Model Reordering Model Training Morphology Compound Syntactic Trees Pre-reordering Factored SMT Figure : An overview of SMT
Gongbo Tang Introduction to Neural Machine Translation 3/38
Translation Model Language Model Reordering Model Training Morphology Compound Syntactic Trees Pre-reordering Factored SMT Figure : An overview of SMT
Gongbo Tang Introduction to Neural Machine Translation 3/38
Gongbo Tang Introduction to Neural Machine Translation 4/38
Gongbo Tang Introduction to Neural Machine Translation 4/38
−0.2 −0.1 0.1 0.4 −0.3 1.1
Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 5/38
15
(Junczys-Dowmunt et al, 2016)
Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions Gongbo Tang Introduction to Neural Machine Translation 6/38
Figure from Tie-Yan Liu’s NMT slides Gongbo Tang Introduction to Neural Machine Translation 7/38
Gongbo Tang Introduction to Neural Machine Translation 8/38
Hyperbolic tangent Logistic function Rectified linear unit tanh(x) = sinh(x)
cosh(x) = ex−e−x ex+e−x
sigmoid(x) =
1 1+e−x
relu(x) = max(0,x)
from –1 to +1 from 0 to +1 from 0 to ∞ ✲ ✻ ✲ ✻ ✲ ✻
Figure from Philipp Koehn’s NMT chapter Gongbo Tang Introduction to Neural Machine Translation 9/38
x1 x2 x3 . . . xn
g(w · x + b) y y > 0 y <= 0 x is a vector input, y is a scalar output w and b are the parameters (b is a bias term) g is a non-linear activation function
Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Introduction to Neural Machine Translation 10/38
Figure 13.2: A neural network with a hidden layer.
Gongbo Tang Introduction to Neural Machine Translation 11/38
Gongbo Tang Introduction to Neural Machine Translation 12/38
Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 13/38
Figure from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Gongbo Tang Introduction to Neural Machine Translation 14/38
validation minimum validation training
Gongbo Tang Introduction to Neural Machine Translation 15/38
λ error(λ) λ error(λ) λ error(λ) local optimum global optimum
Too high learning rate Bad initialization Local optimum
Gongbo Tang Introduction to Neural Machine Translation 16/38
(a) Standard Neural Net (b) After applying dropout.
Figure from Dropout : A Simple Way to Prevent Neural Networks from Overfitting Gongbo Tang Introduction to Neural Machine Translation 17/38
Online learning : update the model with each training example. Mini-batch training : update weights in batches (parallelly), can speed up training.
2016-08-07 62
Sentence 1 Sentence 2 Sentence 3 Sentence 4 0’s 0’s 0’s
Sentence 1 Sentence 2 Sentence 4 Sentence 3 0’s 0’s 0’s
Figure from Luong et al. ACL 2016 NMT tutorial Gongbo Tang Introduction to Neural Machine Translation 18/38
Gongbo Tang Introduction to Neural Machine Translation 19/38
Gongbo Tang Introduction to Neural Machine Translation 19/38
Gongbo Tang Introduction to Neural Machine Translation 19/38
Gongbo Tang Introduction to Neural Machine Translation 19/38
Gongbo Tang Introduction to Neural Machine Translation 20/38
Figure 13.2: A neural network with a hidden layer.
Gongbo Tang Introduction to Neural Machine Translation 21/38
h = sigmoid(W1x + b1) y = sigmoid(W2h + b2)
sigmoid sum b2 prod W2 sigmoid sum b1 prod W1 x 3 4 2 3
−5
−4
0.0
2
−2
.119
value x, weight parameters W1, W2, b1, b2, and computation nodes (product, sum, sigmoid). To the right
the input (1, 0)T is processed by the graph. Gongbo Tang Introduction to Neural Machine Translation 22/38
L2 t sigmoid sum b2 prod W2 sigmoid sum b1 prod W1 x 3 4 2 3
0484 −.0258
−5
.00587
−4
.0484 −.0258
0.0
2
−2
.119
−.116
.0484 −.0258
−.0258
.0484 −.0258
−.0258
−.246
.00587
(0, 1)T → 1.0. Gradients are computed with respect to the input of the nodes, so some nodes that have two inputs also have two gradients. See text for details on the computations of the values. Gongbo Tang Introduction to Neural Machine Translation 23/38
Gongbo Tang Introduction to Neural Machine Translation 24/38
Gongbo Tang Introduction to Neural Machine Translation 25/38
Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer
Figure 13.10: Sketch of a neural language model: We predict a word wi based on its preceding words.
Gongbo Tang Introduction to Neural Machine Translation 25/38
Gongbo Tang Introduction to Neural Machine Translation 26/38
Gongbo Tang Introduction to Neural Machine Translation 26/38
Gongbo Tang Introduction to Neural Machine Translation 26/38
Gongbo Tang Introduction to Neural Machine Translation 26/38
Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C
Figure 13.11: Full architecture of a feed-forward neural network language model. Context words (wi−4, wi−3, wi−2, wi−1) are represented in a one-hot vector, then projected into continuous space as word embeddings (using the same weight matrix C for all words). The predicted word is computed as a one-hot vector via a hidden layer.
Gongbo Tang Introduction to Neural Machine Translation 27/38
Gongbo Tang Introduction to Neural Machine Translation 28/38
Gongbo Tang Introduction to Neural Machine Translation 29/38
Figure from https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Gongbo Tang Introduction to Neural Machine Translation 29/38
Word 1 Word 2 E C 1 H1 Word 2 Word 3 E C H2 H1
copy values
Word 3 Word 4 E C H3 H2
copy values
Figure 13.13: Recurrent neural language models: After predicting Word 2 in the context of following Word 1, we re-use this hidden layer (alongside the correct Word 2) to predict Word 3. Again, the hidden layer of this prediction is re-used for the prediction of Word 4.
Gongbo Tang Introduction to Neural Machine Translation 30/38
Gongbo Tang Introduction to Neural Machine Translation 31/38
Gongbo Tang Introduction to Neural Machine Translation 31/38
Figure from https://towardsdatascience.com/ illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 Gongbo Tang Introduction to Neural Machine Translation 32/38
input gate
forget gate
X i m
h m
LSTM Layer Time t-1 Next Layer Y LSTM Layer Time t Preceding Layer
Gongbo Tang Introduction to Neural Machine Translation 33/38
update gate reset gate
X x
h h
GRU Layer Time t-1 Next Layer Y GRU Layer Time t Preceding Layer
Gongbo Tang Introduction to Neural Machine Translation 34/38
Input Hidden Layer Output Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3
Shallow Deep Stacked
Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3
Deep Transition Figure 13.17: Deep recurrent neural networks. The input is passed through a few hidden layers before an output prediction is made. In deep stacked models, the hidden layers are also connected horizontally, i.e., a layer’s values at time step t depends on its value at time step t − 1 as well as the previous layer at time step t. In deep transitional models, the layers at any time step t are sequentially connected and first hidden layer is also informed by the last layer at time step t − 1.
Gongbo Tang Introduction to Neural Machine Translation 35/38
Gongbo Tang Introduction to Neural Machine Translation 36/38
Gongbo Tang Introduction to Neural Machine Translation 36/38
Gongbo Tang Introduction to Neural Machine Translation 37/38
https://www.metacenter.no/user/application/form/notur/
Gongbo Tang Introduction to Neural Machine Translation 38/38