CSEP 517 Natural Language Processing
Neural Machine Translation
Luke Zettlemoyer
(Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky)
Neural Machine Translation Luke Zettlemoyer (Slides adapted from - - PowerPoint PPT Presentation
CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky) Last time Statistical MT Word-based Phrase-based Syntactic
(Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky)
Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016
years, outperformed by NMT systems trained by a handful
3
to target
vector/matrix
language (output)
ht = g(Wht−1 + Uxt + b) ∈ ℝd
(Sutskever et al., 2014)
Encoder RNN
<START>
Source sentence (input)
les pauvres sont démunis
Target sentence (output) Decoder RNN Encoder RNN produces an encoding of the source sentence. Encoding of the source sentence. Provides initial hidden state for Decoder RNN. Decoder RNN is a Language Model that generates target sentence conditioned on encoding.
the
argmax
the
argmax
poor poor
argmax
don’t
Note: This diagram shows test time behavior: decoder output is fed in --> as next step’s input
have any money <END> don’t have any money
argmax argmax argmax argmax
T
t=1
English: Machine translation is cool!
36M sentence pairs
Russian: Машинный перевод - это крутo!
Encoder RNN Source sentence (from corpus)
<START> the poor don’t have any money les pauvres sont démunis
Target sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “end to end”. Decoder RNN
^ 𝑧1 ^ 𝑧2 ^ 𝑧3 ^ 𝑧4 ^ 𝑧5 ^ 𝑧6 ^ 𝑧7 𝐾1 𝐾2 𝐾3 𝐾4 𝐾5 𝐾6 𝐾7
= negative log prob of “the”
𝐾 = 1 𝑈
𝑈
∑
𝑢=1
𝐾𝑢 = + + + + + +
= negative log prob of <END> = negative log prob of “have”
word
y1,...,yT
probable partial translations (hypotheses)
j
t=1
(slide credit: Abigail See)
(slide credit: Abigail See)
(slide credit: Abigail See)
(end) token at different time steps
, stop expanding it and place it aside
OR
⟨e⟩ ⟨e⟩ k ⟨e⟩ 1 T
T
∑
t=1
log P(yt|y1, . . . , yt−1, x1, . . . , xn)
Pros
end
language pairs
Cons
could lead to unwanted biases
(source: Rico Sennrich)
characters
summary)
reply)
parse tree in sequence form)
answer)
, needs to capture all the information about source sentence
Bottleneck
particular part of source sentence
(i.e. notion of what you are trying to decode)
the hidden states of the encoder ( )
i
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores dot product
les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores dot product
les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores dot product
les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores dot product
les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores On this decoder timestep, we’re mostly focusing on the first encoder hidden state (”les”) Attention distribution Take softmax to turn the scores into a probability distribution
les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention distribution Attention scores Attention
Use the attention distribution to take a weighted sum of the encoder hidden states. The attention output mostly contains information the hidden states that received high attention.
les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention distribution Attention scores Attention
Concatenate attention output with decoder hidden state, then use to compute as before
^ 𝑧1
^ 𝑧1
the les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores
the
Attention distribution Attention
^ 𝑧2
poor les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores Attention distribution Attention
the poor
^ 𝑧3
don’t les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores Attention distribution Attention
the poor don’t
^ 𝑧4
have les pauvres sont démunis
Encoder RNN Source sentence (input)
<START>
Decoder RNN Attention scores Attention distribution Attention
the poor have
^ 𝑧5
any don’t les pauvres sont démunis
Encoder RNN Source sentence (input)
<START> les pauvres sont démunis
Decoder RNN Attention scores Attention distribution Attention
the poor don’t have any
^ 𝑧6
money
henc
1 , . . . , henc n
t hdec
t
g et = [g(henc
1 , hdec t
), . . . , g(henc
n , hdec t
)] αt = softmax (et) ∈ ℝn at =
n
∑
i=1
αt
ihenc i
∈ ℝh [at; hdec
t
] ∈ ℝ2h
and decoder hidden state
, where is a weight matrix
where are weight matrices and is a weight vector
h1, h2, . . . , hn z a b ei = g(hi, z) = zThi ∈ ℝ g(hi, z) = zTWhi ∈ ℝ W g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝ W1, W2 v
how to handle OOV?
Sennrich et al. (2016)
Input: _the _eco tax _port i co _in _Po nt - de - Bu is … Output: _le _port ique _éco taxe _de _Pont - de - Bui s
full words but may also be parts of words)
symbol
adjacent characters
whole words
Sennrich et al. (2016)
monolingual corpus T’ to train a language model. Can neural MT do the same? Sennrich et al. (2015)
s1, t1 [null], t’1 [null], t’2 s2, t2 … …
generate T’ as targets from null inputs
sources with a T->S machine translaSon system (backtranslaSon)
s1, t1 MT(t’1), t’1 s2, t2 … … MT(t’2), t’2
source sentences which could be useful
Sennrich et al. (2015)
Wu et al. (2016)
(Wu et al., 2016)
Wu et al. (2016)
the
movie was terribly exciting !
Transformer layer 3 Transformer layer 2 Transformer layer 1
Vaswani et al. (2017)
the movie was great
values + transform vectors x4
x0
4
scalar vector = sum of scalar * vector
αi,j = softmax(x>
i xj)
x0
i = n
X
j=1
αi,jxj
αk,i,j = softmax(x>
i Wkxj)
x0
k,i = n
X
j=1
αk,i,jVkxj
trains much faster
decoder framework)
follow up work
connection and a layer normalization LayerNorm(x + SubLayer(x))
(Ba et al, 2016): Layer Normalization
is a sine/cosine wave of a different frequency. Closer points = higher dot products
vector
emb(1) emb(2) emb(3) emb(4)
Vaswani et al. (2017)
Vaswani et al. (2017)
dim for each token, 16 heads, base = 6 layers +
halved
Vaswani et al. (2017)
Vaswani et al. (2017)
Vaswani et al. (2017)
nn.Transformer:
nn.TransformerEncoder:
The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.html
A Jupyter notebook which explains how Transformer works line by line in PyTorch!
58
59
60
Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be- ce1f7c8c683c
Didn’t specify gender
61
Source: http://languagelog.ldc.upenn.edu/nll/?p=35120#more-35120
(Arivazhagan et al., 2019)
English (remember Interlingua?)