Neural Machine Translation
Spring 2020
2020-03-12
CMPT 825: Natural Language Processing
SFU NatLangLab
Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)
Neural Machine Translation Spring 2020 2020-03-12 Adapted from - - PowerPoint PPT Presentation
SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig) Course
Spring 2020
2020-03-12
CMPT 825: Natural Language Processing
SFU NatLangLab
Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)
to target
matrix
language (output)
(Sutskever et al., 2014)
xt ht−1 xt+1 ht xt+2 ht+1 ht+2 xt+3 ht+3
h This cat is cute Sentence: This cat is cute
word embedding
x1 h0 xt+1 h1 xt+2 ht+1 ht+2 xt+3 ht+3
h Sentence: This cat is cute
word embedding
This cat is cute
x1 h0 x2 h1 x3 h2 h3 x4 h4
xt+2 ht+2 xt+3 ht+3
h Sentence: This cat is cute
word embedding
This cat is cute
x1 h0 x2 h1 x3 h2 h3 x4 h4
(encoded representation)
Sentence: This cat is cute
word embedding
henc
This cat is cute
<s> ce chat est mianon
x′
1
x′
2
z1 x′
3
z2
ce
4
z4
mignon est
x′
5
z5
mignon
word embedding
henc
<s> ce chat est mianon
y1
henc
x′
2
z1 x′
3
z2
ce
4
z4
mignon est
x′
5
z5
mignon
word embedding
<s> ce chat est mianon
y1 y2 z1 x′
3
z2
ce
4
z4
mignon est
x′
5
z5
mignon
word embedding
henc
y1 y2 z1 y3 z2
ce
z4
chat mignon
est
y5 z5
word embedding
henc
T
∑
t=1
− log P(yt|y1, . . . , yt−1, x1, . . . , xn)
English: Machine translation is cool!
36M sentence pairs
Russian: Машинный перевод - это крутo!
(slide credit: Abigail See)
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Use masking to help compute loss for batched sequences
(figure credit: Bengio et al, 2015)
Possible decay schedules (probability using true y decays over time)
(source: Rico Sennrich)
(Wu et al., 2016)
Pros
language pairs
Cons
compute
could lead to unwanted biases
Task/Application Input Output Machine Translation French English Summarization Document Short Summary Dialogue Utterance Response Parsing Sentence Parse tree (as sequence) Question Answering Context + Question Answer
Task/Application Input Output Speech Recognition Speech Signal Transcript Image Captioning Image Text Video Captioning Video Text
, needs to capture all the information about source sentence
henc
Bottleneck
, needs to capture all the information about source sentence
henc
Bottleneck
particular part of source sentence
notion of what you are trying to decode)
hidden states of the encoder ( )
henc
i
(slide credit: Abigail See)
(slide credit: Abigail See)
(slide credit: Abigail See)
(slide credit: Abigail See)
Can also use as input for next time step
̂ y1
(slide credit: Abigail See)
henc
1 , . . . , henc n
t hdec
t
g et = [g(henc
1 , hdec t
), . . . , g(henc
n , hdec t
)] αt = softmax (et) ∈ ℝn at =
n
∑
i=1
αt
ihenc i
∈ ℝh [at; hdec
t
] ∈ ℝ2h
and decoder hidden state
, where is a weight matrix
where are weight matrices and is a weight vector
h1, h2, . . . , hn z
a b ei = g(hi, z) = zThi ∈ ℝ g(hi, z) = zTWhi ∈ ℝ W g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝ W1, W2 v
, needs to capture all the information about source sentence
henc
Bottleneck
stochastically
during training ( usually works well)
p p = 0.5 1/(1 − p)
P(yi) = exp(wi ⋅ h + bi) ∑|V|
j=1 exp(wj ⋅ h + bj)
Expensive to compute
(figure credit: Graham Neubig)
(figure credit: Graham Neubig)
Other ways to sample: Importance Sampling (Bengio and Senecal 2003) Noise Contrastive Estimation (Mnih & Teh 2012)
(Morin and Bengio 2005) (figure credit: Quora)
then predict word.
(Gooding 2001, Mikolov et al 2011)
contexts.
(figure credit: Graham Neubig)
themselves
embeddings close on the unit ball
(slide credit: Graham Neubig)
(figure credit: Luong, Cho, and Manning)
generate word
y1,...,yT
probable partial translations (hypotheses)
j
t=1
(slide credit: Abigail See)
(slide credit: Abigail See)
(slide credit: Abigail See)
(slide credit: Abigail See)
(end) token at different time steps
, stop expanding it and place it aside
OR
⟨eos⟩ ⟨eos⟩ k ⟨eos⟩ 1 T
T
∑
t=1
log P(yt|y1, . . . , yt−1, x1, . . . , xn)
<EOS>
! RNN <EOS> nobody RNN expects expects RNN the the RNN spanish spanish RNN inquisition inquisition RNN ! <SOS> RNN nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
<EOS>
<EOS>
! RNN <EOS> nobody RNN expects expects RNN the the RNN spanish spanish RNN inquisition inquisition RNN ! <SOS> RNN nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
P0 XXXXXXX P2 Ensemble
XXXXXXX P2 Ensemble
!
RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>
! ! !
RNN
XXXXXXX P2 Ensemble RNN RNN RNN prof loves cheese RNN <EOS> RNN RNN RNN RNN prof loves cheese RNN <EOS> RNN RNN RNN RNN prof loves cheese RNN <EOS> RNN
seq2seq_E049.pt
XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1
seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt
XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1
seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05 …… 'kebab': 0.3 'cheese': 0.55 'pizza': 0.05 …… 'kebab': 0.3 'cheese': 0.05 'pizza': 0.55 ……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05 …… 'kebab': 0.3 'cheese': 0.05 'pizza': 0.55 ……
XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1
seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05 …… kebab kebab
XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1
seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt
kebab kebab
Zipf’s Law
Occurrence Count 1 10 100 1000 10000 100000 1000000 Frequency Rank 1 25000 50000
P0 XXXXXXX P1 Copy
!
RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>
! ! !
RNN
P0 XXXXXXX P1 Copy
!
RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>
! ! !
RNN <UNK> prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab
P0 XXXXXXX P1 Copy
!
RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>
! ! !
RNN <UNK> prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab
P0 XXXXXXX P1 Copy
!
RNN RNN RNN <SOS> prof loves prof loves cheese
! !
RNN prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab
t αt fargmaxiαt,i
P0 XXXXXXX P1 Copy
RNN <UNK>
!
pg
!
RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>
! ! !
RNN <UNK> prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab
P0 XXXXXXX P1 Copy
RNN <UNK>
!
pg
t pg([hdec
t
; contextt]) ≤ 0.5 αt fargmaxiαt,i
P0 XXXXXXX P1 Copy
RNN <UNK>
!
pg
t pg([hdec
t
; contextt]) ≤ 0.5 αt dict(fargmaxiαt,i)
(Arivazhagan et al., 2019)
English (remember Interlingua?)