Neural Machine Translation Spring 2020 2020-03-12 Adapted from - - PowerPoint PPT Presentation

neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Neural Machine Translation Spring 2020 2020-03-12 Adapted from - - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig) Course


slide-1
SLIDE 1

Neural Machine Translation

Spring 2020

2020-03-12

CMPT 825: Natural Language Processing

SFU NatLangLab

Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

slide-2
SLIDE 2

Course Logistics

  • Project proposal is due today
  • What problem are you addressing? Why is it interesting?
  • What specific aspects will your project be on?
  • Re-implement paper? Compare different methods?
  • What data do you plan to use?
  • What is your method?
  • How do you plan to evaluate? What metrics?
slide-3
SLIDE 3
  • Statistical MT
  • Word-based
  • Phrase-based
  • Syntactic

Last time

slide-4
SLIDE 4

Neural Machine Translation

  • A single neural network is used to translate from source

to target

  • Architecture: Encoder-Decoder
  • Two main components:
  • Encoder: Convert source sentence (input) into a vector/

matrix

  • Decoder: Convert encoding into a sentence in target

language (output)

slide-5
SLIDE 5

Sequence to Sequence learning (Seq2seq)

  • Encode entire input sequence into a single vector (using an RNN)
  • Decode one word at a time (again, using an RNN!)
  • Beam search for better inference
  • Learning is not trivial! (vanishing/exploding gradients)

(Sutskever et al., 2014)

slide-6
SLIDE 6

Encoder

xt ht−1 xt+1 ht xt+2 ht+1 ht+2 xt+3 ht+3

h This cat is cute Sentence: This cat is cute

word embedding

slide-7
SLIDE 7

Encoder

x1 h0 xt+1 h1 xt+2 ht+1 ht+2 xt+3 ht+3

h Sentence: This cat is cute

word embedding

This cat is cute

slide-8
SLIDE 8

x1 h0 x2 h1 x3 h2 h3 x4 h4

Encoder

xt+2 ht+2 xt+3 ht+3

h Sentence: This cat is cute

word embedding

This cat is cute

slide-9
SLIDE 9

Encoder

x1 h0 x2 h1 x3 h2 h3 x4 h4

(encoded representation)

Sentence: This cat is cute

word embedding

henc

This cat is cute

slide-10
SLIDE 10

<s> ce chat est mianon

Decoder

x′

1

x′

2

z1 x′

3

z2

ce

  • z3
  • x′

4

z4

  • chat

mignon est

x′

5

z5

  • <e>

mignon

word embedding

henc

slide-11
SLIDE 11

<s> ce chat est mianon

Decoder

y1

henc

x′

2

z1 x′

3

z2

ce

  • z3
  • x′

4

z4

  • chat

mignon est

x′

5

z5

  • <e>

mignon

word embedding

slide-12
SLIDE 12

<s> ce chat est mianon

Decoder

y1 y2 z1 x′

3

z2

ce

  • z3
  • x′

4

z4

  • chat

mignon est

x′

5

z5

  • <e>

mignon

word embedding

henc

slide-13
SLIDE 13

Decoder

y1 y2 z1 y3 z2

ce

  • z3
  • y4

z4

  • <s> ce chat est mianon

chat mignon

  • A conditioned language model

est

y5 z5

  • <e>

word embedding

henc

slide-14
SLIDE 14

Seq2seq training

  • Similar to training a language model!
  • Minimize cross-entropy loss:
  • Back-propagate gradients through both decoder and encoder
  • Need a really big corpus

T

t=1

− log P(yt|y1, . . . , yt−1, x1, . . . , xn)

English: Machine translation is cool!

36M sentence pairs

Russian: Машинный перевод - это крутo!

slide-15
SLIDE 15

Seq2seq training

(slide credit: Abigail See)

slide-16
SLIDE 16

Remember masking

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Use masking to help compute loss for batched sequences

slide-17
SLIDE 17

Scheduled Sampling

(figure credit: Bengio et al, 2015)

Possible decay schedules (probability using true y decays over time)

slide-18
SLIDE 18

How seq2seq changed the MT landscape

slide-19
SLIDE 19

MT Progress

(source: Rico Sennrich)

slide-20
SLIDE 20

(Wu et al., 2016)

slide-21
SLIDE 21

NMT vs SMT

Pros

  • Better performance
  • Fluency
  • Longer context
  • Single NN optimized end-to-end
  • Less engineering
  • Works out of the box for many

language pairs

Cons

  • Requires more data and

compute

  • Less interpretable
  • Hard to debug
  • Uncontrollable
  • Heavily dependent on data -

could lead to unwanted biases

  • More parameters
slide-22
SLIDE 22

Seq2Seq for more than NMT

Task/Application Input Output Machine Translation French English Summarization Document Short Summary Dialogue Utterance Response Parsing Sentence Parse tree (as sequence) Question Answering Context + Question Answer

slide-23
SLIDE 23

Cross-Modal Seq2Seq

Task/Application Input Output Speech Recognition Speech Signal Transcript Image Captioning Image Text Video Captioning Video Text

slide-24
SLIDE 24

Issues with vanilla seq2seq

  • A single encoding vector,

, needs to capture all the information about source sentence

  • Longer sequences can lead to vanishing gradients
  • Overfitting

henc

Bottleneck

slide-25
SLIDE 25

Issues with vanilla seq2seq

  • A single encoding vector,

, needs to capture all the information about source sentence

  • Longer sequences can lead to vanishing gradients
  • Overfitting

henc

Bottleneck

slide-26
SLIDE 26

Remember alignments?

slide-27
SLIDE 27

Attention

  • The neural MT equivalent of alignment models
  • Key idea: At each time step during decoding, focus on a

particular part of source sentence

  • This depends on the decoder’s current hidden state (i.e.

notion of what you are trying to decode)

  • Usually implemented as a probability distribution over the

hidden states of the encoder ( )

henc

i

slide-28
SLIDE 28

Seq2seq with attention

(slide credit: Abigail See)

slide-29
SLIDE 29

Seq2seq with attention

(slide credit: Abigail See)

slide-30
SLIDE 30

Seq2seq with attention

(slide credit: Abigail See)

slide-31
SLIDE 31

Seq2seq with attention

(slide credit: Abigail See)

Can also use as input for next time step

̂ y1

slide-32
SLIDE 32

Seq2seq with attention

(slide credit: Abigail See)

slide-33
SLIDE 33

Computing attention

  • Encoder hidden states:
  • Decoder hidden state at time :
  • First, get attention scores for this time step (we will see what is soon!):

  • Obtain the attention distribution using softmax:

  • Compute weighted sum of encoder hidden states:

  • Finally, concatenate with decoder state and pass on to output layer:

henc

1 , . . . , henc n

t hdec

t

g et = [g(henc

1 , hdec t

), . . . , g(henc

n , hdec t

)] αt = softmax (et) ∈ ℝn at =

n

i=1

αt

ihenc i

∈ ℝh [at; hdec

t

] ∈ ℝ2h

slide-34
SLIDE 34

Types of attention

  • Assume encoder hidden states

and decoder hidden state

  • 1. Dot-product attention (assumes equal dimensions for and :

  • 2. Multiplicative attention:


, where is a weight matrix

  • 3. Additive attention:



 where are weight matrices and is a weight vector

h1, h2, . . . , hn z

a b ei = g(hi, z) = zThi ∈ ℝ g(hi, z) = zTWhi ∈ ℝ W g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝ W1, W2 v

slide-35
SLIDE 35

Issues with vanilla seq2seq

  • A single encoding vector,

, needs to capture all the information about source sentence

  • Longer sequences can lead to vanishing gradients
  • Overfitting

henc

Bottleneck

slide-36
SLIDE 36

Dropout

  • Form of regularization for RNNs (and any NN in general)
  • Idea: “Handicap” NN by removing hidden units

stochastically

  • set each hidden unit in a layer to 0 with probability

during training ( usually works well)

  • scale outputs by
  • hidden units forced to learn more general patterns
  • Test time: Use all activations (no need to rescale)

p p = 0.5 1/(1 − p)

slide-37
SLIDE 37

Handling large vocabularies

  • Softmax can be expensive for large vocabularies
  • English vocabulary size: 10K to 100K

P(yi) = exp(wi ⋅ h + bi) ∑|V|

j=1 exp(wj ⋅ h + bj)

Expensive to compute

slide-38
SLIDE 38

Approximate Softmax

  • Negative Sampling
  • Structured softmax
  • Embedding prediction
slide-39
SLIDE 39

Negative Sampling

  • Softmax is expensive when vocabulary size is large

(figure credit: Graham Neubig)

slide-40
SLIDE 40

Negative Sampling

  • Sample just a subset of the vocabulary for negative
  • Saw simple negative sampling in word2vec (Mikolov 2013)

(figure credit: Graham Neubig)

Other ways to sample: Importance Sampling (Bengio and Senecal 2003) Noise Contrastive Estimation (Mnih & Teh 2012)

slide-41
SLIDE 41

Hierarchical softmax

(Morin and Bengio 2005) (figure credit: Quora)

slide-42
SLIDE 42

Class based softmax

  • Two-layer: cluster words into classes, predict class and

then predict word.

(Gooding 2001, Mikolov et al 2011)

  • Clusters can be based on frequency, random, or word

contexts.

(figure credit: Graham Neubig)

slide-43
SLIDE 43

Embedding prediction

  • Directly predict embeddings of outputs

themselves

  • What loss to use? (Kumar and Tsvetkov 2019)
  • L2? Cosine?
  • Von-Mises Fisher distribution loss, make

embeddings close on the unit ball

(slide credit: Graham Neubig)

slide-44
SLIDE 44

Generation

How can we use our model (decoder) to generate sentences?

  • Sampling: Try to generate a random sentence

according the the probability distribution

  • Argmax: Try to generate the best sentence,

the sentence with the highest probability

slide-45
SLIDE 45

Decoding Strategies

  • Ancestral sampling
  • Greedy decoding
  • Exhaustive search
  • Beam search
slide-46
SLIDE 46

Ancestral Sampling

  • Randomly sample words one by one
  • Provides diverse output (high variance)

(figure credit: Luong, Cho, and Manning)

slide-47
SLIDE 47

Greedy decoding

  • Compute argmax at every step of decoder to

generate word

  • What’s wrong?
slide-48
SLIDE 48

Exhaustive search?

  • Find
  • Requires computing all possible sequences
  • complexity!
  • Too expensive

arg max

y1,...,yT

P(y1, . . . , yT|x1, . . . , xn) O(VT)

slide-49
SLIDE 49

Recall: Beam search (a middle ground)

  • Key idea: At every step, keep track of the k most

probable partial translations (hypotheses)

  • Score of each hypothesis = log probability
  • Not guaranteed to be optimal
  • More efficient than exhaustive search

j

t=1

log P(yt|y1, . . . , yt−1, x1, . . . , xn)

slide-50
SLIDE 50

Beam decoding

(slide credit: Abigail See)

slide-51
SLIDE 51

Beam decoding

(slide credit: Abigail See)

slide-52
SLIDE 52

Beam decoding

(slide credit: Abigail See)

slide-53
SLIDE 53

Backtrack

(slide credit: Abigail See)

slide-54
SLIDE 54

Beam decoding

  • Different hypotheses may produce

(end) token at different time steps

  • When a hypothesis produces

, stop expanding it and place it aside

  • Continue beam search until:
  • All hypotheses produce

OR

  • Hit max decoding limit T
  • Select top hypotheses using the normalized likelihood score
  • Otherwise shorter hypotheses have higher scores

⟨eos⟩ ⟨eos⟩ k ⟨eos⟩ 1 T

T

t=1

log P(yt|y1, . . . , yt−1, x1, . . . , xn)

slide-55
SLIDE 55

Under translation

  • Under translation
  • Beam search
  • Ensemble
  • Coverage models
  • Crucial information are left untranslated; premature generation

<EOS>

slide-56
SLIDE 56
  • Out-of-Vocabulary (OOV) Problem; Rare word problem
  • Frequent word are translated correctly, rare words are not
  • In any corpus, word frequencies are exponentially unbalanced
  • Under translation
  • Crucial information are left untranslated; premature generation

Under-trans<EOS>

! RNN <EOS> nobody RNN expects expects RNN the the RNN spanish spanish RNN inquisition inquisition RNN ! <SOS> RNN nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

<EOS>

slide-57
SLIDE 57
  • Out-of-Vocabulary (OOV) Problem; Rare word problem
  • Frequent word are translated correctly, rare words are not
  • In any corpus, word frequencies are exponentially unbalanced
  • Under translation
  • Crucial information are left untranslated; premature generation

<EOS>

Under-trans<EOS>

! RNN <EOS> nobody RNN expects expects RNN the the RNN spanish spanish RNN inquisition inquisition RNN ! <SOS> RNN nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

slide-58
SLIDE 58

Ensemble

  • Similar to voting mechanism, but with probabilities
  • multiple models of different parameters (usually from different checkpoints
  • f the same training instance)
  • use the output with the highest probability across all models

C

  • n

c e p t

P0 XXXXXXX P2 Ensemble

slide-59
SLIDE 59

Ensemble

D e m

  • P0

XXXXXXX P2 Ensemble

!

RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>

! ! !

RNN

slide-60
SLIDE 60

Ensemble

D e m

  • P0

XXXXXXX P2 Ensemble RNN RNN RNN prof loves cheese RNN <EOS> RNN RNN RNN RNN prof loves cheese RNN <EOS> RNN RNN RNN RNN prof loves cheese RNN <EOS> RNN

seq2seq_E049.pt

slide-61
SLIDE 61

Ensemble

D e m

  • P0

XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1

seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt

slide-62
SLIDE 62

Ensemble

D e m

  • P0

XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1

seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt

'kebab': 0.9 'cheese': 0.05 'pizza': 0.05 …… 'kebab': 0.3 'cheese': 0.55 'pizza': 0.05 …… 'kebab': 0.3 'cheese': 0.05 'pizza': 0.55 ……

slide-63
SLIDE 63

'kebab': 0.3 'cheese': 0.55 'pizza': 0.05 …… 'kebab': 0.3 'cheese': 0.05 'pizza': 0.55 ……

Ensemble

D e m

  • P0

XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1

seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt

'kebab': 0.9 'cheese': 0.05 'pizza': 0.05 …… kebab kebab

slide-64
SLIDE 64

Ensemble

D e m

  • P0

XXXXXXX P2 Ensemble RNN cp2 cp2 prof loves cheese cp2 <EOS> cp2 RNN cp3 cp3 prof loves pizza cp3 <EOS> cp3 RNN cp1 cp1 prof loves kebab cp1 <EOS> cp1

seq2seq_E049.pt seq2seq_E048.pt seq2seq_E047.pt

kebab kebab

slide-65
SLIDE 65

Existing challenges with NMT

  • Out-of-vocabulary words
  • Low-resource languages
  • Long-term context
  • Common sense knowledge (e.g. hot dog, paper jam)
  • Fairness and bias
  • Uninterpretable
slide-66
SLIDE 66

Existing challenges with NMT

  • Out-of-vocabulary words
  • Low-resource languages
  • Long-term context
  • Common sense knowledge (e.g. hot dog, paper jam)
  • Fairness and bias
  • Uninterpretable
slide-67
SLIDE 67

Out-of-vocabulary words

  • Out-of-Vocabulary (OOV) Problem; Rare word problem
  • Frequent word are translated correctly, rare words are not
  • In any corpus, word frequencies are exponentially unbalanced
slide-68
SLIDE 68

OOV

  • 1. https://en.wikipedia.org/wiki/Zipf%27s_law

Zipf’s Law

Occurrence Count 1 10 100 1000 10000 100000 1000000 Frequency Rank 1 25000 50000

  • rare word are exponentially less frequent than frequent words
  • e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once
slide-69
SLIDE 69

Out-of-vocabulary words

  • Out-of-Vocabulary (OOV) Problem; Rare word problem
  • Copy Mechanisms
  • Subword / character models
  • Out-of-Vocabulary (OOV) Problem; Rare word problem
slide-70
SLIDE 70

Copy Mechanism

C

  • n

c e p t

P0 XXXXXXX P1 Copy

!

RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>

! ! !

RNN

  • 1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
slide-71
SLIDE 71

Copy Mechanism

C

  • n

c e p t

P0 XXXXXXX P1 Copy

!

RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>

! ! !

RNN <UNK> prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab

  • 1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
slide-72
SLIDE 72

Copy Mechanism

C

  • n

c e p t

P0 XXXXXXX P1 Copy

!

RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>

! ! !

RNN <UNK> prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab

  • 1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
slide-73
SLIDE 73

Copy Mechanism

C

  • n

c e p t

P0 XXXXXXX P1 Copy

!

RNN RNN RNN <SOS> prof loves prof loves cheese

! !

RNN prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab

  • Sees <UNK> at step
  • looks at attention weight
  • replace <UNK> with the source word

t αt fargmaxiαt,i

  • 1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
slide-74
SLIDE 74

*Pointer-Generator Copy Mechanism

C

  • n

c e p t

P0 XXXXXXX P1 Copy

  • 1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN <UNK>

!

pg

!

RNN RNN RNN <SOS> prof loves prof loves cheese RNN cheese <EOS>

! ! !

RNN <UNK> prof liebt kebab <UNK> αt,1 = 0.1 αt,2 = 0.2 αt,3 = 0.7 kebab

  • binary classifier switch
  • use decoder output
  • use copy mechanism
slide-75
SLIDE 75

*Pointer-Generator Copy Mechanism

C

  • n

c e p t

P0 XXXXXXX P1 Copy

  • 1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN <UNK>

!

pg

  • Sees <UNK> at step , or
  • looks at attention weight
  • replace <UNK> with the source word

t pg([hdec

t

; contextt]) ≤ 0.5 αt fargmaxiαt,i

slide-76
SLIDE 76

*Pointer-Based Dictionary Fusion

C

  • n

c e p t

P0 XXXXXXX P1 Copy

  • 1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation

RNN <UNK>

!

pg

  • Sees <UNK> at step , or
  • looks at attention weight
  • replace <UNK> with translation of the source word

t pg([hdec

t

; contextt]) ≤ 0.5 αt dict(fargmaxiαt,i)

slide-77
SLIDE 77

Out-of-vocabulary words

  • Out-of-Vocabulary (OOV) Problem; Rare word problem
  • Copy Mechanisms
  • Subword / character models
  • Out-of-Vocabulary (OOV) Problem; Rare word problem
slide-78
SLIDE 78

OOV

slide-79
SLIDE 79

Subword / character models

  • Character based seq2seq models
slide-80
SLIDE 80

Subword / character models

  • Character based seq2seq models
  • Byte pair encoding (BPE)
slide-81
SLIDE 81

Subword / character models

  • Character based seq2seq models
  • Byte pair encoding (BPE)
slide-82
SLIDE 82

Subword / character models

  • Character based seq2seq models
  • Byte pair encoding (BPE)
slide-83
SLIDE 83

Subword / character models

  • Character based seq2seq models
  • Byte pair encoding (BPE)
  • Subword units help model morphology
slide-84
SLIDE 84

Existing challenges with NMT

  • Out-of-vocabulary words
  • Low-resource languages
  • Long-term context
  • Common sense knowledge (e.g. hot dog, paper jam)
  • Fairness and bias
  • Uninterpretable
slide-85
SLIDE 85

Massively multilingual MT

(Arivazhagan et al., 2019)

  • Train a single neural network on 103 languages paired with

English (remember Interlingua?)

  • Massive improvements on low-resource languages
slide-86
SLIDE 86

Existing challenges with NMT

  • Out-of-vocabulary words
  • Low-resource languages
  • Long-term context
  • Common sense knowledge (e.g. hot dog, paper jam)
  • Fairness and bias
  • Uninterpretable
slide-87
SLIDE 87

Next time

  • Contextualized embeddings
  • ELMO, BERT, and friends
  • Transformers
  • Multi-headed self-attention