Neural Machine Translation Luke Zettlemoyer (Slides adapted from - - PowerPoint PPT Presentation

neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Neural Machine Translation Luke Zettlemoyer (Slides adapted from - - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky) Last time Statistical MT Word-based Phrase-based Syntactic


slide-1
SLIDE 1

CSEP 517 Natural Language Processing

Neural Machine Translation

Luke Zettlemoyer

(Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky)

slide-2
SLIDE 2
  • Statistical MT
  • Word-based
  • Phrase-based
  • Syntactic

Last time

slide-3
SLIDE 3

NMT: the biggest success story of NLP Deep Learning

Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016

  • 2014: First seq2seq paper published
  • 2016: Google Translate switches from SMT to NMT
  • This is amazing!
  • SMT systems, built by hundreds of engineers over many

years, outperformed by NMT systems trained by a handful

  • f engineers in a few months

3

slide-4
SLIDE 4

Neural Machine Translation

  • A single neural network is used to translate from source

to target

  • Architecture: Encoder-Decoder
  • Two main components:
  • Encoder: Convert source sentence (input) into a

vector/matrix

  • Decoder: Convert encoding into a sentence in target

language (output)

slide-5
SLIDE 5

Recall: RNNs

ht = g(Wht−1 + Uxt + b) ∈ ℝd

slide-6
SLIDE 6

Sequence to Sequence learning (Seq2seq)

  • Encode entire input sequence into a single vector (using an RNN)
  • Decode one word at a time (again, using an RNN!)
  • Beam search for better inference
  • Learning is not trivial! (vanishing/exploding gradients)

(Sutskever et al., 2014)

slide-7
SLIDE 7

Encoder RNN

Neural Machine Translation (NMT)

<START>

Source sentence (input)

les pauvres sont démunis

Target sentence (output) Decoder RNN Encoder RNN produces an encoding of the source sentence. Encoding of the source sentence. Provides initial hidden state for Decoder RNN. Decoder RNN is a Language Model that generates target sentence conditioned on encoding.

the

argmax

the

argmax

poor poor

argmax

don’t

Note: This diagram shows test time behavior: decoder output is fed in --> as next step’s input

have any money <END> don’t have any money

argmax argmax argmax argmax

slide-8
SLIDE 8

Seq2seq training

  • Similar to training a language model!
  • Minimize cross-entropy loss:
  • Back-propagate gradients through both decoder and encoder
  • Need a really big corpus

T

t=1

− log P(yt|y1, . . . , yt−1, x1, . . . , xn)

English: Machine translation is cool!

36M sentence pairs

Russian: Машинный перевод - это крутo!

slide-9
SLIDE 9

Training a Neural Machine Translation system

Encoder RNN Source sentence (from corpus)

<START> the poor don’t have any money les pauvres sont démunis

Target sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “end to end”. Decoder RNN

^ 𝑧1 ^ 𝑧2 ^ 𝑧3 ^ 𝑧4 ^ 𝑧5 ^ 𝑧6 ^ 𝑧7 𝐾1 𝐾2 𝐾3 𝐾4 𝐾5 𝐾6 𝐾7

= negative log prob of “the”

𝐾 = 1 𝑈

𝑈

𝑢=1

𝐾𝑢 = + + + + + +

= negative log prob of <END> = negative log prob of “have”

slide-10
SLIDE 10

Greedy decoding

  • Compute argmax at every step of decoder to generate

word

  • What’s wrong?
slide-11
SLIDE 11

Exhaustive search?

  • Find
  • Requires computing all possible sequences
  • complexity!
  • Too expensive

arg max

y1,...,yT

P(y1, . . . , yT|x1, . . . , xn) O(VT)

slide-12
SLIDE 12

A middle ground: Beam search

  • Key idea: At every step, keep track of the k most

probable partial translations (hypotheses)

  • Score of each hypothesis = log probability
  • Not guaranteed to be optimal
  • More efficient than exhaustive search

j

t=1

log P(yt|y1, . . . , yt−1, x1, . . . , xn)

slide-13
SLIDE 13

Beam decoding

(slide credit: Abigail See)

slide-14
SLIDE 14

Beam decoding

(slide credit: Abigail See)

slide-15
SLIDE 15

Beam decoding

(slide credit: Abigail See)

slide-16
SLIDE 16

Beam decoding

  • Different hypotheses may produce

(end) token at different time steps

  • When a hypothesis produces

, stop expanding it and place it aside

  • Continue beam search until:
  • All hypotheses produce

OR

  • Hit max decoding limit T
  • Select top hypotheses using the normalized likelihood score
  • Otherwise shorter hypotheses have higher scores

⟨e⟩ ⟨e⟩ k ⟨e⟩ 1 T

T

t=1

log P(yt|y1, . . . , yt−1, x1, . . . , xn)

slide-17
SLIDE 17

NMT vs SMT

Pros

  • Better performance
  • Fluency
  • Longer context
  • Single NN optimized end-to-

end

  • Less engineering
  • Works out of the box for many

language pairs

Cons

  • Requires more data and compute
  • Less interpretable
  • Hard to debug
  • Uncontrollable
  • Heavily dependent on data -

could lead to unwanted biases

  • More parameters
slide-18
SLIDE 18

How seq2seq changed the MT landscape

slide-19
SLIDE 19

MT Progress

(source: Rico Sennrich)

slide-20
SLIDE 20

Versatile seq2seq

  • Seq2seq finds applications in many other tasks!
  • Any task where inputs and outputs are sequences of words/

characters

  • Summarization (input text

summary)

  • Dialogue (previous utterance

reply)

  • Parsing (sentence

parse tree in sequence form)

  • Question answering (context+question

answer)

→ → → →

slide-21
SLIDE 21

Issues with vanilla seq2seq

  • A single encoding vector,

, needs to capture all the information about source sentence

  • Longer sequences can lead to vanishing gradients
  • Overfitting

henc

Bottleneck

slide-22
SLIDE 22

Remember alignments?

slide-23
SLIDE 23

Attention

  • The neural MT equivalent of alignment models
  • Key idea: At each time step during decoding, focus on a

particular part of source sentence

  • This depends on the decoder’s current hidden state

(i.e. notion of what you are trying to decode)

  • Usually implemented as a probability distribution over

the hidden states of the encoder ( )

henc

i

slide-24
SLIDE 24

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores dot product

les pauvres sont démunis

Sequence-to-sequence with attention

slide-25
SLIDE 25

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores dot product

les pauvres sont démunis

Sequence-to-sequence with attention

slide-26
SLIDE 26

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores dot product

les pauvres sont démunis

Sequence-to-sequence with attention

slide-27
SLIDE 27

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores dot product

les pauvres sont démunis

Sequence-to-sequence with attention

slide-28
SLIDE 28

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores On this decoder timestep, we’re mostly focusing on the first encoder hidden state (”les”) Attention distribution Take softmax to turn the scores into a probability distribution

les pauvres sont démunis

Sequence-to-sequence with attention

slide-29
SLIDE 29

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention distribution Attention scores Attention

  • utput

Use the attention distribution to take a weighted sum of the encoder hidden states. The attention output mostly contains information the hidden states that received high attention.

les pauvres sont démunis

Sequence-to-sequence with attention

slide-30
SLIDE 30

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention distribution Attention scores Attention

  • utput

Concatenate attention output with decoder hidden state, then use to compute as before

^ 𝑧1

^ 𝑧1

the les pauvres sont démunis

Sequence-to-sequence with attention

slide-31
SLIDE 31

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores

the

Attention distribution Attention

  • utput

^ 𝑧2

poor les pauvres sont démunis

Sequence-to-sequence with attention

slide-32
SLIDE 32

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores Attention distribution Attention

  • utput

the poor

^ 𝑧3

don’t les pauvres sont démunis

Sequence-to-sequence with attention

slide-33
SLIDE 33

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores Attention distribution Attention

  • utput

the poor don’t

^ 𝑧4

have les pauvres sont démunis

Sequence-to-sequence with attention

slide-34
SLIDE 34

Encoder RNN Source sentence (input)

<START>

Decoder RNN Attention scores Attention distribution Attention

  • utput

the poor have

^ 𝑧5

any don’t les pauvres sont démunis

Sequence-to-sequence with attention

slide-35
SLIDE 35

Sequence-to-sequence with attention

Encoder RNN Source sentence (input)

<START> les pauvres sont démunis

Decoder RNN Attention scores Attention distribution Attention

  • utput

the poor don’t have any

^ 𝑧6

money

slide-36
SLIDE 36

Computing attention

  • Encoder hidden states:
  • Decoder hidden state at time :
  • First, get attention scores for this time step (we will see what is soon!):

  • Obtain the attention distribution using softmax:

  • Compute weighted sum of encoder hidden states:

  • Finally, concatenate with decoder state and pass on to output layer:

henc

1 , . . . , henc n

t hdec

t

g et = [g(henc

1 , hdec t

), . . . , g(henc

n , hdec t

)] αt = softmax (et) ∈ ℝn at =

n

i=1

αt

ihenc i

∈ ℝh [at; hdec

t

] ∈ ℝ2h

slide-37
SLIDE 37

Types of attention

  • Assume encoder hidden states

and decoder hidden state

  • 1. Dot-product attention (assumes equal dimensions for and :

  • 2. Multiplicative attention:


, where is a weight matrix

  • 3. Additive attention:



 where are weight matrices and is a weight vector

h1, h2, . . . , hn z a b ei = g(hi, z) = zThi ∈ ℝ g(hi, z) = zTWhi ∈ ℝ W g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝ W1, W2 v

slide-38
SLIDE 38

Rare Words and Monolingual Text

slide-39
SLIDE 39

Handling Rare Words

  • Words are a difficult unit to work with, e.g. vocabularies get very large,

how to handle OOV?

Sennrich et al. (2016)

  • Character-level models are possible, but expensive

Input: _the _eco tax _port i co _in _Po nt - de - Bu is … Output: _le _port ique _éco taxe _de _Pont - de - Bui s

  • Compromise soluSon: use thousands of “word pieces” (which may be

full words but may also be parts of words)

  • Can do transliteraSon, model sub-word regulariSes, etc.
slide-40
SLIDE 40

Byte Pair Encoding (BPE)

  • Use large corpus of text for counSng
  • Start with every individual byte (basically character) as its own

symbol

  • Count bigram character cooccurrences
  • Merge the most frequent pair of

adjacent characters

  • 8k merges => vocabulary of around 8000 word pieces. Includes many

whole words

  • Most SOTA NMT systems use this on both source + target

Sennrich et al. (2016)

slide-41
SLIDE 41

Backtranslation

  • Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large

monolingual corpus T’ to train a language model. Can neural MT do the same? Sennrich et al. (2015)

s1, t1 [null], t’1 [null], t’2 s2, t2 … …

  • Approach 1: force the system to

generate T’ as targets from null inputs

  • Approach 2: generate syntheSc

sources with a T->S machine translaSon system (backtranslaSon)

s1, t1 MT(t’1), t’1 s2, t2 … … MT(t’2), t’2

slide-42
SLIDE 42

Backtranslation

  • parallelsynth: backtranslate training data; makes addiSonal noisy

source sentences which could be useful

  • Gigaword: large monolingual English corpus

Sennrich et al. (2015)

slide-43
SLIDE 43

Google’s NMT System

  • 8-layer LSTM encoder-decoder with agenSon, word

piece vocabulary of 8k-32k

Wu et al. (2016)

slide-44
SLIDE 44

(Wu et al., 2016)

slide-45
SLIDE 45

Google’s NMT System

Gender is correct in GNMT but not in PBMT “sled” “walker”

Wu et al. (2016)

slide-46
SLIDE 46

Transformers for MT

slide-47
SLIDE 47

RNNs vs Transformers

the

movie was terribly exciting !

Transformer layer 3 Transformer layer 2 Transformer layer 1

slide-48
SLIDE 48

New Twist: Self-Attention

Vaswani et al. (2017)

the movie was great

  • Each word computes agenSon over every
  • ther word
  • MulSple “heads”: Use parameters Wk and Vk to get different agenSon

values + transform vectors x4

x0

4

scalar vector = sum of scalar * vector

αi,j = softmax(x>

i xj)

x0

i = n

X

j=1

αi,jxj

αk,i,j = softmax(x>

i Wkxj)

x0

k,i = n

X

j=1

αk,i,jVkxj

slide-49
SLIDE 49
  • Transformers

Transformers

slide-50
SLIDE 50

Transformers

  • NIPS’17: Attention is All You Need
  • Key idea: Multi-head self-attention
  • No recurrence structure any more so it

trains much faster

  • Originally proposed for NMT (encoder-

decoder framework)

  • Used as the base model for lots of

follow up work

slide-51
SLIDE 51

Transformers

  • Each Transformer block has two sub-layers
  • Multi-head attention
  • 2-layer feedforward NN (with ReLU)
  • Each sublayer has a residual

connection and a layer normalization LayerNorm(x + SubLayer(x))

(Ba et al, 2016): Layer Normalization

  • Input layer has a positional encoding
slide-52
SLIDE 52

Transformers and Word Order

the movie was

  • Augment word embedding with posiSon embeddings, each dim

is a sine/cosine wave of a different frequency. Closer points = higher dot products

  • Works essenSally as well as just encoding posiSon as a one-hot

vector

the movie was

emb(1) emb(2) emb(3) emb(4)

Vaswani et al. (2017)

slide-53
SLIDE 53

Transformers for MT

  • Encoder and decoder are

both transformers

  • Decoder consumes the

previous generated token (and agends to input), but has no recurrent state

Vaswani et al. (2017)

slide-54
SLIDE 54

Transformers

  • Big = 6 layers, 1000

dim for each token, 16 heads, base = 6 layers +

  • ther params

halved

Vaswani et al. (2017)

slide-55
SLIDE 55

Visualization

Vaswani et al. (2017)

slide-56
SLIDE 56

Visualization

Vaswani et al. (2017)

slide-57
SLIDE 57

Useful Resources

nn.Transformer:

nn.TransformerEncoder:

The Annotated Transformer:

http://nlp.seas.harvard.edu/2018/04/03/attention.html

A Jupyter notebook which explains how Transformer works line by line in PyTorch!

slide-58
SLIDE 58

So is Machine Translation solved?

  • Nope!
  • Many difficulties remain:
  • Out-of-vocabulary words
  • Domain mismatch between train and test data
  • Maintaining context over longer text
  • Low-resource language pairs

58

slide-59
SLIDE 59

So is Machine Translation solved?

  • Nope!
  • Using common sense is still hard

59

?

slide-60
SLIDE 60

So is Machine Translation solved?

  • Nope!
  • NMT picks up biases in training data

60

Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be- ce1f7c8c683c

Didn’t specify gender

slide-61
SLIDE 61

So is Machine Translation solved?

  • Nope!
  • Uninterpretable systems do strange things

61

Source: http://languagelog.ldc.upenn.edu/nll/?p=35120#more-35120

slide-62
SLIDE 62

Massively multilingual MT

(Arivazhagan et al., 2019)

  • Train a single neural network on 103 languages paired with

English (remember Interlingua?)

  • Massive improvements on low-resource languages