Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - - PowerPoint PPT Presentation

neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - - PowerPoint PPT Presentation

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 Language Models 1 Modeling variants feed-forward neural network recurrent neural network long


slide-1
SLIDE 1

Neural Machine Translation

Philipp Koehn 6 October 2020

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-2
SLIDE 2

1

Language Models

  • Modeling variants

– feed-forward neural network – recurrent neural network – long short term memory neural network

  • May include input context

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-3
SLIDE 3

2

Feed Forward Neural Language Model

Softmax FF

wi h Hidden Layer Output Word wi-4 wi-3 wi-2 wi-1 History

Embed Embed Embed Embed

Embedding Ew

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-4
SLIDE 4

3

Recurrent Neural Language Model

<s> Embed

Input Word Embedding Input Word Output Word Prediction ti Output Word yi E xj xj Recurrent State hj

Softmax the RNN

Predict the first word of a sentence

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-5
SLIDE 5

4

Recurrent Neural Language Model

<s> Embed the Embed

Input Word Embedding Input Word

Softmax

Output Word Prediction ti

house

Output Word yi E xj xj Recurrent State hj

Softmax the RNN RNN

Predict the second word of a sentence Re-use hidden state from first word prediction

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-6
SLIDE 6

5

Recurrent Neural Language Model

<s> Embed the Embed house Embed

Input Word Embedding Input Word

Softmax Softmax

Output Word Prediction ti

house is

Output Word yi E xj xj Recurrent State hj

Softmax the RNN RNN RNN

Predict the third word of a sentence ... and so on

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-7
SLIDE 7

6

Recurrent Neural Language Model

<s> Embed the Embed house Embed is Embed big Embed . Embed

Input Word Embedding Input Word

Softmax Softmax Softmax Softmax Softmax

Output Word Prediction ti

house is big . </s>

Output Word yi E xj xj Recurrent State hj

Softmax the RNN RNN RNN RNN RNN RNN

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-8
SLIDE 8

7

Recurrent Neural Translation Model

  • We predicted the words of a sentence
  • Why not also predict their translations?

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-9
SLIDE 9

8

Encoder-Decoder Model

<s> Embed the Embed house Embed is Embed big Embed . Embed </s> Embed

Input Word Embedding Input Word

Softmax Softmax Softmax Softmax Softmax Softmax

Output Word Prediction ti

house is big . </s> das

Output Word yi E xj xj Recurrent State hj

das Embed Haus Embed ist Embed groß Embed . Embed Softmax Softmax Softmax Softmax Softmax Haus ist groß . </s> Softmax the RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN

  • Obviously madness
  • Proposed by Google (Sutskever et al. 2014)

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-10
SLIDE 10

9

What is Missing?

  • Alignment of input words to output words

⇒ Solution: attention mechanism

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-11
SLIDE 11

10

neural translation model with attention

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-12
SLIDE 12

11

Input Encoding

<s> Embed the Embed house Embed is Embed big Embed . Embed

Input Word Embedding Input Word

Softmax Softmax Softmax Softmax Softmax

Output Word Prediction ti

house is big . </s>

Output Word yi E xj xj Recurrent State hj

Softmax the RNN RNN RNN RNN RNN RNN

  • Inspiration: recurrent neural network language model on the input side

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-13
SLIDE 13

12

Hidden Language Model States

  • This gives us the hidden states

RNN RNN RNN RNN RNN RNN RNN

  • These encode left context for each word
  • Same process in reverse: right context for each word

RNN RNN RNN RNN RNN RNN RNN

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-14
SLIDE 14

13

Input Encoder

<s> RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN </s> RNN Embed RNN

Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj

RNN RNN RNN RNN RNN

  • Input encoder: concatenate bidrectional RNN states
  • Each word representation includes full left and right sentence context

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-15
SLIDE 15

14

Encoder: Math

<s> RNN Embed RNN the Embed RNN house Embed RNN is Embed RNN big Embed RNN . Embed RNN </s> RNN Embed RNN

Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word hj E xj xj hj

RNN RNN RNN RNN RNN

  • Input is sequence of words xj, mapped into embedding space ¯

E xj

  • Bidirectional recurrent neural networks

← − hj = f(← − − hj+1, ¯ E xj) − → hj = f(− − → hj−1, ¯ E xj)

  • Various choices for the function f(): feed-forward layer, GRU, LSTM, ...

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-16
SLIDE 16

15

Decoder

  • We want to have a recurrent neural network predicting output words

RNN RNN RNN RNN

Output Word Prediction Decoder State ti si

Softmax Softmax Softmax

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-17
SLIDE 17

16

Decoder

  • We want to have a recurrent neural network predicting output words

Embed RNN Embed Embed Embed RNN RNN RNN

Output Word Prediction Output Word Embeddings Decoder State ti E yi si

Softmax Softmax Softmax

  • We feed decisions on output words back into the decoder state

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-18
SLIDE 18

17

Decoder

  • We want to have a recurrent neural network predicting output words

Embed RNN Embed Embed Embed RNN RNN RNN

Output Word Prediction Output Word Embeddings Decoder State Input Context ti E yi si ci

Softmax Softmax Softmax

  • We feed decisions on output words back into the decoder state
  • Decoder state is also informed by the input context

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-19
SLIDE 19

18

More Detail

RNN RNN

Output Word Prediction Output Word Output Word Embeddings Decoder State Input Context ti

<s> Embed das Embed

yi E yi si ci

Softmax

  • Decoder is also recurrent neural network
  • ver sequence of hidden states si

si = f(si−1, Ey−1, ci)

  • Again, various choices for the function f():

feed-forward layer, GRU, LSTM, ...

  • Output word yi is selected by computing a

vector ti (same size as vocabulary) ti = W(Usi−1 + V Eyi−1 + Cci) then finding the highest value in vector ti

  • If we normalize ti, we can view it as a

probability distribution over words

  • Eyi is the embedding of the output word yi

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-20
SLIDE 20

19

Attention

RNN RNN Attention RNN RNN RNN RNN RNN

Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj

RNN RNN RNN RNN RNN

  • Given what we have generated so far (decoder hidden state)
  • ... which words in the input should we pay attention to (encoder states)?

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-21
SLIDE 21

20

Attention

RNN RNN Attention RNN RNN RNN RNN RNN

Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj

RNN RNN RNN RNN RNN

  • Given: – the previous hidden state of the decoder si−1

– the representation of input words hj = (← − hj, − → hj)

  • Predict an alignment probability a(si−1, hj) to each input word j

(modeled with with a feed-forward neural network layer)

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-22
SLIDE 22

21

Attention

RNN RNN Attention RNN RNN RNN RNN RNN

Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si αij hj hj

RNN RNN RNN RNN RNN

  • Normalize attention (softmax)

αij = exp(a(si−1, hj))

  • k exp(a(si−1, hk))

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-23
SLIDE 23

22

Attention

RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN

Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj

RNN RNN RNN RNN RNN

  • Relevant input context: weigh input words according to attention: ci =

j αijhj Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-24
SLIDE 24

23

Attention

RNN RNN Weighted Sum Attention RNN RNN RNN RNN RNN RNN

Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder si ci αij hj hj

RNN RNN RNN RNN RNN

  • Use context to predict next hidden state and output word

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-25
SLIDE 25

24

training

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-26
SLIDE 26

25

Comparing Prediction to Correct Word

<s> das Cost Haus Cost ist Cost

Output Word Prediction Output Word Error ti yi

  • log ti [yi]

Softmax Softmax Softmax

  • Current model gives some probability ti[yi] to correct word yi
  • We turn this into an error by computing cross-entropy: −log ti[yi]

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-27
SLIDE 27

26

Computation Graph

  • Math behind neural machine translation defines a computation graph
  • Forward and backward computation to compute gradients for model training

Product W1 Sum Sigmoid W2 b1 b2

x

Product 1.0 0.0 3 2 1

  • 2

.731 .119 Sum Sigmoid 3.06 1.06 .743 3 2 4 3

  • 2
  • 4

5 -5

  • 2

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-28
SLIDE 28

27

Unrolled Computation Graph

<s> <s> Embed RNN Weighted Sum Attention RNN Embed RNN the das Embed Cost Weighted Sum Attention Embed RNN house Haus Embed Cost Weighted Sum Attention Embed RNN is ist Embed Cost Weighted Sum Attention Embed RNN big groß Embed Cost Softmax Weighted Sum Attention Embed RNN . . Embed Cost Weighted Sum Attention Embed RNN </s> </s> Embed Cost Softmax RNN Weighted Sum Attention RNN Embed RNN RNN RNN RNN RNN RNN

Output Word Prediction Output Word Output Word Embeddings Error Decoder State Input Context Attention Right-to-Left Encoder Left-to-Right Encoder Input Word Embedding Input Word ti yi E yi

  • log ti [yi]

si ci αij hj E xj xj hj

RNN RNN RNN RNN RNN Softmax Softmax Softmax Softmax

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-29
SLIDE 29

28

Batching

  • Already large degree of parallelism

– most computations on vectors, matrices – efficient implementations for CPU and GPU

  • Further parallelism by batching

– processing several sentence pairs at once – scalar operation → vector operation – vector operation → matrix operation – matrix operation → 3d tensor operation

  • Typical batch sizes 50–100 sentence pairs

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-30
SLIDE 30

29

Batches

  • Sentences have different length
  • When batching, fill up unneeded cells in tensors

⇒ A lot of wasted computations

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-31
SLIDE 31

30

Mini-Batches

  • Sort sentences by length, break up into mini-batches
  • Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-32
SLIDE 32

31

Overall Organization of Training

  • Shuffle corpus
  • Break into maxi-batches
  • Break up each maxi-batch into mini-batches
  • Process mini-batch, update parameters
  • Once done, repeat
  • Typically 5-15 epochs needed (passes through entire training corpus)

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-33
SLIDE 33

32

deeper models

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-34
SLIDE 34

33

Deeper Models

  • Encoder and decoder are recurrent neural networks
  • We can add additional layers for each step
  • Recall shallow and deep language models

Output Hidden Layer Input Word Embedding

Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed

yt ht E xt Shallow

Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed

yt ht,3 ht,2 ht,1 E xi

Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN

Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding

Embed Embed Embed

yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional

  • Adding residual connections (short-cuts through deep layers) help

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-35
SLIDE 35

34

Deep Decoder

  • Two ways of adding layers

– deep transitions: several layers on path to output – deeply stacking recurrent neural networks

  • Why not both?

RNN RNN RNN RNN RNN

Decoder State

Stack 1, Transition 1

Input Context vt,1,1 ct

FF FF FF FF FF

vt,1,2 Decoder State

Stack 1, Transition 2

RNN RNN RNN RNN RNN

vt,2,1

FF FF FF FF FF

vt,2,2 Decoder State

Stack 2, Transition 1

Decoder State

Stack 2, Transition 2

FF FF FF FF FF

st,1=vt,1,3 Decoder State

Stack 1, Transition 3

FF FF FF FF FF

st,2=vt,2,3 Decoder State

Stack 2, Transition 3

Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

slide-36
SLIDE 36

35

Deep Encoder

  • Previously proposed encoder already has 2 layers

– left-to-right recurrent network, to encode left context – right-to-left recurrent network, to encode right context ⇒ Third way of adding layers

RNN RNN RNN RNN RNN

Encoder State

Layer 1, L2R

Input Word Embedding ci

RNN RNN RNN RNN RNN

hj,2 Encoder State

Layer 1, R2L

RNN RNN RNN RNN RNN

hj,4 Encoder State

Layer 2, R2L

RNN RNN RNN RNN RNN

hj,3 Encoder State

Layer 2, L2R

hj,1 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020