[PPT] - Sequence-to-Sequence Learning using Recurrent Neural Networks PowerPoint Presentation

SLIDE 1

Sequence-to-Sequence Learning using Recurrent Neural Networks

Jindřich Helcl, Jindřich Libovický

March 4, 2020

NPFL116 Compendium of Neural Machine Translation

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

SLIDE 2

Outline

Symbol Embeddings Recurrent Networks Neural Network Language Models Vanilla Sequence-to-Sequence Model Attentive Sequence-to-Sequence Learning Reading Assignment

Sequence-to-Sequence Learning using Recurrent Neural Networks

1/ 41

SLIDE 3

Symbol Embeddings

SLIDE 4

Discrete symbol vs. continuous representation

Simple task: predict next word given three previous:

Source: Bengio, Yoshua, et al. ”A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Sequence-to-Sequence Learning using Recurrent Neural Networks

2/ 41

SLIDE 5

Embeddings

Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
It would mean a huge matrix every time a symbol is on the input
Rather factorize this matrix and share the fjrst part ⇒ embeddings
“Embeddings” because they embed discrete symbols into a continuous space

Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears.

Sequence-to-Sequence Learning using Recurrent Neural Networks

3/ 41

SLIDE 6

Embeddings

Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
It would mean a huge matrix every time a symbol is on the input
Rather factorize this matrix and share the fjrst part ⇒ embeddings
“Embeddings” because they embed discrete symbols into a continuous space

Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears.

Sequence-to-Sequence Learning using Recurrent Neural Networks

3/ 41

SLIDE 7

Embeddings

Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
It would mean a huge matrix every time a symbol is on the input
Rather factorize this matrix and share the fjrst part ⇒ embeddings
“Embeddings” because they embed discrete symbols into a continuous space

Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears.

Sequence-to-Sequence Learning using Recurrent Neural Networks

3/ 41

SLIDE 8

Properties of embeddings

Source: https://blogs.mathworks.com/loren/2017/09/21/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox/

Sequence-to-Sequence Learning using Recurrent Neural Networks

4/ 41

SLIDE 9

Recurrent Networks

SLIDE 10

Why RNNs

for loops over sequential data
the most frequently used type of network in NLP

Sequence-to-Sequence Learning using Recurrent Neural Networks

5/ 41

SLIDE 11

General Formulation

inputs: 𝑦1, … , 𝑦𝑈
initial state ℎ0:
result of previous computation
trainable parameter
recurrent computation: ℎ𝑢 = 𝐵(ℎ𝑢−1, 𝑦𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

6/ 41

SLIDE 12

RNN as Imperative Code

def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output

Sequence-to-Sequence Learning using Recurrent Neural Networks

7/ 41

SLIDE 13

RNN as a Fancy Image

Sequence-to-Sequence Learning using Recurrent Neural Networks

8/ 41

SLIDE 14

Vanilla RNN

ℎ𝑢 = tanh (𝑋[ℎ𝑢−1; 𝑦𝑢] + 𝑐)

cannot propagate long-distance relations
vanishing gradient problem

Sequence-to-Sequence Learning using Recurrent Neural Networks

9/ 41

SLIDE 15

Vanishing Gradient Problem (1)

1
0.5

0.5 1

6
4
2

2 4 6 Y X

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

0.2 0.4 0.6 0.8 1

6
4
2

2 4 6 Y X

d tanh 𝑦 d𝑦 = 1 − tanh2 𝑦 ∈ (0, 1] Weights initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Sequence-to-Sequence Learning using Recurrent Neural Networks

10/ 41

SLIDE 16

Vanishing Gradient Problem (1)

1
0.5

0.5 1

6
4
2

2 4 6 Y X

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

0.2 0.4 0.6 0.8 1

6
4
2

2 4 6 Y X

d tanh 𝑦 d𝑦 = 1 − tanh2 𝑦 ∈ (0, 1] Weights initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Sequence-to-Sequence Learning using Recurrent Neural Networks

10/ 41

SLIDE 17

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Sequence-to-Sequence Learning using Recurrent Neural Networks

11/ 41

SLIDE 18

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Sequence-to-Sequence Learning using Recurrent Neural Networks

11/ 41

SLIDE 19

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Sequence-to-Sequence Learning using Recurrent Neural Networks

11/ 41

SLIDE 20

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

SLIDE 21

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

SLIDE 22

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

SLIDE 23

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

SLIDE 24

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

what to use from input,
what to use from hidden state,
what to put on output

Sequence-to-Sequence Learning using Recurrent Neural Networks

13/ 41

SLIDE 25

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

what to use from input,
what to use from hidden state,
what to put on output

Sequence-to-Sequence Learning using Recurrent Neural Networks

13/ 41

SLIDE 26

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

what to use from input,
what to use from hidden state,
what to put on output

Sequence-to-Sequence Learning using Recurrent Neural Networks

13/ 41

SLIDE 27

Hidden State

two types of hidden states
ℎ𝑢 — “public” hidden state, used an output
𝑑𝑢 — “private” memory, no non-linearities on the way
direct fmow of gradients (without multiplying by ≤ derivatives)
only vectors guaranteed to live in the same space are manipulated
information highway metaphor

Sequence-to-Sequence Learning using Recurrent Neural Networks

14/ 41

SLIDE 28

Forget Gate

𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔)

based on input and previous state, decide what to forget from the memory

Sequence-to-Sequence Learning using Recurrent Neural Networks

15/ 41

SLIDE 29

Input Gate

𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷)

̃

𝐷 — candidate what may want to add to the memory

𝑗𝑢 — decide how much of the information we want to store

Sequence-to-Sequence Learning using Recurrent Neural Networks

16/ 41

SLIDE 30

Cell State Update

𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

17/ 41

SLIDE 31

Output Gate

𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

18/ 41

SLIDE 32

Here we are!

𝑔𝑢 = 𝜏 (𝑋𝑔 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Sequence-to-Sequence Learning using Recurrent Neural Networks

19/ 41

SLIDE 33

Here we are!

𝑔𝑢 = 𝜏 (𝑋𝑔 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Sequence-to-Sequence Learning using Recurrent Neural Networks

19/ 41

SLIDE 34

Here we are!

𝑔𝑢 = 𝜏 (𝑋𝑔 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Sequence-to-Sequence Learning using Recurrent Neural Networks

19/ 41

SLIDE 35

Gated Recurrent Units

𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

20/ 41

SLIDE 36

GRU and LSTM

LSTM 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢 GRU 𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

21/ 41

SLIDE 37

GRU or LSTM?

GRU preserves the information highway property
GRU has less parameters, should learn faster
LSTM more general (although both Turing complete)
empirical results: it’s task-specifjc

Chung, Junyoung, et al. ”Empirical evaluation of gated recurrent neural networks on sequence modeling.” arXiv preprint arXiv:412.3555 (204). Irie, Kazuki, et al. ”LSTM, GRU, highway and a bit of attention: an empirical overview for language modeling in speech recognition.” Interspeech, San Francisco, CA, USA (206).

Sequence-to-Sequence Learning using Recurrent Neural Networks

22/ 41

SLIDE 38

Recurrent Networks +

correspond to intuition of sequential

processing

theoretically strong
cannot be parallelized, always need to

wait for previous state

Sequence-to-Sequence Learning using Recurrent Neural Networks

23/ 41

SLIDE 39

Neural Network Language Models

SLIDE 40

RNN Language Model

Train RNN as classifjer for next words (unlimited history)

<s> w1 w2 w3 w4 p(w1) p(w2) p(w3) p(w4) p(w5)

Can be used to estimate sentence probability / perplexity → defjnes a distribution over

sentences

We can sample from the distribution

Sequence-to-Sequence Learning using Recurrent Neural Networks

24/ 41

SLIDE 41

RNN Language Model

Train RNN as classifjer for next words (unlimited history)

<s> w1 w2 w3 w4 p(w1) p(w2) p(w3) p(w4) p(w5)

Can be used to estimate sentence probability / perplexity → defjnes a distribution over

sentences

We can sample from the distribution

Sequence-to-Sequence Learning using Recurrent Neural Networks

24/ 41

SLIDE 42

RNN Language Model

Train RNN as classifjer for next words (unlimited history)

<s> w1 w2 w3 w4 p(w1) p(w2) p(w3) p(w4) p(w5)

Can be used to estimate sentence probability / perplexity → defjnes a distribution over

sentences

We can sample from the distribution

<s> ~w1 ~w2 ~w3 ~w4 ~w5

Sequence-to-Sequence Learning using Recurrent Neural Networks

24/ 41

SLIDE 43

Two views on RNN LM

RNN is a for loop (functional map) over sequential data
All outputs are conditional distributions → probabilistic distribution over sequences of

words: 𝑄 (𝑥1, … , 𝑥𝑜) =

𝑜

∏

𝑗=1

𝑄 (𝑥𝑗|𝑥𝑗−1, … , 𝑥1)

Sequence-to-Sequence Learning using Recurrent Neural Networks

25/ 41

SLIDE 44

Vanilla Sequence-to-Sequence Model

SLIDE 45

Encoder-Decoder NMT

Exploits the conditional LM scheme
Two networks
1. A network processing the input sentence into a single vector representation (encoder)
2. A neural language model initialized with the output of the encoder (decoder)

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.

Sequence-to-Sequence Learning using Recurrent Neural Networks

26/ 41

SLIDE 46

Encoder-Decoder – Image

<s> <s> x1 x2 x3 x4 ~y1 ~y2 ~y3 ~y4 ~y5

Source language input + target language LM

Sequence-to-Sequence Learning using Recurrent Neural Networks

27/ 41

SLIDE 47

Encoder-Decoder Model – Code

state = np.zeros(EMB_SIZE) for w in input_words: input_embedding = source_embeddings[w] state, _ = enc_cell(state, input_embedding) prev_w = "<s>" while prev_w != "</s>": prev_w_embeding = target_embeddings[prev_w] state, dec_output = dec_cell(state, prev_w_embeding) logits = output_projection(dec_output) prev_w = np.argmax(logits) yield prev_w

Sequence-to-Sequence Learning using Recurrent Neural Networks

28/ 41

SLIDE 48

Encoder-Decoder Model – Formal Notation

Data input embeddings (source language) x = (𝑦1, … , 𝑦𝑈𝑦)

utput embeddings (target language)

y = (𝑧1, … , 𝑧𝑈𝑧) Encoder initial state ℎ0 ≡ 𝑘-th state ℎ𝑘 = RNNenc(ℎ𝑘−1, 𝑦𝑘) fjnal state ℎ𝑈𝑦 Decoder initial state 𝑡0 = ℎ𝑈𝑦 𝑗-th decoder state 𝑡𝑗 = RNNdec(𝑡𝑗−1, 𝑧𝑗) 𝑗-th word score 𝑢𝑗+1 = 𝑉𝑝𝑡𝑗+1 + 𝑊𝑝𝐹𝑧𝑗 + 𝑐𝑝, (or multi-layer projection)

utput

̂ 𝑧𝑗+1 = arg max 𝑢𝑗+1

Sequence-to-Sequence Learning using Recurrent Neural Networks

29/ 41

SLIDE 49

Encoder-Decoder Model – Formal Notation

Data input embeddings (source language) x = (𝑦1, … , 𝑦𝑈𝑦)

utput embeddings (target language)

y = (𝑧1, … , 𝑧𝑈𝑧) Encoder initial state ℎ0 ≡ 0 𝑘-th state ℎ𝑘 = RNNenc(ℎ𝑘−1, 𝑦𝑘) fjnal state ℎ𝑈𝑦 Decoder initial state 𝑡0 = ℎ𝑈𝑦 𝑗-th decoder state 𝑡𝑗 = RNNdec(𝑡𝑗−1, 𝑧𝑗) 𝑗-th word score 𝑢𝑗+1 = 𝑉𝑝𝑡𝑗+1 + 𝑊𝑝𝐹𝑧𝑗 + 𝑐𝑝, (or multi-layer projection)

utput

̂ 𝑧𝑗+1 = arg max 𝑢𝑗+1

Sequence-to-Sequence Learning using Recurrent Neural Networks

29/ 41

SLIDE 50

Encoder-Decoder Model – Formal Notation

Data input embeddings (source language) x = (𝑦1, … , 𝑦𝑈𝑦)

utput embeddings (target language)

y = (𝑧1, … , 𝑧𝑈𝑧) Encoder initial state ℎ0 ≡ 0 𝑘-th state ℎ𝑘 = RNNenc(ℎ𝑘−1, 𝑦𝑘) fjnal state ℎ𝑈𝑦 Decoder initial state 𝑡0 = ℎ𝑈𝑦 𝑗-th decoder state 𝑡𝑗 = RNNdec(𝑡𝑗−1, 𝑧𝑗) 𝑗-th word score 𝑢𝑗+1 = 𝑉𝑝𝑡𝑗+1 + 𝑊𝑝𝐹𝑧𝑗 + 𝑐𝑝, (or multi-layer projection)

utput

̂ 𝑧𝑗+1 = arg max 𝑢𝑗+1

Sequence-to-Sequence Learning using Recurrent Neural Networks

29/ 41

SLIDE 51

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡

[𝑧𝑗] Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) =

𝑞 (− log ̂

𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

SLIDE 52

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) =

𝑞 (− log ̂

𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

SLIDE 53

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) = E𝑞 (− log ̂ 𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

SLIDE 54

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) = E𝑞 (− log ̂ 𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

SLIDE 55

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) = E𝑞 (− log ̂ 𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

SLIDE 56

Implementation: Runtime vs. training

runtime: ̂

𝑧𝑘

(decoded) ×

training: 𝑧𝑘

(ground truth)

<s> x1 x2 x3 x4 <s> ~y1 ~y2 ~y3 ~y4 ~y5 <s> y1 y2 y3 y4 loss

Sequence-to-Sequence Learning using Recurrent Neural Networks

31/ 41

SLIDE 57

Sutskever et al., “Sequence-to-Sequence Learning with Neural Networks”, 2014

Reverse input sequence
Impressive empirical results – made researchers believe NMT is way to go

Evaluation on WMT14 EN → FR test set: method BLEU score vanilla SMT 33.0 tuned SMT 37.0 Sutskever et al.: reversed 30.6 –”–: ensemble + beam search 34.8 –”–: vanilla SMT rescoring 36.5 Bahdanau’s attention 28.5 Why is better Bahdanau’s model worse?

Sequence-to-Sequence Learning using Recurrent Neural Networks

32/ 41

SLIDE 58

Sutskever et al., “Sequence-to-Sequence Learning with Neural Networks”, 2014

Reverse input sequence
Impressive empirical results – made researchers believe NMT is way to go

Evaluation on WMT14 EN → FR test set: method BLEU score vanilla SMT 33.0 tuned SMT 37.0 Sutskever et al.: reversed 30.6 –”–: ensemble + beam search 34.8 –”–: vanilla SMT rescoring 36.5 Bahdanau’s attention 28.5 Why is better Bahdanau’s model worse?

Sequence-to-Sequence Learning using Recurrent Neural Networks

32/ 41

SLIDE 59

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 60

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 61

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 62

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 63

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 64

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 65

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

SLIDE 66

Attentive Sequence-to-Sequence Learning

SLIDE 67

Main Idea

Same as reversing input: do not force the network to catch long-distance dependencies
Use decoder state only for target sentence dependencies and as a query for the source

word sentence

RNN can serve as LM — it can store the language context in their hidden states

Sequence-to-Sequence Learning using Recurrent Neural Networks

34/ 41

SLIDE 68

Small Trick before We Start

Bidirectional network

<s> x1 x2 x3 x4 h1 h0 h2 h3 h4

...

read the input sentence from both sides
every ℎ𝑗 contains in fact information from the whole sentence

Sequence-to-Sequence Learning using Recurrent Neural Networks

35/ 41

SLIDE 69

Attention Model

SLIDE 75

Attention Model in Equations (2)

Output projection: 𝑢𝑗 = MLP (𝑉𝑝𝑡𝑗−1 + 𝑊𝑝𝐹𝑧𝑗−1 + 𝐷𝑝𝑑𝑗 + 𝑐𝑝) …context vector is mixed with the hidden state Output distribution: 𝑞 (𝑧𝑗 = 𝑥|𝑡𝑗, 𝑧𝑗−1, 𝑑𝑗) ∝ exp (𝑋𝑝𝑢𝑗)𝑥 + 𝑐𝑥

Sequence-to-Sequence Learning using Recurrent Neural Networks

38/ 41

SLIDE 76

Attention Visualization

Sequence-to-Sequence Learning using Recurrent Neural Networks

39/ 41

SLIDE 77

Image Captioning

Attention over CNN for image classifjcation:

Source: Xu, Kelvin, et al. ”Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML. Vol. 14. 2015.

Sequence-to-Sequence Learning using Recurrent Neural Networks

40/ 41

SLIDE 78

Reading Assignment

SLIDE 79

Reading for the Next Week

Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Question:

TBA

Sequence-to-Sequence Learning using Recurrent Neural Networks

41/ 41