Sequence-to-Sequence Learning using Recurrent Neural Networks - - PowerPoint PPT Presentation

sequence to sequence learning using recurrent neural
SMART_READER_LITE
LIVE PREVIEW

Sequence-to-Sequence Learning using Recurrent Neural Networks - - PowerPoint PPT Presentation

Sequence-to-Sequence Learning using Recurrent Neural Networks Jindich Helcl, Jindich Libovick March 4, 2020 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and


slide-1
SLIDE 1

Sequence-to-Sequence Learning using Recurrent Neural Networks

Jindřich Helcl, Jindřich Libovický

March 4, 2020

NPFL116 Compendium of Neural Machine Translation

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Outline

Symbol Embeddings Recurrent Networks Neural Network Language Models Vanilla Sequence-to-Sequence Model Attentive Sequence-to-Sequence Learning Reading Assignment

Sequence-to-Sequence Learning using Recurrent Neural Networks

1/ 41

slide-3
SLIDE 3

Symbol Embeddings

slide-4
SLIDE 4

Discrete symbol vs. continuous representation

Simple task: predict next word given three previous:

Source: Bengio, Yoshua, et al. ”A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Sequence-to-Sequence Learning using Recurrent Neural Networks

2/ 41

slide-5
SLIDE 5

Embeddings

  • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
  • It would mean a huge matrix every time a symbol is on the input
  • Rather factorize this matrix and share the fjrst part ⇒ embeddings
  • “Embeddings” because they embed discrete symbols into a continuous space

Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears.

Sequence-to-Sequence Learning using Recurrent Neural Networks

3/ 41

slide-6
SLIDE 6

Embeddings

  • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
  • It would mean a huge matrix every time a symbol is on the input
  • Rather factorize this matrix and share the fjrst part ⇒ embeddings
  • “Embeddings” because they embed discrete symbols into a continuous space

Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears.

Sequence-to-Sequence Learning using Recurrent Neural Networks

3/ 41

slide-7
SLIDE 7

Embeddings

  • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
  • It would mean a huge matrix every time a symbol is on the input
  • Rather factorize this matrix and share the fjrst part ⇒ embeddings
  • “Embeddings” because they embed discrete symbols into a continuous space

Think of training-related problems when using word embeddings... Embeddings get updated only rarely – only when a symbol appears.

Sequence-to-Sequence Learning using Recurrent Neural Networks

3/ 41

slide-8
SLIDE 8

Properties of embeddings

Source: https://blogs.mathworks.com/loren/2017/09/21/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox/

Sequence-to-Sequence Learning using Recurrent Neural Networks

4/ 41

slide-9
SLIDE 9

Recurrent Networks

slide-10
SLIDE 10

Why RNNs

  • for loops over sequential data
  • the most frequently used type of network in NLP

Sequence-to-Sequence Learning using Recurrent Neural Networks

5/ 41

slide-11
SLIDE 11

General Formulation

  • inputs: 𝑦1, … , 𝑦𝑈
  • initial state ℎ0:
  • result of previous computation
  • trainable parameter
  • recurrent computation: ℎ𝑢 = 𝐵(ℎ𝑢−1, 𝑦𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

6/ 41

slide-12
SLIDE 12

RNN as Imperative Code

def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output

Sequence-to-Sequence Learning using Recurrent Neural Networks

7/ 41

slide-13
SLIDE 13

RNN as a Fancy Image

Sequence-to-Sequence Learning using Recurrent Neural Networks

8/ 41

slide-14
SLIDE 14

Vanilla RNN

ℎ𝑢 = tanh (𝑋[ℎ𝑢−1; 𝑦𝑢] + 𝑐)

  • cannot propagate long-distance relations
  • vanishing gradient problem

Sequence-to-Sequence Learning using Recurrent Neural Networks

9/ 41

slide-15
SLIDE 15

Vanishing Gradient Problem (1)

  • 1
  • 0.5

0.5 1

  • 6
  • 4
  • 2

2 4 6 Y X

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

0.2 0.4 0.6 0.8 1

  • 6
  • 4
  • 2

2 4 6 Y X

d tanh 𝑦 d𝑦 = 1 − tanh2 𝑦 ∈ (0, 1] Weights initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Sequence-to-Sequence Learning using Recurrent Neural Networks

10/ 41

slide-16
SLIDE 16

Vanishing Gradient Problem (1)

  • 1
  • 0.5

0.5 1

  • 6
  • 4
  • 2

2 4 6 Y X

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

0.2 0.4 0.6 0.8 1

  • 6
  • 4
  • 2

2 4 6 Y X

d tanh 𝑦 d𝑦 = 1 − tanh2 𝑦 ∈ (0, 1] Weights initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Sequence-to-Sequence Learning using Recurrent Neural Networks

10/ 41

slide-17
SLIDE 17

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Sequence-to-Sequence Learning using Recurrent Neural Networks

11/ 41

slide-18
SLIDE 18

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Sequence-to-Sequence Learning using Recurrent Neural Networks

11/ 41

slide-19
SLIDE 19

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Sequence-to-Sequence Learning using Recurrent Neural Networks

11/ 41

slide-20
SLIDE 20

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

slide-21
SLIDE 21

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

slide-22
SLIDE 22

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

slide-23
SLIDE 23

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Sequence-to-Sequence Learning using Recurrent Neural Networks

12/ 41

slide-24
SLIDE 24

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Sequence-to-Sequence Learning using Recurrent Neural Networks

13/ 41

slide-25
SLIDE 25

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Sequence-to-Sequence Learning using Recurrent Neural Networks

13/ 41

slide-26
SLIDE 26

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Sequence-to-Sequence Learning using Recurrent Neural Networks

13/ 41

slide-27
SLIDE 27

Hidden State

  • two types of hidden states
  • ℎ𝑢 — “public” hidden state, used an output
  • 𝑑𝑢 — “private” memory, no non-linearities on the way
  • direct fmow of gradients (without multiplying by ≤ derivatives)
  • only vectors guaranteed to live in the same space are manipulated
  • information highway metaphor

Sequence-to-Sequence Learning using Recurrent Neural Networks

14/ 41

slide-28
SLIDE 28

Forget Gate

𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔)

  • based on input and previous state, decide what to forget from the memory

Sequence-to-Sequence Learning using Recurrent Neural Networks

15/ 41

slide-29
SLIDE 29

Input Gate

𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷)

  • ̃

𝐷 — candidate what may want to add to the memory

  • 𝑗𝑢 — decide how much of the information we want to store

Sequence-to-Sequence Learning using Recurrent Neural Networks

16/ 41

slide-30
SLIDE 30

Cell State Update

𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

17/ 41

slide-31
SLIDE 31

Output Gate

𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

18/ 41

slide-32
SLIDE 32

Here we are!

𝑔𝑢 = 𝜏 (𝑋𝑔 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Sequence-to-Sequence Learning using Recurrent Neural Networks

19/ 41

slide-33
SLIDE 33

Here we are!

𝑔𝑢 = 𝜏 (𝑋𝑔 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Sequence-to-Sequence Learning using Recurrent Neural Networks

19/ 41

slide-34
SLIDE 34

Here we are!

𝑔𝑢 = 𝜏 (𝑋𝑔 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Sequence-to-Sequence Learning using Recurrent Neural Networks

19/ 41

slide-35
SLIDE 35

Gated Recurrent Units

𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

20/ 41

slide-36
SLIDE 36

GRU and LSTM

LSTM 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢 GRU 𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢

Sequence-to-Sequence Learning using Recurrent Neural Networks

21/ 41

slide-37
SLIDE 37

GRU or LSTM?

  • GRU preserves the information highway property
  • GRU has less parameters, should learn faster
  • LSTM more general (although both Turing complete)
  • empirical results: it’s task-specifjc

Chung, Junyoung, et al. ”Empirical evaluation of gated recurrent neural networks on sequence modeling.” arXiv preprint arXiv:412.3555 (204). Irie, Kazuki, et al. ”LSTM, GRU, highway and a bit of attention: an empirical overview for language modeling in speech recognition.” Interspeech, San Francisco, CA, USA (206).

Sequence-to-Sequence Learning using Recurrent Neural Networks

22/ 41

slide-38
SLIDE 38

Recurrent Networks +

  • correspond to intuition of sequential

processing

  • theoretically strong
  • cannot be parallelized, always need to

wait for previous state

Sequence-to-Sequence Learning using Recurrent Neural Networks

23/ 41

slide-39
SLIDE 39

Neural Network Language Models

slide-40
SLIDE 40

RNN Language Model

  • Train RNN as classifjer for next words (unlimited history)

<s> w1 w2 w3 w4 p(w1) p(w2) p(w3) p(w4) p(w5)

  • Can be used to estimate sentence probability / perplexity → defjnes a distribution over

sentences

  • We can sample from the distribution

Sequence-to-Sequence Learning using Recurrent Neural Networks

24/ 41

slide-41
SLIDE 41

RNN Language Model

  • Train RNN as classifjer for next words (unlimited history)

<s> w1 w2 w3 w4 p(w1) p(w2) p(w3) p(w4) p(w5)

  • Can be used to estimate sentence probability / perplexity → defjnes a distribution over

sentences

  • We can sample from the distribution

Sequence-to-Sequence Learning using Recurrent Neural Networks

24/ 41

slide-42
SLIDE 42

RNN Language Model

  • Train RNN as classifjer for next words (unlimited history)

<s> w1 w2 w3 w4 p(w1) p(w2) p(w3) p(w4) p(w5)

  • Can be used to estimate sentence probability / perplexity → defjnes a distribution over

sentences

  • We can sample from the distribution

<s> ~w1 ~w2 ~w3 ~w4 ~w5

Sequence-to-Sequence Learning using Recurrent Neural Networks

24/ 41

slide-43
SLIDE 43

Two views on RNN LM

  • RNN is a for loop (functional map) over sequential data
  • All outputs are conditional distributions → probabilistic distribution over sequences of

words: 𝑄 (𝑥1, … , 𝑥𝑜) =

𝑜

𝑗=1

𝑄 (𝑥𝑗|𝑥𝑗−1, … , 𝑥1)

Sequence-to-Sequence Learning using Recurrent Neural Networks

25/ 41

slide-44
SLIDE 44

Vanilla Sequence-to-Sequence Model

slide-45
SLIDE 45

Encoder-Decoder NMT

  • Exploits the conditional LM scheme
  • Two networks
  • 1. A network processing the input sentence into a single vector representation (encoder)
  • 2. A neural language model initialized with the output of the encoder (decoder)

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.

Sequence-to-Sequence Learning using Recurrent Neural Networks

26/ 41

slide-46
SLIDE 46

Encoder-Decoder – Image

<s> <s> x1 x2 x3 x4 ~y1 ~y2 ~y3 ~y4 ~y5

Source language input + target language LM

Sequence-to-Sequence Learning using Recurrent Neural Networks

27/ 41

slide-47
SLIDE 47

Encoder-Decoder Model – Code

state = np.zeros(EMB_SIZE) for w in input_words: input_embedding = source_embeddings[w] state, _ = enc_cell(state, input_embedding) prev_w = "<s>" while prev_w != "</s>": prev_w_embeding = target_embeddings[prev_w] state, dec_output = dec_cell(state, prev_w_embeding) logits = output_projection(dec_output) prev_w = np.argmax(logits) yield prev_w

Sequence-to-Sequence Learning using Recurrent Neural Networks

28/ 41

slide-48
SLIDE 48

Encoder-Decoder Model – Formal Notation

Data input embeddings (source language) x = (𝑦1, … , 𝑦𝑈𝑦)

  • utput embeddings (target language)

y = (𝑧1, … , 𝑧𝑈𝑧) Encoder initial state ℎ0 ≡ 𝑘-th state ℎ𝑘 = RNNenc(ℎ𝑘−1, 𝑦𝑘) fjnal state ℎ𝑈𝑦 Decoder initial state 𝑡0 = ℎ𝑈𝑦 𝑗-th decoder state 𝑡𝑗 = RNNdec(𝑡𝑗−1, 𝑧𝑗) 𝑗-th word score 𝑢𝑗+1 = 𝑉𝑝𝑡𝑗+1 + 𝑊𝑝𝐹𝑧𝑗 + 𝑐𝑝, (or multi-layer projection)

  • utput

̂ 𝑧𝑗+1 = arg max 𝑢𝑗+1

Sequence-to-Sequence Learning using Recurrent Neural Networks

29/ 41

slide-49
SLIDE 49

Encoder-Decoder Model – Formal Notation

Data input embeddings (source language) x = (𝑦1, … , 𝑦𝑈𝑦)

  • utput embeddings (target language)

y = (𝑧1, … , 𝑧𝑈𝑧) Encoder initial state ℎ0 ≡ 0 𝑘-th state ℎ𝑘 = RNNenc(ℎ𝑘−1, 𝑦𝑘) fjnal state ℎ𝑈𝑦 Decoder initial state 𝑡0 = ℎ𝑈𝑦 𝑗-th decoder state 𝑡𝑗 = RNNdec(𝑡𝑗−1, 𝑧𝑗) 𝑗-th word score 𝑢𝑗+1 = 𝑉𝑝𝑡𝑗+1 + 𝑊𝑝𝐹𝑧𝑗 + 𝑐𝑝, (or multi-layer projection)

  • utput

̂ 𝑧𝑗+1 = arg max 𝑢𝑗+1

Sequence-to-Sequence Learning using Recurrent Neural Networks

29/ 41

slide-50
SLIDE 50

Encoder-Decoder Model – Formal Notation

Data input embeddings (source language) x = (𝑦1, … , 𝑦𝑈𝑦)

  • utput embeddings (target language)

y = (𝑧1, … , 𝑧𝑈𝑧) Encoder initial state ℎ0 ≡ 0 𝑘-th state ℎ𝑘 = RNNenc(ℎ𝑘−1, 𝑦𝑘) fjnal state ℎ𝑈𝑦 Decoder initial state 𝑡0 = ℎ𝑈𝑦 𝑗-th decoder state 𝑡𝑗 = RNNdec(𝑡𝑗−1, 𝑧𝑗) 𝑗-th word score 𝑢𝑗+1 = 𝑉𝑝𝑡𝑗+1 + 𝑊𝑝𝐹𝑧𝑗 + 𝑐𝑝, (or multi-layer projection)

  • utput

̂ 𝑧𝑗+1 = arg max 𝑢𝑗+1

Sequence-to-Sequence Learning using Recurrent Neural Networks

29/ 41

slide-51
SLIDE 51

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

  • Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

  • Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡

[𝑧𝑗] Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) =

𝑞 (− log ̂

𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

slide-52
SLIDE 52

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

  • Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

  • Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) =

𝑞 (− log ̂

𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

slide-53
SLIDE 53

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

  • Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

  • Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) = E𝑞 (− log ̂ 𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

slide-54
SLIDE 54

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

  • Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

  • Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) = E𝑞 (− log ̂ 𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

slide-55
SLIDE 55

Encoder-Decoder: Training Objective

For output word 𝑧𝑗 we have:

  • Estimated conditional distribution

̂ 𝑞𝑗 =

exp 𝑢𝑗 ∑ exp 𝑢𝑘 (softmax function)

  • Unknown true distribution 𝑞𝑗, we lay 𝑞𝑗 ≡ 1 [𝑧𝑗]

Cross entropy ≈ distance of ̂ 𝑞 and 𝑞: ℒ = 𝐼( ̂ 𝑞, 𝑞) = E𝑞 (− log ̂ 𝑞) = − log ̂ 𝑞(𝑧𝑗) …computing ∂ℒ

∂𝑢𝑗 is super simple

Sequence-to-Sequence Learning using Recurrent Neural Networks

30/ 41

slide-56
SLIDE 56

Implementation: Runtime vs. training

runtime: ̂

𝑧𝑘

(decoded) ×

training: 𝑧𝑘

(ground truth)

<s> x1 x2 x3 x4 <s> ~y1 ~y2 ~y3 ~y4 ~y5 <s> y1 y2 y3 y4 loss

Sequence-to-Sequence Learning using Recurrent Neural Networks

31/ 41

slide-57
SLIDE 57

Sutskever et al., “Sequence-to-Sequence Learning with Neural Networks”, 2014

  • Reverse input sequence
  • Impressive empirical results – made researchers believe NMT is way to go

Evaluation on WMT14 EN → FR test set: method BLEU score vanilla SMT 33.0 tuned SMT 37.0 Sutskever et al.: reversed 30.6 –”–: ensemble + beam search 34.8 –”–: vanilla SMT rescoring 36.5 Bahdanau’s attention 28.5 Why is better Bahdanau’s model worse?

Sequence-to-Sequence Learning using Recurrent Neural Networks

32/ 41

slide-58
SLIDE 58

Sutskever et al., “Sequence-to-Sequence Learning with Neural Networks”, 2014

  • Reverse input sequence
  • Impressive empirical results – made researchers believe NMT is way to go

Evaluation on WMT14 EN → FR test set: method BLEU score vanilla SMT 33.0 tuned SMT 37.0 Sutskever et al.: reversed 30.6 –”–: ensemble + beam search 34.8 –”–: vanilla SMT rescoring 36.5 Bahdanau’s attention 28.5 Why is better Bahdanau’s model worse?

Sequence-to-Sequence Learning using Recurrent Neural Networks

32/ 41

slide-59
SLIDE 59

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-60
SLIDE 60

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-61
SLIDE 61

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-62
SLIDE 62

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-63
SLIDE 63

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-64
SLIDE 64

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-65
SLIDE 65

Sutskever et al. × Bahdanau et al.

Sutskever et al. Bahdanau et al.

vocabulary

160k enc, 80k dec 30k both

encoder

4× LSTM, 1,000 units bidi GRU, 2,000

decoder

4× LSTM, 1,000 units GRU, 1,000 units

word embeddings

1,000 dimensions 620 dimensions

training time

7.5 epochs 5 epochs With Bahdanau’s model size: method BLEU score encoder-decoder 13.9 attention model 28.5

Sequence-to-Sequence Learning using Recurrent Neural Networks

33/ 41

slide-66
SLIDE 66

Attentive Sequence-to-Sequence Learning

slide-67
SLIDE 67

Main Idea

  • Same as reversing input: do not force the network to catch long-distance dependencies
  • Use decoder state only for target sentence dependencies and as a query for the source

word sentence

  • RNN can serve as LM — it can store the language context in their hidden states

Sequence-to-Sequence Learning using Recurrent Neural Networks

34/ 41

slide-68
SLIDE 68

Small Trick before We Start

Bidirectional network

<s> x1 x2 x3 x4 h1 h0 h2 h3 h4

...

  • read the input sentence from both sides
  • every ℎ𝑗 contains in fact information from the whole sentence

Sequence-to-Sequence Learning using Recurrent Neural Networks

35/ 41

slide-69
SLIDE 69

Attention Model

<s> x1 x2 x3 x4 ~yi ~yi+1 h1 h0 h2 h3 h4

...

+

× α0 × α1 × α2 × α3 × α4

si si-1 si+1

+

Sequence-to-Sequence Learning using Recurrent Neural Networks

36/ 41

slide-70
SLIDE 70

Attention Model in Equations (1)

Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤

𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)

Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑𝑈𝑦

𝑙=1 exp (𝑓𝑗𝑙)

Context vector: 𝑑𝑗 =

𝑈𝑦

𝑘=1

𝛽𝑗𝑘ℎ𝑘

Sequence-to-Sequence Learning using Recurrent Neural Networks

37/ 41

slide-71
SLIDE 71

Attention Model in Equations (1)

Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤

𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)

Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑𝑈𝑦

𝑙=1 exp (𝑓𝑗𝑙)

Context vector: 𝑑𝑗 =

𝑈𝑦

𝑘=1

𝛽𝑗𝑘ℎ𝑘

Sequence-to-Sequence Learning using Recurrent Neural Networks

37/ 41

slide-72
SLIDE 72

Attention Model in Equations (1)

Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤

𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)

Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑𝑈𝑦

𝑙=1 exp (𝑓𝑗𝑙)

Context vector: 𝑑𝑗 =

𝑈𝑦

𝑘=1

𝛽𝑗𝑘ℎ𝑘

Sequence-to-Sequence Learning using Recurrent Neural Networks

37/ 41

slide-73
SLIDE 73

Attention Model in Equations (1)

Inputs: decoder state 𝑡𝑗 encoder states ℎ𝑘 = [⃗⃗⃗⃗⃗⃗⃗⃗ ℎ𝑘; ⃖⃖⃖⃖⃖⃖⃖⃖ ℎ𝑘] ∀𝑗 = 1 … 𝑈𝑦 Attention energies: 𝑓𝑗𝑘 = 𝑤⊤

𝑏 tanh (𝑋𝑏𝑡𝑗−1 + 𝑉𝑏ℎ𝑘 + 𝑐𝑏)

Attention distribution: 𝛽𝑗𝑘 = exp (𝑓𝑗𝑘) ∑𝑈𝑦

𝑙=1 exp (𝑓𝑗𝑙)

Context vector: 𝑑𝑗 =

𝑈𝑦

𝑘=1

𝛽𝑗𝑘ℎ𝑘

Sequence-to-Sequence Learning using Recurrent Neural Networks

37/ 41

slide-74
SLIDE 74

Attention Model in Equations (2)

Output projection: 𝑢𝑗 = MLP (𝑉𝑝𝑡𝑗−1 + 𝑊𝑝𝐹𝑧𝑗−1 + 𝐷𝑝𝑑𝑗 + 𝑐𝑝) …context vector is mixed with the hidden state Output distribution: 𝑞 (𝑧𝑗 = 𝑥|𝑡𝑗, 𝑧𝑗−1, 𝑑𝑗) ∝ exp (𝑋𝑝𝑢𝑗)𝑥 + 𝑐𝑥

Sequence-to-Sequence Learning using Recurrent Neural Networks

38/ 41

slide-75
SLIDE 75

Attention Model in Equations (2)

Output projection: 𝑢𝑗 = MLP (𝑉𝑝𝑡𝑗−1 + 𝑊𝑝𝐹𝑧𝑗−1 + 𝐷𝑝𝑑𝑗 + 𝑐𝑝) …context vector is mixed with the hidden state Output distribution: 𝑞 (𝑧𝑗 = 𝑥|𝑡𝑗, 𝑧𝑗−1, 𝑑𝑗) ∝ exp (𝑋𝑝𝑢𝑗)𝑥 + 𝑐𝑥

Sequence-to-Sequence Learning using Recurrent Neural Networks

38/ 41

slide-76
SLIDE 76

Attention Visualization

Sequence-to-Sequence Learning using Recurrent Neural Networks

39/ 41

slide-77
SLIDE 77

Image Captioning

Attention over CNN for image classifjcation:

Source: Xu, Kelvin, et al. ”Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML. Vol. 14. 2015.

Sequence-to-Sequence Learning using Recurrent Neural Networks

40/ 41

slide-78
SLIDE 78

Reading Assignment

slide-79
SLIDE 79

Reading for the Next Week

Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Question:

TBA

Sequence-to-Sequence Learning using Recurrent Neural Networks

41/ 41