Neural Architectures for NLP Jindich Helcl, Jindich Libovick - - PowerPoint PPT Presentation

neural architectures for nlp
SMART_READER_LITE
LIVE PREVIEW

Neural Architectures for NLP Jindich Helcl, Jindich Libovick - - PowerPoint PPT Presentation

Neural Architectures for NLP Jindich Helcl, Jindich Libovick February 26, 2020 NPFL116 Compendium of Neural Machine Translation Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless


slide-1
SLIDE 1

Neural Architectures for NLP

Jindřich Helcl, Jindřich Libovický

February 26, 2020

NPFL116 Compendium of Neural Machine Translation

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Outline

Symbol Embeddings Recurrent Networks Convolutional Networks Self-attentive Networks Reading Assignment

Neural Architectures for NLP

1/ 33

slide-3
SLIDE 3

Symbol Embeddings

slide-4
SLIDE 4

Discrete symbol vs. continuous representation

Simple task: predict next word given three previous:

Source: Bengio, Yoshua, et al. ”A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Neural Architectures for NLP

2/ 33

slide-5
SLIDE 5

Embeddings

  • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
  • It would mean a huge matrix every time a symbol is on the input
  • Rather factorize this matrix and share the fjrst part ⇒ embeddings
  • “Embeddings” because they embed discrete symbols into a continuous space

What is the biggest problem during training? Embeddings get updated only rarely – only when a symbol appears.

Neural Architectures for NLP

3/ 33

slide-6
SLIDE 6

Embeddings

  • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
  • It would mean a huge matrix every time a symbol is on the input
  • Rather factorize this matrix and share the fjrst part ⇒ embeddings
  • “Embeddings” because they embed discrete symbols into a continuous space

What is the biggest problem during training? Embeddings get updated only rarely – only when a symbol appears.

Neural Architectures for NLP

3/ 33

slide-7
SLIDE 7

Embeddings

  • Natural solution: one-hot vector (vector of vocabulary length with exactly one 1)
  • It would mean a huge matrix every time a symbol is on the input
  • Rather factorize this matrix and share the fjrst part ⇒ embeddings
  • “Embeddings” because they embed discrete symbols into a continuous space

What is the biggest problem during training? Embeddings get updated only rarely – only when a symbol appears.

Neural Architectures for NLP

3/ 33

slide-8
SLIDE 8

Properties of embeddings

Source: https://blogs.mathworks.com/loren/2017/09/21/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox/

Neural Architectures for NLP

4/ 33

slide-9
SLIDE 9

Recurrent Networks

slide-10
SLIDE 10

Why RNNs

  • for loops over sequential data
  • the most frequently used type of network in NLP

Neural Architectures for NLP

5/ 33

slide-11
SLIDE 11

General Formulation

  • inputs: 𝑦, … , 𝑦𝑈
  • initial state ℎ0 = 0, a result of previous

computation, trainable parameter

  • recurrent computation: ℎ𝑢 = 𝐵(ℎ𝑢−1, 𝑦𝑢)

Neural Architectures for NLP

6/ 33

slide-12
SLIDE 12

RNN as Imperative Code

def rnn(initial_state, inputs): prev_state = initial_state for x in inputs: new_state, output = rnn_cell(x, prev_state) prev_state = new_state yield output

Neural Architectures for NLP

7/ 33

slide-13
SLIDE 13

RNN as a Fancy Image

Neural Architectures for NLP

8/ 33

slide-14
SLIDE 14

Vanilla RNN

ℎ𝑢 = tanh (𝑋[ℎ𝑢−1; 𝑦𝑢] + 𝑐)

  • cannot propagate long-distance relations
  • vanishing gradient problem

Neural Architectures for NLP

9/ 33

slide-15
SLIDE 15

Vanishing Gradient Problem (1)

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

  • 1
  • 0.5

0.5 1

  • 6
  • 4
  • 2

2 4 6 Y X

tanh 𝑦 𝑦 = 1 − tanh2 𝑦 ∈ (0, 1]

0.2 0.4 0.6 0.8 1

  • 6
  • 4
  • 2

2 4 6 Y X

Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Neural Architectures for NLP

10/ 33

slide-16
SLIDE 16

Vanishing Gradient Problem (1)

tanh 𝑦 = 1 − 𝑓−2𝑦 1 + 𝑓−2𝑦

  • 1
  • 0.5

0.5 1

  • 6
  • 4
  • 2

2 4 6 Y X

tanh 𝑦 𝑦 = 1 − tanh2 𝑦 ∈ (0, 1]

0.2 0.4 0.6 0.8 1

  • 6
  • 4
  • 2

2 4 6 Y X

Weight initialized ∼ 𝒪(0, 1) to have gradients further from zero.

Neural Architectures for NLP

10/ 33

slide-17
SLIDE 17

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Neural Architectures for NLP

11/ 33

slide-18
SLIDE 18

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Neural Architectures for NLP

11/ 33

slide-19
SLIDE 19

Vanishing Gradient Problem (2)

∂𝐹𝑢+1 ∂𝑐 = ∂𝐹𝑢+1 ∂ℎ𝑢+1 ⋅ ∂ℎ𝑢+1 ∂𝑐

(chain rule)

Neural Architectures for NLP

11/ 33

slide-20
SLIDE 20

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Neural Architectures for NLP

12/ 33

slide-21
SLIDE 21

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Neural Architectures for NLP

12/ 33

slide-22
SLIDE 22

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Neural Architectures for NLP

12/ 33

slide-23
SLIDE 23

Vanishing Gradient Problem (3)

∂ℎ𝑢 ∂𝑐 = ∂ tanh

=𝑨𝑢 (activation)

⏞⏞ ⏞ ⏞ ⏞⏞⏞ ⏞ ⏞ ⏞⏞ (𝑋ℎℎ𝑢−1 + 𝑋𝑦𝑦𝑢 + 𝑐) ∂𝑐

(tanh′ is derivative of tanh)

= tanh′(𝑨𝑢) ⋅ ⎛ ⎜ ⎜ ⎝ ∂𝑋ℎℎ𝑢−1 ∂𝑐 + ∂𝑋𝑦𝑦𝑢 ∂𝑐 ⏟

=0

+ ∂𝑐 ∂𝑐 ⏟

=1

⎞ ⎟ ⎟ ⎠ = 𝑋 ⏟

∼𝒪(0,1)

tanh′(𝑨𝑢) ⏟ ⏟ ⏟ ⏟ ⏟

∈(0;1]

∂ℎ𝑢−1 ∂𝑐 + tanh′(𝑨𝑢)

Neural Architectures for NLP

12/ 33

slide-24
SLIDE 24

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Neural Architectures for NLP

13/ 33

slide-25
SLIDE 25

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Neural Architectures for NLP

13/ 33

slide-26
SLIDE 26

LSTMs

LSTM = Long short-term memory Control the gradient fmow by explicitly gating:

  • what to use from input,
  • what to use from hidden state,
  • what to put on output

Neural Architectures for NLP

13/ 33

slide-27
SLIDE 27

Hidden State

  • two types of hidden states
  • ℎ𝑢 — “public” hidden state, used an output
  • 𝑑𝑢 — “private” memory, no non-linearities on the way
  • direct fmow of gradients (without multiplying by ≤ derivatives)
  • only vectors guaranteed to live in the same space are manipulated
  • information highway metaphor

Neural Architectures for NLP

14/ 33

slide-28
SLIDE 28

Forget Gate

𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔)

  • based on input and previous state, decide what to forget from the memory

Neural Architectures for NLP

15/ 33

slide-29
SLIDE 29

Input Gate

𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷)

  • ̃

𝐷 — candidate what may want to add to the memory

  • 𝑗𝑢 — decide how much of the information we want to store

Neural Architectures for NLP

16/ 33

slide-30
SLIDE 30

Cell State Update

𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢

Neural Architectures for NLP

17/ 33

slide-31
SLIDE 31

Output Gate

𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

Neural Architectures for NLP

18/ 33

slide-32
SLIDE 32

Here we are! 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Neural Architectures for NLP

19/ 33

slide-33
SLIDE 33

Here we are! 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Neural Architectures for NLP

19/ 33

slide-34
SLIDE 34

Here we are! 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢

How would you implement it effjciently? Compute all gates in a single matrix multiplication.

Neural Architectures for NLP

19/ 33

slide-35
SLIDE 35

Gated Recurrent Units

𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢

Neural Architectures for NLP

20/ 33

slide-36
SLIDE 36

GRU and LSTM

Are GRUs special case of LSTMs? LSTM 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢 GRU 𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢 No, you cannot lay 𝐷𝑢 ≡ ℎ𝑢 because of the additional non-linearity in LSTMs.

Neural Architectures for NLP

21/ 33

slide-37
SLIDE 37

GRU and LSTM

Are GRUs special case of LSTMs? LSTM 𝑔𝑢 = 𝜏 (𝑋𝑔[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑔) 𝑗𝑢 = 𝜏 (𝑋𝑗 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑗) 𝑝𝑢 = 𝜏 (𝑋𝑝 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑝) ̃ 𝐷𝑢 = tanh (𝑋𝑑 ⋅ [ℎ𝑢−1; 𝑦𝑢] + 𝑐𝐷) 𝐷𝑢 = 𝑔𝑢 ⊙ 𝐷𝑢−1 + 𝑗𝑢 ⊙ ̃ 𝐷𝑢 ℎ𝑢 = 𝑝𝑢 ⊙ tanh 𝐷𝑢 GRU 𝑨𝑢 = 𝜏 (𝑋𝑨[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑨) 𝑠𝑢 = 𝜏 (𝑋𝑠[ℎ𝑢−1; 𝑦𝑢] + 𝑐𝑠) ̃ ℎ𝑢 = tanh (𝑋[𝑠𝑢 ⊙ ℎ𝑢−1; 𝑦𝑢]) ℎ𝑢 = (1 − 𝑨𝑢) ⊙ ℎ𝑢−1 + 𝑨𝑢 ⊙ ̃ ℎ𝑢 No, you cannot lay 𝐷𝑢 ≡ ℎ𝑢 because of the additional non-linearity in LSTMs.

Neural Architectures for NLP

21/ 33

slide-38
SLIDE 38

GRU or LSTM?

  • GRU preserved the information highway property
  • less parameters, should learn faster
  • LSTM more general (although both Turing complete)
  • empirical results: it’s task-specifjc

Chung, Junyoung, et al. ”Empirical evaluation of gated recurrent neural networks on sequence modeling.” arXiv preprint arXiv:412.3555 (204). Irie, Kazuki, et al. ”LSTM, GRU, highway and a bit of attention: an empirical overview for language modeling in speech recognition.” Interspeech, San Francisco, CA, USA (206).

Neural Architectures for NLP

22/ 33

slide-39
SLIDE 39

Recurrent Networks‘ +

  • correspond to intuition of sequential

processing

  • theoretically strong
  • cannot be parallelized, always need to

wait for previous state

Neural Architectures for NLP

23/ 33

slide-40
SLIDE 40

Convolutional Networks

slide-41
SLIDE 41

1-D Convolution

≈ sliding window over the sequence embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ ℎ1 = 𝑔 (𝑋[𝑦0; 𝑦1.𝑦2] + 𝑐) ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) pad with 0s if we want to keep sequence length

Neural Architectures for NLP

24/ 33

slide-42
SLIDE 42

1-D Convolution

≈ sliding window over the sequence embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ ℎ1 = 𝑔 (𝑋[𝑦0; 𝑦1.𝑦2] + 𝑐) ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) pad with 0s if we want to keep sequence length

Neural Architectures for NLP

24/ 33

slide-43
SLIDE 43

1-D Convolution

≈ sliding window over the sequence embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ ℎ1 = 𝑔 (𝑋[𝑦0; 𝑦1.𝑦2] + 𝑐) ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) pad with 0s if we want to keep sequence length

Neural Architectures for NLP

24/ 33

slide-44
SLIDE 44

1-D Convolution

≈ sliding window over the sequence embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ ℎ1 = 𝑔 (𝑋[𝑦0; 𝑦1.𝑦2] + 𝑐) ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) pad with 0s if we want to keep sequence length

Neural Architectures for NLP

24/ 33

slide-45
SLIDE 45

1-D Convolution: Code

Pseudocode

xs = ... # input sequnce kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters) window = kernel_size // 2

  • utputs = []

for i in range(window, xs.shape[1] - window): h = np.mul(W, xs[i - window:i + window]) + b

  • utputs.append(h)

return np.array(h)

TensorFlow

h = tf.layers.conv1d(x, filters=300 kernel_size=3, strides=1, padding='same')

Neural Architectures for NLP

25/ 33

slide-46
SLIDE 46

Residual Connections

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) Allows training deeper networks. Why do you it helps? Better gradient fmow – the same as in RNNs.

Neural Architectures for NLP

26/ 33

slide-47
SLIDE 47

Residual Connections

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) + 𝑦𝑗 Allows training deeper networks. Why do you it helps? Better gradient fmow – the same as in RNNs.

Neural Architectures for NLP

26/ 33

slide-48
SLIDE 48

Residual Connections

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) + 𝑦𝑗 Allows training deeper networks. Why do you it helps? Better gradient fmow – the same as in RNNs.

Neural Architectures for NLP

26/ 33

slide-49
SLIDE 49

Residual Connections

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

ℎ𝑗 = 𝑔 (𝑋 [𝑦𝑗−1; 𝑦𝑗; 𝑦𝑗+1] + 𝑐) + 𝑦𝑗 Allows training deeper networks. Why do you it helps? Better gradient fmow – the same as in RNNs.

Neural Architectures for NLP

26/ 33

slide-50
SLIDE 50

Residual Connections: Numerical Stability

Numerically unstable, we need activation to be in similar scale ⇒ layer normalization. Activation before non-linearity is normalized: 𝑏𝑗 = 𝑕𝑗 𝜏𝑗 (𝑏𝑗 − 𝜈𝑗) …𝑕 is a trainable parameter, 𝜈, 𝜏 estimated from data. 𝜈 = 1 𝐼

𝐼

𝑗=1

𝑏𝑗 𝜏 = √ √ √ ⎷ 1 𝐼

𝐼

𝑗=1

(𝑏𝑗 − 𝜈)2

Neural Architectures for NLP

27/ 33

slide-51
SLIDE 51

Receptive Field

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ Can be enlarged by dilated convolutions.

Neural Architectures for NLP

28/ 33

slide-52
SLIDE 52

Receptive Field

embeddings x = (𝑦1, … , 𝑦𝑂) 𝑦0 = ⃗ 𝑦𝑂 = ⃗ Can be enlarged by dilated convolutions.

Neural Architectures for NLP

28/ 33

slide-53
SLIDE 53

Convolutional architectures +

  • extremely computationally effjcient
  • limited context
  • by default no aware of 𝑜-gram order

Neural Architectures for NLP

29/ 33

slide-54
SLIDE 54

Self-attentive Networks

slide-55
SLIDE 55

Main idea of self-attention

  • matrix multiplication can be used for to get dot-product similarity between all sequence

vectors

  • while using the same vector space, information might be gathered by summing up

Both regardless the distance in the sequence!

Neural Architectures for NLP

30/ 33

slide-56
SLIDE 56

Naive code

xs = ... # input sequence, time x dimension dimension = xs.shape[1] hidden_size = 400 # size of additional projection for x_1 in xs: similarities = np.array(np.sum(x_1 * x_2) for x_2 in xs) distribution = softmax(similarities) context = np.sum(xs * distribution, axis=1) hidden_layer_input = layer_norm(context + xs) hidden_layer_middle = relu( dense_layer(hidden_input, hidden_size)) hidden_layer_output = relu( dense_layer(hidden_input, hidden_size)) yield layer_norm( hidden_layer_input + hidden_layer_output)

Neural Architectures for NLP

31/ 33

slide-57
SLIDE 57

Self-attentive architectures +

  • computationally effjcient
  • unlimited context
  • empower state-of-the-art models
  • memory requirements grow quadratically

with sequence length

  • not aware or positions in the sequence

(requires positional embeddings)

Neural Architectures for NLP

32/ 33

slide-58
SLIDE 58

Reading Assignment

slide-59
SLIDE 59

Reading for the Next Week

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014). https://arxiv.org/pdf/1409.0473.pdf Questions: The authors report 5 BLEU points worse score than the previous encoder-decoder architecture (Sutskever et al., 2014). Why is their model better then? If someone asked you to create automatically a dictionary. Would you use the attention mechanism for it? Why yes? Why not?

Neural Architectures for NLP

33/ 33