DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent - - PowerPoint PPT Presentation

deep learning for natural language processing
SMART_READER_LITE
LIVE PREVIEW

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent - - PowerPoint PPT Presentation

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro LECTURE 1 RECALL Language modeling with a multi-layer perceptron n 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 |


slide-1
SLIDE 1

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING

Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro

slide-2
SLIDE 2

LECTURE 1 RECALL

Language modeling with a multi-layer perceptron

2nd order Markov chain:

p(y1, . . . , yn) = p(y1) p(y2|y1)

n

i=3

p(yi|yi−1, yi−2)

z = σ (U(1)x + b(1)) x = [ Embedding of yi−1 Embedding of yi−2] w = U(2)z + b(2) Concatenate the embeddings of the two previous words Hidden representation Probability distribution Output projection p(yi|yi−1, yi−2) = exp(wyi) ∑y′exp(wy′)

slide-3
SLIDE 3

LECTURE 1 RECALL

Language modeling with a multi-layer perceptron

2nd order Markov chain:

p(y1, . . . , yn) = p(y1) p(y2|y1)

n

i=3

p(yi|yi−1, yi−2)

z = σ (U(1)x + b(1)) x = [ Embedding of yi−1 Embedding of yi−2] w = U(2)z + b(2) p(yi|yi−1, yi−2) = exp(wyi) ∑y′exp(wy′)

Sentence classification with a Convolutional Neural Network

  • 1. Convolution: sliding window of fixed size of the input sentence
  • 2. Mean/max pooling over convolution outputs
  • 3. Multi-linear perceptron
slide-4
SLIDE 4

LECTURE 1 RECALL

Language modeling with a multi-layer perceptron

2nd order Markov chain:

p(y1, . . . , yn) = p(y1) p(y2|y1)

n

i=3

p(yi|yi−1, yi−2)

z = σ (U(1)x + b(1)) x = [ Embedding of yi−1 Embedding of yi−2] w = U(2)z + b(2) p(yi|yi−1, yi−2) = exp(wyi) ∑y′exp(wy′)

Sentence classification with a Convolutional Neural Network

  • 1. Convolution: sliding window of fixed size of the input sentence
  • 2. Mean/max pooling over convolution outputs
  • 3. Multi-linear perceptron

Main issue

➤ These 2 networks only use local word-order information ➤ No long range dependencies

slide-5
SLIDE 5

LONG RANGE DEPENDENCIES

Recurrent neural networks

➤ Inputs are fed sequentially ➤ State representation updated at each input

The dog is eating

Today

slide-6
SLIDE 6

LONG RANGE DEPENDENCIES

Recurrent neural networks

➤ Inputs are fed sequentially ➤ State representation updated at each input

Attention network

➤ Inputs contain position information ➤ At each position look at any input in the sentence

Next week!

The dog is eating The.1 dog.2 is.3 eating.4

Today

slide-7
SLIDE 7

RECURRENT NEURAL NETWORK

h(n) x(n) h(n) x(n) r(n−1) r(n)

Input Output Incoming recurrent connection Outgoing recurrent connection

Recurrent neural network cell

slide-8
SLIDE 8

RECURRENT NEURAL NETWORK

The dog is eating

h(4) h(3) h(2) h(1) h(n) x(n) h(n) x(n) r(n−1) r(n)

Input Output Incoming recurrent connection Outgoing recurrent connection

Recurrent neural network cell Dynamic neural network

All cells share the same parameters

slide-9
SLIDE 9

LANGUAGE MODEL

Why do we usually make independence assumptions?

➤ Less parameters to learn ➤ Less sparsity ➤ 2nd order Markov chain:

p(y1, . . . , yn) = p(y1) p(y2|y1)

n

i=3

p(yi|yi−1, yi−2) p(y1, . . . , yn) = p(y1)

n

i=2

p(yi|yi−1)

➤ 1st order Markov chain:

|V | × |V | parameters |V | × |V | × |V | parameters

Non neural language model Multi-layer perceptron language model

➤ No sparsity issue thanks to word embeddings ➤ Independence assumption, so no long range dependencies

slide-10
SLIDE 10

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS

p(y1 . . . yn) = p(y1, . . . , yn−1)p(yn|y1, . . . , yn−1)

No independence assumption!

slide-11
SLIDE 11

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS

p(y1 . . . yn) = p(y1, . . . , yn−1)p(yn|y1, . . . , yn−1)

No independence assumption!

<BOS>

p(y1)

slide-12
SLIDE 12

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS

p(y1 . . . yn) = p(y1, . . . , yn−1)p(yn|y1, . . . , yn−1)

No independence assumption!

<BOS>

p(y1)

The <BOS>

p(y2|y1)

slide-13
SLIDE 13

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS

p(y1 . . . yn) = p(y1, . . . , yn−1)p(yn|y1, . . . , yn−1)

No independence assumption!

<BOS>

p(y1)

The <BOS>

p(y2|y1)

The dog <BOS>

p(y3|y1, y2)

slide-14
SLIDE 14

LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS

p(y1 . . . yn) = p(y1, . . . , yn−1)p(yn|y1, . . . , yn−1)

No independence assumption!

<BOS>

p(y1)

The <BOS>

p(y2|y1)

The dog <BOS>

p(y3|y1, y2)

The dog is <BOS>

p(y4|y1, y2, y3)

slide-15
SLIDE 15

SENTENCE CLASSIFICATION

Neural architecture

  • 1. A recurrent neural network (RNN) compute a context sensitive representation of

the sentence

  • 2. A multi-layer perceptron takes as input this representation and output class weights
slide-16
SLIDE 16

SENTENCE CLASSIFICATION

Neural architecture

  • 1. A recurrent neural network (RNN) compute a context sensitive representation of

the sentence

  • 2. A multi-layer perceptron takes as input this representation and output class weights

The dog is eating

z(1)

Context sensitive representation

1

slide-17
SLIDE 17

SENTENCE CLASSIFICATION

Neural architecture

  • 1. A recurrent neural network (RNN) compute a context sensitive representation of

the sentence

  • 2. A multi-layer perceptron takes as input this representation and output class weights

The dog is eating

z(1)

Context sensitive representation

1

z(2) = σ (U(1)z(1) + b(1)) w = U(2)z(2) + b(2)

Output weights MLP hidden layer

2

slide-18
SLIDE 18

MACHINE TRANSLATION

Neural architecture: Encoder-Decoder

  • 1. Encoder: a recurrent neural network (RNN) compute a context sensitive

representation of the sentence

  • 2. Decoder: a different recurrent neural network (RNN) compute the translation,


word after word Conditional language model

slide-19
SLIDE 19

MACHINE TRANSLATION

The dog is running

z Neural architecture: Encoder-Decoder

  • 1. Encoder: a recurrent neural network (RNN) compute a context sensitive

representation of the sentence

  • 2. Decoder: a different recurrent neural network (RNN) compute the translation,


word after word Conditional language model

1

slide-20
SLIDE 20

MACHINE TRANSLATION

The dog is running

z

<BOS>

Neural architecture: Encoder-Decoder

  • 1. Encoder: a recurrent neural network (RNN) compute a context sensitive

representation of the sentence

  • 2. Decoder: a different recurrent neural network (RNN) compute the translation,


word after word Conditional language model

le

1 2

Begin of sentence

slide-21
SLIDE 21

MACHINE TRANSLATION

The dog is running

z

<BOS> le

Neural architecture: Encoder-Decoder

  • 1. Encoder: a recurrent neural network (RNN) compute a context sensitive

representation of the sentence

  • 2. Decoder: a different recurrent neural network (RNN) compute the translation,


word after word Conditional language model

le chien

1 2

Begin of sentence

slide-22
SLIDE 22

MACHINE TRANSLATION

The dog is running

z

<BOS> le chien

Neural architecture: Encoder-Decoder

  • 1. Encoder: a recurrent neural network (RNN) compute a context sensitive

representation of the sentence

  • 2. Decoder: a different recurrent neural network (RNN) compute the translation,


word after word Conditional language model

le chien court

1 2

Begin of sentence

slide-23
SLIDE 23

MACHINE TRANSLATION

The dog is running

z

<BOS> le chien court

Neural architecture: Encoder-Decoder

  • 1. Encoder: a recurrent neural network (RNN) compute a context sensitive

representation of the sentence

  • 2. Decoder: a different recurrent neural network (RNN) compute the translation,


word after word Conditional language model

le chien court <EOS>

1 2

Begin of sentence Stop translation when the end of sentence token is generated

slide-24
SLIDE 24

SIMPLE RECURRENT NEURAL NETWORK

slide-25
SLIDE 25

MULTI-LAYER PERCEPTRON RECURRENT NETWORK

The dog is eating

h(4) h(3) h(2) h(1) h(4) h(n) = tanh(U [ x(n) h(n−1)] + b)

word

h h Multi-linear perceptron cell

➤ Input: the current word and the previous output ➤ Output: the hidden representation

The recurrent connection is juste the output at each position

slide-26
SLIDE 26

GRADIENT BASED LEARNING PROBLEM

Does it work?

➤ In theory: yes ➤ In practice: no, gradient based learning of RNN fail to learn long range dependencies!

The dog , I

h(4) h(3) h(2) h(1)

was told by my friend is ,

… … h(11)

Difficulties to propagate influence

slide-27
SLIDE 27

GRADIENT BASED LEARNING PROBLEM

Does it work?

➤ In theory: yes ➤ In practice: no, gradient based learning of RNN fail to learn long range dependencies!

Deep learning is not a « single tool fits all problem » solution

➤ You need to understand your data and prediction task ➤ You need to understand why a given neural architecture may fail for a given task ➤ You need to be able design tailored neural architectures for a given task

The dog , I

h(4) h(3) h(2) h(1)

was told by my friend is ,

… … h(11)

Difficulties to propagate influence

slide-28
SLIDE 28

LONG SHORT-TERM MEMORY NETWORKS

slide-29
SLIDE 29

LONG SHORT-TERM MEMORY NETWORKS (LSTM)

Intuition

➤ Memory vector which is passed along the sequence ➤ At each time step, the network selects which cell of the memory to modify

The network can learn to keep track of long distance relationships

c

Memory vector

LSTM cell

➤ The recurrent connection pass the memory vector to the next cell

h h, c x

slide-30
SLIDE 30

ERASING/WRITING VALUES IN A VECTOR

Erasing values in the memory

3.02 −4.11 21.00 4.44 −6.9 21.00 4.44 −6.9

« Forget » the first two cells

slide-31
SLIDE 31

ERASING/WRITING VALUES IN A VECTOR

Erasing values in the memory

3.02 −4.11 21.00 4.44 −6.9 21.00 4.44 −6.9

« Forget » the first two cells

Writing values in the memory

Memory after update

21.00 4.44 −6.9 10.0 5.0 1.0 10.0 5.0 22.00 4.44 −6.9

+

Memory before update Update

slide-32
SLIDE 32

GATE MECHANISM

Erasing values in a vector

Let assume we want to remove some values from a vector c:

  • 1. A simple linear classifier compute the importance of each value in c:
  • 2. We erase non important value, i.e. values with a negative weight in w

w = Uc + b

slide-33
SLIDE 33

GATE MECHANISM

= + × w U b c Erasing values in a vector

Let assume we want to remove some values from a vector c:

  • 1. A simple linear classifier compute the importance of each value in c:
  • 2. We erase non important value, i.e. values with a negative weight in w

w = Uc + b

1

Importance of each cell in c

slide-34
SLIDE 34

GATE MECHANISM

= + × w U b c Erasing values in a vector

Let assume we want to remove some values from a vector c:

  • 1. A simple linear classifier compute the importance of each value in c:
  • 2. We erase non important value, i.e. values with a negative weight in w

w = Uc + b c′

i = {

ci if wi > 0, 0 otherwise

1 2

Importance of each cell in c

slide-35
SLIDE 35

GATE MECHANISM

= + × w U b c Erasing values in a vector

Let assume we want to remove some values from a vector c:

  • 1. A simple linear classifier compute the importance of each value in c:
  • 2. We erase non important value, i.e. values with a negative weight in w

w = Uc + b c′

i = {

ci if wi > 0, 0 otherwise

1

bi = { 1 if wi > 0, 0 otherwise c′ = c × b

2

OR

Vector of booleans indicating which cell we must keep Importance of each cell in c

slide-36
SLIDE 36

CELL SELECTION AND BACKPROPAGATION?

w = Uc + b Forward pass

−4 −2 2 4 −2 −1 1 2

wi bi bi = { 1 if wi > 0, 0 otherwise

slide-37
SLIDE 37

CELL SELECTION AND BACKPROPAGATION?

w = Uc + b Forward pass Backward pass

By the chain rule:

∂ℒ ∂wi = ∂ℒ ∂bi ⋅ ∂bi ∂wi + . . .

−4 −2 2 4 −2 −1 1 2

wi bi

What does this term look like? Gradient wrt the loss

bi = { 1 if wi > 0, 0 otherwise

slide-38
SLIDE 38

CELL SELECTION AND BACKPROPAGATION?

w = Uc + b Forward pass Backward pass

By the chain rule:

∂ℒ ∂wi = ∂ℒ ∂bi ⋅ ∂bi ∂wi + . . . ∂bi ∂wi

−4 −2 2 4 −2 −1 1 2

wi

−4 −2 2 4 −2 −1 1 2

wi bi

What does this term look like? Gradient wrt the loss Gradient is blocked!
 No information is back propagated!

!

bi = { 1 if wi > 0, 0 otherwise

slide-39
SLIDE 39

SMOOTH SELECTION 1/2

bi = { 1 if wi > 0, 0 otherwise bi = argmaxyi yi × wi s.t. yi ≤ 1 OR

Equivalent formulation as a small optimization problem

yi ≥ 0

slide-40
SLIDE 40

SMOOTH SELECTION 1/2

bi = { 1 if wi > 0, 0 otherwise bi = argmaxyi yi × wi s.t. yi ≤ 1 OR

Equivalent formulation as a small optimization problem

yi ≥ 0 Intuition

➤ At the optimal solution, one of the constraint is tight


=> small perturbation on will not change the solution

➤ We can introduce a penalty in the objective so that constraints are never tight

at the optimal solution

wi

slide-41
SLIDE 41

SMOOTH SELECTION 1/2

bi = { 1 if wi > 0, 0 otherwise bi = argmaxyi yi × wi s.t. yi ≤ 1 OR

Equivalent formulation as a small optimization problem

yi ≥ 0 Intuition

➤ At the optimal solution, one of the constraint is tight


=> small perturbation on will not change the solution

➤ We can introduce a penalty in the objective so that constraints are never tight

at the optimal solution

wi bi = argmaxyi yi × wi − Ω(yi) s.t. yi ≤ 1 yi ≥ 0

Strong convex regularizer

slide-42
SLIDE 42

SMOOTH SELECTION 1/2

bi = argmaxyi yi × wi − Ω(yi) s.t. yi ≤ 1 yi ≥ 0 How to choose the convex regularizer?

➤ We need to solve the program quickly ➤ We need to be able to back propagate easily ➤ Several solutions


(i.e. similar to interior point method)

slide-43
SLIDE 43

SMOOTH SELECTION 1/2

bi = argmaxyi yi × wi − Ω(yi) s.t. yi ≤ 1 yi ≥ 0 How to choose the convex regularizer?

➤ We need to solve the program quickly ➤ We need to be able to back propagate easily ➤ Several solutions


(i.e. similar to interior point method)

bi = argmaxyi yi × wi − yi log yi − (1 − yi)log(1 − yi) s.t. yi ≤ 1 yi ≥ 0 Negative Fermi-Dirac entropy

slide-44
SLIDE 44

SMOOTH SELECTION 1/2

bi = argmaxyi yi × wi − Ω(yi) s.t. yi ≤ 1 yi ≥ 0 How to choose the convex regularizer?

➤ We need to solve the program quickly ➤ We need to be able to back propagate easily ➤ Several solutions


(i.e. similar to interior point method)

bi = argmaxyi yi × wi − yi log yi − (1 − yi)log(1 − yi) s.t. yi ≤ 1 yi ≥ 0 Negative Fermi-Dirac entropy

−4 −2 2 4 −1 −0.5 0.5 1

bi = 1 (1 + exp(−wi)) = σ(wi)

This is actually the sigmoid
 (solve the KKT condition to see that) Smooth and differentiable approximation! :)

slide-45
SLIDE 45

LSTM CELL 1/2

c(n−1) h(n−1) x(n)

Time step input Incoming memory Incoming representation

slide-46
SLIDE 46

LSTM CELL 1/2

c(n−1) h(n−1) x(n) ×

σ(U(1) [ x(n) h(n−1)] + b(1) )

Forget gate Time step input Incoming memory Incoming representation

slide-47
SLIDE 47

LSTM CELL 1/2

c(n−1) h(n−1) x(n) ×

σ(U(1) [ x(n) h(n−1)] + b(1) ) tanh(U(3) [ x(n) h(n−1)] + b(3) )

Forget gate What could we add to the memory? Time step input Incoming memory Incoming representation

slide-48
SLIDE 48

LSTM CELL 1/2

c(n−1) h(n−1) x(n) ×

σ(U(1) [ x(n) h(n−1)] + b(1) ) σ(U(2) [ x(n) h(n−1)] + b(2) ) tanh(U(3) [ x(n) h(n−1)] + b(3) )

× +

Forget gate Input gate What could we add to the memory? Time step input Incoming memory Incoming representation

slide-49
SLIDE 49

LSTM CELL 1/2

c(n−1) h(n−1) x(n) ×

σ(U(1) [ x(n) h(n−1)] + b(1) ) σ(U(2) [ x(n) h(n−1)] + b(2) ) tanh(U(3) [ x(n) h(n−1)] + b(3) )

× + c(n)

Forget gate Input gate What could we add to the memory? Time step input Incoming memory Incoming representation Outgoing memory

slide-50
SLIDE 50

LSTM CELL 1/2

c(n−1) h(n−1) x(n) ×

σ(U(1) [ x(n) h(n−1)] + b(1) ) σ(U(2) [ x(n) h(n−1)] + b(2) ) tanh(U(3) [ x(n) h(n−1)] + b(3) ) σ(U(4) [ x(n) h(n−1)] + b(4) )

× + c(n) h(n) ×

Forget gate

tanh

Input gate What could we add to the memory? Output gate Time step input Incoming memory Incoming representation Outgoing memory Hidden representation

slide-51
SLIDE 51

LSTM CELL 2/2

f(n) = σ(U(1) [ x(n) h(n−1)] + b(1) ) i(n) = σ(U(2) [ x(n) h(n−1)] + b(2) )

  • (n) = σ(U(4)

[ x(n) h(n−1)] + b(4) ) h(n) = o(n) × tanh(c(n)) c(n) = f(n) × c(n−1) + i(n) × tanh(U(3) [ x(n) h(n−1)] + b(3) )

Gates Outputs Number of parameters

4 times more parameters than a simple recurrent neural network! Erase memory Update memory Compute output wrt memory

slide-52
SLIDE 52

LSTM VARIANT: COUPLED FORGET AND INPUT GATES

f(n) = σ(U(1) [ x(n) h(n−1)] + b(1) ) i(n) = 1 − f(n)

  • (n) = σ(U(4)

[ x(n) h(n−1)] + b(4) ) h(n) = o(n) × tanh(c(n)) c(n) = f(n) × c(n−1) + i(n) × tanh(U(3) [ x(n) h(n−1)] + b(3) )

Gates Outputs Intuition

➤ Tie forget and input gates ➤ Each memory cell is either kept as it or replaced by a new value

Input gate is tied to the forget gate

slide-53
SLIDE 53

LSTM VARIANT: PEEPHOLES

Intuition

➤ In standard LSTMs, gates are not dependent on the memory state ➤ In peephole LSTMs, gates depend on the memory

slide-54
SLIDE 54

LSTM VARIANT: PEEPHOLES

Intuition

➤ In standard LSTMs, gates are not dependent on the memory state ➤ In peephole LSTMs, gates depend on the memory

Gates

f(n) = σ(U(1) x(n) h(n−1) c(n−1) + b(1) ) i(n) = σ(U(2) x(n) h(n−1) c(n−1) + b(2) )

Look memory content to choose which cell to change

slide-55
SLIDE 55

LSTM VARIANT: PEEPHOLES

Intuition

➤ In standard LSTMs, gates are not dependent on the memory state ➤ In peephole LSTMs, gates depend on the memory

Gates

f(n) = σ(U(1) x(n) h(n−1) c(n−1) + b(1) ) i(n) = σ(U(2) x(n) h(n−1) c(n−1) + b(2) )

  • (n) = σ(U(4)

x(n) h(n−1) c(n) + b(4) )

Look memory content to choose which cell to change Output gate depend on the new memory state

slide-56
SLIDE 56

LSTM VARIANT: PEEPHOLES

Intuition

➤ In standard LSTMs, gates are not dependent on the memory state ➤ In peephole LSTMs, gates depend on the memory

h(n) = o(n) × tanh(c(n)) c(n) = f(n) × c(n−1) + i(n) × tanh(U(3) [ x(n) h(n−1)] + b(3) )

Gates Outputs

f(n) = σ(U(1) x(n) h(n−1) c(n−1) + b(1) ) i(n) = σ(U(2) x(n) h(n−1) c(n−1) + b(2) )

  • (n) = σ(U(4)

x(n) h(n−1) c(n) + b(4) )

Look memory content to choose which cell to change Output gate depend on the new memory state Unchanged

slide-57
SLIDE 57

RNN-BASED ARCHITECTURES

slide-58
SLIDE 58

MULTI-LAYER RNN

The dog is eating

h(4) h(3) h(2) h(1) h(n) h(n), c(n) RNN with one layer RNN with two layers

➤ Each layer as it own set of trainable parameters ➤ The recurrent connection is layer-dependent ➤ The input of layer n > 1 is the hidden representation at layer n

The dog is eating

h(2,4) h(2,3) h(2,2) h(2,1)

Layer 2

h(1,n), c(1,n) h(2,n) x(1,n) h(2,n), c(2,n)

Layer 1

x(n)

slide-59
SLIDE 59

TAGGING WITH LSTMS

They walk the dog PRP VB DET NN

Part-of-speech tagging Named entity recognition

Neil Armstrong visited the moon B-Per I-Per O O B-Loc

slide-60
SLIDE 60

TAGGING WITH LSTMS

They walk the dog PRP VB DET NN

Part-of-speech tagging Named entity recognition

Neil Armstrong visited the moon B-Per I-Per O O B-Loc They walk the dog

h(4) h(3) h(2) h(1)

MLP MLP MLP MLP

Neural architecture

  • 1. A RNN computes a context sensitive representation of each word
  • 2. At each time step, the output of the RNN if fed to a MLP for classification

MLPs share parameters

!

The classifiers receive no information about context on the right of each word!

slide-61
SLIDE 61

BIRNN

Intuition

Use two RNNs with different trainable parameters:

➤ Forward RNN: visit the sentence from left to right ➤ Backward RNN: visit the sentence from right to left

slide-62
SLIDE 62

BIRNN

Intuition

Use two RNNs with different trainable parameters:

➤ Forward RNN: visit the sentence from left to right ➤ Backward RNN: visit the sentence from right to left

The dog is eating

Forward RNN

slide-63
SLIDE 63

BIRNN

Intuition

Use two RNNs with different trainable parameters:

➤ Forward RNN: visit the sentence from left to right ➤ Backward RNN: visit the sentence from right to left

The dog is eating

Forward RNN Backward RNN

slide-64
SLIDE 64

BIRNN

Intuition

Use two RNNs with different trainable parameters:

➤ Forward RNN: visit the sentence from left to right ➤ Backward RNN: visit the sentence from right to left

The dog is eating

For token representation, we concatenate the output

  • f each RNN
slide-65
SLIDE 65

BIRNN

Intuition

Use two RNNs with different trainable parameters:

➤ Forward RNN: visit the sentence from left to right ➤ Backward RNN: visit the sentence from right to left

The dog is eating

For token representation, we concatenate the output

  • f each RNN
slide-66
SLIDE 66

BIRNN

Intuition

Use two RNNs with different trainable parameters:

➤ Forward RNN: visit the sentence from left to right ➤ Backward RNN: visit the sentence from right to left

The dog is eating

For token representation, we concatenate the output

  • f each RNN

For sentence representation, we concatenate the output of the last cell of each RNN

slide-67
SLIDE 67

MULTI-STACK BIRNN

Intuition

Multi-layer RNNs have information only about previous words

slide-68
SLIDE 68

MULTI-STACK BIRNN

The dog is eating }First BiRNN stack

Intuition

Multi-layer RNNs have information only about previous words

slide-69
SLIDE 69

MULTI-STACK BIRNN

The dog is eating }

}

Second BiRNN stack First BiRNN stack

Intuition

Multi-layer RNNs have information only about previous words Each cell in the second stack has information about the whole sentence!