Recurrent Neural Networks LING572 Advanced Statistical Methods for - - PowerPoint PPT Presentation

recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks LING572 Advanced Statistical Methods for - - PowerPoint PPT Presentation

Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline Word representations and MLPs for NLP tasks Recurrent neural networks for sequences Fancier RNNs Vanishing/exploding gradients


slide-1
SLIDE 1

Recurrent Neural Networks

LING572 Advanced Statistical Methods for NLP March 5 2020

1

slide-2
SLIDE 2

Outline

  • Word representations and MLPs for NLP tasks
  • Recurrent neural networks for sequences
  • Fancier RNNs
  • Vanishing/exploding gradients
  • LSTMs (Long Short-Term Memory)
  • Variants
  • Seq2seq architecture
  • Attention

2

slide-3
SLIDE 3

MLPs for text classification

3

slide-4
SLIDE 4

Word Representations

  • Traditionally: words are discrete features
  • e.g. curWord=“class”
  • As vectors: one-hot encoding
  • Each vector is
  • dimensional, where V is the vocabulary
  • Each dimension corresponds to one word of the vocabulary
  • A 1 for the current word; 0 everywhere else

|V|

4

w1 = [1 0 0 ⋯ 0] w3 = [0 0 1 ⋯ 0]

slide-5
SLIDE 5

Word Embeddings

  • Problem 1: every word is equally different from every other.
  • All words are orthogonal to each other.
  • Problem 2: very high dimensionality
  • Solution: Move words into dense, lower-dimensional space
  • Grouping similar words to each other
  • These denser representations are called embeddings

5

slide-6
SLIDE 6

Word Embeddings

  • Formally, a d-dimensional embedding is a matrix E with shape (|V|, d)
  • Each row is the vector for one word in the vocabulary
  • Matrix multiplying by a one-hot vector returns the corresponding row, i.e. the right word

vector

  • Trained on prediction tasks (see LING571 slides)
  • Continuous bag of words
  • Skip-gram
  • Can be trained on specific task, or download pre-trained (e.g. GloVe, fastText)
  • Fancier versions now to deal with OOV: sub-word (e.g. BPE), character CNN/LSTM

6

slide-7
SLIDE 7

Relationships via Offsets

7

MAN WOMAN UNCLE AUNT KING QUEEN

Mikolov et al 2013b

slide-8
SLIDE 8

Relationships via Offsets

7

MAN WOMAN UNCLE AUNT KING QUEEN KING QUEEN KINGS QUEENS

Mikolov et al 2013b

slide-9
SLIDE 9

One More Example

8

Mikolov et al 2013c

slide-10
SLIDE 10

One More Example

9

slide-11
SLIDE 11

Caveat Emptor

10

Linzen 2016, a.o.

slide-12
SLIDE 12

Example MLP for Language Modeling

11 Bengio et al 2003

slide-13
SLIDE 13

Example MLP for Language Modeling

11 Bengio et al 2003

: one-hot vector

wt

slide-14
SLIDE 14

Example MLP for Language Modeling

11 Bengio et al 2003

embeddings = concat(Cwt−1, Cwt−2, …, Cwt−(n+1)) : one-hot vector

wt

slide-15
SLIDE 15

Example MLP for Language Modeling

11 Bengio et al 2003

embeddings = concat(Cwt−1, Cwt−2, …, Cwt−(n+1)) hidden = tanh(W1embeddings + b1) : one-hot vector

wt

slide-16
SLIDE 16

Example MLP for Language Modeling

11 Bengio et al 2003

embeddings = concat(Cwt−1, Cwt−2, …, Cwt−(n+1)) hidden = tanh(W1embeddings + b1) probabilities = softmax(W2hidden + b2) : one-hot vector

wt

slide-17
SLIDE 17

Example MLP for sentiment classification

  • Issue: texts of different length.
  • One solution: average (or sum, or…) all the embeddings, which are of same dim

12

Iyyer et al 2015

Model IMDB accuracy

Deep averaging network 89.4 NB-SVM (Wang and Manning

2012)

91.2

slide-18
SLIDE 18

Recurrent Neural Networks

13

slide-19
SLIDE 19

RNNs: high-level

14

slide-20
SLIDE 20

RNNs: high-level

  • Feed-forward networks: fixed-size input, fixed-size output
  • Previous classifier: average embeddings of words
  • Other solutions: n-gram assumption (i.e. fixed-size context of word embeddings)

14

slide-21
SLIDE 21

RNNs: high-level

  • Feed-forward networks: fixed-size input, fixed-size output
  • Previous classifier: average embeddings of words
  • Other solutions: n-gram assumption (i.e. fixed-size context of word embeddings)
  • RNNs process sequences of vectors
  • Maintaining “hidden” state
  • Applying the same operation at each step

14

slide-22
SLIDE 22

RNNs: high-level

  • Feed-forward networks: fixed-size input, fixed-size output
  • Previous classifier: average embeddings of words
  • Other solutions: n-gram assumption (i.e. fixed-size context of word embeddings)
  • RNNs process sequences of vectors
  • Maintaining “hidden” state
  • Applying the same operation at each step
  • Different RNNs:
  • Different operations at each step
  • Operation also called “recurrent cell”
  • Other architectural considerations (e.g. depth; bidirectionally)

14

slide-23
SLIDE 23

RNNs

15 Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-24
SLIDE 24

RNNs

15 Steinert-Threlkeld and Szymanik 2019; Olah 2015

ht = f(xt, ht−1)

slide-25
SLIDE 25

RNNs

15 Steinert-Threlkeld and Szymanik 2019; Olah 2015

ht = f(xt, ht−1)

Simple/“Vanilla” RNN:

ht = tanh(Wxxt + Whht−1 + b)

slide-26
SLIDE 26

RNNs

15 Steinert-Threlkeld and Szymanik 2019; Olah 2015

This class … interesting ht = f(xt, ht−1)

Simple/“Vanilla” RNN:

ht = tanh(Wxxt + Whht−1 + b)

slide-27
SLIDE 27

RNNs

15 Steinert-Threlkeld and Szymanik 2019; Olah 2015

This class … interesting ht = f(xt, ht−1)

Simple/“Vanilla” RNN:

ht = tanh(Wxxt + Whht−1 + b)

Linear + softmax Linear + softmax Linear + softmax

slide-28
SLIDE 28

Using RNNs

16

MLP seq2seq (later) e.g. text classification e.g. POS tagging

slide-29
SLIDE 29

Training: BPTT

  • “Unroll” the network across time-steps
  • Apply backprop to the “wide” network
  • Each cell has the same parameters
  • When updating parameters using the gradients, take the average across the

time steps

17

slide-30
SLIDE 30

Fancier RNNs

18

slide-31
SLIDE 31

Vanishing/Exploding Gradients Problem

  • BPTT with vanilla RNNs faces a major problem:
  • The gradients can vanish (approach 0) across time
  • This makes it hard/impossible to learn long distance dependencies, which are

rampant in natural language

19

slide-32
SLIDE 32

Vanishing Gradients

20

source If these are small (depends on W), the effect from t=4 on t=1 will be very small

slide-33
SLIDE 33

Vanishing Gradient Problem

21

source

slide-34
SLIDE 34

Vanishing Gradient Problem

22

Graves 2012

slide-35
SLIDE 35

Vanishing Gradient Problem

  • Gradient measures the effect of the past on the future
  • If it vanishes between t and t+n, can’t tell if:
  • There’s no dependency in fact
  • The weights in our network just haven’t yet captured the dependency

23

slide-36
SLIDE 36

The need for long-distance dependencies

  • Language modeling (fill-in-the-blank)
  • The keys ____
  • The keys on the table ____
  • The keys next to the book on top of the table ____
  • To get the number on the verb, need to look at the subject, which can be very far

away

  • And number can disagree with linearly-close nouns
  • Need models that can capture long-range dependencies like this.

Vanishing gradients means vanilla RNNs will have difficulty.

24

slide-37
SLIDE 37

Long Short-Term Memory (LSTM)

25

slide-38
SLIDE 38

LSTMs

  • Long Short-Term Memory (Hochreiter and Schmidhuber 1997)
  • The gold standard / default RNN
  • If someone says “RNN” now, they almost always mean “LSTM”
  • Originally: to solve the vanishing/exploding gradient problem for RNNs
  • Vanilla: re-writes the entire hidden state at every time-step
  • LSTM: separate hidden state and memory
  • Read, write to/from memory; can preserve long-term information

26

slide-39
SLIDE 39

LSTMs

27

ft = σ (Wf ⋅ ht−1xt + bf) it = σ (Wi ⋅ ht−1xt + bi) ̂ ct = tanh (Wc ⋅ ht−1xt + bc) ct = ft ⊙ ct−1 + it ⊙ ̂ ct

  • t = σ (Wo ⋅ ht−1xt + bo)

ht = ot ⊙ tanh (ct)

slide-40
SLIDE 40

LSTMs

27

ft = σ (Wf ⋅ ht−1xt + bf) it = σ (Wi ⋅ ht−1xt + bi) ̂ ct = tanh (Wc ⋅ ht−1xt + bc) ct = ft ⊙ ct−1 + it ⊙ ̂ ct

  • t = σ (Wo ⋅ ht−1xt + bo)

ht = ot ⊙ tanh (ct)

🤕🤕🤸🤯

slide-41
SLIDE 41

LSTMs

27

ft = σ (Wf ⋅ ht−1xt + bf) it = σ (Wi ⋅ ht−1xt + bi) ̂ ct = tanh (Wc ⋅ ht−1xt + bc) ct = ft ⊙ ct−1 + it ⊙ ̂ ct

  • t = σ (Wo ⋅ ht−1xt + bo)

ht = ot ⊙ tanh (ct)

🤕🤕🤸🤯

  • Key innovation:
  • : a memory cell
  • Reading/writing (smooth)

controlled by gates

  • : forget gate
  • : input gate
  • : output gate

ct, ht = f(xt, ct−1, ht−1) ct ft it

  • t
slide-42
SLIDE 42

LSTMs

28 Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-43
SLIDE 43

LSTMs

28

: which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-44
SLIDE 44

LSTMs

28

Element-wise multiplication: 0: erase 1: retain : which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-45
SLIDE 45

LSTMs

28

Element-wise multiplication: 0: erase 1: retain : which cells to write to

it ∈ [0,1]m

: which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-46
SLIDE 46

LSTMs

28

Element-wise multiplication: 0: erase 1: retain “candidate” / new values : which cells to write to

it ∈ [0,1]m

: which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-47
SLIDE 47

LSTMs

28

Element-wise multiplication: 0: erase 1: retain “candidate” / new values Add new values to memory : which cells to write to

it ∈ [0,1]m

: which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-48
SLIDE 48

LSTMs

28

Element-wise multiplication: 0: erase 1: retain “candidate” / new values Add new values to memory

= ft ⊙ ct−1 + it ⊙ ̂ ct

: which cells to write to

it ∈ [0,1]m

: which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-49
SLIDE 49

LSTMs

28

Element-wise multiplication: 0: erase 1: retain : which cells to output

  • t ∈ [0,1]m

“candidate” / new values Add new values to memory

= ft ⊙ ct−1 + it ⊙ ̂ ct

: which cells to write to

it ∈ [0,1]m

: which cells to forget

ft ∈ [0,1]m

Steinert-Threlkeld and Szymanik 2019; Olah 2015

slide-50
SLIDE 50

LSTMs solve vanishing gradients

29

Graves 2012

slide-51
SLIDE 51

Gated Recurrent Unit (GRU)

  • Cho et al 2014: gated like LSTM, but no separate memory cell
  • “Collapses” execution/control and memory
  • Fewer gates = fewer parameters, higher speed
  • Update gate
  • Reset gate

30

ut = σ(Wuht−1 + Uuxt + bu) rt = σ(Wrht−1 + Urxt + br) ˜ ht = tanh(Wh(rt ⊙ ht) + Uhxt + bh) ht = (1 − ut) ⊙ ht−1 + ut ⊙ ˜ ht

slide-52
SLIDE 52

LSTM vs GRU

  • Generally: LSTM a good default

choice

  • GRU can be used if speed and

fewer parameters are important

  • Full differences between them not

fully understood

  • Performance often comparable,

but: LSTMs can store unboundedly large values in memory, and seem to e.g. count better

31

source

slide-53
SLIDE 53

Two Extensions

  • Deep RNNs:

32

Source: RNN cheat sheet

slide-54
SLIDE 54

Two Extensions

  • Deep RNNs:

32

  • Bidirectional RNNs:

Source: RNN cheat sheet

slide-55
SLIDE 55

Two Extensions

  • Deep RNNs:

32

  • Bidirectional RNNs:

Source: RNN cheat sheet Forward RNN

slide-56
SLIDE 56

Two Extensions

  • Deep RNNs:

32

  • Bidirectional RNNs:

Source: RNN cheat sheet Forward RNN Backward RNN

slide-57
SLIDE 57

Two Extensions

  • Deep RNNs:

32

  • Bidirectional RNNs:

Source: RNN cheat sheet Forward RNN Backward RNN Concatenate states

slide-58
SLIDE 58

“The BiLSTM Hegemony”

  • Chris Manning, in 2017:

33

source

slide-59
SLIDE 59

Seq2Seq + attention

34

slide-60
SLIDE 60

Sequence to sequence problems

  • Many NLP tasks can be construed as sequence-to-sequence problems
  • Machine translations: sequence of source lang tokens to sequence of target

lang tokens

  • Parsing: “Shane talks.” —> “(S (NP (N Shane)) (VP V talks))”
  • Incl semantic parsing
  • Summarization
  • NB: not the same as tagging, which assigns a label to each position in a

given sequence

35

slide-61
SLIDE 61

seq2seq architecture [e.g. NMT]

36

Sutskever et al 2013

slide-62
SLIDE 62

seq2seq architecture [e.g. NMT]

36

Sutskever et al 2013

encoder

slide-63
SLIDE 63

seq2seq architecture [e.g. NMT]

36

Sutskever et al 2013

encoder decoder

slide-64
SLIDE 64

seq2seq results

37

slide-65
SLIDE 65

seq2seq architecture: problem

38

Sutskever et al 2013

slide-66
SLIDE 66

seq2seq architecture: problem

38

Sutskever et al 2013

encoder

slide-67
SLIDE 67

seq2seq architecture: problem

38

Sutskever et al 2013

encoder decoder

slide-68
SLIDE 68

seq2seq architecture: problem

38

Sutskever et al 2013

encoder decoder

Decoder can only see info in this one vector all info about source must be “crammed” into here

slide-69
SLIDE 69

seq2seq architecture: problem

38

Sutskever et al 2013

encoder decoder

Decoder can only see info in this one vector all info about source must be “crammed” into here Mooney 2014: “You can't cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”

slide-70
SLIDE 70

39

source

slide-71
SLIDE 71

39

source

slide-72
SLIDE 72

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

Badhanau et al 2014

slide-73
SLIDE 73

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

Badhanau et al 2014

slide-74
SLIDE 74

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) Badhanau et al 2014

slide-75
SLIDE 75

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) Badhanau et al 2014

slide-76
SLIDE 76

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) Badhanau et al 2014

slide-77
SLIDE 77

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) softmax

eij = softmax(α)j

Badhanau et al 2014

slide-78
SLIDE 78

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) softmax

eij = softmax(α)j

ci = Σjeijhj

Badhanau et al 2014

slide-79
SLIDE 79

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) softmax

eij = softmax(α)j

ci = Σjeijhj

Linear + softmax

w′

1

Badhanau et al 2014

slide-80
SLIDE 80

Adding Attention

40

w1 w2 w3 h1 h2 h3 ⟨s⟩ d1

αij = a(hj, di)

(dot product usually) softmax

eij = softmax(α)j

ci = Σjeijhj

Linear + softmax

w′

1

w′

i

d2

Badhanau et al 2014

slide-81
SLIDE 81

Attention, Generally

41

slide-82
SLIDE 82

Attention, Generally

  • A query pays attention to some values

based on similarity with some keys .

q {vk} {kv}

41

slide-83
SLIDE 83

Attention, Generally

  • A query pays attention to some values

based on similarity with some keys .

q {vk} {kv}

  • Dot-product attention:



 
 
 
 


41

αj = q ⋅ kj ej = eαj/Σjeαj c = Σjejvj

slide-84
SLIDE 84

Attention, Generally

  • A query pays attention to some values

based on similarity with some keys .

q {vk} {kv}

  • Dot-product attention:



 
 
 
 


  • In the previous example: encoder hidden states played both the keys and

the values roles.

41

αj = q ⋅ kj ej = eαj/Σjeαj c = Σjejvj

slide-85
SLIDE 85

Why attention?

42

slide-86
SLIDE 86

Why attention?

  • Incredibly useful (for performance)
  • By “solving” the bottleneck issue

42

slide-87
SLIDE 87

Why attention?

  • Incredibly useful (for performance)
  • By “solving” the bottleneck issue
  • Aids interpretability (maybe)

42

slide-88
SLIDE 88

Why attention?

  • Incredibly useful (for performance)
  • By “solving” the bottleneck issue
  • Aids interpretability (maybe)

42

Badhanau et al 2014

slide-89
SLIDE 89

Why attention?

  • Incredibly useful (for performance)
  • By “solving” the bottleneck issue
  • Aids interpretability (maybe)
  • A general technique for combining

representations, applications in:

  • NMT, parsing, image/video captioning, …,

everything

42

Badhanau et al 2014

slide-90
SLIDE 90

Why attention?

  • Incredibly useful (for performance)
  • By “solving” the bottleneck issue
  • Aids interpretability (maybe)
  • A general technique for combining

representations, applications in:

  • NMT, parsing, image/video captioning, …,

everything

42

Badhanau et al 2014 Vinyals et al 2015

slide-91
SLIDE 91

Next Time

  • We will introduce a new type of large neural model: the Transformer
  • Hint: “Attention is All You Need” is the original paper
  • Introduce the idea of transfer learning and pre-training language models
  • Canvas recent developments and trends in that approach
  • What we might call “The Transformer Hegemony” or “The Muppet Hegemony”

43