Recurrent Neural Networks
LING572 Advanced Statistical Methods for NLP March 5 2020
1
Recurrent Neural Networks LING572 Advanced Statistical Methods for - - PowerPoint PPT Presentation
Recurrent Neural Networks LING572 Advanced Statistical Methods for NLP March 5 2020 1 Outline Word representations and MLPs for NLP tasks Recurrent neural networks for sequences Fancier RNNs Vanishing/exploding gradients
LING572 Advanced Statistical Methods for NLP March 5 2020
1
2
3
4
5
vector
6
7
MAN WOMAN UNCLE AUNT KING QUEEN
Mikolov et al 2013b
7
MAN WOMAN UNCLE AUNT KING QUEEN KING QUEEN KINGS QUEENS
Mikolov et al 2013b
8
Mikolov et al 2013c
9
10
Linzen 2016, a.o.
11 Bengio et al 2003
11 Bengio et al 2003
11 Bengio et al 2003
11 Bengio et al 2003
11 Bengio et al 2003
12
Iyyer et al 2015
Model IMDB accuracy
Deep averaging network 89.4 NB-SVM (Wang and Manning
2012)
91.2
13
14
14
14
14
15 Steinert-Threlkeld and Szymanik 2019; Olah 2015
15 Steinert-Threlkeld and Szymanik 2019; Olah 2015
15 Steinert-Threlkeld and Szymanik 2019; Olah 2015
Simple/“Vanilla” RNN:
15 Steinert-Threlkeld and Szymanik 2019; Olah 2015
Simple/“Vanilla” RNN:
15 Steinert-Threlkeld and Szymanik 2019; Olah 2015
Simple/“Vanilla” RNN:
Linear + softmax Linear + softmax Linear + softmax
16
time steps
17
18
rampant in natural language
19
20
21
22
23
away
24
25
26
27
27
27
28 Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
: which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
Element-wise multiplication: 0: erase 1: retain : which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
Element-wise multiplication: 0: erase 1: retain : which cells to write to
it ∈ [0,1]m
: which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
Element-wise multiplication: 0: erase 1: retain “candidate” / new values : which cells to write to
it ∈ [0,1]m
: which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
Element-wise multiplication: 0: erase 1: retain “candidate” / new values Add new values to memory : which cells to write to
it ∈ [0,1]m
: which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
Element-wise multiplication: 0: erase 1: retain “candidate” / new values Add new values to memory
= ft ⊙ ct−1 + it ⊙ ̂ ct
: which cells to write to
it ∈ [0,1]m
: which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
28
Element-wise multiplication: 0: erase 1: retain : which cells to output
“candidate” / new values Add new values to memory
= ft ⊙ ct−1 + it ⊙ ̂ ct
: which cells to write to
it ∈ [0,1]m
: which cells to forget
ft ∈ [0,1]m
Steinert-Threlkeld and Szymanik 2019; Olah 2015
29
30
fewer parameters are important
31
source
32
Source: RNN cheat sheet
32
Source: RNN cheat sheet
32
Source: RNN cheat sheet Forward RNN
32
Source: RNN cheat sheet Forward RNN Backward RNN
32
Source: RNN cheat sheet Forward RNN Backward RNN Concatenate states
33
34
lang tokens
35
36
Sutskever et al 2013
36
Sutskever et al 2013
36
Sutskever et al 2013
37
38
Sutskever et al 2013
38
Sutskever et al 2013
38
Sutskever et al 2013
38
Sutskever et al 2013
Decoder can only see info in this one vector all info about source must be “crammed” into here
38
Sutskever et al 2013
Decoder can only see info in this one vector all info about source must be “crammed” into here Mooney 2014: “You can't cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”
39
39
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) softmax
eij = softmax(α)j
Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) softmax
eij = softmax(α)j
ci = Σjeijhj
Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) softmax
eij = softmax(α)j
ci = Σjeijhj
Linear + softmax
1
Badhanau et al 2014
40
w1 w2 w3 h1 h2 h3 ⟨s⟩ d1
αij = a(hj, di)
(dot product usually) softmax
eij = softmax(α)j
ci = Σjeijhj
Linear + softmax
1
w′
i
d2
Badhanau et al 2014
41
41
41
41
42
42
42
42
Badhanau et al 2014
everything
42
Badhanau et al 2014
everything
42
Badhanau et al 2014 Vinyals et al 2015
43