IN5550 Neural Methods in Natural Language Processing Recurrent - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550 Neural Methods in Natural Language Processing Recurrent - - PowerPoint PPT Presentation

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen University of Oslo March 10, 2019 Our Roadmap Today Language structure: sequences, trees, graphs Recurrent Neural Networks Different


slide-1
SLIDE 1

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks

Stephan Oepen

University of Oslo

March 10, 2019

slide-2
SLIDE 2

Our Roadmap

Today ◮ Language structure: sequences, trees, graphs ◮ Recurrent Neural Networks ◮ Different types of sequence labeling ◮ Learning to forget: Gated RNNs Next Week ◮ RNNs for structured prediction ◮ Recursive RNN variants ◮ A Selection of RNN applications Later ◮ Contextualized embeddings and transfer learning ◮ Conditioned generation and attention ◮ A CNN & RNN marriage: transformer archtectures

2

slide-3
SLIDE 3

Recap: CNN Pros and Cons

◮ Can learn to represent large n-grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates

  • n is independent of the others; the entire input can be processed
  • concurrently. (Each filter also independent.)

◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is in fact calculated independently. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies.

3

slide-4
SLIDE 4

But Language is So Rich in Structure

A similar technique is almost impossible to apply to other crops.

4

slide-5
SLIDE 5

But Language is So Rich in Structure

A similar technique is almost impossible to apply to other crops.

http://mrp.nlpl.eu/index.php?page=2

4

slide-6
SLIDE 6

Okay, Maybe Start with Somewhat Simpler Structures

A similar technique is almost impossible to apply to other crops.

5

slide-7
SLIDE 7

Okay, Maybe Start with Somewhat Simpler Structures

A similar technique is almost impossible to apply to other crops. A similar technique is almost impossible to apply to other crops .

root det amod nsubj cop advmod punct mark ccomp

  • bl

amod case

http://epe.nlpl.eu/index.php?page=1

5

slide-8
SLIDE 8

Okay, Maybe Start with Somewhat Simpler Structures

A similar technique is almost impossible to apply to other crops. A similar technique is almost impossible to apply to other crops .

root det amod nsubj cop advmod punct mark ccomp

  • bl

amod case

http://epe.nlpl.eu/index.php?page=1

DET ADJ NOUN AUX ADV ADJ PART VERB ADP ADJ NOUN PUNCT

A similar technique is almost impossible to apply to

  • ther crops

.

5

slide-9
SLIDE 9

Recurrent Neural Networks in the Abstract

◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions

6

slide-10
SLIDE 10

Recurrent Neural Networks in the Abstract

◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x1:n to output y1:n ◮ internal state sequence s1:n as ‘history’

6

slide-11
SLIDE 11

Recurrent Neural Networks in the Abstract

◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x1:n to output y1:n ◮ internal state sequence s1:n as ‘history’ RNN(x1:n, s0) = y1:n si = R(si−1, xi) yi = O(si) xi ∈ Rdx; yi ∈ Rdy; si ∈ Rf(dy)

6

slide-12
SLIDE 12

Still High-Level: The RNN Abstraction Unrolled

7

slide-13
SLIDE 13

Still High-Level: The RNN Abstraction Unrolled

◮ Each state si and output yi depend on the full previous context, e.g. s4 = R(R(R(R(x1, so), x2), x3)x4)

7

slide-14
SLIDE 14

Still High-Level: The RNN Abstraction Unrolled

◮ Each state si and output yi depend on the full previous context, e.g. s4 = R(R(R(R(x1, so), x2), x3)x4) ◮ Functions R(·) and O(·) shared across time points; fewer parameters

7

slide-15
SLIDE 15

Implementing the RNN Abstraction

◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·)

8

slide-16
SLIDE 16

Implementing the RNN Abstraction

◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si

8

slide-17
SLIDE 17

Implementing the RNN Abstraction

◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar?

8

slide-18
SLIDE 18

Implementing the RNN Abstraction

◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar? Merely a continuous bag of words

8

slide-19
SLIDE 19

Implementing the RNN Abstraction

◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco

8

slide-20
SLIDE 20

Implementing the RNN Abstraction

◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco ◮ actually has no parameters of it own: θ = {}; thus, no learning ability

8

slide-21
SLIDE 21

The ‘Simple’ RNN (Elman, 1990)

◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training

9

slide-22
SLIDE 22

The ‘Simple’ RNN (Elman, 1990)

◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training The Elman RNN si = R(si−1, xi) = g(si−1W s + xiW x + b) yi = O(si) = si xi ∈ Rdx; s1, yi ∈ Rdy; W x ∈ Rdx×ds; W s ∈ Rds×ds; b ∈ Rds

9

slide-23
SLIDE 23

The ‘Simple’ RNN (Elman, 1990)

◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training The Elman RNN si = R(si−1, xi) = g(si−1W s + xiW x + b) yi = O(si) = si xi ∈ Rdx; s1, yi ∈ Rdy; W x ∈ Rdx×ds; W s ∈ Rds×ds; b ∈ Rds ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R(·): si = g([si−1; xi]W + b)

9

slide-24
SLIDE 24

Training Recurrent Neural Networks

◮ Embed RNN in end-to-end task, e.g. classification from output states yi

10

slide-25
SLIDE 25

Training Recurrent Neural Networks

◮ Embed RNN in end-to-end task, e.g. classification from output states yi ◮ standard loss functions, backpropagation, optimizers (so-called BPTT)

10

slide-26
SLIDE 26

An Alternate Training Regime

11

slide-27
SLIDE 27

An Alternate Training Regime

◮ Focus on final output state: yn as encoding of full sequence x1:n ◮ looking familiar?

11

slide-28
SLIDE 28

An Alternate Training Regime

◮ Focus on final output state: yn as encoding of full sequence x1:n ◮ looking familiar? map variable-length sequence to fixed-size vector

11

slide-29
SLIDE 29

An Alternate Training Regime

◮ Focus on final output state: yn as encoding of full sequence x1:n ◮ looking familiar? map variable-length sequence to fixed-size vector ◮ sentence-level classification; or as input to conditioned generator ◮ aka sequence–to–sequence model; e.g. translation or summarization

11

slide-30
SLIDE 30

Unrolled RNNs, in a Sense, are very Deep MLPs

12

slide-31
SLIDE 31

Unrolled RNNs, in a Sense, are very Deep MLPs

si = R(si−1, xi) = g(si−1W s + xiW x + b)

12

slide-32
SLIDE 32

Unrolled RNNs, in a Sense, are very Deep MLPs

si = R(si−1, xi) = g(si−1W s + xiW x + b) = g(g(si−2W s + xi−1W x + b)W s + xiW x + b)

12

slide-33
SLIDE 33

Unrolled RNNs, in a Sense, are very Deep MLPs

si = R(si−1, xi) = g(si−1W s + xiW x + b) = g(g(si−2W s + xi−1W x + b)W s + xiW x + b) ◮ W s, W x shared across all layers → exploding or vanishing gradients

12

slide-34
SLIDE 34

Variants: Bi-Directional Recurrent Networks

13

slide-35
SLIDE 35

Variants: Bi-Directional Recurrent Networks

◮ Capture full left and right context: ‘history’ and ‘future’ for each xi ◮ moderate increase in parameters (double); still linear-time computation

13

slide-36
SLIDE 36

Variants: ‘Deep’ (Stacked) Recurrent Networks

14

slide-37
SLIDE 37

A Note on Archicture Design

While it is not theoretically clear what is the additional power gained by the deeper architectures, it was observed empirically that deep RNNs work better than shallower ones on some tasks. [...] Many works report results using layered RNN architectures, but do not ex- plicitly compare to one-layer RNNs. In the experiments of my research group, using two or more layers indeed often improves over using a single one. (Goldberg, 2017, p. 172)

15

slide-38
SLIDE 38

RNNs as Feature Extractors

16

slide-39
SLIDE 39

Common Applications of RNNs (in NLP)

◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n); RNNb(xn:1)])) x1:n = E[w1], . . . , E[wn]

17

slide-40
SLIDE 40

Common Applications of RNNs (in NLP)

◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn]

17

slide-41
SLIDE 41

Common Applications of RNNs (in NLP)

◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] xi = E[wi]

17

slide-42
SLIDE 42

Common Applications of RNNs (in NLP)

◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] xi = [E[wi]; RNNf

c(c1:li); RNNb c(cli:1)]

◮ character-level RNNs robust to unknown words; may capture affixation

17

slide-43
SLIDE 43

Common Applications of RNNs (in NLP)

◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] xi = [E[wi]; RNNf

c(c1:li); RNNb c(cli:1)]

◮ character-level RNNs robust to unknown words; may capture affixation ◮ encoder–decoder (sequence-to-sequence) models coming before Easter

17

slide-44
SLIDE 44

Fun Outlook: ‘Conditioned Generation’

18

slide-45
SLIDE 45

Fun Outlook: ‘Conditioned Generation’

◮ Andrei Karpathy (2016): Connecting Images and Natural Language

18

slide-46
SLIDE 46

RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences;

19

slide-47
SLIDE 47

RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points;

19

slide-48
SLIDE 48

RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients;

19

slide-49
SLIDE 49

RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014).

19

slide-50
SLIDE 50

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc.

20

slide-51
SLIDE 51

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated

20

slide-52
SLIDE 52

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today .

20

slide-53
SLIDE 53

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB .

20

slide-54
SLIDE 54

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG

20

slide-55
SLIDE 55

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — —

20

slide-56
SLIDE 56

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O

20

slide-57
SLIDE 57

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O BPERS EPERS O SORG O O

20

slide-58
SLIDE 58

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O BPERS EPERS O SORG O O ◮ IOB (aka BIO) labeling scheme—and variants—encodes chunkings.

20

slide-59
SLIDE 59

Constituent Parsing as Sequence Labeling (1:2)

21

slide-60
SLIDE 60

Constituent Parsing as Sequence Labeling (1:2)

Constituent Parsing as Sequence Labeling

Carlos G´

  • mez-Rodr´

ıguez Universidade da Coru˜ na FASTPARSE Lab, LyS Group Departamento de Computaci´

  • n

Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain carlos.gomez@udc.es David Vilares Universidade da Coru˜ na FASTPARSE Lab, LyS Group Departamento de Computaci´

  • n

Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain david.vilares@udc.es Abstract

We introduce a method to reduce constituent parsing to sequence labeling. For each word wt, it generates a label that encodes: (1) the number of ancestors in the tree that the words wt and wt+1 have in common, and (2) the non- terminal symbol at the lowest common ances-

  • tor. We first prove that the proposed encoding

function is injective for any tree without unary

  • branches. In practice, the approach is made

extensible to all constituency trees by collaps- ing unary branches. We then use the PTB and

CTB treebanks as testbeds and propose a set of

Zhang, 2017; Fern´ andez-Gonz´ alez and G´

  • mez-

Rodr´ ıguez, 2018). With an aim more related to our work, other au- thors have reduced constituency parsing to tasks that can be solved faster or in a more generic

  • way. Fern´

andez-Gonz´ alez and Martins (2015) re- duce phrase structure parsing to dependency pars-

  • ing. They propose an intermediate representation

where dependency labels from a head to its de- pendents encode the nonterminal symbol and an attachment order that is used to arrange nodes into constituents. Their approach makes it pos- sible to use off-the-shelf dependency parsers for 21

slide-61
SLIDE 61

Constituent Parsing as Sequence Labeling (2:2)

22

slide-62
SLIDE 62

Vanishing (or Exploding) Gradients

◮ gradients for parameters ‘deep’ down diminish (or get out of hand)

23

slide-63
SLIDE 63

Vanishing (or Exploding) Gradients

◮ gradients for parameters ‘deep’ down diminish (or get out of hand) → ineffective backpropagation of error signals through the network → can make it difficult for the RNN to learn long-range dependencies

23

slide-64
SLIDE 64

Gating: Controlling Memory Access

◮ Hadamard product (⊙) simply performs element-wise multiplication ◮ vector g acts as a gate: mask out parts of the memory and input

24

slide-65
SLIDE 65

Gating: Controlling Memory Access

◮ Hadamard product (⊙) simply performs element-wise multiplication ◮ vector g acts as a gate: mask out parts of the memory and input ◮ gating should depend on memory state and input; learn its behavior ◮ differentiable gates: assume g ∈ Rn but squash into (0, 1) with σ

24

slide-66
SLIDE 66

Long Short-Term Memory RNNs (LSTMs)

◮ State vectors si partitioned into context memory and hidden state;

25

slide-67
SLIDE 67

Long Short-Term Memory RNNs (LSTMs)

◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output.

25

slide-68
SLIDE 68

Long Short-Term Memory RNNs (LSTMs)

◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f = σ(xiW xf + hi−1W hf) i = σ(xiW xi + hi−1W hi)

  • =

σ(xiW xo + hi−1W ho) ci = f ⊙ ci−1 + i ⊙ tanh(xiW x + hi−1W h) hi =

  • ⊙ tanh(ci)

yi = O(si) = hi

25

slide-69
SLIDE 69

Long Short-Term Memory RNNs (LSTMs)

◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f = σ(xiW xf + hi−1W hf) i = σ(xiW xi + hi−1W hi)

  • =

σ(xiW xo + hi−1W ho) ci = f ⊙ ci−1 + i ⊙ tanh(xiW x + hi−1W h) hi =

  • ⊙ tanh(ci)

yi = O(si) = hi ◮ More parameters: separate W x· and W h· matrices for each gate.

25

slide-70
SLIDE 70

Long Short-Term Memory RNNs (LSTMs)

◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f i = σ(xiW xf + hi−1W hf + bf) ii = σ(xiW xi + hi−1W hi + bi)

  • i

= σ(xiW xo + hi−1W ho + bo) ci = fi ⊙ ci−1 + ii ⊙ tanh(xiW x + hi−1W h + b) hi =

  • i ⊙ tanh(ci)

yi = O(si) = hi ◮ More parameters: separate W x· and W h· matrices for each gate.

25

slide-71
SLIDE 71

Learning to Forget: Schematically

⊙ ⊙ ⊙

26

slide-72
SLIDE 72

Streams of Fashion in NLP

To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017

(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)

27

slide-73
SLIDE 73

Streams of Fashion in NLP

To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017

(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)

RNNs, particularly ones with gated architectures such as the LSTM and the GRU, [...] are arguably the strongest contribution of deep learning to the statistical natural language processing tool set. (Goldberg, 2017, p. 163)

27

slide-74
SLIDE 74

Streams of Fashion in NLP

To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017

(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)

RNNs, particularly ones with gated architectures such as the LSTM and the GRU, [...] are arguably the strongest contribution of deep learning to the statistical natural language processing tool set. (Goldberg, 2017, p. 163) Looking around at EMNLP 2019 last Fall, that rule of thumb needs to be nuanced now.

27

slide-75
SLIDE 75

RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; ◮ gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014);

28

slide-76
SLIDE 76

RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; ◮ gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014); → pooling over a transducer RNN alternative design for acceptor usage.

28