IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks
Stephan Oepen
University of Oslo
March 10, 2019
IN5550 Neural Methods in Natural Language Processing Recurrent - - PowerPoint PPT Presentation
IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen University of Oslo March 10, 2019 Our Roadmap Today Language structure: sequences, trees, graphs Recurrent Neural Networks Different
Stephan Oepen
University of Oslo
March 10, 2019
Today ◮ Language structure: sequences, trees, graphs ◮ Recurrent Neural Networks ◮ Different types of sequence labeling ◮ Learning to forget: Gated RNNs Next Week ◮ RNNs for structured prediction ◮ Recursive RNN variants ◮ A Selection of RNN applications Later ◮ Contextualized embeddings and transfer learning ◮ Conditioned generation and attention ◮ A CNN & RNN marriage: transformer archtectures
2
◮ Can learn to represent large n-grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates
◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is in fact calculated independently. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies.
3
A similar technique is almost impossible to apply to other crops.
4
A similar technique is almost impossible to apply to other crops.
http://mrp.nlpl.eu/index.php?page=2
4
A similar technique is almost impossible to apply to other crops.
5
A similar technique is almost impossible to apply to other crops. A similar technique is almost impossible to apply to other crops .
root det amod nsubj cop advmod punct mark ccomp
amod case
http://epe.nlpl.eu/index.php?page=1
5
A similar technique is almost impossible to apply to other crops. A similar technique is almost impossible to apply to other crops .
root det amod nsubj cop advmod punct mark ccomp
amod case
http://epe.nlpl.eu/index.php?page=1
DET ADJ NOUN AUX ADV ADJ PART VERB ADP ADJ NOUN PUNCT
A similar technique is almost impossible to apply to
.
5
◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions
6
◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x1:n to output y1:n ◮ internal state sequence s1:n as ‘history’
6
◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x1:n to output y1:n ◮ internal state sequence s1:n as ‘history’ RNN(x1:n, s0) = y1:n si = R(si−1, xi) yi = O(si) xi ∈ Rdx; yi ∈ Rdy; si ∈ Rf(dy)
6
7
◮ Each state si and output yi depend on the full previous context, e.g. s4 = R(R(R(R(x1, so), x2), x3)x4)
7
◮ Each state si and output yi depend on the full previous context, e.g. s4 = R(R(R(R(x1, so), x2), x3)x4) ◮ Functions R(·) and O(·) shared across time points; fewer parameters
7
◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·)
8
◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si
8
◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar?
8
◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar? Merely a continuous bag of words
8
◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco
8
◮ We have yet to define the nature of the R(·) and O(·) functions ◮ RNNs actually a family of architectures; much variation for R(·) Arguably the Most Basic RNN Implementation si = R(si−1, xi) = si−1 + xi yi = O(si) = si ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco ◮ actually has no parameters of it own: θ = {}; thus, no learning ability
8
◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training
9
◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training The Elman RNN si = R(si−1, xi) = g(si−1W s + xiW x + b) yi = O(si) = si xi ∈ Rdx; s1, yi ∈ Rdy; W x ∈ Rdx×ds; W s ∈ Rds×ds; b ∈ Rds
9
◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training The Elman RNN si = R(si−1, xi) = g(si−1W s + xiW x + b) yi = O(si) = si xi ∈ Rdx; s1, yi ∈ Rdy; W x ∈ Rdx×ds; W s ∈ Rds×ds; b ∈ Rds ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R(·): si = g([si−1; xi]W + b)
9
◮ Embed RNN in end-to-end task, e.g. classification from output states yi
10
◮ Embed RNN in end-to-end task, e.g. classification from output states yi ◮ standard loss functions, backpropagation, optimizers (so-called BPTT)
10
11
◮ Focus on final output state: yn as encoding of full sequence x1:n ◮ looking familiar?
11
◮ Focus on final output state: yn as encoding of full sequence x1:n ◮ looking familiar? map variable-length sequence to fixed-size vector
11
◮ Focus on final output state: yn as encoding of full sequence x1:n ◮ looking familiar? map variable-length sequence to fixed-size vector ◮ sentence-level classification; or as input to conditioned generator ◮ aka sequence–to–sequence model; e.g. translation or summarization
11
12
si = R(si−1, xi) = g(si−1W s + xiW x + b)
12
si = R(si−1, xi) = g(si−1W s + xiW x + b) = g(g(si−2W s + xi−1W x + b)W s + xiW x + b)
12
si = R(si−1, xi) = g(si−1W s + xiW x + b) = g(g(si−2W s + xi−1W x + b)W s + xiW x + b) ◮ W s, W x shared across all layers → exploding or vanishing gradients
12
13
◮ Capture full left and right context: ‘history’ and ‘future’ for each xi ◮ moderate increase in parameters (double); still linear-time computation
13
14
While it is not theoretically clear what is the additional power gained by the deeper architectures, it was observed empirically that deep RNNs work better than shallower ones on some tasks. [...] Many works report results using layered RNN architectures, but do not ex- plicitly compare to one-layer RNNs. In the experiments of my research group, using two or more layers indeed often improves over using a single one. (Goldberg, 2017, p. 172)
15
16
◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n); RNNb(xn:1)])) x1:n = E[w1], . . . , E[wn]
17
◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn]
17
◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] xi = E[wi]
17
◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] xi = [E[wi]; RNNf
c(c1:li); RNNb c(cli:1)]
◮ character-level RNNs robust to unknown words; may capture affixation
17
◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] xi = [E[wi]; RNNf
c(c1:li); RNNb c(cli:1)]
◮ character-level RNNs robust to unknown words; may capture affixation ◮ encoder–decoder (sequence-to-sequence) models coming before Easter
17
18
◮ Andrei Karpathy (2016): Connecting Images and Natural Language
18
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences;
19
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points;
19
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients;
19
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014).
19
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc.
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today .
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB .
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — —
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O BPERS EPERS O SORG O O
20
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O BPERS EPERS O SORG O O ◮ IOB (aka BIO) labeling scheme—and variants—encodes chunkings.
20
21
Constituent Parsing as Sequence Labeling
Carlos G´
ıguez Universidade da Coru˜ na FASTPARSE Lab, LyS Group Departamento de Computaci´
Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain carlos.gomez@udc.es David Vilares Universidade da Coru˜ na FASTPARSE Lab, LyS Group Departamento de Computaci´
Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain david.vilares@udc.es Abstract
We introduce a method to reduce constituent parsing to sequence labeling. For each word wt, it generates a label that encodes: (1) the number of ancestors in the tree that the words wt and wt+1 have in common, and (2) the non- terminal symbol at the lowest common ances-
function is injective for any tree without unary
extensible to all constituency trees by collaps- ing unary branches. We then use the PTB and
CTB treebanks as testbeds and propose a set of
Zhang, 2017; Fern´ andez-Gonz´ alez and G´
Rodr´ ıguez, 2018). With an aim more related to our work, other au- thors have reduced constituency parsing to tasks that can be solved faster or in a more generic
andez-Gonz´ alez and Martins (2015) re- duce phrase structure parsing to dependency pars-
where dependency labels from a head to its de- pendents encode the nonterminal symbol and an attachment order that is used to arrange nodes into constituents. Their approach makes it pos- sible to use off-the-shelf dependency parsers for 21
22
◮ gradients for parameters ‘deep’ down diminish (or get out of hand)
23
◮ gradients for parameters ‘deep’ down diminish (or get out of hand) → ineffective backpropagation of error signals through the network → can make it difficult for the RNN to learn long-range dependencies
23
◮ Hadamard product (⊙) simply performs element-wise multiplication ◮ vector g acts as a gate: mask out parts of the memory and input
24
◮ Hadamard product (⊙) simply performs element-wise multiplication ◮ vector g acts as a gate: mask out parts of the memory and input ◮ gating should depend on memory state and input; learn its behavior ◮ differentiable gates: assume g ∈ Rn but squash into (0, 1) with σ
24
◮ State vectors si partitioned into context memory and hidden state;
25
◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output.
25
◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f = σ(xiW xf + hi−1W hf) i = σ(xiW xi + hi−1W hi)
σ(xiW xo + hi−1W ho) ci = f ⊙ ci−1 + i ⊙ tanh(xiW x + hi−1W h) hi =
yi = O(si) = hi
25
◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f = σ(xiW xf + hi−1W hf) i = σ(xiW xi + hi−1W hi)
σ(xiW xo + hi−1W ho) ci = f ⊙ ci−1 + i ⊙ tanh(xiW x + hi−1W h) hi =
yi = O(si) = hi ◮ More parameters: separate W x· and W h· matrices for each gate.
25
◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f i = σ(xiW xf + hi−1W hf + bf) ii = σ(xiW xi + hi−1W hi + bi)
= σ(xiW xo + hi−1W ho + bo) ci = fi ⊙ ci−1 + ii ⊙ tanh(xiW x + hi−1W h + b) hi =
yi = O(si) = hi ◮ More parameters: separate W x· and W h· matrices for each gate.
25
26
To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017
(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)
27
To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017
(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)
RNNs, particularly ones with gated architectures such as the LSTM and the GRU, [...] are arguably the strongest contribution of deep learning to the statistical natural language processing tool set. (Goldberg, 2017, p. 163)
27
To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017
(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)
RNNs, particularly ones with gated architectures such as the LSTM and the GRU, [...] are arguably the strongest contribution of deep learning to the statistical natural language processing tool set. (Goldberg, 2017, p. 163) Looking around at EMNLP 2019 last Fall, that rule of thumb needs to be nuanced now.
27
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; ◮ gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014);
28
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; ◮ gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014); → pooling over a transducer RNN alternative design for acceptor usage.
28