IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks
Stephan Oepen
University of Oslo
March 24, 2019
IN5550 Neural Methods in Natural Language Processing Applications - - PowerPoint PPT Presentation
IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks Stephan Oepen University of Oslo March 24, 2019 Our Roadmap Last Week Language structure: sequences, trees, graphs Recurrent Neural
Stephan Oepen
University of Oslo
March 24, 2019
Last Week ◮ Language structure: sequences, trees, graphs ◮ Recurrent Neural Networks ◮ Different types of sequence labeling ◮ Learning to forget: Gated RNNs Today: First Half ◮ RNNs for structured prediction ◮ A Selection of RNN applications Today: Second Half ◮ Encoder–decoder (sequence-to-sequence) models ◮ Conditioned generation and attention
2
◮ Recurrent Neural Networks (RNNs) ◮ map input sequence x1:n to output y1:n ◮ internal state sequence s1:n as ‘memory’ RNN(x1:n, s0) = y1:n si = R(si−1, xi) yi = O(si) xi ∈ Rdx; yi ∈ Rdy; si ∈ Rf(dy)
3
◮ Each state si and output yi depend on the full previous context, e.g. s4 = R(R(R(R(x1, so), x2), x3)x4) ◮ Functions R(·) and O(·) shared across time points; fewer parameters
4
5
◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training The Elman RNN si = R(si−1, xi) = g(si−1W s + xiW x + b) yi = O(si) = si xi ∈ Rdx; s1, yi ∈ Rdy; W x ∈ Rdx×ds; W s ∈ Rds×ds; b ∈ Rds ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R(·): si = g([si−1; xi]W + b)
6
◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014).
7
8
◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f i = σ(xiW xf + hi−1W hf + bf) ii = σ(xiW xi + hi−1W hi + bi)
= σ(xiW xo + hi−1W ho + bo) ci = fi ⊙ ci−1 + ii ⊙ tanh(xiW x + hi−1W h + b) hi =
yi = O(si) = hi ◮ More parameters: separate W x· and W h· matrices for each gate.
9
◮ Same overall goals, but somewhat lower complexity than LSTMs ◮ “substantially fewer gates” (Goldberg, 2017, p. 181): two (one less) si = R(xi, si−1) = (1 − zi) ⊙ si−1 + zi ⊙ ˜ si zi = σ(xiW xz + si−1W sz + bz) ri = σ(xiW xr + si−1W sr + br) ˜ si = tanh(xiW x + (r ⊙ si−1)W s + b) yi = O(si) = si ◮ Can give results comparable to LSTMs, at reduced training costs: [...] the jury is still out between the GRU, the LSTM, and possible alternative RNN architectures. (Goldberg, 2017, p. 182)
10
◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] x1:n = E[w1], . . . , E[wn] ◮ encoder–decoder (sequence-to-sequence) models coming later today
11
◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . BPERS IPERS O BORG O O BPERS EPERS O SORG O O 2, NP 1, S 2, VP 2, VP 1, S ◮ IOB (aka BIO) labeling scheme—and variants—encodes chunkings. ◮ What is the constituent tree corresponding to the bottom row labels?
12
◮ Hmm, actually, what exactly does one mean by sequence labeling? ◮ gentle definition class predictions for all elements, in context; ◮ pointwise classification; each individual decision is independent; ◮ no (direct) model of wellformedness conditions on class sequence; ◮ strict definition sequence labeling performs structured prediction; ◮ search for ‘globally’ optimal solution, e.g. most probable sequence; ◮ models (properties of) output sequence explicitly, e.g. class bi-grams; ◮ later time points impact earlier choices, i.e. revision of path prefix; ◮ search techniques: dynamic programming, beam search, re-ranking.
13
14
C C C H H H S /S 3 1 3
H H H
0.8 ∗ 0.4 P (C|S)P (3|C) 0.2 ∗ 0.1 P (H|H)P (1|H) 0.6 ∗ 0.2 P (C|H)P (1|C) 0.2 ∗ 0.5 P (H|C)P (1|H) 0.3 ∗ 0.2 P (C|C)P (1|C) 0.5 ∗ 0.5 P (H|H)P (3|H) 0.6 ∗ 0.4 P (C|H)P (3|C) 0.2 ∗ 0.1 P (H|C)P (3|H) 0.3 ∗ 0.4 P (C|C)P (3|C) 0.5 ∗ 0.1 P (/S|H) 0.2 P (
S
C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32 ∗ .1, .02 ∗ .25) = .032 v3(H) = max(.0384 ∗ .24, .032 ∗ .12) = .0092 v3(C) = max(.0384 ∗ .02, .032 ∗ .05) = .0016 vf (/S) = max(.0092 ∗ .2, .0016 ∗ .2) = .0018 15
Neural Architectures for Named Entity Recognition
Guillaume Lample♠ Miguel Ballesteros♣♠ Sandeep Subramanian♠ Kazuya Kawakami♠ Chris Dyer♠
♠Carnegie Mellon University ♣NLP Group, Pompeu Fabra University
{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu, miguel.ballesteros@upf.edu Abstract
State-of-the-art named entity recognition sys- tems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures—one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of infor- mation about words: character-based word
from unannotated corpora offers an alternative strat- egy for obtaining better generalization from small amounts of supervision. However, even systems that have relied extensively on unsupervised fea- tures (Collobert et al., 2011; Turian et al., 2010; Lin and Wu, 2009; Ando and Zhang, 2005b, in- ter alia) have used these to augment, rather than replace, hand-engineered features (e.g., knowledge about capitalization patterns and character classes in a particular language) and specialized knowledge re- sources (e.g., gazetteers). In this paper, we present neural architectures 16
17
◮ Maybe just maximize sequence probability over softmax outputs? ◮ CRFs mark pinnacle of evolution in probabilistic sequence labeling; ◮ discriminative (like MEMMs), but avoiding the label bias problem; ◮ for an input sequence W = w1:n and label sequence T = t1:n P(t1:n|w1:n) = escore(W,T)
score(t1:n, w1:n) =
n+1
A[ti−1,ti] +
n
Y [i,ti] ◮ Y is the (bi-)RNN ouput; A holds transition scores for tag bi-grams; ◮ What are the dimensionalities of Y and A? Y ∈ Rm×n; A ∈ Rn×n; ◮ end-to-end training: maximize the log-probability of the correct t1:n.
18
Variable Length of Input Sequence ◮ Although RNNs in principle well-defined for inputs of variable length, ◮ in practise, padding to fixed length is required for efficiency (batching); ◮ actually not too much ‘waste’; but can be beneficial to bin by length. Evaluation ◮ Accuracy is common metric for tagging; fixed number of predictions; ◮ for most inputs, very large proportion of padding tokens (and labels); ◮ trivial predictions will inflate accuracy scores; detrimental to learning? ◮ Can define custom function: ‘prefix accuracy’; control early stopping? Dropout in RNNs ◮ Dropout along memory updates can inhibt learning of effective gating; ◮ only apply dropout ‘vertically’; or fix random mask (variational RNN).
19