IN5550 Neural Methods in Natural Language Processing Applications - - PowerPoint PPT Presentation

▶

Jul 14, 2023 27 likes •221 views

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks Stephan Oepen University of Oslo March 24, 2019 Our Roadmap Last Week Language structure: sequences, trees, graphs Recurrent Neural

SLIDE 1

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

Stephan Oepen

University of Oslo

March 24, 2019

SLIDE 2

Our Roadmap

Last Week ◮ Language structure: sequences, trees, graphs ◮ Recurrent Neural Networks ◮ Different types of sequence labeling ◮ Learning to forget: Gated RNNs Today: First Half ◮ RNNs for structured prediction ◮ A Selection of RNN applications Today: Second Half ◮ Encoder–decoder (sequence-to-sequence) models ◮ Conditioned generation and attention

SLIDE 3

Recap: Recurrent Neural Networks

◮ Recurrent Neural Networks (RNNs) ◮ map input sequence x1:n to output y1:n ◮ internal state sequence s1:n as ‘memory’ RNN(x1:n, s0) = y1:n si = R(si−1, xi) yi = O(si) xi ∈ Rdx; yi ∈ Rdy; si ∈ Rf(dy)

SLIDE 4

Recap: The RNN Abstraction Unrolled

◮ Each state si and output yi depend on the full previous context, e.g. s4 = R(R(R(R(x1, so), x2), x3)x4) ◮ Functions R(·) and O(·) shared across time points; fewer parameters

SLIDE 5

Recap: Bidirectional, Stacked, and Character RNNs

SLIDE 6

Recap: The ‘Simple’ RNN (Elman, 1990)

◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R(·) function needs to be determined during training The Elman RNN si = R(si−1, xi) = g(si−1W s + xiW x + b) yi = O(si) = si xi ∈ Rdx; s1, yi ∈ Rdy; W x ∈ Rdx×ds; W s ∈ Rds×ds; b ∈ Rds ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R(·): si = g([si−1; xi]W + b)

SLIDE 7

Recap: RNNs In a Nutshell

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length inputs (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014).

SLIDE 8

Recap: Learning to Forget

⊙ ⊙ ⊙

SLIDE 9

Recap: Long Short-Term Memory RNNs (LSTMs)

◮ State vectors si partitioned into context memory and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] f i = σ(xiW xf + hi−1W hf + bf) ii = σ(xiW xi + hi−1W hi + bi)

= σ(xiW xo + hi−1W ho + bo) ci = fi ⊙ ci−1 + ii ⊙ tanh(xiW x + hi−1W h + b) hi =

i ⊙ tanh(ci)

yi = O(si) = hi ◮ More parameters: separate W x· and W h· matrices for each gate.

SLIDE 10

A Variant: Gated Recurrent Units (GRUs)

◮ Same overall goals, but somewhat lower complexity than LSTMs ◮ “substantially fewer gates” (Goldberg, 2017, p. 181): two (one less) si = R(xi, si−1) = (1 − zi) ⊙ si−1 + zi ⊙ ˜ si zi = σ(xiW xz + si−1W sz + bz) ri = σ(xiW xr + si−1W sr + br) ˜ si = tanh(xiW x + (r ⊙ si−1)W s + b) yi = O(si) = si ◮ Can give results comparable to LSTMs, at reduced training costs: [...] the jury is still out between the GRU, the LSTM, and possible alternative RNN architectures. (Goldberg, 2017, p. 182)

SLIDE 11

Recap: Common Applications of RNNs (in NLP)

◮ Acceptors e.g. (sentence-level) sentiment classification: P(c = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n)[n]; RNNb(xn:1)[1]])) x1:n = E[w1], . . . , E[wn] ◮ transducers e.g. part-of-speech tagging: P(ci = k|w1:n) = softmax(MLP([RNNf(x1:n)[i]; RNNb(xn:1)[i]]))[k] x1:n = E[w1], . . . , E[wn] ◮ encoder–decoder (sequence-to-sequence) models coming later today

SLIDE 12

Recap: Sequence Labeling in NLP

◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . BPERS IPERS O BORG O O BPERS EPERS O SORG O O 2, NP 1, S 2, VP 2, VP 1, S ◮ IOB (aka BIO) labeling scheme—and variants—encodes chunkings. ◮ What is the constituent tree corresponding to the bottom row labels?

SLIDE 13

Two Definitions of ‘Sequence Labeling’

◮ Hmm, actually, what exactly does one mean by sequence labeling? ◮ gentle definition class predictions for all elements, in context; ◮ pointwise classification; each individual decision is independent; ◮ no (direct) model of wellformedness conditions on class sequence; ◮ strict definition sequence labeling performs structured prediction; ◮ search for ‘globally’ optimal solution, e.g. most probable sequence; ◮ models (properties of) output sequence explicitly, e.g. class bi-grams; ◮ later time points impact earlier choices, i.e. revision of path prefix; ◮ search techniques: dynamic programming, beam search, re-ranking.

SLIDE 14

Wanted: Sequence-Level Output Constraints

SLIDE 15

Recap: Viterbi Decoding—Thanks, Bec!

C C C H H H S /S 3 1 3

H H H

P (H|S)P (3|H)

0.8 ∗ 0.4 P (C|S)P (3|C) 0.2 ∗ 0.1 P (H|H)P (1|H) 0.6 ∗ 0.2 P (C|H)P (1|C) 0.2 ∗ 0.5 P (H|C)P (1|H) 0.3 ∗ 0.2 P (C|C)P (1|C) 0.5 ∗ 0.5 P (H|H)P (3|H) 0.6 ∗ 0.4 P (C|H)P (3|C) 0.2 ∗ 0.1 P (H|C)P (3|H) 0.3 ∗ 0.4 P (C|C)P (3|C) 0.5 ∗ 0.1 P (/S|H) 0.2 P (

C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32 ∗ .1, .02 ∗ .25) = .032 v3(H) = max(.0384 ∗ .24, .032 ∗ .12) = .0092 v3(C) = max(.0384 ∗ .02, .032 ∗ .05) = .0016 vf (/S) = max(.0092 ∗ .2, .0016 ∗ .2) = .0018 15

SLIDE 16

‘Vintage’ Machine Learning to the Rescue

Neural Architectures for Named Entity Recognition

Guillaume Lample♠ Miguel Ballesteros♣♠ Sandeep Subramanian♠ Kazuya Kawakami♠ Chris Dyer♠

♠Carnegie Mellon University ♣NLP Group, Pompeu Fabra University

{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu, miguel.ballesteros@upf.edu Abstract

State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures—one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of infor- mation about words: character-based word

from unannotated corpora offers an alternative strat- egy for obtaining better generalization from small amounts of supervision. However, even systems that have relied extensively on unsupervised features (Collobert et al., 2011; Turian et al., 2010; Lin and Wu, 2009; Ando and Zhang, 2005b, in- ter alia) have used these to augment, rather than replace, hand-engineered features (e.g., knowledge about capitalization patterns and character classes in a particular language) and specialized knowledge re- sources (e.g., gazetteers). In this paper, we present neural architectures 16

SLIDE 17

Abstractly: RNN Outputs as ‘Emission Scores’

SLIDE 18

Conditional Random Fields (CRF) on Top of an RNN

◮ Maybe just maximize sequence probability over softmax outputs? ◮ CRFs mark pinnacle of evolution in probabilistic sequence labeling; ◮ discriminative (like MEMMs), but avoiding the label bias problem; ◮ for an input sequence W = w1:n and label sequence T = t1:n P(t1:n|w1:n) = escore(W,T)

T ′ escore(W,T ′)

score(t1:n, w1:n) =

n+1

A[ti−1,ti] +

Y [i,ti] ◮ Y is the (bi-)RNN ouput; A holds transition scores for tag bi-grams; ◮ What are the dimensionalities of Y and A? Y ∈ Rm×n; A ∈ Rn×n; ◮ end-to-end training: maximize the log-probability of the correct t1:n.

SLIDE 19

Some Practical Considerations

Variable Length of Input Sequence ◮ Although RNNs in principle well-defined for inputs of variable length, ◮ in practise, padding to fixed length is required for efficiency (batching); ◮ actually not too much ‘waste’; but can be beneficial to bin by length. Evaluation ◮ Accuracy is common metric for tagging; fixed number of predictions; ◮ for most inputs, very large proportion of padding tokens (and labels); ◮ trivial predictions will inflate accuracy scores; detrimental to learning? ◮ Can define custom function: ‘prefix accuracy’; control early stopping? Dropout in RNNs ◮ Dropout along memory updates can inhibt learning of effective gating; ◮ only apply dropout ‘vertically’; or fix random mask (variational RNN).