INF5820: Language Technological Applications Applications of - - PowerPoint PPT Presentation

inf5820 language technological applications applications
SMART_READER_LITE
LIVE PREVIEW

INF5820: Language Technological Applications Applications of - - PowerPoint PPT Presentation

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018 Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in


slide-1
SLIDE 1

INF5820: Language Technological Applications Applications of Recurrent Neural Networks

Stephan Oepen

University of Oslo

November 6, 2018

slide-2
SLIDE 2

Most Recently: Variants of Recurrent Neural Networks

To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...]

Christopher Manning, March 2017

(https://simons.berkeley.edu/talks/christopher-manning-2017-3-27)

Looking around at EMNLP 2018 last week, that rule of thumb seems no less valid today.

2

slide-3
SLIDE 3

Very High-Level: The RNN Abstraction—Unrolled

si = R(xi, si−1) = g(si−1W s + xiW x + b) yi = O(si) = si

3

slide-4
SLIDE 4

RNNs: Take-Home Messages for the Casual User

◮ State vectors si reflect the complete history up to time point i; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients;

→ gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014).

4

slide-5
SLIDE 5

Essentials: Long Short-Term Memory RNNs (LSTMs)

◮ Three additional gates (vectors) modulate flow of information; ◮ state vectors si are partitioned into memory cells and hidden state; ◮ forget gate f

how much of the previous memory to keep;

◮ input gate i

how much of the proposed update to apply;

◮ output gate o

what parts of the updated memory to output. si = R(xi, si−1) = [ci; hi] ci = fi ⊙ ci−1 + ii ⊙ zi zi = tanh(xiW x + hi−1W h) hi = O(xi, si−1) = oi ⊙ tanh(ci)

◮ More parameters: separate W x· and W h· matrices for each gate.

5

slide-6
SLIDE 6

Variants: Bi-Directional Recurrent Networks

6

slide-7
SLIDE 7

Variants: ‘Deep’ (Stacked) Recurrent Networks

7

slide-8
SLIDE 8

A Side Note: Beyond Sequential Structures

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

Kai Sheng Tai, Richard Socher*, Christopher D. Manning Computer Science Department, Stanford University, *MetaMind Inc. kst@cs.stanford.edu, richard@metamind.io, manning@stanford.edu Abstract

Because of their superior ability to pre- serve sequence information over time, Long Short-Term Memory (LSTM) net- works, a type of recurrent neural net- work with a more complex computational unit, have obtained strong results on a va- riety of sequence modeling tasks. The

  • nly underlying LSTM structure that has

been explored so far is a linear chain. However, natural language exhibits syn- tactic properties that would naturally com- bine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x4 x5 x6 y1 y2 y3 y4 y6 8

slide-9
SLIDE 9

Tree LSTMs Help Leverage Syntactic Structure

[text] A woman is slicing a tomato. [hypothesis] A vegetable is being cut by a woman.

Method Pearson’s r Spearman’s ρ MSE Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692 UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550 Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224 ECNU (Zhao et al., 2014) 0.8414 – – Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090) DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137) SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074) LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092) Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063) 2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150) 2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020) Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108) Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052)

Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval 2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs.

9

slide-10
SLIDE 10

Common Applications of RNNs (in NLP)

◮ Acceptors

e.g. (sentence-level) sentiment classification: P(class = k|w1:n) = ˆ y[k] ˆ y = softmax(MLP([RNNf(x1:n); RNNb(xn:1)])) x1:n = E[w1], . . . , E[wn]

◮ transducers

e.g. part-of-speech tagging: P(ti = k|w1:n) = softmax(MLP([RNNf(x1:n, i); RNNb(xn:1, i)]))[k] xi = [E[wi]; RNNf

c(c1:li); RNNb c(cli:1)] ◮ character-level RNNs robust to unknown words; may capture affixation; ◮ encoder–decoder (sequence-to-sequence) models coming next week.

10

slide-11
SLIDE 11

RNNs as Feature Extractors

11

slide-12
SLIDE 12

Sequence Labeling in Natural Language Processing

◮ Token-level class assignments in sequential context, aka tagging; ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated.

Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — BPERS IPERS O BORG O O BPERS EPERS O SORG O O

◮ IOB (aka BIO) labeling scheme—and variants—encodes constraints.

12

slide-13
SLIDE 13

Reflections on Negation as a Tagging Task

we have never gone out without keeping a sharp watch , and no one could have escaped our notice . "

nsubj aux neg conj cc punct prep part pcomp dobj det amod dep nsubj aux aux punct punct dobj poss root

  • ann. 1:
  • ann. 2:
  • ann. 3:

cue cue cue labels: CUE CUE CUE N N E E N N N N E N N N N S O S O N

{ } { } { } { }

⟩ ⟨ ⟩ ⟨ ⟩ ⟨ ◮ Sherlock (Lapponi et al., 2012, 2017) still state of the art today; ◮ ‘flattens out’ multiple, potentially overlapping negation instances; ◮ post-classification: heuristic reconstruction of separate structures. ◮ To what degree is cue classification a sequence labeling problem?

13

slide-14
SLIDE 14

Constituent Parsing as Sequence Labeling (1:2)

Constituent Parsing as Sequence Labeling

Carlos G´

  • mez-Rodr´

ıguez Universidade da Coru˜ na FASTPARSE Lab, LyS Group Departamento de Computaci´

  • n

Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain carlos.gomez@udc.es David Vilares Universidade da Coru˜ na FASTPARSE Lab, LyS Group Departamento de Computaci´

  • n

Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain david.vilares@udc.es Abstract

We introduce a method to reduce constituent parsing to sequence labeling. For each word wt, it generates a label that encodes: (1) the number of ancestors in the tree that the words wt and wt+1 have in common, and (2) the non- terminal symbol at the lowest common ances-

  • tor. We first prove that the proposed encoding

function is injective for any tree without unary

  • branches. In practice, the approach is made

extensible to all constituency trees by collaps- ing unary branches. We then use the PTB and

CTB treebanks as testbeds and propose a set of

Zhang, 2017; Fern´ andez-Gonz´ alez and G´

  • mez-

Rodr´ ıguez, 2018). With an aim more related to our work, other au- thors have reduced constituency parsing to tasks that can be solved faster or in a more generic

  • way. Fern´

andez-Gonz´ alez and Martins (2015) re- duce phrase structure parsing to dependency pars-

  • ing. They propose an intermediate representation

where dependency labels from a head to its de- pendents encode the nonterminal symbol and an attachment order that is used to arrange nodes into constituents. Their approach makes it pos- sible to use off-the-shelf dependency parsers for 14

slide-15
SLIDE 15

Constituent Parsing as Sequence Labeling (2:2)

15

slide-16
SLIDE 16

Two Definitions of ‘Sequence Labeling’

◮ Hmm, actually, what exactly does one mean by sequence labeling? ◮ gentle definition

class predictions for all elements, in context;

◮ pointwise classification; each individual decision is independent; ◮ no (direct) model of wellformedness conditions on class sequence; ◮ strict definition

sequence labeling performs structured prediction;

◮ search for ‘globally’ optimal solution, e.g. most probable sequence; ◮ models (properties of) output sequence explicitly, e.g. class bi-grams; ◮ later time points impact earlier choices, i.e. revision of path prefix; ◮ search techniques: dynamic programming, beam search, re-ranking.

16

slide-17
SLIDE 17

Wanted: Sequence-Level Output Constraints

17

slide-18
SLIDE 18

‘Vintage’ Machine Learning to the Rescue

Neural Architectures for Named Entity Recognition

Guillaume Lample♠ Miguel Ballesteros♣♠ Sandeep Subramanian♠ Kazuya Kawakami♠ Chris Dyer♠

♠Carnegie Mellon University ♣NLP Group, Pompeu Fabra University

{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu, miguel.ballesteros@upf.edu Abstract

State-of-the-art named entity recognition sys- tems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures—one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of infor- mation about words: character-based word

from unannotated corpora offers an alternative strat- egy for obtaining better generalization from small amounts of supervision. However, even systems that have relied extensively on unsupervised fea- tures (Collobert et al., 2011; Turian et al., 2010; Lin and Wu, 2009; Ando and Zhang, 2005b, in- ter alia) have used these to augment, rather than replace, hand-engineered features (e.g., knowledge about capitalization patterns and character classes in a particular language) and specialized knowledge re- sources (e.g., gazetteers). In this paper, we present neural architectures 18

slide-19
SLIDE 19

Abstractly: RNN Outputs as Emission Scores

19

slide-20
SLIDE 20

Conditional Random Fields (CRF) on Top of an RNN

◮ CRFs mark pinnacle of evolution in probabilistic sequence labeling; ◮ discriminative (like MEMMs), but avoiding the label bias problem; ◮ for an input sequence W = w1:n and label sequence T = t1:n

P(t1:n|w1:n) = escore(W,T)

  • T ′ escore(W,T ′)

score(t1:n, w1:n) =

n

  • i=0

A[ti,ti+1] +

n

  • i=1

Y [i,ti]

◮ Y is the (bi-)RNN ouput; A holds transition scores for tag bi-grams; ◮ end-to-end training: maximize the log-probability of the correct t1:n.

20

slide-21
SLIDE 21

Recap: Viterbi Decoding—Thanks, Bec!

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

H H H

  • P (H|S)P (3|H)

0.8 ∗ 0.4 P (C|S)P (3|C) 0.2 ∗ 0.1 P (H|H)P (1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P (C|C)P (1|C) 0.5 ∗ 0.5 P (H|H)P (3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P (C|C)P (3|C) 0.5 ∗ 0.1 P (/S|H) 0.2 P (/S|C) 0.2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32 ∗ .1, .02 ∗ .25) = .032 v3(H) = max(.0384 ∗ .24, .032 ∗ .12) = .0092 v3(C) = max(.0384 ∗ .02, .032 ∗ .05) = .0016 vf (/S) = max(.0092 ∗ .2, .0016 ∗ .2) = .0018 21

slide-22
SLIDE 22

Some Practical Considerations (in Keras)

Variable Length of Input Sequence

◮ Although RNNs in principle well-defined for inputs of variable length, ◮ in practise, padding to fixed length is required for efficiency (batching); ◮ utilities like pad_sequences make that easy; beware wasteful copying.

Evaluation

◮ Accuracy is common metric for tagging; fixed number of predictions; ◮ for most inputs, dominant proportion of padding tokens (and labels); ◮ trivial predictions will inflate accuracy scores; detrimental to learning? ◮ Can define custom function: ‘prefix accuracy’; control early stopping?

Remaining Limitations

◮ CRF layer has been in TensorFlow Community Contributions for years; ◮ available through keras-contrib module (but limits model saving).

22

slide-23
SLIDE 23

Looking Ahead: Laboratory and Next Lectures

Three More Laboratory Sessions

◮ November 8 and 15: work with final assignment on negation resolution; ◮ this week: student presentations on top performers for sentiment; ◮ live programming: develop PoS tagger using bi-directional LSMTs.

Two More Lectures

◮ November 13: all the cool stuff: seq2seq, attention, multi-task learning; ◮ November 20: the great wrap-up; how to prepare for the exam?

Finally, Exam

◮ Not our favorite form of evaluation this year: four hours, written exam; ◮ questions mostly probe understanding; but some formulae can be useful.

23