inf5820 language technological applications applications
play

INF5820: Language Technological Applications Applications of - PowerPoint PPT Presentation

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018 Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in


  1. INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018

  2. Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017 ( https://simons.berkeley.edu/talks/christopher-manning-2017-3-27 ) Looking around at EMNLP 2018 last week, that rule of thumb seems no less valid today. 2

  3. Very High-Level: The RNN Abstraction—Unrolled R( x i , s i − 1 ) = g( s i − 1 W s + x i W x + b ) s i = = O( s i ) = s i y i 3

  4. RNNs: Take-Home Messages for the Casual User ◮ State vectors s i reflect the complete history up to time point i ; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014). 4

  5. Essentials: Long Short-Term Memory RNNs (LSTMs) ◮ Three additional gates (vectors) modulate flow of information; ◮ state vectors s i are partitioned into memory cells and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. s i = R( x i , s i − 1 ) = [ c i ; h i ] c i = f i ⊙ c i − 1 + i i ⊙ z i tanh( x i W x + h i − 1 W h ) = z i h i = O( x i , s i − 1 ) = o i ⊙ tanh( c i ) ◮ More parameters: separate W x · and W h · matrices for each gate. 5

  6. Variants: Bi-Directional Recurrent Networks 6

  7. Variants: ‘Deep’ (Stacked) Recurrent Networks 7

  8. A Side Note: Beyond Sequential Structures Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks Kai Sheng Tai, Richard Socher*, Christopher D. Manning Computer Science Department, Stanford University, *MetaMind Inc. kst@cs.stanford.edu, richard@metamind.io, manning@stanford.edu Abstract y 1 y 2 y 3 y 4 Because of their superior ability to pre- serve sequence information over time, Long Short-Term Memory (LSTM) net- x 1 x 2 x 3 x 4 works, a type of recurrent neural net- y 1 work with a more complex computational unit, have obtained strong results on a va- y 2 y 3 riety of sequence modeling tasks. The x 1 only underlying LSTM structure that has been explored so far is a linear chain. y 4 y 6 However, natural language exhibits syn- x 2 tactic properties that would naturally com- bine words to phrases. We introduce the x 4 x 5 x 6 Tree-LSTM, a generalization of LSTMs to 8

  9. Tree LSTMs Help Leverage Syntactic Structure [ text ] A woman is slicing a tomato. [ hypothesis ] A vegetable is being cut by a woman. Method Pearson’s r Spearman’s ρ MSE Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692 UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550 Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224 ECNU (Zhao et al., 2014) 0.8414 – – Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090) DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137) SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074) LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092) Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063) 2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150) 2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020) Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108) Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052) Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval 2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs. 9

  10. Common Applications of RNNs (in NLP) ◮ Acceptors e.g. (sentence-level) sentiment classification: P (class = k | w 1: n ) = y [ k ] ˆ softmax(MLP([RNN f ( x 1: n ); RNN b ( x n :1 )])) = y ˆ x 1: n = E [ w 1 ] , . . . , E [ w n ] ◮ transducers e.g. part-of-speech tagging: softmax(MLP([RNN f ( x 1: n , i ); RNN b ( x n :1 , i )])) [ k ] P ( t i = k | w 1: n ) = [ E [ w i ] ; RNN f c ( c 1: l i ); RNN b x i = c ( c l i :1 )] ◮ character-level RNNs robust to unknown words; may capture affixation; ◮ encoder–decoder (sequence-to-sequence) models coming next week. 10

  11. RNNs as Feature Extractors 11

  12. Sequence Labeling in Natural Language Processing ◮ Token-level class assignments in sequential context, aka tagging; ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated. Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — B PERS I PERS O B ORG O O B PERS E PERS O S ORG O O ◮ IOB (aka BIO) labeling scheme—and variants—encodes constraints. 12

  13. Reflections on Negation as a Tagging Task conj root cc punct nsubj punct nsubj punct aux aux dobj prep pcomp dep dobj neg det aux poss part amod we have never gone out without keeping a sharp watch , and no one could have escaped our notice . " { } { } ann. 1: ⟨ cue ⟩ ann. 2: { } ⟨ cue ⟩ { } ann. 3: ⟨ cue ⟩ labels: N N CUE E E CUE N S O CUE E N N N N S O N N N N ◮ Sherlock (Lapponi et al., 2012, 2017) still state of the art today; ◮ ‘flattens out’ multiple, potentially overlapping negation instances; ◮ post-classification: heuristic reconstruction of separate structures. ◮ To what degree is cue classification a sequence labeling problem? 13

  14. Constituent Parsing as Sequence Labeling (1:2) Constituent Parsing as Sequence Labeling Carlos G´ omez-Rodr´ ıguez David Vilares Universidade da Coru˜ na Universidade da Coru˜ na FASTPARSE Lab, LyS Group FASTPARSE Lab, LyS Group Departamento de Computaci´ on Departamento de Computaci´ on Campus de Elvi˜ na s/n, 15071 Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain A Coru˜ na, Spain carlos.gomez@udc.es david.vilares@udc.es Abstract Zhang, 2017; Fern´ andez-Gonz´ alez and G´ omez- Rodr´ ıguez, 2018). We introduce a method to reduce constituent With an aim more related to our work, other au- parsing to sequence labeling. For each word thors have reduced constituency parsing to tasks w t , it generates a label that encodes: (1) the that can be solved faster or in a more generic number of ancestors in the tree that the words way. Fern´ andez-Gonz´ alez and Martins (2015) re- w t and w t +1 have in common, and (2) the non- duce phrase structure parsing to dependency pars- terminal symbol at the lowest common ances- tor. We first prove that the proposed encoding ing. They propose an intermediate representation function is injective for any tree without unary where dependency labels from a head to its de- branches. In practice, the approach is made pendents encode the nonterminal symbol and an extensible to all constituency trees by collaps- attachment order that is used to arrange nodes ing unary branches. We then use the PTB and into constituents. Their approach makes it pos- CTB treebanks as testbeds and propose a set of sible to use off-the-shelf dependency parsers for 14

  15. Constituent Parsing as Sequence Labeling (2:2) 15

  16. Two Definitions of ‘Sequence Labeling’ ◮ Hmm, actually, what exactly does one mean by sequence labeling? ◮ gentle definition class predictions for all elements, in context; ◮ pointwise classification; each individual decision is independent; ◮ no (direct) model of wellformedness conditions on class sequence; ◮ strict definition sequence labeling performs structured prediction; ◮ search for ‘globally’ optimal solution, e.g. most probable sequence; ◮ models (properties of) output sequence explicitly, e.g. class bi-grams; ◮ later time points impact earlier choices, i.e. revision of path prefix; ◮ search techniques: dynamic programming, beam search, re-ranking. 16

  17. Wanted: Sequence-Level Output Constraints 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend