INF5820: Language Technological Applications Applications of - PowerPoint PPT Presentation

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018

Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017 ( https://simons.berkeley.edu/talks/christopher-manning-2017-3-27 ) Looking around at EMNLP 2018 last week, that rule of thumb seems no less valid today. 2

Very High-Level: The RNN Abstraction—Unrolled R( x i , s i − 1 ) = g( s i − 1 W s + x i W x + b ) s i = = O( s i ) = s i y i 3

RNNs: Take-Home Messages for the Casual User ◮ State vectors s i reflect the complete history up to time point i ; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014). 4

Essentials: Long Short-Term Memory RNNs (LSTMs) ◮ Three additional gates (vectors) modulate flow of information; ◮ state vectors s i are partitioned into memory cells and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. s i = R( x i , s i − 1 ) = [ c i ; h i ] c i = f i ⊙ c i − 1 + i i ⊙ z i tanh( x i W x + h i − 1 W h ) = z i h i = O( x i , s i − 1 ) = o i ⊙ tanh( c i ) ◮ More parameters: separate W x · and W h · matrices for each gate. 5

Variants: Bi-Directional Recurrent Networks 6

Variants: ‘Deep’ (Stacked) Recurrent Networks 7

A Side Note: Beyond Sequential Structures Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks Kai Sheng Tai, Richard Socher*, Christopher D. Manning Computer Science Department, Stanford University, *MetaMind Inc. kst@cs.stanford.edu, richard@metamind.io, manning@stanford.edu Abstract y 1 y 2 y 3 y 4 Because of their superior ability to pre- serve sequence information over time, Long Short-Term Memory (LSTM) net- x 1 x 2 x 3 x 4 works, a type of recurrent neural net- y 1 work with a more complex computational unit, have obtained strong results on a va- y 2 y 3 riety of sequence modeling tasks. The x 1 only underlying LSTM structure that has been explored so far is a linear chain. y 4 y 6 However, natural language exhibits syn- x 2 tactic properties that would naturally com- bine words to phrases. We introduce the x 4 x 5 x 6 Tree-LSTM, a generalization of LSTMs to 8

Tree LSTMs Help Leverage Syntactic Structure [ text ] A woman is slicing a tomato. [ hypothesis ] A vegetable is being cut by a woman. Method Pearson’s r Spearman’s ρ MSE Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692 UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550 Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224 ECNU (Zhao et al., 2014) 0.8414 – – Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090) DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137) SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074) LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092) Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063) 2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150) 2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020) Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108) Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052) Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval 2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs. 9

Common Applications of RNNs (in NLP) ◮ Acceptors e.g. (sentence-level) sentiment classification: P (class = k | w 1: n ) = y [ k ] ˆ softmax(MLP([RNN f ( x 1: n ); RNN b ( x n :1 )])) = y ˆ x 1: n = E [ w 1 ] , . . . , E [ w n ] ◮ transducers e.g. part-of-speech tagging: softmax(MLP([RNN f ( x 1: n , i ); RNN b ( x n :1 , i )])) [ k ] P ( t i = k | w 1: n ) = [ E [ w i ] ; RNN f c ( c 1: l i ); RNN b x i = c ( c l i :1 )] ◮ character-level RNNs robust to unknown words; may capture affixation; ◮ encoder–decoder (sequence-to-sequence) models coming next week. 10

RNNs as Feature Extractors 11

Sequence Labeling in Natural Language Processing ◮ Token-level class assignments in sequential context, aka tagging; ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated. Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — B PERS I PERS O B ORG O O B PERS E PERS O S ORG O O ◮ IOB (aka BIO) labeling scheme—and variants—encodes constraints. 12

Reflections on Negation as a Tagging Task conj root cc punct nsubj punct nsubj punct aux aux dobj prep pcomp dep dobj neg det aux poss part amod we have never gone out without keeping a sharp watch , and no one could have escaped our notice . " { } { } ann. 1: ⟨ cue ⟩ ann. 2: { } ⟨ cue ⟩ { } ann. 3: ⟨ cue ⟩ labels: N N CUE E E CUE N S O CUE E N N N N S O N N N N ◮ Sherlock (Lapponi et al., 2012, 2017) still state of the art today; ◮ ‘flattens out’ multiple, potentially overlapping negation instances; ◮ post-classification: heuristic reconstruction of separate structures. ◮ To what degree is cue classification a sequence labeling problem? 13

Constituent Parsing as Sequence Labeling (1:2) Constituent Parsing as Sequence Labeling Carlos G´ omez-Rodr´ ıguez David Vilares Universidade da Coru˜ na Universidade da Coru˜ na FASTPARSE Lab, LyS Group FASTPARSE Lab, LyS Group Departamento de Computaci´ on Departamento de Computaci´ on Campus de Elvi˜ na s/n, 15071 Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain A Coru˜ na, Spain carlos.gomez@udc.es david.vilares@udc.es Abstract Zhang, 2017; Fern´ andez-Gonz´ alez and G´ omez- Rodr´ ıguez, 2018). We introduce a method to reduce constituent With an aim more related to our work, other au- parsing to sequence labeling. For each word thors have reduced constituency parsing to tasks w t , it generates a label that encodes: (1) the that can be solved faster or in a more generic number of ancestors in the tree that the words way. Fern´ andez-Gonz´ alez and Martins (2015) re- w t and w t +1 have in common, and (2) the non- duce phrase structure parsing to dependency pars- terminal symbol at the lowest common ances- tor. We first prove that the proposed encoding ing. They propose an intermediate representation function is injective for any tree without unary where dependency labels from a head to its de- branches. In practice, the approach is made pendents encode the nonterminal symbol and an extensible to all constituency trees by collaps- attachment order that is used to arrange nodes ing unary branches. We then use the PTB and into constituents. Their approach makes it pos- CTB treebanks as testbeds and propose a set of sible to use off-the-shelf dependency parsers for 14

Constituent Parsing as Sequence Labeling (2:2) 15

Two Definitions of ‘Sequence Labeling’ ◮ Hmm, actually, what exactly does one mean by sequence labeling? ◮ gentle definition class predictions for all elements, in context; ◮ pointwise classification; each individual decision is independent; ◮ no (direct) model of wellformedness conditions on class sequence; ◮ strict definition sequence labeling performs structured prediction; ◮ search for ‘globally’ optimal solution, e.g. most probable sequence; ◮ models (properties of) output sequence explicitly, e.g. class bi-grams; ◮ later time points impact earlier choices, i.e. revision of path prefix; ◮ search techniques: dynamic programming, beam search, re-ranking. 16

Wanted: Sequence-Level Output Constraints 17

INF5820: Language Technological Applications Applications of - PowerPoint PPT Presentation

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018 Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in

INF5820: Language technological applications Gated RNNs (3:2) Taraka Rama University of Oslo 30

INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid,

INF5820: Language technological applications Lecture 6 Evaluating Word Embeddings and Using them

The Motion The Combined Technological and The Combined Technological and Economic Economic

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Technological barriers in PEM Technological barriers in PEM fuel cell system development fuel

PERM STATE AGRO-TECHNOLOGICAL UNIVERSITY NAMED AFTER ACADEMICIAN D.N. PRYANISHNIKOV pERM , rUSSI

Technological Learning Systems, Technological Learning Systems, Competitiveness and Development

The telescopes from the The telescopes from the technological point of view: technological point

Agricultural R&D, Technological Agricultural R&D, Technological Change, and Food Security

Brook Abegaz, Tennessee Technological University, Fall 2013 1 Tennessee Technological University

Macroeconomic Effects of Technological Transition F. Collard, P. F` eve & F. Portier April,

Digital inequalities in children and young people: A technological matter? INDIRE OECD

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Analyzing Simulated Data Matthew Turk There is only one sky. (but there are many simulation

Rockwell Collins, Oct 1, 2002, based on JSLC, Grenoble 68 November 2001 and FTRTFT September

Two-dimensional arrays, Copying arrays (shallow copies) , Software Engineering Techniques

Disclosure The speaker has no conflict of interests to disclose. Objectives As a result of

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 17 September 2019

Technical mechanics of a trans-border Waste Flow Tracking solution based on Blockchain technology

Meat Evaluation beef brisket whole brisket corned brisket flat half point half chuck arm

INF5820: Language Technological Applications Applications of - PowerPoint PPT Presentation

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018 Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in

INF5820: Language technological applications Gated RNNs (3:2) Taraka Rama University of Oslo 30

INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid,

INF5820: Language technological applications Lecture 6 Evaluating Word Embeddings and Using them

The Motion The Combined Technological and The Combined Technological and Economic Economic

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Technological barriers in PEM Technological barriers in PEM fuel cell system development fuel

PERM STATE AGRO-TECHNOLOGICAL UNIVERSITY NAMED AFTER ACADEMICIAN D.N. PRYANISHNIKOV pERM , rUSSI

Technological Learning Systems, Technological Learning Systems, Competitiveness and Development

The telescopes from the The telescopes from the technological point of view: technological point

Agricultural R&amp;D, Technological Agricultural R&amp;D, Technological Change, and Food Security

Brook Abegaz, Tennessee Technological University, Fall 2013 1 Tennessee Technological University

Macroeconomic Effects of Technological Transition F. Collard, P. F` eve &amp; F. Portier April,

Digital inequalities in children and young people: A technological matter? INDIRE OECD

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Analyzing Simulated Data Matthew Turk There is only one sky. (but there are many simulation

Rockwell Collins, Oct 1, 2002, based on JSLC, Grenoble 68 November 2001 and FTRTFT September

Two-dimensional arrays, Copying arrays (shallow copies) , Software Engineering Techniques

Disclosure The speaker has no conflict of interests to disclose. Objectives As a result of

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 17 September 2019

Technical mechanics of a trans-border Waste Flow Tracking solution based on Blockchain technology

Meat Evaluation beef brisket whole brisket corned brisket flat half point half chuck arm

Agricultural R&D, Technological Agricultural R&D, Technological Change, and Food Security

Macroeconomic Effects of Technological Transition F. Collard, P. F` eve & F. Portier April,