Recurrent neural network grammars Slide credits: Chris Dyer, - PowerPoint PPT Presentation

Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

Syntactic Composition Syntactic Composition (NP The hungry cat ) (NP The hungry cat ) Need representation for: Need representation for: The The hungry cat hungry cat ( ( NP NP ) ) NP NP

Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) The hungry cat ( NP ) NP

Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP

Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat

Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat NP

Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat VP NP

Stack symbols composed recursively mirror corresponding tree structure S Effect Stack encodes NP VP top-down syntactic recency, rather than left-to-right The hungry cat meows . string recency . meows The hungry cat VP NP S

Implementing RNNGs   Stack RNNs • Augment a sequential RNN with a stack pointer • Two constant-time operations • push - read input, add to top of stack, connect to current location of the stack pointer • pop - move stack pointer to its parent • A summary of stack contents is obtained by accessing the output of the RNN at location of the stack pointer • Note: push and pop are discrete actions here   ( cf . Grefenstette et al., 2015)

Implementing RNNGs   Stack RNNs y 0 PUSH ∅

Implementing RNNGs   Stack RNNs y 0 y 1 POP ∅ x 1

Implementing RNNGs   Stack RNNs y 0 y 1 ∅ x 1

Implementing RNNGs   Stack RNNs y 0 y 1 PUSH ∅ x 1

Implementing RNNGs   Stack RNNs y 0 y 1 y 2 POP ∅ x 1 x 2

Implementing RNNGs   Stack RNNs y 0 y 1 y 2 ∅ x 1 x 2

Implementing RNNGs   Stack RNNs y 0 y 1 y 2 PUSH ∅ x 1 x 2

Implementing RNNGs   Stack RNNs y 0 y 1 y 2 y 3 ∅ x 3 x 1 x 2

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack top

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack S top

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S

The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S top

Each word is conditioned on history represented by a trio of RNNs S NP VP The hungry cat meows . p( meows |history) S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S

Train with backpropagation through structure In training, This network is S backpropagate dynamic . Don’t through these derive gradients three RNNs) by hand—that’s NP VP error prone. Use automatic differentiation The hungry cat meows . instead S( NP( The hungry cat ) VP( meows ) . ) And recursively through this VP NP structure. stack S

Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree history allowable embedding action actions at embedding this step

Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree Model is dynamic : history allowable variable number of embedding action actions at context-dependent embedding this step actions at each step

Complete model stack output action (buffer) history

Implementing RNNGs   Parameter Estimation • RNNGs jointly model sequences of words together with p θ ( x , y ) a “tree structure”, • Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints) • We use trees from the Penn Treebank • We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction . Future work…

Implementing RNNGs   Inference • An RNNG is a joint distribution p ( x , y ) over strings ( x ) and parse trees ( y ) • We are interested in two inference questions: • What is p ( x ) for a given x ? [ language modeling ] • What is max p ( y | x ) for a given x ? [ parsing ] y • Unfortunately, the dynamic programming algorithms we often rely on are of no help here • We can use importance sampling to do both by sampling from a discriminatively trained model

English PTB (Parsing) Type F1 Petrov and Klein G 90.1 (2007) Shindo et al (2012)   G 91.1 Single model Shindo et al (2012)   ~G 92.4 Ensemble Vinyals et al (2015)   D 90.5 PTB only Vinyals et al (2015)   S 92.8 Ensemble Discriminative D 89.8 Generative (IS) G 92.4

Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable

Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x ) X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )

Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )

Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo   estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N }

Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo   estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N } N 1 MC X w ( x , y ( i ) ) E q ( y | x ) w ( x , y ) ≈ N i =1

English PTB (LM) Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4 Chinese CTB (LM) Perplexity 5-gram IKN 255.2 LSTM + Dropout 207.3 Generative (IS) 171.9

Do we need a stack? Kuncoro et al., Oct 2017 • Both stack and action history encode the same information, but expose it to the classifier in different ways. Leaving out stack is harmful; using it on its own works slightly better than complete model!

RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

Recurrent neural network grammars Slide credits: Chris Dyer, - PowerPoint PPT Presentation

Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread phenomenon: Polarity items can only appear in certain contexts Example: anybody is a polarity item that tends to appear only in specific contexts:

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Recurrent Neural Network Agenda Recurrent Neural Network

Recurrent Neural Network Grammars NAACL-HLT 2016 Authors: Chris Dyer, Adhiguna Kuncoro, Miguel

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Grammars and Parsing Grammars and Sentence Structure What makes a good grammar A

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Recurrent Recommendation with Local Coherence Jianling Wang and James Caverlee Dynamics in

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff

Recurrent Neural Networks (RNN) Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Introduction to the course RECURREN T N EURAL N ETW ORK S F OR LAN GUAGE MODELIN G IN P YTH ON

Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang,

Normalizing tweets with edit scripts and recurrent neural embeddings Grzegorz Chrupaa |

Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke 1,2 ,