recurrent neural network grammars
play

Recurrent neural network grammars Slide credits: Chris Dyer, - PowerPoint PPT Presentation

Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread phenomenon: Polarity items can only appear in certain contexts Example: anybody is a polarity item that tends to appear only in specific contexts:


  1. Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

  2. Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

  3. Syntactic Composition (NP The hungry cat ) Need representation for: The hungry cat NP ) NP

  4. Syntactic Composition Syntactic Composition (NP The hungry cat ) (NP The hungry cat ) Need representation for: Need representation for: The The hungry cat hungry cat ( ( NP NP ) ) NP NP

  5. Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) The hungry cat ( NP ) NP

  6. Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP

  7. Recursion Need representation for: (NP The hungry cat ) (NP The (ADJP very hungry ) cat ) | {z } v v The cat ( NP ) NP

  8. Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat

  9. Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat NP

  10. Stack symbols composed recursively mirror corresponding tree structure S NP VP The hungry cat meows . . meows The hungry cat VP NP

  11. Stack symbols composed recursively mirror corresponding tree structure S Effect Stack encodes NP VP top-down syntactic recency, rather than left-to-right The hungry cat meows . string recency . meows The hungry cat VP NP S

  12. Implementing RNNGs 
 Stack RNNs • Augment a sequential RNN with a stack pointer • Two constant-time operations • push - read input, add to top of stack, connect to current location of the stack pointer • pop - move stack pointer to its parent • A summary of stack contents is obtained by accessing the output of the RNN at location of the stack pointer • Note: push and pop are discrete actions here 
 ( cf . Grefenstette et al., 2015)

  13. Implementing RNNGs 
 Stack RNNs y 0 PUSH ∅

  14. Implementing RNNGs 
 Stack RNNs y 0 y 1 POP ∅ x 1

  15. Implementing RNNGs 
 Stack RNNs y 0 y 1 ∅ x 1

  16. Implementing RNNGs 
 Stack RNNs y 0 y 1 PUSH ∅ x 1

  17. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 POP ∅ x 1 x 2

  18. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 ∅ x 1 x 2

  19. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 PUSH ∅ x 1 x 2

  20. Implementing RNNGs 
 Stack RNNs y 0 y 1 y 2 y 3 ∅ x 3 x 1 x 2

  21. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack top

  22. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) stack S top

  23. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S

  24. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

  25. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

  26. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top NP stack S

  27. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) NP top stack S

  28. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S

  29. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S

  30. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP top stack S

  31. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) top VP NP stack S

  32. The evolution of the stack LSTM over time mirrors tree structure S NP VP The hungry cat meows . S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S top

  33. Each word is conditioned on history represented by a trio of RNNs S NP VP The hungry cat meows . p( meows |history) S( NP( The hungry cat ) VP( meows ) . ) VP NP stack S

  34. Train with backpropagation through structure In training, This network is S backpropagate dynamic . Don’t through these derive gradients three RNNs) by hand—that’s NP VP error prone. Use automatic differentiation The hungry cat meows . instead S( NP( The hungry cat ) VP( meows ) . ) And recursively through this VP NP structure. stack S

  35. Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree history allowable embedding action actions at embedding this step

  36. Complete model Sequence of actions sentence (completely defines x and y ) Actions up to time t bias tree Model is dynamic : history allowable variable number of embedding action actions at context-dependent embedding this step actions at each step

  37. Complete model stack output action (buffer) history

  38. Complete model stack output action (buffer) history

  39. Implementing RNNGs 
 Parameter Estimation • RNNGs jointly model sequences of words together with p θ ( x , y ) a “tree structure”, • Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints) • We use trees from the Penn Treebank • We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction . Future work…

  40. Implementing RNNGs 
 Inference • An RNNG is a joint distribution p ( x , y ) over strings ( x ) and parse trees ( y ) • We are interested in two inference questions: • What is p ( x ) for a given x ? [ language modeling ] • What is max p ( y | x ) for a given x ? [ parsing ] y • Unfortunately, the dynamic programming algorithms we often rely on are of no help here • We can use importance sampling to do both by sampling from a discriminatively trained model

  41. English PTB (Parsing) Type F1 Petrov and Klein G 90.1 (2007) Shindo et al (2012) 
 G 91.1 Single model Shindo et al (2012) 
 ~G 92.4 Ensemble Vinyals et al (2015) 
 D 90.5 PTB only Vinyals et al (2015) 
 S 92.8 Ensemble Discriminative D 89.8 Generative (IS) G 92.4

  42. Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable

  43. Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x )

  44. Importance Sampling q ( y | x ) Assume we’ve got a conditional distribution p ( x , y ) > 0 = ⇒ q ( y | x ) > 0 s.t. (i) y ∼ q ( y | x ) (ii) is tractable and q ( y | x ) (iii) is tractable w ( x , y ) = p ( x , y ) Let the importance weights q ( y | x ) X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )

  45. Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y )

  46. Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo 
 estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N }

  47. Importance Sampling X X p ( x ) = p ( x , y ) = w ( x , y ) q ( y | x ) y ∈ Y ( x ) y ∈ Y ( x ) = E y ∼ q ( y | x ) w ( x , y ) Replace this expectation with its Monte Carlo 
 estimate. y ( i ) ∼ q ( y | x ) for i ∈ { 1 , 2 , . . . , N } N 1 MC X w ( x , y ( i ) ) E q ( y | x ) w ( x , y ) ≈ N i =1

  48. English PTB (LM) Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4 Chinese CTB (LM) Perplexity 5-gram IKN 255.2 LSTM + Dropout 207.3 Generative (IS) 171.9

  49. Do we need a stack? Kuncoro et al., Oct 2017 • Both stack and action history encode the same information, but expose it to the classifier in different ways. Leaving out stack is harmful; using it on its own works slightly better than complete model!

  50. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

  51. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

  52. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

  53. RNNG as a mini-linguist • Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. • What does this learn?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend