Recurrent neural network grammars Slide credits: Chris Dyer, - - PowerPoint PPT Presentation

recurrent neural network grammars
SMART_READER_LITE
LIVE PREVIEW

Recurrent neural network grammars Slide credits: Chris Dyer, - - PowerPoint PPT Presentation

Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread phenomenon: Polarity items can only appear in certain contexts Example: anybody is a polarity item that tends to appear only in specific contexts:


slide-1
SLIDE 1

Recurrent neural network grammars

Slide credits: Chris Dyer, Adhiguna Kuncoro

slide-2
SLIDE 2

Widespread phenomenon: Polarity items can only appear in certain contexts

The lecture that I gave did not appeal to anybody * The lecture that I gave appealed to anybody * The lecture that I did not give appealed to anybody Example: “anybody” is a polarity item that tends to appear only in specific contexts: but not: We might infer that the licensing context is the word “not” appearing among the preceding words, and you could use an RNN to model this. However:

slide-3
SLIDE 3

Language is hierarchical

The lecture The licensing context depends on recursive structure (syntax) The lecture that I gave did not appeal to anybody that I did not give appealed to anybody

slide-4
SLIDE 4

One theory of hierarchy

  • Generate symbols sequentially using an RNN
  • Add some “control symbols” to rewrite the history periodically
  • Periodically “compress” a sequence into a single “constituent”
  • Augment RNN with an operation to compress recent history into a

single vector (-> “reduce”)

  • RNN predicts next symbol based on the history of compressed

elements and non-compressed terminals (“push”)

  • RNN must also predict “control symbols” that decide how big

constituents are

  • We call such models recurrent neural network grammars.
slide-5
SLIDE 5

One theory of hierarchy

  • Generate symbols sequentially using an RNN
  • Add some “control symbols” to rewrite the history periodically
  • Periodically “compress” a sequence into a single “constituent”
  • Augment RNN with an operation to compress recent history into a

single vector (-> “reduce”)

  • RNN predicts next symbol based on the history of compressed

elements and non-compressed terminals (“shift” or “generate”)

  • RNN must also predict “control symbols” that decide how big

constituents are

  • We call such models recurrent neural network grammars.
slide-6
SLIDE 6

One theory of hierarchy

  • Generate symbols sequentially using an RNN
  • Add some “control symbols” to rewrite the history periodically
  • Periodically “compress” a sequence into a single “constituent”
  • Augment RNN with an operation to compress recent history into a

single vector (-> “reduce”)

  • RNN predicts next symbol based on the history of compressed

elements and non-compressed terminals (“shift” or “generate”)

  • RNN must also predict “control symbols” that decide how big

constituents are

  • We call such models recurrent neural network grammars.
slide-7
SLIDE 7

(Ordered) tree traversals are sequences

The hungry cat meows . NP VP S

slide-8
SLIDE 8

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-9
SLIDE 9

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-10
SLIDE 10

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-11
SLIDE 11

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-12
SLIDE 12

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-13
SLIDE 13

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-14
SLIDE 14

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-15
SLIDE 15

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-16
SLIDE 16

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-17
SLIDE 17

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-18
SLIDE 18

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-19
SLIDE 19

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

(Ordered) tree traversals are sequences

slide-20
SLIDE 20

Stack Action Terminals

slide-21
SLIDE 21

Stack Action

NT(S)

Terminals

slide-22
SLIDE 22

Stack Action

NT(S) (S

Terminals Stack Action

NT(S) (S NT(NP)

Terminals

slide-23
SLIDE 23

Stack Action

NT(S) (S NT(NP) (S (NP

Terminals

slide-24
SLIDE 24

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The)

Terminals

slide-25
SLIDE 25

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) (S (NP The The

Terminals

slide-26
SLIDE 26

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) (S (NP The The

Terminals

slide-27
SLIDE 27

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) (S (NP The hungry The hungry (S (NP The The

Terminals

slide-28
SLIDE 28

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) (S (NP The hungry The hungry (S (NP The The

Terminals

slide-29
SLIDE 29

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-30
SLIDE 30

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-31
SLIDE 31

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

The hungry cat (S (NP The hungry cat )

slide-32
SLIDE 32

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

The hungry cat (S (NP The hungry cat ) (S (NP The hungry cat)

Compress “The hungry cat”
 into a single composite symbol

slide-33
SLIDE 33

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-34
SLIDE 34

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-35
SLIDE 35

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-36
SLIDE 36

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

Q: What information can we use to predict the next action, and how can we encode it with an RNN?

???

slide-37
SLIDE 37

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

A: We can use an RNN for each of:

  • 1. Previous terminal symbols
  • 2. Previous actions
  • 3. Current stack contents
slide-38
SLIDE 38

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) GEN(meows) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-39
SLIDE 39

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) GEN(meows) The hungry cat meows REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) The hungry cat meows . (S (NP The hungry cat) (VP meows) . The hungry cat meows . (S (NP The hungry cat) (VP meows The hungry cat meows (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

slide-40
SLIDE 40

Stack Action

NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) GEN(meows) The hungry cat meows REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) The hungry cat meows . (S (NP The hungry cat) (VP meows) . The hungry cat meows . (S (NP The hungry cat) (VP meows The hungry cat meows (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The

Terminals

Final stack symbol is (a vector representation of) the complete tree.

slide-41
SLIDE 41

Syntactic Composition

(NP The hungry cat)

Need representation for:

slide-42
SLIDE 42

Syntactic Composition

(NP The hungry cat)

Need representation for:

NP

What head type?

Syntactic Composition

The

(NP The hungry cat)

Need representation for:

NP

What head type?

slide-43
SLIDE 43

Syntactic Composition

The hungry

(NP The hungry cat)

Need representation for:

NP

What head type?

slide-44
SLIDE 44

Syntactic Composition

The hungry cat

(NP The hungry cat)

Need representation for:

NP

What head type?

slide-45
SLIDE 45

Syntactic Composition

The hungry cat

)

(NP The hungry cat)

Need representation for:

NP

What head type?

slide-46
SLIDE 46

Syntactic Composition

The

NP

hungry cat

)

(NP The hungry cat)

Need representation for:

slide-47
SLIDE 47

Syntactic Composition

The

NP

hungry cat

) NP

(NP The hungry cat)

Need representation for:

slide-48
SLIDE 48

Syntactic Composition

The

NP

hungry cat

) NP

(NP The hungry cat)

Need representation for:

slide-49
SLIDE 49

Syntactic Composition

The

NP

hungry cat

) NP

(NP The hungry cat)

Need representation for:

slide-50
SLIDE 50

Syntactic Composition

The

NP

hungry cat

) NP

(NP The hungry cat)

Need representation for:

slide-51
SLIDE 51

Syntactic Composition

The

NP

hungry cat

) NP (

(NP The hungry cat)

Need representation for:

Syntactic Composition

The

NP

hungry cat

) NP (

(NP The hungry cat)

Need representation for:

slide-52
SLIDE 52

Recursion

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat)

hungry

slide-53
SLIDE 53

Recursion

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat) | {z }

v

v

slide-54
SLIDE 54

Recursion

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat) | {z }

v

v

slide-55
SLIDE 55

The hungry cat meows . NP VP S

Stack symbols composed recursively mirror corresponding tree structure

The hungry cat meows .

slide-56
SLIDE 56

The hungry cat meows . NP VP S

Stack symbols composed recursively mirror corresponding tree structure

The hungry cat meows . NP

slide-57
SLIDE 57

The hungry cat meows . NP VP S

Stack symbols composed recursively mirror corresponding tree structure

The hungry cat meows . NP VP

slide-58
SLIDE 58

The hungry cat meows . NP VP S

Stack symbols composed recursively mirror corresponding tree structure

The hungry cat meows . NP VP S Effect Stack encodes top-down syntactic recency, rather than left-to-right string recency

slide-59
SLIDE 59
  • Augment a sequential RNN with a stack pointer
  • Two constant-time operations
  • push - read input, add to top of stack, connect to current

location of the stack pointer

  • pop - move stack pointer to its parent
  • A summary of stack contents is obtained by accessing the
  • utput of the RNN at location of the stack pointer
  • Note: push and pop are discrete actions here


(cf. Grefenstette et al., 2015)

Implementing RNNGs
 Stack RNNs

slide-60
SLIDE 60

∅ y0

PUSH

Implementing RNNGs
 Stack RNNs

slide-61
SLIDE 61

∅ x1 y0 y1

POP

Implementing RNNGs
 Stack RNNs

slide-62
SLIDE 62

∅ x1 y0 y1

Implementing RNNGs
 Stack RNNs

slide-63
SLIDE 63

∅ x1 y0 y1

PUSH

Implementing RNNGs
 Stack RNNs

slide-64
SLIDE 64

∅ x1 y0 y1 y2 x2

POP

Implementing RNNGs
 Stack RNNs

slide-65
SLIDE 65

∅ x1 y0 y1 y2 x2

Implementing RNNGs
 Stack RNNs

slide-66
SLIDE 66

∅ x1 y0 y1 y2 x2

PUSH

Implementing RNNGs
 Stack RNNs

slide-67
SLIDE 67

∅ x1 y0 y1 y2 x2

x3 y3

Implementing RNNGs
 Stack RNNs

slide-68
SLIDE 68

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

stack

top

slide-69
SLIDE 69

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

S stack

top

slide-70
SLIDE 70

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP S stack

top

slide-71
SLIDE 71

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP S stack

top

slide-72
SLIDE 72

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP S stack

top

slide-73
SLIDE 73

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP S stack

top

slide-74
SLIDE 74

NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

S stack

top

slide-75
SLIDE 75

NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

VP S stack

top

slide-76
SLIDE 76

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP VP S stack

top

slide-77
SLIDE 77

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP VP S stack

top

slide-78
SLIDE 78

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP VP S stack

top

slide-79
SLIDE 79

The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

The evolution of the stack LSTM over time mirrors tree structure

NP VP S

top

stack

slide-80
SLIDE 80

NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

Each word is conditioned on history represented by a trio of RNNs

VP S stack p(meows|history)

slide-81
SLIDE 81

NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )

Train with backpropagation through structure

VP S stack

In training, backpropagate through these three RNNs) And recursively through this structure. This network is

  • dynamic. Don’t

derive gradients by hand—that’s error prone. Use automatic differentiation instead

slide-82
SLIDE 82

Complete model

Sequence of actions (completely defines x and y) Actions up to time t sentence tree action embedding history embedding allowable actions at this step bias

slide-83
SLIDE 83

Complete model

Sequence of actions (completely defines x and y) Actions up to time t sentence tree action embedding history embedding allowable actions at this step bias Model is dynamic: variable number of context-dependent actions at each step

slide-84
SLIDE 84

Complete model

  • utput

(buffer) action history stack

slide-85
SLIDE 85

Complete model

  • utput

(buffer) action history stack

slide-86
SLIDE 86
  • RNNGs jointly model sequences of words together with

a “tree structure”,

  • Any parse tree can be converted to a sequence of

actions (depth first traversal) and vice versa (subject to wellformedness constraints)

  • We use trees from the Penn Treebank
  • We could treat the non-generation actions as latent

variables or learn them with RL, effectively making this a problem of grammar induction. Future work…

Implementing RNNGs
 Parameter Estimation

pθ(x, y)

slide-87
SLIDE 87
  • An RNNG is a joint distribution p(x,y) over strings (x) and parse

trees (y)

  • We are interested in two inference questions:
  • What is p(x) for a given x? [language modeling]
  • What is max p(y | x) for a given x? [parsing]
  • Unfortunately, the dynamic programming algorithms we often

rely on are of no help here

  • We can use importance sampling to do both by sampling from a

discriminatively trained model

y

Implementing RNNGs
 Inference

slide-88
SLIDE 88

Type F1 Petrov and Klein (2007) G 90.1 Shindo et al (2012)
 Single model G 91.1 Shindo et al (2012)
 Ensemble ~G 92.4 Vinyals et al (2015)
 PTB only D 90.5 Vinyals et al (2015)
 Ensemble S 92.8 Discriminative D 89.8 Generative (IS) G 92.4

English PTB (Parsing)

slide-89
SLIDE 89

Importance Sampling

q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t.

slide-90
SLIDE 90

Importance Sampling

q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t. w(x, y) = p(x, y) q(y | x) Let the importance weights

slide-91
SLIDE 91

Importance Sampling

q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t. w(x, y) = p(x, y) q(y | x) Let the importance weights p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y)

slide-92
SLIDE 92

Importance Sampling

p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y)

slide-93
SLIDE 93

Importance Sampling

p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y) Replace this expectation with its Monte Carlo
 estimate. y(i) ∼ q(y | x) for i ∈ {1, 2, . . . , N}

slide-94
SLIDE 94

Importance Sampling

p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y) Replace this expectation with its Monte Carlo
 estimate. y(i) ∼ q(y | x) for i ∈ {1, 2, . . . , N} Eq(y|x)w(x, y)

MC

≈ 1 N

N

X

i=1

w(x, y(i))

slide-95
SLIDE 95

Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4

English PTB (LM)

Perplexity 5-gram IKN 255.2 LSTM + Dropout 207.3 Generative (IS) 171.9

Chinese CTB (LM)

slide-96
SLIDE 96

Do we need a stack?

  • Both stack and action history encode the same

information, but expose it to the classifier in different ways. Leaving out stack is harmful; using it

  • n its own works

slightly better than complete model! Kuncoro et al., Oct 2017

slide-97
SLIDE 97

RNNG as a mini-linguist

  • Replace composition with one that computes

attention over objects in the composed sequence, using embedding of NT for similarity.

  • What does this learn?
slide-98
SLIDE 98

RNNG as a mini-linguist

  • Replace composition with one that computes

attention over objects in the composed sequence, using embedding of NT for similarity.

  • What does this learn?
slide-99
SLIDE 99

RNNG as a mini-linguist

  • Replace composition with one that computes

attention over objects in the composed sequence, using embedding of NT for similarity.

  • What does this learn?
slide-100
SLIDE 100

RNNG as a mini-linguist

  • Replace composition with one that computes

attention over objects in the composed sequence, using embedding of NT for similarity.

  • What does this learn?
slide-101
SLIDE 101

RNNG as a mini-linguist

  • Replace composition with one that computes

attention over objects in the composed sequence, using embedding of NT for similarity.

  • What does this learn?
slide-102
SLIDE 102

Summary

  • Language is hierarchical, and this inductive bias can be

encoded into an RNN-style model.

  • RNNGs work by simulating a tree traversal—like a pushdown

automaton, but with continuous rather than finite history.

  • Modeled by RNNs encoding (1) previous tokens, (2) previous

actions, and (3) stack contents.

  • A stack LSTM evolves with stack contents.
  • The final representation computed by a stack LSTM has a top-

down recency bias, rather than left-to-right bias, which might be useful in modeling sentences.

  • Effective for parsing and language modeling, and seems to

capture linguistic intuitions about headedness.