Recurrent neural network grammars
Slide credits: Chris Dyer, Adhiguna Kuncoro
Recurrent neural network grammars Slide credits: Chris Dyer, - - PowerPoint PPT Presentation
Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread phenomenon: Polarity items can only appear in certain contexts Example: anybody is a polarity item that tends to appear only in specific contexts:
Slide credits: Chris Dyer, Adhiguna Kuncoro
The lecture that I gave did not appeal to anybody * The lecture that I gave appealed to anybody * The lecture that I did not give appealed to anybody Example: “anybody” is a polarity item that tends to appear only in specific contexts: but not: We might infer that the licensing context is the word “not” appearing among the preceding words, and you could use an RNN to model this. However:
The lecture The licensing context depends on recursive structure (syntax) The lecture that I gave did not appeal to anybody that I did not give appealed to anybody
single vector (-> “reduce”)
elements and non-compressed terminals (“push”)
constituents are
single vector (-> “reduce”)
elements and non-compressed terminals (“shift” or “generate”)
constituents are
single vector (-> “reduce”)
elements and non-compressed terminals (“shift” or “generate”)
constituents are
The hungry cat meows . NP VP S
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
Stack Action Terminals
Stack Action
NT(S)
Terminals
Stack Action
NT(S) (S
Terminals Stack Action
NT(S) (S NT(NP)
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The)
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
The hungry cat (S (NP The hungry cat )
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
The hungry cat (S (NP The hungry cat ) (S (NP The hungry cat)
Compress “The hungry cat” into a single composite symbol
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Q: What information can we use to predict the next action, and how can we encode it with an RNN?
???
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
A: We can use an RNN for each of:
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) GEN(meows) (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) GEN(meows) The hungry cat meows REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) The hungry cat meows . (S (NP The hungry cat) (VP meows) . The hungry cat meows . (S (NP The hungry cat) (VP meows The hungry cat meows (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Stack Action
NT(S) (S NT(NP) (S (NP GEN(The) GEN(hungry) GEN(cat) REDUCE NT(VP) GEN(meows) The hungry cat meows REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) The hungry cat meows . (S (NP The hungry cat) (VP meows) . The hungry cat meows . (S (NP The hungry cat) (VP meows The hungry cat meows (S (NP The hungry cat) (VP The hungry cat (S (NP The hungry cat) The hungry cat (S (NP The hungry cat The hungry cat (S (NP The hungry The hungry (S (NP The The
Terminals
Final stack symbol is (a vector representation of) the complete tree.
(NP The hungry cat)
Need representation for:
(NP The hungry cat)
Need representation for:
NP
What head type?
The
(NP The hungry cat)
Need representation for:
NP
What head type?
The hungry
(NP The hungry cat)
Need representation for:
NP
What head type?
The hungry cat
(NP The hungry cat)
Need representation for:
NP
What head type?
The hungry cat
)
(NP The hungry cat)
Need representation for:
NP
What head type?
The
NP
hungry cat
)
(NP The hungry cat)
Need representation for:
The
NP
hungry cat
) NP
(NP The hungry cat)
Need representation for:
The
NP
hungry cat
) NP
(NP The hungry cat)
Need representation for:
The
NP
hungry cat
) NP
(NP The hungry cat)
Need representation for:
The
NP
hungry cat
) NP
(NP The hungry cat)
Need representation for:
The
NP
hungry cat
) NP (
(NP The hungry cat)
Need representation for:
The
NP
hungry cat
) NP (
(NP The hungry cat)
Need representation for:
The
NP
cat
) NP (
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
hungry
The
NP
cat
) NP (
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat) | {z }
v
v
The
NP
cat
) NP (
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat) | {z }
v
v
The hungry cat meows . NP VP S
The hungry cat meows .
The hungry cat meows . NP VP S
The hungry cat meows . NP
The hungry cat meows . NP VP S
The hungry cat meows . NP VP
The hungry cat meows . NP VP S
The hungry cat meows . NP VP S Effect Stack encodes top-down syntactic recency, rather than left-to-right string recency
location of the stack pointer
(cf. Grefenstette et al., 2015)
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP S stack
top
NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
S stack
top
NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
VP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP VP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP VP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP VP S stack
top
The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
NP VP S
top
stack
NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
VP S stack p(meows|history)
NP The hungry cat meows . NP VP S S( NP( The hungry cat ) VP( meows ) . )
VP S stack
In training, backpropagate through these three RNNs) And recursively through this structure. This network is
derive gradients by hand—that’s error prone. Use automatic differentiation instead
Sequence of actions (completely defines x and y) Actions up to time t sentence tree action embedding history embedding allowable actions at this step bias
Sequence of actions (completely defines x and y) Actions up to time t sentence tree action embedding history embedding allowable actions at this step bias Model is dynamic: variable number of context-dependent actions at each step
(buffer) action history stack
(buffer) action history stack
a “tree structure”,
actions (depth first traversal) and vice versa (subject to wellformedness constraints)
variables or learn them with RL, effectively making this a problem of grammar induction. Future work…
pθ(x, y)
trees (y)
rely on are of no help here
discriminatively trained model
y
Type F1 Petrov and Klein (2007) G 90.1 Shindo et al (2012) Single model G 91.1 Shindo et al (2012) Ensemble ~G 92.4 Vinyals et al (2015) PTB only D 90.5 Vinyals et al (2015) Ensemble S 92.8 Discriminative D 89.8 Generative (IS) G 92.4
English PTB (Parsing)
q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t.
q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t. w(x, y) = p(x, y) q(y | x) Let the importance weights
q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t. w(x, y) = p(x, y) q(y | x) Let the importance weights p(x) = X
y∈Y(x)
p(x, y) = X
y∈Y(x)
w(x, y)q(y | x) = Ey∼q(y|x)w(x, y)
p(x) = X
y∈Y(x)
p(x, y) = X
y∈Y(x)
w(x, y)q(y | x) = Ey∼q(y|x)w(x, y)
p(x) = X
y∈Y(x)
p(x, y) = X
y∈Y(x)
w(x, y)q(y | x) = Ey∼q(y|x)w(x, y) Replace this expectation with its Monte Carlo estimate. y(i) ∼ q(y | x) for i ∈ {1, 2, . . . , N}
p(x) = X
y∈Y(x)
p(x, y) = X
y∈Y(x)
w(x, y)q(y | x) = Ey∼q(y|x)w(x, y) Replace this expectation with its Monte Carlo estimate. y(i) ∼ q(y | x) for i ∈ {1, 2, . . . , N} Eq(y|x)w(x, y)
MC
≈ 1 N
N
X
i=1
w(x, y(i))
Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4
English PTB (LM)
Perplexity 5-gram IKN 255.2 LSTM + Dropout 207.3 Generative (IS) 171.9
Chinese CTB (LM)
information, but expose it to the classifier in different ways. Leaving out stack is harmful; using it
slightly better than complete model! Kuncoro et al., Oct 2017
attention over objects in the composed sequence, using embedding of NT for similarity.
attention over objects in the composed sequence, using embedding of NT for similarity.
attention over objects in the composed sequence, using embedding of NT for similarity.
attention over objects in the composed sequence, using embedding of NT for similarity.
attention over objects in the composed sequence, using embedding of NT for similarity.
encoded into an RNN-style model.
automaton, but with continuous rather than finite history.
actions, and (3) stack contents.
down recency bias, rather than left-to-right bias, which might be useful in modeling sentences.
capture linguistic intuitions about headedness.