[PPT] - Marrying Dynamic Programming with Recurrent Neural Networks I eat PowerPoint Presentation

SLIDE 1

Marrying Dynamic Programming with Recurrent Neural Networks

I eat sushi with tuna from Japan

Liang Huang

Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark

SLIDE 2

Marrying Dynamic Programming with Recurrent Neural Networks

I eat sushi with tuna from Japan

Liang Huang

Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark

SLIDE 3

Marrying Dynamic Programming with Recurrent Neural Networks

I eat sushi with tuna from Japan

Liang Huang

Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark

James Cross

SLIDE 4

Structured Prediction is Hard!

2

SLIDE 5

Not Easy for Humans Either...

3

(structural ambiguity :-P)

SLIDE 6

Not Even Easy for Nature!

4

prion: “misfolded protein”
structural ambiguity for the same amino-acid sequence
similar to different interpretations under different contexts
causes mad-cow diseases etc.

SLIDE 7

Case Study: Parsing and Folding

both problems have exponentially large search space
both can be modeled by grammars (context-free & above)
question 1: how to search for the highest-scoring structure?
question 2: how to make gold structure score the highest?

5

I eat sushi with tuna from Japan

SLIDE 8

Solutions to Search and Learning

question 1: how to search for the highest-scoring structure?
answer: dynamic programming to factor search space
question 2: how to make gold structure score the highest?
answer: neural nets to automate feature engineering
But do DP and neural nets like each other??

6

I eat sushi with tuna from Japan

SLIDE 9

Solutions to Search and Learning

question 1: how to search for the highest-scoring structure?
answer: dynamic programming to factor search space
question 2: how to make gold structure score the highest?
answer: neural nets to automate feature engineering
But do DP and neural nets like each other??

6

I eat sushi with tuna from Japan

SLIDE 10

In this talk...

Background
Dynamic Programming for Incremental Parsing
Features: from sparse to neural to recurrent neural nets
Bidirectional RNNs: minimal features; no tree structures!
dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
span-based constituency parsing (Cross+Huang, 2016b)
Marrying DP & RNNs (mostly not my work!)
transition-based dependency parsing (Shi et al, EMNLP 2017)
minimal span-based constituency parsing (Stern et al, ACL 2017)

7

SLIDE 11

Spectrum: Neural Incremental Parsing

8

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

SLIDE 12

Spectrum: Neural Incremental Parsing

8

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

SLIDE 13

Incremental Parsing with Dynamic Programming

(Huang & Sagae, ACL 2010*; Kuhlmann et al., ACL 2011; Mi & Huang, ACL 2015)

* best paper nominee

SLIDE 14

Incremental Parsing with Dynamic Programming

(Huang & Sagae, ACL 2010*; Kuhlmann et al., ACL 2011; Mi & Huang, ACL 2015)

* best paper nominee

SLIDE 15

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi with tuna from Japan in a restaurant

SLIDE 16

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ...

I eat sushi with tuna from Japan in a restaurant

SLIDE 17

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ...

I

1

shift

I eat sushi with tuna from Japan in a restaurant

SLIDE 18

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ...

I eat I

1

shift 2 shift

I eat sushi with tuna from Japan in a restaurant

SLIDE 19

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ...

I eat I eat I

1

shift 2 shift 3 l-reduce

I eat sushi with tuna from Japan in a restaurant

SLIDE 20

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ...

I eat I eat I eat sushi I

1

shift 2 shift 3 l-reduce 4 shift

I eat sushi with tuna from Japan in a restaurant

SLIDE 21

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ... with tuna from ...

I eat I eat I eat sushi I eat I sushi

1

shift 2 shift 3 l-reduce 4 shift 5a r-reduce

I eat sushi with tuna from Japan in a restaurant

SLIDE 22

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ... with tuna from ... tuna from Japan ...

I eat I eat I eat sushi I eat I sushi eat sushi with I

1

shift 2 shift 3 l-reduce 4 shift 5a r-reduce 5b shift

I eat sushi with tuna from Japan in a restaurant

SLIDE 23

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

shift-reduce conflict

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ... with tuna from ... tuna from Japan ...

I eat I eat I eat sushi I eat I sushi eat sushi with I

1

shift 2 shift 3 l-reduce 4 shift 5a r-reduce 5b shift

I eat sushi with tuna from Japan in a restaurant

SLIDE 24

Liang Huang (Oregon State)

Greedy Search

11

each state => three new states (shift, l-reduce, r-reduce)
greedy search: always pick the best next state
“best” is defined by a score learned from data

sh l-re r-re

SLIDE 25

Liang Huang (Oregon State)

Greedy Search

12

each state => three new states (shift, l-reduce, r-reduce)
greedy search: always pick the best next state
“best” is defined by a score learned from data

SLIDE 26

Liang Huang (Oregon State)

Beam Search

13

each state => three new states (shift, l-reduce, r-reduce)
beam search: always keep top-b states
still just a tiny fraction of the whole search space

SLIDE 27

Liang Huang (Oregon State)

Beam Search

13

each state => three new states (shift, l-reduce, r-reduce)
beam search: always keep top-b states
still just a tiny fraction of the whole search space

psycholinguistic evidence: parallelism (Fodor et al, 1974; Gibson, 1991)

SLIDE 28

Liang Huang (Oregon State)

Dynamic Programming

each state => three new states (shift, l-reduce, r-reduce)
key idea of DP: share common subproblems
merge equivalent states => polynomial space

14

(Huang and Sagae, 2010)

SLIDE 29

Liang Huang (Oregon State)

Dynamic Programming

each state => three new states (shift, l-reduce, r-reduce)
key idea of DP: share common subproblems
merge equivalent states => polynomial space

15

(Huang and Sagae, 2010)

SLIDE 30

Liang Huang (Oregon State)

Dynamic Programming

each state => three new states (shift, l-reduce, r-reduce)
key idea of DP: share common subproblems
merge equivalent states => polynomial space

16

(Huang and Sagae, 2010)

SLIDE 31

Liang Huang (Oregon State)

Dynamic Programming

each state => three new states (shift, l-reduce, r-reduce)
key idea of DP: share common subproblems
merge equivalent states => polynomial space

16

each DP state corresponds to exponentially many non-DP states

(Huang and Sagae, 2010)

graph-structured stack

(Tomita, 1986)

SLIDE 32

Liang Huang (Oregon State)

Dynamic Programming

each state => three new states (shift, l-reduce, r-reduce)
key idea of DP: share common subproblems
merge equivalent states => polynomial space

17

each DP state corresponds to exponentially many non-DP states

(Huang and Sagae, 2010)

100 102 104 106 108 1010 0 10 20 30 40 50 60 70 sentence length

DP: exponential

non-DP beam search

SLIDE 33

Liang Huang (Oregon State)

Dynamic Programming

each state => three new states (shift, l-reduce, r-reduce)
key idea of DP: share common subproblems
merge equivalent states => polynomial space

17

each DP state corresponds to exponentially many non-DP states

(Huang and Sagae, 2010)

100 102 104 106 108 1010 0 10 20 30 40 50 60 70 sentence length

DP: exponential

non-DP beam search

graph-structured stack

(Tomita, 1986)

SLIDE 34

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

two states are equivalent if they agree on features
because same features guarantee same cost
example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

SLIDE 35

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

two states are equivalent if they agree on features
because same features guarantee same cost
example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

two equivalent classes

... eat sushi ... I sushi

SLIDE 36

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

two states are equivalent if they agree on features
because same features guarantee same cost
example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

psycholinguistic evidence (eye-tracking experiments): delayed disambiguation

John and Mary had 2 papers John and Mary had 2 papers

Frazier and Rayner (1990), Frazier (1999)

two equivalent classes

... eat sushi ... I sushi

SLIDE 37

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

two states are equivalent if they agree on features
because same features guarantee same cost
example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

psycholinguistic evidence (eye-tracking experiments): delayed disambiguation

John and Mary had 2 papers John and Mary had 2 papers

Frazier and Rayner (1990), Frazier (1999)

two equivalent classes

... eat sushi ... I sushi

each together

SLIDE 38

Liang Huang (Oregon State)

Result: linear-time, DP , and accurate!

very fast linear-time dynamic programming parser
explores exponentially many trees (and outputs forest)
state-of-the-art parsing accuracy on English & Chinese

19

0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 parsing time (secs) sentence length

SLIDE 39

Liang Huang (Oregon State)

Result: linear-time, DP , and accurate!

very fast linear-time dynamic programming parser
explores exponentially many trees (and outputs forest)
state-of-the-art parsing accuracy on English & Chinese

19

C h a r n i a k Berkeley MST this work

0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 parsing time (secs) sentence length

O(n2) O(n) O(n2.4) O(n2.5)

SLIDE 40

Liang Huang (Oregon State)

Result: linear-time, DP , and accurate!

very fast linear-time dynamic programming parser
explores exponentially many trees (and outputs forest)
state-of-the-art parsing accuracy on English & Chinese

19

C h a r n i a k Berkeley MST this work

0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 parsing time (secs) sentence length 100 102 104 106 108 1010 0 10 20 30 40 50 60 70 sentence length

DP: exponential

non-DP beam search

O(n2) O(n) O(n2.4) O(n2.5)

SLIDE 41

In this talk...

Background
Dynamic Programming for Incremental Parsing
Features: from sparse to neural to recurrent neural nets
Bidirectional RNNs: minimal features; no tree structures!
dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
span-based constituency parsing (Cross+Huang, 2016b)
Marrying DP & RNNs (mostly not my work!)
minimal span-based constituency parsing (Stern et al, ACL 2017)
transition-based dependency parsing (Shi et al, EMNLP 2017)

20

SLIDE 42

Liang Huang (Oregon State)

Sparse Features

score each action using features f and weights w
features are drawn from a local window
abstraction (or signature) of a state -- this inspires DP!
weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue →

(Huang+Sagae, 2010)

SLIDE 43

Liang Huang (Oregon State)

Sparse Features

score each action using features f and weights w
features are drawn from a local window
abstraction (or signature) of a state -- this inspires DP!
weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue → ← stack queue →

... feed cats I nearby

in the garden ...

(Huang+Sagae, 2010)

SLIDE 44

Liang Huang (Oregon State)

Sparse Features

score each action using features f and weights w
features are drawn from a local window
abstraction (or signature) of a state -- this inspires DP!
weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue → features: (s0.w, s0.rc, q0, ...) = (cats, nearby, in, ...) ← stack queue →

... feed cats I nearby

in the garden ...

(Huang+Sagae, 2010)

SLIDE 45

Liang Huang (Oregon State)

From Sparse to Neural to RNN

22

… … … …

(Chen+Manning 2014)

SLIDE 46

Liang Huang (Oregon State)

From Sparse to Neural to RNN

neural nets can automate feature engineering :-)
but early neural work (e.g., Chen+Manning 14) still use lots of

manually designed atomic features on the stack

22

… … … …

(Chen+Manning 2014)

SLIDE 47

Liang Huang (Oregon State)

From Sparse to Neural to RNN

neural nets can automate feature engineering :-)
but early neural work (e.g., Chen+Manning 14) still use lots of

manually designed atomic features on the stack

can we automate even more?
option 1: summarize the whole stack (part of y) using RNNs =>

stack LSTM / RNNG (Dyer+ 15, 16)

option 2: summarize the whole input (x) using RNNs =>

biLSTM dependency parsing (Kiperwaser+Goldberg 16, Cross+Huang 16a) biLSTM constituency parsing (Cross+Huang 16b)

22

… … … …

(Chen+Manning 2014)

SLIDE 48

Liang Huang (Oregon State)

From Sparse to Neural to RNN

neural nets can automate feature engineering :-)
but early neural work (e.g., Chen+Manning 14) still use lots of

manually designed atomic features on the stack

can we automate even more?
option 1: summarize the whole stack (part of y) using RNNs =>

stack LSTM / RNNG (Dyer+ 15, 16)

option 2: summarize the whole input (x) using RNNs =>

biLSTM dependency parsing (Kiperwaser+Goldberg 16, Cross+Huang 16a) biLSTM constituency parsing (Cross+Huang 16b)

22

rules out DP! :( enables DP! :)

… … … …

(Chen+Manning 2014)

SLIDE 49

Spectrum: Neural Incremental Parsing

23

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

SLIDE 50

In this talk...

Background
Dynamic Programming for Incremental Parsing
Interlude: NN Features: from feedforward to recurrent
Bidirectional RNNs: minimal features; no tree structures!
dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
span-based constituency parsing (Cross+Huang, 2016b)
Marrying DP & RNNs (mostly not my work!)
minimal span-based constituency parsing (Stern et al, ACL 2017)
transition-based dependency parsing (Shi et al, EMNLP 2017)

24

SLIDE 51

biRNN for Dependency Parsing

several parallel efforts in 2016 used biLSTM features
Kiperwaser+Goldberg 2016: four positional feats; arc-eager
Cross+Huang ACL 2016: three positional feats; arc-standard
Wang+Chang 2016: two positional feats; graph-based
all inspired by sparse edge-factored model (McDonald+05)
use positions to summarize the input x, not the output y!
=> O(n3) DP

, e.g. graph-based, but also incremental!

25

… … … …

these developments lead to state-of-the-art in dependency parsing

(Cross and Huang, ACL 2016) (Kiperwaser and Goldberg 2016)

SLIDE 52

Liang Huang (Oregon State)

Span-Based Constituency Parsing

previous work uses tree structures on stack
we simplify to operate directly on sentence spans
simple-to-implement linear-time parsing

26 do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Stack Queue

do/MD I/PRP eating/VBG fish/NN

Stack Queue VP’ NP

like/VBP

previous work

ur work

(Cross and Huang, EMNLP 2016)

SLIDE 53

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {}

(Cross and Huang, EMNLP 2016)

SLIDE 54

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

(Cross and Huang, EMNLP 2016)

SLIDE 55

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP

(Cross and Huang, EMNLP 2016)

SLIDE 56

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP

(Cross and Huang, EMNLP 2016)

SLIDE 57

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP t = {0NP1} No-Label

(Cross and Huang, EMNLP 2016)

SLIDE 58

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP t = {0NP1} No-Label

(Cross and Huang, EMNLP 2016)

SLIDE 59

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP t = {0NP1} No-Label t = {0NP1} No-Label

(Cross and Huang, EMNLP 2016)

SLIDE 60

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 61

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 62

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 63

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 64

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label t = {0NP1} No-Label

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 65

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label t = {0NP1} No-Label Shift do/MD like/VBP I/PRP

1 3 5

eating/VBG fish/NN

4

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 66

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label t = {0NP1} No-Label Label-NP t = {0NP1, 4NP5} Shift do/MD like/VBP I/PRP

1 3 5

eating/VBG fish/NN

4

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 67

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 68

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 69

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 70

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 71

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

Label-VP t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 72

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Combine I/PRP do/MD like/VBP eating/VBG fish/NN

5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

Label-VP t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 73

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Combine I/PRP do/MD like/VBP eating/VBG fish/NN

5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

Label-VP t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5}

Label-S t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5, 0S5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

SLIDE 74

Liang Huang (Oregon State)

Bi-LSTM Span Features

30

hsi I do like eating fish h/si f0 b0

1

f1 b1

2

f2 b2

3

f3 b3

4

f4 b4

5

f5 b5

Sentence segment “eating fish” represented by two vectors:
Forward component: f5 - f3 (Wang and Chang, ACL 2016)
Backward component: b3 - b5

(Cross and Huang, EMNLP 2016)

SLIDE 75

Liang Huang (Oregon State)

Structural & Label Actions

31

pre-s1 s1 s0 queue do/MD like/VBP I/PRP eating/VBG fish/NN ./. pre-s0 s0 queue do/MD like/VBP eating/VBG fish/NN I/PRP ./.

Structural Action: 4 spans Label Action: 3 spans

SLIDE 76

Liang Huang (Oregon State)

Results on Penn Treebank

32

Parser Search Recall Prec. F1 Carreras et al. (2008) cubic 90.7 91.4 91.1 Shindo et al. (2012) cubic 91.1 Thang et al. (2015) ~cubic 91.1 Watanabe et al. (2015) beam 90.7 Static Oracle greedy 90.7 91.4 91.0 Dynamic + Exploration greedy 90.5 92.1 91.3

state of the art despite simple system with greedy actions and

small embeddings trained from scratch

first neural constituency parser to outperform sparse features

(Cross and Huang, EMNLP 2016)

SLIDE 77

Liang Huang (Oregon State)

Extension: Joint Syntax-Discourse Parsing

extend span-based parsing to discourse parsing
end-to-end, joint syntactic and discourse parsing

33

(Kai and Huang, EMNLP 2017)

RST discourse tree

+PTB

discourse-level s y n t a x

l

e v e l

SLIDE 78

In this talk...

Background
Dynamic Programming for Incremental Parsing
Interlude: NN Features: from feedforward to recurrent
Bidirectional RNNs: minimal features; no tree structures!
dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
span-based constituency parsing (Cross+Huang, 2016b)
Marrying DP & RNNs (mostly not my work!)
minimal span-based constituency parsing (Stern et al, ACL 2017)
transition-based dependency parsing (Shi et al, EMNLP 2017)

34

SLIDE 79

Minimal Span-based Const. Parsing

chart-based bottom-up parsing instead of incremental
an even simpler score formulation
O(n3) exact DP (CKY) instead of greedy search
global loss-augmented training instead of local training

35

SLIDE 80

Minimal Span-based Const. Parsing

chart-based bottom-up parsing instead of incremental
an even simpler score formulation
O(n3) exact DP (CKY) instead of greedy search
global loss-augmented training instead of local training

35

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

(Cross+Huang, EMNLP16)

SLIDE 81

Minimal Span-based Const. Parsing

chart-based bottom-up parsing instead of incremental
an even simpler score formulation
O(n3) exact DP (CKY) instead of greedy search
global loss-augmented training instead of local training

35

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

score action (i, k, j)

structural action

score label (i, j)

label action

(Cross+Huang, EMNLP16)

SLIDE 82

Minimal Span-based Const. Parsing

chart-based bottom-up parsing instead of incremental
an even simpler score formulation
O(n3) exact DP (CKY) instead of greedy search
global loss-augmented training instead of local training

35

(Stern+, ACL 2017)

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

score action (i, k, j)

structural action

score label (i, j)

label action

(Cross+Huang, EMNLP16)

max label score label (i, j) max k best (i, k)+best (k, j)

SLIDE 83

Minimal Span-based Const. Parsing

chart-based bottom-up parsing instead of incremental
an even simpler score formulation
O(n3) exact DP (CKY) instead of greedy search
global loss-augmented training instead of local training

35

(Stern+, ACL 2017)

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

score action (i, k, j)

structural action

score label (i, j)

label action

(Cross+Huang, EMNLP16)

max label score label (i, j) max k best (i, k)+best (k, j)

+

best (i, j) =

SLIDE 84

Liang Huang (Oregon State)

Global Training & Loss-Augmented Decoding

want for all and larger margin for worse trees: loss-augmented decoding in training (find the most-violated tree, i.e., a bad tree with good score) loss-augmented decoding for Hamming loss (approximating F1): simply replace score label (i, j) with score label (i, j) + 1(label ≠ label*ij) gold tree label for span (i, j) (could be “nolabel”) bad tree good score

(Stern+, ACL 2017)

SLIDE 85

Liang Huang (Oregon State)

Penn Treebank Results

Parser F1 Score Hall et al. (2014) 89.2 Vinyals et al. (2015) 88.3 Cross and Huang (2016b) 91.3 Dyer et al. (2016) corrected 91.7 Liu and Zhang (2017) 91.7 Chart Parser 91.7

+refinement

91.8

(Stern+, ACL 2017)

SLIDE 86

Minimal Feats for Incremental Dep. Parsing

38

… … … …

(Cross and Huang, ACL 2016) arc-standard (Kiperwaser and Goldberg 2016) arc-eager

SLIDE 87

Minimal Feats for Incremental Dep. Parsing

38

… … … …

(Cross and Huang, ACL 2016) arc-standard (Kiperwaser and Goldberg 2016) arc-eager

… …

(Shi, Huang, Lee, EMNLP 2017) Saturday talk! arc-hybrid and arc-eager works for both greedy and O(n3) DP

SLIDE 88

Minimal Feats for Incremental Dep. Parsing

39

SLIDE 89

Spectrum: Neural Incremental Parsing

40

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

SLIDE 90

Spectrum: Neural Incremental Parsing

40

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

SLIDE 91

Conclusions and Limitations

41

SLIDE 92

Conclusions and Limitations

DP and RNNs can indeed be married, if done creatively
biRNN summarizing input x and not output structure y
this allows efficient DP with exact search
combine with global learning (loss-augmented decoding)

41

SLIDE 93

Conclusions and Limitations

DP and RNNs can indeed be married, if done creatively
biRNN summarizing input x and not output structure y
this allows efficient DP with exact search
combine with global learning (loss-augmented decoding)
but exact DP is still too slow
future work: linear-time beam search DP with biRNNs

41

SLIDE 94

Conclusions and Limitations

DP and RNNs can indeed be married, if done creatively
biRNN summarizing input x and not output structure y
this allows efficient DP with exact search
combine with global learning (loss-augmented decoding)
but exact DP is still too slow
future work: linear-time beam search DP with biRNNs
what if we want strictly incremental parsing? no biRNN...
DP search could compensate for loss of lookahead

41

SLIDE 95

Conclusions and Limitations

DP and RNNs can indeed be married, if done creatively
biRNN summarizing input x and not output structure y
this allows efficient DP with exact search
combine with global learning (loss-augmented decoding)
but exact DP is still too slow
future work: linear-time beam search DP with biRNNs
what if we want strictly incremental parsing? no biRNN...
DP search could compensate for loss of lookahead
what about translation? we do need to model y directly...

41

SLIDE 96

非常感谢 !

fēi cháng gǎn xiè

James Cross

SLIDE 97

Thank you very much

非常感谢 !

fēi cháng gǎn xiè

!

James Cross