Marrying Dynamic Programming with Recurrent Neural Networks I eat - - PowerPoint PPT Presentation

marrying dynamic programming with recurrent neural
SMART_READER_LITE
LIVE PREVIEW

Marrying Dynamic Programming with Recurrent Neural Networks I eat - - PowerPoint PPT Presentation

Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying Dynamic Programming with Recurrent


slide-1
SLIDE 1

Marrying Dynamic Programming with Recurrent Neural Networks

I eat sushi with tuna from Japan

Liang Huang

Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark

slide-2
SLIDE 2

Marrying Dynamic Programming with Recurrent Neural Networks

I eat sushi with tuna from Japan

Liang Huang

Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark

slide-3
SLIDE 3

Marrying Dynamic Programming with Recurrent Neural Networks

I eat sushi with tuna from Japan

Liang Huang

Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark

James Cross

slide-4
SLIDE 4

Structured Prediction is Hard!

2

slide-5
SLIDE 5

Not Easy for Humans Either...

3

(structural ambiguity :-P)

slide-6
SLIDE 6

Not Even Easy for Nature!

4

  • prion: “misfolded protein”
  • structural ambiguity for the same amino-acid sequence
  • similar to different interpretations under different contexts
  • causes mad-cow diseases etc.
slide-7
SLIDE 7

Case Study: Parsing and Folding

  • both problems have exponentially large search space
  • both can be modeled by grammars (context-free & above)
  • question 1: how to search for the highest-scoring structure?
  • question 2: how to make gold structure score the highest?

5

I eat sushi with tuna from Japan

slide-8
SLIDE 8

Solutions to Search and Learning

  • question 1: how to search for the highest-scoring structure?
  • answer: dynamic programming to factor search space
  • question 2: how to make gold structure score the highest?
  • answer: neural nets to automate feature engineering
  • But do DP and neural nets like each other??

6

I eat sushi with tuna from Japan

slide-9
SLIDE 9

Solutions to Search and Learning

  • question 1: how to search for the highest-scoring structure?
  • answer: dynamic programming to factor search space
  • question 2: how to make gold structure score the highest?
  • answer: neural nets to automate feature engineering
  • But do DP and neural nets like each other??

6

I eat sushi with tuna from Japan

slide-10
SLIDE 10

In this talk...

  • Background
  • Dynamic Programming for Incremental Parsing
  • Features: from sparse to neural to recurrent neural nets
  • Bidirectional RNNs: minimal features; no tree structures!
  • dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
  • span-based constituency parsing (Cross+Huang, 2016b)
  • Marrying DP & RNNs (mostly not my work!)
  • transition-based dependency parsing (Shi et al, EMNLP 2017)
  • minimal span-based constituency parsing (Stern et al, ACL 2017)

7

slide-11
SLIDE 11

Spectrum: Neural Incremental Parsing

8

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

slide-12
SLIDE 12

Spectrum: Neural Incremental Parsing

8

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

slide-13
SLIDE 13

Incremental Parsing with Dynamic Programming

(Huang & Sagae, ACL 2010*; Kuhlmann et al., ACL 2011; Mi & Huang, ACL 2015)

* best paper nominee

slide-14
SLIDE 14

Incremental Parsing with Dynamic Programming

(Huang & Sagae, ACL 2010*; Kuhlmann et al., ACL 2011; Mi & Huang, ACL 2015)

* best paper nominee

slide-15
SLIDE 15

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi with tuna from Japan in a restaurant

slide-16
SLIDE 16

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ...

  • I eat sushi with tuna from Japan in a restaurant
slide-17
SLIDE 17

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ...

I

  • 1

shift

I eat sushi with tuna from Japan in a restaurant

slide-18
SLIDE 18

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ...

I eat I

  • 1

shift 2 shift

I eat sushi with tuna from Japan in a restaurant

slide-19
SLIDE 19

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ...

I eat I eat I

  • 1

shift 2 shift 3 l-reduce

I eat sushi with tuna from Japan in a restaurant

slide-20
SLIDE 20

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ...

I eat I eat I eat sushi I

  • 1

shift 2 shift 3 l-reduce 4 shift

I eat sushi with tuna from Japan in a restaurant

slide-21
SLIDE 21

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ... with tuna from ...

I eat I eat I eat sushi I eat I sushi

  • 1

shift 2 shift 3 l-reduce 4 shift 5a r-reduce

I eat sushi with tuna from Japan in a restaurant

slide-22
SLIDE 22

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ... with tuna from ... tuna from Japan ...

I eat I eat I eat sushi I eat I sushi eat sushi with I

  • 1

shift 2 shift 3 l-reduce 4 shift 5a r-reduce 5b shift

I eat sushi with tuna from Japan in a restaurant

slide-23
SLIDE 23

Liang Huang (Oregon State)

Incremental Parsing (Shift-Reduce)

10

action stack queue

shift-reduce conflict

I eat sushi ... eat sushi with ... sushi with tuna ... sushi with tuna ... with tuna from ... with tuna from ... tuna from Japan ...

I eat I eat I eat sushi I eat I sushi eat sushi with I

  • 1

shift 2 shift 3 l-reduce 4 shift 5a r-reduce 5b shift

I eat sushi with tuna from Japan in a restaurant

slide-24
SLIDE 24

Liang Huang (Oregon State)

Greedy Search

11

  • each state => three new states (shift, l-reduce, r-reduce)
  • greedy search: always pick the best next state
  • “best” is defined by a score learned from data

sh l-re r-re

slide-25
SLIDE 25

Liang Huang (Oregon State)

Greedy Search

12

  • each state => three new states (shift, l-reduce, r-reduce)
  • greedy search: always pick the best next state
  • “best” is defined by a score learned from data
slide-26
SLIDE 26

Liang Huang (Oregon State)

Beam Search

13

  • each state => three new states (shift, l-reduce, r-reduce)
  • beam search: always keep top-b states
  • still just a tiny fraction of the whole search space
slide-27
SLIDE 27

Liang Huang (Oregon State)

Beam Search

13

  • each state => three new states (shift, l-reduce, r-reduce)
  • beam search: always keep top-b states
  • still just a tiny fraction of the whole search space

psycholinguistic evidence: parallelism (Fodor et al, 1974; Gibson, 1991)

slide-28
SLIDE 28

Liang Huang (Oregon State)

Dynamic Programming

  • each state => three new states (shift, l-reduce, r-reduce)
  • key idea of DP: share common subproblems
  • merge equivalent states => polynomial space

14

(Huang and Sagae, 2010)

slide-29
SLIDE 29

Liang Huang (Oregon State)

Dynamic Programming

  • each state => three new states (shift, l-reduce, r-reduce)
  • key idea of DP: share common subproblems
  • merge equivalent states => polynomial space

15

(Huang and Sagae, 2010)

slide-30
SLIDE 30

Liang Huang (Oregon State)

Dynamic Programming

  • each state => three new states (shift, l-reduce, r-reduce)
  • key idea of DP: share common subproblems
  • merge equivalent states => polynomial space

16

(Huang and Sagae, 2010)

slide-31
SLIDE 31

Liang Huang (Oregon State)

Dynamic Programming

  • each state => three new states (shift, l-reduce, r-reduce)
  • key idea of DP: share common subproblems
  • merge equivalent states => polynomial space

16

each DP state corresponds to exponentially many non-DP states

(Huang and Sagae, 2010)

graph-structured stack

(Tomita, 1986)

slide-32
SLIDE 32

Liang Huang (Oregon State)

Dynamic Programming

  • each state => three new states (shift, l-reduce, r-reduce)
  • key idea of DP: share common subproblems
  • merge equivalent states => polynomial space

17

each DP state corresponds to exponentially many non-DP states

(Huang and Sagae, 2010)

100 102 104 106 108 1010 0 10 20 30 40 50 60 70 sentence length

DP: exponential

non-DP beam search

slide-33
SLIDE 33

Liang Huang (Oregon State)

Dynamic Programming

  • each state => three new states (shift, l-reduce, r-reduce)
  • key idea of DP: share common subproblems
  • merge equivalent states => polynomial space

17

each DP state corresponds to exponentially many non-DP states

(Huang and Sagae, 2010)

100 102 104 106 108 1010 0 10 20 30 40 50 60 70 sentence length

DP: exponential

non-DP beam search

graph-structured stack

(Tomita, 1986)

slide-34
SLIDE 34

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

  • two states are equivalent if they agree on features
  • because same features guarantee same cost
  • example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

slide-35
SLIDE 35

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

  • two states are equivalent if they agree on features
  • because same features guarantee same cost
  • example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

two equivalent classes

... eat sushi ... I sushi

slide-36
SLIDE 36

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

  • two states are equivalent if they agree on features
  • because same features guarantee same cost
  • example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

psycholinguistic evidence (eye-tracking experiments): delayed disambiguation

John and Mary had 2 papers John and Mary had 2 papers

Frazier and Rayner (1990), Frazier (1999)

two equivalent classes

... eat sushi ... I sushi

slide-37
SLIDE 37

Liang Huang (Oregon State)

Merging (Ambiguity Packing)

  • two states are equivalent if they agree on features
  • because same features guarantee same cost
  • example: if we only care about the last 2 words on stack

I sushi I eat sushi eat sushi

(Huang and Sagae, 2010)

psycholinguistic evidence (eye-tracking experiments): delayed disambiguation

John and Mary had 2 papers John and Mary had 2 papers

Frazier and Rayner (1990), Frazier (1999)

two equivalent classes

... eat sushi ... I sushi

each together

slide-38
SLIDE 38

Liang Huang (Oregon State)

Result: linear-time, DP , and accurate!

  • very fast linear-time dynamic programming parser
  • explores exponentially many trees (and outputs forest)
  • state-of-the-art parsing accuracy on English & Chinese

19

0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 parsing time (secs) sentence length

slide-39
SLIDE 39

Liang Huang (Oregon State)

Result: linear-time, DP , and accurate!

  • very fast linear-time dynamic programming parser
  • explores exponentially many trees (and outputs forest)
  • state-of-the-art parsing accuracy on English & Chinese

19

C h a r n i a k Berkeley MST this work

0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 parsing time (secs) sentence length

O(n2) O(n) O(n2.4) O(n2.5)

slide-40
SLIDE 40

Liang Huang (Oregon State)

Result: linear-time, DP , and accurate!

  • very fast linear-time dynamic programming parser
  • explores exponentially many trees (and outputs forest)
  • state-of-the-art parsing accuracy on English & Chinese

19

C h a r n i a k Berkeley MST this work

0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 parsing time (secs) sentence length 100 102 104 106 108 1010 0 10 20 30 40 50 60 70 sentence length

DP: exponential

non-DP beam search

O(n2) O(n) O(n2.4) O(n2.5)

slide-41
SLIDE 41

In this talk...

  • Background
  • Dynamic Programming for Incremental Parsing
  • Features: from sparse to neural to recurrent neural nets
  • Bidirectional RNNs: minimal features; no tree structures!
  • dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
  • span-based constituency parsing (Cross+Huang, 2016b)
  • Marrying DP & RNNs (mostly not my work!)
  • minimal span-based constituency parsing (Stern et al, ACL 2017)
  • transition-based dependency parsing (Shi et al, EMNLP 2017)

20

slide-42
SLIDE 42

Liang Huang (Oregon State)

Sparse Features

  • score each action using features f and weights w
  • features are drawn from a local window
  • abstraction (or signature) of a state -- this inspires DP!
  • weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue →

(Huang+Sagae, 2010)

slide-43
SLIDE 43

Liang Huang (Oregon State)

Sparse Features

  • score each action using features f and weights w
  • features are drawn from a local window
  • abstraction (or signature) of a state -- this inspires DP!
  • weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue → ← stack queue →

... feed cats I nearby

in the garden ...

(Huang+Sagae, 2010)

slide-44
SLIDE 44

Liang Huang (Oregon State)

Sparse Features

  • score each action using features f and weights w
  • features are drawn from a local window
  • abstraction (or signature) of a state -- this inspires DP!
  • weights trained by structured perceptron (Collins 02)

21

... s2 s1 s0 q0 q1 ...

← stack queue → features: (s0.w, s0.rc, q0, ...) = (cats, nearby, in, ...) ← stack queue →

... feed cats I nearby

in the garden ...

(Huang+Sagae, 2010)

slide-45
SLIDE 45

Liang Huang (Oregon State)

From Sparse to Neural to RNN

22

… … … …

(Chen+Manning 2014)

slide-46
SLIDE 46

Liang Huang (Oregon State)

From Sparse to Neural to RNN

  • neural nets can automate feature engineering :-)
  • but early neural work (e.g., Chen+Manning 14) still use lots of

manually designed atomic features on the stack

22

… … … …

(Chen+Manning 2014)

slide-47
SLIDE 47

Liang Huang (Oregon State)

From Sparse to Neural to RNN

  • neural nets can automate feature engineering :-)
  • but early neural work (e.g., Chen+Manning 14) still use lots of

manually designed atomic features on the stack

  • can we automate even more?
  • option 1: summarize the whole stack (part of y) using RNNs =>

stack LSTM / RNNG (Dyer+ 15, 16)

  • option 2: summarize the whole input (x) using RNNs =>

biLSTM dependency parsing (Kiperwaser+Goldberg 16, Cross+Huang 16a) biLSTM constituency parsing (Cross+Huang 16b)

22

… … … …

(Chen+Manning 2014)

slide-48
SLIDE 48

Liang Huang (Oregon State)

From Sparse to Neural to RNN

  • neural nets can automate feature engineering :-)
  • but early neural work (e.g., Chen+Manning 14) still use lots of

manually designed atomic features on the stack

  • can we automate even more?
  • option 1: summarize the whole stack (part of y) using RNNs =>

stack LSTM / RNNG (Dyer+ 15, 16)

  • option 2: summarize the whole input (x) using RNNs =>

biLSTM dependency parsing (Kiperwaser+Goldberg 16, Cross+Huang 16a) biLSTM constituency parsing (Cross+Huang 16b)

22

rules out DP! :( enables DP! :)

… … … …

(Chen+Manning 2014)

slide-49
SLIDE 49

Spectrum: Neural Incremental Parsing

23

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

slide-50
SLIDE 50

In this talk...

  • Background
  • Dynamic Programming for Incremental Parsing
  • Interlude: NN Features: from feedforward to recurrent
  • Bidirectional RNNs: minimal features; no tree structures!
  • dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
  • span-based constituency parsing (Cross+Huang, 2016b)
  • Marrying DP & RNNs (mostly not my work!)
  • minimal span-based constituency parsing (Stern et al, ACL 2017)
  • transition-based dependency parsing (Shi et al, EMNLP 2017)

24

slide-51
SLIDE 51

biRNN for Dependency Parsing

  • several parallel efforts in 2016 used biLSTM features
  • Kiperwaser+Goldberg 2016: four positional feats; arc-eager
  • Cross+Huang ACL 2016: three positional feats; arc-standard
  • Wang+Chang 2016: two positional feats; graph-based
  • all inspired by sparse edge-factored model (McDonald+05)
  • use positions to summarize the input x, not the output y!
  • => O(n3) DP

, e.g. graph-based, but also incremental!

25

… … … …

these developments lead to state-of-the-art in dependency parsing

(Cross and Huang, ACL 2016) (Kiperwaser and Goldberg 2016)

slide-52
SLIDE 52

Liang Huang (Oregon State)

Span-Based Constituency Parsing

  • previous work uses tree structures on stack
  • we simplify to operate directly on sentence spans
  • simple-to-implement linear-time parsing

26 do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Stack Queue

do/MD I/PRP eating/VBG fish/NN

Stack Queue VP’ NP

like/VBP

previous work

  • ur work

(Cross and Huang, EMNLP 2016)

slide-53
SLIDE 53

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {}

(Cross and Huang, EMNLP 2016)

slide-54
SLIDE 54

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

(Cross and Huang, EMNLP 2016)

slide-55
SLIDE 55

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP

(Cross and Huang, EMNLP 2016)

slide-56
SLIDE 56

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP

(Cross and Huang, EMNLP 2016)

slide-57
SLIDE 57

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP t = {0NP1} No-Label

(Cross and Huang, EMNLP 2016)

slide-58
SLIDE 58

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP t = {0NP1} No-Label

(Cross and Huang, EMNLP 2016)

slide-59
SLIDE 59

Liang Huang (Oregon State)

current brackets

27

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {} Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

Shift do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Label-NP t = {0NP1} No-Label t = {0NP1} No-Label

(Cross and Huang, EMNLP 2016)

slide-60
SLIDE 60

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-61
SLIDE 61

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-62
SLIDE 62

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-63
SLIDE 63

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-64
SLIDE 64

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label t = {0NP1} No-Label

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-65
SLIDE 65

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label t = {0NP1} No-Label Shift do/MD like/VBP I/PRP

1 3 5

eating/VBG fish/NN

4

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-66
SLIDE 66

Liang Huang (Oregon State) 28

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD I/PRP like/VBP

1 2

eating/VBG fish/NN

3 4 5

t = {0NP1} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

Shift do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1} No-Label t = {0NP1} No-Label Label-NP t = {0NP1, 4NP5} Shift do/MD like/VBP I/PRP

1 3 5

eating/VBG fish/NN

4

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-67
SLIDE 67

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-68
SLIDE 68

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-69
SLIDE 69

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-70
SLIDE 70

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-71
SLIDE 71

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

Label-VP t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-72
SLIDE 72

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Combine I/PRP do/MD like/VBP eating/VBG fish/NN

5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

Label-VP t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-73
SLIDE 73

Liang Huang (Oregon State) 29

Structural (even step) Shift Structural (even step) Combine Label (odd step) Label-X Label (odd step) No-Label

do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 4 5

t = {0NP1, 4NP5} Combine do/MD like/VBP I/PRP

1

eating/VBG fish/NN

3 5

Combine do/MD like/VBP eating/VBG fish/NN I/PRP

1 5

Combine I/PRP do/MD like/VBP eating/VBG fish/NN

5

Label-S-VP t = {0NP1, 4NP5,

3S5, 3VP5}

Label-VP t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5}

Label-S t = {0NP1, 4NP5,

3S5, 3VP5, 1VP5, 0S5}

S VP S VP NP NN fish VBG eating VBP like MD do NP PRP I

(Cross and Huang, EMNLP 2016)

slide-74
SLIDE 74

Liang Huang (Oregon State)

Bi-LSTM Span Features

30

hsi I do like eating fish h/si f0 b0

1

f1 b1

2

f2 b2

3

f3 b3

4

f4 b4

5

f5 b5

  • Sentence segment “eating fish” represented by two vectors:
  • Forward component: f5 - f3 (Wang and Chang, ACL 2016)
  • Backward component: b3 - b5

(Cross and Huang, EMNLP 2016)

slide-75
SLIDE 75

Liang Huang (Oregon State)

Structural & Label Actions

31

pre-s1 s1 s0 queue do/MD like/VBP I/PRP eating/VBG fish/NN ./. pre-s0 s0 queue do/MD like/VBP eating/VBG fish/NN I/PRP ./.

Structural Action: 4 spans Label Action: 3 spans

slide-76
SLIDE 76

Liang Huang (Oregon State)

Results on Penn Treebank

32

Parser Search Recall Prec. F1 Carreras et al. (2008) cubic 90.7 91.4 91.1 Shindo et al. (2012) cubic 91.1 Thang et al. (2015) ~cubic 91.1 Watanabe et al. (2015) beam 90.7 Static Oracle greedy 90.7 91.4 91.0 Dynamic + Exploration greedy 90.5 92.1 91.3

  • state of the art despite simple system with greedy actions and

small embeddings trained from scratch

  • first neural constituency parser to outperform sparse features

(Cross and Huang, EMNLP 2016)

slide-77
SLIDE 77

Liang Huang (Oregon State)

Extension: Joint Syntax-Discourse Parsing

  • extend span-based parsing to discourse parsing
  • end-to-end, joint syntactic and discourse parsing

33

(Kai and Huang, EMNLP 2017)

RST discourse tree

+PTB

discourse-level s y n t a x

  • l

e v e l

slide-78
SLIDE 78

In this talk...

  • Background
  • Dynamic Programming for Incremental Parsing
  • Interlude: NN Features: from feedforward to recurrent
  • Bidirectional RNNs: minimal features; no tree structures!
  • dependency parsing (Kiperwaser+Goldberg, 2016, Cross+Huang, 2016a)
  • span-based constituency parsing (Cross+Huang, 2016b)
  • Marrying DP & RNNs (mostly not my work!)
  • minimal span-based constituency parsing (Stern et al, ACL 2017)
  • transition-based dependency parsing (Shi et al, EMNLP 2017)

34

slide-79
SLIDE 79

Minimal Span-based Const. Parsing

  • chart-based bottom-up parsing instead of incremental
  • an even simpler score formulation
  • O(n3) exact DP (CKY) instead of greedy search
  • global loss-augmented training instead of local training

35

slide-80
SLIDE 80

Minimal Span-based Const. Parsing

  • chart-based bottom-up parsing instead of incremental
  • an even simpler score formulation
  • O(n3) exact DP (CKY) instead of greedy search
  • global loss-augmented training instead of local training

35

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

(Cross+Huang, EMNLP16)

slide-81
SLIDE 81

Minimal Span-based Const. Parsing

  • chart-based bottom-up parsing instead of incremental
  • an even simpler score formulation
  • O(n3) exact DP (CKY) instead of greedy search
  • global loss-augmented training instead of local training

35

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

score action (i, k, j)

structural action

score label (i, j)

label action

(Cross+Huang, EMNLP16)

slide-82
SLIDE 82

Minimal Span-based Const. Parsing

  • chart-based bottom-up parsing instead of incremental
  • an even simpler score formulation
  • O(n3) exact DP (CKY) instead of greedy search
  • global loss-augmented training instead of local training

35

(Stern+, ACL 2017)

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

score action (i, k, j)

structural action

score label (i, j)

label action

(Cross+Huang, EMNLP16)

max label score label (i, j) max k best (i, k)+best (k, j)

slide-83
SLIDE 83

Minimal Span-based Const. Parsing

  • chart-based bottom-up parsing instead of incremental
  • an even simpler score formulation
  • O(n3) exact DP (CKY) instead of greedy search
  • global loss-augmented training instead of local training

35

(Stern+, ACL 2017)

pre-s1 s1 s0 queue

do/MD like/VBP I/PRP eating/VBG fish/NN ./.

i j k

pre-s0 s0 queue

do/MD like/VBP eating/VBG fish/NN I/PRP ./.

i j

score action (i, k, j)

structural action

score label (i, j)

label action

(Cross+Huang, EMNLP16)

max label score label (i, j) max k best (i, k)+best (k, j)

+

best (i, j) =

slide-84
SLIDE 84

Liang Huang (Oregon State)

Global Training & Loss-Augmented Decoding

want for all and larger margin for worse trees: loss-augmented decoding in training (find the most-violated tree, i.e., a bad tree with good score) loss-augmented decoding for Hamming loss (approximating F1): simply replace score label (i, j) with score label (i, j) + 1(label ≠ label*ij) gold tree label for span (i, j) (could be “nolabel”) bad tree good score

(Stern+, ACL 2017)

slide-85
SLIDE 85

Liang Huang (Oregon State)

Penn Treebank Results

Parser F1 Score Hall et al. (2014) 89.2 Vinyals et al. (2015) 88.3 Cross and Huang (2016b) 91.3 Dyer et al. (2016) corrected 91.7 Liu and Zhang (2017) 91.7 Chart Parser 91.7

+refinement

91.8

(Stern+, ACL 2017)

slide-86
SLIDE 86

Minimal Feats for Incremental Dep. Parsing

38

… … … …

(Cross and Huang, ACL 2016) arc-standard (Kiperwaser and Goldberg 2016) arc-eager

slide-87
SLIDE 87

Minimal Feats for Incremental Dep. Parsing

38

… … … …

(Cross and Huang, ACL 2016) arc-standard (Kiperwaser and Goldberg 2016) arc-eager

… …

(Shi, Huang, Lee, EMNLP 2017) Saturday talk! arc-hybrid and arc-eager works for both greedy and O(n3) DP

slide-88
SLIDE 88

Minimal Feats for Incremental Dep. Parsing

39

slide-89
SLIDE 89

Spectrum: Neural Incremental Parsing

40

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

slide-90
SLIDE 90

Spectrum: Neural Incremental Parsing

40

Feedforward NNs

(Chen + Manning 14)

Stack LSTM (Dyer+ 15) biRNN dependency

(Kiperwaser+Goldberg 16; Cross+Huang 16a)

biRNN span-based constituency

(Cross+Huang 16b)

minimal span-based constituency

(Stern+ ACL 17)

minimal dependency

(Shi+ EMNLP 17)

edge-factored

(McDonald+ 05a)

biRNN graph-based dependency

(Kiperwaser+Goldberg 16; Wang+Chang 16)

DP incremental parsing

(Huang+Sagae 10, Kuhlmann+ 11)

RNNG

(Dyer+ 16)

DP impossible enables slow DP enables fast DP fastest DP: O(n3)

all tree info (summarize output y) minimal or no tree info (summarize input x)

constituency dependency bottom-up

slide-91
SLIDE 91

Conclusions and Limitations

41

slide-92
SLIDE 92

Conclusions and Limitations

  • DP and RNNs can indeed be married, if done creatively
  • biRNN summarizing input x and not output structure y
  • this allows efficient DP with exact search
  • combine with global learning (loss-augmented decoding)

41

slide-93
SLIDE 93

Conclusions and Limitations

  • DP and RNNs can indeed be married, if done creatively
  • biRNN summarizing input x and not output structure y
  • this allows efficient DP with exact search
  • combine with global learning (loss-augmented decoding)
  • but exact DP is still too slow
  • future work: linear-time beam search DP with biRNNs

41

slide-94
SLIDE 94

Conclusions and Limitations

  • DP and RNNs can indeed be married, if done creatively
  • biRNN summarizing input x and not output structure y
  • this allows efficient DP with exact search
  • combine with global learning (loss-augmented decoding)
  • but exact DP is still too slow
  • future work: linear-time beam search DP with biRNNs
  • what if we want strictly incremental parsing? no biRNN...
  • DP search could compensate for loss of lookahead

41

slide-95
SLIDE 95

Conclusions and Limitations

  • DP and RNNs can indeed be married, if done creatively
  • biRNN summarizing input x and not output structure y
  • this allows efficient DP with exact search
  • combine with global learning (loss-augmented decoding)
  • but exact DP is still too slow
  • future work: linear-time beam search DP with biRNNs
  • what if we want strictly incremental parsing? no biRNN...
  • DP search could compensate for loss of lookahead
  • what about translation? we do need to model y directly...

41

slide-96
SLIDE 96

非常 感谢 !

fēi cháng gǎn xiè

James Cross

slide-97
SLIDE 97

Thank you very much

非常 感谢 !

fēi cháng gǎn xiè

!

James Cross