Linear Time Constituency Parsing with RNNs and Dynamic Programming - - PowerPoint PPT Presentation

linear time constituency parsing with rnns and dynamic
SMART_READER_LITE
LIVE PREVIEW

Linear Time Constituency Parsing with RNNs and Dynamic Programming - - PowerPoint PPT Presentation

Linear Time Constituency Parsing with RNNs and Dynamic Programming Juneki Hong 1 Liang Huang 1,2 1 Oregon State University 2 Baidu Research Silicon Valley AI Lab Span Parsing is SOTA in Constituency Parsing Cross+Huang 2016 introduced Span


slide-1
SLIDE 1

Linear Time Constituency Parsing with RNNs and Dynamic Programming

Juneki Hong 1 Liang Huang 1,2

1 Oregon State University 2 Baidu Research Silicon

Valley AI Lab

slide-2
SLIDE 2

Span Parsing is SOTA in Constituency Parsing

  • Cross+Huang 2016 introduced Span Parsing
  • But with greedy decoding.
  • Stern et al. 2017 had Span Parsing with Exact Search and Global Training
  • But was too slow: O(n3)
  • Can we get the best of both worlds?
  • Something that is both fast and accurate?

2

Accuracy Speed

Cross + Huang 2016

Stern et al. 2017

Our Work

Kitaev + Klein 2018 Joshi et al. 2018

New at ACL 2018! Also Span Parsing!

slide-3
SLIDE 3

Both Fast and Accurate!

3

Baseline Chart Parser (Stern et al. 2017a) 91.79 Our Linear Time Parser 91.97

chart parsing

  • u

r w

  • r

k

slide-4
SLIDE 4

In this talk, we will discuss:

  • Linear Time Constituency Parsing using dynamic programming
  • Going slower in order to go faster: O(n3) → O(n4) → O(n)
  • Cube Pruning to speed up Incremental Parsing with Dynamic Programming
  • From O(n b2) to O(n b log b)
  • An improved loss function for Loss-Augmented Decoding
  • 2nd highest accuracy among single systems trained on PTB only

4

O(2n) → O(n3) → O(n4) O(nb2) O(nb log b)

<latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit><latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit><latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit><latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit>
slide-5
SLIDE 5

Span Parsing

  • Span differences are taken from an encoder 


(in our case: a bi-LSTM)

  • A span is scored and labeled by a feed-forward

network.

  • The score of a tree is the sum of all the labeled

span scores

5

s

stree(t) = P

(i,j,X)∈t

s(i, j, X)

(fj − fi, bi − bj)

s(i, j, X)

⟨/s⟩ ⟨s⟩

f0 f1 f2 f3 f4 f5 b1 b2 b3 b4 b5 b0

1 3 5 2 4

You should eat ice cream

Cross + Huang 2016 Stern et al. 2017 Wang + Chang 2016

slide-6
SLIDE 6

Incremental Span Parsing Example

6

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

Action Label Stack

Cross + Huang 2016

slide-7
SLIDE 7

S VP PP NP NN lunch IN after NP NN cream NN ice VB Eat

Incremental Span Parsing Example

7

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø

Action Label Stack 1 Shift ø (0, 1)

Cross + Huang 2016

slide-8
SLIDE 8

S VP PP NP NN lunch IN after NP NN cream NN ice VB Eat

Incremental Span Parsing Example

8

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2)

Cross + Huang 2016

slide-9
SLIDE 9

Incremental Span Parsing Example

9

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø ø

S PP NP NN lunch IN after VP NP NN cream NN cream NN ice VB Eat

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3)

Cross + Huang 2016

slide-10
SLIDE 10

S VP PP NP NN lunch IN after NP NN cream NN ice VB Eat

Incremental Span Parsing Example

10

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø

NP

ø

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3) 4 Reduce NP (0, 1) (1, 3)

Cross + Huang 2016

slide-11
SLIDE 11

S temp PP NP NN lunch IN after ∅ NP NN cream NN ice VB Eat

Incremental Span Parsing Example

11

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø ø

NP

ø

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3) 4 Reduce NP (0, 1) (1, 3) 5 Reduce ø (0, 3)

Cross + Huang 2016

slide-12
SLIDE 12

S temp PP NP NN lunch IN after ∅ NP NN cream NN ice VB Eat

Incremental Span Parsing Example

12

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø ø

NP

ø ø

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3) 4 Reduce NP (0, 1) (1, 3) 5 Reduce ø (0, 3) 6 Shift ø (0, 3) (3, 4)

Cross + Huang 2016

slide-13
SLIDE 13

Incremental Span Parsing Example

13

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø ø

NP

ø ø

NP

S temp NP NN lunch PP IN after ∅ NP NN cream NN ice VB Eat

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3) 4 Reduce NP (0, 1) (1, 3) 5 Reduce ø (0, 3) 6 Shift ø (0, 3) (3, 4) 7 Shift NP (0, 3) (3, 4) (4, 5)

Cross + Huang 2016

slide-14
SLIDE 14

Incremental Span Parsing Example

14

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

ø ø ø

NP

ø ø

PP NP

S temp PP NP NN lunch IN after ∅ NP NN cream NN ice VB Eat

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3) 4 Reduce NP (0, 1) (1, 3) 5 Reduce ø (0, 3) 6 Shift ø (0, 3) (3, 4) 7 Shift NP (0, 3) (3, 4) (4, 5) 8 Reduce PP (0, 3) (3, 5)

Cross + Huang 2016

slide-15
SLIDE 15

Incremental Span Parsing Example

15

Eat


VB 1

ice


NN

cream


NN

after


IN

lunch


NN 2 3 4 5

S VP PP NP NN lunch IN after NP NN cream NN ice VB Eat

Action Label Stack 1 Shift ø (0, 1) 2 Shift ø (0, 1) (1, 2) 3 Shift ø (0, 1) (1, 2) (2, 3) 4 Reduce NP (0, 1) (1, 3) 5 Reduce ø (0, 3) 6 Shift ø (0, 3) (3, 4) 7 Shift NP (0, 3) (3, 4) (4, 5) 8 Reduce PP (0, 3) (3, 5) 9 Reduce S-VP (0, 5)

ø ø ø

NP

ø ø

PP NP S-VP

Cross + Huang 2016

slide-16
SLIDE 16

How Many Possible Parsing Paths?

16

O(2n)

  • 2 actions per state.
  • O(2n)
slide-17
SLIDE 17

Equivalent Stacks?

  • Observe that all stacks that end with (i, j) will be treated the same!
  • …Until (i, j) is popped off.



 
 


  • So we can treat these as “temporarily equivalent”, and merge.


17

[(0, 2), (2, 7), (7, 9)] [(0, 3), (3, 7), (7, 9)] […, (7, 9)]

becomes

Graph-Structured Stack (Tomita 1988; Huang + Sagae 2010)

slide-18
SLIDE 18

Equivalent Stacks?

  • Observe that all stacks that end with (i, j) will be treated the same!
  • …Until (i, j) is popped off.



 
 


  • This is our new stack representation.


18

[…, (2, 7)] […, (3, 7)] […, (0, 2)]

Left Pointers

[…, (7, 9)] […, (0, 3)]

Graph-Structured Stack (Tomita 1988; Huang + Sagae 2010)

slide-19
SLIDE 19

Equivalent Stacks?

  • Observe that all stacks that end with (i, j) will be treated the same!
  • …Until (i, j) is popped off.

19

[…, (k, i)] […, (i, j)] […, (k, j)] Reduce Actions: O(n3) […, (2, 9)] […, (3, 9)]

Left Pointers

[…, (2, 7)] […, (0, 2)] […, (7, 9)] […, (3, 7)] […, (0, 3)]

reduce reduce Graph-Structured Stack (Tomita 1988; Huang + Sagae 2010)

slide-20
SLIDE 20

Dynamic Programming: Merging Stacks

20

O(2n)

  • Temporarily merging stacks will make our state space polynomial.



 
 
 
 
 
 
 
 


  • And our parsing state is represented by top span (i, j).

O(n3)

slide-21
SLIDE 21
  • Shift-Reduce Parsers are traditionally action synchronous.
  • This makes beam-search straight forward.
  • We will also do the same



 
 
 
 
 


  • But will show that this will slow down our DP (before applying beam-search)

Becoming Action Synchronous

21

O(2n) O(n4)

slide-22
SLIDE 22

1 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh

Action Synchronous Parsing Example

22

Gold: Shift (0,1)

slide-23
SLIDE 23

1 2 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh

Action Synchronous Parsing Example

23

Gold: Shift (0,1) Shift (1,2) Left Pointers Gold Parse

slide-24
SLIDE 24

1 2 3 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r

Action Synchronous Parsing Example

24

Gold: Shift (0,1) Shift (1,2) Shift (2, 3) Left Pointers Gold Parse

slide-25
SLIDE 25

Action Synchronous Parsing Example

25

Gold: Shift (0,1) Shift (1,2) Shift (2, 3) Reduce (1, 3) Left Pointers

1 2 3 4 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh

Gold Parse

slide-26
SLIDE 26

Action Synchronous Parsing Example

26

Gold: Shift (0,1) Shift (1,2) Shift (2, 3) Reduce (1, 3) Reduce (0, 3) Left Pointers

1 2 3 4 5 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r

Gold Parse

slide-27
SLIDE 27

Action Synchronous Parsing Example

27

Gold: Shift (0,1) Shift (1,2) Shift (2, 3) Reduce (1, 3) Reduce (0, 3) Shift (3, 4) Left Pointers Gold Parse

1 2 3 4 5 6 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh

slide-28
SLIDE 28

1 2 3 4 5 6 7 8 9 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

Action Synchronous Parsing Example

28

Gold: Shift (0,1) Shift (1,2) Shift (2, 3) Reduce (1, 3) Reduce (0, 3) Shift (3, 4) Shift (4, 5) Reduce (3, 5) Reduce (0, 5)

slide-29
SLIDE 29

1 2 3 4 5 6 7 8 9 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

Runtime Analysis: .

29

O(n4)

Huang+Sagae 2010

slide-30
SLIDE 30

Runtime Analysis: .

30

O(n4)

#steps: 2n − 1 = O(n)

1 2 3 4 5 6 7 8 9 ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

Huang+Sagae 2010

slide-31
SLIDE 31

Runtime Analysis: .

31

O(n4)

O(n2) #states per step: (i, j) 2n − 1 = O(n) 2n − 1 = O(n) 2n − 1 = O(n) #steps:

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

Huang+Sagae 2010

slide-32
SLIDE 32

Runtime Analysis: .

32

O(n4)

(i, j) 2n − 1 = O(n) 2n − 1 = O(n) #steps: O(n2) #states per step:

states O(n3)

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

Huang+Sagae 2010

slide-33
SLIDE 33

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

states O(n3)

Runtime Analysis: .

33

O(n4)

(i, j) #left pointers per state: O(n) 2n − 1 = O(n) #steps: O(n2) #states per step:

Check out the paper for our new theorem: Huang+Sagae 2010

l′ = l − 2(j − i) + 1

Thanks to Dezhong Deng!

l’: […, (k, i)] l: […, (i, j)] l+1: […, (k, j)] O(n4)

slide-34
SLIDE 34

Going slower to go faster

  • Our Action-Synchronous algorithm has a slower runtime than CKY!
  • However, it also becomes straightforward to prune using beam search.
  • So we can achieve a linear runtime in the end.

34

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (3,5) (2,5) (2,3) (3,4) (1,4) (4,5) (3,5) (0,3) (2,4) (0,4) (4,5) (3,4) ✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (2,3) (0,3) (3,4) (0,4) (4,5)

sh sh sh r sh r sh sh r sh r sh r r sh r sh r r sh r r r sh r r sh r r r r r sh r r r r sh sh sh r sh sh sh r r sh r r r sh r r

slide-35
SLIDE 35

Now our runtime is O(n).

35

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (1,5) (0,5) (0,2) (1,3) (2,4) (4,5) (4,5) (3,5) (2,3) (0,3) (3,4) (0,4) (4,5)

sh sh sh r sh r sh sh r r r r sh sh r r sh r r r sh r r r

slide-36
SLIDE 36

But this O(n) is hiding a constant.

36

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (0,2) (1,3) (2,4) (4,5) (2,3) (0,3) (3,4)

slide-37
SLIDE 37

But this O(n) is hiding a constant.

37

O(b)

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (0,2) (1,3) (2,4) (4,5) (2,3) (0,3) (3,4)

left pointers per state

b states per action step

O(nb2)runtime

slide-38
SLIDE 38

Cube Pruning

  • We can apply cube pruning to make

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (0,2) (1,3) (2,4) (4,5) (2,3) (0,3) (3,4)

sh sh sh r sh r sh sh r r r r sh sh

O(nb log b)

Chiang 2007 Huang+Chiang 2007

slide-39
SLIDE 39

Cube Pruning

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (0,2) (1,3) (2,4) (4,5) (2,3) (0,3) (3,4)

sh sh sh r sh r sh sh r r r r sh sh

  • We can apply cube pruning to make O(nb log b)
  • By pushing all states and their left pointers into a heap

Chiang 2007 Huang+Chiang 2007

slide-40
SLIDE 40

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (0,2) (1,3) (2,4) (4,5) (4,5) (2,3) (0,3) (3,4) (0,4)

sh sh sh r sh r sh sh r r r r sh sh r sh r

Cube Pruning

  • By pushing all states and their left pointers into a heap
  • And popping the top b unique subsequent states

  • We can apply cube pruning to make O(nb log b)

Chiang 2007 Huang+Chiang 2007

slide-41
SLIDE 41

✏ (0,1) (1,2) (2,3) (3,4) (4,5) (3,5) (2,5) (0,2) (1,3) (2,4) (4,5) (4,5) (2,3) (0,3) (3,4) (0,4)

sh sh sh r sh r sh sh r r r r sh sh r sh r

Cube Pruning

  • By pushing all states and their left pointers into a heap
  • And popping the top b unique subsequent states
  • First time Cube-Pruning has been applied to Incremental Parsing
  • We can apply cube pruning to make O(nb log b)

Chiang 2007 Huang+Chiang 2007

slide-42
SLIDE 42

Runtime on PTB and Discourse Treebank

42

chart parsing

  • u

r w

  • r

k

slide-43
SLIDE 43

Training

  • Structured SVM approach (Taskar et al. 2003; Stern et al. 2017):
  • Goal: Score the gold tree higher than all others by a margin:


  • Loss Augmented Decoding:
  • During Training: Return the most violated tree (i.e., highest augmented score):

  • Minimize:

43

∀t, s(t∗) − s(t) ≥ ∆(t, t∗)

ˆ t = arg max

t

  • s(t) + ∆(t, t∗)
  • s(ˆ

t) + ∆(ˆ t, t∗)

  • − s(t∗)
slide-44
SLIDE 44

Loss Function

  • Counts the incorrectly labeled spans in the tree (Stern et al. 2017)
  • Happens to be decomposable, so can even be used to compare partial trees.

44

∆(t, t∗) = P

(i,j,X)∈t

1 ⇣ X 6= t∗

(i,j)

slide-45
SLIDE 45

Novel Cross-Span Loss

  • We observe that the null label ø is used in two different ways:
  • To facilitate ternary and n-ary branching trees.
  • As a default label for incorrect spans that violate other gold spans.

45

i j

t*(i, j) = Ø

i j

t*(i, j) = Ø

slide-46
SLIDE 46
  • We modify the loss to account for incorrect spans in the tree.

Novel Cross-Span Loss

46

∆(t, t∗) = P

(i,j,X)∈t

1 ⇣ X 6= t∗

(i,j)

slide-47
SLIDE 47
  • We modify the loss to account for incorrect spans in the tree.



 


  • Indicates whether (i, j) is crossing a span in the gold tree



 
 
 
 


  • Still decomposable over spans, so can be used to compare partial trees.

cross(i, j, t∗) =

Novel Cross-Span Loss

47

∆(t, t∗) = P

(i,j,X)∈t

1 ⇣ X 6= t∗

(i,j) _ cross(i, j, t∗)

i j

slide-48
SLIDE 48
  • Take the largest augmented loss value across all time steps.
  • This is the Max-Violation, that we use to train.



 
 
 
 
 
 


Max-Violation Updates

48

Huang et. al. 2012

early max- violation latest full

(standard)

best in the beam worst in the beam falls off the beam biggest violation last valid update correct sequence invalid update!

slide-49
SLIDE 49

Comparison with Baseline Chart Parser

49

Model Note F1 (PTB test) Stern et al. (2017a) Baseline Chart Parser

91.79

+our cross-span loss

91.81

Our Work Beam 15

91.84

Beam 20

91.97

slide-50
SLIDE 50

Comparison to Other Parsers

50

Model Note F1 Durett + Klein 2015 91.1 Cross + Huang 2016 Original Span Parser 91.3 Liu + Zhang 2016 91.7 Dyer et al. 2016 Discriminative 91.7 Stern et al. 2017a Baseline Chart Parser 91.79 Stern et al. 2017c Separate Decoding 92.56 Our Work Beam 20 91.97 Model Note F1 Vinyals et al. 2015 Ensemble 90.5 Dyer et al. 2016 Generative Reranking 93.3 Choe + Charniak 2016 Reranking 93.8 Fried et al. 2017 Ensemble Reranking 94.25

Reranking, Ensemble, Extra Data PTB only, Single Model, End-to-End

slide-51
SLIDE 51

Conclusions

  • Linear-Time, Span-Based Constituency Parsing with Dynamic Programming
  • Cube-Pruning to speedup Incremental Parsing with Dynamic Programming
  • Cross-Span Loss extension for improving Loss-Augmented Decoding
  • Result: Faster and more accurate than cubic-time Chart Parsing
  • 2nd highest accuracy for single-model end-to-end systems trained on PTB only
  • Stern et al. 2017c is more accurate, but with separate decoding, and is much slower
  • After this ACL, definitely no longer true. (e.g. Joshi et al. 2018, Kitaev+Klein 2018)
  • But both are Span-Based Parsers and can be linearized in the same way!

51

O(2n) → O(n3) → O(n4) O(nb2) O(nb log b)

<latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit><latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit><latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit><latexit sha1_base64="J5yt+Ykz1sg9to8zcqinVqW85aE=">ACPnicbZBNTwIxEIa7+IX4hXr0khM4EJ2kUSPRC/ewEQ+ElhIt9tdGrtpu1qCOGXefE3ePoxYPGePVogT0oOEmTd56ZyXReL2ZUadt+sTJr6xubW9nt3M7u3v5B/vCopUQiMWliwYTseEgRjlpaqoZ6cSoMhjpO2Nrmf19j2Rigp+p8cxcSMUchpQjLRBg3yzXqz0eQn2JA2HGkpHmC9yPvnq6hqECPIV1rMcuj1K8vEpCKEXmQL9hlex5wVTipKIA0GoP8c8XOIkI15ghpbqOHWt3gqSmJFprpcoEiM8QiHpGslRJQ7mZ8/hWeG+DAQ0jyu4Zz+npigSKlx5JnOCOmhWq7N4H+1bqKDS3dCeZxowvFiUZAwaM6deQl9KgnWbGwEwpKav0I8RBJhbRzPGROc5ZNXRatSduyc1st1K5SO7LgBJyCInDABaiBG9ATYDBI3gF7+DerLerE/ra9GasdKZY/AnrO8fb6arpA=</latexit>
slide-52
SLIDE 52

Thank you! Questions?

52

chart parsing

  • u

r w

  • r

k

slide-53
SLIDE 53

Acknowledgements

  • Dezhong Deng for his theorem for predecessor states.
  • And his mathematical proofreading of the training sections.
  • Mitchell Stern for releasing his code and his suggestions.

53