Fast(er) Exact Decoding and Global Training for Transition-Based - - PowerPoint PPT Presentation

β–Ά
fast er exact decoding and global training for transition
SMART_READER_LITE
LIVE PREVIEW

Fast(er) Exact Decoding and Global Training for Transition-Based - - PowerPoint PPT Presentation

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Tianze Shi* Liang Huang Lillian Lee* * Cornell Oregon State University University Short Version Transition-based


slide-1
SLIDE 1

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Tianze Shi* Liang Huang† Lillian Lee*

* Cornell University

† Oregon State

University

slide-2
SLIDE 2

Short Version

  • Transition-based dependency parsing has an

exponentially-large search space

  • 𝑃 π‘œ# exact solutions exist
  • In practice, however, we needed rich features ⟹ 𝑃 π‘œ%
  • (This work) with bi-LSTMs, now we can do 𝑃(π‘œ#)!
  • And we get state-of-the-art results

2

slide-3
SLIDE 3

Dependency Parsing

She wanted to eat an apple

nsubj root mark xcomp

  • bj

det INPUT OUTPUT

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 3

slide-4
SLIDE 4

Transition-based Dependency Parsing

… … … …

Initial state Terminal states

…

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 4

slide-5
SLIDE 5

Transition-based Dependency Parsing

… … … …

Initial state Terminal states

…

Goal:

max score( )

…

= max βˆ‘ score( )

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 5

slide-6
SLIDE 6

Exact Decoding

  • Dynamic programming (Huang and Sagae, 2010;

Kuhlmann, GΓ³mez-RodrΓ­guez and Satta, 2011) … since transition (sub-)sequences can be reused

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 6

slide-7
SLIDE 7

Exact Decoding

… … … …

Initial state Terminal states

…

Goal:

max score( )

…

= max βˆ‘ score( )

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 7

Exponential to polynomial

slide-8
SLIDE 8

Transition Systems

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 8

DP Complexity # Transitions

Arc-standard

𝑃 π‘œ. 3

Arc-eager

𝑷 π’πŸ’ 4

Arc-hybrid

𝑷 π’πŸ’ 3

In our paper

Presentational convenience

slide-9
SLIDE 9

Arc-hybrid Transition System

Search State Stack Buffer

𝑑3 𝑑4 𝑑5 𝑐3 𝑐4

… … Initial State Terminal State

ROOT She wanted … ROOT

(Yamada and Matsumoto, 2003) (GΓ³mez-RodrΓ­guez et al., 2008) (Kuhlmann et al., 2011) Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 9

slide-10
SLIDE 10

Arc-hybrid Transition System

Transitions shift

𝑐3

… … reduceβ†· reduceβ†Ά

𝑐3

… …

𝑐3

… …

𝑑3

… … …

𝑑4 𝑑3

… 𝑑4 …

𝑑3 𝑑3

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 10

𝑐3

…

Same as arc-standard

slide-11
SLIDE 11

Arc-hybrid Transition System

Transitions shift

𝑐3

… … reduceβ†· reduceβ†Ά

𝑐3

… …

𝑐3

… …

𝑑3

… … …

𝑑4 𝑑3

… 𝑑4 …

𝑑3 𝑑3

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 11

𝑐3

…

Same as arc-standard

slide-12
SLIDE 12

Arc-hybrid Transition System

Transitions shift

𝑐3

… … reduceβ†· reduceβ†Ά

𝑐3

… …

𝑐3

… …

𝑑3

… … …

𝑑4 𝑑3

… 𝑑4 …

𝑑3 𝑑3

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 12

𝑐3

…

Same as arc-standard

slide-13
SLIDE 13

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

13

shift eat an apple ROOT wanted to

slide-14
SLIDE 14

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

14

shift eat an apple ROOT wanted to

slide-15
SLIDE 15

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

15

shift eat an apple ROOT wanted to

slide-16
SLIDE 16

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

16

shift eat an apple ROOT wanted to

slide-17
SLIDE 17

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

17

shift eat an apple ROOT wanted to

slide-18
SLIDE 18

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

18

shift eat an apple ROOT wanted to

slide-19
SLIDE 19

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 19

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

slide-20
SLIDE 20

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 20

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

slide-21
SLIDE 21

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 21

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

slide-22
SLIDE 22

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 22

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

slide-23
SLIDE 23

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 23

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

slide-24
SLIDE 24

Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 24

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ (terminal) wanted

slide-25
SLIDE 25

Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 25

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ (terminal) wanted

slide-26
SLIDE 26

Stack Buffer

Arc-hybrid Transition System

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 26

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ (terminal) wanted

slide-27
SLIDE 27

Deduction System for Arc-hybrid

  • Deduction Item

27

Stack Buffer

ROOT She wanted to eat an apple 0 1 2 3 4 5 6 (π‘œ)

[𝑗,π‘˜]

π‘˜

… …

𝑗

[0, π‘œ + 1]

π‘œ + 1

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

  • Goal
slide-28
SLIDE 28

Deduction System for Arc-hybrid

[𝑗,π‘˜] [π‘˜, π‘˜ + 1]

shift shift

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

π‘˜

… …

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

28

slide-29
SLIDE 29

Deduction System for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [? , π‘˜]

reduce↢

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 29

reduce↢

π‘˜

… …

? π‘˜

… …

𝑙 𝑙

slide-30
SLIDE 30

Deduction System for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 30

reduce↢

π‘˜

… …

𝑗 𝑙 π‘˜

… …

𝑙 𝑗

slide-31
SLIDE 31

Deduction System for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 31

reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗

*

slide-32
SLIDE 32

Deduction System for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢ reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

𝑗

… … *

[𝑗, 𝑙]

*

[𝑙,π‘˜]

32

* [𝑗,π‘˜] In Kuhlmann et al. (2011)’s notation

slide-33
SLIDE 33

Deduction System for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↷ reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

𝑗

… … *

[𝑗, 𝑙]

*

[𝑙,π‘˜]

33

* [𝑗,π‘˜]

slide-34
SLIDE 34

Deduction System for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢

[𝑗,π‘˜] [π‘˜, π‘˜ + 1]

shift

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↷

𝑃 π‘œ#

[0, π‘œ + 1]

Goal:

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

𝑙 β†Ά π‘˜ 𝑗 β†· 𝑙

34

slide-35
SLIDE 35

Time Complexity in Practice

  • Complexity depends on feature representation!
  • Typical feature representation:
  • Feature templates look at specific positions in the

stack and in the buffer

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 35

slide-36
SLIDE 36

Time Complexity in Practice

  • Compare the following features
  • Time complexities are different!!!

𝑑3 𝑐3

… …

Information about 𝑑4 is not available, needs extra bookkeeping

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

𝑑3 𝑑4 𝑐3

… … shift

π‘˜

… …

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

shift

π‘˜

…

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

…

?

𝑑CD 𝑗,π‘˜ 𝑑CD ? ,𝑗, π‘˜

36

slide-37
SLIDE 37

Time Complexity in Practice

  • Complexity depends on feature representation!
  • Typical feature representation:
  • Feature templates look at specific positions in the

stack and in the buffer

  • Best-known complexity in practice: 𝑃(π‘œ%)

(Huang and Sagae, 2010) Stack Buffer

𝑑3 𝑑4 𝑑5 𝑐3 𝑐4

… …

𝑑4. π‘šπ‘‘ 𝑑4. 𝑠𝑑

…

𝑑3. π‘šπ‘‘ 𝑑3. 𝑠𝑑

…

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 37

slide-38
SLIDE 38

Best-known Time Complexities (recap)

𝑃 π‘œ# 𝑃 π‘œ%

Theoretical Practical

Gap:

Feature representation

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 38

slide-39
SLIDE 39

In Practice, Instead of Exact Decoding …

  • Greedy search (Nivre, 2003, 2004, 2008; Chen and Manning, 2014)
  • Beam search (Zhang and Clark, 2011; Weiss et al.,2015)
  • Best-first search (Sagae and Lavie, 2006; Sagae and Tsujii, 2007;

Zhao et al., 2013)

  • Dynamic oracles (Goldberg and Nivre, 2012, 2013)
  • β€œGlobal” normalization on the beam (Zhou et al., 2015; Andor

et al., 2016)

  • Reinforcement learning (LΓͺ and Fokkens, 2017)
  • Learning to search (DaumΓ© III and Marcu, 2005; Chang et al.,

2016; Wiseman and Rush, 2016)

  • …

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 39

slide-40
SLIDE 40

How Many Positional Features Do We Need?

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 40

Non-neural (manual engineering) Chen and Manning (2014)

slide-41
SLIDE 41

How Many Positional Features Do We Need?

  • Chen and Manning (2014)

Stack Buffer

𝑑3 𝑑4 𝑑5 𝑐3 𝑐4

… …

𝑐5 𝑑4. π‘šπ‘‘I 𝑑4. 𝑠𝑑I

…

𝑑3. π‘šπ‘‘I 𝑑3. 𝑠𝑑I

…

𝑑3. 𝑠𝑑3. 𝑠𝑑3 𝑑3. π‘šπ‘‘3. π‘šπ‘‘3 𝑑4. 𝑠𝑑3. 𝑠𝑑3 𝑑4. π‘šπ‘‘3. π‘šπ‘‘3

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 41

slide-42
SLIDE 42

How Many Positional Features Do We Need?

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 42

Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM (Dyer et al., 2016) Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) Exponential DP Slow DP Fast DP More tree-structure information

slide-43
SLIDE 43

…

How Many Positional Features Do We Need?

  • LSTMs can be used to encode the entire stack

and buffer (Dyer et al., 2016)

Stack Buffer

𝑐3 𝑐4

…

𝑐5

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 43

𝑑3 𝑑4 𝑑5

…

𝑑 J K4 𝑐 L K4

slide-44
SLIDE 44

How Many Positional Features Do We Need?

  • Bi-LSTMs give compact feature representations

(Kiperwasser and Goldberg, 2016; Cross and Huang, 2016)

  • Features used in Kiperwasser and Goldberg (2016)
  • Features used in Cross and Huang (2016)

Stack Buffer

𝑑3 𝑑4 𝑑5 𝑐3

… … Stack Buffer

𝑑3 𝑑4 𝑐3

… …

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 44

slide-45
SLIDE 45

Model Architecture

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑑3 𝑐3 𝑑4

Multi-layer perceptron

𝑑CD, 𝑑MNβ†Ά, 𝑑MNβ†·

45

𝑑5

slide-46
SLIDE 46

Model Architecture

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑑3 𝑐3 𝑑4

Multi-layer perceptron

𝑑CD, 𝑑MNβ†Ά, 𝑑MNβ†·

46

slide-47
SLIDE 47

Model Architecture

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑑3 𝑐3

Multi-layer perceptron

𝑑CD, 𝑑MNβ†Ά, 𝑑MNβ†·

47

slide-48
SLIDE 48

Model Architecture

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑐3

Multi-layer perceptron

𝑑CD, 𝑑MNβ†Ά, 𝑑MNβ†·

48

slide-49
SLIDE 49

How Many Positional Features Do We Need?

  • We answer the question empirically

… experimented with greedy decoding

  • Two positional feature vectors are enough!

40 60 80 100 {𝑑5,𝑑4,𝑑3, 𝑐3} {𝑑4, 𝑑3,𝑐3} {𝑑3,𝑐3} {𝑐3} πŸ˜πŸ“.πŸπŸ— Β±0.13 πŸ˜πŸ“.πŸπŸ— Β±0.05 πŸ˜πŸ“.πŸπŸ’ Β±0.12 πŸ”πŸ‘.πŸ’πŸ˜ Β±0.23 UAS

  • n

PTB (dev)

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 49

Considered in prior work

slide-50
SLIDE 50
  • Our minimal feature set works
  • Counter-intuitive, but works for greedy decoding

Stack Buffer

𝑑3 𝑐3

… …

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

How Many Positional Features Do We Need?

50

reduceβ†· …

𝑑4 𝑑3

…

𝑑4 𝑑3 𝑐3

…

𝑐3

…

slide-51
SLIDE 51

Implication of Minimal Feature Set

  • The bare deduction items already contain

enough information to extract features

  • We don’t need extra book keeping
  • Leads to the first 𝑃 π‘œ# implementation of

global decoders!

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 51

slide-52
SLIDE 52

How Many Positional Features Do We Need?

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 52

Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM (Dyer et al., 2016) Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) Our work Exponential DP Slow DP Fast DP Fast(er) DP More tree-structure information

slide-53
SLIDE 53

Best-known Time Complexities (recap)

𝑃 π‘œ# 𝑃 π‘œ%

Theoretical Practical

Gap:

Feature representation

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 53

slide-54
SLIDE 54

Our contribution

𝑃 π‘œ# 𝑃 π‘œ%

Theoretical Practical

𝑃 π‘œ#

Bi-directional LSTMs

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 54

slide-55
SLIDE 55

Decoding

𝑗, π‘˜ :𝑀 π‘˜, π‘˜ + 1 : 0

shift

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

*

[𝑗,π‘˜]

𝑗

… … shift

π‘˜

… …

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

55

Score of the sub-sequence

slide-56
SLIDE 56

Decoding

𝑗, 𝑙 :𝑀4 𝑙, π‘˜ :𝑀5 𝑗, π‘˜ :𝑀4 + 𝑀5 + Ξ”

reduce↢

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results

Ξ” = 𝑑CD 𝑗, 𝑙 + 𝑑MNβ†Ά 𝑙, π‘˜

56

reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗 𝑗

… … *

[𝑗, 𝑙]

*

[𝑙,π‘˜]

* [𝑗,π‘˜]

slide-57
SLIDE 57

Training

  • Cost-augmented decoding (Taskar et al., 2005)

max score( ) + cost( ) - score( )

… … … …

𝑗, 𝑙 : 𝑀4 𝑙,π‘˜ : 𝑀5 𝑗,π‘˜ : 𝑀4 + 𝑀5 + 𝑑CD 𝑗,𝑙 + 𝑑MNβ†Ά 𝑙, π‘˜ + 𝟐 β„Žπ‘“π‘π‘’ 𝑙 β‰  π‘˜ reduceβ†Ά

π‘˜

… …

𝑗 𝑙

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 57

slide-58
SLIDE 58

Comparing with State-of-the-art

BGDS16 CH16 DBLMS15 KG16a KG16b CFHGD16 DM17 KG16a KBKDS16 WC16

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

Local Global

Chinese

CTB UAS English PTB UAS

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 58

slide-59
SLIDE 59

Comparing with State-of-the-art

BGDS16 CH16 DBLMS15 KG16a KG16b CFHGD16 DM17 KG16a KBKDS16 WC16

Our arc-eager DP Our arc-hybrid DP

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

Local Global Our Global

Chinese

CTB UAS English PTB UAS

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 59

slide-60
SLIDE 60

Comparing with State-of-the-art

BGDS16 CH16 DBLMS15 KG16a KG16b CFHGD16 DM17 KG16a KBKDS16 WC16

Our best local Our arc-eager DP Our arc-hybrid DP

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

Local Global Our Local Our Global

Chinese

CTB UAS English PTB UAS

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 60

slide-61
SLIDE 61

Comparing with State-of-the-art

BGDS16 CH16 DBLMS15 KG16a KG16b CFHGD16 DM17 KG16a KBKDS16 WC16

Our best local Our arc-eager DP Our arc-hybrid DP 15 Our all global

20 KBKDS16

5 Our arc-eager DP 5 Our arc-hybrid DP

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

Local Global Our Local Our Global Ensemble

Chinese

CTB UAS English PTB UAS

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 61

slide-62
SLIDE 62

Results – CoNLL’17 Shared Task

75.00 74.32 74.00 73.75

73 74 75

LAS Ensemble Exact Arc-eager Exact Arc-hybrid Graph- based

(Shi, Wu, Chen and Cheng, 2017; Zeman et al., 2017)

  • Macro-average of 81 treebanks in 49 languages
  • 2nd–highest overall performance

Background 𝑃(π‘œ#) in theory 𝑃(π‘œ%) in practice Back to 𝑃(π‘œ#) Results 62

slide-63
SLIDE 63

Conclusion

  • Bi-LSTM feature set is minimal yet highly effective
  • First 𝑃 π‘œ# implementation of exact decoders
  • Global training and decoding gave high performance

63

slide-64
SLIDE 64

More in Our Paper

  • Description and analysis of three transition systems

(arc-standard, arc-hybrid, arc-eager)

  • CKY-style representations of the deduction systems
  • Theoretical analysis of the global methods
  • Arc-eager models can β€œsimulate” arc-hybrid models
  • Arc-eager models can β€œsimulate” edge-factored models

64

slide-65
SLIDE 65

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Tianze Shi* Liang Huang† Lillian Lee*

https://github.com/tzshi/dp-parser-emnlp17

* Cornell University † Oregon State University

slide-66
SLIDE 66

CKY-style Visualization

66