Fast(er) Exact Decoding and Global Training for Transition-Based - - PowerPoint PPT Presentation

β–Ά
fast er exact decoding and global training for transition
SMART_READER_LITE
LIVE PREVIEW

Fast(er) Exact Decoding and Global Training for Transition-Based - - PowerPoint PPT Presentation

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Tianze Shi* Liang Huang Lillian Lee* Oregon State University * Cornell University 3 Minimal 3 6


slide-1
SLIDE 1

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Tianze Shi* Liang Huang† Lillian Lee*

𝑃 π‘œ3 𝑃 π‘œ6

Theoretical Practical

𝑃 π‘œ3

Minimal Feature Set

* Cornell University

† Oregon State University

slide-2
SLIDE 2

Short Version

  • Transition-based dependency parsing has an

exponentially-large search space

  • 𝑃 π‘œ3 exact solutions exist πŸ˜„
  • In practice, however, we needed rich features ⟹ 𝑃 π‘œ6 😟
  • (This work) with bi-LSTMs, now we can do 𝑃(π‘œ3)! πŸ˜„
  • And we get state-of-the-art results

2 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-3
SLIDE 3

Short Version

  • Transition-based dependency parsing has an

exponentially-large search space

  • 𝑃 π‘œ3 exact solutions exist πŸ˜„
  • In practice, however, we needed rich features ⟹ 𝑃 π‘œ6 😟
  • (This work) with bi-LSTMs, now we can do 𝑃(π‘œ3)! πŸ˜„
  • And we get state-of-the-art results

3 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-4
SLIDE 4

Short Version

  • Transition-based dependency parsing has an

exponentially-large search space

  • 𝑃 π‘œ3 exact solutions exist πŸ˜„
  • In practice, however, we needed rich features ⟹ 𝑃 π‘œ6 😟
  • (This work) with bi-LSTMs, now we can do 𝑃(π‘œ3)! πŸ˜„
  • And we get state-of-the-art results

4 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-5
SLIDE 5

Short Version

  • Transition-based dependency parsing has an

exponentially-large search space

  • 𝑃 π‘œ3 exact solutions exist πŸ˜„
  • In practice, however, we needed rich features ⟹ 𝑃 π‘œ6 😟
  • (This work) with bi-LSTMs, now we can do 𝑃(π‘œ3)! πŸ˜„
  • And we get state-of-the-art results

5 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-6
SLIDE 6

Short Version

  • Transition-based dependency parsing has an

exponentially-large search space

  • 𝑃 π‘œ3 exact solutions exist πŸ˜„
  • In practice, however, we needed rich features ⟹ 𝑃 π‘œ6 😟
  • (This work) with bi-LSTMs, now we can do 𝑃(π‘œ3)! πŸ˜„
  • And we get state-of-the-art results

6 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-7
SLIDE 7

Dependency Parsing

She wanted to eat an apple

nsubj root mark xcomp

  • bj

det INPUT OUTPUT

7 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-8
SLIDE 8

Transition-based Dependency Parsing

… … … …

Initial state Terminal states

…

8 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-9
SLIDE 9

Transition-based Dependency Parsing

… … … …

Initial state Terminal states

…

Goal:

max score( )

…

= max βˆ‘ score( )

9 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-10
SLIDE 10

Exact Decoding with Dynamic Programming … … … …

Initial state Terminal states

…

Goal:

max score( )

…

= max βˆ‘ score( )

10

Exponential to polynomial

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

(Huang and Sagae, 2010; Kuhlmann, GΓ³mez-RodrΓ­guez and Satta, 2011)

slide-11
SLIDE 11

Transition Systems

11

DP Complexity # Action Types

Arc-standard

𝑃 π‘œ4 3

Arc-eager

𝑷 π’πŸ’ 4

Arc-hybrid

𝑷 π’πŸ’ 3

In our paper

Presentational convenience

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-12
SLIDE 12

Arc-hybrid Transition System

State Stack Buffer

𝑑0 𝑑1 𝑑2 𝑐0 𝑐1

… … Initial State Terminal State

ROOT She wanted … ROOT

(Yamada and Matsumoto, 2003) (GΓ³mez-RodrΓ­guez et al., 2008) (Kuhlmann et al., 2011) 12 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-13
SLIDE 13

Arc-hybrid Transition System

Transitions shift

𝑐0

… … reduceβ†· reduceβ†Ά

𝑐0

… …

𝑐0

… …

𝑑0

… … …

𝑑1 𝑑0

… 𝑑1 …

𝑑0 𝑑0

13

𝑐0

…

Same as arc-standard

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-14
SLIDE 14

Arc-hybrid Transition System

Transitions shift

𝑐0

… … reduceβ†· reduceβ†Ά

𝑐0

… …

𝑐0

… …

𝑑0

… … …

𝑑1 𝑑0

… 𝑑1 …

𝑑0 𝑑0

14

𝑐0

…

Same as arc-standard

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-15
SLIDE 15

Arc-hybrid Transition System

Transitions shift

𝑐0

… … reduceβ†· reduceβ†Ά

𝑐0

… …

𝑐0

… …

𝑑0

… … …

𝑑1 𝑑0

… 𝑑1 …

𝑑0 𝑑0

15

𝑐0

…

Same as arc-standard

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-16
SLIDE 16

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

16

shift eat an apple ROOT wanted to

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-17
SLIDE 17

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

17

shift eat an apple ROOT wanted to

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-18
SLIDE 18

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

18

shift eat an apple ROOT wanted to

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-19
SLIDE 19

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

19

shift eat an apple ROOT wanted to

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-20
SLIDE 20

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

20

shift eat an apple ROOT wanted to

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-21
SLIDE 21

Stack Buffer ROOT She wanted to eat an apple

Arc-hybrid Transition System

shift initial shift She wanted to eat an apple ROOT wanted to eat an apple ROOT She reduce↢ wanted to eat an apple ROOT She shift to eat an apple ROOT wanted

21

shift eat an apple ROOT wanted to

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-22
SLIDE 22

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

22

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

eat an apple ROOT wanted to

slide-23
SLIDE 23

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

23

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

eat an apple ROOT wanted to

slide-24
SLIDE 24

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

24

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

eat an apple ROOT wanted to

slide-25
SLIDE 25

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

25

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

eat an apple ROOT wanted to

slide-26
SLIDE 26

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

26

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

eat an apple ROOT wanted to

slide-27
SLIDE 27

ROOT wanted eat Stack Buffer

Arc-hybrid Transition System

27

reduce↢ eat an apple ROOT wanted to shift an apple ROOT wanted eat an shift apple ROOT wanted eat reduce↢ apple an ROOT wanted eat apple shift

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

eat an apple ROOT wanted to

slide-28
SLIDE 28

Stack Buffer

Arc-hybrid Transition System

28

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ wanted

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

ROOT wanted eat apple terminal

slide-29
SLIDE 29

Stack Buffer

Arc-hybrid Transition System

29

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ wanted

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

ROOT wanted eat apple terminal

slide-30
SLIDE 30

Stack Buffer

Arc-hybrid Transition System

30

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ wanted

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

ROOT wanted eat apple terminal

slide-31
SLIDE 31

Stack Buffer

Arc-hybrid Transition System

31

ROOT wanted eat reduce↷ apple ROOT wanted reduce↷ eat ROOT reduce↷ wanted

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

ROOT wanted eat apple terminal

slide-32
SLIDE 32

Dynamic Programming for Arc-hybrid

  • Deduction Item

32

Stack Buffer

ROOT She wanted to eat an apple 0 1 2 3 4 5 6 (π‘œ)

[𝑗, π‘˜]

π‘˜

… …

𝑗

[0, π‘œ + 1]

π‘œ + 1

  • Goal

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-33
SLIDE 33

Dynamic Programming for Arc-hybrid

[𝑗, π‘˜] [π‘˜, π‘˜ + 1]

shift shift

π‘˜

… …

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

33 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-34
SLIDE 34

Dynamic Programming for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [? , π‘˜]

reduce↢

34

reduce↢

π‘˜

… …

? π‘˜

… …

𝑙 𝑙

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-35
SLIDE 35

Dynamic Programming for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢

35

reduce↢

π‘˜

… …

𝑗 𝑙 π‘˜

… …

𝑙 𝑗

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-36
SLIDE 36

Dynamic Programming for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢

36

reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗

*

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-37
SLIDE 37

Dynamic Programming for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢ reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗 𝑗

… … *

[𝑗, 𝑙]

*

[𝑙, π‘˜]

37

* [𝑗, π‘˜] In Kuhlmann, GΓ³mez-RodrΓ­guez and Satta (2011)’s notation

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-38
SLIDE 38

Dynamic Programming for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↷ reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗 𝑗

… … *

[𝑗, 𝑙]

*

[𝑙, π‘˜]

38

* [𝑗, π‘˜]

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-39
SLIDE 39

Dynamic Programming for Arc-hybrid

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↢

[𝑗, π‘˜] [π‘˜, π‘˜ + 1]

shift

𝑗, 𝑙 [𝑙, π‘˜] [𝑗, π‘˜]

reduce↷

𝑃 π‘œ3

[0, π‘œ + 1]

Goal: 𝑙 β†Ά π‘˜ 𝑗 β†· 𝑙

39 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-40
SLIDE 40

Time Complexity in Practice

  • Complexity depends on feature representation!
  • Typical feature representation:
  • Feature templates look at specific positions in the

stack and in the buffer

40 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-41
SLIDE 41

Time Complexity in Practice

  • Compare the following features
  • Time complexities are different!!!

𝑑0 𝑐0

… …

𝑑0 𝑑1 𝑐0

… … shift

π‘˜

… …

π‘˜

…

π‘˜ + 1

…

𝑗

shift

π‘˜

…

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

…

𝑙

𝑑sh 𝑗, π‘˜ 𝑑sh 𝑙, 𝑗, π‘˜

41 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-42
SLIDE 42

Time Complexity in Practice

  • Complexity depends on feature representation!
  • Typical feature representation:
  • Feature templates look at specific positions in the

stack and in the buffer

  • Best-known complexity in practice: 𝑃(π‘œ6)

(Huang and Sagae, 2010) Stack Buffer

𝑑0 𝑑1 𝑑2 𝑐0 𝑐1

… …

𝑑1. π‘šπ‘‘ 𝑑1. 𝑠𝑑

…

𝑑0. π‘šπ‘‘ 𝑑0. 𝑠𝑑

…

42 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-43
SLIDE 43

Best-known Time Complexities (recap)

𝑃 π‘œ3 𝑃 π‘œ6

Theoretical Practical

Gap:

Feature representation

43 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-44
SLIDE 44

In Practice, Instead of Exact Decoding …

  • Greedy search (Nivre, 2003, 2004, 2008; Chen and Manning, 2014)
  • Beam search (Zhang and Clark, 2011; Weiss et al.,2015)
  • Best-first search (Sagae and Lavie, 2006; Sagae and Tsujii, 2007;

Zhao et al., 2013)

  • Dynamic oracles (Goldberg and Nivre, 2012, 2013)
  • β€œGlobal” normalization on the beam (Zhou et al., 2015; Andor

et al., 2016)

  • Reinforcement learning (LΓͺ and Fokkens, 2017)
  • Learning to search (DaumΓ© III and Marcu, 2005; Chang et al.,

2016; Wiseman and Rush, 2016)

  • …

44 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-45
SLIDE 45

How Many Positional Features Do We Need?

45

Non-neural (manual engineering) ☞ Chen and Manning (2014)

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

Stack Buffer

𝑑0 𝑑1 𝑑2 𝑐0 𝑐1

… …

𝑐2 𝑑1. π‘šπ‘‘π‘— 𝑑1. 𝑠𝑑𝑗

…

𝑑0. π‘šπ‘‘π‘— 𝑑0. 𝑠𝑑𝑗

…

𝑑0. 𝑠𝑑0. 𝑠𝑑0 𝑑0. π‘šπ‘‘0. π‘šπ‘‘0 𝑑1. 𝑠𝑑0. 𝑠𝑑0 𝑑1. π‘šπ‘‘0. π‘šπ‘‘0

slide-46
SLIDE 46

How Many Positional Features Do We Need?

46

Non-neural (manual engineering) ☞ Chen and Manning (2014) Stack LSTM ☞ Dyer et al. (2015) Bi-LSTM ☞ Kiperwasser and Goldberg (2016) ☞ Cross and Huang (2016)

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

… Stack

𝑑0 𝑑1 𝑑2

…

𝑑 𝜏 βˆ’1

slide-47
SLIDE 47

How Many Positional Features Do We Need?

  • Bi-LSTMs give compact feature representations

(Kiperwasser and Goldberg, 2016; Cross and Huang, 2016)

  • Features used in Kiperwasser and Goldberg (2016)
  • Features used in Cross and Huang (2016)

Stack Buffer

𝑑0 𝑑1 𝑑2 𝑐0

… … Stack Buffer

𝑑0 𝑑1 𝑐0

… …

47 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-48
SLIDE 48

How Many Positional Features Do We Need?

48

Non-neural (manual engineering) ☞ Chen and Manning (2014) Stack LSTM ☞ Dyer et al. (2015) Bi-LSTM ☞ Kiperwasser and Goldberg (2016) ☞ Cross and Huang (2016)

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

Enables Slow DP Enables Fast DP Exponential DP Summarizing trees on stack Summarizing input

slide-49
SLIDE 49

Model Architecture

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑑0 𝑐0 𝑑1

Multi-layer perceptron

π‘‘π‘‘β„Ž, 𝑑𝑠𝑓↢, 𝑑𝑠𝑓↷

49

𝑑2

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-50
SLIDE 50

Model Architecture

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑑0 𝑐0 𝑑1

Multi-layer perceptron

π‘‘π‘‘β„Ž, 𝑑𝑠𝑓↢, 𝑑𝑠𝑓↷

50 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-51
SLIDE 51

Model Architecture

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑑0 𝑐0

Multi-layer perceptron

π‘‘π‘‘β„Ž, 𝑑𝑠𝑓↢, 𝑑𝑠𝑓↷

51 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-52
SLIDE 52

Model Architecture

She wanted to eat an apple

Bi-directional LSTM Word embeddings + POS embeddings

𝑐0

Multi-layer perceptron

π‘‘π‘‘β„Ž, 𝑑𝑠𝑓↢, 𝑑𝑠𝑓↷

52 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-53
SLIDE 53

How Many Positional Features Do We Need?

  • We answer the question empirically

… experimented with greedy decoding

  • Two positional feature vectors are enough!

40 60 80 100 {𝑑2, 𝑑1, 𝑑0, 𝑐0} {𝑑1, 𝑑0, 𝑐0} {𝑑0, 𝑐0} {𝑐0} πŸ˜πŸ“. πŸπŸ— Β±0.13 πŸ˜πŸ“. πŸπŸ— Β±0.05 πŸ˜πŸ“. πŸπŸ’ Β±0.12 πŸ”πŸ‘. πŸ’πŸ˜ Β±0.23 UAS

  • n

PTB (dev)

53

Considered in prior work

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-54
SLIDE 54
  • Our minimal feature set
  • Counter-intuitive, but works for greedy decoding

Stack Buffer

𝑑0 𝑐0

… …

How Many Positional Features Do We Need?

54

reduceβ†· …

𝑑1 𝑑0

…

𝑑1 𝑑0 𝑐0

…

𝑐0

…

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-55
SLIDE 55
  • Our minimal feature set
  • Counter-intuitive, but works for greedy decoding
  • The bare deduction items already contain enough

information to extract features for DP

  • Leads to the first 𝑃 π‘œ3 implementation of global

decoders!

Stack Buffer

𝑑0 𝑐0

… …

How Many Positional Features Do We Need?

55 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-56
SLIDE 56

How Many Positional Features Do We Need?

56

Non-neural (manual engineering) ☞ Chen and Manning (2014) Stack LSTM ☞ Dyer et al. (2015) Bi-LSTM ☞ Kiperwasser and Goldberg (2016) ☞ Cross and Huang (2016) ☞ Our work Summarizing trees on stack

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

Enables Slow DP Enables Fast DP Fast(er) DP Exponential DP Summarizing input

slide-57
SLIDE 57

Best-known Time Complexities (recap)

𝑃 π‘œ3 𝑃 π‘œ6

Theoretical Practical

Gap:

Feature representation

57 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-58
SLIDE 58

Our contribution

𝑃 π‘œ3 𝑃 π‘œ6

Theoretical Practical

𝑃 π‘œ3

Minimal Feature Set

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results 58

slide-59
SLIDE 59

Decoding

𝑗, π‘˜ : 𝑀 π‘˜, π‘˜ + 1 : 0

shift *

[𝑗, π‘˜]

𝑗

… … shift

π‘˜

… …

π‘˜

…

π‘˜ + 1

…

𝑗 𝑗

59

Score of the sub-sequence

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-60
SLIDE 60

Decoding

𝑗, 𝑙 : 𝑀1 𝑙, π‘˜ : 𝑀2 𝑗, π‘˜ : 𝑀1 + 𝑀2 + Ξ”

reduce↢

Ξ” = 𝑑sh 𝑗, 𝑙 + 𝑑reβ†Ά 𝑙, π‘˜

60

reduce↢

π‘˜

… …

𝑗 𝑙 𝑙

… …

𝑗 π‘˜

… …

𝑙 𝑗 𝑗

… … *

[𝑗, 𝑙]

*

[𝑙, π‘˜]

* [𝑗, π‘˜]

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-61
SLIDE 61

Training

  • Separate incorrect from correct by a margin
  • Cost-augmented decoding (Taskar et al., 2005)

max score( ) - score( ) + cost( )

… … … …

𝑗, 𝑙 : 𝑀1 𝑙, π‘˜ : 𝑀2 𝑗, π‘˜ : 𝑀1 + 𝑀2 + 𝑑sh 𝑗, 𝑙 + 𝑑reβ†Ά 𝑙, π‘˜ + 𝟐 head 𝑙 β‰  π‘˜ reduceβ†Ά

π‘˜

… …

𝑗 𝑙

61 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-62
SLIDE 62

Comparing with State-of-the-art

πŸ“Ž BGDS16 πŸ“Ž CH16 πŸ“Ž DBLMS15 πŸ“Ž KG16a πŸ“Ž KG16b πŸŒ‘CFHGD16 πŸŒ‘DM17 πŸŒ‘KG16a πŸŒ‘KBKDS16 πŸŒ‘WC16

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 94.0 95.0 96.0

πŸ“Ž Local πŸŒ‘ Global

Chinese

CTB UAS English PTB UAS

62 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-63
SLIDE 63

Comparing with State-of-the-art

πŸ“Ž BGDS16 πŸ“Ž CH16 πŸ“Ž DBLMS15 πŸ“Ž KG16a πŸ“Ž KG16b πŸŒ‘CFHGD16 πŸŒ‘DM17 πŸŒ‘KG16a πŸŒ‘KBKDS16 πŸŒ‘WC16

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 94.0 95.0 96.0

πŸ“Ž Local πŸŒ‘ Global πŸŒ‘ Our Global

Chinese

CTB UAS English PTB UAS

63

πŸŒ‘Our arc-hybrid DP πŸŒ‘Our arc-eager DP

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-64
SLIDE 64

Comparing with State-of-the-art

πŸ“Ž BGDS16 πŸ“Ž CH16 πŸ“Ž DBLMS15 πŸ“Ž KG16a πŸ“Ž KG16b πŸŒ‘CFHGD16 πŸŒ‘DM17 πŸŒ‘KG16a πŸŒ‘KBKDS16 πŸŒ‘WC16

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 94.0 95.0 96.0

πŸ“Ž Local πŸŒ‘ Global πŸ“Ž Our Local πŸŒ‘ Our Global

Chinese

CTB UAS English PTB UAS

64

Our best local πŸ“Ž πŸŒ‘Our arc-hybrid DP πŸŒ‘Our arc-eager DP

Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-65
SLIDE 65

Comparing with State-of-the-art

πŸ“Ž BGDS16 πŸ“Ž CH16 πŸ“Ž DBLMS15 πŸ“Ž KG16a πŸ“Ž KG16b πŸŒ‘CFHGD16 πŸŒ‘DM17 πŸŒ‘KG16a πŸŒ‘KBKDS16 πŸŒ‘WC16 πŸ’½20 KBKDS16

86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 90.5 93.0 94.0 95.0 96.0

πŸ“Ž Local πŸŒ‘ Global πŸ“Ž Our Local πŸŒ‘ Our Global πŸ’½ Ensemble

Chinese

CTB UAS English PTB UAS

65

Our best local πŸ“Ž πŸŒ‘Our arc-hybrid DP

πŸ’½5 Our arc-hybrid DP

πŸŒ‘Our arc-eager DP

πŸ’½5 Our arc-eager DP πŸ’½15 Our all global Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-66
SLIDE 66

Results – CoNLL’17 Shared Task

75.00 74.32 74.00 73.75

73 74 75

LAS Ensemble Exact Arc-eager Exact Arc-hybrid Graph- based

(Shi, Wu, Chen and Cheng, 2017; Zeman et al., 2017)

  • Macro-average of 81 treebanks in 49 languages
  • 2nd–highest overall performance

66 Background 𝑃(π‘œ3) in theory 𝑃(π‘œ6) in practice Back to 𝑃(π‘œ3) Results

slide-67
SLIDE 67

Conclusion

  • Bi-LSTM feature set is minimal yet highly effective
  • First 𝑃 π‘œ3 implementation of exact decoders
  • Global training and decoding gave high performance

67

slide-68
SLIDE 68

More in Our Paper

  • Description and analysis of three transition systems

(arc-standard, arc-hybrid, arc-eager)

  • CKY-style representations of the deduction systems
  • Theoretical analysis of the global methods
  • Arc-eager models can β€œsimulate” arc-hybrid models
  • Arc-eager models can β€œsimulate” edge-factored models

68

= +

slide-69
SLIDE 69

Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

Tianze Shi* Liang Huang† Lillian Lee*

https://github.com/tzshi/dp-parser-emnlp17

* Cornell University † Oregon State University

slide-70
SLIDE 70

CKY-style Visualization

70

slide-71
SLIDE 71

CKY-style Visualization

71

slide-72
SLIDE 72

Results with Arc-eager and Arc-standard

72

slide-73
SLIDE 73

Results with Arc-eager and Arc-standard

73