Natural Language Processing (CSEP 517): Dependency Syntax and - - PowerPoint PPT Presentation

natural language processing csep 517 dependency syntax
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing (CSEP 517): Dependency Syntax and - - PowerPoint PPT Presentation

Natural Language Processing (CSEP 517): Dependency Syntax and Parsing Noah Smith 2017 c University of Washington nasmith@cs.washington.edu May 1, 2017 1 / 96 To-Do List Online quiz: due Sunday Read: K ubler et al. (2009, ch.


slide-1
SLIDE 1

Natural Language Processing (CSEP 517): Dependency Syntax and Parsing

Noah Smith

c 2017 University of Washington nasmith@cs.washington.edu

May 1, 2017

1 / 96

slide-2
SLIDE 2

To-Do List

◮ Online quiz: due Sunday ◮ Read: K¨

ubler et al. (2009, ch. 1, 2, 6)

◮ A3 due May 7 (Sunday) ◮ A4 due May 14 (Sunday)

2 / 96

slide-3
SLIDE 3

Dependencies

Informally, you can think of dependency structures as a transformation of phrase-structures that

◮ maintains the word-to-word relationships induced by lexicalization, ◮ adds labels to them, and ◮ eliminates the phrase categories.

There are also linguistic theories built on dependencies (Tesni` ere, 1959; Mel’ˇ cuk, 1987), as well as treebanks corresponding to those.

◮ Free(r)-word order languages (e.g., Czech)

3 / 96

slide-4
SLIDE 4

Dependency Tree: Definition

Let x = x1, . . . , xn be a sentence. Add a special root symbol as “x0.” A dependency tree consists of a set of tuples p, c, ℓ, where

◮ p ∈ {0, . . . , n} is the index of a parent ◮ c ∈ {1, . . . , n} is the index of a child ◮ ℓ ∈ L is a label

Different annotation schemes define different label sets L, and different constraints on the set of tuples. Most commonly:

◮ The tuple is represented as a directed edge from xp to xc with label ℓ. ◮ The directed edges form an arborescence (directed tree) with x0 as the root

(sometimes denoted root).

4 / 96

slide-5
SLIDE 5

Example

S NP Pronoun we VP Verb wash NP Determiner

  • ur

Noun cats Phrase-structure tree.

5 / 96

slide-6
SLIDE 6

Example

S NP Pronoun we VP Verb wash NP Determiner

  • ur

Noun cats Phrase-structure tree with heads.

6 / 96

slide-7
SLIDE 7

Example

Swash NPwe Pronounwe we VPwash Verbwash wash NPcats Determinerour

  • ur

Nouncats cats Phrase-structure tree with heads, lexicalized.

7 / 96

slide-8
SLIDE 8

Example

we wash

  • ur

cats “Bare bones” dependency tree.

8 / 96

slide-9
SLIDE 9

Example

we wash

  • ur

cats who stink

9 / 96

slide-10
SLIDE 10

Example

we vigorously wash

  • ur

cats who stink

10 / 96

slide-11
SLIDE 11

Content Heads vs. Function Heads

Credit: Nathan Schneider

little kids were always watching birds with fish little kids were always watching birds with fish

11 / 96

slide-12
SLIDE 12

Labels

kids saw birds with fish

root sbj dobj prep pobj

Key dependency relations captured in the labels include: subject, direct object, preposition object, adjectival modifier, adverbial modifier. In this lecture, I will mostly not discuss labels, to keep the algorithms simpler.

12 / 96

slide-13
SLIDE 13

Coordination Structures

we vigorously wash

  • ur

cats and dogs who stink The bugbear of dependency syntax.

13 / 96

slide-14
SLIDE 14

Example

we vigorously wash

  • ur

cats and dogs who stink Make the first conjunct the head?

14 / 96

slide-15
SLIDE 15

Example

we vigorously wash

  • ur

cats and dogs who stink Make the coordinating conjunction the head?

15 / 96

slide-16
SLIDE 16

Example

we vigorously wash

  • ur

cats and dogs who stink Make the second conjunct the head?

16 / 96

slide-17
SLIDE 17

Dependency Schemes

◮ Transform the treebank: define “head rules” that can select the head child of any

node in a phrase-structure tree and label the dependencies.

◮ More powerful, less local rule sets, possibly collapsing some words into arc labels.

◮ Stanford dependencies are a popular example (de Marneffe et al., 2006).

◮ Direct annotation.

17 / 96

slide-18
SLIDE 18

Three Approaches to Dependency Parsing

  • 1. Dynamic programming with the Eisner algorithm.
  • 2. Transition-based parsing with a stack.
  • 3. Chu-Liu-Edmonds algorithm for arborescences.

18 / 96

slide-19
SLIDE 19

Dependencies and Grammar

Context-free grammars can be used to encode dependency structures. For every head word and constellation of dependent children: Nhead → Nleftmost-sibling . . . Nhead . . . Nrightmost-sibling And for every v ∈ V: Nv → v and S → Nv.

19 / 96

slide-20
SLIDE 20

Dependencies and Grammar

Context-free grammars can be used to encode dependency structures. For every head word and constellation of dependent children: Nhead → Nleftmost-sibling . . . Nhead . . . Nrightmost-sibling And for every v ∈ V: Nv → v and S → Nv. A bilexical dependency grammar binarizes the dependents, generating only one per rule.

20 / 96

slide-21
SLIDE 21

Dependencies and Grammar

Context-free grammars can be used to encode dependency structures. For every head word and constellation of dependent children: Nhead → Nleftmost-sibling . . . Nhead . . . Nrightmost-sibling And for every v ∈ V: Nv → v and S → Nv. A bilexical dependency grammar binarizes the dependents, generating only one per rule. Such a grammar can produce only projective trees, which are (informally) trees in which the arcs don’t cross.

21 / 96

slide-22
SLIDE 22

Bilexical Dependency Grammar: Derivation

S Nwash Nwe we Nwash Nwash wash Ncats Nour

  • ur

Ncats cats Na¨ ıvely, the CKY algorithm will require O(n5) runtime. Why?

22 / 96

slide-23
SLIDE 23

CKY for Bilexical Context-Free Grammars

i k Nxh j + 1 k Nxc i j Nxh p(Nxh Nxc | Nxh) i k Nxh j + 1 k Nxh i j Nxc p(Nxc Nxh | Nxh)

23 / 96

slide-24
SLIDE 24

CKY Example

we wash our cats goal

Nwe Nwash Nour. Ncats Ncats Nwash Nwash S

24 / 96

slide-25
SLIDE 25

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Items:

h d d h c h h c ◮ Both triangles indicate that xd is a descendant of xh. ◮ Both trapezoids indicate that xc can be attached as the child of xh. ◮ In all cases, the words “in between” are descendants of xh.

25 / 96

slide-26
SLIDE 26

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Initialization:

i i i i 1 p(xi | Nxi)

Goal:

1 i i n p(Nxi | S) goal

26 / 96

slide-27
SLIDE 27

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Attaching a left dependent: Complete a left child:

i j j + 1 k i k p(Nxi Nxk | Nxk) j k i j i k

27 / 96

slide-28
SLIDE 28

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Attaching a right dependent: Complete a right child:

i j j + 1 k i k p(Nxi Nxk | Nxi) i j j k i k

28 / 96

slide-29
SLIDE 29

Eisner Algorithm Example

we wash our cats goal

29 / 96

slide-30
SLIDE 30

Three Approaches to Dependency Parsing

  • 1. Dynamic programming with the Eisner algorithm.
  • 2. Transition-based parsing with a stack.
  • 3. Chu-Liu-Edmonds algorithm for arborescences.

30 / 96

slide-31
SLIDE 31

Transition-Based Parsing

◮ Process x once, from left to right, making a sequence of greedy parsing decisions.

31 / 96

slide-32
SLIDE 32

Transition-Based Parsing

◮ Process x once, from left to right, making a sequence of greedy parsing decisions. ◮ Formally, the parser is a state machine (not a finite-state machine) whose state

is represented by a stack S and a buffer B.

32 / 96

slide-33
SLIDE 33

Transition-Based Parsing

◮ Process x once, from left to right, making a sequence of greedy parsing decisions. ◮ Formally, the parser is a state machine (not a finite-state machine) whose state

is represented by a stack S and a buffer B.

◮ Initialize the buffer to contain x and the stack to contain the root symbol.

33 / 96

slide-34
SLIDE 34

Transition-Based Parsing

◮ Process x once, from left to right, making a sequence of greedy parsing decisions. ◮ Formally, the parser is a state machine (not a finite-state machine) whose state

is represented by a stack S and a buffer B.

◮ Initialize the buffer to contain x and the stack to contain the root symbol. ◮ The “arc standard” transition set (Nivre, 2004):

◮ shift the word at the front of the buffer B onto the stack S. ◮ right-arc: u = pop(S); v = pop(S); push(S, v → u). ◮ left-arc: u = pop(S); v = pop(S); push(S, v ← u).

(For labeled parsing, add labels to the right-arc and left-arc transitions.)

34 / 96

slide-35
SLIDE 35

Transition-Based Parsing

◮ Process x once, from left to right, making a sequence of greedy parsing decisions. ◮ Formally, the parser is a state machine (not a finite-state machine) whose state

is represented by a stack S and a buffer B.

◮ Initialize the buffer to contain x and the stack to contain the root symbol. ◮ The “arc standard” transition set (Nivre, 2004):

◮ shift the word at the front of the buffer B onto the stack S. ◮ right-arc: u = pop(S); v = pop(S); push(S, v → u). ◮ left-arc: u = pop(S); v = pop(S); push(S, v ← u).

(For labeled parsing, add labels to the right-arc and left-arc transitions.)

◮ During parsing, apply a classifier to decide which transition to take next, greedily.

No backtracking.

35 / 96

slide-36
SLIDE 36

Transition-Based Parsing: Example

Stack S: root Buffer B: we vigorously wash

  • ur

cats who stink Actions:

36 / 96

slide-37
SLIDE 37

Transition-Based Parsing: Example

Stack S: we root Buffer B: vigorously wash

  • ur

cats who stink Actions: shift

37 / 96

slide-38
SLIDE 38

Transition-Based Parsing: Example

Stack S: vigorously we root Buffer B: wash

  • ur

cats who stink Actions: shift shift

38 / 96

slide-39
SLIDE 39

Transition-Based Parsing: Example

Stack S: wash vigorously we root Buffer B:

  • ur

cats who stink Actions: shift shift shift

39 / 96

slide-40
SLIDE 40

Transition-Based Parsing: Example

Stack S: vigorously wash we root Buffer B:

  • ur

cats who stink Actions: shift shift shift left-arc

40 / 96

slide-41
SLIDE 41

Transition-Based Parsing: Example

Stack S: we vigorously wash root Buffer B:

  • ur

cats who stink Actions: shift shift shift left-arc left-arc

41 / 96

slide-42
SLIDE 42

Transition-Based Parsing: Example

Stack S:

  • ur

we vigorously wash root Buffer B: cats who stink Actions: shift shift shift left-arc left-arc shift

42 / 96

slide-43
SLIDE 43

Transition-Based Parsing: Example

Stack S: cats

  • ur

we vigorously wash root Buffer B: who stink Actions: shift shift shift left-arc left-arc shift shift

43 / 96

slide-44
SLIDE 44

Transition-Based Parsing: Example

Stack S:

  • ur

cats we vigorously wash root Buffer B: who stink Actions: shift shift shift left-arc left-arc shift shift left-arc

44 / 96

slide-45
SLIDE 45

Transition-Based Parsing: Example

Stack S: who

  • ur

cats we vigorously wash root Buffer B: stink Actions: shift shift shift left-arc left-arc shift shift left-arc shift

45 / 96

slide-46
SLIDE 46

Transition-Based Parsing: Example

Stack S: stink who

  • ur

cats we vigorously wash root Buffer B: Actions: shift shift shift left-arc left-arc shift shift left-arc shift shift

46 / 96

slide-47
SLIDE 47

Transition-Based Parsing: Example

Stack S: who stink

  • ur

cats we vigorously wash root Buffer B: Actions: shift shift shift left-arc left-arc shift shift left-arc shift shift right-arc

47 / 96

slide-48
SLIDE 48

Transition-Based Parsing: Example

Stack S:

  • ur

cats who stink we vigorously wash root Buffer B: Actions: shift shift shift left-arc left-arc shift shift left-arc shift shift right-arc right-arc

48 / 96

slide-49
SLIDE 49

Transition-Based Parsing: Example

Stack S: we vigorously wash

  • ur

cats who stink root Buffer B: Actions: shift shift shift left-arc left-arc shift shift left-arc shift shift right-arc right-arc right-arc

49 / 96

slide-50
SLIDE 50

Transition-Based Parsing: Example

Stack S: we vigorously wash

  • ur

cats who stink

root

Buffer B: Actions: shift shift shift left-arc left-arc shift shift left-arc shift shift right-arc right-arc right-arc right-arc

50 / 96

slide-51
SLIDE 51

The Core of Transition-Based Parsing: Classification

◮ At each iteration, choose among {shift, right-arc, left-arc}.

(Actually, among all L-labeled variants of right- and left-arc.)

51 / 96

slide-52
SLIDE 52

The Core of Transition-Based Parsing: Classification

◮ At each iteration, choose among {shift, right-arc, left-arc}.

(Actually, among all L-labeled variants of right- and left-arc.)

◮ Features can look S, B, and the history of past actions—usually there is no

decomposition into local structures.

52 / 96

slide-53
SLIDE 53

The Core of Transition-Based Parsing: Classification

◮ At each iteration, choose among {shift, right-arc, left-arc}.

(Actually, among all L-labeled variants of right- and left-arc.)

◮ Features can look S, B, and the history of past actions—usually there is no

decomposition into local structures.

◮ Training data: “oracle” transition sequence that gives the right tree converts into

2 · n pairs: state, correct transition. Each word gets shifted once and participates as a child in one arc.

53 / 96

slide-54
SLIDE 54

Transition-Based Parsing: Remarks

◮ Can also be applied to phrase-structure parsing (e.g., Sagae and Lavie, 2006).

Keyword: “shift-reduce” parsing.

◮ The algorithm for making decisions doesn’t need to be greedy; can maintain

multiple hypotheses.

◮ E.g., beam search, which we’ll discuss in the context of machine translation later.

◮ Potential flaw: the classifier is typically trained under the assumption that

previous classification decisions were all correct.

◮ As yet, no principled solution to this problem, but see “dynamic oracles” (Goldberg

and Nivre, 2012).

54 / 96

slide-55
SLIDE 55

Three Approaches to Dependency Parsing

  • 1. Dynamic programming with the Eisner algorithm.
  • 2. Transition-based parsing with a stack.
  • 3. Chu-Liu-Edmonds algorithm for arborescences.

55 / 96

slide-56
SLIDE 56

Acknowledgment

Slides are mostly adapted from those by Swabha Swayamdipta and Sam Thomson.

56 / 96

slide-57
SLIDE 57

Features in Dependency Parsing

For the Eisner algorithm, the score of an unlabeled parse y was sglobal(y) =

n

  • c=1

log p(xc | Nxc) + log    p(Nxc Nxp | Nxp) if p, c ∈ y ∧ c < p ∧ p > 0 p(Nxp Nxc | Nxp) if p, c ∈ y ∧ c > p ∧ p > 0 p(Nxc | S) if 0, c ∈ y For transition-based parsing, we could use any past decisions to score the current decision: sglobal(y) = s(a) =

|a|

  • i=1

s(ai | a0:i−1) We gave up on any guarantee of finding the best possible y in favor of arbitrary features.

◮ For a neural network-based model that fully exploits this, see Dyer et al. (2015).

57 / 96

slide-58
SLIDE 58

Graph-Based Dependency Parsing

(McDonald et al., 2005)

Every possible directed edge e between a parent p and a child c gets a local score, s(e). This set, E, contains O(n2) edges. No incoming edges to x0, ensuring that it will be the root.

58 / 96

slide-59
SLIDE 59

First-Order Graph-Based (FOG) Dependency Parsing

(McDonald et al., 2005)

y∗ = argmax

y⊂E

sglobal(y) = argmax

y⊂E

  • e∈y

s(e) subject to the constraint that y is an arborescence Classical algorithm to efficiently solve this problem: Chu and Liu (1965), Edmonds (1967)

59 / 96

slide-60
SLIDE 60

Chu-Liu-Edmonds Intuitions

◮ Every non-root node needs exactly one incoming edge.

60 / 96

slide-61
SLIDE 61

Chu-Liu-Edmonds Intuitions

◮ Every non-root node needs exactly one incoming edge. ◮ In fact, every connected component that doesn’t contain x0 needs exactly one

incoming edge.

61 / 96

slide-62
SLIDE 62

Chu-Liu-Edmonds Intuitions

◮ Every non-root node needs exactly one incoming edge. ◮ In fact, every connected component that doesn’t contain x0 needs exactly one

incoming edge. High-level view of the algorithm:

  • 1. For every c, pick an incoming edge (i.e., pick a parent)—greedily.
  • 2. If this forms an arborescence, you are done!
  • 3. Otherwise, it’s because there’s a cycle, C.

◮ Arborescences can’t have cycles, so some edge in C needs to be kicked out. ◮ We also need to find an incoming edge for C. ◮ Choosing the incoming edge for C determines which edge to kick out. 62 / 96

slide-63
SLIDE 63

Chu-Liu-Edmonds: Recursive (Inefficient) Definition

def maxArborescence(V , E, root): # returns best arborescence as a map from each node to its parent for c in V \ root: bestInEdge[c] ← argmaxe∈E:e=p,c e.s # i.e., s(e) if bestInEdge contains a cycle C: # build a new graph where C is contracted into a single node vC ← new Node() V ′ ← V ∪ {vC} \ C E′ ← {adjust(e, vC) for e ∈ E \ C} A ← maxArborescence(V ′, E′, root) return {e.original for e ∈ A} ∪ C \ {A[vC].kicksOut} # each node got a parent without creating any cycles return bestInEdge

63 / 96

slide-64
SLIDE 64

Understanding Chu-Liu-Edmonds

There are two stages:

◮ Contraction (the stuff before the recursive call) ◮ Expansion (the stuff after the recursive call)

64 / 96

slide-65
SLIDE 65

Chu-Liu-Edmonds: Contraction

◮ For each non-root node v, set bestInEdge[v] to be its highest scoring incoming

edge.

◮ If a cycle C is formed:

◮ contract the nodes in C into a new node vC

adjust subroutine on next slide performs the following:

◮ Edges incoming to any node in C now get destination vC ◮ For each node v in C, and for each edge e incoming to v from outside of C: ◮ Set e.kicksOut to bestInEdge[v], and ◮ Set e.s to be e.s − e.kicksOut.s ◮ Edges outgoing from any node in C now get source vC

◮ Repeat until every non-root node has an incoming edge and no cycles are formed

65 / 96

slide-66
SLIDE 66

Chu-Liu-Edmonds: Edge Adjustment Subroutine

def adjust(e, vC): e′ ← copy(e) e′.original ← e if e.dest ∈ C: e′.dest ← vC e′.kicksOut ← bestInEdge[e.dest] e′.s ← e.s − e′.kicksOut.s elif e.src ∈ C: e′.src ← vC return e′

66 / 96

slide-67
SLIDE 67

Contraction Example

V1 ROOT V3 V2 a : 5 b : 1 c : 1 f : 5 d : 11 h : 9 e : 4 i : 8 g : 10

bestInEdge V1 V2 V3

kicksOut

a b c d e f g h i

67 / 96

slide-68
SLIDE 68

Contraction Example

V1 ROOT V3 V2 a : 5 b : 1 c : 1 f : 5 d : 11 h : 9 e : 4 i : 8 g : 10

bestInEdge V1 g V2 V3

kicksOut

a b c d e f g h i

68 / 96

slide-69
SLIDE 69

Contraction Example

V1 ROOT V3 V2 a : 5 b : 1 c : 1 f : 5 d : 11 h : 9 e : 4 i : 8 g : 10

bestInEdge V1 g V2 d V3

kicksOut

a b c d e f g h i

69 / 96

slide-70
SLIDE 70

Contraction Example

V1 ROOT V3 V2 a : 5 − 10 b : 1 − 11 c : 1 f : 5 d : 11 h : 9 − 10 e : 4 i : 8 − 11 g : 10

V4

bestInEdge V1 g V2 d V3

kicksOut

a g b d c d e f g h g i d

70 / 96

slide-71
SLIDE 71

Contraction Example

V4 ROOT V3 b : −10 c : 1 f : 5 a : −5 h : −1 e : 4 i : −3

bestInEdge V1 g V2 d V3 V4 kicksOut a g b d c d e f g h g i d

71 / 96

slide-72
SLIDE 72

Contraction Example

V4 ROOT V3 b : −10 c : 1 f : 5 a : −5 h : −1 e : 4 i : −3

bestInEdge V1 g V2 d V3 f V4 kicksOut a g b d c d e f g h g i d

72 / 96

slide-73
SLIDE 73

Contraction Example

V4 ROOT V3 b : −10 c : 1 f : 5 a : −5 h : −1 e : 4 i : −3

bestInEdge V1 g V2 d V3 f V4 h kicksOut a g b d c d e f g h g i d

73 / 96

slide-74
SLIDE 74

Contraction Example

V4 ROOT V3 b : −10 − −1 c : 1 − 5 f : 5 a : −5 − −1 h : −1 e : 4 i : −3

V5

bestInEdge V1 g V2 d V3 f V4 h V5 kicksOut a g, h b d, h c f d e f g h g i d

74 / 96

slide-75
SLIDE 75

Contraction Example

V5 ROOT b : −9 a : −4 c : −4

bestInEdge V1 g V2 d V3 f V4 h V5 kicksOut a g, h b d, h c f d e f f g h g i d

75 / 96

slide-76
SLIDE 76

Contraction Example

V5 ROOT b : −9 a : −4 c : −4

bestInEdge V1 g V2 d V3 f V4 h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

76 / 96

slide-77
SLIDE 77

Chu-Liu-Edmonds: Expansion

After the contraction stage, every contracted node will have exactly one bestInEdge. This edge will kick out one edge inside the contracted node, breaking the cycle.

◮ Go through each bestInEdge e in the reverse order that we added them ◮ Lock down e, and remove every edge in kicksOut(e) from bestInEdge.

77 / 96

slide-78
SLIDE 78

Expansion Example

V5 ROOT b : −9 a : −4 c : −4

bestInEdge V1 g V2 d V3 f V4 h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

78 / 96

slide-79
SLIDE 79

Expansion Example

V5 ROOT b : −9 a : −4 c : −4

bestInEdge V1 a ✁ g V2 d V3 f V4 a ✁ h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

79 / 96

slide-80
SLIDE 80

Expansion Example

V4 ROOT V3 b : −10 c : 1 f : 5 a : −5 h : −1 e : 4 i : −3

bestInEdge V1 a ✁ g V2 d V3 f V4 a ✁ h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

80 / 96

slide-81
SLIDE 81

Expansion Example

V4 ROOT V3 b : −10 c : 1 f : 5 a : −5 h : −1 e : 4 i : −3

bestInEdge V1 a ✁ g V2 d V3 f V4 a ✁ h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

81 / 96

slide-82
SLIDE 82

Expansion Example

V1 ROOT V3 V2 a : 5 b : 1 c : 1 f : 5 d : 11 h : 9 e : 4 i : 8 g : 10

bestInEdge V1 a ✁ g V2 d V3 f V4 a ✁ h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

82 / 96

slide-83
SLIDE 83

Expansion Example

V1 ROOT V3 V2 a : 5 b : 1 c : 1 f : 5 d : 11 h : 9 e : 4 i : 8 g : 10

bestInEdge V1 a ✁ g V2 d V3 f V4 a ✁ h V5 a kicksOut a g, h b d, h c f d e f f g h g i d

83 / 96

slide-84
SLIDE 84

Observation

The set of arborescences strictly includes the set of projective dependency trees. Is this a good thing or a bad thing?

84 / 96

slide-85
SLIDE 85

Nonprojective Example

A hearing is scheduled

  • n

the issue today .

ROOT ATT ATT SBJ PU VC TMP PC ATT 85 / 96

slide-86
SLIDE 86

Chu-Liu-Edmonds: Notes

◮ This is a greedy algorithm with a clever form of delayed backtracking to recover

from inconsistent decisions (cycles).

86 / 96

slide-87
SLIDE 87

Chu-Liu-Edmonds: Notes

◮ This is a greedy algorithm with a clever form of delayed backtracking to recover

from inconsistent decisions (cycles).

◮ CLE is exact: it always recovers an optimal arborescence.

87 / 96

slide-88
SLIDE 88

Chu-Liu-Edmonds: Notes

◮ This is a greedy algorithm with a clever form of delayed backtracking to recover

from inconsistent decisions (cycles).

◮ CLE is exact: it always recovers an optimal arborescence. ◮ What about labeled dependencies?

◮ As a matter of preprocessing, for each p, c, keep only the top-scoring labeled edge. 88 / 96

slide-89
SLIDE 89

Chu-Liu-Edmonds: Notes

◮ This is a greedy algorithm with a clever form of delayed backtracking to recover

from inconsistent decisions (cycles).

◮ CLE is exact: it always recovers an optimal arborescence. ◮ What about labeled dependencies?

◮ As a matter of preprocessing, for each p, c, keep only the top-scoring labeled edge.

◮ Tarjan (1977) offered a more efficient, but unfortunately incorrect,

implementation. Camerini et al. (1979) corrected it. The approach is not recursive; instead using a disjoint set data structure to keep track of collapsed nodes. Even better: Gabow et al. (1986) used a Fibonacci heap to keep incoming edges sorted, and finds cycles in a more sensible way. Also constrains root to have only

  • ne outgoing edge.

With these tricks, O(n2) runtime.

89 / 96

slide-90
SLIDE 90

More Details on Statistical Dependency Parsing

◮ What about the scores? McDonald et al. (2005) used carefully-designed features

and (something close to) the structured perceptron; Kiperwasser and Goldberg (2016) used bidirectional recurrent neural networks.

90 / 96

slide-91
SLIDE 91

More Details on Statistical Dependency Parsing

◮ What about the scores? McDonald et al. (2005) used carefully-designed features

and (something close to) the structured perceptron; Kiperwasser and Goldberg (2016) used bidirectional recurrent neural networks.

◮ What about higher-order parsing? Requires approximate inference, e.g., dual

decomposition (Martins et al., 2013).

91 / 96

slide-92
SLIDE 92

Important Tradeoffs (and Not Just in NLP)

  • 1. Two extremes:

◮ Specialized algorithm that efficiently solves your problem, under your assumptions.

E.g., Chu-Liu-Edmonds for FOG dependency parsing.

◮ General-purpose method that solves many problems, allowing you to test the effect

  • f different assumptions. E.g., dynamic programming, transition-based methods,

some forms of approximate inference.

92 / 96

slide-93
SLIDE 93

Important Tradeoffs (and Not Just in NLP)

  • 1. Two extremes:

◮ Specialized algorithm that efficiently solves your problem, under your assumptions.

E.g., Chu-Liu-Edmonds for FOG dependency parsing.

◮ General-purpose method that solves many problems, allowing you to test the effect

  • f different assumptions. E.g., dynamic programming, transition-based methods,

some forms of approximate inference.

  • 2. Two extremes:

◮ Fast (linear-time) but greedy ◮ Model-optimal but slow 93 / 96

slide-94
SLIDE 94

Important Tradeoffs (and Not Just in NLP)

  • 1. Two extremes:

◮ Specialized algorithm that efficiently solves your problem, under your assumptions.

E.g., Chu-Liu-Edmonds for FOG dependency parsing.

◮ General-purpose method that solves many problems, allowing you to test the effect

  • f different assumptions. E.g., dynamic programming, transition-based methods,

some forms of approximate inference.

  • 2. Two extremes:

◮ Fast (linear-time) but greedy ◮ Model-optimal but slow ◮ Dirty secret: the best way to get (English) dependency trees is to run

phrase-structure parsing, then convert.

94 / 96

slide-95
SLIDE 95

References I

Paolo M. Camerini, Luigi Fratta, and Francesco Maffioli. A note on finding optimum branchings. Networks, 9 (4):309–312, 1979.

  • Y. J. Chu and T. H. Liu. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400, 1965.

Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure parses. In Proc. of LREC, 2006. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. Transition-based dependency parsing with stack long short-term memory. In Proc. of ACL, 2015. Jack Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards, 71B:233–240, 1967. Jason M. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proc. of COLING, 1996. Harold N. Gabow, Zvi Galil, Thomas Spencer, and Robert E. Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica, 6(2):109–122, 1986. Yoav Goldberg and Joakim Nivre. A dynamic oracle for arc-eager dependency parsing. In Proc. of COLING, 2012. Eliyahu Kiperwasser and Yoav Goldberg. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics, 4:313–327, 2016.

95 / 96

slide-96
SLIDE 96

References II

Sandra K¨ ubler, Ryan McDonald, and Joakim Nivre. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, 2009. URL http://www.morganclaypool.com/doi/pdf/10.2200/S00169ED1V01Y200901HLT002. Andr´ e F. T. Martins, Miguel Almeida, and Noah A. Smith. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proc. of ACL, 2013. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of HLT-EMNLP, 2005. URL http://www.aclweb.org/anthology/H/H05/H05-1066.pdf. Igor A. Mel’ˇ

  • cuk. Dependency Syntax: Theory and Practice. State University Press of New York, 1987.

Joakim Nivre. Incrementality in deterministic dependency parsing. In Proc. of ACL, 2004. Kenji Sagae and Alon Lavie. A best-first probabilistic shift-reduce parser. In Proc. of COLING-ACL, 2006. Robert E. Tarjan. Finding optimum branchings. Networks, 7:25–35, 1977.

  • L. Tesni`
  • ere. ´

El´ ements de Syntaxe Structurale. Klincksieck, 1959.

96 / 96