Algorithms for NLP CS 11-711 Fall 2020 Lecture 14: Graph-based - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11-711 Fall 2020 Lecture 14: Graph-based - - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 14: Graph-based dependency parsing Emma Strubell Announcements No recitation on Friday (Tartan Community Day). 2 Dependency parsing 3 Dependency parsing Input buffer Transition-based


slide-1
SLIDE 1

Emma Strubell

Algorithms for NLP

CS 11-711 · Fall 2020

Lecture 14: Graph-based dependency parsing

slide-2
SLIDE 2

Announcements

2

■ No recitation on Friday (Tartan Community Day).

slide-3
SLIDE 3

Dependency parsing

3

slide-4
SLIDE 4

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

■ Transition-based (shift-reduce) parsing:

Dependency parsing

3

slide-5
SLIDE 5

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

■ Transition-based (shift-reduce) parsing: ■ Greedy choice of local transitions guided

by a good classifier.

Dependency parsing

3

slide-6
SLIDE 6

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

■ Transition-based (shift-reduce) parsing: ■ Greedy choice of local transitions guided

by a good classifier.

■ Examples: MaltParser [Nivre et al. 2008],

Stack LSTM [Dyer et al. 2015]

Dependency parsing

3

slide-7
SLIDE 7

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

■ Transition-based (shift-reduce) parsing: ■ Greedy choice of local transitions guided

by a good classifier.

■ Examples: MaltParser [Nivre et al. 2008],

Stack LSTM [Dyer et al. 2015]

■ Graph-based dependency parsing:

Dependency parsing

3

root Book that flight 12 4 4 5 6 8 7 5 7

slide-8
SLIDE 8

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

■ Transition-based (shift-reduce) parsing: ■ Greedy choice of local transitions guided

by a good classifier.

■ Examples: MaltParser [Nivre et al. 2008],

Stack LSTM [Dyer et al. 2015]

■ Graph-based dependency parsing: ■ Given scores for every pair of words, find

the (globally) highest scoring set of edges.

Dependency parsing

3

root Book that flight 12 4 4 5 6 8 7 5 7

slide-9
SLIDE 9

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

■ Transition-based (shift-reduce) parsing: ■ Greedy choice of local transitions guided

by a good classifier.

■ Examples: MaltParser [Nivre et al. 2008],

Stack LSTM [Dyer et al. 2015]

■ Graph-based dependency parsing: ■ Given scores for every pair of words, find

the (globally) highest scoring set of edges.

■ Examples: MSTParser [McDonald et al.

2005], TurboParser [Martins et al. 2009], Deep Biaffine [Dozat et al. 2017]

Dependency parsing

3

root Book that flight 12 4 4 5 6 8 7 5 7

slide-10
SLIDE 10

root Book that flight 12 4 4 5 6 8 7 5 7

Graph-based dependency parsing

4

slide-11
SLIDE 11

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches:

Graph-based dependency parsing

4

slide-12
SLIDE 12

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches: ■ Score of a tree decomposes as sum of edge scores:

Graph-based dependency parsing

4

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>
slide-13
SLIDE 13

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches: ■ Score of a tree decomposes as sum of edge scores: ■ Start with a fully-connected directed graph

Graph-based dependency parsing

4

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>
slide-14
SLIDE 14

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches: ■ Score of a tree decomposes as sum of edge scores: ■ Start with a fully-connected directed graph ■ How to infer the highest scoring tree?

Graph-based dependency parsing

4

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>
slide-15
SLIDE 15

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches: ■ Score of a tree decomposes as sum of edge scores: ■ Start with a fully-connected directed graph ■ How to infer the highest scoring tree? ■ Find a maximum directed spanning tree:

Chu and Liu (1965) and Edmonds (1967) algorithm

Graph-based dependency parsing

4

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>
slide-16
SLIDE 16

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

slide-17
SLIDE 17

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

select best incoming edge for each node

slide-18
SLIDE 18

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

select best incoming edge for each node subtract its score from all incoming edges

slide-19
SLIDE 19

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

select best incoming edge for each node subtract its score from all incoming edges stopping condition

slide-20
SLIDE 20

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

select best incoming edge for each node subtract its score from all incoming edges stopping condition contract nodes if there are cycles

slide-21
SLIDE 21

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

select best incoming edge for each node subtract its score from all incoming edges stopping condition contract nodes if there are cycles recursively compute MST

slide-22
SLIDE 22

Chu-Liu-Edmonds algorithm

5

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

select best incoming edge for each node subtract its score from all incoming edges stopping condition contract nodes if there are cycles recursively compute MST expand contracted nodes

slide-23
SLIDE 23

Chu-Liu-Edmonds algorithm

6

■ Select best incoming edge for each node

root Book 12 that 7 flight 8 12 4 4 5 6 8 7 5 7

slide-24
SLIDE 24

Chu-Liu-Edmonds algorithm

7

■ Subtract its score from all incoming edges

root Book 12 that 7 flight 8 12 4 4 5 6 8 7 5 7

  • 12
  • 12
  • 12
  • 7
  • 7
  • 7
  • 8
  • 8
  • 8
slide-25
SLIDE 25

root Book 12 that 7 flight 8

  • 4
  • 3
  • 2
  • 6
  • 1
  • 7

Chu-Liu-Edmonds algorithm

8

■ Subtract its score from all incoming edges

slide-26
SLIDE 26

root Book 12 that 7 flight 8

  • 4
  • 3
  • 2
  • 6
  • 1
  • 7

Chu-Liu-Edmonds algorithm

9

■ Contract nodes if there are cycles

slide-27
SLIDE 27

■ Contract nodes if there are cycles

Chu-Liu-Edmonds algorithm

10

root Book tf

  • 3
  • 4
  • 7
  • 1
  • 6
  • 2
slide-28
SLIDE 28

■ Recursively compute MST

Chu-Liu-Edmonds algorithm

11

root Book tf

  • 1
  • 3
  • 4
  • 7
  • 1
  • 6
  • 2
slide-29
SLIDE 29

■ Expand contracted nodes

Chu-Liu-Edmonds algorithm

12

root Book that flight Deleted from cycle

slide-30
SLIDE 30

Chu-Liu-Edmonds algorithm

13

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

runtime?

slide-31
SLIDE 31

Chu-Liu-Edmonds algorithm

13

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

runtime? naive: O(n3)

slide-32
SLIDE 32

Chu-Liu-Edmonds algorithm

13

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

runtime? naive: O(n3) fancy: O(n2 + nlogn)

slide-33
SLIDE 33

Chu-Liu-Edmonds algorithm

13

function MAXSPANNINGTREE(G=(V,E), root,score) returns spanning tree F←[] T’←[] score’←[] for each v ∈ V do bestInEdge←argmaxe=(u,v)∈ E score[e] F←F ∪ bestInEdge for each e=(u,v) ∈ E do score’[e]←score[e] − score[bestInEdge] if T=(V,F) is a spanning tree then return it else C←a cycle in F G’←CONTRACT(G,C) T’←MAXSPANNINGTREE(G’,root,score’) T←EXPAND(T’, C) return T function CONTRACT(G,C) returns contracted graph function EXPAND(T, C) returns expanded graph

runtime? naive: O(n3) fancy: O(n2 + nlogn) what about labeled parsing?

slide-34
SLIDE 34

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches: ■ Score of a tree decomposes as sum of edge scores: ■ Start with a fully-connected directed graph ■ How to infer the highest scoring tree? ■ Find a maximum directed spanning tree:

Chu and Liu (1965) and Edmonds (1967) algorithm

Graph-based dependency parsing

14

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>
slide-35
SLIDE 35

root Book that flight 12 4 4 5 6 8 7 5 7

■ Edge-factored (or arc-factored) approaches: ■ Score of a tree decomposes as sum of edge scores: ■ Start with a fully-connected directed graph ■ How to infer the highest scoring tree? ■ Find a maximum directed spanning tree:

Chu and Liu (1965) and Edmonds (1967) algorithm

Graph-based dependency parsing

14

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ For projective trees: Eisner’s algorithm [Eisner 1996]

slide-36
SLIDE 36

Graph-based dependency parsing

15

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-37
SLIDE 37

■ Can also define higher-order models: score decomposes as a sum of scores

  • f local subgraphs.

Graph-based dependency parsing

15

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

First order h m Second order h s m g h m Third order g h s m h t s m

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-38
SLIDE 38

■ Can also define higher-order models: score decomposes as a sum of scores

  • f local subgraphs.

■ Have efficient (polynomial time) algorithms for second [Eisner 1996] and

third order [Koo and Collins, 2010] projective dependency parsing.

Graph-based dependency parsing

15

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

First order h m Second order h s m g h m Third order g h s m h t s m

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-39
SLIDE 39

■ Can also define higher-order models: score decomposes as a sum of scores

  • f local subgraphs.

■ Have efficient (polynomial time) algorithms for second [Eisner 1996] and

third order [Koo and Collins, 2010] projective dependency parsing.

■ Second order parsing is NP-hard for non-projective dependency graphs

[McDonald and Pereira, 2006]!

Graph-based dependency parsing

15

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

First order h m Second order h s m g h m Third order g h s m h t s m

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-40
SLIDE 40

Graph-based dependency parsing

16

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-41
SLIDE 41

■ How to parameterize ?

Graph-based dependency parsing

16

ψ

<latexit sha1_base64="0rWVhw/omzXUia/crvc+39E482A=">AD+3icfVJbaxNBFJ4k2tb1luqjL6MhUCWUpAoKIhT1wRcxgmkLmRBmJ2c3Y+eyzszmwrK/xjfx1d8i/hjB2U1Suk1xYGc/zvnOd85Z8JEcOu63T+1euPGzZ3dvVvB7Tt3791v7j84sTo1DAZMC23OQmpBcAUDx52As8QAlaGA0/D8XeE/nYGxXKsvbpnASNJY8Ygz6rxp3PzWJn3LD0gos2XewcV/nr/GxE3B0af4DSY2leOMk4Xh8dRY/Q8Mzn+iglXeBWk8Sn2KZs0nU26YKCOG62uofd8uBt0FuDFlqf/ni/bshEs1SCckxQa4e9buJGTWOMwF5QFILCWXnNIah4pKsKOsnE2O294ywZE2/lMOl9bLERmV1i5l6JmSuqm96iuM1/mGqYtejTKuktSBYqtCUSqw07gYNJ5wA8yJpQeUGe61YjalhjLn1xG0L5eZgpiBqzbC5CizUVm9IimUeTV4sWo0IAYUzJmWkqrJs4xEVHKxnEBEU+HyjNhog6+bV2cy4ldj+4ipQBHtN8qV1QIiBwprq53Dkp74C8B78gAx+96k8JGOq08UqoiSVd5H5hMXlMCvg/JlcXTA+rbWlAN9MRedgMryEjKhLZAwNjpNKoK34kuhPgGN/BpWfKiGrRj+lfauvsltcHJ02Ht+ePT5Rev47fq97qFH6Ak6QD30Eh2jD6iPBoih3+hvbae28gb3xs/Gj9X1HptHfMQVU7j1z84HF4h</latexit>

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-42
SLIDE 42

Classic graph-based parsing features

17

■ Word forms, lemmas, parts of speech of the head word and its dependent ■ Corresponding features derived from the contexts before, after, between words ■ Word embeddings ■ Dependency relation ■ Direction of the relation (right or left) ■ Distance from the head to the dependent ■ Combinations of all of the above

slide-43
SLIDE 43

Graph-based neural network parsers

18

back the bill will Janet

word embeddings per-token features

neural network

slide-44
SLIDE 44

Graph-based neural network parsers

19

per-token features

neural network

back/VB the/DT bill/NN will/MD Janet/NNP

word embeddings pos embeddings

slide-45
SLIDE 45

Graph-based neural network parsers

19

per-token features

neural network

edge features

root → Janet will → Janet back → Janet the → Janet bill → Janet Janet → bill

back/VB the/DT bill/NN will/MD Janet/NNP

word embeddings pos embeddings

slide-46
SLIDE 46

Graph-based neural network parsers

19

per-token features

neural network

edge features

root → Janet will → Janet back → Janet the → Janet bill → Janet Janet → bill

back/VB the/DT bill/NN will/MD Janet/NNP

word embeddings pos embeddings

slide-47
SLIDE 47

Graph-based neural network parsers

20

slide-48
SLIDE 48

Graph-based neural network parsers

20

■ Concat + feed-forward [Kiperwasser and Goldberg, 2016]

LSTM f xthe concat LSTM f xbrown concat LSTM f xfox concat LSTM f xjumped concat LSTM f x∗ concat LSTM b s0 LSTM b s1 LSTM b s2 LSTM b s3 LSTM b s4 Vthe Vbrown Vfox Vjumped V∗ MLP MLP MLP MLP

+

slide-49
SLIDE 49

Graph-based neural network parsers

20

■ Concat + feed-forward [Kiperwasser and Goldberg, 2016]

LSTM f xthe concat LSTM f xbrown concat LSTM f xfox concat LSTM f xjumped concat LSTM f x∗ concat LSTM b s0 LSTM b s1 LSTM b s2 LSTM b s3 LSTM b s4 Vthe Vbrown Vfox Vjumped V∗ MLP MLP MLP MLP

+

stacked LSTM
 token encoder

slide-50
SLIDE 50

Graph-based neural network parsers

20

■ Concat + feed-forward [Kiperwasser and Goldberg, 2016]

LSTM f xthe concat LSTM f xbrown concat LSTM f xfox concat LSTM f xjumped concat LSTM f x∗ concat LSTM b s0 LSTM b s1 LSTM b s2 LSTM b s3 LSTM b s4 Vthe Vbrown Vfox Vjumped V∗ MLP MLP MLP MLP

+

stacked LSTM
 token encoder concat head, dependent
 as input to MLP

slide-51
SLIDE 51

Graph-based neural network parsers

20

■ Concat + feed-forward [Kiperwasser and Goldberg, 2016] ■ Hinge loss:

LSTM f xthe concat LSTM f xbrown concat LSTM f xfox concat LSTM f xjumped concat LSTM f x∗ concat LSTM b s0 LSTM b s1 LSTM b s2 LSTM b s3 LSTM b s4 Vthe Vbrown Vfox Vjumped V∗ MLP MLP MLP MLP

+

stacked LSTM
 token encoder concat head, dependent
 as input to MLP

max ⇣ 0, 1 max

y06=y

X

(h,m)2y0

MLP(vh vm) X ⌘ ⇣ X

2

+ X

(h,m)2y

MLP(vh vm) ⌘

slide-52
SLIDE 52

Graph-based neural network parsers

20

■ Concat + feed-forward [Kiperwasser and Goldberg, 2016] ■ Hinge loss: ■ w/ loss augmented inference:

LSTM f xthe concat LSTM f xbrown concat LSTM f xfox concat LSTM f xjumped concat LSTM f x∗ concat LSTM b s0 LSTM b s1 LSTM b s2 LSTM b s3 LSTM b s4 Vthe Vbrown Vfox Vjumped V∗ MLP MLP MLP MLP

+

stacked LSTM
 token encoder concat head, dependent
 as input to MLP

max ⇣ 0, 1 max

y06=y

X

(h,m)2y0

MLP(vh vm) X ⌘ ⇣ X

2

+ X

(h,m)2y

MLP(vh vm) ⌘

max(0, 1 + score(x, y) X

  • max

y06=y

X

part2y0

(scorelocal(x, part) +

part62y))

slide-53
SLIDE 53

Graph-based neural network parsers

21

slide-54
SLIDE 54

Graph-based neural network parsers

21

■ Biaffine classifier [Dozat and Manning, 2017]

... root ROOT Kim NNP

1 1 1 1

⊤ · · = BiLSTM: ri Embeddings: xi MLP: h(arc-dep)

i

, h(arc-head)

i

H(arc-dep) ⊕ 1 U (arc) H(arc-head) S(arc)

slide-55
SLIDE 55

Graph-based neural network parsers

21

■ Biaffine classifier [Dozat and Manning, 2017]

... root ROOT Kim NNP

1 1 1 1

⊤ · · = BiLSTM: ri Embeddings: xi MLP: h(arc-dep)

i

, h(arc-head)

i

H(arc-dep) ⊕ 1 U (arc) H(arc-head) S(arc)

stacked LSTM
 token encoder

slide-56
SLIDE 56

Graph-based neural network parsers

21

■ Biaffine classifier [Dozat and Manning, 2017]

... root ROOT Kim NNP

1 1 1 1

⊤ · · = BiLSTM: ri Embeddings: xi MLP: h(arc-dep)

i

, h(arc-head)

i

H(arc-dep) ⊕ 1 U (arc) H(arc-head) S(arc)

stacked LSTM
 token encoder MLPhead MLPdep

slide-57
SLIDE 57

Graph-based neural network parsers

21

■ Biaffine classifier [Dozat and Manning, 2017]

... root ROOT Kim NNP

1 1 1 1

⊤ · · = BiLSTM: ri Embeddings: xi MLP: h(arc-dep)

i

, h(arc-head)

i

H(arc-dep) ⊕ 1 U (arc) H(arc-head) S(arc)

stacked LSTM
 token encoder MLPhead MLPdep biaffine classifier

si = Wri + b Fixed-class affine classifier s(arc)

i

=

  • RU (1)

ri +

  • Ru(2)

Variable-class biaffine classifier

slide-58
SLIDE 58

Graph-based neural network parsers

21

■ Biaffine classifier [Dozat and Manning, 2017] ■ Locally normalized log loss.

... root ROOT Kim NNP

1 1 1 1

⊤ · · = BiLSTM: ri Embeddings: xi MLP: h(arc-dep)

i

, h(arc-head)

i

H(arc-dep) ⊕ 1 U (arc) H(arc-head) S(arc)

stacked LSTM
 token encoder MLPhead MLPdep biaffine classifier

si = Wri + b Fixed-class affine classifier s(arc)

i

=

  • RU (1)

ri +

  • Ru(2)

Variable-class biaffine classifier

slide-59
SLIDE 59

■ How to parameterize ?

Graph-based dependency parsing

22

ψ

<latexit sha1_base64="0rWVhw/omzXUia/crvc+39E482A=">AD+3icfVJbaxNBFJ4k2tb1luqjL6MhUCWUpAoKIhT1wRcxgmkLmRBmJ2c3Y+eyzszmwrK/xjfx1d8i/hjB2U1Suk1xYGc/zvnOd85Z8JEcOu63T+1euPGzZ3dvVvB7Tt3791v7j84sTo1DAZMC23OQmpBcAUDx52As8QAlaGA0/D8XeE/nYGxXKsvbpnASNJY8Ygz6rxp3PzWJn3LD0gos2XewcV/nr/GxE3B0af4DSY2leOMk4Xh8dRY/Q8Mzn+iglXeBWk8Sn2KZs0nU26YKCOG62uofd8uBt0FuDFlqf/ni/bshEs1SCckxQa4e9buJGTWOMwF5QFILCWXnNIah4pKsKOsnE2O294ywZE2/lMOl9bLERmV1i5l6JmSuqm96iuM1/mGqYtejTKuktSBYqtCUSqw07gYNJ5wA8yJpQeUGe61YjalhjLn1xG0L5eZgpiBqzbC5CizUVm9IimUeTV4sWo0IAYUzJmWkqrJs4xEVHKxnEBEU+HyjNhog6+bV2cy4ldj+4ipQBHtN8qV1QIiBwprq53Dkp74C8B78gAx+96k8JGOq08UqoiSVd5H5hMXlMCvg/JlcXTA+rbWlAN9MRedgMryEjKhLZAwNjpNKoK34kuhPgGN/BpWfKiGrRj+lfauvsltcHJ02Ht+ePT5Rev47fq97qFH6Ak6QD30Eh2jD6iPBoih3+hvbae28gb3xs/Gj9X1HptHfMQVU7j1z84HF4h</latexit>

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

slide-60
SLIDE 60

■ How to parameterize ?

Graph-based dependency parsing

22

ψ

<latexit sha1_base64="0rWVhw/omzXUia/crvc+39E482A=">AD+3icfVJbaxNBFJ4k2tb1luqjL6MhUCWUpAoKIhT1wRcxgmkLmRBmJ2c3Y+eyzszmwrK/xjfx1d8i/hjB2U1Suk1xYGc/zvnOd85Z8JEcOu63T+1euPGzZ3dvVvB7Tt3791v7j84sTo1DAZMC23OQmpBcAUDx52As8QAlaGA0/D8XeE/nYGxXKsvbpnASNJY8Ygz6rxp3PzWJn3LD0gos2XewcV/nr/GxE3B0af4DSY2leOMk4Xh8dRY/Q8Mzn+iglXeBWk8Sn2KZs0nU26YKCOG62uofd8uBt0FuDFlqf/ni/bshEs1SCckxQa4e9buJGTWOMwF5QFILCWXnNIah4pKsKOsnE2O294ywZE2/lMOl9bLERmV1i5l6JmSuqm96iuM1/mGqYtejTKuktSBYqtCUSqw07gYNJ5wA8yJpQeUGe61YjalhjLn1xG0L5eZgpiBqzbC5CizUVm9IimUeTV4sWo0IAYUzJmWkqrJs4xEVHKxnEBEU+HyjNhog6+bV2cy4ldj+4ipQBHtN8qV1QIiBwprq53Dkp74C8B78gAx+96k8JGOq08UqoiSVd5H5hMXlMCvg/JlcXTA+rbWlAN9MRedgMryEjKhLZAwNjpNKoK34kuhPgGN/BpWfKiGrRj+lfauvsltcHJ02Ht+ePT5Rev47fq97qFH6Ak6QD30Eh2jD6iPBoih3+hvbae28gb3xs/Gj9X1HptHfMQVU7j1z84HF4h</latexit>

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

■ How to learn θ ?

slide-61
SLIDE 61

■ How to parameterize ?

Graph-based dependency parsing

22

ψ

<latexit sha1_base64="0rWVhw/omzXUia/crvc+39E482A=">AD+3icfVJbaxNBFJ4k2tb1luqjL6MhUCWUpAoKIhT1wRcxgmkLmRBmJ2c3Y+eyzszmwrK/xjfx1d8i/hjB2U1Suk1xYGc/zvnOd85Z8JEcOu63T+1euPGzZ3dvVvB7Tt3791v7j84sTo1DAZMC23OQmpBcAUDx52As8QAlaGA0/D8XeE/nYGxXKsvbpnASNJY8Ygz6rxp3PzWJn3LD0gos2XewcV/nr/GxE3B0af4DSY2leOMk4Xh8dRY/Q8Mzn+iglXeBWk8Sn2KZs0nU26YKCOG62uofd8uBt0FuDFlqf/ni/bshEs1SCckxQa4e9buJGTWOMwF5QFILCWXnNIah4pKsKOsnE2O294ywZE2/lMOl9bLERmV1i5l6JmSuqm96iuM1/mGqYtejTKuktSBYqtCUSqw07gYNJ5wA8yJpQeUGe61YjalhjLn1xG0L5eZgpiBqzbC5CizUVm9IimUeTV4sWo0IAYUzJmWkqrJs4xEVHKxnEBEU+HyjNhog6+bV2cy4ldj+4ipQBHtN8qV1QIiBwprq53Dkp74C8B78gAx+96k8JGOq08UqoiSVd5H5hMXlMCvg/JlcXTA+rbWlAN9MRedgMryEjKhLZAwNjpNKoK34kuhPgGN/BpWfKiGrRj+lfauvsltcHJ02Ht+ePT5Rev47fq97qFH6Ak6QD30Eh2jD6iPBoih3+hvbae28gb3xs/Gj9X1HptHfMQVU7j1z84HF4h</latexit>

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

■ How to learn θ ? ■ Locally normalized: predict each head and its label, softmax, log loss.

slide-62
SLIDE 62

■ How to parameterize ?

Graph-based dependency parsing

22

ψ

<latexit sha1_base64="0rWVhw/omzXUia/crvc+39E482A=">AD+3icfVJbaxNBFJ4k2tb1luqjL6MhUCWUpAoKIhT1wRcxgmkLmRBmJ2c3Y+eyzszmwrK/xjfx1d8i/hjB2U1Suk1xYGc/zvnOd85Z8JEcOu63T+1euPGzZ3dvVvB7Tt3791v7j84sTo1DAZMC23OQmpBcAUDx52As8QAlaGA0/D8XeE/nYGxXKsvbpnASNJY8Ygz6rxp3PzWJn3LD0gos2XewcV/nr/GxE3B0af4DSY2leOMk4Xh8dRY/Q8Mzn+iglXeBWk8Sn2KZs0nU26YKCOG62uofd8uBt0FuDFlqf/ni/bshEs1SCckxQa4e9buJGTWOMwF5QFILCWXnNIah4pKsKOsnE2O294ywZE2/lMOl9bLERmV1i5l6JmSuqm96iuM1/mGqYtejTKuktSBYqtCUSqw07gYNJ5wA8yJpQeUGe61YjalhjLn1xG0L5eZgpiBqzbC5CizUVm9IimUeTV4sWo0IAYUzJmWkqrJs4xEVHKxnEBEU+HyjNhog6+bV2cy4ldj+4ipQBHtN8qV1QIiBwprq53Dkp74C8B78gAx+96k8JGOq08UqoiSVd5H5hMXlMCvg/JlcXTA+rbWlAN9MRedgMryEjKhLZAwNjpNKoK34kuhPgGN/BpWfKiGrRj+lfauvsltcHJ02Ht+ePT5Rev47fq97qFH6Ak6QD30Eh2jD6iPBoih3+hvbae28gb3xs/Gj9X1HptHfMQVU7j1z84HF4h</latexit>

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

■ How to learn θ ? ■ Locally normalized: predict each head and its label, softmax, log loss. ■ Hinge loss: Find the best scoring tree, and penalize edges not in the gold

tree (with a margin).

slide-63
SLIDE 63

■ How to parameterize ?

Graph-based dependency parsing

22

ψ

<latexit sha1_base64="0rWVhw/omzXUia/crvc+39E482A=">AD+3icfVJbaxNBFJ4k2tb1luqjL6MhUCWUpAoKIhT1wRcxgmkLmRBmJ2c3Y+eyzszmwrK/xjfx1d8i/hjB2U1Suk1xYGc/zvnOd85Z8JEcOu63T+1euPGzZ3dvVvB7Tt3791v7j84sTo1DAZMC23OQmpBcAUDx52As8QAlaGA0/D8XeE/nYGxXKsvbpnASNJY8Ygz6rxp3PzWJn3LD0gos2XewcV/nr/GxE3B0af4DSY2leOMk4Xh8dRY/Q8Mzn+iglXeBWk8Sn2KZs0nU26YKCOG62uofd8uBt0FuDFlqf/ni/bshEs1SCckxQa4e9buJGTWOMwF5QFILCWXnNIah4pKsKOsnE2O294ywZE2/lMOl9bLERmV1i5l6JmSuqm96iuM1/mGqYtejTKuktSBYqtCUSqw07gYNJ5wA8yJpQeUGe61YjalhjLn1xG0L5eZgpiBqzbC5CizUVm9IimUeTV4sWo0IAYUzJmWkqrJs4xEVHKxnEBEU+HyjNhog6+bV2cy4ldj+4ipQBHtN8qV1QIiBwprq53Dkp74C8B78gAx+96k8JGOq08UqoiSVd5H5hMXlMCvg/JlcXTA+rbWlAN9MRedgMryEjKhLZAwNjpNKoK34kuhPgGN/BpWfKiGrRj+lfauvsltcHJ02Ht+ePT5Rev47fq97qFH6Ak6QD30Eh2jD6iPBoih3+hvbae28gb3xs/Gj9X1HptHfMQVU7j1z84HF4h</latexit>

First order h m

Ψ(y, w; θ) = X

i

r

− →j∈y ψ(i

r

− → j, w, θ)

<latexit sha1_base64="B8n3543ays3HiDAwfo7VpnAkHG8=">AD9XicfVJLixNBEO4kPtb42KwevbSGhVCSFZBQYRFPXgRI5jdhXQIPT01kzb9GLp78mCYn+JNvPpbPpLvNozSZadzWLD9HxUfVX1VUHieDW9Xp/avXGjZu3bu/dad69d/Bfuvg4anVqWEwZFpocx5QC4IrGDruBJwnBqgMBJwFs/eF/2wOxnKtvrpVAmNJY8UjzqjzpklrRgaWH5FAZqu8g4v/In+DiZuCo8/wW0xsKicZJ0vD46mjxuhFZnL8DROu8DosJ4lPsUvZputs01a7V63Vx68C/ob0EabM5gc1A0JNUslKMcEtXbU7yVunFHjOBOQN0lqIaFsRmMYeaioBDvOyqnk+NBbQhxp4z/lcGm9HJFRae1KBp4pqZvaq7CeJ1vlLro9TjKkdKLYuFKUCO42LEeOQG2BOrDygzHCvFbMpNZQ5v4jm4eUyUxBzcNVGmBxnNiqrVyQFMq8GL9eNokBQumpaQqfJ6RiEouViFENBUuz4iNtvi6eXCOU/sZnQXKQU4ov1CuaJCQORIcVXN5bpJeTfJB/ALMvDJq/6cgKFOG6+EmljSZe4XFpMnpID/Y3J1wfSw2lZWCvDNFHPRCagsLyET2gIJYqPTpCJ4J74U6hPQyK9hzYdq2JrhX2n/6pvcBafH3f6L7vGXl+2Td5v3uoceo6foCPXRK3SCPqIBGiKGfqO/NVSrNRaN740fjZ9rar2iXmEKqfx6x+lY1wO</latexit>

■ Edge-factored (or arc-factored) approaches: score of a tree decomposes

as sum of edge scores.

■ How to learn θ ? ■ Locally normalized: predict each head and its label, softmax, log loss. ■ Hinge loss: Find the best scoring tree, and penalize edges not in the gold

tree (with a margin).

■ Globally normalized CRF: can compute marginals/partition function using a

variant of Kirchhoff’s Matrix-Tree Theorem [Tutte, 1984; Koo et al. 2007].

slide-64
SLIDE 64

Graph-based vs. transition-based parsing?

23

slide-65
SLIDE 65

Graph-based vs. transition-based parsing?

23

■ Transition-based ■ Fast ■ Greedy / local inference ■ Maybe closer to humans?

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

slide-66
SLIDE 66

Graph-based vs. transition-based parsing?

23

■ Transition-based ■ Fast ■ Greedy / local inference ■ Maybe closer to humans? ■ Graph-based ■ Slow ■ Exact inference ■ More accurate

Dependency Relations wn w1 w2 s2 ... s1 sn

Parser

Input buffer Stack

Oracle

root Book that flight 12 4 4 5 6 8 7 5 7

slide-67
SLIDE 67

Graph-based vs. transition-based parsing?

24

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

slide-68
SLIDE 68

Graph-based vs. transition-based parsing?

24

[Zhang & McDonald 2014] TurboParser [Martins et al. 2010]

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based

slide-69
SLIDE 69

Graph-based vs. transition-based parsing?

24

MaltParser [Nivre et al. 2009] [Zhang & McDonald 2014] TurboParser [Martins et al. 2010]

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based

slide-70
SLIDE 70

Graph-based vs. transition-based parsing?

24

MaltParser [Nivre et al. 2009] [Zhang & McDonald 2014] TurboParser [Martins et al. 2010] [Rush & Petrov 2012;

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based Structured prediction cascades Weiss & Taskar 2010]

slide-71
SLIDE 71

Graph-based vs. transition-based parsing?

24

MaltParser [Nivre et al. 2009] [Chen & Manning 2014] [Zhang & McDonald 2014] TurboParser [Martins et al. 2010] [Rush & Petrov 2012;

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based FF neural nets Structured prediction cascades Weiss & Taskar 2010]

slide-72
SLIDE 72

Graph-based vs. transition-based parsing?

24

transition-based w/ dynamic feature selection MaltParser [Nivre et al. 2009] [Chen & Manning 2014] [Zhang & McDonald 2014] TurboParser [Martins et al. 2010] [Rush & Petrov 2012;

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based FF neural nets Structured prediction cascades [Strubell et al. 2015] Weiss & Taskar 2010] [Strubell et al. 2015]

slide-73
SLIDE 73

Graph-based vs. transition-based parsing?

25

transition-based w/ dynamic feature selection

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based FF neural nets Structured prediction cascades

94 95 96

slide-74
SLIDE 74

Graph-based vs. transition-based parsing?

25

transition-based w/ dynamic feature selection

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based FF neural nets Structured prediction cascades

94 95 96

Stacked LSTMs [Dozat and Manning 2017]*

slide-75
SLIDE 75

Graph-based vs. transition-based parsing?

25

transition-based w/ dynamic feature selection

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based FF neural nets Structured prediction cascades

94 95 96

Stacked LSTMs [Dozat and Manning 2017]*

*on GPU!!

slide-76
SLIDE 76

Graph-based vs. transition-based parsing?

25

transition-based w/ dynamic feature selection

Dependency parsing: accuracy vs. speed Accuracy (UAS) Speed (sentences/sec)

Graph-based Transition-based FF neural nets Structured prediction cascades

94 95 96

Stacked LSTMs [Dozat and Manning 2017]*

*on GPU!!

slide-77
SLIDE 77

Announcements

26

■ No recitation on Friday (Tartan Community Day).