Transition-based Dependency Parsing with Selectional Branching - - PowerPoint PPT Presentation
Transition-based Dependency Parsing with Selectional Branching - - PowerPoint PPT Presentation
Transition-based Dependency Parsing with Selectional Branching Presented at the 4th workshop on Statistical Parsing in Morphologically Rich Languages October 18th, 2013 Jinho D. Choi University of Massachusetts Amherst Greedy vs. Non-greedy
Greedy vs. Non-greedy Parsing
- Greedy parsing
- Considers only one head for each token.
- Generates one parse tree per sentence.
- e.g., transition-based parsing (2 ms / sentence).
- Non-greedy parsing
- Considers multiple heads for each token.
- Generates multiple parse trees per sentence.
- e.g., transition-based parsing with beam search, graph-
based parsing, linear programming, dual decomposition (≥ 93%).
2
Motivation
- How often do we need non-greedy parsing?
- Our greedy parser performs as accurately as our non-
greedy parser about 64% of the time.
- This gap is even closer when they are evaluated on non-
benchmark data (e.g., twits, chats, blogs).
- Many applications are time sensitive.
- Some applications need at least one complete parse tree
ready given a limited time period (e.g., search, dialog, Q/ A).
- Hard sentences are hard for any parser!
- Considering more heads does not always guarantee more
accurate parse results.
3
Transition-based Parsing
- Transition-based dependency parsing (greedy)
- Considers one transition for each parsing state.
4
S t1
…
tL t′ S t1 tL
…
t′ S
…
T
What if t′ is not the correct transition?
Transition-based Parsing
- Transition-based dependency parsing with beam search
- Considers b-num. of transitions for each block of parsing
5
S t1
…
tL t′1 t′b
…
t11 t1L tb1 tbL
… …
t′1 t′b S1 Sb
… …
T1 Tb S1 Sb
…
Selectional Branching
- Issues with beam search
- Generates the fixed number of parse trees no matter
how easy/hard the input sentence is.
- Is it possible to dynamically adjust the beam size for each
individual sentence?
- Selectional branching
- One-best transition sequence is found by a greedy parser.
- Collect k-best state-transition pairs for each low
confidence transition used to generate the one-best sequence.
- Generate transition sequences from the b-1 highest
scoring state-transition pairs in the collection.
6
Selectional Branching
7
S1 t11
…
t1L t′11 S2 t21 t2L
…
t′21 Sn
…
T
λ =
t′12 S1 t′1k S1
…
t′22 S2 t′2k S2
… …
Pick b-1 number of pairs with the highest scores. low confident? low confident? For our experiments, k = 2 is used.
Selectional Branching
8
λ =
t′12 S1 t′22 S2 t′32 S3 T S1 t′12 S2 Sa
…
T S2 t′22 S3 Sb
…
T S3 t′32 S4 Sc
…
Carries on parsing states from the one-best sequence. Guarantees to generate fewer trees than beam search when |λ| ≤ b.
Low Confidence Transition
- Let C1 be a classifier that finds the highest scoring
transition given the parsing state x.
- Let Ck be a classifier that finds the k-highest scoring
transitions given the parsing state x and the margin m.
- The highest scoring transition C1(x) is low confident if
|Ck(x, m)| > 1.
9
C1(x) = arg max
y2Y {f(x, y)}
f(x, y) = exp(w · Φ(x, y)) P
y02Y exp(w · Φ(x, y0))
Ck(x, m) = K arg max
y2Y {f(x, y)}
s.t. f(x, C1(x)) − f(x, y) ≤ m
Experiments
- Parsing algorithm (Choi & McCallum, 2013)
- Hybrid between Nivre’s arc-eager and list-based
algorithms.
- Projective parsing: O(n).
- Non-projective parsing: expected linear time.
- Features
- Rich non-local features from Zhang & Nivre, 2011.
- For languages with coarse-grained POS tags, feature
templates using fine-grained POS tags are replicated.
- For languages with morphological features, morphologies
- f σ[0] and β[0] are used as unigram features.
10
Number of Transitions
- # of transitions performed with respect to beam
sizes.
11
80 10 20 30 40 50 60
70
1,200,000 200,000 400,000 600,000 800,000 1,000,000
Beam size = 1, 2, 4, 8, 16, 32, 64, 80 Transitions
Projective Parsing
- The benchmark setup using WSJ.
12 Approach USA LAS Time bt = 80, bd = 80 92.96 91.93 0.009 bt = 80, bd = 64 92.96 91.93 0.009 bt = 80, bd = 32 92.96 91.94 0.009 bt = 80, bd = 16 92.96 91.94 0.008 bt = 80, bd = 8 92.89 91.87 0.006 bt = 80, bd = 4 92.76 91.76 0.004 bt = 80, bd = 2 92.56 91.54 0.003 bt = 80, bd = 1 92.26 91.25 0.002 bt = 1, bd = 1 92.06 91.05 0.002
Projective Parsing
- The benchmark setup using WSJ.
13 Approach USA LAS Time bt = 80, bd = 80 92.96 91.93 0.009 Zhang & Clark, 2008 92.1 Huang & Sagae, 2010 92.1 0.04 Zhang & Nivre, 2011 92.9 91.8 0.03 Bohnet & Nivre, 2012 93.38 92.44 0.4 McDonald et al., 2005 90.9 McDonald & Pereira, 2006 91.5 Sagae & Lavie, 2006 92.7 Koo & Collins, 2010 93.04 Zhang & McDonald, 2012 93.06 91.86 Martins et al., 2010 93.26 Rush et al., 2010 93.8
Non-projective Parsing
- CoNLL-X shared task data
14 Approach Danish Dutch Slovene Swedish LAS UAS LAS UAS LAS UAS LAS UAS bt = 80, bd = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.8 91.36 bt = 80, bd = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12 Nivre et al., 2006 84.77 89.8 78.59 81.35 70.3 78.72 84.58 89.5 McDonald et al., 2006 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93 Nivre, 2009 84.2
- 75.2
- F.-Gonz. & G.-Rodr., 2012
85.17 90.1
- 83.55
89.3 Nivre & McDonald, 2008 86.67
- 81.63
- 75.94
- 84.66
- Martins et al., 2010
- 91.5
- 84.91
- 85.53
- 89.8
SPMRL 2013 Shared Task
- Baseline results provided by ClearNLP
.
15 Language 5K Full LAS UAS LS LAS UAS LS Arabic 81.72 84.46 93.41 84.19 86.48 94.43 Basque 78.01 84.62 82.71 79.16 85.32 83.63 French 73.39 85.3 81.42 74.51 86.41 82 German 82.58 85.36 90.49 86.73 88.8 92.95 Hebrew 75.09 81.74 82.84
- Hungarian
81.98 86.09 88.26 82.68 86.56 88.8 Korean 76.28 80.39 87.32 83.55 86.82 92.39 Polish 80.64 88.49 86.47 81.12 89.24 86.59 Swedish 80.96 86.48 85.1
Conclusion
- Selectional branching
- Uses confidence estimates to decide when to employ a
beam.
- Shows comparable accuracy against traditional beam
search.
- Gives faster speed against any other non-greedy parsing.
- ClearNLP
- Provides several NLP tools including morphological
analyzer, dependency parser, semantic role labeler, etc.
- Webpage: clearnlp.com.
16