Optimizing Spectral Learning for Parsing
Shashi Narayan, Shay Cohen
School of Informatics, University of Edinburgh
ACL, August 2016
1 / 1
Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen - - PowerPoint PPT Presentation
Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1 Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1
1 / 1
S1 VP2
N4 cat D1 the V4 saw
N2 dog D1 the
◮ analogous to syntactic heads as in lexicalization (Charniak
2 / 1
3 / 1
3 / 1
3 / 1
4 / 1
◮ Petrov et al. (2006) demonstrated that coarse-to-fine
4 / 1
4 / 1
S VP NP N dog D the
VP P him V saw
5 / 1
6 / 1
= SVD Step
◮ Averaging with SVD parameters ⇒ Dense estimates
7 / 1
= SVD Step
◮ Averaging with SVD parameters ⇒ Dense estimates
(1, 1, 0, 1, . . .)
S VP N w3 V w2 NP N w1 D w0 S[1] VP[3] N[1] w3 V[1] w2 NP[4] N[4] w1 D[7] w0
7 / 1
8 / 1
◮ An input treebank divided into training and development
◮ A basic spectral estimation algorithm S mapping each
◮ fdef : {S → 24, NNP → 24, VP → 24, DT → 24, . . .}
◮ fopt : {S → 40, NNP → 81, VP → 35, DT → 4, . . .}
9 / 1
◮ Iterate through the nonterminals, changing the number of
◮ estimate the grammar on the training set and ◮ optimize the accuracy on the dev set
10 / 1
DT
S
NP
11 / 1
DT
S
NP
DT
S
NP
DT
S
NP
DT
S
NP
11 / 1
DT
S
NP
DT
S
NP
DT
S
NP
DT
S
NP
11 / 1
12 / 1
berkeley cluster moments 50 55 60 65 70 75 80 85 F-Measures 75.50 71.40 73.40
Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨
13 / 1
berkeley cluster moments 50 55 60 65 70 75 80 85 F-Measures 75.50 71.40 75.20 73.40
Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨
13 / 1
berkeley cluster moments 50 55 60 65 70 75 80 85 F-Measures 75.50 71.40 75.20 73.40 75.50
Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨
13 / 1
berkeley cluster moments 50 55 60 65 70 75 80 85 90 F-Measures 80.60 76.40 79.40 78.40 80.90
Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨
14 / 1
Basque French German Hebrew Hungarian Korean Polish Swedish 50 60 70 80 90 100 F-Measures 74.7 81.4 80.4 79.1 78.3 78.2 87.0 89.0 85.2 89.2 78.6 80.0 86.8 91.8 80.6 80.9
Berkeley Spectral Optimized
◮ Berkeley results are taken from Bj¨
15 / 1
16 / 1
S VP NP N dog D the V saw NP N cat D the
◮ The pairs (VP, V) and (VP, NP) ◮ The rule VP → V NP ◮ The tree fragment (VP (V saw) NP) ◮ The tree fragment (VP V (NP D N)) ◮ The pair of head part-of-speech tag with VP: (VP, V)
17 / 1
S VP NP N dog D the V saw NP N cat D the
◮ The pairs (D, NP) and (D, NP, VP) ◮ The pair of head part-of-speech tag with D: (D, N) ◮ The tree fragments
NP N D*
VP NP N D* V
S VP NP N D* V NP
18 / 1
◮ SVD variants: singular value decomposition of empirical
◮ Convex EM variant: “anchor method” that identifies
◮ Clustering variant: a simplified version of the SVD variant
19 / 1
◮ Initialization: (n0, fdef, Fdef) → Q
◮ n0 : First nonterminal ◮ fdef : {S → 24, NNP → 24, VP → 24, DT → 24, . . .} ◮ Fdef is the F1 score on the development set
◮ Iteration: (ni, fi, Fi) ← Q
◮ For each number of latent state l ∈ {1, ..., m}, ◮ f ′
i : f ′ i (ni) = l and for others n, f ′ i (n) = fi(n),
◮ Estimate a new F ′
i score on the development set, and
◮ Push (ni+1, f ′
i , F ′ i )
◮ Termination: (nfin+1, fopt, Ffin) ← Q
◮ fopt : {S→ 40, NNP → 81, VP → 35, DT → 4, . . .}
20 / 1
21 / 1
22 / 1
23 / 1