optimizing spectral learning for parsing
play

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen - PowerPoint PPT Presentation

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1 Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1


  1. Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1

  2. Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1 N 2 V 4 NP 5 D N V NP ⇒ D 1 N 4 the dog saw D N the dog saw the cat the cat Latent states play the role of nonterminal subcategorization, e.g., NP → { NP 1 , NP 2 , . . . , NP 24 } ◮ analogous to syntactic heads as in lexicalization (Charniak 1997) ? They are not part of the observed data in the treebank 2 / 1

  3. Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood 3 / 1

  4. Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood Spectral Algorithm (Cohen et al., 2012, 2014) ⇑ Statistically consistent algorithms that make use of spectral decomposition ⇑ Much faster training than the EM algorithm 3 / 1

  5. Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood Spectral Algorithm (Cohen et al., 2012, 2014) ⇑ Statistically consistent algorithms that make use of spectral decomposition ⇑ Much faster training than the EM algorithm ⇓ Lagged behind in their empirical results 3 / 1

  6. Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation 4 / 1

  7. Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation Contributions: A . Parsing results significantly improve if the number of latent states for each nonterminal is globally optimized ◮ Petrov et al. (2006) demonstrated that coarse-to-fine techniques that carefully select the number of latent states improve accuracy. 4 / 1

  8. Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation Contributions: B . Optimized spectral method beats coarse-to-fine expectation-maximization techniques on 6 ( Basque , Hebrew , Hungarian , Korean , Polish and Swedish ) out of 8 SPMRL datasets 4 / 1

  9. Intuition behind the Spectral Algorithm Inside and outside trees At node VP: Outside tree o = S S NP VP D N NP VP the dog D N V P Inside tree t = VP V P the dog saw him saw him Conditionally independent given the label and the hidden state p ( o , t | VP , h ) = p ( o | VP , h ) × p ( t | VP , h ) 5 / 1

  10. Recent Advances in Spectral Estimation = Singular value decomposition (SVD) of cross-covariance matrix for each nonterminal 6 / 1

  11. Recent Advances in Spectral Estimation = SVD Step Method of moments (Cohen et al., 2012, 2014) ◮ Averaging with SVD parameters ⇒ Dense estimates 7 / 1

  12. Recent Advances in Spectral Estimation = SVD Step Method of moments (Cohen et al., 2012, 2014) ◮ Averaging with SVD parameters ⇒ Dense estimates Clustering variants (Narayan and Cohen 2015) S [ 1 ] S ( 1 , 1 , 0 , 1 , . . . ) NP [ 4 ] VP [ 3 ] NP VP D [ 7 ] N [ 4 ] V [ 1 ] N [ 1 ] D N V N w 0 w 1 w 2 w 3 w 0 w 1 w 2 w 3 Sparse estimates 7 / 1

  13. Standard Spectral Estimation and Number of Latent States = ⇑ A natural way to choose the number of latent states based on the number of non-zero singular values ⇑ Number of latent states for each nonterminal in an L-PCFG can be decided in isolation ⇓ Conventional approach fails to take into account interactions between different nonterminals 8 / 1

  14. Optimizing Latent States for Various Nonterminals Input: ◮ An input treebank divided into training and development set ◮ A basic spectral estimation algorithm S mapping each nonterminal to a fixed number of latent states ◮ f def : { S → 24, NNP → 24, VP → 24, DT → 24, . . . } Output: ◮ f opt : { S → 40, NNP → 81, VP → 35, DT → 4, . . . } 9 / 1

  15. Optimizing Latent States for Various Nonterminals Algorithm in a nutshell ◮ Iterate through the nonterminals, changing the number of latent states, ◮ estimate the grammar on the training set and ◮ optimize the accuracy on the dev set A beam search algorithm for the traversal of multidimensional vectors of latent states: Optimizing their global interaction 10 / 1

  16. Optimizing Latent States for Various Nonterminals DT S NP f def : , F1 def 24 24 24 24 24 24 time:0 11 / 1

  17. Optimizing Latent States for Various Nonterminals DT S NP , F1 def f def : 4 37 24 24 24 24 DT S NP m 1 f m 1 : m 1 , F1 m 1 4 37 24 24 24 DT S NP m 2 f m 2 : m 2 , F1 m 2 4 37 24 24 24 time:t DT S NP m N f m N : m N , F1 m N 4 37 24 24 24 11 / 1

  18. Optimizing Latent States for Various Nonterminals DT S NP , F1 def f def : 4 37 24 24 24 24 DT S NP m 1 f m 1 : m 1 , F1 m 1 4 37 24 24 24 DT S NP m 2 f m 2 : m 2 , F1 m 2 4 37 24 24 24 time:t DT S NP m N f m N : m N , F1 m N 4 37 24 24 24 Clustering variant of spectral estimation leads to compact models and is relatively fast 11 / 1

  19. Experiments The SPMRL Dataset 8 morphologically rich languages: Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish Treebanks of varying sizes from 5,000 sentences (Hebrew and Swedish) to 40,472 sentences (German) 12 / 1

  20. Results on the Swedish dataset Results on the dev set 85 80 75.50 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1

  21. Results on the Swedish dataset Results on the dev set 85 80 75.50 75.20 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1

  22. Results on the Swedish dataset Results on the dev set 85 80 75.50 75.50 75.20 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1

  23. Results on the Swedish dataset Final results on the test set 90 80.90 80.60 85 79.40 78.40 76.40 80 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 14 / 1

  24. Final Results on the SPMRL Dataset 100 Berkeley 91.8 Spectral Optimized 89.2 89.0 87.0 86.8 90 85.2 81.4 80.9 80.6 80.4 80.0 79.1 78.6 78.3 78.2 80 74.7 F-Measures 70 60 50 Basque French German Hebrew Hungarian Korean Polish Swedish ◮ Berkeley results are taken from Bj¨ orkelund et al, 2013 . 15 / 1

  25. Conclusion Spectral parsing results significantly improve if the number of latent states for each nonterminal is globally optimized Optimized spectral algorithm beats coarse-to-fine EM algorithm for 6 ( Basque , Hebrew , Hungarian , Korean , Polish and Swedish ) out of 8 SPMRL datasets The Rainbow parser and multilingual models : http://cohort.inf.ed.ac.uk/lpcfg/ Acknowledgments: Thanks to David McClosky, Eugene Charniak, DK Choe, Geoff Gordon, Djam´ e Seddah, Thomas M¨ uller, Anders Bj¨ orkelund and anonymous reviewers 16 / 1

  26. Inside Features used Consider the VP node in the following tree: S NP VP D N V NP D N the cat saw the dog The inside features consist of: ◮ The pairs (VP, V) and (VP, NP) ◮ The rule VP → V NP ◮ The tree fragment (VP (V saw) NP) ◮ The tree fragment (VP V (NP D N)) ◮ The pair of head part-of-speech tag with VP: (VP, V) 17 / 1

  27. Outside Features used Consider the D node in the following tree: S NP VP D N V NP the cat saw D N the dog The outside features consist of: ◮ The pairs (D, NP) and (D, NP, VP) ◮ The pair of head part-of-speech tag with D: (D, N) ◮ The tree fragments , and NP VP S D* N V NP NP VP D* N V NP D* N 18 / 1

  28. Variants of Spectral Estimation ◮ SVD variants: singular value decomposition of empirical count matrices (cross-covariance matrices) to estimate grammar parameters (Cohen et. al. 2012, 2014) ◮ Convex EM variant: “anchor method” that identifies features that uniquely identify latent states (Cohen and Collins, 2014) ◮ Clustering variant: a simplified version of the SVD variant that clusters low-dimensional representations to latent states (Narayan and Cohen, 2015) Intuitive-to-understand and very (computationally) efficient 19 / 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend