Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen - PowerPoint PPT Presentation

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1

Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1 N 2 V 4 NP 5 D N V NP ⇒ D 1 N 4 the dog saw D N the dog saw the cat the cat Latent states play the role of nonterminal subcategorization, e.g., NP → { NP 1 , NP 2 , . . . , NP 24 } ◮ analogous to syntactic heads as in lexicalization (Charniak 1997) ? They are not part of the observed data in the treebank 2 / 1

Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood 3 / 1

Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood Spectral Algorithm (Cohen et al., 2012, 2014) ⇑ Statistically consistent algorithms that make use of spectral decomposition ⇑ Much faster training than the EM algorithm 3 / 1

Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood Spectral Algorithm (Cohen et al., 2012, 2014) ⇑ Statistically consistent algorithms that make use of spectral decomposition ⇑ Much faster training than the EM algorithm ⇓ Lagged behind in their empirical results 3 / 1

Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation 4 / 1

Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation Contributions: A . Parsing results significantly improve if the number of latent states for each nonterminal is globally optimized ◮ Petrov et al. (2006) demonstrated that coarse-to-fine techniques that carefully select the number of latent states improve accuracy. 4 / 1

Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation Contributions: B . Optimized spectral method beats coarse-to-fine expectation-maximization techniques on 6 ( Basque , Hebrew , Hungarian , Korean , Polish and Swedish ) out of 8 SPMRL datasets 4 / 1

Intuition behind the Spectral Algorithm Inside and outside trees At node VP: Outside tree o = S S NP VP D N NP VP the dog D N V P Inside tree t = VP V P the dog saw him saw him Conditionally independent given the label and the hidden state p ( o , t | VP , h ) = p ( o | VP , h ) × p ( t | VP , h ) 5 / 1

Recent Advances in Spectral Estimation = Singular value decomposition (SVD) of cross-covariance matrix for each nonterminal 6 / 1

Recent Advances in Spectral Estimation = SVD Step Method of moments (Cohen et al., 2012, 2014) ◮ Averaging with SVD parameters ⇒ Dense estimates 7 / 1

Recent Advances in Spectral Estimation = SVD Step Method of moments (Cohen et al., 2012, 2014) ◮ Averaging with SVD parameters ⇒ Dense estimates Clustering variants (Narayan and Cohen 2015) S [ 1 ] S ( 1 , 1 , 0 , 1 , . . . ) NP [ 4 ] VP [ 3 ] NP VP D [ 7 ] N [ 4 ] V [ 1 ] N [ 1 ] D N V N w 0 w 1 w 2 w 3 w 0 w 1 w 2 w 3 Sparse estimates 7 / 1

Standard Spectral Estimation and Number of Latent States = ⇑ A natural way to choose the number of latent states based on the number of non-zero singular values ⇑ Number of latent states for each nonterminal in an L-PCFG can be decided in isolation ⇓ Conventional approach fails to take into account interactions between different nonterminals 8 / 1

Optimizing Latent States for Various Nonterminals Input: ◮ An input treebank divided into training and development set ◮ A basic spectral estimation algorithm S mapping each nonterminal to a fixed number of latent states ◮ f def : { S → 24, NNP → 24, VP → 24, DT → 24, . . . } Output: ◮ f opt : { S → 40, NNP → 81, VP → 35, DT → 4, . . . } 9 / 1

Optimizing Latent States for Various Nonterminals Algorithm in a nutshell ◮ Iterate through the nonterminals, changing the number of latent states, ◮ estimate the grammar on the training set and ◮ optimize the accuracy on the dev set A beam search algorithm for the traversal of multidimensional vectors of latent states: Optimizing their global interaction 10 / 1

Optimizing Latent States for Various Nonterminals DT S NP f def : , F1 def 24 24 24 24 24 24 time:0 11 / 1

Optimizing Latent States for Various Nonterminals DT S NP , F1 def f def : 4 37 24 24 24 24 DT S NP m 1 f m 1 : m 1 , F1 m 1 4 37 24 24 24 DT S NP m 2 f m 2 : m 2 , F1 m 2 4 37 24 24 24 time:t DT S NP m N f m N : m N , F1 m N 4 37 24 24 24 11 / 1

Optimizing Latent States for Various Nonterminals DT S NP , F1 def f def : 4 37 24 24 24 24 DT S NP m 1 f m 1 : m 1 , F1 m 1 4 37 24 24 24 DT S NP m 2 f m 2 : m 2 , F1 m 2 4 37 24 24 24 time:t DT S NP m N f m N : m N , F1 m N 4 37 24 24 24 Clustering variant of spectral estimation leads to compact models and is relatively fast 11 / 1

Experiments The SPMRL Dataset 8 morphologically rich languages: Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish Treebanks of varying sizes from 5,000 sentences (Hebrew and Swedish) to 40,472 sentences (German) 12 / 1

Results on the Swedish dataset Results on the dev set 85 80 75.50 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1

Results on the Swedish dataset Results on the dev set 85 80 75.50 75.20 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1

Results on the Swedish dataset Results on the dev set 85 80 75.50 75.50 75.20 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1

Results on the Swedish dataset Final results on the test set 90 80.90 80.60 85 79.40 78.40 76.40 80 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 14 / 1

Final Results on the SPMRL Dataset 100 Berkeley 91.8 Spectral Optimized 89.2 89.0 87.0 86.8 90 85.2 81.4 80.9 80.6 80.4 80.0 79.1 78.6 78.3 78.2 80 74.7 F-Measures 70 60 50 Basque French German Hebrew Hungarian Korean Polish Swedish ◮ Berkeley results are taken from Bj¨ orkelund et al, 2013 . 15 / 1

Conclusion Spectral parsing results significantly improve if the number of latent states for each nonterminal is globally optimized Optimized spectral algorithm beats coarse-to-fine EM algorithm for 6 ( Basque , Hebrew , Hungarian , Korean , Polish and Swedish ) out of 8 SPMRL datasets The Rainbow parser and multilingual models : http://cohort.inf.ed.ac.uk/lpcfg/ Acknowledgments: Thanks to David McClosky, Eugene Charniak, DK Choe, Geoff Gordon, Djam´ e Seddah, Thomas M¨ uller, Anders Bj¨ orkelund and anonymous reviewers 16 / 1

Inside Features used Consider the VP node in the following tree: S NP VP D N V NP D N the cat saw the dog The inside features consist of: ◮ The pairs (VP, V) and (VP, NP) ◮ The rule VP → V NP ◮ The tree fragment (VP (V saw) NP) ◮ The tree fragment (VP V (NP D N)) ◮ The pair of head part-of-speech tag with VP: (VP, V) 17 / 1

Outside Features used Consider the D node in the following tree: S NP VP D N V NP the cat saw D N the dog The outside features consist of: ◮ The pairs (D, NP) and (D, NP, VP) ◮ The pair of head part-of-speech tag with D: (D, N) ◮ The tree fragments , and NP VP S D* N V NP NP VP D* N V NP D* N 18 / 1

Variants of Spectral Estimation ◮ SVD variants: singular value decomposition of empirical count matrices (cross-covariance matrices) to estimate grammar parameters (Cohen et. al. 2012, 2014) ◮ Convex EM variant: “anchor method” that identifies features that uniquely identify latent states (Cohen and Collins, 2014) ◮ Clustering variant: a simplified version of the SVD variant that clusters low-dimensional representations to latent states (Narayan and Cohen, 2015) Intuitive-to-understand and very (computationally) efficient 19 / 1

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen - PowerPoint PPT Presentation

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1 Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Dependency Parsing & Feature-based Parsing Ling571 Deep Processing Techniques for NLP

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parsing as Deduction Joseph K uhner March 24, 2007 Joseph K uhner Parsing as Deduction

Bottom-up parsing LR parsing Construct parse tree for input from leaves up LR( k ) parsing

Compilers Shift-Reduce Parsing Alex Aiken Shift-Reduce Parsing Important Fact #1 about

Fast Frame-Based Scene Change Detection in the Compressed Domain for MPEG-4 Video Jens Brandt,

The superstring n-point 1-loop amplitude Carlos R. Mafra (With Oliver Schlotterer, arXiv:1812. {

The LambrechtsStanley Model of Configuration Spaces Najib Idrissi October 13th, 2016 Najib

srstr t rs

Rhetoric or Reality Wale Adediran, HR Director FMN Group Business Management Department,

Symmetry Reflection Symmetry Symmetric 2D Antisymmetric 2D vector field vector field u

Interprocedural Analysis: Sharir-Pnuelis Call-strings Approach Deepak DSouza Department of

Superstring amplitudes with the pure spinor formalism Carlos R. Mafra STAG Research Centre and

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen - PowerPoint PPT Presentation

Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1 Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Dependency Parsing &amp; Feature-based Parsing Ling571 Deep Processing Techniques for NLP

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parsing as Deduction Joseph K uhner March 24, 2007 Joseph K uhner Parsing as Deduction

Bottom-up parsing LR parsing Construct parse tree for input from leaves up LR( k ) parsing

Compilers Shift-Reduce Parsing Alex Aiken Shift-Reduce Parsing Important Fact #1 about

Fast Frame-Based Scene Change Detection in the Compressed Domain for MPEG-4 Video Jens Brandt,

The superstring n-point 1-loop amplitude Carlos R. Mafra (With Oliver Schlotterer, arXiv:1812. {

The LambrechtsStanley Model of Configuration Spaces Najib Idrissi October 13th, 2016 Najib

srstr t rs

Rhetoric or Reality Wale Adediran, HR Director FMN Group Business Management Department,

Symmetry Reflection Symmetry Symmetric 2D Antisymmetric 2D vector field vector field u

Interprocedural Analysis: Sharir-Pnuelis Call-strings Approach Deepak DSouza Department of

Superstring amplitudes with the pure spinor formalism Carlos R. Mafra STAG Research Centre and

Dependency Parsing & Feature-based Parsing Ling571 Deep Processing Techniques for NLP