sparsity in dependency grammar induction
play

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 - PowerPoint PPT Presentation

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 Kuzman Ganchev 1 ca 2 Jo ao Gra Ben Taskar 1 Fernando Pereira 3 1 Computer & Information Science University of Pennsylvania 2 L 2 F INESC-ID, Lisboa, Portugal 3 Google, Inc.


  1. Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 Kuzman Ganchev 1 ca 2 Jo˜ ao Gra¸ Ben Taskar 1 Fernando Pereira 3 1 Computer & Information Science University of Pennsylvania 2 L 2 F INESC-ID, Lisboa, Portugal 3 Google, Inc. July 12, 2010 1/9

  2. Outline A generative dependency parsing model 2/9

  3. Outline A generative dependency parsing model The ambiguity problem this model faces 2/9

  4. Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity 2/9

  5. Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity 2/9

  6. Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective 2/9

  7. Outline A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective Success with respect to EM and parameter prior baselines 2/9

  8. Dependency model with valence (Klein and Manning, ACL 2004) y x V N ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) 3/9

  9. Dependency model with valence (Klein and Manning, ACL 2004) y x N V ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) · θ stop ( nostop | V , right , false ) · θ child ( N | V , right ) 3/9

  10. Dependency model with valence (Klein and Manning, ACL 2004) y x N V ADJ N sparse grammars Regularization creates p θ ( x , y ) = θ root ( V ) · θ stop ( nostop | V , right , false ) · θ child ( N | V , right ) · θ stop ( stop | V , right , true ) · θ stop ( nostop | V , left , false ) · θ child ( N | V , left ) . . . 3/9

  11. Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y 4/9

  12. Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y Optimization method : expectation maximization (EM) 4/9

  13. Traditional objective optimization Traditional objective : marginal log likelihood � max L ( θ ) = E X [log p θ ( x )] = E X [log p θ ( x , y )] θ y Optimization method : expectation maximization (EM) Problem : EM may learn a very ambiguous grammar Too many non-zero probabilities Ex: V → N should have non-zero probability, but V → DET, V → JJ, V → PRP$, etc. should be 0 4/9

  14. Previous approaches to improving performance Structural annealing 1 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

  15. Previous approaches to improving performance Structural annealing 1 L ( θ ′ ): Model extension 2 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

  16. Previous approaches to improving performance Structural annealing 1 L ( θ ′ ): Model extension 2 L ( θ ) + log p ( θ ): Parameter regularization 3 Tend to reduce unique # of children per parent, rather than directly reducing # of unique parent-child pairs θ child ( Y | X , dir ) � = posterior ( X → Y ) 1 Smith and Eisner, ACL 2006 2 Headden et al., NAACL 2009 3 Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

  17. Ambiguity measure using posteriors: L 1 / ∞ Intuition : True # of unique parent tags for a child tag is small ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 6/9

  18. Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 6/9

  19. Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 6/9

  20. Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 6/9

  21. Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 1 0 0 V ADJ N grammars Use good 6/9

  22. Ambiguity measure using posteriors: L 1 / ∞ ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0 1 0 V N V Sparsity is working 0 1 0 N V V Sparsity is working 0 1 0 ADJ V N good grammars Use 1 0 0 V ADJ N grammars Use good max ↓ sum = 3 ← 0 1 0 0 1 0 1 0 0 6/9

  23. Measuring ambiguity on distributions over trees For a distribution p θ ( y | x ) instead of gold trees: ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 7/9

  24. Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 7/9

  25. Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 7/9

  26. Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 7/9

  27. Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 0.4 0.6 .4 .6 0 ADJ V N grammars Use good 7/9

  28. Measuring ambiguity on distributions over trees ADJ N V ADJ ADJ N N V V ADJ → ADJ → ADJ → N → N → N → V → V → V → 0.4 0 1 0 0.6 N V V Sparsity is working 0.4 .4 .6 0 0.6 V N V Sparsity is working 0.70.3 0 .7 .3 V ADJ N grammars good Use 0.4 0.6 .4 .6 0 ADJ V N grammars Use good max ↓ sum = 3.3 ← 0 1 .3 .4 .6 0 .4 .6 0 7/9

  29. Minimizing ambiguity through posterior regularization q t ( y | x ) = arg min E-Step KL ( q � p θ t ) q ( y | x ) 8/9

  30. Minimizing ambiguity through posterior regularization q t ( y | x ) = arg min E-Step KL ( q � p θ t ) q ( y | x ) parent D N V N Probability D N V N q ( y | x ) = D ♣ ✈ ♣ ✈ r s t → child ♣ t ① t N ✉ ♣ ✉ r q ( root → x i ) V ♣ ♣ ♣ ♣ N ✉ r ✉ ♣ q ( x i → x j ) 8/9

  31. Minimizing ambiguity through posterior regularization Apply E-step penalty L 1 / ∞ on posteriors q ( y | x ) to induce sparsity (Graca et al., NIPS 2007 & 2009) q t ( y | x ) = arg min E-Step KL ( q � p θ t ) + σ L 1 / ∞ ( q ( y | x )) q ( y | x ) parent D N V N Probability D N V N q ( y | x ) = D ♣ ✈ ♣ ✈ r s t → child ♣ t ① t N ♣ ♣ ② r q ( root → x i ) V ♣ ♣ ♣ ♣ N ♣ r ② ♣ q ( x i → x j ) 8/9

  32. Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 9/9

  33. Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 11 other languages from CoNLL-X: Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM 9/9

  34. Experimental results English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR ( σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR ( σ = 140, λ = 1 / 3) 64.4 55.2 50.5 DD ( α = 1, λ learned) 65.0 ( ± 5 . 7 ) 11 other languages from CoNLL-X: Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM Come see the poster for more details 9/9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend