neural grammar induction
play

Neural Grammar Induction Yoon Kim Harvard University (with Chris - PowerPoint PPT Presentation

Neural Grammar Induction Yoon Kim Harvard University (with Chris Dyer, Alexander Rush) 1/69 Language has Hierarchical Structure 2/69 Evidence from Neuroscience (Ding et al. 2015) 3/69 Evidence from Neuroscience (Ding et al. 2015) 4/69


  1. Neural Grammar Induction Yoon Kim Harvard University (with Chris Dyer, Alexander Rush) 1/69

  2. Language has Hierarchical Structure 2/69

  3. Evidence from Neuroscience (Ding et al. 2015) 3/69

  4. Evidence from Neuroscience (Ding et al. 2015) 4/69

  5. Evidence from Neuroscience (Ding et al. 2015) 5/69

  6. Goals Grammar Induction Unsupervised Parsing : Learn a parsing system from observed sentences alone. (from https://nlp.stanford.edu/projects/project-induction.shtml ) 6/69

  7. Longstanding Problem in AI/NLP Children can do it without explicit supervision on trees. Implications for “poverty of the stimulus” argument. Many domains/languages lack annotated trees. 7/69

  8. Progress in Supervised Parsing Model F 1 Non-Neural Models Collins (1997) 87.8 Charniak (1999) 89.6 Petrov and Klein (2007) 90.1 McClosky et al. (2006) 92.1 Neural Models Dyer et al. (2016) 93.3 Fried et al. (2017) 94.7 Kitaev and Klein (2019) 95.8 (on WSJ Penn Treebank) 8/69

  9. Progress in Unsupervised Parsing Initial work: gold part-of-speech tags, short sentences of length up to 10. Recent work: directly on words, train/evaluate on full corpus. [Shen et al. 2018, 2019; Jin et al. 2018b,a; Drozdov et al. 2019; Shi et al. 2019] Still a very hard problem... 9/69

  10. This Talk: Neural Grammar Induction Grammar induction with PCFG + neural parameterization works! Approximate more flexible grammars with vectors + VAEs. Learn structure-aware generative models with the induced trees. 10/69

  11. Context-Free Grammars (CFG) Set of recursive production rules used to generate strings formal languages. S → aSb S → ab generates a n b n . Given a string, we can efficiently obtain the underlying parse tree 11/69

  12. Context-Free Grammars for Natural Language 12/69

  13. Context-Free Grammars (CFG): Formal Description G = ( S, N , P , Σ , R ) where N : Set of nonterminals (constituent labels) P : Set of preterminals (part-of-speech tags) Σ : Set of terminals (words) S : Start symbol R : Set of rules Each rule r ∈ R is one of the following: S → A A ∈ N A → B C A ∈ N , B, C ∈ N ∪ P T → w T ∈ P , w ∈ Σ 13/69

  14. Context-Free Grammars (CFG): Formal Description G = ( S, N , P , Σ , R ) where N : Set of nonterminals (constituent labels) P : Set of preterminals (part-of-speech tags) Σ : Set of terminals (words) S : Start symbol R : Set of rules Each rule r ∈ R is one of the following: S → A A ∈ N A → B C A ∈ N , B, C ∈ N ∪ P T → w T ∈ P , w ∈ Σ 13/69

  15. Probabilistic Context-Free Grammars (PCFG) Associate probabilities π r to each rule r . Gives rise to distributions over parse trees via with rule probabilities π r for each rule r ∈ R . Probability of a tree t is given by multiplying the probabilities of rules used in the derivation � p π ( t ) = π r r ∈ t R where t R is set of rules used to derive t 14/69

  16. PCFG Example S A i : nonterminals A 1 T j : preterminals T 4 A 3 t R = { S → A 1 , A 1 → T 4 A 3 , Jon T 2 T 7 A 3 → T 2 T 7 , T 4 → Jon , T 2 → knows , T 7 → nothing } knows nothing p π ( t ) = π S → A 1 × π A 1 → T 4 A 3 × π A 3 → T 2 T 7 × π T 4 → Jon × π T 2 → knows × π T 7 → nothing 15/69

  17. PCFG Example S A i : nonterminals A 1 T j : preterminals T 4 A 3 t R = { S → A 1 , A 1 → T 4 A 3 , Jon T 2 T 7 A 3 → T 2 T 7 , T 4 → Jon , T 2 → knows , T 7 → nothing } knows nothing p π ( t ) = π S → A 1 × π A 1 → T 4 A 3 × π A 3 → T 2 T 7 × π T 4 → Jon × π T 2 → knows × π T 7 → nothing 15/69

  18. Grammar Induction with PCFGs Specify the broad structure of the grammar (number nonterminals, preterminals, etc.) Maximize log likelihood with respect to the learnable parameters, i.e. given corpus of sentences x (1) , . . . x ( N ) , N � log p π ( x ( n ) ) max π n =1 Need to sum out unobserved trees, � p π ( x ) = p π ( t ) t ∈T ( x ) where T ( x ) = set of trees whose leaves are x 16/69

  19. Grammar Induction with PCFGs F 1 against gold trees annotated by linguists. Model F 1 Random Trees 19.5 Right Branching 39.5 PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank) 17/69

  20. Grammar Induction with PCFGs Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex). Overly-strict independence assumptions (first-order context-freeness). 18/69

  21. Grammar Induction with PCFGs Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex). Overly-strict independence assumptions (first-order context-freeness). 18/69

  22. Prior Work on Grammar Induction with PCFGs Driven by conventional wisdom: “MLE with PCFGs doesn’t work” Modified objectives [Klein and Manning 2002, 2004; Smith and Eisner 2004] . Use priors/nonparametric models [Liang et al. 2007; Johnson et al. 2007] . Handcrafted features [Huang et al. 2012; Golland et al. 2012] . Other types of regularization (e.g. on recursion depth) [Noji et al. 2016; Jin et al. 2018b] . 19/69

  23. Prior Work on Grammar Induction with PCFGs Driven by conventional wisdom: “MLE with PCFGs doesn’t work” Modified objectives [Klein and Manning 2002, 2004; Smith and Eisner 2004] . Use priors/nonparametric models [Liang et al. 2007; Johnson et al. 2007] . Handcrafted features [Huang et al. 2012; Golland et al. 2012] . Other types of regularization (e.g. on recursion depth) [Noji et al. 2016; Jin et al. 2018b] . 19/69

  24. A Different Parameterization Scalar Parameterization : Associate probabilities π r to each rule such that they are valid probability distributions. � π T → w ≥ 0 π T → w ′ = 1 w ′ ∈ Σ Neural Parameterization : Associate symbol embeddings w N to each symbol N on left hand side of a rule. Rule probabilities given by a neural net over w N , e.g. exp( u ⊤ w f ( w T )) π T → w = NeuralNet ( w T ) = � w ′ ∈ Σ exp( u ⊤ w ′ f ( w T )) (Similar parameterizations for A → BC ) 20/69

  25. A Different Parameterization Scalar Parameterization : Associate probabilities π r to each rule such that they are valid probability distributions. � π T → w ≥ 0 π T → w ′ = 1 w ′ ∈ Σ Neural Parameterization : Associate symbol embeddings w N to each symbol N on left hand side of a rule. Rule probabilities given by a neural net over w N , e.g. exp( u ⊤ w f ( w T )) π T → w = NeuralNet ( w T ) = � w ′ ∈ Σ exp( u ⊤ w ′ f ( w T )) (Similar parameterizations for A → BC ) 20/69

  26. Neural PCFG shared neural net � � � �� � u ⊤ π T → w ∝ exp f ( ) w T w ���� ���� input emb. output emb. Model parameters θ given by input embeddings, output embeddings, and parameters of neural net f . Analogous to count-based vs neural language models: parameter sharing through distributed representations (word embedding vs symbol embedding). Same independence assumptions (i.e. context-freeness), just a different parameterization. 21/69

  27. Neural PCFG shared neural net � � � �� � u ⊤ π T → w ∝ exp f ( ) w T w ���� ���� input emb. output emb. Model parameters θ given by input embeddings, output embeddings, and parameters of neural net f . Analogous to count-based vs neural language models: parameter sharing through distributed representations (word embedding vs symbol embedding). Same independence assumptions (i.e. context-freeness), just a different parameterization. 21/69

  28. Neural PCFG: Training Same dynamic programming algorithm for marginalization. Just perform stochastic gradient ascent on log marginal likelihood with: Inside algorithm + Autodiff θ new = θ old + λ ∇ θ log p θ ( x ) � �� � inside algorithm 22/69

  29. Neural PCFG: Results Model F 1 Random Trees 19.5 Right Branching 39.5 Scalar PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank) 23/69

  30. Neural PCFG: Results Model F 1 Random Trees 19.5 Right Branching 39.5 Scalar PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank) 24/69

  31. Neural PCFG Results Model F 1 Training/Test PPL Random Trees 19.5 − Right Branching 39.5 − Scalar PCFG < 35.0 > 350 Neural PCFG 52.6 ≈ 250 25/69

  32. Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex). 26/69

  33. Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex) Neural parameterization + SGD makes optimization easier (for some reason) 27/69

  34. Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex) Neural parameterization + SGD makes optimization easier (for some reason) Overly-strict independence assumptions (first-order context-freeness). 28/69

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend