Neural Grammar Induction Yoon Kim Harvard University (with Chris - PowerPoint PPT Presentation

Neural Grammar Induction Yoon Kim Harvard University (with Chris Dyer, Alexander Rush) 1/69

Language has Hierarchical Structure 2/69

Evidence from Neuroscience (Ding et al. 2015) 3/69

Goals Grammar Induction Unsupervised Parsing : Learn a parsing system from observed sentences alone. (from https://nlp.stanford.edu/projects/project-induction.shtml ) 6/69

Longstanding Problem in AI/NLP Children can do it without explicit supervision on trees. Implications for “poverty of the stimulus” argument. Many domains/languages lack annotated trees. 7/69

Progress in Supervised Parsing Model F 1 Non-Neural Models Collins (1997) 87.8 Charniak (1999) 89.6 Petrov and Klein (2007) 90.1 McClosky et al. (2006) 92.1 Neural Models Dyer et al. (2016) 93.3 Fried et al. (2017) 94.7 Kitaev and Klein (2019) 95.8 (on WSJ Penn Treebank) 8/69

Progress in Unsupervised Parsing Initial work: gold part-of-speech tags, short sentences of length up to 10. Recent work: directly on words, train/evaluate on full corpus. [Shen et al. 2018, 2019; Jin et al. 2018b,a; Drozdov et al. 2019; Shi et al. 2019] Still a very hard problem... 9/69

This Talk: Neural Grammar Induction Grammar induction with PCFG + neural parameterization works! Approximate more flexible grammars with vectors + VAEs. Learn structure-aware generative models with the induced trees. 10/69

Context-Free Grammars (CFG) Set of recursive production rules used to generate strings formal languages. S → aSb S → ab generates a n b n . Given a string, we can efficiently obtain the underlying parse tree 11/69

Context-Free Grammars for Natural Language 12/69

Context-Free Grammars (CFG): Formal Description G = ( S, N , P , Σ , R ) where N : Set of nonterminals (constituent labels) P : Set of preterminals (part-of-speech tags) Σ : Set of terminals (words) S : Start symbol R : Set of rules Each rule r ∈ R is one of the following: S → A A ∈ N A → B C A ∈ N , B, C ∈ N ∪ P T → w T ∈ P , w ∈ Σ 13/69

Probabilistic Context-Free Grammars (PCFG) Associate probabilities π r to each rule r . Gives rise to distributions over parse trees via with rule probabilities π r for each rule r ∈ R . Probability of a tree t is given by multiplying the probabilities of rules used in the derivation � p π ( t ) = π r r ∈ t R where t R is set of rules used to derive t 14/69

PCFG Example S A i : nonterminals A 1 T j : preterminals T 4 A 3 t R = { S → A 1 , A 1 → T 4 A 3 , Jon T 2 T 7 A 3 → T 2 T 7 , T 4 → Jon , T 2 → knows , T 7 → nothing } knows nothing p π ( t ) = π S → A 1 × π A 1 → T 4 A 3 × π A 3 → T 2 T 7 × π T 4 → Jon × π T 2 → knows × π T 7 → nothing 15/69

Grammar Induction with PCFGs Specify the broad structure of the grammar (number nonterminals, preterminals, etc.) Maximize log likelihood with respect to the learnable parameters, i.e. given corpus of sentences x (1) , . . . x ( N ) , N � log p π ( x ( n ) ) max π n =1 Need to sum out unobserved trees, � p π ( x ) = p π ( t ) t ∈T ( x ) where T ( x ) = set of trees whose leaves are x 16/69

Grammar Induction with PCFGs F 1 against gold trees annotated by linguists. Model F 1 Random Trees 19.5 Right Branching 39.5 PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank) 17/69

Grammar Induction with PCFGs Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex). Overly-strict independence assumptions (first-order context-freeness). 18/69

Prior Work on Grammar Induction with PCFGs Driven by conventional wisdom: “MLE with PCFGs doesn’t work” Modified objectives [Klein and Manning 2002, 2004; Smith and Eisner 2004] . Use priors/nonparametric models [Liang et al. 2007; Johnson et al. 2007] . Handcrafted features [Huang et al. 2012; Golland et al. 2012] . Other types of regularization (e.g. on recursion depth) [Noji et al. 2016; Jin et al. 2018b] . 19/69

A Different Parameterization Scalar Parameterization : Associate probabilities π r to each rule such that they are valid probability distributions. � π T → w ≥ 0 π T → w ′ = 1 w ′ ∈ Σ Neural Parameterization : Associate symbol embeddings w N to each symbol N on left hand side of a rule. Rule probabilities given by a neural net over w N , e.g. exp( u ⊤ w f ( w T )) π T → w = NeuralNet ( w T ) = � w ′ ∈ Σ exp( u ⊤ w ′ f ( w T )) (Similar parameterizations for A → BC ) 20/69

Neural PCFG shared neural net � � � �� u ⊤ π T → w ∝ exp f ( ) w T w �� input emb. output emb. Model parameters θ given by input embeddings, output embeddings, and parameters of neural net f . Analogous to count-based vs neural language models: parameter sharing through distributed representations (word embedding vs symbol embedding). Same independence assumptions (i.e. context-freeness), just a different parameterization. 21/69

Neural PCFG: Training Same dynamic programming algorithm for marginalization. Just perform stochastic gradient ascent on log marginal likelihood with: Inside algorithm + Autodiff θ new = θ old + λ ∇ θ log p θ ( x ) � �� inside algorithm 22/69

Neural PCFG: Results Model F 1 Random Trees 19.5 Right Branching 39.5 Scalar PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank) 23/69

Neural PCFG: Results Model F 1 Random Trees 19.5 Right Branching 39.5 Scalar PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank) 24/69

Neural PCFG Results Model F 1 Training/Test PPL Random Trees 19.5 − Right Branching 39.5 − Scalar PCFG < 35.0 > 350 Neural PCFG 52.6 ≈ 250 25/69

Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex). 26/69

Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex) Neural parameterization + SGD makes optimization easier (for some reason) 27/69

Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992] . Why? Potentially due to Hardness of optimization problem (non-convex) Neural parameterization + SGD makes optimization easier (for some reason) Overly-strict independence assumptions (first-order context-freeness). 28/69

Neural Grammar Induction Yoon Kim Harvard University (with Chris - PowerPoint PPT Presentation

Neural Grammar Induction Yoon Kim Harvard University (with Chris Dyer, Alexander Rush) 1/69 Language has Hierarchical Structure 2/69 Evidence from Neuroscience (Ding et al. 2015) 3/69 Evidence from Neuroscience (Ding et al. 2015) 4/69

Working Together What does his future hold? Carres Grammar School Carres Grammar School

Induction Stepwise induction (for T PA , T cons ) Complete induction (for T PA , T cons )

Induction and recursion Chapter 5 Chapter Summary Mathematical Induction Strong Induction

Grammar and word order Grammar and word order Grammar Grammar Includes morphology and syntax

Mathematical Induction Lecture 10-11 Menu Mathematical Induction Strong Induction

MA THEMA TICAL INDUCTION Induction and Deduction Mathematical Induction (its

Beyond Inductive Definitions Induction-Recursion, Induction-Induction, Coalgebras Anton

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

Strong induction (3) 23/38 Let P be a unary predicate on N Strong induction: Induction . . .

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Induction and Recursion CMPS/MATH 2170: Discrete Mathematics Outline Mathematical induction

Natural Deduction and Rule Induction Dr. Liam OConnor University of Edinburgh LFCS UNSW, Term

Foundations of Computer Science Lecture 6 Strong Induction Strengthening the Induction

Mathematical Induction 2. Assume the statement is true for any particular value of n and show that

Foundations of Computer Science Lecture 6 Strong Induction Strengthening the Induction

Interpreting inductive-inductive definitions as indexed inductive definitions Fredrik Nordvall

Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang

Greedy Layerwise Learning Can Scale to ImageNet Edouard Oyallon Eugene Belilovsky, Michael

Harmonic Analysis of Deep Convolutional Neural Networks Helmut B olcskei Department of

Posterior odds interpretation of a sigmoid Artificial Intelligence: Neural Networks Michael S.

Neural Importance Sampling Fabrice Rousselle Markus Gross Jan Novk A ffi liation: Work done

Articial Neural Net w orks [Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9,

Introduction to Neural Networks Slides from L. Lazebnik, B. Hariharan Outline Perceptrons

Learning and transferring mid-level image representions using convolutional neural networks