Neural Grammar Induction Yoon Kim Harvard University (with Chris - - PowerPoint PPT Presentation

neural grammar induction
SMART_READER_LITE
LIVE PREVIEW

Neural Grammar Induction Yoon Kim Harvard University (with Chris - - PowerPoint PPT Presentation

Neural Grammar Induction Yoon Kim Harvard University (with Chris Dyer, Alexander Rush) 1/69 Language has Hierarchical Structure 2/69 Evidence from Neuroscience (Ding et al. 2015) 3/69 Evidence from Neuroscience (Ding et al. 2015) 4/69


slide-1
SLIDE 1

1/69

Neural Grammar Induction

Yoon Kim

Harvard University (with Chris Dyer, Alexander Rush)

slide-2
SLIDE 2

2/69

Language has Hierarchical Structure

slide-3
SLIDE 3

3/69

Evidence from Neuroscience (Ding et al. 2015)

slide-4
SLIDE 4

4/69

Evidence from Neuroscience (Ding et al. 2015)

slide-5
SLIDE 5

5/69

Evidence from Neuroscience (Ding et al. 2015)

slide-6
SLIDE 6

6/69

Goals Grammar Induction Unsupervised Parsing: Learn a parsing system from observed sentences alone.

(from https://nlp.stanford.edu/projects/project-induction.shtml)

slide-7
SLIDE 7

7/69

Longstanding Problem in AI/NLP Children can do it without explicit supervision on trees. Implications for “poverty of the stimulus” argument. Many domains/languages lack annotated trees.

slide-8
SLIDE 8

8/69

Progress in Supervised Parsing Model F1 Non-Neural Models Collins (1997) 87.8 Charniak (1999) 89.6 Petrov and Klein (2007) 90.1 McClosky et al. (2006) 92.1 Neural Models Dyer et al. (2016) 93.3 Fried et al. (2017) 94.7 Kitaev and Klein (2019) 95.8 (on WSJ Penn Treebank)

slide-9
SLIDE 9

9/69

Progress in Unsupervised Parsing Initial work: gold part-of-speech tags, short sentences of length up to 10. Recent work: directly on words, train/evaluate on full corpus. [Shen et al. 2018, 2019; Jin

et al. 2018b,a; Drozdov et al. 2019; Shi et al. 2019]

Still a very hard problem...

slide-10
SLIDE 10

10/69

This Talk: Neural Grammar Induction Grammar induction with PCFG + neural parameterization works! Approximate more flexible grammars with vectors + VAEs. Learn structure-aware generative models with the induced trees.

slide-11
SLIDE 11

11/69

Context-Free Grammars (CFG) Set of recursive production rules used to generate strings formal languages. S → aSb S → ab generates anbn. Given a string, we can efficiently obtain the underlying parse tree

slide-12
SLIDE 12

12/69

Context-Free Grammars for Natural Language

slide-13
SLIDE 13

13/69

Context-Free Grammars (CFG): Formal Description G = (S, N, P, Σ, R) where N : Set of nonterminals (constituent labels) P : Set of preterminals (part-of-speech tags) Σ : Set of terminals (words) S : Start symbol R : Set of rules Each rule r ∈ R is one of the following: S → A A ∈ N A → B C A ∈ N, B, C ∈ N ∪ P T → w T ∈ P, w ∈ Σ

slide-14
SLIDE 14

13/69

Context-Free Grammars (CFG): Formal Description G = (S, N, P, Σ, R) where N : Set of nonterminals (constituent labels) P : Set of preterminals (part-of-speech tags) Σ : Set of terminals (words) S : Start symbol R : Set of rules Each rule r ∈ R is one of the following: S → A A ∈ N A → B C A ∈ N, B, C ∈ N ∪ P T → w T ∈ P, w ∈ Σ

slide-15
SLIDE 15

14/69

Probabilistic Context-Free Grammars (PCFG) Associate probabilities πr to each rule r. Gives rise to distributions over parse trees via with rule probabilities πr for each rule r ∈ R. Probability of a tree t is given by multiplying the probabilities of rules used in the derivation pπ(t) =

  • r∈tR

πr where tR is set of rules used to derive t

slide-16
SLIDE 16

15/69

PCFG Example

S A1 A3 T7 nothing T2 knows T4 Jon

Ai: nonterminals Tj: preterminals tR = {S → A1, A1 → T4 A3, A3 → T2 T7, T4 → Jon, T2 → knows, T7 → nothing} pπ(t) = πS→A1 × πA1→T4 A3 × πA3→T2 T7× πT4→Jon × πT2→knows × πT7→nothing

slide-17
SLIDE 17

15/69

PCFG Example

S A1 A3 T7 nothing T2 knows T4 Jon

Ai: nonterminals Tj: preterminals tR = {S → A1, A1 → T4 A3, A3 → T2 T7, T4 → Jon, T2 → knows, T7 → nothing} pπ(t) = πS→A1 × πA1→T4 A3 × πA3→T2 T7× πT4→Jon × πT2→knows × πT7→nothing

slide-18
SLIDE 18

16/69

Grammar Induction with PCFGs Specify the broad structure of the grammar (number nonterminals, preterminals, etc.) Maximize log likelihood with respect to the learnable parameters, i.e. given corpus of sentences x(1), . . . x(N), max

π N

  • n=1

log pπ(x(n)) Need to sum out unobserved trees, pπ(x) =

  • t∈T (x)

pπ(t) where T (x) =set of trees whose leaves are x

slide-19
SLIDE 19

17/69

Grammar Induction with PCFGs F1 against gold trees annotated by linguists.

Model F1 Random Trees 19.5 Right Branching 39.5 PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank)

slide-20
SLIDE 20

18/69

Grammar Induction with PCFGs Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992]. Why? Potentially due to Hardness of optimization problem (non-convex). Overly-strict independence assumptions (first-order context-freeness).

slide-21
SLIDE 21

18/69

Grammar Induction with PCFGs Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992]. Why? Potentially due to Hardness of optimization problem (non-convex). Overly-strict independence assumptions (first-order context-freeness).

slide-22
SLIDE 22

19/69

Prior Work on Grammar Induction with PCFGs Driven by conventional wisdom: “MLE with PCFGs doesn’t work” Modified objectives [Klein and Manning 2002, 2004; Smith and Eisner 2004]. Use priors/nonparametric models [Liang et al. 2007; Johnson et al. 2007]. Handcrafted features [Huang et al. 2012; Golland et al. 2012]. Other types of regularization (e.g. on recursion depth) [Noji et al. 2016; Jin et al. 2018b].

slide-23
SLIDE 23

19/69

Prior Work on Grammar Induction with PCFGs Driven by conventional wisdom: “MLE with PCFGs doesn’t work” Modified objectives [Klein and Manning 2002, 2004; Smith and Eisner 2004]. Use priors/nonparametric models [Liang et al. 2007; Johnson et al. 2007]. Handcrafted features [Huang et al. 2012; Golland et al. 2012]. Other types of regularization (e.g. on recursion depth) [Noji et al. 2016; Jin et al. 2018b].

slide-24
SLIDE 24

20/69

A Different Parameterization Scalar Parameterization: Associate probabilities πr to each rule such that they are valid probability distributions. πT→w ≥ 0

  • w′∈Σ

πT→w′ = 1 Neural Parameterization: Associate symbol embeddings wN to each symbol N on left hand side of a rule. Rule probabilities given by a neural net over wN, e.g. πT→w = NeuralNet(wT ) = exp(u⊤

w f(wT ))

  • w′∈Σ exp(u⊤

w′ f(wT ))

(Similar parameterizations for A → BC)

slide-25
SLIDE 25

20/69

A Different Parameterization Scalar Parameterization: Associate probabilities πr to each rule such that they are valid probability distributions. πT→w ≥ 0

  • w′∈Σ

πT→w′ = 1 Neural Parameterization: Associate symbol embeddings wN to each symbol N on left hand side of a rule. Rule probabilities given by a neural net over wN, e.g. πT→w = NeuralNet(wT ) = exp(u⊤

w f(wT ))

  • w′∈Σ exp(u⊤

w′ f(wT ))

(Similar parameterizations for A → BC)

slide-26
SLIDE 26

21/69

Neural PCFG πT→w ∝ exp

  • u⊤

w

  • utput emb.

shared neural net

  • f(

wT

  • input emb.

)

  • Model parameters θ given by input embeddings, output embeddings, and parameters of

neural net f. Analogous to count-based vs neural language models: parameter sharing through distributed representations (word embedding vs symbol embedding). Same independence assumptions (i.e. context-freeness), just a different parameterization.

slide-27
SLIDE 27

21/69

Neural PCFG πT→w ∝ exp

  • u⊤

w

  • utput emb.

shared neural net

  • f(

wT

  • input emb.

)

  • Model parameters θ given by input embeddings, output embeddings, and parameters of

neural net f. Analogous to count-based vs neural language models: parameter sharing through distributed representations (word embedding vs symbol embedding). Same independence assumptions (i.e. context-freeness), just a different parameterization.

slide-28
SLIDE 28

22/69

Neural PCFG: Training Same dynamic programming algorithm for marginalization. Just perform stochastic gradient ascent on log marginal likelihood with: Inside algorithm + Autodiff θnew = θold + λ∇θ log pθ(x)

  • inside algorithm
slide-29
SLIDE 29

23/69

Neural PCFG: Results

Model F1 Random Trees 19.5 Right Branching 39.5 Scalar PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank)

slide-30
SLIDE 30

24/69

Neural PCFG: Results

Model F1 Random Trees 19.5 Right Branching 39.5 Scalar PCFG < 35.0 Neural PCFG 52.6 (English Penn Treebank)

slide-31
SLIDE 31

25/69

Neural PCFG Results

Model F1 Training/Test PPL Random Trees 19.5 − Right Branching 39.5 − Scalar PCFG < 35.0 > 350 Neural PCFG 52.6 ≈ 250

slide-32
SLIDE 32

26/69

Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992]. Why? Potentially due to Hardness of optimization problem (non-convex).

slide-33
SLIDE 33

27/69

Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992]. Why? Potentially due to Hardness of optimization problem (non-convex) Neural parameterization + SGD makes optimization easier (for some reason)

slide-34
SLIDE 34

28/69

Grammar Induction with PCFGs: Issues Long history of work showing that MLE with PCFGs fails to discover linguistically meaningful tree structures [Lari and Young 1990; Carroll and Charniak 1992]. Why? Potentially due to Hardness of optimization problem (non-convex) Neural parameterization + SGD makes optimization easier (for some reason) Overly-strict independence assumptions (first-order context-freeness).

slide-35
SLIDE 35

29/69

This Talk: Neural Grammar Induction Grammar induction with PCFG + neural parameterization works! Approximate more flexible grammars with vectors + VAEs. Learn structure-aware generative models with the induced trees.

slide-36
SLIDE 36

30/69

Limitations of PCFGs No sensitivity to lexical context

(example from http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/lexpcfgs.pdf)

slide-37
SLIDE 37

31/69

Limitations of PCFGs No sensitivity to lexical context

slide-38
SLIDE 38

32/69

Limitations of PCFGs No sensitivity to structural context

(example from http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/lexpcfgs.pdf)

slide-39
SLIDE 39

33/69

Limitations of PCFGs Johnson et al. [2007]: Supervised PCFG + Unsupervised fine tuning decreases parsing accuracy while corpus likelihood improves! “It is easy to demonstrate that the poor quality of the PCFG models is the cause of these problems rather than search or other algorithmic issues. If one initializes either the IO or Bayesian estimation procedures with treebank parses and then runs the procedure using the yields alone, the accuracy of the parses uniformly decreases while the (posterior) likelihood uniformly increases with each iteration, demonstrating that improving the (posterior) likelihood of such models does not improve parse accuracy.”

slide-40
SLIDE 40

34/69

Potential Solutions: Lexicalization No sensitivity to lexical context = ⇒ Lexicalized PCFGs [Collins 1997] Rules are lexicalized, e.g. A → BC = ⇒ A(w) → B(w)C(h) w, h ∈ Σ Integrates notion of headedness

slide-41
SLIDE 41

35/69

Potential Solutions: Markovization No sensitivity to structural context = ⇒ Horizontal/Vertical Markovization [Klein and

Manning 2003]

Richer dependencies through grandparents/siblings.

slide-42
SLIDE 42

36/69

Potential Solutions: Enriching PCFGs Lexicalized PCFG [Collins 1997] Horizontal/Vertical Markovization [Klein and Manning 2003] Latent Variable PCFG [Petrov et al. 2006] Too expensive to apply in the unsupervised case due to explosion in number of rules.

slide-43
SLIDE 43

36/69

Potential Solutions: Enriching PCFGs Lexicalized PCFG [Collins 1997] Horizontal/Vertical Markovization [Klein and Manning 2003] Latent Variable PCFG [Petrov et al. 2006] Too expensive to apply in the unsupervised case due to explosion in number of rules.

slide-44
SLIDE 44

37/69

Compound PCFG Idea: Keep the symbol embeddings, but associate a latent vector z to each sentence x. Compound generative process (Bayesian PCFG): (1) z ∼ N(0, I) (2) πz = NeuralNetwork([wN; z]), for example, πz,T→w = exp(u⊤

w f([wT ; z]))

  • w′∈Σ exp(u⊤

w′ f([wT ; z]))

(3) t ∼ pcfg(πz) (4) x = yield(t)

slide-45
SLIDE 45

37/69

Compound PCFG Idea: Keep the symbol embeddings, but associate a latent vector z to each sentence x. Compound generative process (Bayesian PCFG): (1) z ∼ N(0, I) (2) πz = NeuralNetwork([wN; z]), for example, πz,T→w = exp(u⊤

w f([wT ; z]))

  • w′∈Σ exp(u⊤

w′ f([wT ; z]))

(3) t ∼ pcfg(πz) (4) x = yield(t)

slide-46
SLIDE 46

38/69

Compound PCFG πz,T→w ∝ exp( u⊤

w f([wT

  • fixed across sents

;

varies

  • z ]))

Input/output embeddings and neural net f shared across sentences, but rule probabilities for each sentence can vary through z Intuition: z can encode lexical/structural information specific to the sentence. Some approximation to a lexicalized, higher-order grammar.

slide-47
SLIDE 47

39/69

Recap Neural PCFG: same modeling assumptions, different parameterization. Compound PCFG: different modeling assumptions altogether. No longer context-free!

slide-48
SLIDE 48

40/69

Neural PCFG vs. Compound PCFG

slide-49
SLIDE 49

41/69

Compound PCFG: Training and Inference For maximum likelihood, log marginal likelihood given by log pθ(x) = log

  • t∈T (x)

pθ(t | z)

  • pθ(x | z)

p(z) dz

  • Intractable due to integral over z.
slide-50
SLIDE 50

42/69

Compound PCFG: Training and Inference Variational Inference: Introduce variational posterior for z log pθ(x) ≥ E qφ(z | x)

  • log
  • t∈T (x)

pθ(t | z)

  • pθ(x | z)
  • − KL[ qφ(z | x) p(z) ]

Inference network (LSTM) over x produces parameters for the Gaussian variational posterior qφ(z | x). Given a sample z, can calculate pθ(x | z) =

  • t∈T (x)

pθ(t | z) with dynamic programming (inside algorithm)

slide-51
SLIDE 51

42/69

Compound PCFG: Training and Inference Variational Inference: Introduce variational posterior for z log pθ(x) ≥ E qφ(z | x)

  • log
  • t∈T (x)

pθ(t | z)

  • pθ(x | z)
  • − KL[ qφ(z | x) p(z) ]

Inference network (LSTM) over x produces parameters for the Gaussian variational posterior qφ(z | x). Given a sample z, can calculate pθ(x | z) =

  • t∈T (x)

pθ(t | z) with dynamic programming (inside algorithm)

slide-52
SLIDE 52

43/69

Compound PCFG: Training and Inference Collapsed Variational Inference log pθ(x) ≥ E qφ(z | x)

  • reparameterized sample

[ log pθ(x | z)

  • inside algorithm

] − KL[ qφ(z | x) p(z) ]

  • analytic KL between 2 Gaussians

“VAE with a PCFG decoder”

slide-53
SLIDE 53

43/69

Compound PCFG: Training and Inference Collapsed Variational Inference log pθ(x) ≥ E qφ(z | x)

  • reparameterized sample

[ log pθ(x | z)

  • inside algorithm

] − KL[ qφ(z | x) p(z) ]

  • analytic KL between 2 Gaussians

“VAE with a PCFG decoder”

slide-54
SLIDE 54

44/69

Compound PCFG: Results on PTB

Model F1 Training/Test PPL Random Trees 19.5 − Right Branching 39.5 − Scalar PCFG < 35.0 > 350 Neural PCFG 52.6 ≈ 250 Compound PCFG 60.1 ≈ 190

slide-55
SLIDE 55

45/69

Compound PCFG: Results against Prior Work

Model English (PTB) Chinese (CTB) PRPN [Shen et al. 2018] 38.1 − Ordered Neurons [Shen et al. 2019] 49.4 − Unsupervised RNNG [Kim et al. 2019] 45.4 − DIORA [Drozdov et al. 2019] 58.9 − Random Trees 19.5 16.0 Right Branching 39.5 20.0 Scalar PCFG < 35.0 < 15.0 Neural PCFG 52.6 29.5 Compound PCFG 60.1 39.8

slide-56
SLIDE 56

46/69

Model Analysis: Nonterminal Alignment (|N| = 30)

slide-57
SLIDE 57

47/69

Model Analysis: What does z learn? Nearest neighbors based on variational posterior mean vector

unk corp. received an N million army contract for helicopter engines boeing co. received a N million air force contract for developing cable systems for the unk missile general dynamics corp. received a N million air force contract for unk training sets grumman corp. received an N million navy contract to upgrade aircraft electronics thomson missile products with about half british aerospace ’s annual revenue include the unk unk missile family already british aerospace and french unk unk unk on a british missile contract and on an air-traffic control

slide-58
SLIDE 58

48/69

Model Analysis: What does z learn?

NT-04 NT-12 T-22 w6 NT-20 T-40 w5 NT-20 T-35 w4 NT-07 T-45 w3 T-05 w2 T-13 w1 PC -

  • f the company ’s capital structure

in the company ’s divestiture program by the company ’s new board in the company ’s core businesses

  • n the company ’s strategic plan

PC + above the treasury ’s N-year note above the treasury ’s seven-year note above the treasury ’s comparable note above the treasury ’s five-year note measured the earth ’s ozone layer

slide-59
SLIDE 59

49/69

This Talk: Neural Grammar Induction Grammar induction with PCFG + neural parameterization works! Approximate more flexible grammars with vectors + VAEs. Learn structure-aware generative models with the induced trees.

slide-60
SLIDE 60

50/69

Compound PCFG as a Language Model

Model F1 Test PPL Scalar PCFG < 35.0 > 350 Neural PCFG 52.6 ≈ 250 Compound PCFG 60.1 ≈ 190 LSTM LM − 86.2

slide-61
SLIDE 61

51/69

Compound PCFG as a Language Model

Model F1 Test PPL Scalar PCFG < 35.0 > 350 Neural PCFG 52.6 ≈ 250 Compound PCFG 60.1 ≈ 190 RNN LM − 86.2

slide-62
SLIDE 62

52/69

Compound PCFG as a Language Model Compound PCFG has less strict independence assumptions than a PCFG, but still more restricted than an RNN language model. Good parser, poor language model. Can we use the induced trees to learn a good generative model?

slide-63
SLIDE 63

53/69

Background: Recurrent Neural Network Grammars (RNNG) [Dyer et al. 2016] Structured joint generative model of sentence x and tree z pθ(x, z) Generate next word conditioned on partially-completed syntax tree Hierarchical generative process (cf. flat generative process of RNN) Like RNN LM, no independence assumptions.

slide-64
SLIDE 64

54/69

Recurrent Neural Network Language Models Standard RNN LMs: flat left-to-right generation xt ∼ pθ(x | x1, . . . , xt−1) = softmax(Wht−1 + b)

slide-65
SLIDE 65

55/69

RNNG [Dyer et al. 2016] Introduce binary variables z = [z1, . . . , z2T−1] (unlabeled binary tree) Sample action zt ∈ {generate, reduce} at each time step: zt ∼ Bernoulli(pt) pt = σ(w⊤hprev + b)

slide-66
SLIDE 66

56/69

RNNG [Dyer et al. 2016] If zt = generate Sample word from context representation

slide-67
SLIDE 67

57/69

RNNG [Dyer et al. 2016] (Similar to standard RNNLMs) x ∼ softmax(Whprev + b)

slide-68
SLIDE 68

58/69

RNNG [Dyer et al. 2016] Obtain new context representation with ehungry hnew = LSTM(ehungry, hprev)

slide-69
SLIDE 69

59/69

RNNG [Dyer et al. 2016] hnew = LSTM(ecat, hprev)

slide-70
SLIDE 70

60/69

RNNG [Dyer et al. 2016] If zt = reduce

slide-71
SLIDE 71

61/69

RNNG [Dyer et al. 2016] If zt = reduce Pop last two elements

slide-72
SLIDE 72

62/69

RNNG [Dyer et al. 2016] Obtain new representation of constituent e(hungry cat) = TreeLSTM(ehungry, ecat)

slide-73
SLIDE 73

63/69

RNNG [Dyer et al. 2016] Move the new representation onto the stack hnew = LSTM(e(hungry cat), hprev)

slide-74
SLIDE 74

64/69

Compound PCFG + RNNG Use the compound PCFG to parse training set. Train an RNNG on induced trees, fine-tune with unsupervised RNNG. Model Test PPL Neural PCFG 252.6 Compound PCFG 196.3 RNN LM 86.2 URNNG + Compound PCFG 83.7 URNNG + Gold Trees 78.3

slide-75
SLIDE 75

64/69

Compound PCFG + RNNG Use the compound PCFG to parse training set. Train an RNNG on induced trees, fine-tune with unsupervised RNNG. Model Test PPL Neural PCFG 252.6 Compound PCFG 196.3 RNN LM 86.2 URNNG + Compound PCFG 83.7 URNNG + Gold Trees 78.3

slide-76
SLIDE 76

65/69

Syntactic Evaluation [Marvin and Linzen 2018] Two minimally different sentences:

The senators near the assistant are old

*The senators near the assistant is old Model must assign higher probability to the correct one.

slide-77
SLIDE 77

66/69

Syntactic Evaluation [Marvin and Linzen 2018] Model Test PPL Syntactic Eval. Neural PCFG 252.6 49.2% Compound PCFG 196.3 50.7% RNN LM 86.2 60.9% URNNG + Compound PCFG 83.7 76.1% URNNG + Gold Trees 78.3 76.1% Models can have similar PPL but perform very differently on such tasks

slide-78
SLIDE 78

67/69

Limitations Slow to train due to cubic dynamic program. Latent vector to approximate lexicalized/higher-order grammars seems hacky. What does structure mean in ELMo/BERT era?

slide-79
SLIDE 79

68/69

Summary Neural PCFG: Neural machinery + PCFG can induce linguistcally meaningful grammars with MLE. Compound PCFG: More flexible grammar through sentence-level latent vector. Induced RNNG: Use induced trees to improve language models.

slide-80
SLIDE 80

69/69

Future Work Analyses of learned models with psycholinguistics-like experiments [Wilcox et al. 2019; Futrell

et al. 2019; An et al. 2019].

Separation of “syntax” from “semantics”. Some languages are provably not context-free = ⇒ neural parameterizations of mildly context-sensitive formalisms (e.g. tree-adjoining grammars). Investigate why MLE with scalar parameterization fails but neural parameterization works.

slide-81
SLIDE 81

69/69

Aixiu An, Peng Qian, Ethan Wilcox, and Roger Levy. 2019. Representation of constituents in neural language models: Coordination phrase as a case study. In Proceedings of EMNLP. Glenn Carroll and Eugene Charniak. 1992. Two Experiments on Learning Probabilistic Dependency Grammars from Corpora. In AAAI Workshop on Statistically-Based NLP Techniques. Michael Collins. 1997. Three Generative, Lexicalised Models for Statistical Parsing. In Proceedings of ACL. Andrew Drozdov, Patrick Verga, Mohit Yadev, Mohit Iyyer, and Andrew McCallum. 2019. Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders. In Proceedings of NAACL. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent Neural Network Grammars. In Proceedings of NAACL. Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger

slide-82
SLIDE 82

69/69

  • Levy. 2019. Neural language models as psycholinguistic subjects: Representations of

syntactic state. In Proceedings of NAACL. Dave Golland, John DeNero, and Jakob Uszkoreit. 2012. A Feature-Rich Constituent Context Model for Grammar Induction. In Proceedings of ACL. Yun Huang, Min Zhang, and Chew Lim Tan. 2012. Improved Constituent Context Model with Features. In Proceedings of PACLIC. Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William Schuler, and Lane Schwartz. 2018a. Depth-bounding is Effective: Improvements and Evaluation of Unsupervised PCFG

  • Induction. In Proceedings of EMNLP.

Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William Schuler, and Lane Schwartz. 2018b. Unsupervised Grammar Induction with Depth-bounded PCFG. In Proceedings of TACL. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. 2007. Bayesian Inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL. Yoon Kim, Alexander M. Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and G´ abor Melis.

  • 2019. Unsupervised Recurrent Neural Network Grammars. In Proceedings of NAACL.
slide-83
SLIDE 83

69/69

Dan Klein and Christopher Manning. 2002. A Generative Constituent-Context Model for Improved Grammar Induction. In Proceedings of ACL. Dan Klein and Christopher Manning. 2004. Corpus-based Induction of Syntactic Structure: Models of Dependency and Constituency. In Proceedings of ACL. Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings of ACL. Karim Lari and Steve Young. 1990. The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language, 4:35–56. Percy Liang, Slav Petrov, Michael I. Jordan, and Dan Klein. 2007. The Infinite PCFG using Hierarchical Dirichlet Processes. In Proceedings of EMNLP. Rebecca Marvin and Tal Linzen. 2018. Targeted Syntactic Evaluation of Language Models. In Proceedings of EMNLP. Hiroshi Noji, Yusuke Miyao, and Mark Johnson. 2016. Using Left-corner Parsing to Encode Universal Structural Constraints in Grammar Induction. In Proceedings of EMNLP.

slide-84
SLIDE 84

69/69

Slav Petrov, Leon Barret, Romain Thibaux, and Dan Klein. 2006. Learning Accurate, Compact, and Interpretable Tree Annotation. In Proceedings of ACL. Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and Aaron Courville. 2018. Neural Language Modeling by Jointly Learning Syntax and Lexicon. In Proceedings of ICLR. Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In Proceedings of ICLR. Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually Grounded Neural Syntax Acquisition. In Proceedings of ACL. Noah A. Smith and Jason Eisner. 2004. Annealing Techniques for Unsupervised Statistical Language Learning. In Proceedings of ACL. Ethan Wilcox, Peng Qian, Richard Futrell, Miguel Ballesteros, and Roger Levy. 2019. Structural Supervision Improves Learning of Non-Local Grammatical Dependencies. In Proceedings of NAACL.