The Effect of Non-tightness on Bayesian Estimation of PCFGs Shay - PowerPoint PPT Presentation

The Effect of Non-tightness on Bayesian Estimation of PCFGs Shay Cohen (Columbia University, University of Edinburgh) and Mark Johnson (Macquarie University) August, 2013 We thank the anonymous reviewers and Giorgio Satta for their valuable comments. Shay Cohen was supported by the National Science Foundation under Grant #1136996 to the Computing Research Association for the CIFellows Project, and Mark Johnson was supported by the Australian Research Council’s Discovery Projects funding scheme (project numbers DP110102506 and DP110102593) 1 / 12

Probabilistic context-free grammars (PCFGs) Parse tree Probability Rule 1 . 0 S → NP VP S 1 . 0 NP → Det N 1 . 0 VP → V NP NP VP 0 . 7 Det → the 0 . 3 Det → a Det N V NP 0 . 4 N → cat the cat chased Det N 0 . 6 N → dog 0 . 2 V → chased the dog 0 . 8 V → liked Tree probability = 1 . 0 × 1 . 0 × 0 . 7 × 0 . 4 × 1 . 0 × 0 . 2 × 1 . 0 × 0 . 7 × 0 . 6 = 0 . 02352 2 / 12

PCFGs and tightness • p ∈ [0 , 1] |R| is a vector of rule probabilities indexed by rules R • A PCFG associates each tree t with a measure m p ( t ): � p n A → α ( t ) m p ( t ) = , where: A → α A → α ∈R n A → α ( t ) is the number of times rule A → α is used in the derivation of t • The partition function Z of a PCFG is: � Z p = m p ( t ) t ∈T • PCFGs require the rule probabilities expanding a non-terminal to be normalised, but this does not guarantee that Z p = 1 • When Z p < 1, we say the PCFG is “ non-tight .” 3 / 12

Catalan grammar: an example of a non-tight PCFG • PCFG has two rules: S → S S and S → x • It generates strings of x of arbitrary length • It generates all possible finite binary trees ◮ or equivalently, all possible well-formed brackettings ◮ called the Catalan grammar because the number of parses of x n is Catalan number C n − 1 • The PCFG is non-tight when p S → S S > 0 . 5 1 S 0 . 75 S S Z p 0 . 5 S S S S 0 . 25 S S x x S S 0 . 0 x x x S S 0 . 0 0 . 25 0 . 5 0 . 75 1 p S → S S x x 4 / 12

Why can the Catalan grammar be non-tight? • Every binary tree over n terminals has n − 1 non-terminals ⇒ probability of a tree decreases exponentially with length • The number of different binary trees with n terminals is C n − 1 ⇒ number of trees grammar grows exponentially with length • When p S → S S ≥ 0 . 5, the PCFG puts non-zero mass on non-terminating derivations ◮ this grammar defines a branching processes ◮ At each step, p S → S S is probability of reproducing, p S → x is probability of dying ◮ p S → S S < 0 . 5 ⇒ population dies out (subcritical) ◮ p S → S S > 0 . 5 ⇒ population grows unboundedly (supercritical) • Mini-theorem: every linear PCFG is tight (except on cases of measure zero under continuous priors) ◮ CFG is linear ⇔ RHS of every rule contains at most one non-terminal ◮ HMMs are linear PCFGs ⇒ always tight 5 / 12

Bayesian inference of PCFGs • Bayesian inference uses Bayes rule to compute a posterior over rule probability vectors p P( p | D ) ∝ P( D | p ) P( p ) � �� Posterior Likelihood Prior where D = ( D 1 , . . . , D n ) is the training data (trees or strings) • Bayesians prefer the full posterior distribution P( p | D ) to a point estimate � p • If the prior assigns non-zero mass to non-tight grammars, in general the posterior will too • As the number of independent observations n in the training data grows, the posterior concentrates around the MLE ◮ MLE is always a tight PCFG (Chi and Geman 1998) ◮ As n → ∞ the posterior concentrates on tight PCFGs 6 / 12

3 approaches to non-tightness in the Bayesian setting • If the grammar is linear, then all continuous priors lead to tight PCFGs • Three different approaches to Bayesian inference with non-tight grammars: 1. “Sink element” : assign mass of “infinite trees” to a sink element , implicitly assumed by Johnson et al (2007) 2. “Only tight” : redefine prior so it only places mass onto tight grammars 3. “Renormalisation” : divide by partition function to ensure normalisation Assume for now that trees and strings are observed in D (supervised learning) 7 / 12

“Only tight” approach Let I ( p ) be 1 if p is tight and 0 otherwise. Given a “non-tight prior” P ( p ), define a new prior P ′ as: P ′ ( p ) ∝ P ( p ) I ( p ) If P ( p ) is conjugate family of priors with respect to PCFG likelihood, then P ′ ( p ) is also conjugate We can draw samples from P ′ ( p | D ) using rejection sampling : • Draw PCFG parameters p from P ( p | D ) until p is tight ◮ P ( p | D ) is a product of Dirichlets ⇒ can use textbook algorithms for sampling from Dirichlets 8 / 12

Renormalisation approach Renormalise the measure µ p ( t ) over finite trees (Chi, 1999) If P ( p | α ) is a product of Dirichlets, posterior is: � n µ p ( t i ) 1 P( p | D ) = P( p | α ) ∝ P( p | α + n ( D )) . Z p Z n p i =1 where n ( D ) is the count vector over all rules for the data D • Use a Metropolis-Hastings sampler to sample from P( p | D ) ◮ proposal distribution is product of Dirichlets Samplers for each approach can be used within a component-wise Gibbs sampler for the unsupervised case where only strings are observed. 9 / 12

Toy example Consider the grammar S → S S S | S S | a Let w = a a a t 1 = S t 2 = S t 3 = S S S S S S S S S S S S a a a a a a a a a • Uniform prior ( α = 1) • Sink-element approach: P( t 1 | w ) = 7 11 ≈ 0 . 636364. • Only-tight approach: P( t 1 | w ) = 11179 17221 ≈ 0 . 649149. • Renormalisation approach: P( t 1 | w ) ≈ 0 . 619893. ⇒ All three approaches induce different posteriors from uniform prior 10 / 12

Experiments on WSJ10 • Task: unsupervised estimation of Smith et al (2006)’s PCFG version of the DMV (Klein et al 2004) from WSJ10 • 100 runs of each sampler for 1,000 MCMC sweeps • Computed average F 1 score on every 10th sweep for last 100 sweeps • Kolmogorov-Smirnov tests did not show a statistically significant difference 30 Inference Density only−tight 20 sink−state renormalise 10 0 0.35 0.40 0.45 0.50 0.55 Average f−score 11 / 12

Conclusion • Linear CFGs are tight regardless of the prior • For non-linear CFGs, three approaches are suggested for handling non-tightness • The three approaches are not mathematically equivalent, but experiments on WSJ Penn treebank showed that they behave similarly empirically Open problem: are the approaches reducible in the following sense? Given a prior P for one of the approaches, is there a prior P ′ for another approach such that for all data D , the posteriors under both approaches are the same. 12 / 12

The Effect of Non-tightness on Bayesian Estimation of PCFGs Shay - PowerPoint PPT Presentation

The Effect of Non-tightness on Bayesian Estimation of PCFGs Shay Cohen (Columbia University, University of Edinburgh) and Mark Johnson (Macquarie University) August, 2013 We thank the anonymous reviewers and Giorgio Satta for their valuable

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall

Detection and Estimation Theory Lecture 13 Mojtaba Soltanalian- UIC msol@uic.edu

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

The Memory-Tightness of Authenticated Encryption Stefano Tessaro Ashrujit Ghoshal Joseph Jaeger

On the Memory-Tightness of Hashed ElGamal Ashrujit Ghoshal Stefano Tessaro University of

DIGITAL EVIDENCE: DOES THE LAW HAVE TO CHANGE WITH THE TECHNOLOGY? Presentation to The Law And

Major reforms to Australias foreign investment framework June/July 2020 The Australian

2010 Annual General Meeting Sydney, 18 November 2010 Annual General Meeting Graham Kraehe AO

ABSTRACT DISCRETION, DIRECTION AND THE OMBUDSMAN: TO STEER THE SHIP OR TO CHOOSE THE SHIP? In

Capital Market Development, Insights on the Economic Context and Australias Policy on Stock

Hong Kong Securities Industry Group Update on Implementation of Update on Implementation of G30

Annual General Meeting 2 Summary of financial results for the year ended 30 June 2017 2017 2016

CLIMATE BONDS INITIATIVE WEBINAR LOW CARBON COMMERCIAL PROPERTY DEVELOPMENTS, GREEN FINANCE

Sambuz

Useful Links

Newsletter

Mail Us