TAG, Dynamic Programming, and the Perceptron for Efficent, - PowerPoint PPT Presentation

TAG, Dynamic Programming, and the Perceptron for Efficent, Feature-Rich Parsing Xavier Carreras, Michael Collins and Terry Koo MIT CSAIL

Discriminative Models for Parsing Structured Prediction methods like CRF or Perceptron train linear models defined on factored representations of structures: � Parse( x ) = argmax f ( x , r ) · w y ∈Y ( x ) r ∈ y Main Advantage: ◮ Flexibility of feature definitions in f ( x , r ) Critical Difficulty: ◮ Training algorithms repeatedly parse the training sentences. Efficient parsing algorithms are crucial.

A Feature-rich Consituent Parsing Model We present a TAG-style model to recover constituent trees. It defines feature vectors looking at: ◮ CFG-based structure ◮ Dependency relations between lexical heads ◮ Second-order dependency relations with sibling and grandparent dependencies These can be combined with surface features of the sentence.

Efficient Coarse-to-fine Inference We use a coarse-to-fine parsing strategy on dependency graphs: ◮ We use general versions of the Eisner algorithm to parse with the full TAG parser ◮ Simple first-order dependency models restrict the space of the full model, making parsing feasible We train a parser with discriminative methods at full-scale.

TAG + Dynamic Programming + Perceptron We use the Averaged Perceptron to train the parameters of our TAG model: ◮ w = 0 , w a = 0 ◮ For t = 1 . . . T ◮ For each training example ( x , y ) 1. z = Parse( x ; w ) 2. if y � = z then w = w + f ( x , y ) − f ( x , z ) 3. w a = w a + w ◮ return w a We obtain state-of-the-art results for English.

Outline A TAG-style Linear Model for Constituent Parsing Representation: Spines and Adjunctions Model and Features Fast Inference with our TAG Parsing the WSJ Treebank

Tree-Adjoining Grammar (TAG) ◮ In TAG formalisms [Joshi et al. 1975] : ◮ The basic elements are trees ◮ Trees can be combined to form bigger trees ◮ There are many variations of TAG ◮ Here we present a simple TAG-style grammar: ◮ Allows rich features ◮ Allows efficient inference

Decomposing trees into spines and adjunctions S S NP VP NP VP n v NP NP n v Mary = ⇒ eats PP d n Mary eats d n PP the cake p NP the cake p NP with n with n almonds almonds Syntactic constituents sit on top of their lexical heads. The underlying structure looks like a dependency structure.

Spines Spines are lexical units with a chain of unary projections. They are the elementary trees in our TAG. (see also [Shen & Joshi 2005]) NP S S NP NP ADVP PP det n VP VP n n adv prep the Mary v v cake door quickly with eats loves We build a dictionary of spines appearing in the WSJ.

Sister Adjunctions Sister adjunctions are used to combine spines to form trees. S NP VP n v Mary eats An adjunction operation attaches: ◮ A modifier spine ◮ To some position of a head spine

Sister Adjunctions Sister adjunctions are used to combine spines to form trees. S NP VP NP n v d n Mary eats the cake An adjunction operation attaches: ◮ A modifier spine ◮ To some position of a head spine

Sister Adjunctions Sister adjunctions are used to combine spines to form trees. S NP VP NP n v d n PP Mary eats NP the cake p with n An adjunction operation attaches: ◮ A modifier spine almonds ◮ To some position of a head spine

Regular Adjunctions We also consider a regular adjunction operation. It adds one level to the syntactic constituent it attaches to. NP NP r S’ NP S’ d n d n WP S the boys WP S VP the boys wp wp VP who v who v play play N.B.: This operation is simpler than adjunctions in classic TAG, resulting in more efficient parsing costs.

Derivations in our TAG A tree is a set with two types of elements: spines adjunctions S S VP VP v NP v eat n h eat i cake m � i, σ � � h, m, σ h , σ m , POS , A � i : word position h m : head and modifier positions σ : a spine σ h σ m : spines of h and m POS : the attachment position A : sister or regular

A TAG-style Linear Model f a ( x , � h, m, σ h , σ m , POS , A � ) S VP NP v n the boys eat eat a cake cake with a � Parser( x ) = argmax y ∈Y ( x ) f s ( x , � i, σ � ) · w + � i,σ �∈ S ( y ) � f a ( x , � h, m, . . . � ) · w � h,m,... �∈ A ( y )

Parsing with the Eisner Algorithms ◮ Our TAG structures are a general form of dependency graph: ◮ Dependencies are adjunctions between spines ◮ Labels include the type and position of the adjunction ◮ Parsing can be done with the Eisner [1996,2000] algorithms ◮ Applies to splittable dependency representations i.e., left and right modifiers are adjoined independently ◮ Words in the dependency graph can have senses, like our spines ◮ Parsing time is O ( n 3 G ) ◮ Can be extended to include second-order features.

Second-Order Features in our TAG We incorporate recent extensions to the Eisner algorithm: siblings grandchildren S S VP PP VP PP NP NP v p v p boys eat n with boys eat with n a cake a fork a cake a fork O ( n 3 G ) O ( n 4 G ) [Eisner 2000] [Carreras, 2007] [McDonald & Pereira, 2006]

Exact Inference is Too Expensive ◮ Parsing time is at least O ( n 3 G ) . (it is O ( n 4 G ) in our final model) ◮ The constant G is polynomial in the number of possible spines for any word, and the maximum height of any spine. This is prohibitive for real parsing tasks ( G > 5000 ). ◮ Solution: Coarse-to-fine inference (e.g. [Charniak 97] [Charniak & Johnson 05] [Petrov & Klein 07]) ◮ Use simple dependency parsing models to restrict the space of possible structures of the full model.

A Coarse-to-fine Strategy for Fast Parsing S VP VP NP NP k:1 v 1:3 VP v NP v n eat eat cake eat cake eat cake eat cake cake µ ( x , h, m, t ) = µ H ( x , h, m, t H ) × µ P ( x , h, m, t P ) × µ M ( x , h, m, t M ) ◮ First-order dependency models estimate conditional distributions of simple dependencies ◮ We build a beam of most likely dependencies: ◮ Inside-Outside inference, in O ( n 3 H ) with H ∼ 50 ◮ We can discard 99.6 of dependencies and retain 98.5 of correct constituents ◮ The full model is constrained to the pruned space both at training and testing

A TAG-style Linear Model: Summary A simple TAG-style model, based in spines and adjunctions: ◮ It allows a wide variety of features ◮ It’s splittable, allowing efficient inference ◮ O ( n 3 G ) for CFG-style, head-modifier and sibling features ◮ O ( n 4 G ) for grandchildren dependency features ◮ The backbone dependency graph can be pruned with simple first-order dependency models Other TAG formalisms have more expensive parsing algorithms [Chiang 2003] [Shen & Joshi 2005] .

Parsing the WSJ Treebank ◮ Extraction of our TAG derivations from WSJ trees ◮ Straighforward process using the head rules of [Collins 1999] ◮ ∼ 300 spines, ∼ 20 spines/token ◮ Learning: ◮ Train first-order models using EG [Collins et al. 2008] 5 training passes, 5 hours per pass ◮ Train TAG-style full model using Avg. Perceptron 10 training passes, 12 hours per pass ◮ Parse test data and evaluate

Test results on WSJ data Full Parsers precision recall F 1 Charniak 2000 89.5 89.6 89.6 Petrov & Klein 2007 90.2 89.9 90.1 this work 91.4 90.7 91.1 precision recall F 1 Rerankers Collins 2000 89.9 89.6 89.8 Charniak & Johnson 2005 · · 91.4 Huang 2008 · · 91.7

Evaluating Dependencies ◮ We look at the accuracy of recovering unlabeled dependencies ◮ We compare to state-of-the-art dependency parsing models using the same features and learner : training structures dependency accuracy unlabeled dependencies (*) 92.0 labeled dependencies (*) 92.5 adjoined spines 93.5 (*) results from [Koo et al., ACL 2008] constituent structure greatly helps parsing performance

Summary A new efficient and expressive discriminative model for full consituent parsing: ◮ Represents phrase structure with a TAG-style grammar ◮ Has rich features combining phrase structure and lexical heads due to our spines being basic elements ◮ Parsing is efficient with the Eisner methods due to the splittable nature of our adjunctions A very effective method to prune dependency-based graphs: key to discriminative training at full scale

Thanks!

TAG, Dynamic Programming, and the Perceptron for Efficent, - PowerPoint PPT Presentation

TAG, Dynamic Programming, and the Perceptron for Efficent, Feature-Rich Parsing Xavier Carreras, Michael Collins and Terry Koo MIT CSAIL Discriminative Models for Parsing Structured Prediction methods like CRF or Perceptron train linear models

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Campus with Tag Manager Marcel Ayers, Director of Implementation OmniUpdate Agenda What is Tag

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

TAG Update Brooke V ilante TAG TOSA October 13, 2015 TAG Reinvestment Board allocated $200,000

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Caching / Performance ofgset 1 data valid tag data valid tag cache operation (associative)

Company presentation August 2020 / Q2 2020 Content 2 2 I. TAG overview and strategy 3 II.

(TAG) at River Trail MS Spring 20 2020 20 TAG students are placed in Advanced or TAG classes

Minor claims Tag questions are adjuncts which modify a preceding declarative clause. Tag

ON THE COST OF TYPE-TAG SOUNDNESS Ben Greenman Zeina Migeed ON THE COST OF TYPE-TAG SOUNDNESS

Bounce Address Tag Validation Bounce Address Tag Validation Bounce Address Tag Validation (BATV)

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Lecture 10 First-Order Logic 4 th February 2020 Outline 2 / 24 Why FOL? Syntax and

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 8. First-Order Logic First-Order Logic

CS525: Advanced Database Organization Notes 4: Indexing and Hashing Part II: B + -Trees Yousef M.

XPath Satisfiability with Parent Axes or Qualifiers Is Tractable under Many of Real-World DTDs

Inferring Internet Server IPv4 and IPv6 Address Relationships Robert Beverly, Arthur Berger ,

Why the compiler broke your program Peter Brett, LiveCode Six impossible things before breakfast

My y Dig igit ital al Sib iblin ling Andy Fawkes DRAFT as of 6-Jan-20 Ove vervi view ew

Tree A tree consists of a set of nodes and a set of edges that connect pairs of nodes.

TAG, Dynamic Programming, and the Perceptron for Efficent, - PowerPoint PPT Presentation

TAG, Dynamic Programming, and the Perceptron for Efficent, Feature-Rich Parsing Xavier Carreras, Michael Collins and Terry Koo MIT CSAIL Discriminative Models for Parsing Structured Prediction methods like CRF or Perceptron train linear models

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Campus with Tag Manager Marcel Ayers, Director of Implementation OmniUpdate Agenda What is Tag

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

TAG Update Brooke V ilante TAG TOSA October 13, 2015 TAG Reinvestment Board allocated $200,000

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Caching / Performance ofgset 1 data valid tag data valid tag cache operation (associative)

Company presentation August 2020 / Q2 2020 Content 2 2 I. TAG overview and strategy 3 II.

(TAG) at River Trail MS Spring 20 2020 20 TAG students are placed in Advanced or TAG classes

Minor claims Tag questions are adjuncts which modify a preceding declarative clause. Tag

ON THE COST OF TYPE-TAG SOUNDNESS Ben Greenman Zeina Migeed ON THE COST OF TYPE-TAG SOUNDNESS

Bounce Address Tag Validation Bounce Address Tag Validation Bounce Address Tag Validation (BATV)

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Lecture 10 First-Order Logic 4 th February 2020 Outline 2 / 24 Why FOL? Syntax and

ARTIFICIAL INTELLIGENCE Russell &amp; Norvig Chapter 8. First-Order Logic First-Order Logic

CS525: Advanced Database Organization Notes 4: Indexing and Hashing Part II: B + -Trees Yousef M.

XPath Satisfiability with Parent Axes or Qualifiers Is Tractable under Many of Real-World DTDs

Inferring Internet Server IPv4 and IPv6 Address Relationships Robert Beverly, Arthur Berger ,

Why the compiler broke your program Peter Brett, LiveCode Six impossible things before breakfast

My y Dig igit ital al Sib iblin ling Andy Fawkes DRAFT as of 6-Jan-20 Ove vervi view ew

Tree A tree consists of a set of nodes and a set of edges that connect pairs of nodes.

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 8. First-Order Logic First-Order Logic