Probabilistic parsing with a wide variety of features Mark Johnson - PowerPoint PPT Presentation

Probabilistic parsing with a wide variety of features Mark Johnson Brown University IJCNLP, March 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1

Talk outline • Statistical parsing models • Discriminatively trained reranking models – features for selecting good parses – estimation methods – evaluation • Conclusion and future work 2

Approaches to statistical parsing • Kinds of models: “Rationalist” vs. “Empiricist” – based on linguistic theories (CCG, HPSG, LFG, TAG, etc.) • typically use specialized representations – models of trees in a training corpus (Charniak, Collins, etc.) • Grammars are typically hand-written or extracted from a corpus (or both?) – both methods require linguistic knowledge – each method is affected differently by • lack of linguistic knowledge (or resources needed to enter it) • errors and inconsistencies 3

Features in linear models • (Statistical) features are real-valued functions of parses (e.g., in a PCFG, the number of times a rule is used in a tree) • A model associates a real-valued weight with each feature (e.g., the log of the rule’s probability) • The score of a parse is the weighted sum of its feature values (the tree’s log probability) • Higher scoring parses are more likely to be correct • Computational complexity of estimation (training) depends on how these features interact 4

Feature dependencies and complexity • “Generative models” (features and constraints induce tree-structured dependencies , e.g., PCFGs, TAGs) – maximum likelihood estimation is computationally cheap (counting occurences of features in training data) – crafting a model with a given set of features can be difficult • “Conditional” or “discriminative models” (features have arbitrary dependencies, e.g., SUBGs) – maximum likelihood estimation is computationally intractible (as far as we know) – conditional estimation is computationally feasible but expensive – features can be arbitrary functions of parses 5

Why coarse-to-fine discriminative reranking? • Question: What are the best features for statistical parsing? • Intuition: The choice of features matters more than the grammar formalism or parsing method • Are global features of the parse tree useful? ⇒ Choose a framework that makes experimenting with features as easy as possible • Coarse-to-fine discriminative reranking is such a framework – features can be arbitrary functions of parse trees – computational complexity is manageable • Why a Penn tree-bank parsing model? 6

The parsing problem Y ( x ) = set of parses of string x y ∈ Y ( x ) is a parse for string x Y • Y = set of all parses , Y ( x ) = set of parses of string x • f = ( f 1 , . . . , f m ) are real-valued feature functions (e.g., f 22 ( y ) = number of times an S dominates a VP in y ) • So f ( y ) = ( f 1 ( y ) , . . . , f m ( y )) is real-valued vector • w = ( w 1 , . . . , w m ) is a weight vector , which we learn from training data • S w ( y ) = w · f ( y ) = � m j =1 w j f j ( y ) is the score of a parse 7

Conditional training Y ( x i ) = set of parses of x i y i Y • Labelled training data D = (( x 1 , y 1 ) , . . . , ( x n , y n )), where y i is the correct parse for x i • Parsing: return the parse y ∈ Y ( x ) with the highest score • Conditional training: Find a weight vector w so that the correct parse y i scores “better” than any other parse in Y ( x i ) • There are many different algorithms for doing this (MaxEnt, Perceptron, SVMs, etc.) 8

Another view of conditional training Correct parse’s All other parses’ features features [1 , 3 , 2] [2 , 2 , 3] [3 , 1 , 5] [2 , 6 , 3] sentence 1 [7 , 2 , 1] [2 , 5 , 5] sentence 2 [2 , 4 , 2] [1 , 1 , 7] [7 , 2 , 1] sentence 3 . . . . . . . . . • Training data is fully observed (i.e., parsed data) • Choose w to maximize score of correct parses relative to other parses • Distribution of sentences is ignored – The models learnt by this kind of conditional training can’t be used as language models • Nothing is learnt from unambiguous examples 9

A coarse to fine approximation • The set of parses Y ( x ) can be string x huge! • Collins Model 2 parser pro- Collins model 2 duces a set of candidate parses parses Y c ( x ) y 1 . . . y k Y c ( x ) for each sentence x • The score for each parse is S w ( y ) = w · f ( y ) f ( y 1 ) f ( y k ) features . . . • The highest scoring parse y ⋆ = argmax w · f ( y 1 ) . . . w · f ( y k ) scores S w ( y ) S w ( y ) y ∈Y c ( x ) is predicted correct (Collins 1999 “Discriminative reranking”) 10

Advantages of this approach • The Collins parser only uses features for which there is a fast dynamic programming algorithm • The set of parses Y c ( x ) it produces is small enough that dynamic programming is not necessary • This gives us almost complete freedom to formulate and explore possible features • We’re already starting from a good baseline . . . • . . . but we only produce Penn treebank trees (instead of something deeper) • and parser evaluation with respect to the Penn treebank is standard in the field 11

A complication • Intuition: the discriminative learner should learn the common error modes of Collins parser • Obvious approach: parse the training data with the Collins parser • When parsed on the training section of the PTB, the Collins parser does much better on training section than it does on other text! • Train the discriminative model from parser output on text parser was not trained on • Use cross-validation paradigm to produce discriminative training data (divide training data into 10 sections) • Development data described here is from PTB sections 20 and 21 12

Another complication • Training data (( x 1 , y 1 ) , . . . , ( x n , y n )) ˜ Y c ( x i ) y i y i • Each string x i is parsed using Collins parser, producing a set Y c ( x i ) of parse Y trees • The correct parse y i might not be in the Collins parses Y c ( x i ) • Let ˜ y i = argmax y ∈Y c ( x i ) F y i ( y ) be the best Collins parse , where F y ′ ( y ) mea- sures parse accuracy • Choose w to discriminate ˜ y i from the other Y c ( x i ) 13

Multiple best parses Y c ( x i ) y i Y • There can be several Collins parses equally close to the correct parse: which one(s) should we declare to be the best parse? • Weighting all close parses equally does not work as well (0 . 9025) as . . . • picking the parse with the highest Collins parse probability (0 . 9036), but . . . • letting the model pick its own winner from the close parses (EM-like scheme in Riezler ’02) works best of all (0 . 904) 14

Baseline and oracle results • Training corpus: 36,112 Penn treebank trees from sections 2–19, development corpus 3,720 trees from sections 20–21 • Collins Model 2 parser failed to produce a parse on 115 sentences • Average |Y ( x ) | = 36 . 1 • Model 2 f -score = 0 . 882 (picking parse with highest Model 2 probability) • Oracle (maximum possible) f -score = 0 . 953 (i.e., evaluate f -score of closest parses ˜ y i ) ⇒ Oracle (maximum possible) error reduction 0 . 601 15

Expt 1: Only “old” features • Features: (1) log Model 2 probability , (9717) local tree features • Model 2 already conditions on local trees! • Feature selection: features must vary on 5 or more sentences • Results: f -score = 0 . 886; ≈ 4% error reduction ⇒ discriminative training alone can improve accuracy ROOT S NP VP . WDT VBD PP . That went IN NP over NP PP DT JJ NN IN NP the permissible line for ADJP NNS JJ CC JJ feelings warm and fuzzy 16

Expt 2: Rightmost branch bias • The RightBranch feature’s value is the number of nodes on the right-most branch (ignoring punctuation) • Reflects the tendancy toward right branching • LogProb + RightBranch: f -score = 0 . 884 (probably significant) • LogProb + RightBranch + Rule: f -score = 0 . 889 ROOT S NP VP . WDT VBD PP . That went IN NP over NP PP DT JJ NN IN NP the permissible line for ADJP NNS JJ CC JJ feelings 17 warm and fuzzy

Lexicalized and parent-annotated rules • Lexicalization associates each constituent with its head • Parent annotation provides a little “vertical context” • With all combinations, there are 158,890 rule features ROOT Grandparent S NP VP . WDT VBD PP . Rule That went IN NP over NP PP DT JJ NN IN NP the permissible line for ADJP NNS JJ CC JJ feelings Heads warm and fuzzy 18

n -gram rule features generalize rules • Collects adjacent constituents in a local tree • Also includes relationship to head • Constituents can be ancestor-annotated and lexicalized • 5,143 unlexicalized rule bigram features, 43,480 lexicalized rule bigram features ROOT S NP VP . DT NN AUX NP . The clash is NP PP DT NN IN NP a sign of NP PP DT JJ NN CC NN IN NP a new toughness and divisiveness in NP JJ JJ NNS NNP POS once-cozy financial circles Japan ’s Left of head, non-adjacent to head 19

Probabilistic parsing with a wide variety of features Mark Johnson - PowerPoint PPT Presentation

Probabilistic parsing with a wide variety of features Mark Johnson Brown University IJCNLP, March 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1 Talk outline

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Probabilistic Models of Human Parsing Parser Architectures Informatics 2A: Lecture 23 2

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Empirical Methods in Natural Language Processing Lecture 10 Parsing (II): Probabilistic parsing

Empirical Methods in Natural Language Processing Lecture 10 Parsing (II): Probabilistic parsing

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parsing as Deduction Joseph K uhner March 24, 2007 Joseph K uhner Parsing as Deduction

Bottom-up parsing LR parsing Construct parse tree for input from leaves up LR( k ) parsing

General Fee Setting Policy and Tire Fees Proposal Round Two Consultations December 18 & 19,

Regulatory Governance & Policy for the Digital Economy William Lehr Douglas Sicker CMU MIT

Results and prospects on hadronic cross section and physics at KLOE/KLOE-2 Giuseppe

Vector mesons and more at an EIC Spencer Klein & Ya-Ping Xie NSD Tuesday Meeting, Jan. 8,

PEER EXCHANGE FRAMEWORK Underlying questions 1. Who are the lead organisations and other key

Housing Sector Transformation of the Non-Profit Housing Sector Kevin Albers, CEO, Makola Group

Shropshire Local Plan Review Consultation on Preferred Sites November 2018 Adrian Cooper

Does formal 0-3 years old child care availability boost employment rate of mothers ? Panel data

Sambuz

Useful Links

Newsletter

Mail Us

Probabilistic parsing with a wide variety of features Mark Johnson - PowerPoint PPT Presentation

Probabilistic parsing with a wide variety of features Mark Johnson Brown University IJCNLP, March 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1 Talk outline

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Probabilistic Models of Human Parsing Parser Architectures Informatics 2A: Lecture 23 2

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Empirical Methods in Natural Language Processing Lecture 10 Parsing (II): Probabilistic parsing

Empirical Methods in Natural Language Processing Lecture 10 Parsing (II): Probabilistic parsing

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parsing as Deduction Joseph K uhner March 24, 2007 Joseph K uhner Parsing as Deduction

Bottom-up parsing LR parsing Construct parse tree for input from leaves up LR( k ) parsing

General Fee Setting Policy and Tire Fees Proposal Round Two Consultations December 18 &amp; 19,

Regulatory Governance &amp; Policy for the Digital Economy William Lehr Douglas Sicker CMU MIT

Results and prospects on hadronic cross section and physics at KLOE/KLOE-2 Giuseppe

Vector mesons and more at an EIC Spencer Klein &amp; Ya-Ping Xie NSD Tuesday Meeting, Jan. 8,

PEER EXCHANGE FRAMEWORK Underlying questions 1. Who are the lead organisations and other key

Housing Sector Transformation of the Non-Profit Housing Sector Kevin Albers, CEO, Makola Group

Shropshire Local Plan Review Consultation on Preferred Sites November 2018 Adrian Cooper

Does formal 0-3 years old child care availability boost employment rate of mothers ? Panel data

Sambuz

Useful Links

Newsletter

Mail Us

General Fee Setting Policy and Tire Fees Proposal Round Two Consultations December 18 & 19,

Regulatory Governance & Policy for the Digital Economy William Lehr Douglas Sicker CMU MIT

Vector mesons and more at an EIC Spencer Klein & Ya-Ping Xie NSD Tuesday Meeting, Jan. 8,