Discriminative approaches to Statistical Parsing Mark Johnson - PowerPoint PPT Presentation

Discriminative approaches to Statistical Parsing Mark Johnson Brown University University of Tokyo, 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1

Talk outline • A typology of approaches to parsing • Applications of parsers • Representations and features of statistical parsers • Estimation (training) of statistical parsers – maximum likelihood (generative) estimation – maximum conditional likelihood (discriminative) estimation • Experiments with a discriminatively trained reranking parser • Advantages and disadvantages of generative and discriminative training • Conclusions and future work 2

Grammars and parsing • A (formal) language is a set of strings – For most practical purposes, human languages are infinite sets of strings – In general we are interested in the mapping from surface form to meaning • A grammar is a finite description of a language – Usually assigns each string in a language a description (e.g., parse tree, semantic representation) • Parsing is the process of characterizing (recovering) the descriptions of a string • Most grammars of human languages are either manually constructed or extracted automatically from an annotated corpus – Linguistic expertise is necessary for both! 3

Manually constructed grammars Examples: Lexical-functional grammar (LFG), Head-driven phrase-structure grammar (HPSG), Tree-adjoining grammar (TAG) • Linguistically inspired – Deals with linguistically interesting phenomena – Ignores boring (or difficult!) but frequent constructions – Often explicitly models the form-meaning mapping • Each theory usually has its own kind of representation ⇒ Difficult to compare different approaches • Constructing broad-coverage grammars is hard and unrewarding • Probability distributions can be defined over their representations • Often involve long-distance constraints ⇒ Computationally expensive and difficult 4

TURN SENTENCE ID BAC002 E SEGMENT ANIM + CASE ACC ROOT PERIOD NUM PL OBJ PERS 1 Sadj . PRED PRO PRON-FORM WE PRON-TYPE PERS S 9 PASSIVE − PRED LET � 2,10 � 9 VPv STMT-TYPE IMPERATIVE PERS 2 V NP VPv SUBJ PRED PRO PRON-TYPE NULL let PRON V NP 2 TNS-ASP MOOD IMPERATIVE us take DATEP ANIM − N COMMA DATEnum NUMBER ORD NTYPE Tuesday , D NUMBER TIME DATE NUM SG APP the fifteenth PRED fifteen SPEC-FORM THE SPEC SPEC-TYPE DEF OBJ CASE ACC XCOMP GEND NEUT GRAIN COUNT NTYPE PROPER DATE TIME DAY NUM SG PERS 3 PRED TUESDAY 13 PASSIVE − PRED TAKE � 9,13 � SUBJ 9 10 5

Corpus-derived grammars • Grammar is extracted automatically from a large linguistically annotated corpus – Focuses on frequently occuring constructions – Only models phenomena that can be (easily) annotated – Typically ignores semantics and most of the rich details of linguistic theories • Different models extracted from the same corpus can usually be compared • Constructing corpora is hard, unrewarding work • Generative models usually only involve local constraints – Dynamic programming possible, but usually involves heuristic search 6

Sample Penn treebank tree ROOT S NP-SBJ VP . NNP NNP NNP VBD NP PP-DIR PP-DIR . BELL INDUSTRIES Inc. increased PRP$ NN TO NP IN NP its quarterly to CD NNS from NP NP-ADV 10 cents CD NNS DT NN seven cents a share 7

Applications of (statistical) parsers 1. Applications that use syntactic parse trees • information extraction • (short answer) question answering • summarization • machine translation 2. Applications that use the probability distribution over strings or trees (parser-based language models) • speech recognition and related applications • machine translation 8

PCFG representations and features S NP VP NNP VB NP ADVP George eats NN RB pizza quickly 0.14: VP → VB NP ADVP • Probabilistic context-free grammars (PCFGs) associate a rule probability p ( r ) with each rule ⇒ features are local trees • Probability of a tree y is P( y ) = � r ∈ y p ( r ) = � r p ( r ) f r ( y ) where f r ( y ) is the number of times r appears in y • Probability of a string x is P( x ) = � y ∈Y ( x ) P( y ) 9

Lexicalized PCFGs S sank NP VP torpedo sank DT NN VB NP the torpedo sank boat the torpedo sank DT NN the boat the boat 0.02: VP sank → VB sank NP boat • Head annotation captures subcategorization and head-to-head dependencies • Sparse data is a serious problem: smoothing is essential! 10

Modern (generative) statistical parsers S VB:sank NP VP NN:torpedo VB:sank DT NN VB NP DT:the NN:torpedo VB:sank NN:boat the torpedo sank DT NN DT:the NN:boat the boat • Generates a tree via a very large number of small steps (generates NP, then NN, then boat) • Each step in this branching process conditions on a large number of (already generated) variables • Sparse data is the major problem: smoothing is essential! 11

Estimating PCFGs from visible data S S S NP VP NP VP NP VP rice grows rice grows corn grows   S       P = 2 / 3 Rule Count Rel Freq NP VP     S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → corn 1 1 / 3       VP → grows 3 1 P = 1 / 3 NP VP     corn grows 12

Why is the PCFG MLE so easy to compute? y i Y • Visible training data D = ( y 1 , . . . , y n ), where y i is a parse tree � n • The MLE is ˆ p = argmax p i =1 P p ( y i ) • It is easy to compute because PCFGs are always normalized, i.e., Z = � � r p ( r ) f r ( y ) = 1, y ∈Y where Y is the set of all trees generated by the grammar 13

Non-local constraints and the PCFG MLE S S S NP VP NP VP NP VP rice grows rice grows bananas grow   S     NP VP P  = 4 / 9   rule count rel freq  S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → bananas 1 1 / 3     VP → grows 2 2 / 3 NP VP P   =  1 / 9  VP → grow 1 1 / 3 bananas grow Z = 5 / 9 14

Renormalization S S S NP VP NP VP NP VP rice grows rice grows bananas grow   S     NP VP P  = 4 / 9 4 / 5   rule count rel freq  S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → bananas 1 1 / 3     VP → grows 2 2 / 3 NP VP P    = 1 / 9 1 / 5  VP → grow 1 1 / 3 bananas grow Z = 5 / 9 15

Other values do better! S S S NP VP NP VP NP VP rice grows rice grows bananas grow   S     rule count rel freq NP VP P  = 2 / 6 2 / 3    S → NP VP 3 1 rice grows NP → rice 2 2 / 3   NP → bananas 1 1 / 3 S   VP → grows 2 1 / 2   NP VP P   =  1 / 6 1 / 3 VP → grow 1 1 / 2  bananas grow (Abney 1997) Z = 3 / 6 16

Make dependencies local – GPSG-style   rule count rel freq S NP VP S → 2 2 / 3   + singular + singular   NP VP   P = 2 / 3 +singular +singular   NP VP   S → 1 1 / 3 + plural + plural rice grows NP + singular → rice 2 1   S NP + plural → bananas 1 1     NP VP   P = 1 / 3 +plural +plural VP   + singular → grows 2 1   bananas grow VP + plural → grow 1 1 17

Maximum entropy or log linear models • Y = set of syntactic structures (not necessarily trees) • f j ( y ) = number of occurences of j th feature in y ∈ Y (these features need not be conventional linguistic features) • w j are “feature weight” parameters m � S w ( y ) = w j f j ( y ) j =1 V w ( y ) = exp S w ( y ) � Z w = V w ( y ) y ∈Y m � V w ( y ) 1 P w ( y ) = = exp w j f j ( y ) Z w Z w j =1 m � log P λ ( y ) = w j f j ( y ) − log Z w j =1 18

PCFGs are log-linear models Y = set of all trees generated by grammar G f j ( y ) = number of times the j th rule is used in y ∈ Y p ( r j ) = probability of j th rule in G Choose w j = log p ( r j ), so p ( r j ) = exp w j   S     NP VP   f = [ 1 , 1 , 0 , 1 , 0 ]   ��   S → NP VP NP → rice NP → bananas VP → grows VP → grow rice grows m m m � � � p ( r j ) f j ( y ) = (exp w j ) f j ( y ) = exp( P w ( y ) = w j f j ( ω )) j =1 j =1 j =1 So a PCFG is just a log linear model with Z = 1. 19

Max likelihood estimation of log linear models Visible training data D = ( y 1 , . . . , y n ), where y i ∈ Y is a tree y i Y w ˆ = argmax L D ( w ) , where w n n � � log L D ( w ) = log P w ( y i ) = ( S w ( y i ) − log Z w ) i =1 i =1 • In general no closed form solution ⇒ optimize log L D ( w ) numerically. • Calculating Z w involves summing over all parses of all strings ⇒ computationally intractible (Abney suggests Monte Carlo) 20

Discriminative approaches to Statistical Parsing Mark Johnson - PowerPoint PPT Presentation

Discriminative approaches to Statistical Parsing Mark Johnson Brown University University of Tokyo, 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1 Talk outline

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Statistical Parsing Parsing context-free languages ar ltekin University of Tbingen

Statistical Parsing Dependency parsing ar ltekin University of Tbingen Seminar fr

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parse & Syntax Trees Syntax & Semantic Errors Mini Lecture CSC 4181 Compiler

Propositional Logic Introduction and Syntax Alice Gao Lecture 2 CS 245 Logic and Computation

Plan for Today Review main idea syntax-directed evaluation and translation Recall syntax-directed

On the reduction of drift coefficients in the presence of turbulence N.E. Engelbrecht 1 , R.D.

Collins Parsing Victor, Ydng Zhu Outline Introduction Basic Model Representation

CS525: Advanced Database Organization Notes 6: Query Processing Parsing and pre-processing

Programming Abstractions Week 7-2: MiniScheme A and B Stephen Checkoway Structure of MiniScheme

CS 525: Advanced Database convert answer Organisation logical query plan execute apply laws

Sambuz

Useful Links

Newsletter

Mail Us

Discriminative approaches to Statistical Parsing Mark Johnson - PowerPoint PPT Presentation

Discriminative approaches to Statistical Parsing Mark Johnson Brown University University of Tokyo, 2004 Joint work with Eugene Charniak (Brown) and Michael Collins (MIT) Supported by NSF grants LIS 9720368 and IIS0095940 1 Talk outline

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Statistical Parsing Parsing context-free languages ar ltekin University of Tbingen

Statistical Parsing Dependency parsing ar ltekin University of Tbingen Seminar fr

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parse &amp; Syntax Trees Syntax &amp; Semantic Errors Mini Lecture CSC 4181 Compiler

Propositional Logic Introduction and Syntax Alice Gao Lecture 2 CS 245 Logic and Computation

Plan for Today Review main idea syntax-directed evaluation and translation Recall syntax-directed

On the reduction of drift coefficients in the presence of turbulence N.E. Engelbrecht 1 , R.D.

Collins Parsing Victor, Ydng Zhu Outline Introduction Basic Model Representation

CS525: Advanced Database Organization Notes 6: Query Processing Parsing and pre-processing

Programming Abstractions Week 7-2: MiniScheme A and B Stephen Checkoway Structure of MiniScheme

CS 525: Advanced Database convert answer Organisation logical query plan execute apply laws

Sambuz

Useful Links

Newsletter

Mail Us

Parse & Syntax Trees Syntax & Semantic Errors Mini Lecture CSC 4181 Compiler