Constituency Parsing Karl Stratos Rutgers University Karl Stratos - PowerPoint PPT Presentation

CS 533: Natural Language Processing Constituency Parsing Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/37

Project Logistics 1. Proposal (due 3/24) 2. Milestone (due 4/15) 3. Presentation (tentatively 4/29) 4. Final report (due 5/4) Karl Stratos CS 533: Natural Language Processing 2/37

Possible Project Types (More Details in Template) ◮ Extend and apply recent machine learning methods to previously unconsidered NLP tasks ◮ Search last/this year’s AISTATS/ICLR/ICML/NeurIPS/UAI publications ◮ Extend a recent machine learning method in NLP ◮ Search last/this year’s ACL/CoNLL/EMNLP/NAACL ◮ Reimplement and replicate an existing technically challenging NLP paper from scratch ◮ No available public code! ◮ There may be a limited number of predefined projects ◮ No promise ◮ Priority given to those with higher assignment grades Karl Stratos CS 533: Natural Language Processing 3/37

Course Status Covered (all in the context of neural networks) ◮ Language models and conditional language models ◮ Pretrained representations from language modeling, evaluation tasks (GLUE, SuperGLUE) ◮ Tagging, parsing (today) Will cover ◮ Latent-variable models (EM, VAEs) ◮ Information extraction (entity linking) before the proposal due date You will have enough background to read state-of-the-art research papers for your proposal. Karl Stratos CS 533: Natural Language Processing 4/37

HMM Loose Ends ◮ Recall HMM n +1 n � � p ( x 1 . . . x n , y 1 . . . y n ) = t ( y i | y i − 1 ) × o ( x i | y i ) i =1 i =1 ◮ The forward algorithm computes in O ( |Y| 2 n ) � α ( i, y ) = p ( x 1 . . . x i , y 1 . . . y i ) y 1 ...y i ∈Y i : y i = y ◮ The backward algorithm computes in O ( |Y| 2 n ) � β ( i, y ) = p ( x i +1 . . . x n , y i +1 . . . y n | y i ) y i ...y n ∈Y n − i +1 : y i = y Karl Stratos CS 533: Natural Language Processing 5/37

Backward Algorithm Base case. For y ∈ Y , β ( n, y ) = t ( STOP | y ) Main. For i = n − 1 . . . 1 , for y ∈ Y , � t ( y ′ | y ) × o ( x i +1 | y ′ ) × β ( i + 1 , y ′ ) β ( i, y ) = y ′ ∈Y Karl Stratos CS 533: Natural Language Processing 6/37

Marginals ◮ Once forward and backward probabilities are computed, useful marginal probabilities can be computed. ◮ Debugging tip: p ( x 1 . . . x n ) = � y α ( i, y ) β ( i, y ) for any i ◮ Marginal decoding: for i = 1 . . . n predict y ∈ Y with highest p ( x 1 . . . x n , y i = y ) = α ( i, y ) × β ( i, y ) ◮ Tag pair conditional marginals (will be useful later) p ( y i = y, y i +1 = y ′ | x 1 . . . x n ) = α ( i, y ) × t ( y ′ | y ) × o ( x i +1 | y ′ ) × β ( i + 1 , y ′ ) p ( x 1 . . . x n ) Karl Stratos CS 533: Natural Language Processing 7/37

Agenda 1. Parsing 2. PCFG: A model for constituency parsing 3. Transition-based dependency parsing (Some slides adapted from Chris Manning, Mike Collins) Karl Stratos CS 533: Natural Language Processing 8/37

Constituency Parsing vs Dependency Parsing Two formalisms for linguistic structure Karl Stratos CS 533: Natural Language Processing 9/37

Constituency Structure Nested constituents/phrases ◮ Start by labeing words with POS tags (D the) (N dog) (V saw) (D the) (N cat) ◮ Recursively combine constituents according to rules (NP (D the) (N dog)) (V saw) (D the) (N cat) (NP (D the) (N dog)) (V saw) (NP (D the) (N cat)) (NP (D the) (N dog)) (VP (V saw) (NP (D the) (N cat))) (S (NP (D the) (N dog)) (VP (V saw) (NP (D the) (N cat)))) Used rules: NP → D N, VP → V NP, S → NP VP Karl Stratos CS 533: Natural Language Processing 10/37

Dependency Structure Labeled pairwise relations between words Karl Stratos CS 533: Natural Language Processing 11/37

Case for Parsing ◮ Most compelling example of latent structure in language the man saw the moon with a telescope ◮ Hypothesis: parsing is intimately related to intelligence ◮ Also some useful applications, e.g., relation extraction Karl Stratos CS 533: Natural Language Processing 12/37

Context-Free Grammar (CFG) A tuple G = ( N, Σ , R, S ) ◮ N : non-terminal symbols (constituents) ◮ Σ : terminal symbols (words) ◮ R : rules of form X → Y 1 . . . Y m where X ∈ N, Y i ∈ N ∪ Σ ◮ S ∈ N : start symbol Karl Stratos CS 533: Natural Language Processing 13/37

Example CFG Karl Stratos CS 533: Natural Language Processing 14/37

Left-Most Derivation Given a CFG G = ( N, Σ , R, S ) , a left-most derivation is a sequence of strings s 1 . . . s n where ◮ s 1 = S ◮ For i = 2 . . . n , s i = ExpandLeftMostNonterminal( s i − 1 ) ◮ s n ∈ Σ ∗ (aka “yield” of the derivation) Karl Stratos CS 533: Natural Language Processing 15/37

Example Karl Stratos CS 533: Natural Language Processing 16/37

Ambiguity Some string can have multiple valid derivations (i.e., parse trees). Number of binary trees over n + 1 nodes ( n -th Catalyn number) � 2 n � 1 C n = > 6 . 5 billion for n = 20 n + 1 n Karl Stratos CS 533: Natural Language Processing 17/37

Rule-Based to Statistical ◮ Rule-based: manually construct some CFG that recognizes as many English strings as possible ◮ Never enough, no way to choose the right parse ◮ Statistical: annotate sentences with parse trees (aka. treebank) and learn a statistical model to disambiguate Karl Stratos CS 533: Natural Language Processing 18/37

Treebanks ◮ Standard setup: WSJ portion of Penn Treebank ◮ 40,000 trees for training ◮ 1,700 trees for validation ◮ 2,400 trees for evaluation ◮ Building a treebank vs building a grammar? ◮ Broad coverage, more natural annotation ◮ Contains distributional information of English ◮ Can be used to evaluate parsers Karl Stratos CS 533: Natural Language Processing 19/37

Probabilistic Context-Free Grammar (PCFG) A tuple G = ( N, Σ , R, S, q ) ◮ N : non-terminal symbols (constituents) ◮ Σ : terminal symbols (words) ◮ R : rules of form X → Y 1 . . . Y m where X ∈ N, Y i ∈ N ∪ Σ ◮ S ∈ N : start symbol ◮ q : rule probability q ( α → β ) ≥ 0 for every rule α → β ∈ R such that � β q ( X → β ) = 1 for any X ∈ N Karl Stratos CS 533: Natural Language Processing 20/37

Probability of a Tree Under PCFG Karl Stratos CS 533: Natural Language Processing 21/37

Estimating a PCFG from a Treebank Given trees t (1) . . . t ( N ) in the training data ◮ N : all non-terminal symbols (constituents) seen in the data ◮ Σ : all terminal symbols (words) seen in the data ◮ R : all rules seen in the data ◮ S ∈ N : special start symbol (if the data does not already have it, add it to every tree) ◮ q : MLE estimate count ( α → β ) q ( α → β ) = � β count ( α → β ) If we see VP → Vt NP 10 times and VP 1000 times, than q ( VP → Vt NP ) = 0 . 01 Karl Stratos CS 533: Natural Language Processing 22/37

Aside: Improper PCFG A → A A with probability γ A → a with probability 1 − γ t : height( t ) ≤ h p ( t ) . Let S ∗ = lim h →∞ S h . Lemma. Define S h = � Then S ∗ < 1 if γ > 0 . 5 . ◮ Total probability of parses is less than one! Fortunately, an MLE from a finite treebank is never improper (aka. “tight”) (Chi and Geman, 2015) https://www.aclweb.org/anthology/J98-2005.pdf Karl Stratos CS 533: Natural Language Processing 23/37

Marginalization and Inference GEN ( x 1 . . . x n ) denotes the set of all valid derivations for x 1 . . . x n under the considered PCFG. 1. What is the probability of x 1 . . . x n under a PCFG? � p ( t ) t ∈ GEN ( x 1 ...x n ) 2. What is the most likely tree of x 1 . . . x n under a PCFG? arg max p ( t ) t ∈ GEN ( x 1 ...x n ) Karl Stratos CS 533: Natural Language Processing 24/37

Chomsky Normal Form (CNF) From here on, always assume that a PCFG G = ( N, Σ , R, S, q ) is in CNF: meaning every α → β ∈ R is either 1. Binary non-terminal production: X → Y Z where X, Y, Z ∈ N 2. Unary terminal production: X → x where X ∈ N , x ∈ Σ Not a big deal: can convert PCFG to equivalent CNF (and back) Karl Stratos CS 533: Natural Language Processing 25/37

Inside Algorithm: Bottom-Up Marginalization ◮ For 1 ≤ i ≤ j ≤ n , for all X ∈ N , � α ( i, j, X ) = p ( t ) t ∈ GEN ( x i ...x j ): root( t )= X We will see that computing each α ( i, j, X ) takes O ( n | R | ) time using dynamic programming . ◮ Return � α (1 , n, S ) = p ( t ) t ∈ GEN ( x 1 ...x n ) ◮ Total runtime? Karl Stratos CS 533: Natural Language Processing 26/37

Base Case ( i = j ) For i = 1 . . . n , for X ∈ N , α ( i, i, X ) = q ( X → x i ) Karl Stratos CS 533: Natural Language Processing 27/37

Main Body ( j = i + l ) For l = 1 . . . n − 1 , for i = 1 . . . n − l (set j = i + l ), for X ∈ N , � α ( i, j, X ) = q ( X → Y Z ) i ≤ k<j X → Y Z ∈ R × α ( i, k, Y ) × α ( k + 1 , j, Z ) Karl Stratos CS 533: Natural Language Processing 28/37

CKY Parsing Algorithm: Bottom-Up Maximization ◮ For 1 ≤ i ≤ j ≤ n , for all X ∈ N , π ( i, j, X ) = t ∈ GEN ( x i ...x j ): root( t )= X p ( t ) max ◮ Base: π ( i, j, X ) = q ( X → x i ) ◮ Main: π ( i, j, X ) = max i ≤ k<j, X → Y Z ∈ R q ( X → Y Z ) × π ( i, k, Y ) × π ( k + 1 , j, Z ) ◮ We have π (1 , n, S ) = t ∈ GEN ( x 1 ...x n ) p ( t ) max The optimal tree can be retrieved by backtracking Karl Stratos CS 533: Natural Language Processing 29/37

Constituency Parsing Karl Stratos Rutgers University Karl Stratos - PowerPoint PPT Presentation

CS 533: Natural Language Processing Constituency Parsing Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/37 Project Logistics 1. Proposal (due 3/24) 2. Milestone (due 4/15) 3. Presentation (tentatively

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

Domain Adaptation for Constituency Parsing Using Partial Annotations Vidur Joshi Matthew Peters

Linear Time Constituency Parsing with RNNs and Dynamic Programming Juneki Hong 1 Liang Huang 1,2 1

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

Span-Based Constituency Parsing with Provably Optimal Dynamic Oracles James Cross and Liang Huang

WORLD BANK GROUP AFRICA GROUP 1 CONSTITUENCY 17 th Statutory Constituency Meeting ANNUAL REPORT

Advisory Group 5 Subregional Focal Points, 14 Constituency Focal Points Constituency Groups (1)

WORLD BANK GROUP AFRICA GROUP 1 CONSTITUENCY 16 th Statutory Constituency Meeting INTERIM REPORT

Constituency/Stakeholder Travel FY 11 FY 12 Update Update on Constituency Travel Support

Syntax: Conjunction Constituency Tests Recursion, Conjunction, and Auxiliary Verbs

The Constituency of Hyperlinks in a Hypertext Corpus . mitcho (Michael Yoshitaka Erlewine)

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Daniel Fried and Dan Klein

Constituency Parsing CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Parsing Parsers Jenna Zeigen JSConf Hawaii 2/5/2020 @zeigenvector jenna.is/at-jsconfhi

1 Parse Trees Parse trees are a representation of derivations that is much more compact. Several

Parsing of Context-Free Grammars Bernd Kiefer { Bernd.Kiefer } @dfki.de Deutsches

Earley Parser Christopher Millar and Ekaterina Volkova Seminar fr Sprachwissenschaft

CS 103 Unit 14 Stringstreams and Parsing 2 I/O Streams '>>' operator used to

Lagrangian Based Approaches for Lexicalized Tree Adjoining Grammar Parsing Caio Corro

LL1 Parsing a b $ S 1 A 4 B 3 2 a b b b 1. S --> a B

Predictive Parsers LL(k) Parsing Can we avoid backtracking? Yes, if for a given input symbol and

Constituency Parsing Karl Stratos Rutgers University Karl Stratos - PowerPoint PPT Presentation

CS 533: Natural Language Processing Constituency Parsing Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/37 Project Logistics 1. Proposal (due 3/24) 2. Milestone (due 4/15) 3. Presentation (tentatively

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

Domain Adaptation for Constituency Parsing Using Partial Annotations Vidur Joshi Matthew Peters

Linear Time Constituency Parsing with RNNs and Dynamic Programming Juneki Hong 1 Liang Huang 1,2 1

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

Span-Based Constituency Parsing with Provably Optimal Dynamic Oracles James Cross and Liang Huang

WORLD BANK GROUP AFRICA GROUP 1 CONSTITUENCY 17 th Statutory Constituency Meeting ANNUAL REPORT

Advisory Group 5 Subregional Focal Points, 14 Constituency Focal Points Constituency Groups (1)

WORLD BANK GROUP AFRICA GROUP 1 CONSTITUENCY 16 th Statutory Constituency Meeting INTERIM REPORT

Constituency/Stakeholder Travel FY 11 FY 12 Update Update on Constituency Travel Support

Syntax: Conjunction Constituency Tests Recursion, Conjunction, and Auxiliary Verbs

The Constituency of Hyperlinks in a Hypertext Corpus . mitcho (Michael Yoshitaka Erlewine)

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Daniel Fried and Dan Klein

Constituency Parsing CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Parsing Parsers Jenna Zeigen JSConf Hawaii 2/5/2020 @zeigenvector jenna.is/at-jsconfhi

1 Parse Trees Parse trees are a representation of derivations that is much more compact. Several

Parsing of Context-Free Grammars Bernd Kiefer { Bernd.Kiefer } @dfki.de Deutsches

Earley Parser Christopher Millar and Ekaterina Volkova Seminar fr Sprachwissenschaft

CS 103 Unit 14 Stringstreams and Parsing 2 I/O Streams '&gt;&gt;' operator used to

Lagrangian Based Approaches for Lexicalized Tree Adjoining Grammar Parsing Caio Corro

LL1 Parsing a b $ S 1 A 4 B 3 2 a b b b 1. S --&gt; a B

Predictive Parsers LL(k) Parsing Can we avoid backtracking? Yes, if for a given input symbol and

CS 103 Unit 14 Stringstreams and Parsing 2 I/O Streams '>>' operator used to

LL1 Parsing a b $ S 1 A 4 B 3 2 a b b b 1. S --> a B