 
              CS 533: Natural Language Processing Constituency Parsing Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/37
Project Logistics 1. Proposal (due 3/24) 2. Milestone (due 4/15) 3. Presentation (tentatively 4/29) 4. Final report (due 5/4) Karl Stratos CS 533: Natural Language Processing 2/37
Possible Project Types (More Details in Template) ◮ Extend and apply recent machine learning methods to previously unconsidered NLP tasks ◮ Search last/this year’s AISTATS/ICLR/ICML/NeurIPS/UAI publications ◮ Extend a recent machine learning method in NLP ◮ Search last/this year’s ACL/CoNLL/EMNLP/NAACL ◮ Reimplement and replicate an existing technically challenging NLP paper from scratch ◮ No available public code! ◮ There may be a limited number of predefined projects ◮ No promise ◮ Priority given to those with higher assignment grades Karl Stratos CS 533: Natural Language Processing 3/37
Course Status Covered (all in the context of neural networks) ◮ Language models and conditional language models ◮ Pretrained representations from language modeling, evaluation tasks (GLUE, SuperGLUE) ◮ Tagging, parsing (today) Will cover ◮ Latent-variable models (EM, VAEs) ◮ Information extraction (entity linking) before the proposal due date You will have enough background to read state-of-the-art research papers for your proposal. Karl Stratos CS 533: Natural Language Processing 4/37
HMM Loose Ends ◮ Recall HMM n +1 n � � p ( x 1 . . . x n , y 1 . . . y n ) = t ( y i | y i − 1 ) × o ( x i | y i ) i =1 i =1 ◮ The forward algorithm computes in O ( |Y| 2 n ) � α ( i, y ) = p ( x 1 . . . x i , y 1 . . . y i ) y 1 ...y i ∈Y i : y i = y ◮ The backward algorithm computes in O ( |Y| 2 n ) � β ( i, y ) = p ( x i +1 . . . x n , y i +1 . . . y n | y i ) y i ...y n ∈Y n − i +1 : y i = y Karl Stratos CS 533: Natural Language Processing 5/37
Backward Algorithm Base case. For y ∈ Y , β ( n, y ) = t ( STOP | y ) Main. For i = n − 1 . . . 1 , for y ∈ Y , � t ( y ′ | y ) × o ( x i +1 | y ′ ) × β ( i + 1 , y ′ ) β ( i, y ) = y ′ ∈Y Karl Stratos CS 533: Natural Language Processing 6/37
Marginals ◮ Once forward and backward probabilities are computed, useful marginal probabilities can be computed. ◮ Debugging tip: p ( x 1 . . . x n ) = � y α ( i, y ) β ( i, y ) for any i ◮ Marginal decoding: for i = 1 . . . n predict y ∈ Y with highest p ( x 1 . . . x n , y i = y ) = α ( i, y ) × β ( i, y ) ◮ Tag pair conditional marginals (will be useful later) p ( y i = y, y i +1 = y ′ | x 1 . . . x n ) = α ( i, y ) × t ( y ′ | y ) × o ( x i +1 | y ′ ) × β ( i + 1 , y ′ ) p ( x 1 . . . x n ) Karl Stratos CS 533: Natural Language Processing 7/37
Agenda 1. Parsing 2. PCFG: A model for constituency parsing 3. Transition-based dependency parsing (Some slides adapted from Chris Manning, Mike Collins) Karl Stratos CS 533: Natural Language Processing 8/37
Constituency Parsing vs Dependency Parsing Two formalisms for linguistic structure Karl Stratos CS 533: Natural Language Processing 9/37
Constituency Structure Nested constituents/phrases ◮ Start by labeing words with POS tags (D the) (N dog) (V saw) (D the) (N cat) ◮ Recursively combine constituents according to rules (NP (D the) (N dog)) (V saw) (D the) (N cat) (NP (D the) (N dog)) (V saw) (NP (D the) (N cat)) (NP (D the) (N dog)) (VP (V saw) (NP (D the) (N cat))) (S (NP (D the) (N dog)) (VP (V saw) (NP (D the) (N cat)))) Used rules: NP → D N, VP → V NP, S → NP VP Karl Stratos CS 533: Natural Language Processing 10/37
Dependency Structure Labeled pairwise relations between words Karl Stratos CS 533: Natural Language Processing 11/37
Case for Parsing ◮ Most compelling example of latent structure in language the man saw the moon with a telescope ◮ Hypothesis: parsing is intimately related to intelligence ◮ Also some useful applications, e.g., relation extraction Karl Stratos CS 533: Natural Language Processing 12/37
Context-Free Grammar (CFG) A tuple G = ( N, Σ , R, S ) ◮ N : non-terminal symbols (constituents) ◮ Σ : terminal symbols (words) ◮ R : rules of form X → Y 1 . . . Y m where X ∈ N, Y i ∈ N ∪ Σ ◮ S ∈ N : start symbol Karl Stratos CS 533: Natural Language Processing 13/37
Example CFG Karl Stratos CS 533: Natural Language Processing 14/37
Left-Most Derivation Given a CFG G = ( N, Σ , R, S ) , a left-most derivation is a sequence of strings s 1 . . . s n where ◮ s 1 = S ◮ For i = 2 . . . n , s i = ExpandLeftMostNonterminal( s i − 1 ) ◮ s n ∈ Σ ∗ (aka “yield” of the derivation) Karl Stratos CS 533: Natural Language Processing 15/37
Example Karl Stratos CS 533: Natural Language Processing 16/37
Ambiguity Some string can have multiple valid derivations (i.e., parse trees). Number of binary trees over n + 1 nodes ( n -th Catalyn number) � 2 n � 1 C n = > 6 . 5 billion for n = 20 n + 1 n Karl Stratos CS 533: Natural Language Processing 17/37
Rule-Based to Statistical ◮ Rule-based: manually construct some CFG that recognizes as many English strings as possible ◮ Never enough, no way to choose the right parse ◮ Statistical: annotate sentences with parse trees (aka. treebank) and learn a statistical model to disambiguate Karl Stratos CS 533: Natural Language Processing 18/37
Treebanks ◮ Standard setup: WSJ portion of Penn Treebank ◮ 40,000 trees for training ◮ 1,700 trees for validation ◮ 2,400 trees for evaluation ◮ Building a treebank vs building a grammar? ◮ Broad coverage, more natural annotation ◮ Contains distributional information of English ◮ Can be used to evaluate parsers Karl Stratos CS 533: Natural Language Processing 19/37
Probabilistic Context-Free Grammar (PCFG) A tuple G = ( N, Σ , R, S, q ) ◮ N : non-terminal symbols (constituents) ◮ Σ : terminal symbols (words) ◮ R : rules of form X → Y 1 . . . Y m where X ∈ N, Y i ∈ N ∪ Σ ◮ S ∈ N : start symbol ◮ q : rule probability q ( α → β ) ≥ 0 for every rule α → β ∈ R such that � β q ( X → β ) = 1 for any X ∈ N Karl Stratos CS 533: Natural Language Processing 20/37
Probability of a Tree Under PCFG Karl Stratos CS 533: Natural Language Processing 21/37
Estimating a PCFG from a Treebank Given trees t (1) . . . t ( N ) in the training data ◮ N : all non-terminal symbols (constituents) seen in the data ◮ Σ : all terminal symbols (words) seen in the data ◮ R : all rules seen in the data ◮ S ∈ N : special start symbol (if the data does not already have it, add it to every tree) ◮ q : MLE estimate count ( α → β ) q ( α → β ) = � β count ( α → β ) If we see VP → Vt NP 10 times and VP 1000 times, than q ( VP → Vt NP ) = 0 . 01 Karl Stratos CS 533: Natural Language Processing 22/37
Aside: Improper PCFG A → A A with probability γ A → a with probability 1 − γ t : height( t ) ≤ h p ( t ) . Let S ∗ = lim h →∞ S h . Lemma. Define S h = � Then S ∗ < 1 if γ > 0 . 5 . ◮ Total probability of parses is less than one! Fortunately, an MLE from a finite treebank is never improper (aka. “tight”) (Chi and Geman, 2015) https://www.aclweb.org/anthology/J98-2005.pdf Karl Stratos CS 533: Natural Language Processing 23/37
Marginalization and Inference GEN ( x 1 . . . x n ) denotes the set of all valid derivations for x 1 . . . x n under the considered PCFG. 1. What is the probability of x 1 . . . x n under a PCFG? � p ( t ) t ∈ GEN ( x 1 ...x n ) 2. What is the most likely tree of x 1 . . . x n under a PCFG? arg max p ( t ) t ∈ GEN ( x 1 ...x n ) Karl Stratos CS 533: Natural Language Processing 24/37
Chomsky Normal Form (CNF) From here on, always assume that a PCFG G = ( N, Σ , R, S, q ) is in CNF: meaning every α → β ∈ R is either 1. Binary non-terminal production: X → Y Z where X, Y, Z ∈ N 2. Unary terminal production: X → x where X ∈ N , x ∈ Σ Not a big deal: can convert PCFG to equivalent CNF (and back) Karl Stratos CS 533: Natural Language Processing 25/37
Inside Algorithm: Bottom-Up Marginalization ◮ For 1 ≤ i ≤ j ≤ n , for all X ∈ N , � α ( i, j, X ) = p ( t ) t ∈ GEN ( x i ...x j ): root( t )= X We will see that computing each α ( i, j, X ) takes O ( n | R | ) time using dynamic programming . ◮ Return � α (1 , n, S ) = p ( t ) t ∈ GEN ( x 1 ...x n ) ◮ Total runtime? Karl Stratos CS 533: Natural Language Processing 26/37
Base Case ( i = j ) For i = 1 . . . n , for X ∈ N , α ( i, i, X ) = q ( X → x i ) Karl Stratos CS 533: Natural Language Processing 27/37
Main Body ( j = i + l ) For l = 1 . . . n − 1 , for i = 1 . . . n − l (set j = i + l ), for X ∈ N , � α ( i, j, X ) = q ( X → Y Z ) i ≤ k<j X → Y Z ∈ R × α ( i, k, Y ) × α ( k + 1 , j, Z ) Karl Stratos CS 533: Natural Language Processing 28/37
CKY Parsing Algorithm: Bottom-Up Maximization ◮ For 1 ≤ i ≤ j ≤ n , for all X ∈ N , π ( i, j, X ) = t ∈ GEN ( x i ...x j ): root( t )= X p ( t ) max ◮ Base: π ( i, j, X ) = q ( X → x i ) ◮ Main: π ( i, j, X ) = max i ≤ k<j, X → Y Z ∈ R q ( X → Y Z ) × π ( i, k, Y ) × π ( k + 1 , j, Z ) ◮ We have π (1 , n, S ) = t ∈ GEN ( x 1 ...x n ) p ( t ) max The optimal tree can be retrieved by backtracking Karl Stratos CS 533: Natural Language Processing 29/37
Recommend
More recommend