SLIDE 1 Parsing II
Anjalie Field – CMU Slides adapted from: Dan Klein – UC Berkeley Taylor Berg-Kirkpatrick, Yulia Tsvetkov, Maria Ryskina – CMU
Algorithms for NLP
SLIDE 2
Overview: CKY in the Wild
▪ Recap of CKY
▪ Extension to PCFGs
▪ Learning PCFGs from a Treebank ▪ Tree annotations ▪ Speeding up
SLIDE 3
Syntactic Parsing
▪ INPUT:
▪ The move followed a round of similar
increases by other lenders, reflecting a continuing decline in that market ▪ OUTPUT:
SLIDE 4 Context Free Grammar (CFG)
Grammar (CFG) Lexicon
ROOT → S S → NP VP NP → DT NN NP → NN NNS NN → interest NNS → raises VBP → interest VBZ → raises … NP → NP PP VP → VBP NP VP → VBP NP PP PP → IN NP
▪ If our CFG is in Chomsky Normal Form, we can use CKY algorithm to find a parse tree for a sentence ▪ All rules must be of the form: ▪ [Non-terminal] → [Non-terminal] [Non-terminal] ▪ [Non-terminal] → [Terminal]
SLIDE 5 Parsing with CKY
Preterminal rules Inner rules
SLIDE 6 Preterminal rules Inner rules
Chart (aka parsing triangle)
SLIDE 7 Preterminal rules Inner rules
SLIDE 8 Preterminal rules Inner rules
SLIDE 9 Preterminal rules Inner rules
SLIDE 10 Preterminal rules Inner rules
SLIDE 11 Preterminal rules Inner rules
SLIDE 12 Preterminal rules Inner rules
SLIDE 13 Preterminal rules Inner rules
SLIDE 14 Preterminal rules Inner rules
SLIDE 15 Preterminal rules Inner rules
SLIDE 16 Preterminal rules Inner rules
SLIDE 17 Preterminal rules Inner rules
SLIDE 18 Preterminal rules Inner rules
SLIDE 19 Preterminal rules Inner rules
SLIDE 20 Preterminal rules Inner rules
SLIDE 21 Preterminal rules Inner rules
SLIDE 22 Preterminal rules Inner rules
SLIDE 23 Preterminal rules Inner rules
mid=1
SLIDE 24 Preterminal rules Inner rules
mid=2
SLIDE 25 Preterminal rules Inner rules
Apparently the sentence is ambiguous for the grammar: (as the grammar overgenerates)
SLIDE 26 Preterminal rules Inner rules
How can we tell which parse is better?
SLIDE 27 1.0 0.2 1.0 0.4 0.5 0.2 0.3 0.5 1.0 0.6 0.5 0.3 0.3 0.7
PCFGs
27
1.0 0.2 0.4 0.4 0.3 0.5 0.2 1.0 0.2 0.7 0.1 1.0 0.5 0.5 0.6 0.4 0.3 0.7
SLIDE 28 CKY with PCFGs
▪ Chart is represented by a 3d array of floats chart[min][max][label]
▪ It stores probabilities for the most probable subtree with a given signature
▪ chart[0][n][S] will store the probability
- f the most probable full parse tree
SLIDE 29
Intuition
For every C choose C1 , C2 and mid such that is maximal, where T1 and T2 are left and right subtrees.
SLIDE 30
Implementation: preterminal rules
SLIDE 31 Implementation: binary rules
max min
SLIDE 32
Recovery of the tree
▪ For each signature we store backpointers to the elements from which it was built
▪ start recovering from [0, n, S]
▪ What backpointers do we store?
SLIDE 33
Recovery of the tree
▪ For each signature we store backpointers to the elements from which it was built
▪ start recovering from [0, n, S]
▪ What backpointers do we store?
▪ rule ▪ for binary rules, midpoint
SLIDE 34
Recap
▪ Given a PCFG:
▪ For a new sentence, we can use CKY to compute all possible parse trees under our grammar ▪ We can trace back through our CKY chart to find the best parse of the sentence ▪ But where do we get a PCFG?
SLIDE 35
Learning PCFGs from A TreeBank
SLIDE 36 Treebank PCFGs
▪ Can we use CKY to parse sentences according to this grammar?
S → NP VP 1 NP→ DT JJ NN NN 1 VP→ VBD 1 …..
NP DT JJ NN NN The fat S VP VBD house cat sat
▪ We can take a grammar straight off a tree, using counts to estimate probabilities
SLIDE 37 Treebank PCFGs
▪ Vanilla CKY only allows binary rules
S→ NP VP 1 NP→ DT JJ NN NN 1 VP→ VBD 1 …..
NP DT JJ JJ NN The fat S VP VBD
sat
▪ We can take a grammar straight off a tree, using counts to estimate probabilities
SLIDE 38 Option 1: Binarize the Grammar
S→ NP VP NP→ DT JJ NN NN VP→ VBD
NP DT JJ NN NN The fat S VP VBD house cat sat
S→ NP VP S→ NP #[V VBD] NP→ DT @NP[DT] @NP[DT]→ JJ @NP[DT JJ] @NP[DT JJ]→ NN NN
▪ Introduce cleverly-named intermediate symbols that we can undo later
SLIDE 39 Option 2: Binarize the Tree
DT NP JJ
@NP[DT] @NP[DT,JJ,NN]
NN
@NP[DT,JJ]
NN
NP DT JJ NN NN The fa t S VP VBD house cat sat
S VP VBD
▪ Can we use CKY to parse sentences according to the grammar pulled from this tree?
SLIDE 40 CKY: Modifications for Unary Rules
Binary Rules: S→ NP VP NP→ DT @NP[DT] @NP[DT]→ JJ @NP[DT JJ] @NP[DT JJ]→ NN @NP[DT,JJ,NN] Unary Rules: VP→ VBD @NP[DT,JJ,NN] → NN
DT NP JJ
@NP[DT] @NP[DT,JJ,NN]
NN
@NP[DT,JJ]
NN S VP VBD
SLIDE 41
CKY: Incorporate Unary Rules
▪ Binary chart: Store the scores of non-terminals after applying binary rules ▪ Fill by applying rules to elements of the unary chart ▪ Unary chart: Store the scores of non-terminals after apply unary rules ▪ Fill by applying rules to elements of the binary chart
[Also need Unary Closure to handle chains]
SLIDE 42 CKY with TreeBank PCFG
▪ With these modifications, given a treebank we can:
▪ Binarize the trees ▪ Learn a PCFG from the binarized trees ▪ Use the unary-binary chart variant of CKY to
- btain parse trees for new sentences
▪ Does this work?
[Charniak 96]
SLIDE 43
Parsing evaluation
▪ Intrinsic evaluation:
▪ Automatic: evaluate against annotation provided by human experts (gold standard) according to some predefined measure ▪ Manual: … according to human judgment
▪ Extrinsic evaluation: score syntactic representation by comparing how well a system using this representation performs on some task
▪ E.g., use syntactic representation as input for a semantic analyzer and compare results of the analyzer using syntax predicted by different parsers.
SLIDE 44
Standard evaluation setting in parsing
▪ Automatic intrinsic evaluation is used: parsers are evaluated against gold standard by provided by linguists
▪ There is a standard split into the parts:
▪ training set: used for estimation of model parameters ▪ development set: used for tuning the model (initial experiments) ▪ test set: final experiments to compare against previous work
SLIDE 45 Automatic evaluation of constituent parsers
▪ Exact match: percentage of trees predicted correctly ▪ Bracket score: scores how well individual phrases (and their boundaries) are identified
The most standard measure; we will focus on it
SLIDE 46 Brackets scores
▪ The most standard score is bracket score ▪ It regards a tree as a collection of brackets: ▪ The set of brackets predicted by a parser is compared against the set of brackets in the tree annotated by a linguist ▪ Precision, recall and F1 are used as scores
Subtree signatures for CKY
SLIDE 47 Typical Experimental Setup
▪ Corpus: Penn Treebank, WSJ ▪ Accuracy – F1: harmonic mean of per-node labeled precision and recall. ▪ Here: also size – number of symbols in grammar.
Training: sections 02-21 Development: section 22 (here, first 20 files) Test: section 23
SLIDE 48 CKY with TreeBank PCFG
▪ With these modifications, given a treebank we can:
▪ Binarize the trees ▪ Learn a PCFG from the binarized trees ▪ Use the unary-binary chart variant of CKY to
- btain parse trees for new sentences
▪ Does this work?
Model F1 Baseline 72.0
[Charniak 96]
SLIDE 49
Preview: F1 bracket score
SLIDE 50
Model Assumptions
▪ Place Invariance
▪ The probability of a subtree does not depend on where in the string the words it dominates are
▪ Context-free
▪ The probability of a subtree does not depend on words not dominated by the subtree
▪ Ancestor-free
▪ The probability of a subtree does not depend on nodes in the derivation outside the tree
SLIDE 51
Model Assumptions
▪ We can relax some of these assumptions by enriching our grammar
▪ We’re already doing this in binarization
▪ Structured Annotation [Johnson ’98, Klein&Manning ’03]
▪ Enrich with features about surrounding nodes
▪ Lexicalization [Collins ’99, Charniak ’00]
▪ Enrich with word features
▪ Latent Variable Grammars [Matsuzaki et al. ‘05, Petrov et al. ’06]
SLIDE 52
Grammar Refinement
▪ Structural Annotation [Johnson ’98, Klein&Manning ’03] ▪ Lexicalization [Collins ’99, Charniak ’00] ▪ Latent Variables [Matsuzaki et al. ’05, Petrov et al. ’06]
SLIDE 53
Structural Annotation
SLIDE 54
Ancestor-free assumption
▪ Not every NP expansion can fill every NP slot
SLIDE 55
Ancestor-free assumption
▪ Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). ▪ Also: the subject and object expansions are correlated!
All NPs NPs under S NPs under VP
SLIDE 56
Parent Annotation
▪ Annotation refines base treebank symbols to improve statistical fit of the grammar
SLIDE 57 Parent Annotation
▪ Why stop at 1 parent?
^S
SLIDE 58 Vertical Markovization
▪ Vertical Markov
depend on past k ancestor nodes. (cf. parent annotation)
Order 1 Order 2
SLIDE 59 Back to our binarized tree
DT NP JJ
@NP[DT] @NP[DT,JJ,NN]
NN
@NP[DT,JJ]
NN S VP VBD
▪ How much parent annotating are we doing? The fat house cat sat
SLIDE 60 Back to our binarized tree
DT NP JJ
@NP[DT] @NP[DT,JJ,NN]
NN
@NP[DT,JJ]
NN S VP VBD
▪ Are we doing any
annotation? The fat house cat sat
SLIDE 61 Back to our binarized tree
DT NP JJ
@NP[DT] @NP[DT,JJ,NN]
NN
@NP[DT,JJ]
NN S VP VBD
▪ We’re remembering nodes to the left ▪ If we call parent annotation “vertical” than this is “horizontal” The fat house cat sat
SLIDE 62 Horizontal Markovization
Order 1 Order ∞
SLIDE 63 Binarization / Markovization
NP DT JJ NN NN v=1,h=∞ DT NP
@NP[DT] @NP[DT,JJ,NN]
NN JJ
@NP[DT,JJ]
NN v=1,h=0 DT NP JJ
@NP @NP
NN
@NP
NN v=1,h=1 DT NP JJ
@NP[DT] @NP[…,NN]
NN
@NP[…,JJ]
NN
SLIDE 64 Binarization / Markovization
NP DT JJ NN NN v=2,h=∞
DT^NP NP^VP JJ^NP
@NP^VP[DT] @NP^VP[DT,JJ,NN]
NN^NP
@NP^VP[DT,JJ]
NN^NP
v=2,h=0
DT^NP NP^VP JJ^NP
@NP^VP @NP^VP
NN^NP
@NP^VP
NN^NP
v=2,h=1
DT^NP NP^VP JJ^NP
@NP^VP[DT] @NP^VP[…,NN]
NN^NP
@NP^VP[…,JJ]
NN^NP
SLIDE 65
A Fully Annotated (Unlex) Tree
SLIDE 66 Some Test Set Results
▪ Beats “first generation” lexicalized parsers. ▪ Lots of room to improve – more complex models next.
Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 86.3 1.10 60.3 Charniak 97 87.4 87.5 87.4 1.00 62.1 Collins 99 88.7 88.6 88.6 0.90 67.1
SLIDE 67
Beyond Structured Annotation: Lexicalization and Latent Variable Grammars
SLIDE 68 ▪ Annotation refines base treebank symbols to improve statistical fit of the grammar
▪ Structural annotation [Johnson ’98, Klein and Manning 03] ▪ Head lexicalization [Collins ’99, Charniak ’00]
The Game of Designing a Grammar
SLIDE 69 Problems with PCFGs
▪ If we do no annotation, these trees differ only in one rule:
▪ VP → VP PP ▪ NP → NP PP
▪ Parse will go one way or the other, regardless of words ▪ We addressed this in one way with unlexicalized grammars (how?) ▪ Lexicalization allows us to be sensitive to specific words
SLIDE 70
SLIDE 71
Grammar Refinement
▪ Example: PP attachment
SLIDE 72
Problems with PCFGs
▪ What’s different between basic PCFG scores here? ▪ What (lexical) correlations need to be scored?
SLIDE 73 Lexicalized Trees
▪ Add “head words” to each phrasal node
▪ Syntactic vs. semantic heads ▪ Headship not in (most) treebanks ▪ Usually use head rules, e.g.:
▪ NP:
▪ Take leftmost NP ▪ Take rightmost N* ▪ Take rightmost JJ ▪ Take right child
▪ VP:
▪ Take leftmost VB* ▪ Take leftmost VP ▪ Take left child
SLIDE 74 Some Test Set Results
▪ Beats “first generation” lexicalized parsers. ▪ Lots of room to improve – more complex models next.
Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 86.3 1.10 60.3 Charniak 97 87.4 87.5 87.4 1.00 62.1 Collins 99 88.7 88.6 88.6 0.90 67.1
SLIDE 75 ▪ Annotation refines base treebank symbols to improve statistical fit of the grammar
▪ Parent annotation [Johnson ’98] ▪ Head lexicalization [Collins ’99, Charniak ’00] ▪ Automatic clustering?
The Game of Designing a Grammar
SLIDE 76
Latent Variable Grammars
Parse Tree Sentence Parameters .. . Derivations
SLIDE 77 Learned Splits
▪ Proper Nouns (NNP): ▪ Personal pronouns (PRP):
NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street PRP-0 It He I PRP-1 it he they PRP-2 it them him
SLIDE 78 ▪ Relative adverbs (RBR): ▪ Cardinal Numbers (CD):
RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later CD-7
two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 1 30 31 CD-9 78 58 34
Learned Splits
SLIDE 79 Final Results (Accuracy)
≤ 40 words F1 all F1 E N G Charniak&Johnson ‘05 (generative) 90.1 89.6 Split / Merge 90.6 90.1 G E R Dubey ‘05 76.3
80.8 80.1 C H N Chiang et al. ‘02 80.0 76.6 Split / Merge 86.3 83.4 Still higher numbers from reranking / self-training methods
SLIDE 80
Efficient Parsing for Structural Annotation
SLIDE 81
Overview: Coarse-to-Fine
▪ We’ve introduce a lot of new symbols in our grammar: do we always need to consider all these symbols? ▪ Motivation:
▪ If any NP is unlikely to span these words, than NP^S[DT], NP^VB[DT], NP^S[JJ], etc. are all unlikely
▪ High level:
▪ First pass: compute probability that a coarse symbol spans these words ▪ Second pass: parse as usual, but skip fine symbols that correspond with unprobable coarse symbols
SLIDE 82
Defining Coarse/Fine Grammars
▪ [Charniak et al. 2006]
▪ level 0: ROOT vs. not-ROOT ▪ level 1: argument vs. modifier (i.e. two nontrivial nonterminals) ▪ level 2: four major phrasal categories (verbal, nominal, adjectival and prepositional phrases) ▪ level 3: all standard Penn treebank categories
▪ Our version: stop at 2 passes
SLIDE 83 Grammar Projections
NP → DT @NP
Coarse Grammar Fine Grammar
D T NP JJ
@NP @NP
NN
@NP
NN
DT^NP NP^VP JJ^NP
@NP^VP[DT] @NP^VP[…,NN]
NN^NP
@NP^VP[…,JJ]
NN^NP
NP^VP → DT^NP @NP^VP[DT]
Note: X-Bar Grammars are projections with rules like XP → Y @X or XP → @X Y or @X → X
SLIDE 84 Grammar Projections
NP
Coarse Symbols Fine Symbols
DT @NP NP^VP NP^S @NP^VP[DT] @NP^S[DT] @NP^VP[…,JJ] @NP^S[…,JJ] DT^NP
SLIDE 85 Coarse-to-Fine Pruning
For each coarse chart item X[i,j], compute posterior probability P(X at [i,j] | sentence):
… QP NP VP …
coarse: fine: E.g. consider the span 5 to 12:
< threshold
SLIDE 86
Notation
▪ Non-terminal symbols (latent variables): ▪ Sentence (observed data): ▪ denotes that spans in the sentence
SLIDE 87 Inside probability
Definition (compare with backward prob for HMMs): Computed recursively
Base case: Induction:
The grammar is binarized
SLIDE 88 Implementation: PCFG parsing
double total = 0.0
SLIDE 89 Implementation: inside
double total = 0.0 double total = 0.0 total = total + candidate
SLIDE 90 Implementation: inside
double total = 0.0 double total = 0.0 total = total + candidate
SLIDE 91 Implementation: inside
double total = 0.0 double total = 0.0 total = total + candidate
SLIDE 92
Inside probability: example
SLIDE 93
Inside probability: example
SLIDE 94
Inside probability: example
SLIDE 95
Inside probability: example
SLIDE 96
Inside probability: example
SLIDE 97 Outside probability
Definition (compare with forward prob for HMMs): The joint probability of starting with S, generating words , the non terminal and words .
SLIDE 98 Calculating outside probability
Computed recursively, base case Induction? Intuition: must be either the L or R child of a parent node. We first consider the case when it is the L child.
SLIDE 99 Calculating outside probability
The yellow area is the probability we would like to calculate
How do we decompose it?
SLIDE 100 Calculating outside probability
Step 1: We assume that is the parent of . Its outside probability, , (represented by the yellow shading) is available
- recursively. But how do we compute the green part?
SLIDE 101 Calculating outside probability
Step 1: The red shaded area is the inside probability for , i.e.
SLIDE 102 Calculating outside probability
Step 3: The blue shaded area is just the production , the corresponding probability
SLIDE 103 Calculating outside probability
If we multiply the terms together, we have the joint probability corresponding to the yellow, red and blue areas, assuming was the L child of , and give fixed non-terminals f and g, as well as a fixed partition e What if we do not want to assume this?
SLIDE 104 Calculating outside probability
The joint probability corresponding to the yellow, red and blue areas, assuming was the L child of some non-terminal:
SLIDE 105 Calculating outside probability
The joint probability corresponding to the yellow, red and blue areas, assuming was the R child of some non-terminal:
SLIDE 106 Calculating outside probability
The joint final joint probability (the sum over the L and R cases):
SLIDE 107 Calculating outside probability
The joint final joint probability (the sum over the L and R cases):
SLIDE 108
Is C2F an Improvement?
▪ Does coarse-to-fine pruning improve accuracy?
▪ If your threshold is too high, it might throw away correct parses
▪ Does coarse-to-fine pruning improve speed?
▪ Maybe, if your threshold is too low pruning might not be very useful
< threshold