Algorithms for NLP Parsing II Anjalie Field CMU Slides adapted - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP Parsing II Anjalie Field CMU Slides adapted - - PowerPoint PPT Presentation

Algorithms for NLP Parsing II Anjalie Field CMU Slides adapted from: Dan Klein UC Berkeley Taylor Berg-Kirkpatrick, Yulia Tsvetkov, Maria Ryskina CMU Overview: CKY in the Wild Recap of CKY Extension to PCFGs Learning


slide-1
SLIDE 1

Parsing II

Anjalie Field – CMU Slides adapted from: Dan Klein – UC Berkeley Taylor Berg-Kirkpatrick, Yulia Tsvetkov, Maria Ryskina – CMU

Algorithms for NLP

slide-2
SLIDE 2

Overview: CKY in the Wild

▪ Recap of CKY

▪ Extension to PCFGs

▪ Learning PCFGs from a Treebank ▪ Tree annotations ▪ Speeding up

slide-3
SLIDE 3

Syntactic Parsing

▪ INPUT:

▪ The move followed a round of similar

increases by other lenders, reflecting a continuing decline in that market ▪ OUTPUT:

slide-4
SLIDE 4

Context Free Grammar (CFG)

Grammar (CFG) Lexicon

ROOT → S S → NP VP NP → DT NN NP → NN NNS NN → interest NNS → raises VBP → interest VBZ → raises … NP → NP PP VP → VBP NP VP → VBP NP PP PP → IN NP

▪ If our CFG is in Chomsky Normal Form, we can use CKY algorithm to find a parse tree for a sentence ▪ All rules must be of the form: ▪ [Non-terminal] → [Non-terminal] [Non-terminal] ▪ [Non-terminal] → [Terminal]

slide-5
SLIDE 5

Parsing with CKY

Preterminal rules Inner rules

slide-6
SLIDE 6

Preterminal rules Inner rules

Chart (aka parsing triangle)

slide-7
SLIDE 7

Preterminal rules Inner rules

slide-8
SLIDE 8

Preterminal rules Inner rules

slide-9
SLIDE 9

Preterminal rules Inner rules

slide-10
SLIDE 10

Preterminal rules Inner rules

slide-11
SLIDE 11

Preterminal rules Inner rules

slide-12
SLIDE 12

Preterminal rules Inner rules

slide-13
SLIDE 13

Preterminal rules Inner rules

slide-14
SLIDE 14

Preterminal rules Inner rules

slide-15
SLIDE 15

Preterminal rules Inner rules

slide-16
SLIDE 16

Preterminal rules Inner rules

slide-17
SLIDE 17

Preterminal rules Inner rules

slide-18
SLIDE 18

Preterminal rules Inner rules

slide-19
SLIDE 19

Preterminal rules Inner rules

slide-20
SLIDE 20

Preterminal rules Inner rules

slide-21
SLIDE 21

Preterminal rules Inner rules

slide-22
SLIDE 22

Preterminal rules Inner rules

slide-23
SLIDE 23

Preterminal rules Inner rules

mid=1

slide-24
SLIDE 24

Preterminal rules Inner rules

mid=2

slide-25
SLIDE 25

Preterminal rules Inner rules

Apparently the sentence is ambiguous for the grammar: (as the grammar overgenerates)

slide-26
SLIDE 26

Preterminal rules Inner rules

How can we tell which parse is better?

slide-27
SLIDE 27

1.0 0.2 1.0 0.4 0.5 0.2 0.3 0.5 1.0 0.6 0.5 0.3 0.3 0.7

PCFGs

27

1.0 0.2 0.4 0.4 0.3 0.5 0.2 1.0 0.2 0.7 0.1 1.0 0.5 0.5 0.6 0.4 0.3 0.7

slide-28
SLIDE 28

CKY with PCFGs

▪ Chart is represented by a 3d array of floats chart[min][max][label]

▪ It stores probabilities for the most probable subtree with a given signature

▪ chart[0][n][S] will store the probability

  • f the most probable full parse tree
slide-29
SLIDE 29

Intuition

For every C choose C1 , C2 and mid such that is maximal, where T1 and T2 are left and right subtrees.

slide-30
SLIDE 30

Implementation: preterminal rules

slide-31
SLIDE 31

Implementation: binary rules

max min

slide-32
SLIDE 32

Recovery of the tree

▪ For each signature we store backpointers to the elements from which it was built

▪ start recovering from [0, n, S]

▪ What backpointers do we store?

slide-33
SLIDE 33

Recovery of the tree

▪ For each signature we store backpointers to the elements from which it was built

▪ start recovering from [0, n, S]

▪ What backpointers do we store?

▪ rule ▪ for binary rules, midpoint

slide-34
SLIDE 34

Recap

▪ Given a PCFG:

▪ For a new sentence, we can use CKY to compute all possible parse trees under our grammar ▪ We can trace back through our CKY chart to find the best parse of the sentence ▪ But where do we get a PCFG?

slide-35
SLIDE 35

Learning PCFGs from A TreeBank

slide-36
SLIDE 36

Treebank PCFGs

▪ Can we use CKY to parse sentences according to this grammar?

S → NP VP 1 NP→ DT JJ NN NN 1 VP→ VBD 1 …..

NP DT JJ NN NN The fat S VP VBD house cat sat

▪ We can take a grammar straight off a tree, using counts to estimate probabilities

slide-37
SLIDE 37

Treebank PCFGs

▪ Vanilla CKY only allows binary rules

S→ NP VP 1 NP→ DT JJ NN NN 1 VP→ VBD 1 …..

NP DT JJ JJ NN The fat S VP VBD

  • range cat

sat

▪ We can take a grammar straight off a tree, using counts to estimate probabilities

slide-38
SLIDE 38

Option 1: Binarize the Grammar

S→ NP VP NP→ DT JJ NN NN VP→ VBD

NP DT JJ NN NN The fat S VP VBD house cat sat

S→ NP VP S→ NP #[V VBD] NP→ DT @NP[DT] @NP[DT]→ JJ @NP[DT JJ] @NP[DT JJ]→ NN NN

▪ Introduce cleverly-named intermediate symbols that we can undo later

slide-39
SLIDE 39

Option 2: Binarize the Tree

DT NP JJ

@NP[DT] @NP[DT,JJ,NN]

NN

@NP[DT,JJ]

NN

NP DT JJ NN NN The fa t S VP VBD house cat sat

S VP VBD

▪ Can we use CKY to parse sentences according to the grammar pulled from this tree?

slide-40
SLIDE 40

CKY: Modifications for Unary Rules

Binary Rules: S→ NP VP NP→ DT @NP[DT] @NP[DT]→ JJ @NP[DT JJ] @NP[DT JJ]→ NN @NP[DT,JJ,NN] Unary Rules: VP→ VBD @NP[DT,JJ,NN] → NN

DT NP JJ

@NP[DT] @NP[DT,JJ,NN]

NN

@NP[DT,JJ]

NN S VP VBD

slide-41
SLIDE 41

CKY: Incorporate Unary Rules

▪ Binary chart: Store the scores of non-terminals after applying binary rules ▪ Fill by applying rules to elements of the unary chart ▪ Unary chart: Store the scores of non-terminals after apply unary rules ▪ Fill by applying rules to elements of the binary chart

[Also need Unary Closure to handle chains]

slide-42
SLIDE 42

CKY with TreeBank PCFG

▪ With these modifications, given a treebank we can:

▪ Binarize the trees ▪ Learn a PCFG from the binarized trees ▪ Use the unary-binary chart variant of CKY to

  • btain parse trees for new sentences

▪ Does this work?

[Charniak 96]

slide-43
SLIDE 43

Parsing evaluation

▪ Intrinsic evaluation:

▪ Automatic: evaluate against annotation provided by human experts (gold standard) according to some predefined measure ▪ Manual: … according to human judgment

▪ Extrinsic evaluation: score syntactic representation by comparing how well a system using this representation performs on some task

▪ E.g., use syntactic representation as input for a semantic analyzer and compare results of the analyzer using syntax predicted by different parsers.

slide-44
SLIDE 44

Standard evaluation setting in parsing

▪ Automatic intrinsic evaluation is used: parsers are evaluated against gold standard by provided by linguists

▪ There is a standard split into the parts:

▪ training set: used for estimation of model parameters ▪ development set: used for tuning the model (initial experiments) ▪ test set: final experiments to compare against previous work

slide-45
SLIDE 45

Automatic evaluation of constituent parsers

▪ Exact match: percentage of trees predicted correctly ▪ Bracket score: scores how well individual phrases (and their boundaries) are identified

The most standard measure; we will focus on it

slide-46
SLIDE 46

Brackets scores

▪ The most standard score is bracket score ▪ It regards a tree as a collection of brackets: ▪ The set of brackets predicted by a parser is compared against the set of brackets in the tree annotated by a linguist ▪ Precision, recall and F1 are used as scores

Subtree signatures for CKY

slide-47
SLIDE 47

Typical Experimental Setup

▪ Corpus: Penn Treebank, WSJ ▪ Accuracy – F1: harmonic mean of per-node labeled precision and recall. ▪ Here: also size – number of symbols in grammar.

Training: sections 02-21 Development: section 22 (here, first 20 files) Test: section 23

slide-48
SLIDE 48

CKY with TreeBank PCFG

▪ With these modifications, given a treebank we can:

▪ Binarize the trees ▪ Learn a PCFG from the binarized trees ▪ Use the unary-binary chart variant of CKY to

  • btain parse trees for new sentences

▪ Does this work?

Model F1 Baseline 72.0

[Charniak 96]

slide-49
SLIDE 49

Preview: F1 bracket score

slide-50
SLIDE 50

Model Assumptions

▪ Place Invariance

▪ The probability of a subtree does not depend on where in the string the words it dominates are

▪ Context-free

▪ The probability of a subtree does not depend on words not dominated by the subtree

▪ Ancestor-free

▪ The probability of a subtree does not depend on nodes in the derivation outside the tree

slide-51
SLIDE 51

Model Assumptions

▪ We can relax some of these assumptions by enriching our grammar

▪ We’re already doing this in binarization

▪ Structured Annotation [Johnson ’98, Klein&Manning ’03]

▪ Enrich with features about surrounding nodes

▪ Lexicalization [Collins ’99, Charniak ’00]

▪ Enrich with word features

▪ Latent Variable Grammars [Matsuzaki et al. ‘05, Petrov et al. ’06]

slide-52
SLIDE 52

Grammar Refinement

▪ Structural Annotation [Johnson ’98, Klein&Manning ’03] ▪ Lexicalization [Collins ’99, Charniak ’00] ▪ Latent Variables [Matsuzaki et al. ’05, Petrov et al. ’06]

slide-53
SLIDE 53

Structural Annotation

slide-54
SLIDE 54

Ancestor-free assumption

▪ Not every NP expansion can fill every NP slot

slide-55
SLIDE 55

Ancestor-free assumption

▪ Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). ▪ Also: the subject and object expansions are correlated!

All NPs NPs under S NPs under VP

slide-56
SLIDE 56

Parent Annotation

▪ Annotation refines base treebank symbols to improve statistical fit of the grammar

slide-57
SLIDE 57

Parent Annotation

▪ Why stop at 1 parent?

^S

slide-58
SLIDE 58

Vertical Markovization

▪ Vertical Markov

  • rder: rewrites

depend on past k ancestor nodes. (cf. parent annotation)

Order 1 Order 2

slide-59
SLIDE 59

Back to our binarized tree

DT NP JJ

@NP[DT] @NP[DT,JJ,NN]

NN

@NP[DT,JJ]

NN S VP VBD

▪ How much parent annotating are we doing? The fat house cat sat

slide-60
SLIDE 60

Back to our binarized tree

DT NP JJ

@NP[DT] @NP[DT,JJ,NN]

NN

@NP[DT,JJ]

NN S VP VBD

▪ Are we doing any

  • ther structured

annotation? The fat house cat sat

slide-61
SLIDE 61

Back to our binarized tree

DT NP JJ

@NP[DT] @NP[DT,JJ,NN]

NN

@NP[DT,JJ]

NN S VP VBD

▪ We’re remembering nodes to the left ▪ If we call parent annotation “vertical” than this is “horizontal” The fat house cat sat

slide-62
SLIDE 62

Horizontal Markovization

Order 1 Order ∞

slide-63
SLIDE 63

Binarization / Markovization

NP DT JJ NN NN v=1,h=∞ DT NP

@NP[DT] @NP[DT,JJ,NN]

NN JJ

@NP[DT,JJ]

NN v=1,h=0 DT NP JJ

@NP @NP

NN

@NP

NN v=1,h=1 DT NP JJ

@NP[DT] @NP[…,NN]

NN

@NP[…,JJ]

NN

slide-64
SLIDE 64

Binarization / Markovization

NP DT JJ NN NN v=2,h=∞

DT^NP NP^VP JJ^NP

@NP^VP[DT] @NP^VP[DT,JJ,NN]

NN^NP

@NP^VP[DT,JJ]

NN^NP

v=2,h=0

DT^NP NP^VP JJ^NP

@NP^VP @NP^VP

NN^NP

@NP^VP

NN^NP

v=2,h=1

DT^NP NP^VP JJ^NP

@NP^VP[DT] @NP^VP[…,NN]

NN^NP

@NP^VP[…,JJ]

NN^NP

slide-65
SLIDE 65

A Fully Annotated (Unlex) Tree

slide-66
SLIDE 66

Some Test Set Results

▪ Beats “first generation” lexicalized parsers. ▪ Lots of room to improve – more complex models next.

Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 86.3 1.10 60.3 Charniak 97 87.4 87.5 87.4 1.00 62.1 Collins 99 88.7 88.6 88.6 0.90 67.1

slide-67
SLIDE 67

Beyond Structured Annotation: Lexicalization and Latent Variable Grammars

slide-68
SLIDE 68

▪ Annotation refines base treebank symbols to improve statistical fit of the grammar

▪ Structural annotation [Johnson ’98, Klein and Manning 03] ▪ Head lexicalization [Collins ’99, Charniak ’00]

The Game of Designing a Grammar

slide-69
SLIDE 69

Problems with PCFGs

▪ If we do no annotation, these trees differ only in one rule:

▪ VP → VP PP ▪ NP → NP PP

▪ Parse will go one way or the other, regardless of words ▪ We addressed this in one way with unlexicalized grammars (how?) ▪ Lexicalization allows us to be sensitive to specific words

slide-70
SLIDE 70
slide-71
SLIDE 71

Grammar Refinement

▪ Example: PP attachment

slide-72
SLIDE 72

Problems with PCFGs

▪ What’s different between basic PCFG scores here? ▪ What (lexical) correlations need to be scored?

slide-73
SLIDE 73

Lexicalized Trees

▪ Add “head words” to each phrasal node

▪ Syntactic vs. semantic heads ▪ Headship not in (most) treebanks ▪ Usually use head rules, e.g.:

▪ NP:

▪ Take leftmost NP ▪ Take rightmost N* ▪ Take rightmost JJ ▪ Take right child

▪ VP:

▪ Take leftmost VB* ▪ Take leftmost VP ▪ Take left child

slide-74
SLIDE 74

Some Test Set Results

▪ Beats “first generation” lexicalized parsers. ▪ Lots of room to improve – more complex models next.

Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 86.3 1.10 60.3 Charniak 97 87.4 87.5 87.4 1.00 62.1 Collins 99 88.7 88.6 88.6 0.90 67.1

slide-75
SLIDE 75

▪ Annotation refines base treebank symbols to improve statistical fit of the grammar

▪ Parent annotation [Johnson ’98] ▪ Head lexicalization [Collins ’99, Charniak ’00] ▪ Automatic clustering?

The Game of Designing a Grammar

slide-76
SLIDE 76

Latent Variable Grammars

Parse Tree Sentence Parameters .. . Derivations

slide-77
SLIDE 77

Learned Splits

▪ Proper Nouns (NNP): ▪ Personal pronouns (PRP):

NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street PRP-0 It He I PRP-1 it he they PRP-2 it them him

slide-78
SLIDE 78

▪ Relative adverbs (RBR): ▪ Cardinal Numbers (CD):

RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later CD-7

  • ne

two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 1 30 31 CD-9 78 58 34

Learned Splits

slide-79
SLIDE 79

Final Results (Accuracy)

≤ 40 words F1 all F1 E N G Charniak&Johnson ‘05 (generative) 90.1 89.6 Split / Merge 90.6 90.1 G E R Dubey ‘05 76.3

  • Split / Merge

80.8 80.1 C H N Chiang et al. ‘02 80.0 76.6 Split / Merge 86.3 83.4 Still higher numbers from reranking / self-training methods

slide-80
SLIDE 80

Efficient Parsing for Structural Annotation

slide-81
SLIDE 81

Overview: Coarse-to-Fine

▪ We’ve introduce a lot of new symbols in our grammar: do we always need to consider all these symbols? ▪ Motivation:

▪ If any NP is unlikely to span these words, than NP^S[DT], NP^VB[DT], NP^S[JJ], etc. are all unlikely

▪ High level:

▪ First pass: compute probability that a coarse symbol spans these words ▪ Second pass: parse as usual, but skip fine symbols that correspond with unprobable coarse symbols

slide-82
SLIDE 82

Defining Coarse/Fine Grammars

▪ [Charniak et al. 2006]

▪ level 0: ROOT vs. not-ROOT ▪ level 1: argument vs. modifier (i.e. two nontrivial nonterminals) ▪ level 2: four major phrasal categories (verbal, nominal, adjectival and prepositional phrases) ▪ level 3: all standard Penn treebank categories

▪ Our version: stop at 2 passes

slide-83
SLIDE 83

Grammar Projections

NP → DT @NP

Coarse Grammar Fine Grammar

D T NP JJ

@NP @NP

NN

@NP

NN

DT^NP NP^VP JJ^NP

@NP^VP[DT] @NP^VP[…,NN]

NN^NP

@NP^VP[…,JJ]

NN^NP

NP^VP → DT^NP @NP^VP[DT]

Note: X-Bar Grammars are projections with rules like XP → Y @X or XP → @X Y or @X → X

slide-84
SLIDE 84

Grammar Projections

NP

Coarse Symbols Fine Symbols

DT @NP NP^VP NP^S @NP^VP[DT] @NP^S[DT] @NP^VP[…,JJ] @NP^S[…,JJ] DT^NP

slide-85
SLIDE 85

Coarse-to-Fine Pruning

For each coarse chart item X[i,j], compute posterior probability P(X at [i,j] | sentence):

… QP NP VP …

coarse: fine: E.g. consider the span 5 to 12:

< threshold

slide-86
SLIDE 86

Notation

▪ Non-terminal symbols (latent variables): ▪ Sentence (observed data): ▪ denotes that spans in the sentence

slide-87
SLIDE 87

Inside probability

Definition (compare with backward prob for HMMs): Computed recursively

Base case: Induction:

The grammar is binarized

slide-88
SLIDE 88

Implementation: PCFG parsing

double total = 0.0

slide-89
SLIDE 89

Implementation: inside

double total = 0.0 double total = 0.0 total = total + candidate

slide-90
SLIDE 90

Implementation: inside

double total = 0.0 double total = 0.0 total = total + candidate

slide-91
SLIDE 91

Implementation: inside

double total = 0.0 double total = 0.0 total = total + candidate

slide-92
SLIDE 92

Inside probability: example

slide-93
SLIDE 93

Inside probability: example

slide-94
SLIDE 94

Inside probability: example

slide-95
SLIDE 95

Inside probability: example

slide-96
SLIDE 96

Inside probability: example

slide-97
SLIDE 97

Outside probability

Definition (compare with forward prob for HMMs): The joint probability of starting with S, generating words , the non terminal and words .

slide-98
SLIDE 98

Calculating outside probability

Computed recursively, base case Induction? Intuition: must be either the L or R child of a parent node. We first consider the case when it is the L child.

slide-99
SLIDE 99

Calculating outside probability

The yellow area is the probability we would like to calculate

How do we decompose it?

slide-100
SLIDE 100

Calculating outside probability

Step 1: We assume that is the parent of . Its outside probability, , (represented by the yellow shading) is available

  • recursively. But how do we compute the green part?
slide-101
SLIDE 101

Calculating outside probability

Step 1: The red shaded area is the inside probability for , i.e.

slide-102
SLIDE 102

Calculating outside probability

Step 3: The blue shaded area is just the production , the corresponding probability

slide-103
SLIDE 103

Calculating outside probability

If we multiply the terms together, we have the joint probability corresponding to the yellow, red and blue areas, assuming was the L child of , and give fixed non-terminals f and g, as well as a fixed partition e What if we do not want to assume this?

slide-104
SLIDE 104

Calculating outside probability

The joint probability corresponding to the yellow, red and blue areas, assuming was the L child of some non-terminal:

slide-105
SLIDE 105

Calculating outside probability

The joint probability corresponding to the yellow, red and blue areas, assuming was the R child of some non-terminal:

slide-106
SLIDE 106

Calculating outside probability

The joint final joint probability (the sum over the L and R cases):

slide-107
SLIDE 107

Calculating outside probability

The joint final joint probability (the sum over the L and R cases):

slide-108
SLIDE 108

Is C2F an Improvement?

▪ Does coarse-to-fine pruning improve accuracy?

▪ If your threshold is too high, it might throw away correct parses

▪ Does coarse-to-fine pruning improve speed?

▪ Maybe, if your threshold is too low pruning might not be very useful

< threshold