Lecture 17: Statistical Parsing with PCFG Kai-Wei Chang CS @ - - PowerPoint PPT Presentation

lecture 17 statistical parsing with pcfg
SMART_READER_LITE
LIVE PREVIEW

Lecture 17: Statistical Parsing with PCFG Kai-Wei Chang CS @ - - PowerPoint PPT Presentation

Lecture 17: Statistical Parsing with PCFG Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501-NLP 1 Reading list v Look at Mike Collins note on PCFGs and lexicalized PCFG


slide-1
SLIDE 1

Lecture 17: Statistical Parsing with PCFG

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501-NLP

slide-2
SLIDE 2

Reading list

v Look at Mike Collins’ note on PCFGs and lexicalized PCFG http://www.cs.columbia.edu/~mcollins/

CS6501-NLP 2

slide-3
SLIDE 3

Phrase structure (constituency) trees v Can be modeled by Context-free grammars

CS6501-NLP 3

slide-4
SLIDE 4

CKY algorithm

CS6501-NLP 4

§for J := 1 to n §Add to [J-1,J] all categories for the Jth word §for width := 2 to n §for start := 0 to n-width // this is I §Define end := start + width // this is J §for mid := start+1 to end-1 // find all I-to-J phrases §for every rule X à Y Z in the grammar if Y in [start,mid] and Z in [mid,end] then add X to [start,end]

slide-5
SLIDE 5

Weighted CKY: Viterbi algorithm

CS6501-NLP 5

  • initialize all entries of chart to ∞
  • for i := 1 to n
  • for each rule R of the form X à word[i]
  • chart[X,i-1,i] max ( weight(R) )
  • for width := 2 to n
  • for start := 0 to n-width
  • Define end := start + width
  • for mid := start+1 to end-1
  • for each rule R of the form X à Y Z
  • chart[X,start,end] = max( weight(R) +

chart[Y,start,mid] + chart[Z,mid,end])

  • return chart[ROOT,0,n]

Slides are modified from Jason Eisner’s NLP course

Assume the weights are log probabilities

  • f rules
slide-6
SLIDE 6

Likelihood of a parse tree

CS6501-NLP 6

WHY??

slide-7
SLIDE 7

CS6501-NLP 7

Probabilistic Trees

v Just like language models or HMM for POS tagging v We make independent assumptions!

S NP time VP VP flies PP P like NP Det an N arrow

slide-8
SLIDE 8

Chain rule: One word at a time

CS6501-NLP 8

p(time flies like an arrow) = p(time) * p(flies | time) * p(like | time flies) * p(an | time flies like) * p(arrow | time flies like an)

slide-9
SLIDE 9

Chain rule + Indep. assumptions (to get trigram model)

CS6501-NLP 9

p(time flies like an arrow) = p(time) * p(flies | time) * p(like | time flies) * p(an | time flies like) * p(arrow | time flies like an)

slide-10
SLIDE 10

Chain rule – written differently

CS6501-NLP 10

p(time flies like an arrow) = p(time) * p(time flies | time) * p(time flies like | time flies) * p(time flies like an | time flies like) * p(time flies like an arrow | time flies like an)

Proof: p(x,y | x) = p(x | x) * p(y | x, x) = 1 * p(y | x)

slide-11
SLIDE 11

Chain rule + Indep. assumptions

CS6501-NLP 11

p(time flies like an arrow) = p(time) * p(time flies | time) * p(time flies like | time flies) * p(time flies like an | time flies like) * p(time flies like an arrow | time flies like an)

Proof: p(x,y | x) = p(x | x) * p(y | x, x) = 1 * p(y | x)

slide-12
SLIDE 12

Chain rule: One node at a time

CS6501-NLP 12

S NP time VP VP flies PP P like NP Det an N arrow

p( | S) = p(

S NP VP| S) * p( S NP time VP| S NP VP )

* p(

S NP time VP VP PP

|

S NP time VP )

* p(

S NP time VP VP flies PP

|

S NP time VP ) * … VP PP

p(time) p(flies, time|time)

slide-13
SLIDE 13

Chain rule + Indep. assumptions

CS6501-NLP 13

S NP time VP VP flies PP P like NP Det an N arrow

p( | S) = p(

S NP VP| S) * p( S NP time VP| S NP VP )

* p(

S NP time VP VP PP

|

S NP time VP )

* p(

S NP time VP VP flies PP

|

S NP time VP ) * … VP PP

slide-14
SLIDE 14

Simplified notation

CS6501-NLP 14

S NP time VP VP flies PP P like NP Det an N arrow

p( | S) = p(S → NP VP | S) * p(NP → time | NP) * p(VP → VP NP | VP) * p(VP → flies| VP) * …

slide-15
SLIDE 15

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501-NLP 15

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-16
SLIDE 16

Three basic problems for HMMs

v Likelihood of the input:

vInside algorithm

v Decoding (Parsing) the input:

vCKY algorithm

v Estimation (Learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vInside-Outside algorithm

CS6501-NLP 16

How likely the sentence ”I love cat” occurs Parse tree of ”I love cat” How to learn the model?

Phrase Structure Trees

slide-17
SLIDE 17

Three basic problems for HMMs

v Likelihood of the input:

vInside algorithm

v Decoding (Parsing) the input:

vCKY algorithm

v Estimation (Learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vInside-Outside algorithm

CS6501-NLP 17

How likely the sentence ”I love cat” occurs Parse tree of ”I love cat” How to learn the model?

Phrase Structure Trees

slide-18
SLIDE 18

Three basic problems for HMMs

v Likelihood of the input:

vInside algorithm

v Decoding (Parsing) the input:

vCKY algorithm

v Estimation (Learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vInside-Outside algorithm

CS6501-NLP 18

How likely the sentence ”I love cat” occurs Parse tree of ”I love cat” How to learn the model?

Phrase Structure Trees

slide-19
SLIDE 19

600.465 - Intro to NLP - J. Eisner 19

Probabilistic CKY: Inside algorithm

  • initialize all entries of chart to 0
  • for i := 1 to n
  • for each rule R of the form X à word[i]
  • chart[X,i-1,i] += prob(R)
  • for width := 2 to n
  • for start := 0 to n-width
  • Define end := start + width
  • for mid := start+1 to end-1
  • for each rule R of the form X à Y Z
  • chart[X,start,end] += prob(R) *

chart[Y,start,mid] * chart[Z,mid,end]

  • return chart[ROOT,0,n]
slide-20
SLIDE 20

How to build a width-6 phrase

CS6501: NLP 20

?

1 7 = 1 2 + 2 7 1 3 + 3 7 1 4 + 4 7 1 5 + 5 7 1 6 + 6 7

S à NP VP NP à Det N NP à NP PP VP à V NP VP à VP PP PP à P NP

slide-21
SLIDE 21

600.465 - Intro to NLP - J. Eisner 21

CKY: Recognition algorithm

v initialize all entries of chart to false v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] |= in_grammar(R) v for width := 2 to n v for start := 0 to n-width v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] |= in_grammar(R) & chart[Y,start,mid] & chart[Z,mid,end] v return chart[ROOT,0,n]

Pay attention to the

  • range code …
slide-22
SLIDE 22

600.465 - Intro to NLP - J. Eisner 22

Weighted CKY: Viterbi algorithm (min-cost)

v initialize all entries of chart to ∞ v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] min= weight(R) v for width := 2 to n v for start := 0 to n-width v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] min= weight(R) + chart[Y,start,mid] + chart[Z,mid,end] v return chart[ROOT,0,n]

Pay attention to the

  • range code …
slide-23
SLIDE 23

600.465 - Intro to NLP - J. Eisner 23

Weighted CKY: Viterbi algorithm (max-prob)

v initialize all entries of chart to 0 v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] max= weight(R) v for width := 2 to n v for start := 0 to n-width v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] max= weight(R) * chart[Y,start,mid] * chart[Z,mid,end] v return chart[ROOT,0,n]

Pay attention to the

  • range code …
slide-24
SLIDE 24

600.465 - Intro to NLP - J. Eisner 24

Weighted CKY: Viterbi algorithm (max-logprob)

v initialize all entries of chart to -∞ v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] max= weight(R) v for width := 2 to n v for start := 0 to n-width v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] max= weight(R) + chart[Y,start,mid] + chart[Z,mid,end] v return chart[ROOT,0,n]

Pay attention to the

  • range code …
slide-25
SLIDE 25

600.465 - Intro to NLP - J. Eisner 25

Probabilistic CKY: Inside algorithm

  • initialize all entries of chart to 0
  • for i := 1 to n
  • for each rule R of the form X à word[i]
  • chart[X,i-1,i] += prob(R)
  • for width := 2 to n
  • for start := 0 to n-width
  • Define end := start + width
  • for mid := start+1 to end-1
  • for each rule R of the form X à Y Z
  • chart[X,start,end] += prob(R) *

chart[Y,start,mid] * chart[Z,mid,end]

  • return chart[ROOT,0,n]
slide-26
SLIDE 26

600.465 - Intro to NLP - J. Eisner 26

Semiring-weighted CKY: General algorithm!

  • initialize all entries of chart to €
  • for i := 1 to n
  • for each rule R of the form X à word[i]
  • chart[X,i-1,i] ⊕= semiring_weight(R)
  • for width := 2 to n
  • for start := 0 to n-width
  • Define end := start + width
  • for mid := start+1 to end-1
  • for each rule R of the form X à Y Z
  • chart[X,start,end] ⊕= semiring_weight(R) ⊗

chart[Y,start,mid] ⊗ chart[Z,mid,end]

  • return chart[ROOT,0,n]

⊗ is like “and”/∀: combines all of several pieces into an X ⊕ is like “or”/∃: considers the alternative ways to build the X

slide-27
SLIDE 27

§ initialize all entries of chart to € § for i := 1 to n § for each rule R of the form X à word[i] § chart[X,i-1,i] ⊕= semiring_weight(R) § for width := 2 to n § for start := 0 to n-width § Define end := start + width § for mid := start+1 to end-1 § for each rule R of the form X à Y Z § chart[X,start,end] ⊕= semiring_weight(R) ⊗ chart[Y,start,mid] ⊗ chart[Z,mid,end] § return chart[ROOT,0,n]

?

600.465 - Intro to NLP - J. Eisner 27

Semiring-weighted CKY: General algorithm!

slide-28
SLIDE 28

§ initialize all entries of chart to € § for i := 1 to n § for each rule R of the form X à word[i] § chart[X,i-1,i] ⊕= semiring_weight(R) § for width := 2 to n § for start := 0 to n-width § Define end := start + width § for mid := start+1 to end-1 § for each rule R of the form X à Y Z § chart[X,start,end] ⊕= semiring_weight(R) ⊗ chart[Y,start,mid] ⊗ chart[Z,mid,end] § return chart[ROOT,0,n]

600.465 - Intro to NLP - J. Eisner 28

Weighted CKY, general version

weights ⊕ ⊗ € total prob (inside) [0,1] + × min weight [0,∞] min + ∞ recognizer

{true,false}

  • r

and false

The designs of the dynamic algorithms are mostly based on the structure

virtual function

slide-29
SLIDE 29

Can we write a general version of Viterbi/Forward algorithms

CS6501-NLP 29

Yes, we can! You can implement this using virtual function!

slide-30
SLIDE 30

Three basic problems for HMMs

v Likelihood of the input:

vInside algorithm

v Decoding (Parsing) the input:

vCKY algorithm

v Estimation (Learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vInside-Outside algorithm

CS6501-NLP 30

How likely the sentence ”I love cat” occurs Parse tree of ”I love cat” How to learn the model?

Phrase Structure Trees

slide-31
SLIDE 31

Maximum Likelihood Estimation

v Has a closed-form solution v Count how many times the rules are

  • applied. Then, normalize.

CS6501-NLP 31

EM can be applied for unsupervised learning setting!

slide-32
SLIDE 32

Evaluation

v Precision: How many of the predicted constituents are correct? v Recall: How many of the correct constituents were predicted? v Usually use F1 score to combine the two

CS6501-NLP 32

slide-33
SLIDE 33

CS6501-NLP 33

slide-34
SLIDE 34

The Penn Treebank

v Wall Street Journal (50,000 sents, 1M words), also contains Brown corpus and others v The annotation:

v POS-tagged v Manually annotated with phrase-structure trees

v Standard data for English parsers

CS6501-NLP 34

slide-35
SLIDE 35

Treebank label set

v 48 preterminals (tags)

v36 POS tags + 12 symbols

v 14 nonterminals:

vS, NP ,PP, VP, ...

v Function tags

v-SBJ: ”subject” v-MNR: “manner”

CS6501-NLP 35

slide-36
SLIDE 36

CS6501-NLP 36

slide-37
SLIDE 37

CFG in the Penn Treebank

v Many rules appear only once v Many rules are similar

CS6501-NLP 37

slide-38
SLIDE 38

PCFG (Charniak 1996)

v Train (300k words), test (30K words)

CS6501-NLP 38

slide-39
SLIDE 39

Ideas to improve PCFG

v Change the grammar by adding the type of the parent node to each nonterminal

[Johnson ’98, Klein&Manning ‘03]

CS6501-NLP 39

slide-40
SLIDE 40

Ideas to improve PCFG

v Lexicalization

[Collins 99’, Charniak 00’]

v Assume each constituent has one head

  • word. Take head word into account!

CS6501-NLP 40

slide-41
SLIDE 41

Price of lexicalization

v # rules increases v Difficult to get good estimates of probabilities

CS6501-NLP 41

slide-42
SLIDE 42

Dealing with unknown words

v Training: Replace some rare words in the training data with a token “Unknown” v Test: Replace unseen words with “Unknown”

CS6501-NLP 42

slide-43
SLIDE 43

Using Latent Variables

v Assume there are different types of NP, VP, etc…

CS6501-NLP 43

Lexicalization Structure Annotation Latent Variables

slide-44
SLIDE 44

Ideas to improve PCFG –Reranking

v [Charniak&Johnson 2005]

v Find the top K solutions based on a generative model v Use a discriminative model to find the best solution among the K solutions

CS6501-NLP 44

slide-45
SLIDE 45

Some results

CS6501-NLP 45

slide-46
SLIDE 46

What we have learned so far

CS6501-NLP 46

slide-47
SLIDE 47

Go beyond the keyword matching

v Identify the structure and meaning of words, sentences, texts and conversations v Deep understanding of broad language v NLP is all around us

CS6501– Natural Language Processing 47

slide-48
SLIDE 48

What will you learn from this course

v The NLP Pipeline

vKey components for understanding text

v NLP applications

v Current techniques & limitation

v Build realistic NLP tools

CS6501– Natural Language Processing 48

slide-49
SLIDE 49

NLP Pipeline

v Word/Token level: morphology, word embedding, word clustering, language models, POS tagging v Sentence level: Chunking, NER, Parsing v We will see how to use these tools for NLP applications

CS6501-NLP 49