lecture 18 pcfg parsing
play

Lecture 18: PCFG Parsing Julia Hockenmaier juliahmr@illinois.edu - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 18: PCFG Parsing Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Where were at Previous lecture: Standard CKY (for non-probabilistic CFGs)


  1. CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 18: PCFG Parsing Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

  2. Where we’re at Previous lecture: 
 Standard CKY (for non-probabilistic CFGs) The standard CKY algorithm finds all possible parse trees τ for a sentence S = w (1) …w (n) under a CFG G 
 in Chomsky Normal Form. Today’s lecture: Probabilistic Context-Free Grammars (PCFGs) – CFGs in which each rule is associated with a probability CKY for PCFGs (Viterbi): – CKY for PCFGs finds the most likely parse tree 
 τ * = argmax P( τ | S) for the sentence S under a PCFG. 2 CS447 Natural Language Processing

  3. Previous Lecture: CKY for CFGs CS447: Natural Language Processing (J. Hockenmaier) 3

  4. CKY: filling the chart w ... ... w i ... w w ... ... w i ... w w ... ... w i ... w w ... ... w i ... w w w w w ... ... ... ... .. .. .. .. . . . . w i w i w i w i ... ... ... ... w w w w w ... ... w i ... w w ... ... w i ... w w ... ... w i ... w w w w ... ... ... .. .. .. . . . w i w i w i ... ... ... w w w 4 CS447 Natural Language Processing

  5. CKY: filling one cell w ... ... w i ... w chart[2][6]: w w 1 w 2 w 3 w 4 w 5 w 6 w 7 ... .. . w i ... w chart[2][6]: chart[2][6]: chart[2][6]: chart[2][6]: w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w ... ... w i ... w w ... ... w i ... w w ... ... w i ... w w ... ... w i ... w w w w w ... ... ... ... .. .. .. .. . . . . w i w i w i w i ... ... ... ... w w w w 5 CS447 Natural Language Processing

  6. CKY for standard CFGs CKY is a bottom-up chart parsing algorithm that finds all possible parse trees τ for a sentence S = w (1) …w (n) under a CFG G in Chomsky Normal Form (CNF). 
 – CNF : G has two types of rules: X ⟶ Y Z and X ⟶ w 
 (X, Y, Z are nonterminals, w is a terminal) – CKY is a dynamic programming algorithm – The parse chart is an n × n upper triangular matrix: 
 Each cell chart[i][j] (i ≤ j) stores all subtrees for w (i) …w (j) – Each cell chart[i][j] has at most one entry for each nonterminal X (and pairs of backpointers to each pair of (Y, Z) entry in cells chart[i][k] chart[k+1][j] from which an X can be formed – Time Complexity: O(n 3 | G |) 6 CS447 Natural Language Processing

  7. Dealing with ambiguity: Probabilistic 
 Context-Free Grammars (PCFGs) CS447: Natural Language Processing (J. Hockenmaier) 7

  8. 
 
 
 
 
 
 
 Grammars are ambiguous A grammar might generate multiple trees for a sentence: Incorrect analysis VP VP NP VP PP PP P NP V NP NP P NP V eat sushi with tuna eat sushi with tuna VP VP NP VP PP PP NP V P P NP NP V NP eat sushi with chopsticks eat sushi with chopsticks What’s the most likely parse τ for sentence S ? 
 We need a model of P( τ | S) 8 CS447 Natural Language Processing

  9. 
 
 
 
 
 Computing P( τ | S) Using Bayes’ Rule: 
 P ( τ , S ) arg max P ( τ | S ) = arg max P ( S ) τ τ = arg max P ( τ , S ) τ = arg max P ( τ ) if S = yield( τ ) τ The yield of a tree is the string of terminal symbols 
 that can be read off the leaf nodes VP NP yield ( ) = eat sushi with tuna PP V NP NP P eat sushi with tuna VP 9 CS447 Natural Language Processing

  10. 
 
 
 Computing P( τ ) T is the (infinite) set of all trees in the language: 
 L = { s ∈ Σ ∗ | ∃ τ ∈ T : yield ( τ ) = s } We need to define P( τ ) such that: 
 0 ≤ P ( τ ) ≤ 1 ∀ τ ∈ T : ∑ τ ∈ T P ( τ ) = 1 The set T is generated by a context-free grammar S NP VP VP Verb NP NP Det Noun → → → S S conj S VP VP PP NP → NP PP → → S ..... VP ..... NP ..... → → → 10 CS447 Natural Language Processing

  11. Probabilistic Context-Free Grammars For every nonterminal X, define a probability distribution P(X → α | X) over all rules with the same LHS symbol X: 
 S → NP VP 0.8 S → S conj S 0.2 NP → Noun 0.2 NP → Det Noun 0.4 NP → NP PP 0.2 NP → NP conj NP 0.2 VP → Verb 0.4 VP → Verb NP 0.3 VP → Verb NP NP 0.1 VP → VP PP 0.2 PP → P NP 1.0 11 CS447 Natural Language Processing

  12. Computing P( τ ) with a PCFG The probability of a tree τ is the product of the probabilities 
 of all its rules: S → NP VP 0.8 S S → S conj S 0.2 NP VP NP → Noun 0.2 Noun VP PP NP → Det Noun 0.4 NP → NP PP 0.2 John Verb NP P NP NP → NP conj NP 0.2 Noun Noun eats with VP → Verb 0.4 pie cream VP → Verb NP 0.3 VP → Verb NP NP 0.1 P( τ ) = 0.8 × 0.3 × 0.2 × 1.0 × 0.2 3 VP → VP PP 0.2 PP → P NP 1.0 = 0.00384 12 CS447 Natural Language Processing

  13. Learning the parameters of a PCFG If we have a treebank (a corpus in which each sentence is associated with a parse tree), we can just count the number of times each rule appears, e.g.: S � NP VP . (count = 1000) S � S conj S . (count = 220) and then we divide the observed frequency of each rule X → Y Z by the sum of the frequencies of all rules with the same LHS X to turn these counts into probabilities: S � NP VP . (p = 1000/1220) 
 S � S conj S . (p = 220/1220) 13 CS447 Natural Language Processing

  14. More on probabilities: Computing P(s) : 
 If P( τ ) is the probability of a tree τ , 
 the probability of a sentence s is the sum of the probabilities of all its parse trees: P(s) = ∑ τ :yield( τ ) = s P( τ ) How do we know that P(L) = ∑ τ P( τ ) = 1 ? If we have learned the PCFG from a corpus via MLE, this is guaranteed to be the case. If we just set the probabilities by hand, we could run into trouble, as in the following example: S � S S (0.9) 
 S � w (0.1) 14 CS447 Natural Language Processing

  15. PCFG parsing (decoding): Probabilistic CKY CS447: Natural Language Processing (J. Hockenmaier) 15

  16. Probabilistic CKY: Viterbi Like standard CKY, but with probabilities. Finding the most likely tree is similar to Viterbi for HMMs: Initialization: – [ optional ] Every chart entry that corresponds to a terminal 
 (entry w in cell[i][i]) has a Viterbi probability P VIT (w [i][i] ) = 1 (*) – Every entry for a non-terminal X in cell[i][i] has Viterbi probability P VIT (X [i][i] ) = P(X → w | X) [and a single backpointer to w [i][i] (*) ] Recurrence: For every entry that corresponds to a non-terminal X in cell[i][j] , keep only the highest-scoring pair of backpointers to any pair of children ( Y in cell[i][k] and Z in cell[k+1][j] ): 
 P VIT (X [i][j] ) = argmax Y,Z,k P VIT (Y [i][k] ) × P VIT (Z [k+1][j] ) × P (X → Y Z | X ) Final step: Return the Viterbi parse for the start symbol S 
 in the top cell[1][n] . *this is unnecessary for simple PCFGs, but can be helpful for more complex probability models 16 CS447 Natural Language Processing

  17. Probabilistic CKY Input: POS-tagged sentence 
 John_N eats_V pie_N with_P cream_N S → NP VP 0.8 John eats pie with cream S → S conj S 0.2 NP NP → Noun 0.2 Noun S S S John 1.0 0.2 0.8 · 0.2 · 0.3 0.8 · 0.2 · 0.06 0.2 · 0.0036 · 0.8 NP → Det Noun 0.4 VP VP VP NP → NP PP 0.2 Verb eats max( 1.0 · 0.008 · 0.3, 1 · 0.3 · 0.2 1.0 0.3 0.06 · 0.2 · 0.3 ) NP → NP conj NP 0.2 = 0.06 NP NP VP → Verb 0.3 0.4 Noun pie 0.2 · 0.2 · 0.2 1.0 0.2 VP → Verb NP 0.3 = 0.008 Prep VP → Verb NP NP 0.1 PP with 1.0 1 · 1 · 0.2 0.3 VP → VP PP 0.2 Prep NP NP PP → P NP 1.0 Noun cream 1.0 0.2 Prep ⟶ P 1.0 Noun ⟶ N 1.0 Verb ⟶ V 1.0 17 CS447 Natural Language Processing

  18. How do we handle flat rules? S → NP VP 0.8 S ⟶ S ConjS 0.2 S → S conj S 0.2 ConjS ⟶ conj S 1.0 NP → Noun 0.2 NP → Det Noun 0.4 NP → NP PP 0.2 Binarize each flat rule by NP → NP conj NP 0.2 adding dummy nonterminals 0.3 VP → Verb 0.4 (ConjS), VP → Verb NP 0.3 and setting the probability of VP → Verb NP NP 0.1 the rule with the dummy 0.3 VP → VP PP 0.2 nonterminal on the LHS to 1 Prep NP PP → P NP 1.0 18 CS447 Natural Language Processing

  19. Parser evaluation CS447: Natural Language Processing (J. Hockenmaier) 19

  20. Precision and recall Precision and recall were originally developed 
 as evaluation metrics for information retrieval: - Precision: What percentage of retrieved documents are relevant to the query? - Recall : What percentage of relevant documents were retrieved? In NLP, they are often used in addition to accuracy: - Precision: What percentage of items that were assigned label X do actually have label X in the test data? - Recall: What percentage of items that have label X in the test data were assigned label X by the system? Particularly useful when there are more than two labels. 20 CS447: Natural Language Processing (J. Hockenmaier)

  21. True vs. false positives, false negatives Items labeled X 
 Items labeled X 
 by the system in the gold standard 
 = TP + FP (‘truth’) = TP + FN False True 
 False 
 Negatives Positives Positives 
 (FN) (TP) (FP) - True positives: Items that were labeled X by the system, 
 and should be labeled X. - False positives: Items that were labeled X by the system, 
 but should not be labeled X. - False negatives: Items that were not labeled X by the system, 
 but should be labeled X 21 CS447: Natural Language Processing (J. Hockenmaier)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend