probabilistic context free probabilistic context free
play

Probabilistic Context-Free Probabilistic Context-Free Grammars - PowerPoint PPT Presentation

Probabilistic Context-Free Probabilistic Context-Free Grammars (PCFGs) Grammars (PCFGs) Berlin Chen 2003 References: 1. Speech and Language Processing, chapter 12 2. Foundations of Statistical Natural Language Processing, chapters 11, 12 1


  1. Probabilistic Context-Free Probabilistic Context-Free Grammars (PCFGs) Grammars (PCFGs) Berlin Chen 2003 References: 1. Speech and Language Processing, chapter 12 2. Foundations of Statistical Natural Language Processing, chapters 11, 12 1

  2. Parsing for Disambiguation • At least three ways to use probabilities in a parser – Probabilities for choosing between parses • Choose from among the many parses of the input sentence which ones are most likely – Probabilities for speedier parsing Parsing as Search • Use probabilities to order or prune the search space of a parser for finding the best parse more quickly – Probabilities for determining the sentence • Use a parser as a language model over a word lattice in order to determine a sequence of words that has the highest probability 2

  3. Parsing for Disambiguation • The integration of sophisticated structural and probabilistic models of syntax is at the very cutting edge of the field – For the non-probabilistic syntax analysis • The context-free grammar (CFG) is the standard – For the probabilistic syntax analysis • No single model has become a standard • A number of probabilistic augmentations to context-free grammars – Probabilistic CFG with the CYK algorithm – Probabilistic lexicalized CFG – Dependency grammars – ……. 3

  4. Definition of the PCFG Booth, 1969 syntactic categories • A PCFG G has five parameters lexical categories 1. A set of non-terminal symbols (or “variables”) N 2. A set of terminal symbols ∑ (disjoint from N ) words 3. A set of productions P , each of the form A →β , where A is a non-terminal symbol and β is a string of symbols from the infinite set of strings ( ∑ ∪ N )* 4. A designated start symbol S (or N 1 ) 5. Each rule in P is augmented with a conditional probability assigned by a function D A →β [ prob. ] ∑ ( ) ∀ → = A P A β 1 P ( A →β ) or P ( A →β | A ) β • A PCFG G= ( N, ∑ , P, S, D ) 4

  5. An Example Grammar 5

  6. Parse Trees • Input: astronomers saw stars with ears The probability of a particular parse is defined as the product of the probabilities of all the rules used to expand each node in the parse tree – An instance of PP-attachment ambiguity 6

  7. Parse Trees • Input: dogs in houses and cats – An instance of coordination ambiguity • Which one is correct ? • However, the PCFG will assign the identical probabilities to the two parses 7

  8. N j Basic Assumptions w 1 …….w k ………..w l ……. w n • Place Invariance c+1 words – The probability of a subtree does not depend on where in the string the words it dominates are ( ) ( ) ∀ → ζ = → ζ k P N j P N j ( ) + k k c word positions in the input string • Context free – The probability of a subtree does not depend on words not dominated by the subtree ( ) ( ) → ζ = → ζ P N j anything outside k through l P N j kl kl • Ancestor free – The probability of a subtree does not depend on nodes in the derivation outside the subtree ( ) ( ) → ζ = → ζ P N j any ancestor outside N j P N j kl kl kl 8

  9. Basic Assumptions • Example chain rule context-free & ancestor-free assumptions Place-invariant assumption 9

  10. Some Features of PCFGs • PCFGs give some idea (probabilities) of the plausibility of different parses – But the probability estimates are based purely on structural factors and not lexical factors • PCFGs are good for grammar induction – PCFG can be learned from data, e.g. from bracketed (labeled) corpora • PCFGs are robust – Tackle grammatical mistakes, disfluencies, and errors by ruling out nothing in the grammar, but by just giving implausible sentences a lower probability 10

  11. Chomsky Normal Form • Chomsky Normal Form (CNF) grammars only have unary and binary rules of the form → N j N r N s For syntactic categories → N j w k For lexical categories • The parameters of a PCFG in CNF are ( ) n 3 matrix of parameters → P N i N r N s G (when n nonterminals ) ( ) n 3 + nV → P N i w k G nV matrix of parameters parameters (when n nonterminals and ∑ ( ) ∑ ( ) → + → = P N j N r N s P N i w k 1 V terminals ) r , s k • Any CFG can be represented by a weakly equivalent CFG in CNF – “ weakly equivalent ” : “ generating the same language ” • But do not assign the same phrase structure to each 11 sentence

  12. CYK Algorithm Ney, 1991 Collins, 1999 • CYK (Cocke-Younger-Kasami) algorithm – A bottom-up parser using the dynamic programming table – Assume the PCFG is in Chomsky normal form (CNF) • Definition – w 1 … w n : an input string composed of n words – w ij : a string of words from words i to j – π [ i, j, a ]: a table entry holds the maximum probability for a constituent with non-terminal index a spaning words w i … w j N a w 1 …….w i ………..w j ……. w n 12

  13. CYK Algorithm • Fill out the table entries by induction – Base case • Consider the input strings of length one (i.e., each ( ) individual word w i ) → P A w i * • Since the grammar is in CNF, ⇒ → A w iff A w i i – Recursive case A must be a lexical category • For strings of words of length > 1, * ⇒ → Choose the A w iff there is at least one rule A B C ij maximum among A must be a + w here B derives the first k-i 1 symbols and all possibilities syntactic category C derives the last j-k symbols and A • Compute the probability by multiplying together the B C probabilities of these two pieces (note that they i k k +1 j have been calculated in the recursion) 13

  14. CYK Algorithm set to zero Finding the most Likely parse for a sentence A on the word-span C m -word input string B n non-terminals begin m m+1 end O(m 3 n 3 ) bookkeeping 14

  15. Three Basic Problems for PCFGs • What is the probability of a sentence w 1 m according to a grammar G : P ( w 1 m | G )? • What is the most likely parse for a sentence ? argmax t P ( t | w 1 m ,G ) • How can we choose the rule probabilities for the grammar G that maximize the probability of a sentence? argmax G P ( w 1 m | G ) Training the PCFG 15

  16. The Inside-Outside Algorithm Baker 1979 • A generalization of the forward-backward Young 1990 algorithm of HMMs • A dynamic programming technique used to efficiently compute PCFG probabilities – Inside and outside probabilities in PCFG 16

  17. The Inside-Outside Algorithm • Definition ( ) ( ) β = – Inside probability p , q P w N j , G j pq pq • The total probability of generating words w p …w q given that one is starting off with the nonterminal N j ( ) ( ) α = p , q P w , N j , w G – Outside probability − + j 1 ( p 1 ) pq ( q 1 ) m • The total probability of beginning with the start symbol N 1 and generating the nonterminal N jpq and all the words outside w p …w q 17

  18. Problem 1: The Probability of a Sentence • A PCFG with the Chomsky Normal Form was used here • The total probability of a sentence expressed by the inside algorithm ( ) ( ) ( ) ( ) = ⇒ = = β P w G P N 1 w G P w N 1 , G 1 , m 1 m 1 m 1 m 1 m 1 • The probability of the base case word-span=1 ( ) ( ) ( ) ( ) β = = → = k , k P w N j , G P N j w G P w N 1 , G j k kk k 1 m 1 m ( ) β • Find the probabilities by induction (or p , q j by recursion) word-span > 1 18

  19. Problem 1: The Probability of a Sentence ( ) • Find the probabilities by induction β p , q j – A bottom-up version of calculation ∀ ≤ < ≤ j , 1 p q m ( ) ( ) ( ) β = ⇒ = p , q P N j w G P w N j , G j pq pq pq pq ( ) q − 1 ∑ ∑ = r s j P w , N , w , N N , G ( ) ( ) + + pd pd d 1 q d 1 q pq = r , s d p ( ) q − 1 ∑ ∑ = P w , N r , w , N s N j , G ( ) ( ) + + pd pd d 1 q d 1 q pq = chain rule r , s d p ( ) ( ) − q 1 ∑ ∑ = × P N r , N s N j , G P w N j , N r , N s , G ( ) ( ) + + pd d 1 q pq pd pq pd d 1 q context-free & = r , s d p ( ) × P w N j , N r , N s , w , G ancestor-free ( ) ( ) + + d 1 q pq pd d 1 q pd assumptions ( ) ( ) ( ) − q 1 ∑ ∑ = × × P N r , N s N j , G P w N r , G P w N s , G ( ) ( ) ( ) + + + pd d 1 q pq pd pd d 1 q d 1 q = r , s d p Place-invariant − q 1 ( ) ∑ ∑ ( ) ( ) = → × β × β + P N j N r N s p , d d 1 , q assumption r s = r , s d p the binary rule 19

  20. Problem 1: The Probability of a Sentence • Example end begin ( ) ( ) ( ) ( ) ( ) ( ) ( ) β = → β β + → β β 2 , 5 P VP V NP 2 , 2 3 , 5 P VP VP PP 2 , 3 4 , 5 VP V NP VP PP 0.7 1.0 0.01296 0.126 0.18 0.015876 0.3 ( ) ( ) ( ) ( ) β = P → β β 1 , 5 S NP VP 1 , 1 2 , 5 S NP VP 0.0015867 1.0 0.1 0.015867 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend