 
              0. Probabilistic Context-Free Grammars Based on “Foundations of Statistical NLP” by C. Manning & H. Sch¨ utze, ch. 11 MIT Press, 2002
1. A Sample PCFG S → NP VP 1.0 NP → NP PP 0.4 PP → P NP 1.0 NP → astronomers 0.1 VP → V NP 0.7 NP → ears 0.18 VP → VP PP 0.3 NP → saw 0.04 P → with 1.0 NP → stars 0.18 V → saw 1.0 NP → telescopes 0.1
2. The Chomsky Normal Form of CFGs CNF CFG: All non-terminals expand into either two or more non-terminals ( N → X Y ) or a single terminal ( N → w ). Proposition: Any CFG can be converted into a “weakly equivalent” CNF CFG. Definition: Two grammars are weakly equivalent if they generate the same language. They are strongly equivalent if they also assign the same structures to strings.
3. Cocke-Younger-Kasami (CYK) Parsing Algorithm ◦ Works on CNF CFGs • First, add the lexical edges • Then: for w = 2 to N % scan left to right , % combining edges to form edges of width w for i = 0 to N- w for k = 0 to w -1 if (A → BC and B → α ∈ chart[ i, i + k + 1 ] and C → β ∈ chart[ i + k + 1 , i + w ]) add A → BC to chart [ i, i + w ] • Finaly, if S ∈ chart[0,N], return the corresponding parse
4. Example: CYK with Chart Representation S S VP VP S NP VP PP V NP NP P NP 0 1 2 3 4 5 NP astronomers saw stars with ears
5. Chart Representation as a Matrix 1 2 3 4 5 S S 1 NP S 2 V VP VP NP VP 3 NP NP PP 4 P 5 NP astronomers saw stars with ears
6. Assumptions of the PCFG Model j P ( N i → ν j | N i ) = 1 • ∀ i � • Place invariance: the probability of a subtree does not depend on where in the string the words it dominates are • Context-free: the probability of a subtree does not depend on words not dominated by the subtree • Ancestor-free: the probability of a subtree does not depend on nodes outside of the subtree
7. Calculating the Probability of a Sentence So, the probability of a sentence is P ( w 1 m ) = � P ( w 1 m , t ) = � P ( t ) t t : yield ( t )= w 1 m where t is a parse tree of the sentence. To calculate the probability of a tree, multiply the probabilities of all the rules it uses.
8. Inside and Outside Probabilities Outside ( α ): the total probabil- N 1 ity of beginning in N 1 and gener- ating N j and the words outside α p and q N j Inside ( β ): the total probability of generating the words from p to q given that we start at non- β terminal N j w 1 w p−1 w p w q w q+1 w m α j ( p, q ) = P ( w 1( p − 1) , N j β j ( p, q ) = P ( w pq | N j pq , w ( q +1) ,m ) pq )
9. Computing Inside Probabilities Base case: β j ( k, k ) = P ( N j → w k | N j ) N j N r N s w p w d w d+1 w q Induction step: β j ( p, q ) = P ( w pq | N j pq ) = d = p P G ( N j → N r N s | N j ) β r ( p, d ) β s ( d + 1 , q ) � q − 1 � r,s
10. Computing Inside Probabilities — Induction q − 1 β j ( p, q ) = P ( w pq | N j P ( w pd , N r pd , w ( d +1) q , N s ( d +1) q | N j pq ) = � � pq ) r,s d = p q − 1 P ( N r pd , N s ( d +1) q | N j pq ) P ( w pd | N j pq , N r pd , N s = � � ( d +1) q ) r,s d = p × P ( w ( d +1) q | N j pq , N r pd , N s ( d +1) q , w pd ) q − 1 P ( N r pd , N s ( d +1) q | N j pq ) P ( w pd | N r pd ) P ( w ( d +1) q | N s = ( d +1) q ) � � r,s d = p q − 1 P ( N j → N r N s ) β r ( p, d ) β s ( d + 1 , q ) = � � r,s d = p
11. Computing Inside Probabilities 1 2 3 4 5 β NP = 0 . 1 β S = 0 . 0126 β S = 0 . 0015876 1 β NP = 0 . 04 β VP = 0 . 126 β VP = 0 . 015876 2 β V = 1 . 0 β NP = 0 . 18 β NP = 0 . 01296 3 β P = 1 . 0 β PP = 0 . 18 4 β NP = 0 . 18 5 astronomers saw stars with ears
12. Computing Outside Probabilities Base case: α 1 (1 , m ) = 1 , and α j (1 , m ) = 0 for j � = 1 1 1 N N N f N f N j N g N g N j w 1 w p w q w q+1 w e w m w 1 w e w w w q w m p−1 p Induction step: α j ( p, q ) = e = q +1 α f ( p, e ) P G ( N f → N j N g | N f ) β g ( q + 1 , e ) + � m � f,g e =1 α f ( e, q ) P G ( N f → N g N j | N f ) β g ( e, p − 1) � p − 1 � f,g
13. Computing Outside Probabilities — Induction m pq , N g α j ( p, q ) P ( w 1( p − 1) , w ( q +1) m , N f pe , N j � � = ( q +1) e ) + e = q +1 f,g p − 1 eq , N g P ( w 1( p − 1) , w ( q +1) m , N f e ( p − 1) , N j � � pq ) e =1 f,g m pq , N g pe ) P ( w ( q +1) e | N g P ( w 1( p − 1) , w ( e +1) m , N f pe ) P ( N j ( q +1) e | N f � � = ( q +1) e ) + e = q +1 f,g p − 1 eq ) P ( N g eq ) P ( w e ( p − 1) | N g P ( w 1( e − 1) , w ( q +1) m , N f e ( p − 1) , N j pq | N f � � e ( p − 1) ) e =1 f,g m α f ( p, e ) P G ( N f → N j N g | N f ) β g ( q + 1 , e ) + � � = e = q +1 f,g p − 1 α f ( e, q ) P G ( N f → N g N j | N f ) β g ( e, p − 1) � � e =1 f,g
14. Finding the Most Likely Parse Sequence Viterbi Algorithm Base case: δ i ( p, p ) = P ( N i → w p | N i ) Induction step: δ i ( p, q ) = max 1 ≤ j,k ≤ n ; p ≤ r<q P G ( N i → N j N k | N i ) δ j ( p, r ) δ k ( r + 1 , q ) ψ i ( p, q ) = argmax ( j,k,r ) P G ( N i → N j N k | N i ) δ j ( p, r ) δ k ( r + 1 , q ) Termination: P G (ˆ t ) = δ 1 (1 , m ) Path readout (by backtracing): if ˆ X χ = N i pq is in the Viterbi parse, and ψ i ( p, q ) = ( j, k, r ) , then left( ˆ pr , right( ˆ X χ ) = N j X χ ) = N k ( r +1) q ( N 1 1 m is the root node of the Viterbi parse.)
15. Learning PCFGs: The Inside-Outside (EM) Algorithm Combining inside and outside probabilities: α j ( p, q ) β j ( p, q ) = P G ( N 1 ⇒ ∗ w 1 m , N j ⇒ ∗ w pq ) = P G ( N 1 ⇒ ∗ w 1 m ) P G ( N j ⇒ ∗ w pq | N 1 ⇒ ∗ w 1 m ) Denoting π = P G ( N 1 ⇒ ∗ w 1 m ) , it follows that P G ( N j ⇒ ∗ w pq | N 1 ⇒ ∗ w 1 m ) = 1 π α j ( p, q ) β j ( p, q ) P G ( N j → N r N s ⇒ ∗ w pq | N 1 ⇒ ∗ w 1 m ) d = p α j ( p, q ) P G ( N j → N r N s | N j ) β r ( p, d ) β s ( d + 1 , q ) � q − 1 = 1 π P G ( N j → w k | N 1 ⇒ ∗ w 1 m , w k = w h ) π α j ( h, h ) P ( w k = w h ) β j ( h, h ) = 1
The Inside-Outside Algorithm: E-step 16. Assume that we have a set of sentences W = { W 1 , . . . , W ω } q − 1 1 i ( p, q ) P G ( N j → N r N s | N j ) β r α j i ( p, d ) β s f i ( p, q, j, r, s ) = � i ( d + 1 , q ) π i d = p 1 i ( h, h ) P ( w k = w h ) β j α j g i ( h, j, k ) = i ( h, h ) π i 1 i ( p, q ) with π i = P G ( N 1 ⇒ ∗ W i ) α j i ( p, q ) β j h i ( p, q, j ) = π i m i − 1 m i ω P G ( N j → N r N s ) = ˆ � � � f i ( p, q, j, r, s ) p =1 q = p +1 i =1 m i ω P G ( N j → w k ) = ˆ g i ( h, j, k ) � � i =1 h =1 m i m i ω ˆ P G ( N j ) = � � q = p h i ( p, q, j ) � i =1 p =1
17. The Inside-Outside Algorithm: M-step P G ( N j → N r N s ) ˆ P G ′ ( N j → N r N s | N j ) = ˆ P G ( N j ) � m i − 1 � m i � ω q = p +1 f i ( p, q, j, r, s ) p =1 i =1 = � m i � ω � m i q = p h i ( p, q, j ) p =1 i =1 P G ( N j → w k ) ˆ P G ′ ( N j → w k | N j ) = ˆ P G ( N j ) � m i � ω h =1 g i ( h, j, k ) i =1 = � m i � ω � m i q = p h i ( p, q, j ) i =1 p =1 P ( W | G ′ ) ≥ P ( W | G ) (Baum-Welch)
18. Problems with the Inside-Outside Algorithm 1. It is much slower than linear models like HMMs: For each sentence of length m , the training is O ( mn ) , where n is the number of nonterminals in G . 2. The algorithm is very sensitive to the initialization : [ Chiarniak, 1993 ] reports finding different local maxima for each of 300 trials of a PCFG on artificial data!! Proposed solutions: [ Lari & Young, 1990 ] 3. Experiments suggest that statisfactory PCFG learning re- quires many more nonterminals (i.e., about 3 times) than are theoretically needed to describe the language.
19. “Problems” with the Learned PCFGs (Contin.) 4. There is no guarantee that the learned nonterminals will bear any resemblance to linguistically-motivated nonter- minals we would use to write the grammar by hand... 5. Even if the grammar is initialized with such nonterminals, the training process may completely change the meaning of those nonterminals. 6. Thus, while grammar induction from unannotated corpora is possible with PCFGs, it is extremely difficult.
Recommend
More recommend