Probabilistic Context-Free Grammars Probabilistic Context-Free - - PowerPoint PPT Presentation

probabilistic context free grammars probabilistic context
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Context-Free Grammars Probabilistic Context-Free - - PowerPoint PPT Presentation

Probabilistic Context-Free Grammars Probabilistic Context-Free Grammars Berlin Chen Graduate Institute of Computer Science & Information Engineering National Taiwan Normal University References: 1. Speech and Language Processing , chapter


slide-1
SLIDE 1

Probabilistic Context-Free Grammars Probabilistic Context-Free Grammars

References:

1. Speech and Language Processing, chapter 12 2. Foundations of Statistical Natural Language Processing, chapters 11, 12

Berlin Chen

Graduate Institute of Computer Science & Information Engineering National Taiwan Normal University

slide-2
SLIDE 2

2

Parsing for Disambiguation (1/2)

  • At least three ways to use probabilities in a parser

– Probabilities for choosing between parses

  • Choose from among the many parses of the input sentence

which ones are most likely – Probabilities for speedier parsing

  • Use probabilities to order or prune the search space of a

parser for finding the best parse more quickly – Probabilities for determining the sentence

  • Use a parser as an augmented language model over a word

lattice in order to determine a sequence of words that has the highest probability

Parsing as Search

slide-3
SLIDE 3

3

Parsing for Disambiguation (2/2)

  • The integration of sophisticated structural and

probabilistic models of syntax is at the very cutting edge

  • f the field

– For the non-probabilistic syntax analysis

  • The context-free grammar (CFG) is the standard

– For the probabilistic syntax analysis

  • No single model has become a standard
  • A number of probabilistic augmentations to context-free

grammars – Probabilistic CFG with the CYK algorithm – Probabilistic lexicalized CFG – Dependency grammars – …….

slide-4
SLIDE 4

4

Definition of the PCFG

  • A PCFG G has five parameters
  • 1. A set of non-terminal symbols (or “variables”) N
  • 2. A set of terminal symbols ∑ (disjoint from N)
  • 3. A set of productions P, each of the form A→β, where A is a non-

terminal symbol and β is a string of symbols from the infinite set of strings (∑∪ N)*

  • 4. A designated start symbol S (or N1)
  • 5. Each rule in P is augmented with a conditional probability

assigned by a function D A→β [prob.]

  • A PCFG G=(N, ∑, P, S, D )

P(A→β) or P(A→β|A)

( )

1 = → ∀

β A P A

β

Booth, 1969 words syntactic categories lexical categories

slide-5
SLIDE 5

5

An Example Grammar

slide-6
SLIDE 6

6

Parse Trees (1/2)

  • Input: astronomers saw stars with ears

– An instance of PP-attachment ambiguity

The probability of a particular parse is defined as the product of the probabilities

  • f all the rules used to expand each node

in the parse tree

slide-7
SLIDE 7

7

Parse Trees (2/2)

  • Input: dogs in houses and cats

– An instance of coordination ambiguity

  • Which one is correct ?
  • However, the PCFG will assign the identical probabilities

to the two parses

slide-8
SLIDE 8

8

Basic Assumptions (1/2)

  • Place Invariance

– The probability of a subtree does not depend on where in the string the words it dominates are

  • Context free

– The probability of a subtree does not depend on words not dominated by the subtree

  • Ancestor free

– The probability of a subtree does not depend on nodes in the derivation outside the subtree

( )

( ) ( )

ζ ζ → = → ∀

+ j j c k k

N P N P k

word positions in the input string

( )

( )

ζ ζ → = →

j kl j kl

N P l k N P through

  • utside

anything

( )

( )

ζ ζ → = →

j kl j kl j kl

N P N N P

  • utside

ancestor any

N j w1 …….wk ………..wl ……. wn

c+1 words

slide-9
SLIDE 9

9

Basic Assumptions (2/2)

  • Example

chain rule context-free & ancestor-free assumptions Place-invariant assumption

slide-10
SLIDE 10

10

Some Features of PCFGs

  • PCFGs give some idea (probabilities) of the plausibility of

different parses

– But the probability estimates are based purely on structural factors and not lexical factors

  • PCFGs are good for grammar induction

– PCFG can be learned from data, e.g. from bracketed (labeled) corpora

  • PCFGs are robust

– Tackle grammatical mistakes, disfluencies and errors by ruling

  • ut nothing in the grammar, but by just giving implausible

sentences a lower probability

slide-11
SLIDE 11

11

Chomsky Normal Form

  • Chomsky Normal Form (CNF) grammars only have unary

and binary rules of the form

  • The parameters of a PCFG in CNF are
  • Any CFG can be represented by a weakly equivalent CFG

in CNF

– “weakly equivalent” : “generating the same language”

  • But do not assign the same phrase structure to each sentence

k j s r j

w N N N N → →

( ) ( )

G w N P G N N N P

k i s r i

→ →

nV matrix of parameters (when n nonterminals and V terminals ) n3 matrix of parameters (when n nonterminals ) n3+nV parameters

( ) ( ) 1

,

= ∑ → + ∑ →

k k i s r s r i

w N P N N N P

For lexical categories For syntactic categories

slide-12
SLIDE 12

12

CYK Algorithm (1/3)

  • CYK (Cocke-Younger-Kasami) algorithm

– A bottom-up parser using the dynamic programming table – Assume the PCFG is in Chomsky normal form (CNF)

  • Definition

– w1…wn: an input string composed of n words – wij: a string of words from words i to j – π[i, j, a]: a table entry holds the maximum probability for a constituent with non-terminal index a spaning words wi…wj

Collins, 1999 Ney, 1991

N a w1 …….wi ………..wj ……. wn

slide-13
SLIDE 13

13

CYK Algorithm (2/3)

  • Fill out the table entries by induction

– Base case

  • Consider the input strings of length one (i.e., each

individual word wi)

  • Since the grammar is in CNF,

– Recursive case

  • For strings of words of length > 1,
  • Compute the probability by multiplying together the

probabilities of these two pieces (i.e., B, C here; notice that they

have been calculated in the recursion)

( )

i

w A P →

i i

w A w A iff

*

→ ⇒ C rule

  • ne

least at is there iff

*

B A w A

ij

→ ⇒ symbols last the derives and symbols 1 first the derives here j-k C k-i B w +

Choose the maximum among all possibilities A must be a lexical category A must be a syntactic category A B C

i j k k+1

slide-14
SLIDE 14

14

CYK Algorithm (3/3)

A B C

begin end m m+1

Finding the most Likely parse for a sentence set to zero m-word input string n non-terminals O(m3n3)

  • n the word-span

bookkeeping start symbol

slide-15
SLIDE 15

15

Three Basic Problems for PCFGs

  • What is the probability of a sentence w1m according to

a grammar

G: P(w1m|G)?

  • What is the most likely parse t* for a sentence?

argmax t P(t |w1m,G)

  • How can we choose the rule probabilities for the

grammar G that maximize the probability of a sentence?

argmaxG P(w1m|G)

  • Similar to the three problems of Hidden Markov Models

Training the PCFG

slide-16
SLIDE 16

16

The Inside-Outside Algorithm (1/2)

  • A generalization of the forward-backward algorithm of

HMMs

  • A dynamic programming technique used to efficiently

compute PCFG probabilities

– Inside and outside probabilities in PCFG

Baker 1979 Young 1990

slide-17
SLIDE 17

17

The Inside-Outside Algorithm (2/2)

  • Definition

– Inside probability

  • The total probability of generating words wp…wq given that
  • ne is starting off with the nonterminal Nj

– Outside probability

  • The total probability of beginning with the start symbol N1 and

generating the nonterminal Njpq and all the words outside wp…wq

( )

( )

G N w P q p

j pq pq j

, , = β

( )

( )

G w N w P q p

m q j pq p j ) 1 ( ) 1 ( 1

, , ,

+ −

= α

slide-18
SLIDE 18

18

Problem 1: The Probability of a Sentence (1/7)

  • A PCFG with the Chomsky Normal Form was used

here

  • The total probability of a sentence expressed by the

inside algorithm

  • The probability of the base case
  • Find the probabilities by induction (or by

recursion)

( ) ( )

( )

( )

m G N w P G w N P G w P

m m m m

, 1 ,

1 1 1 1 1 1 1

β = = ⇒ =

( )

( )

( )

G w N P G N w P k k

k j j kk k j

→ = = , , β

( )

q p

j

, β

word-span=1 word-span > 1

slide-19
SLIDE 19

19

Problem 1: The Probability of a Sentence (2/7)

  • Find the probabilities by induction

– A bottom-up version of calculation ( )

( )

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( )

( ) ( )

q d d p N N N P G N w P G N w P G N N N P G w N N N w P G N N N w P G N N N P G N N w N w P G N w P G w N P q p m q p j

s r s r q p d s r j s q d q d r pd pd s r q p d j pq s q d r pd pd s q d r pd j pq q d s q d r pd j pq pd s r q p d j pq s q d r pd s r q p d j pq s q d q d r pd pd j pq pq pq j pq j

, 1 , , , , , , , , , , , , , , , , , , , , 1 ,

, 1 1 1 , 1 1 1 1 1 , 1 1 , 1 1 1

+ × × ∑ ∑ → = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ∑ ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ∑ ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ∑ ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⇒ = ≤ < ≤ ∀

− = + + − = + + + + − = + − = + +

β β β

( )

q p

j

, β

context-free & ancestor-free assumptions Place-invariant assumption the binary rule chain rule

slide-20
SLIDE 20

20

Problem 1: The Probability of a Sentence (3/7)

  • Example

( ) ( ) ( ) ( ) ( ) ( ) ( )

5 , 4 3 , 2 PP VP VP 5 , 3 2 , 2 NP V VP 5 , 2

PP VP NP V VP

β β β β β → + → = P P

0.7 1.0 0.01296 0.3 0.126 0.18 0.015876

( ) ( ) ( ) ( )

5 , 2 1 , 1 VP NP S 5 , 1

VP NP S

β β β → = P

1.0 0.1 0.015867 0.0015867

begin end

slide-21
SLIDE 21

21

Problem 1: The Probability of a Sentence (4/7)

  • The total probability of a sentence expressed by the
  • utside algorithm
  • The probabilities of the base case
  • Find the probabilities by induction

( ) ( ) ( ) ( ) (

)

( ) (

)

G w N P k k G w N w w P G w N w P G N w w w P G N w P G w P

k j j j m k j kk k kk j m k j kk k j j kk m k kk k j j kk m m

→ = = = =

∑ ∑ ∑ ∑

+ − + − + −

, , , , , , , , , ,

) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 1 1

α

( ) ( )

1 for , 1 1 , 1

1

≠ = = j m m

j

α α

( )

q p

j

, α

context-free & place-invariant assumptions

N j’s are lexical categories

chain rule

slide-22
SLIDE 22

22

Problem 1: The Probability of a Sentence (5/7)

  • Find the probabilities by induction

– A top-down version of calculation

( )

q p

j

, α

( )

( ) ( ) ( ) ( ) (

)

( )

( )

( ) (

) ( )⎥

⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ = =

− = − − − + − ≠ + = + + + + − − = − + − ≠ + = + + − + − g f p e g p e p e f eq j pq g p e f eq m q e j g f m q e g e q e q f pe g e q j pq f pe m e p g f p e j pq g p e f eq m q p j g f m q e g e q j pq f pe m q p m q j pq p j

N w P N N N P N w w P N w P N N N P N w w P N N N w w P N N N w w P G w N w P q p

, 1 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 , 1 ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( 1 , 1 1 ) 1 ( ) 1 ( ) 1 ( 1 , 1 ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1

, , , , , , , , , , , , , , , , , α

( ) (

)

( ) ( ) (

)

( )

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − → + ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + → =

∑ ∑ ∑ ∑

− = ≠ + = g f p e g j g f f j g f m q e g g j f f

p e N N N P q e e q N N N P e p

, 1 1 , 1

1 , , , 1 , β α β α

Chain rule & context-free & ancestor-free assumptions

slide-23
SLIDE 23

23

Problem 1: The Probability of a Sentence (6/7)

  • Explanation

( ) ( ) ( ) (

)

( ) (

)

( ) (

) ( )

( ) (

) ( )

( ) (

)

( )

e q N N N P e p N w P N N N P N w w P N N N w P N N N P N w w P N N N w P N w w P N w w N N w P N w w P N N N w w w P N N N w w P

g g j f f g e q e q f pe g e q j pq f pe m e p f pe g e q j pq e q f pe g e q j pq f pe m e p f pe g e q j pq e q f pe m e p f pe m e p g e q j pq e q f pe m e p g e q j pq f pe m e e q p g e q j pq f pe m q p

, 1 , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( 1

+ → = = = = = =

+ + + + − + + + + − + + + − + − + + + − + + + − + + −

β α

slide-24
SLIDE 24

24

Problem 1: The Probability of a Sentence (7/7)

  • The product of the inside and outside probabilities
  • The probability of a sentence having some constituent

spanning from word p to q

( ) ( )

( )

( ) (

)

( )

( )

( ) ( )

( )

( )

( )

( )

G N w P G N w w w P G N P G N w P G N w w P G N P G N w P G w N w P q p q p

j pq m j pq m q pq p j pq j pq pq j pq m q p j pq j pq pq m q j pq p j j

, , , , , , , , , , , ,

1 ) 1 ( 1 1 ) 1 ( 1 1 ) 1 ( 1 1

= = = =

+ − + − + −

β α

( )

( ) ( )

=

j j j pq m

p,q β p,q α G N w P ,

1

slide-25
SLIDE 25

25

Problem 2: Find the Most Likely Parse (1/2)

  • A Viterbi-style algorithm adapted from the inside

algorithm was used to find the most likely parse of a sentence

– Similar to the CYK algorithm introduced previously

  • Definition

( )

i pq i

N q p subtree a

  • f

parse y probabilit inside highest the : , δ

( )

i pq i

N k, r j, q p subtree a

  • f

) ( n informatio backtrace the store : , ψ

( ) ( ) ( )

......

3 3 3 3 2 2 2 2 1 1 1 1

1 1 1 k q r j i k q r j i k q r j i

N N N N N N N N N

pr pq pr pq pr pq

+ + +

→ → →

Store the

  • ptimal setting

Different combinations

  • f constituents spanning

different word ranges indices for nonterminals word position

slide-26
SLIDE 26

26

Problem 2: Find the Most Likely Parse (2/2)

  • 1. Initialization
  • 2. Induction
  • 3. Termination
  • Recursively construct the tree nodes

( )

( )

p i i

w N P p p → = , δ

( )

( ) (

) ( )

q r r p N N N P q p

k j k j i q r p n k j i

, 1 , max ,

, 1

+ → =

< ≤ ≤ ≤

δ δ δ

( )

( )

( ) ( )

q r r p N N N P q p

k j k j i q r p n k j i

, 1 , max arg ,

, 1

+ → =

< ≤ ≤ ≤

δ δ ψ

( )

r k j , ,

( )

( )

m t P , 1 ˆ

1

δ =

1 , 1 m

N

The corresponding tree

( ) ( )

r k j q p ψ N X

i i pq

, , , , If = =

χ

( ) ( )

k q r i pq j pr i pq

N N N N

) 1 (

right left

+

= =

three elements stored

wp ….. wr w(r+1) …wq N i N j N k

The Viterbi parse

slide-27
SLIDE 27

27

Problem 3: Training a PCFG (1/7)

  • If parsed training corpus are available

– Directly calculate the probabilities of rules via Maximum Likelihood Estimation (MLE) – But, more commonly, a pared training corpus is not available (or a sentence may have many parses)

  • A hidden data problem !
  • We wish to determine probability function on rules, but can
  • nly directly see the probabilities of sentences

( ) ( ) ( )

→ → = →

γ

γ ζ ζ

j j j

N C N C N P ˆ

The count of number of times a particular rule is used The new probability

  • f the rule
slide-28
SLIDE 28

28

Problem 3: Training a PCFG (2/7)

  • If parsed training corpus are not available

– An iterative algorithm is used to determine improving estimates of the probability of the corpus W (Maximum Likelihood Estimation) – Algorithm started with a certain grammar topology

  • The number of terminals and noterminals (determined)
  • The initial probability estimates for rules (randomly chosen)

– According to this grammar

  • The probability of each parse of a training sentence are

accumulated

  • The probabilities of each rule being used in each place are

accumulated as an expectation of how often each rule are used

( ) ( )

?

1 i i

G W P G W P ≥

+

slide-29
SLIDE 29

29

Problem 3: Training a PCFG (3/7)

  • If parsed training corpus are not available

– Refine the probability estimates on rules in regarding to the expectations achieved previously

  • The likelihood of the training corpus given the grammar is

increased – Consider

  • is calculated previously and is set as

– The estimate for how many times is used

( ) ( )

1 i i

G W P G W P ≥

+

( ) ( )

( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ ⇒ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ ⇒ = = G w N w N P G w N P G w N w N P G N w P q p q p

m pq j m pq j m j pq m j j

, , , , ,

1 * 1 * 1 * 1 * 1 * 1 1

β α

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ G w N P

m 1 * 1

π ( ) ( )

π β α q p q p G w N w N P

j j m pq j

, , ,

1 * 1 *

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ ⇒

j

N

( )

( ) ( )

∑ ∑

= =

=

m p m p q j j j

q p q p N E

1

, , derivation in the used is π β α

Sum over all regions

  • f words that the node

could dominate in a sentence

The probability of all possible parses

( )

m , 1

1

β

slide-30
SLIDE 30

30

Problem 3: Training a PCFG (4/7)

  • If parsed training corpus are not available

– The estimate for how many times is used – The new probability for will be

s r j

N N N →

( )

( ) (

) (

) ( )

∑ ∑ ∑

− = + = =

+ → = →

1 1 1 1

, 1 , , used

m p m p q q- p d s r s r j j s r j

q d d p N N N P q p N N N E π β β α

s r j

N N N →

( ) ( ) ( )

( ) (

)

( ) ( ) ( ) ( )

∑ ∑ ∑ ∑ ∑

= = − = + = =

+ → = → = →

m p m p q j j m p m p q q- p d s r s r j j j s r j s r j

q p q p q d d p N N N P q p N E N N N E N N N P

1 1 1 1 1

, , , 1 , , used used ˆ β α β β α

The training formulas for a single sentence

slide-31
SLIDE 31

31

Problem 3: Training a PCFG (5/7)

  • If parsed training corpus are not available

– The estimate for how many times is used – The new probability for will be

k j

w N →

( )

( ) (

)

( ) (

)

( )

π β α π α

∑ ∑

= =

= = = → = →

m h j k h j m h k h h j j k j

h h w w P h h w w w N P h h w N E

1 1

, , , , used

Acts like a indicating function

k j

w N →

( )

( ) (

)

( ) ( ) ( )

∑ ∑ ∑

= = =

= = →

m p m p q j j m h j k h j k j

q p q p h h w w P h h w N P

1 1

, , , , ˆ β α β α

The training formulas for a single sentence

slide-32
SLIDE 32

32

Problem 3: Training a PCFG (6/7)

  • If parsed training corpus are not available

– Assume the sentences in the corpus are independent – The likelihood of the corpus is just the product of the probabilities

  • f sentences in it according to the grammar

– Define common subterms for training sentences

( ) ( ) (

) (

) ( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ + → =∑ = G W N P q d d p N N N P q p s r j q p f

i q- p d s r s r j j i * 1 1

, 1 , , , , , , β β α

( )

ω

W W W ,...,

1

=

( ) ( ) ( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ = G W N P q p q p j q p h

i j j i * 1

, , , , β α

( ) ( ) (

)

( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ = = G W N P h h w w P h h k j h g

i j k h j i * 1

, , , , β α

Statistics for training using all sentences nonterminal at a branching node & using the rule nonterminal at a preterminal node & using the rule nonterminal at anywhere

i

m i i i

w w W

, 1 , L

=

s r j

N N N →

k j

w N →

slide-33
SLIDE 33

33

Problem 3: Training a PCFG (7/7)

  • If parsed training corpus are not available

– The new probability for will be – The new probability for will be

The training formulas using all sentences

s r j

N N N →

( )

( ) ( )

∑ ∑ ∑ ∑ ∑ ∑

= = = = − = + =

= →

ω ω 1 1 1 1 1 1

, , , , , , ˆ

i i m p i m p q i i i m p i m p q i s r j

j q p h s r j q p f N N N P

k j

w N →

( )

( ) ( )

∑ ∑ ∑ ∑ ∑

= = = = =

= →

ω ω 1 1 1 1

, , , , ˆ

i i m p i m p q i i i m h i k j

j q p h k j h g w N P

sentences training

  • f

number total : sentence training

  • f

length word : ω i mi

slide-34
SLIDE 34

34

Problems with the Inside-Outside Algorithm

  • The whole training procedure is slow: O(m3n3) for each

iteration

– m: the length of the sentence – n: the number of nonterminals

  • Local maxima are much more of a problem
  • Satisfactory learning requires many more nonterminals

than are theoretically needed to describe the language at hand

  • No guarantee that the nonterminals learned will have any

satisfactory resemblance to the kinds of nonterminals normally motivated in linguistic analysis

slide-35
SLIDE 35

35

Problems with PCFGs (1/6)

  • The problems with PCFGs come from the fundamental

independence assumptions

– Structural Independency: the expansion of any one non- terminal is independent of any other non-terminal

  • Each rule is independent of each other rule
  • But the choice of how a node expands is dependent on the

location of the node in the parse tree, e.g., NP →Pronoun or NP →Det Noun

NP is a subject in a sentence? NP is an object in a sentence?

Talk about topic or old information Introduce new referents Switchboard: (for declarative sentences) 91% subjects are pronouns (9%: lexical nouns) 66% objects are lexical nouns (34% pronouns)

She is able to take her baby to work with her. All the people signed confessions.

slide-36
SLIDE 36

36

Problems with PCFGs (2/6)

  • The problems with PCFGs come from their fundamental

independence (cont.)

– Lexical independency: PCFGs’ lack of sensitivity to words

  • Lexical information in PCFGs can only be represented via

the probability of pre-terminal nodes (Verb, Noun, Det) to expanded lexically

  • But the lexical information plays an important role in

selecting the correct parsing, e.g., the ambiguous prepositional phrase attachment Moscow sent more than 100,000 soldiers into Afghanistan NP →NP PP (NP attachment)

  • r VP → VP PP (VP attachment)
slide-37
SLIDE 37

37

Problems with PCFGs (3/6)

– Lexical independency (cont.)

  • Attachment ambiguities

– Hindle and Rooth (13M words from the AP newswire 1991) » 67% NP-attachment vs. 33% VP-attachment – Collins (WSJ and IBM computer manuals, 1999) » 52% NP-attachment

  • Coordination ambiguities

– E.g., “ dogs in house and cats” A model keeping separate lexical dependency statistics for different verbs would be helpful for disambiguate these attachment problems !

slide-38
SLIDE 38

38

Problems with PCFGs (4/6)

NP

more than 100,000 soldiers

PP

into Afghanistan

NP V VP

sent

V

more than 100,000 soldiers

NP VP VP

sent

PP

into Afghanistan

NP attachment VP attachment

– Lexical independency (cont.)

  • Attachment ambiguities

Moscow sent more than 100,000 soldiers into Afghanistan

slide-39
SLIDE 39

39

Problems with PCFGs (5/6)

– Lexical independency (cont.)

  • Coordination ambiguities
slide-40
SLIDE 40

40

Structural Dependency: More Examples

Pronouns, proper names, and definite NPs appear more commonly in subject position NPs containing post-head modifiers and bare nouns

  • ccur more commonly in
  • bject position
slide-41
SLIDE 41

41

Lexical Dependency : More Examples

– We should include more information about what the actual words in the sentence are when making decisions about the structure

  • f the parse tree
  • Lexical dependencies between words
slide-42
SLIDE 42

42

Problems with PCFGs (6/6)

  • Upshot

– We should build a much better probabilistic parser by taking into account lexical and structural context

  • Structural dependency
  • Lexical dependency
  • Challenge

– How to find factors that give us a lot of extra discrimination while not defeating us with a multiplicity of parameters (or the sparse data problem)

slide-43
SLIDE 43

43

Probabilistic Lexicalized CFGs (1/8)

  • The syntactic constituents are associated with

a lexical head

– Each non-terminal in a parse tree is annotated with a single word which is its lexical head (the head for each constituent) – Each rule is augmented to identify one right-hand-side constituent to be the head daughter – But how to choose is controversial !

Black et al., 1992

slide-44
SLIDE 44

44

Probabilistic Lexicalized CFGs (2/8)

  • How to select a head for a constituent ?

– E.g., finding the head of a NP

  • Return the very last word if it is tagged POS (possessive)
  • Else to search from right to left for the first child that is an NN,

NNP, etc.

  • Else to search from left to right for the first child that is an NP

NP → NP PP

slide-45
SLIDE 45

45

Probabilistic Lexicalized CFGs (3/8)

  • A simple way to think of a lexicalized grammar

– E.g., creating many copies of each rule, one copy for each possible head word for each constituent – Problem

  • No corpus big enough to train such probabilities

– Should make some simplifying independence assumptions in order to cluster some of the counts

VP (dumped) → VBD (dumped) NP (sacks) PP (into) VP (dumped) → VBD (dumped) NP (cats) PP (into) VP (dumped) → VBD (dumped) NP (hats) PP (into) VP (dumped) → VBD (dumped) NP (sacks) PP (above) …….. [3x10-10] [8x10-11] [4x10-10] [1x10-12]

slide-46
SLIDE 46

46

Probabilistic Lexicalized CFGs (4/8)

  • Example

incorrect correct

slide-47
SLIDE 47

47

Probabilistic Lexicalized CFGs (5/8)

  • Take Charniak’s Parser (1997) for example

– Incorporate lexical dependency information by relating the heads

  • f phrases to the heads of their constituents

– Recall: the vanilla PCFG – Heard-rule probability of the Probabilistic lexicalized CFG

  • E.g.,

VP → VBD NP PP

( )

( )

n n r P

n: the syntactic category of a parse-tree node

( ) ( )

( )

n h n n r P ,

h(n): the headword of a parse-tree node P(r|VP, dumped): the prob. of the rule P(r|VP, slept): the prob. of the rule

slide-48
SLIDE 48

48

Probabilistic Lexicalized CFGs (6/8)

– Further decide the probability of a head

  • Null assumption: all head are equally likely

– The probability that the head of a node would be sacks would be the same as the probability the head would be racks – Doesn’t seem very useful

  • Condition the probability of the head h of node n on two

factors – Syntactic category of the node n – The head of the node’s mother

( ) ( ) ( )

( )

n m h n word n h P

i

, =

P(head(n)=sacks|n=VP, h(m(n))=dumped) X(dumped) NP(?sacks?)

The prior probability

  • f the head words
slide-49
SLIDE 49

49

Probabilistic Lexicalized CFGs (7/8)

– The probability of a parse T of a sentence S

( ) ( ) ( )

( )

( ) ( ) ( )

( )

=

T n

n m h n n h P n h n n r P S T P , , ,

head-rule probability head-head probability

( )

( ) ( ) ( ) ( )

67 . 9 6 , = = → → = →

∑ β

β dumped VP C PP NP VBD dumped VP C dumped VP PP NP VBD VP P

( )

( ) ( ) ( ) ( )

9 , = = → → = →

∑ β

β dumped VP C NP VBD dumped VP C dumped VP NP VBD VP P

( )

( ) ( ) ( ) ( )

22 . 9 2 ... ... )... ( ... , = = → → = ∑ PP dumped X C into PP dumped X C dumped PP into P

( )

( ) ( ) ( ) ( )

... ... )... ( ... , ⇒ = → → = ∑ PP sacks X C into PP sacks X C sacks PP into P Counting from Brown corpus

smoothing or backoff can be applied

slide-50
SLIDE 50

50

Probabilistic Lexicalized CFGs (8/8)

– The original version of Charniak’s parser adds additional conditional factors

  • The rule-expansion probability depends on the node’s

grandparent (trigram or second-order Markovian)

  • Use various backoff and smoothing algorithm
slide-51
SLIDE 51

51

Dependency Grammars (1/2)

  • The grammar formulation is based purely on the lexical

dependency information

– The syntactic structure of a sentence is described purely in terms

  • f words and binary semantic or syntactic relations between

words – Consitiuents and phrase structures do not play any fundamental role

slide-52
SLIDE 52

52

Dependency Grammars (2/2)

  • One of the main advantages of dependency grammars is

their ability to handle languages with relatively free word

  • rder

– Abstract away from word-order variation, representing only information that is necessary for the parse

  • Examples

– Link Grammar – Constraint Grammar

slide-53
SLIDE 53

53

Categorial Grammars (1/2)

  • The combinatory categorial grammar has two components

– The categorial lexicon

  • Associate each word with a syntactic and semantic category
  • Two categories

– Augments: Ns – Factors : verbs, determiners – The combination rules

  • Allow functions and arguments to be combined, e.g.,

– X/Y: something combines with a Y on its right to produce X – X\Y: something combines with a Y on its left to produce X

slide-54
SLIDE 54

54

Categorial Grammars (2/2)

  • Examples

– Determiners receive the category NP/N – Transitive verbs might have the category VP/NP – Ditransitive verbs might have the category (VP/NP)/NP Harry eats apples

NP V NP VP/NP S\NP

slide-55
SLIDE 55

55

Evaluating Parsers (1/2)

  • Labeled recall
  • Labeled precision
  • Cross-brackets

– Number of total brackets – E.g., a cross-bracket ((A B) C) and (A (B C)) # of correct constituents in candidate parse of a sentence s # of correct constituents in treebank parse of a sentence s # of correct constituents in candidate parse of a sentence s # of total constituents in candidate parse of a sentence s The correct constituent must have the same starting time, ending time, and non-terminal symbol as the “gold standard”

  • f treebank.

A B C A B C

slide-56
SLIDE 56

56

Evaluating Parsers (2/2)

  • Examples

– Using a portion of the Wall Street Journal as the test set, parsers such as Charniak (1997) and Collins (1999) achieve just

  • Under 90% recall and under 90% precision
  • About 1% cross-bracketed constituents per sentence