[PPT] - Probabilistic Context-Free Grammars Probabilistic Context-Free PowerPoint Presentation

SLIDE 1

Probabilistic Context-Free Grammars Probabilistic Context-Free Grammars

References:

1. Speech and Language Processing, chapter 12 2. Foundations of Statistical Natural Language Processing, chapters 11, 12

Berlin Chen

Graduate Institute of Computer Science & Information Engineering National Taiwan Normal University

SLIDE 2

2

Parsing for Disambiguation (1/2)

At least three ways to use probabilities in a parser

– Probabilities for choosing between parses

Choose from among the many parses of the input sentence

which ones are most likely – Probabilities for speedier parsing

Use probabilities to order or prune the search space of a

parser for finding the best parse more quickly – Probabilities for determining the sentence

Use a parser as an augmented language model over a word

lattice in order to determine a sequence of words that has the highest probability

Parsing as Search

SLIDE 3

3

Parsing for Disambiguation (2/2)

The integration of sophisticated structural and

probabilistic models of syntax is at the very cutting edge

f the field

– For the non-probabilistic syntax analysis

The context-free grammar (CFG) is the standard

– For the probabilistic syntax analysis

No single model has become a standard
A number of probabilistic augmentations to context-free

grammars – Probabilistic CFG with the CYK algorithm – Probabilistic lexicalized CFG – Dependency grammars – …….

SLIDE 4

4

Definition of the PCFG

A PCFG G has five parameters
1. A set of non-terminal symbols (or “variables”) N
2. A set of terminal symbols ∑ (disjoint from N)
3. A set of productions P, each of the form A→β, where A is a non-

terminal symbol and β is a string of symbols from the infinite set of strings (∑∪ N)*

4. A designated start symbol S (or N1)
5. Each rule in P is augmented with a conditional probability

assigned by a function D A→β [prob.]

A PCFG G=(N, ∑, P, S, D )

P(A→β) or P(A→β|A)

( )

1 = → ∀

∑

β A P A

β

Booth, 1969 words syntactic categories lexical categories

SLIDE 5

5

An Example Grammar

SLIDE 6

6

Parse Trees (1/2)

Input: astronomers saw stars with ears

– An instance of PP-attachment ambiguity

The probability of a particular parse is defined as the product of the probabilities

f all the rules used to expand each node

in the parse tree

SLIDE 7

7

Parse Trees (2/2)

Input: dogs in houses and cats

– An instance of coordination ambiguity

Which one is correct ?
However, the PCFG will assign the identical probabilities

to the two parses

SLIDE 8

8

Basic Assumptions (1/2)

Place Invariance

– The probability of a subtree does not depend on where in the string the words it dominates are

Context free

– The probability of a subtree does not depend on words not dominated by the subtree

Ancestor free

– The probability of a subtree does not depend on nodes in the derivation outside the subtree

( )

( ) ( )

ζ ζ → = → ∀

+ j j c k k

N P N P k

word positions in the input string

( )

ζ ζ → = →

j kl j kl

N P l k N P through

utside

anything

( )

ζ ζ → = →

j kl j kl j kl

N P N N P

utside

ancestor any

N j w1 …….wk ………..wl ……. wn

c+1 words

SLIDE 9

9

Basic Assumptions (2/2)

Example

chain rule context-free & ancestor-free assumptions Place-invariant assumption

SLIDE 10

10

Some Features of PCFGs

PCFGs give some idea (probabilities) of the plausibility of

different parses

– But the probability estimates are based purely on structural factors and not lexical factors

PCFGs are good for grammar induction

– PCFG can be learned from data, e.g. from bracketed (labeled) corpora

PCFGs are robust

– Tackle grammatical mistakes, disfluencies and errors by ruling

ut nothing in the grammar, but by just giving implausible

sentences a lower probability

SLIDE 11

11

Chomsky Normal Form

Chomsky Normal Form (CNF) grammars only have unary

and binary rules of the form

The parameters of a PCFG in CNF are
Any CFG can be represented by a weakly equivalent CFG

in CNF

– “weakly equivalent” : “generating the same language”

But do not assign the same phrase structure to each sentence

k j s r j

w N N N N → →

( ) ( )

G w N P G N N N P

k i s r i

→ →

nV matrix of parameters (when n nonterminals and V terminals ) n3 matrix of parameters (when n nonterminals ) n3+nV parameters

( ) ( ) 1

,

= ∑ → + ∑ →

k k i s r s r i

w N P N N N P

For lexical categories For syntactic categories

SLIDE 12

12

CYK Algorithm (1/3)

CYK (Cocke-Younger-Kasami) algorithm

– A bottom-up parser using the dynamic programming table – Assume the PCFG is in Chomsky normal form (CNF)

Definition

– w1…wn: an input string composed of n words – wij: a string of words from words i to j – π[i, j, a]: a table entry holds the maximum probability for a constituent with non-terminal index a spaning words wi…wj

Collins, 1999 Ney, 1991

N a w1 …….wi ………..wj ……. wn

SLIDE 13

13

CYK Algorithm (2/3)

Fill out the table entries by induction

– Base case

Consider the input strings of length one (i.e., each

individual word wi)

Since the grammar is in CNF,

– Recursive case

For strings of words of length > 1,
Compute the probability by multiplying together the

probabilities of these two pieces (i.e., B, C here; notice that they

have been calculated in the recursion)

( )

i

w A P →

i i

w A w A iff

*

→ ⇒ C rule

ne

least at is there iff

*

B A w A

ij

→ ⇒ symbols last the derives and symbols 1 first the derives here j-k C k-i B w +

Choose the maximum among all possibilities A must be a lexical category A must be a syntactic category A B C

i j k k+1

SLIDE 14

14

CYK Algorithm (3/3)

A B C

begin end m m+1

Finding the most Likely parse for a sentence set to zero m-word input string n non-terminals O(m3n3)

n the word-span

bookkeeping start symbol

SLIDE 15

15

Three Basic Problems for PCFGs

What is the probability of a sentence w1m according to

a grammar

G: P(w1m|G)?

What is the most likely parse t* for a sentence?

argmax t P(t |w1m,G)

How can we choose the rule probabilities for the

grammar G that maximize the probability of a sentence?

argmaxG P(w1m|G)

Similar to the three problems of Hidden Markov Models

Training the PCFG

SLIDE 16

16

The Inside-Outside Algorithm (1/2)

A generalization of the forward-backward algorithm of

HMMs

A dynamic programming technique used to efficiently

compute PCFG probabilities

– Inside and outside probabilities in PCFG

Baker 1979 Young 1990

SLIDE 17

17

The Inside-Outside Algorithm (2/2)

Definition

– Inside probability

The total probability of generating words wp…wq given that
ne is starting off with the nonterminal Nj

– Outside probability

The total probability of beginning with the start symbol N1 and

generating the nonterminal Njpq and all the words outside wp…wq

( )

G N w P q p

j pq pq j

, , = β

( )

G w N w P q p

m q j pq p j ) 1 ( ) 1 ( 1

, , ,

+ −

= α

SLIDE 18

18

Problem 1: The Probability of a Sentence (1/7)

A PCFG with the Chomsky Normal Form was used

here

The total probability of a sentence expressed by the

inside algorithm

The probability of the base case
Find the probabilities by induction (or by

recursion)

( ) ( )

( )

m G N w P G w N P G w P

m m m m

, 1 ,

1 1 1 1 1 1 1

β = = ⇒ =

( )

G w N P G N w P k k

k j j kk k j

→ = = , , β

( )

q p

j

, β

word-span=1 word-span > 1

SLIDE 19

19

Problem 1: The Probability of a Sentence (2/7)

Find the probabilities by induction

– A bottom-up version of calculation ( )

( )

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( )

( ) ( )

q d d p N N N P G N w P G N w P G N N N P G w N N N w P G N N N w P G N N N P G N N w N w P G N w P G w N P q p m q p j

s r s r q p d s r j s q d q d r pd pd s r q p d j pq s q d r pd pd s q d r pd j pq q d s q d r pd j pq pd s r q p d j pq s q d r pd s r q p d j pq s q d q d r pd pd j pq pq pq j pq j

, 1 , , , , , , , , , , , , , , , , , , , , 1 ,

, 1 1 1 , 1 1 1 1 1 , 1 1 , 1 1 1

+ × × ∑ ∑ → = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ∑ ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ × ∑ ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ∑ ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⇒ = ≤ < ≤ ∀

− = + + − = + + + + − = + − = + +

β β β

( )

q p

j

, β

context-free & ancestor-free assumptions Place-invariant assumption the binary rule chain rule

SLIDE 20

20

Problem 1: The Probability of a Sentence (3/7)

Example

( ) ( ) ( ) ( ) ( ) ( ) ( )

5 , 4 3 , 2 PP VP VP 5 , 3 2 , 2 NP V VP 5 , 2

PP VP NP V VP

β β β β β → + → = P P

0.7 1.0 0.01296 0.3 0.126 0.18 0.015876

( ) ( ) ( ) ( )

5 , 2 1 , 1 VP NP S 5 , 1

VP NP S

β β β → = P

1.0 0.1 0.015867 0.0015867

begin end

SLIDE 21

21

Problem 1: The Probability of a Sentence (4/7)

The total probability of a sentence expressed by the
utside algorithm
The probabilities of the base case
Find the probabilities by induction

( ) ( ) ( ) ( ) (

)

( ) (

)

G w N P k k G w N w w P G w N w P G N w w w P G N w P G w P

k j j j m k j kk k kk j m k j kk k j j kk m k kk k j j kk m m

→ = = = =

∑ ∑ ∑ ∑

+ − + − + −

, , , , , , , , , ,

) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 1 1

α

( ) ( )

1 for , 1 1 , 1

1

≠ = = j m m

j

α α

( )

q p

j

, α

context-free & place-invariant assumptions

N j’s are lexical categories

chain rule

SLIDE 22

22

Problem 1: The Probability of a Sentence (5/7)

Find the probabilities by induction

– A top-down version of calculation

( )

q p

j

, α

( )

( ) ( ) ( ) ( ) (

)

( )

( ) (

) ( )⎥

⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ∑ ∑ = =

− = − − − + − ≠ + = + + + + − − = − + − ≠ + = + + − + − g f p e g p e p e f eq j pq g p e f eq m q e j g f m q e g e q e q f pe g e q j pq f pe m e p g f p e j pq g p e f eq m q p j g f m q e g e q j pq f pe m q p m q j pq p j

N w P N N N P N w w P N w P N N N P N w w P N N N w w P N N N w w P G w N w P q p

, 1 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 , 1 ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( 1 , 1 1 ) 1 ( ) 1 ( ) 1 ( 1 , 1 ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1

, , , , , , , , , , , , , , , , , α

( ) (

)

( ) ( ) (

)

( )

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − → + ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + → =

∑ ∑ ∑ ∑

− = ≠ + = g f p e g j g f f j g f m q e g g j f f

p e N N N P q e e q N N N P e p

, 1 1 , 1

1 , , , 1 , β α β α

Chain rule & context-free & ancestor-free assumptions

SLIDE 23

23

Problem 1: The Probability of a Sentence (6/7)

Explanation

( ) ( ) ( ) (

)

( ) (

)

( ) (

) ( )

( ) (

) ( )

( ) (

)

( )

e q N N N P e p N w P N N N P N w w P N N N w P N N N P N w w P N N N w P N w w P N w w N N w P N w w P N N N w w w P N N N w w P

g g j f f g e q e q f pe g e q j pq f pe m e p f pe g e q j pq e q f pe g e q j pq f pe m e p f pe g e q j pq e q f pe m e p f pe m e p g e q j pq e q f pe m e p g e q j pq f pe m e e q p g e q j pq f pe m q p

, 1 , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ) 1 ( 1

+ → = = = = = =

+ + + + − + + + + − + + + − + − + + + − + + + − + + −

β α

SLIDE 24

24

Problem 1: The Probability of a Sentence (7/7)

The product of the inside and outside probabilities
The probability of a sentence having some constituent

spanning from word p to q

( ) ( )

( )

( ) (

)

( )

( ) ( )

( )

G N w P G N w w w P G N P G N w P G N w w P G N P G N w P G w N w P q p q p

j pq m j pq m q pq p j pq j pq pq j pq m q p j pq j pq pq m q j pq p j j

, , , , , , , , , , , ,

1 ) 1 ( 1 1 ) 1 ( 1 1 ) 1 ( 1 1

= = = =

+ − + − + −

β α

( )

( ) ( )

∑

=

j j j pq m

p,q β p,q α G N w P ,

1

SLIDE 25

25

Problem 2: Find the Most Likely Parse (1/2)

A Viterbi-style algorithm adapted from the inside

algorithm was used to find the most likely parse of a sentence

– Similar to the CYK algorithm introduced previously

Definition

( )

i pq i

N q p subtree a

f

parse y probabilit inside highest the : , δ

( )

i pq i

N k, r j, q p subtree a

f

) ( n informatio backtrace the store : , ψ

( ) ( ) ( )

......

3 3 3 3 2 2 2 2 1 1 1 1

1 1 1 k q r j i k q r j i k q r j i

N N N N N N N N N

pr pq pr pq pr pq

+ + +

→ → →

Store the

ptimal setting

Different combinations

f constituents spanning

different word ranges indices for nonterminals word position

SLIDE 26

26

Problem 2: Find the Most Likely Parse (2/2)

1. Initialization
2. Induction
3. Termination
Recursively construct the tree nodes

( )

p i i

w N P p p → = , δ

( )

( ) (

) ( )

q r r p N N N P q p

k j k j i q r p n k j i

, 1 , max ,

, 1

+ → =

< ≤ ≤ ≤

δ δ δ

( )

( ) ( )

q r r p N N N P q p

k j k j i q r p n k j i

, 1 , max arg ,

, 1

+ → =

< ≤ ≤ ≤

δ δ ψ

( )

r k j , ,

( )

m t P , 1 ˆ

1

δ =

1 , 1 m

N

The corresponding tree

( ) ( )

r k j q p ψ N X

i i pq

, , , , If = =

χ

( ) ( )

k q r i pq j pr i pq

N N N N

) 1 (

right left

+

= =

three elements stored

wp ….. wr w(r+1) …wq N i N j N k

The Viterbi parse

SLIDE 27

27

Problem 3: Training a PCFG (1/7)

If parsed training corpus are available

– Directly calculate the probabilities of rules via Maximum Likelihood Estimation (MLE) – But, more commonly, a pared training corpus is not available (or a sentence may have many parses)

A hidden data problem !
We wish to determine probability function on rules, but can
nly directly see the probabilities of sentences

( ) ( ) ( )

∑

→ → = →

γ

γ ζ ζ

j j j

N C N C N P ˆ

The count of number of times a particular rule is used The new probability

f the rule

SLIDE 28

28

Problem 3: Training a PCFG (2/7)

If parsed training corpus are not available

– An iterative algorithm is used to determine improving estimates of the probability of the corpus W (Maximum Likelihood Estimation) – Algorithm started with a certain grammar topology

The number of terminals and noterminals (determined)
The initial probability estimates for rules (randomly chosen)

– According to this grammar

The probability of each parse of a training sentence are

accumulated

The probabilities of each rule being used in each place are

accumulated as an expectation of how often each rule are used

( ) ( )

?

1 i i

G W P G W P ≥

+

SLIDE 29

29

Problem 3: Training a PCFG (3/7)

If parsed training corpus are not available

– Refine the probability estimates on rules in regarding to the expectations achieved previously

The likelihood of the training corpus given the grammar is

increased – Consider

is calculated previously and is set as

– The estimate for how many times is used

( ) ( )

1 i i

G W P G W P ≥

+

( ) ( )

( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ ⇒ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ ⇒ = = G w N w N P G w N P G w N w N P G N w P q p q p

m pq j m pq j m j pq m j j

, , , , ,

1 * 1 * 1 * 1 * 1 * 1 1

β α

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ G w N P

m 1 * 1

π ( ) ( )

π β α q p q p G w N w N P

j j m pq j

, , ,

1 * 1 *

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ ⇒

j

N

( )

( ) ( )

∑ ∑

= =

=

m p m p q j j j

q p q p N E

1

, , derivation in the used is π β α

Sum over all regions

f words that the node

could dominate in a sentence

The probability of all possible parses

( )

m , 1

1

β

SLIDE 30

30

Problem 3: Training a PCFG (4/7)

If parsed training corpus are not available

– The estimate for how many times is used – The new probability for will be

s r j

N N N →

( )

( ) (

) (

) ( )

∑ ∑ ∑

− = + = =

+ → = →

1 1 1 1

, 1 , , used

m p m p q q- p d s r s r j j s r j

q d d p N N N P q p N N N E π β β α

s r j

N N N →

( ) ( ) ( )

( ) (

)

( ) ( ) ( ) ( )

∑ ∑ ∑ ∑ ∑

= = − = + = =

+ → = → = →

m p m p q j j m p m p q q- p d s r s r j j j s r j s r j

q p q p q d d p N N N P q p N E N N N E N N N P

1 1 1 1 1

, , , 1 , , used used ˆ β α β β α

The training formulas for a single sentence

SLIDE 31

31

Problem 3: Training a PCFG (5/7)

If parsed training corpus are not available

– The estimate for how many times is used – The new probability for will be

k j

w N →

( )

( ) (

)

( ) (

)

( )

π β α π α

∑ ∑

= =

= = = → = →

m h j k h j m h k h h j j k j

h h w w P h h w w w N P h h w N E

1 1

, , , , used

Acts like a indicating function

k j

w N →

( )

( ) (

)

( ) ( ) ( )

∑ ∑ ∑

= = =

= = →

m p m p q j j m h j k h j k j

q p q p h h w w P h h w N P

1 1

, , , , ˆ β α β α

The training formulas for a single sentence

SLIDE 32

32

Problem 3: Training a PCFG (6/7)

If parsed training corpus are not available

– Assume the sentences in the corpus are independent – The likelihood of the corpus is just the product of the probabilities

f sentences in it according to the grammar

– Define common subterms for training sentences

( ) ( ) (

) (

) ( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ + → =∑ = G W N P q d d p N N N P q p s r j q p f

i q- p d s r s r j j i * 1 1

, 1 , , , , , , β β α

( )

ω

W W W ,...,

1

=

( ) ( ) ( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ = G W N P q p q p j q p h

i j j i * 1

, , , , β α

( ) ( ) (

)

( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇒ = = G W N P h h w w P h h k j h g

i j k h j i * 1

, , , , β α

Statistics for training using all sentences nonterminal at a branching node & using the rule nonterminal at a preterminal node & using the rule nonterminal at anywhere

i

m i i i

w w W

, 1 , L

=

s r j

N N N →

k j

w N →

SLIDE 33

33

Problem 3: Training a PCFG (7/7)

If parsed training corpus are not available

– The new probability for will be – The new probability for will be

The training formulas using all sentences

s r j

N N N →

( )

( ) ( )

∑ ∑ ∑ ∑ ∑ ∑

= = = = − = + =

= →

ω ω 1 1 1 1 1 1

, , , , , , ˆ

i i m p i m p q i i i m p i m p q i s r j

j q p h s r j q p f N N N P

k j

w N →

( )

( ) ( )

∑ ∑ ∑ ∑ ∑

= = = = =

= →

ω ω 1 1 1 1

, , , , ˆ

i i m p i m p q i i i m h i k j

j q p h k j h g w N P

sentences training

f

number total : sentence training

f

length word : ω i mi

SLIDE 34

34

Problems with the Inside-Outside Algorithm

The whole training procedure is slow: O(m3n3) for each

iteration

– m: the length of the sentence – n: the number of nonterminals

Local maxima are much more of a problem
Satisfactory learning requires many more nonterminals

than are theoretically needed to describe the language at hand

No guarantee that the nonterminals learned will have any

satisfactory resemblance to the kinds of nonterminals normally motivated in linguistic analysis

SLIDE 35

35

Problems with PCFGs (1/6)

The problems with PCFGs come from the fundamental

independence assumptions

– Structural Independency: the expansion of any one nonterminal is independent of any other non-terminal

Each rule is independent of each other rule
But the choice of how a node expands is dependent on the

location of the node in the parse tree, e.g., NP →Pronoun or NP →Det Noun

NP is a subject in a sentence? NP is an object in a sentence?

Talk about topic or old information Introduce new referents Switchboard: (for declarative sentences) 91% subjects are pronouns (9%: lexical nouns) 66% objects are lexical nouns (34% pronouns)

She is able to take her baby to work with her. All the people signed confessions.

SLIDE 36

36

Problems with PCFGs (2/6)

The problems with PCFGs come from their fundamental

independence (cont.)

– Lexical independency: PCFGs’ lack of sensitivity to words

Lexical information in PCFGs can only be represented via

the probability of pre-terminal nodes (Verb, Noun, Det) to expanded lexically

But the lexical information plays an important role in

selecting the correct parsing, e.g., the ambiguous prepositional phrase attachment Moscow sent more than 100,000 soldiers into Afghanistan NP →NP PP (NP attachment)

r VP → VP PP (VP attachment)

SLIDE 37

37

Problems with PCFGs (3/6)

– Lexical independency (cont.)

Attachment ambiguities

– Hindle and Rooth (13M words from the AP newswire 1991) » 67% NP-attachment vs. 33% VP-attachment – Collins (WSJ and IBM computer manuals, 1999) » 52% NP-attachment

Coordination ambiguities

– E.g., “ dogs in house and cats” A model keeping separate lexical dependency statistics for different verbs would be helpful for disambiguate these attachment problems !

SLIDE 38

38

Problems with PCFGs (4/6)

NP

more than 100,000 soldiers

PP

into Afghanistan

NP V VP

sent

V

more than 100,000 soldiers

NP VP VP

sent

PP

into Afghanistan

NP attachment VP attachment

– Lexical independency (cont.)

Attachment ambiguities

Moscow sent more than 100,000 soldiers into Afghanistan

SLIDE 39

39

Problems with PCFGs (5/6)

– Lexical independency (cont.)

Coordination ambiguities

SLIDE 40

40

Structural Dependency: More Examples

Pronouns, proper names, and definite NPs appear more commonly in subject position NPs containing post-head modifiers and bare nouns

ccur more commonly in
bject position

SLIDE 41

41

Lexical Dependency : More Examples

– We should include more information about what the actual words in the sentence are when making decisions about the structure

f the parse tree
Lexical dependencies between words

SLIDE 42

42

Problems with PCFGs (6/6)

Upshot

– We should build a much better probabilistic parser by taking into account lexical and structural context

Structural dependency
Lexical dependency
Challenge

– How to find factors that give us a lot of extra discrimination while not defeating us with a multiplicity of parameters (or the sparse data problem)

SLIDE 43

43

Probabilistic Lexicalized CFGs (1/8)

The syntactic constituents are associated with

a lexical head

– Each non-terminal in a parse tree is annotated with a single word which is its lexical head (the head for each constituent) – Each rule is augmented to identify one right-hand-side constituent to be the head daughter – But how to choose is controversial !

Black et al., 1992

SLIDE 44

44

Probabilistic Lexicalized CFGs (2/8)

How to select a head for a constituent ?

– E.g., finding the head of a NP

Return the very last word if it is tagged POS (possessive)
Else to search from right to left for the first child that is an NN,

NNP, etc.

Else to search from left to right for the first child that is an NP

NP → NP PP

SLIDE 45

45

Probabilistic Lexicalized CFGs (3/8)

A simple way to think of a lexicalized grammar

– E.g., creating many copies of each rule, one copy for each possible head word for each constituent – Problem

No corpus big enough to train such probabilities

– Should make some simplifying independence assumptions in order to cluster some of the counts

VP (dumped) → VBD (dumped) NP (sacks) PP (into) VP (dumped) → VBD (dumped) NP (cats) PP (into) VP (dumped) → VBD (dumped) NP (hats) PP (into) VP (dumped) → VBD (dumped) NP (sacks) PP (above) …….. [3x10-10] [8x10-11] [4x10-10] [1x10-12]

SLIDE 46

46

Probabilistic Lexicalized CFGs (4/8)

Example

incorrect correct

SLIDE 47

47

Probabilistic Lexicalized CFGs (5/8)

Take Charniak’s Parser (1997) for example

– Incorporate lexical dependency information by relating the heads

f phrases to the heads of their constituents

– Recall: the vanilla PCFG – Heard-rule probability of the Probabilistic lexicalized CFG

E.g.,

VP → VBD NP PP

( )

n n r P

n: the syntactic category of a parse-tree node

( ) ( )

( )

n h n n r P ,

h(n): the headword of a parse-tree node P(r|VP, dumped): the prob. of the rule P(r|VP, slept): the prob. of the rule

SLIDE 48

48

Probabilistic Lexicalized CFGs (6/8)

– Further decide the probability of a head

Null assumption: all head are equally likely

– The probability that the head of a node would be sacks would be the same as the probability the head would be racks – Doesn’t seem very useful

Condition the probability of the head h of node n on two

factors – Syntactic category of the node n – The head of the node’s mother

( ) ( ) ( )

( )

n m h n word n h P

i

, =

P(head(n)=sacks|n=VP, h(m(n))=dumped) X(dumped) NP(?sacks?)

The prior probability

f the head words

SLIDE 49

49

Probabilistic Lexicalized CFGs (7/8)

– The probability of a parse T of a sentence S

( ) ( ) ( )

( )

( ) ( ) ( )

( )

∏

∈

=

T n

n m h n n h P n h n n r P S T P , , ,

head-rule probability head-head probability

( )

( ) ( ) ( ) ( )

67 . 9 6 , = = → → = →

∑ β

β dumped VP C PP NP VBD dumped VP C dumped VP PP NP VBD VP P

( )

( ) ( ) ( ) ( )

9 , = = → → = →

∑ β

β dumped VP C NP VBD dumped VP C dumped VP NP VBD VP P

( )

( ) ( ) ( ) ( )

22 . 9 2 ... ... )... ( ... , = = → → = ∑ PP dumped X C into PP dumped X C dumped PP into P

( )

( ) ( ) ( ) ( )

... ... )... ( ... , ⇒ = → → = ∑ PP sacks X C into PP sacks X C sacks PP into P Counting from Brown corpus

smoothing or backoff can be applied

SLIDE 50

50

Probabilistic Lexicalized CFGs (8/8)

– The original version of Charniak’s parser adds additional conditional factors

The rule-expansion probability depends on the node’s

grandparent (trigram or second-order Markovian)

Use various backoff and smoothing algorithm

SLIDE 51

51

Dependency Grammars (1/2)

The grammar formulation is based purely on the lexical

dependency information

– The syntactic structure of a sentence is described purely in terms

f words and binary semantic or syntactic relations between

words – Consitiuents and phrase structures do not play any fundamental role

SLIDE 52

52

Dependency Grammars (2/2)

One of the main advantages of dependency grammars is

their ability to handle languages with relatively free word

rder

– Abstract away from word-order variation, representing only information that is necessary for the parse

Examples

– Link Grammar – Constraint Grammar

SLIDE 53

53

Categorial Grammars (1/2)

The combinatory categorial grammar has two components

– The categorial lexicon

Associate each word with a syntactic and semantic category
Two categories

– Augments: Ns – Factors : verbs, determiners – The combination rules

Allow functions and arguments to be combined, e.g.,

– X/Y: something combines with a Y on its right to produce X – X\Y: something combines with a Y on its left to produce X

SLIDE 54

54

Categorial Grammars (2/2)

Examples

– Determiners receive the category NP/N – Transitive verbs might have the category VP/NP – Ditransitive verbs might have the category (VP/NP)/NP Harry eats apples

NP V NP VP/NP S\NP

SLIDE 55

55

Evaluating Parsers (1/2)

Labeled recall
Labeled precision
Cross-brackets

– Number of total brackets – E.g., a cross-bracket ((A B) C) and (A (B C)) # of correct constituents in candidate parse of a sentence s # of correct constituents in treebank parse of a sentence s # of correct constituents in candidate parse of a sentence s # of total constituents in candidate parse of a sentence s The correct constituent must have the same starting time, ending time, and non-terminal symbol as the “gold standard”

f treebank.

A B C A B C

SLIDE 56

56

Evaluating Parsers (2/2)

Examples

– Using a portion of the Wall Street Journal as the test set, parsers such as Charniak (1997) and Collins (1999) achieve just

Under 90% recall and under 90% precision
About 1% cross-bracketed constituents per sentence