Statistical Parsing Paper presentation: natural language parsing. - - PowerPoint PPT Presentation

statistical parsing
SMART_READER_LITE
LIVE PREVIEW

Statistical Parsing Paper presentation: natural language parsing. - - PowerPoint PPT Presentation

Statistical Parsing Paper presentation: natural language parsing. In: Computational linguistics ar ltekin University of Tbingen Seminar fr Sprachwissenschaft December 2016 Michael Collins (2003). Head-driven statistical


slide-1
SLIDE 1

Statistical Parsing

Paper presentation: Michael Collins (2003). “Head-driven statistical models for natural language parsing”. In: Computational linguistics 29.4, pp. 589–637. doi: 10.1162/089120103322753356

Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft December 2016

slide-2
SLIDE 2

Introduction/Motivation A summary of the paper

What is the paper about?

  • A head-driven, lexicalized PCFG
  • PCFGs cannot capture many linguistic phenomena
  • Lexicalizing PCFGs allows capturing lexical dependencies,

but parameter estimation becomes diffjcult (many rules, sparse data)

  • The main idea is factoring the rule probabilities, into parts

that are easy to estimate

  • The paper does that in a linguistically-motivated way
  • The resulting parser works better than PCFGs, and some
  • thers in the literature

Ç. Çöltekin, SfS / University of Tübingen Collins parser 1 / 20

slide-3
SLIDE 3

Introduction/Motivation A summary of the paper

Three models

Model 1

  • Lexicalize the PCFG
  • Condition the probability of a rule based on

parts of its LHS

  • Condition probabilities of non-heads on

distance to their head Model 2 Add complement-adjunct distinction (use subcategorization frames) Model 3 Add conditions for wh-movement

Ç. Çöltekin, SfS / University of Tübingen Collins parser 2 / 20

slide-4
SLIDE 4

Introduction/Motivation A summary of the paper

An overview of the paper

  • 2. Background: PCFGs, lexicalization, estimation (MLE)
  • 3. Model defjnitions
  • 4. Special cases: mainly related to treebank format
  • 5. Practical issues: parameter estimation, unknown words,

parsing algorithm

  • 6. Results
  • 7. Discussion
  • 8. Related work
  • 9. Conclusions

Ç. Çöltekin, SfS / University of Tübingen Collins parser 3 / 20

slide-5
SLIDE 5

Introduction/Motivation A summary of the paper

Probabilistic context-free grammars

  • A CFG augmented with probabilities for each rule
  • Assigns a proper probability distribution to parse trees

– if all rule probabilities with the same LHS sum to 1 – all derivations terminate in a fjnite number of steps

  • The main problem is estimating probabilities associated

with each rule X → β

  • Maximum-likelihood estimate:

count(X → β count(X)

  • With rule probabilities, parsing is fjnding the best tree

Tbest = arg max

T

P(T|S) = arg max

T

P(T, S) P(S) = arg max

T

P(T, S)

Ç. Çöltekin, SfS / University of Tübingen Collins parser 4 / 20

slide-6
SLIDE 6

Introduction/Motivation A summary of the paper

Probabilistic context-free grammars (2)

  • In PCFGs derivations are assumed to be independent
  • The probability of a tree is the product of the probabilities
  • f rules used in the derivation
  • PCFGs cannot capture lexical or structural dependencies

Ç. Çöltekin, SfS / University of Tübingen Collins parser 5 / 20

slide-7
SLIDE 7

Introduction/Motivation A summary of the paper

Lexicalizing PCFGs

  • Replace non-terminal X with X(h), where h is a tuple with

the lexical word and its POS tag

  • Now the grammar can capture (head-driven) lexical

dependencies

  • But number of nonterminals grow by |V| × |T|
  • Estimation becomes diffjcult (many rules, data sparsity)
  • Note: Penn Treebank (PTB) does not annotate heads, they

are automatically annotated (based on heuristics)

Ç. Çöltekin, SfS / University of Tübingen Collins parser 6 / 20

slide-8
SLIDE 8

Introduction/Motivation A summary of the paper

Example lexicalized derivation

TOP S(bought,VBD) NP(week,NN) JJ(last,JJ) Last NN(week,NN) week NP(IBM,NNP) NNP(IBM,NNP) IBM VP(bought,VBD) VBD(bought,VBD) bought NP(Lotus,NNP) NPN(Lotus,NNP) Lotus

Example rules:

TOP → S(bought,VBD) S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD) VP(bought,VBD) → VBD(bought,VBD) NP(Lotus,NNP) JJ(last,JJ) → Last

Ç. Çöltekin, SfS / University of Tübingen Collins parser 7 / 20

slide-9
SLIDE 9

Introduction/Motivation A summary of the paper

Model 1: the generative story

We take each lexicalized CF rule is formed as X(h) → ⟨left-dependents⟩ H(h) ⟨right-dependents⟩

  • 1. Generate the head with probability Ph(H|X, h)
  • 2. Generate the left modifjer(s) independently, each with

probability Pl(Li(li)|X, h, H)

  • 3. Generate the left modifjer(s) independently, each with

probability Pr(Ri(ri)|X, h, H)

  • A special left/right dependent label ‘STOP’ terminates the

generation

Ç. Çöltekin, SfS / University of Tübingen Collins parser 8 / 20

slide-10
SLIDE 10

Introduction/Motivation A summary of the paper

Model 1: distance

  • Model 1, also conditions the left and right dependents on

their distance from the head. For example Pl is estimated using Pl(Li(li)|X, h, H, distance(i − 1))

  • Two distance measures:

– Is the intervening string length 0? (adjacency) – Does the intervening string contain a verb? (clausal modifjers)

Ç. Çöltekin, SfS / University of Tübingen Collins parser 9 / 20

slide-11
SLIDE 11

Introduction/Motivation A summary of the paper

Model 2: the generative story

Main idea: condition the right/left modifjers on subcategorization frames (LC and RC), which are the left and right complements of the head.

  • 1. Generate the head with probability Ph(H|X, h)
  • 2. Choose left and aright subcategorization frames, with

probabilities Plc(LC|X, H, h) and Prc(RC|X, H, h)

  • 3. Generate the left/right modifjer(s) independently, each

with probability Pl(Li(li)|X, h, H, LC) and Pr(Ri(Ri)|X, h, H, RC)

Ç. Çöltekin, SfS / University of Tübingen Collins parser 10 / 20

slide-12
SLIDE 12

Introduction/Motivation A summary of the paper

Model 3: traces and wh-movement

The idea: mark and propagate ‘gaps’. NP(store) NP(store) The store SBAR(that)(+gap) WHNP(that) WDT that S(bought)(+gap) NP-C(IBM) IBM VP(bought)(+gap) VBD bought TRACE NP(week) last week

Ç. Çöltekin, SfS / University of Tübingen Collins parser 11 / 20

slide-13
SLIDE 13

Introduction/Motivation A summary of the paper

Special cases

  • Non-recursive (base) NPs are marked as NPB
  • Coordination: allow only a single phrase after a CC
  • Punctuation: remove all except non-initial/non-fjnal

comma and colon, treat the rest as coordination

  • Empty subjects: introduce a dummy empty subject during

preprocessing

Ç. Çöltekin, SfS / University of Tübingen Collins parser 12 / 20

slide-14
SLIDE 14

Introduction/Motivation A summary of the paper

Parameter estimation

Parameters are estimated by three levels of backofg (see Table 1 in the paper for details), using a version of Witten-Bell smoothing e = λ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3) where, λ1 = f1 f1 + 5u1 f1 is the relevant number of tokens (count in denominator), u1 is the relevant number of types. Other λs are calculated similarly.

Ç. Çöltekin, SfS / University of Tübingen Collins parser 13 / 20

slide-15
SLIDE 15

Introduction/Motivation A summary of the paper

Unknown words and parsing algorithm

  • During training, all words with frequencies less than 6

were replaced with UNKNOWN

  • During testing, the POS tags for unknown words were

assigned using using the tagger by Ratnaparkhi (1996)

  • The parsing algorithm is a version of CKY parser with

O(n50 complexity

Ç. Çöltekin, SfS / University of Tübingen Collins parser 14 / 20

slide-16
SLIDE 16

Introduction/Motivation A summary of the paper

Results

  • Model 2 performs better than Model 1
  • Model 2 also performs better/similar in comparison to

earlier/state-of-the-art models

  • Details: Table 2 on page 608 on paper.

Ç. Çöltekin, SfS / University of Tübingen Collins parser 15 / 20

slide-17
SLIDE 17

Introduction/Motivation A summary of the paper

More on results

  • Phrase-label precision/recall results do not show

attachment problems.

  • Extracted dependencies are more useful (Figure 12 on

page 610)

  • The parser recovers ‘core’ dependencies successfully,
  • Main problems are with adjuncts and coordination

Ç. Çöltekin, SfS / University of Tübingen Collins parser 16 / 20

slide-18
SLIDE 18

Introduction/Motivation A summary of the paper

More on distance measure

  • Distance measure seem to help fjnding subcategorization

for Model 1

  • As the distance from the head increases,

– the probability of attaching a new modifjer decreases – the probability of attaching ‘STOP’ increases

  • Distance measure is also useful for preferring

right-branching

  • Structural (e.g., close attachment) vs. lexical/semantic

preferences: structural preferences seem to be necessary. For example: John was believed to have been shot by Bill Flip said that Squeaky will do the work yesterday

Ç. Çöltekin, SfS / University of Tübingen Collins parser 17 / 20

slide-19
SLIDE 19

Introduction/Motivation A summary of the paper

Choice of representation

  • The parser prefers PTB-style (fmat) trees
  • For binary representations, do pre-/post-processing
  • This would have an efgect on capturing structural (but not

lexical) preferences.

  • Preprocessing steps, e.g., NPB labeling, seem to be

important

  • In general, the parser works best with

– fmat trees – difgerent constituent labels at difgerent levels

Ç. Çöltekin, SfS / University of Tübingen Collins parser 18 / 20

slide-20
SLIDE 20

Introduction/Motivation A summary of the paper

The need to break down rules

  • The main benefjt is the parser can use rules that it has not

seen in the training data

  • The parser can also learn some regularities in the rules
  • Compare with Charniak (1997) which only allows rules

seen in the training data

  • This is more important for PTB,

PTB VP → V NP VP → V NP PP VP → V NP PP PP … alternative VP → V NP VP → VP PP

  • In PTB, 54.5% of the rules (of the form used by this parser)
  • nly occur once

Ç. Çöltekin, SfS / University of Tübingen Collins parser 19 / 20

slide-21
SLIDE 21

Introduction/Motivation A summary of the paper

Summary

  • Accurate generative parser that breaks down rules
  • Does well on ‘core’ dependencies, adjuncts and

coordination are the main sources of error

  • Either conditioning on adjacency or subcategorization is

needed for good accuracy

  • The models work well with fmat dependencies
  • Breaking down the rules have good properties (can use

rules that were not seem in the training)

Ç. Çöltekin, SfS / University of Tübingen Collins parser 20 / 20

slide-22
SLIDE 22

Bibliography

Collins, Michael (2003). “Head-driven statistical models for natural language parsing”. In: Computational linguistics 29.4, pp. 589–637. doi: 10.1162/089120103322753356. Ratnaparkhi, Adwait (1996). “A maximum entropy model for part-of-speech tagging”. In: Proceedings of the conference on empirical methods in natural language processing. Vol. 1, pp. 133–142. Ç. Çöltekin, SfS / University of Tübingen Collins parser A.1