Features of Statistical Parsers Mark Johnson Brown Laboratory for - - PowerPoint PPT Presentation

features of statistical parsers
SMART_READER_LITE
LIVE PREVIEW

Features of Statistical Parsers Mark Johnson Brown Laboratory for - - PowerPoint PPT Presentation

Features of Statistical Parsers Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 1 Features of Statistical Parsers Confessions of a bottom-feeder: Dredging in the Statistical Muck Mark Johnson Brown Laboratory


slide-1
SLIDE 1

Features of Statistical Parsers

Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005

1

slide-2
SLIDE 2

Features of Statistical Parsers

Confessions of a bottom-feeder: Dredging in the Statistical Muck

Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005

2

slide-3
SLIDE 3

Features of Statistical Parsers

Confessions of a bottom-feeder: Dredging in the Statistical Muck

Mark Johnson Brown Laboratory for Linguistic Information Processing CoNLL 2005 With much help from Eugene Charniak, Michael Collins and Matt Lease

3

slide-4
SLIDE 4

Outline

  • Goal: find features for identifying good parses
  • Why is this difficult with generative statistical models?
  • Reranking framework
  • Conditional versus joint estimation
  • Features for parse ranking
  • Estimation procedures
  • Experimental set-up
  • Feature selection and evaluation

4

slide-5
SLIDE 5

Features for accurate parsing

  • Accurate parsing requires good features

⇒ need a flexible method for evaluating a wide range of features

  • parse ranking framework is current best method for doing this

+ works with virtually any kind of representation + features can encode virtually any kind of information (syntactic, lexical semantics, prosody, etc.) + can exploit the currently best-available parsers − efficient algorithms are hard(-er) to design and implement − fishing expedition

5

slide-6
SLIDE 6

Why not a generative statistical parser?

  • Statistical parsers (Charniak, Collins) generate parses node by node
  • Each step is conditioned on the structure already generated

S NP PRP He VP VBD raised NP the price . . NN DT

  • Encoding dependencies is as difficult as designing a feature-passing

grammar (GPSG)

  • Smoothing interacts in mysterious ways with these encodings
  • Conditional estimation should produce better parsers with our current

lousy models

6

slide-7
SLIDE 7

Linear ranking framework

  • Generate n candidate parses

Tc(s) for each sentence s

  • Map each parse t ∈ Tc(s) to a

real-valued feature vector f(t) = (f1(t), . . . , fm(t))

  • Each feature fj is associated

with a weight wj

  • The highest scoring parse

^ t = argmax

t∈Tc(s)

w · f(t) is predicted correct

sentence s tn . . . . . . f(t1) f(tn) w · f(t1) w · f(tn) . . . n-best parser parses Tc(s) t1 feature vectors parse scores apply feature fns linear combination argmax “best” parse for s

7

slide-8
SLIDE 8

Linear ranking example

w = (−1, 2, 1) Candidate parse tree t features f(t) parse score w · f(t) t1 (1, 3, 2) 7 t2 (2, 2, 1) 3 . . . . . . . . .

  • Parser designer specifies feature functions f = (f1, . . . , fm)
  • Feature weights w = (w1, . . . , wm) specify each feature’s “importance”
  • n-best parser produces trees Tc(s) for each sentence s
  • Feature functions f apply to each tree t ∈ Tc(s), producing feature values

f(t) = (f1(t), . . . , fm(t))

  • Return highest scoring tree

^ t(s) = argmax

t

w · f(t) = argmax

t m

  • j=1

wjfj(t)

8

slide-9
SLIDE 9

Linear ranking, statistics and machine learning

  • Many models define the best candidate ^

t in terms of a linear combination of feature values w · f(t) – Exponential, Log-linear, Gibbs models, MaxEnt P(t) = 1 Z exp w · f(t) Z =

  • t∈T

exp w · f(t) (partition function) log P(t) = w · f(t) − log Z – Perceptron algorithm (including averaged version) – Support Vector Machines – Boosted decision stubs

9

slide-10
SLIDE 10

PCFGs are exponential models

fj(t) = number of times the jth rule is used in t wj = log pj, where pj is probability of jth rule f

      S NP VP rice grows      

= [ 1

  • S→NP VP

, 1

  • NP→rice

,

  • NP→bananas

, 1

  • VP→grows

,

  • VP→grow

] PPCFG(t) =

  • j

pfj(t)

j

=

  • j

exp(wj)fj(t) =

  • j

exp wjfj(t) = exp

  • j

wjfj(t) = exp w · f(t) So a PCFG is just a special kind of exponential model with Z = 1.

10

slide-11
SLIDE 11

Features in linear ranking models

  • Features can be any real-valued function of parse t and sentence s

– counts of number of times a particular structure appears in t – log probabilities from other models ∗ log Pc(t) is our most useful feature! ∗ generalizes reference distributions of MaxEnt models

  • Subtracting a constant c(s) from a feature’s value doesn’t affect

difference between parse scores in a linear model w · (f(t1) − c(s)) − w · (f(t2) − c(s)) = w · f(t1) − w · f(t2) – features that don’t vary on Tc(s) are useless – subtract most frequently occuring value cj(s) for each feature fj in sentence s ⇒ sparser feature vectors

11

slide-12
SLIDE 12

Getting the feature weights

s f(t⋆(s)) {f(t) : t ∈ Tc(s), t = t⋆(s)} sentence 1 (1, 3, 2) (2, 2, 3) (3, 1, 5) (2, 6, 3) sentence 2 (7, 2, 1) (2, 5, 5) sentence 3 (2, 4, 2) (1, 1, 7) (7, 2, 1) . . . . . . . . .

  • n-best parser produces trees Tc(s) for each sentence s
  • Treebank gives correct tree t⋆(s) ∈ Tc(s) for sentence s
  • Feature functions f apply to each tree t ∈ Tc(s), producing feature

values f(t) = (f1(t), . . . , fm(t))

  • Machine learning algorithm selects feature weights w to prefer t⋆(s)

(e.g., so w · f(t⋆(s)) is greater than w · f(t′) for other t′ ∈ Tc(s))

12

slide-13
SLIDE 13

Conditional ML estimation of w

  • Conditional ML estimation selects w to make t⋆(s) as likely as possible

compared to the trees in Tc(s)

  • Same as conditional MaxEnt estimation

Pw(t|s) = 1 Zw(s) exp w · f(t) exponential model Zw(s) =

  • t′∈Tc(s)

exp w · f(t′) D = ((s1, t⋆

1), . . . , (sn, t⋆ n))

treebank training data LD(w) =

n

  • i=1

Pw(t⋆

i|si)

conditional likelihood of D

  • w

= argmax

w

LD(w)

13

slide-14
SLIDE 14

(Joint) MLE for exponential models is hard

t⋆

i

T

D = (t⋆

1, . . . , t⋆ n)

LD(w) =

n

  • i=1

Pw(t⋆

i)

  • w

= argmax

w

LD(w) Pw(t) = 1 Zw exp w · f(t), Zw =

  • t′∈T

exp w · f(t′)

  • Joint MLE selects w to make t⋆

i as likely as possible

  • T is set of all possible parses for all possible strings
  • T is infinite ⇒ cannot be enumerated ⇒ Zw cannot be calculated
  • For a PCFG, Zw and hence

w are easy to calculate, but . . .

  • in general ∂LD/∂wj and Zw are intractable analytically and numerically
  • Abney (1997) suggests a Monte-Carlo calculation method

14

slide-15
SLIDE 15

Conditional MLE is easier

  • The conditional likelihood of w is the conditional probability of the

hidden part of the data (syntactic structure) t⋆ given its visible part (yield or terminal string) s

  • The conditional likelihood can be numerically optimized because Tc(s)

can be enumerated (by a parser)

T (si) t⋆

i

T

D = ((t⋆

1, s1) . . . , (t⋆ n, sn))

LD(w) =

n

  • i=1

Pw(t⋆

i|si)

  • w

= argmax

w

LD(w) P(t|s) = 1 Zw(s) exp w · f(t), Zw(s) =

  • t′∈Tc(s)

exp w · f(t′)

15

slide-16
SLIDE 16

Conditional vs joint estimation

  • Joint MLE maximizes probability of training trees and strings

– Generative statistical parsers usually use joint MLE – Joint MLE is simple to compute (relative frequency)

  • Conditional MLE maximizes probability of trees given strings

– Conditional estimation uses less information from the data – learns nothing from distribution of strings – ignores unambiguous sentences (!) P(t, s) = P(t|s)P(s)

  • Joint MLE should be better (lower variance) if your model correctly

predicts the distribution of parses and strings – Any good probabilistic models of semantics and discourse?

16

slide-17
SLIDE 17

Conditional vs joint MLE for PCFGs

100×

VP V run

V see NP N people P with NP N telescopes VP PP VP

VP V see N people P with NP N telescopes NP PP NP

. . . × 2/105 × . . . . . . × 1/7 × . . . . . . × 2/7 × . . . . . . × 1/7 × . . . Rule count rel freq better vals VP → V 100 100/105 4/7 VP → V NP 3 3/105 1/7 VP → VP PP 2 2/105 2/7 NP → N 6 6/7 6/7 NP → NP PP 1 1/7 1/7

17

slide-18
SLIDE 18

Regularization

  • Overlearning ⇒ add regularization R that penalizes “complex” models
  • Useful with a wide range of objective functions
  • w

= argmin

w

Q(w) + R(w) Q(w) = − log LD(w) (objective function) R(w) = c

  • j

|wj|p (regularizer) LD(w) =

  • i

Pw(t⋆

i|si)

  • p = 2 known as the Gaussian prior
  • p = 1 known as the Laplacian or exponential prior

– sparse solutions – requires special care in optimization (Kazama and Tsujii, 2003)

18

slide-19
SLIDE 19

If candidate parses don’t include correct parse

  • If Tc(s) doesn’t include t⋆(s), choose parse t+(s) in Tc(s) closest to t⋆(s)
  • Maximize conditional likelihood of (t+

1 , . . . , t+ n)

  • Closest parse t+

i = argmaxt∈T (si) Ft⋆

i (t)

– Ft⋆(t) is f-score of t relative to t⋆

  • w chosen to maximize the regularized

log conditional likelihood of t+

i

LD(w) =

  • i

Pw(t+

i |si)

t⋆

i

T t+

i

Tc(si)

19

slide-20
SLIDE 20

Multiple closest parses

t⋆

i

Tc(si) T +

c (t⋆ i)

T

  • There can be more than one candidate parses T +

c (t⋆ i) equally close to the

correct parse t⋆

i: which one(s) should we declare to be the best parse?

  • Picking a parse at random does not work as well as . . .
  • picking the parse with the highest Charniak parse probability, but . . .
  • maximizing probability of all close parses (EM-like scheme in Riezler ’02)

works best of all LD(w) =

  • i

P(Tc(t⋆

i)|Tc(si)) 20

slide-21
SLIDE 21

Likelihood of multiple best parses

  • Treebank D = ((t⋆

1, s1), . . . , (t⋆ n, sn))

  • n-best candidates Tc(si) of sentence si
  • T +

c (t⋆ i) = trees in Tc(si) with max f-score

  • w chosen to maximize the regularized

log conditional likelihood of T +

c (t⋆ i)

t⋆

i

Tc(si) T +

c (t⋆ i)

T

LD(w) =

  • i

Pw(T +

c (t⋆ i)|Tc(si))

=

  • i
  • t∈T +

c (t⋆) exp w · f(t)

  • t∈Tc(si) exp w · f(t)
  • ∂ log L/∂wj is a difference in expectations over T +

c (t⋆) and Tc(si) 21

slide-22
SLIDE 22

Features for ranking parses

  • Features can be any real-valued function of parse trees
  • In these experiments the features come in two kinds:

– The logarithm of the tree’s probability estimated by the Charniak parser – The number of times a particular configuration appears in the parse Which ones improve parsing accuracy the most?

22

slide-23
SLIDE 23

Lexicalized and parent-annotated rules

  • Lexicalization associates each constituent with its head
  • Ancestor annotation provides a little “vertical context”
  • Context annotation indicates constructions that only occur in main clause

(c.f., Emonds)

ROOT S NP WDT That VP VBD went PP IN

  • ver

NP NP DT the JJ permissible NN line PP IN for NP ADJP JJ warm CC and JJ fuzzy NNS feelings . . Heads Ancestor Context Rule

23

slide-24
SLIDE 24

Functional and lexical heads

  • There are at least two sensible notions of head (c.f., Grimshaw)

– Functional heads: determiners of NPs, auxilary verbs of VPs, etc. – Lexical heads: rightmost Ns of NPs, main verbs in VPs, etc.

  • In a Maxent model, it is easy to use both!

S DT A NN record NN date VP VBZ has RB n’t VP VBN been VP VBN set . . NP functional functional lexical

24

slide-25
SLIDE 25

Functional-lexical head dependencies

  • The SynSemHeads features collect pairs of functional and lexical heads
  • f phrases
  • This captures number agreement in NPs and aspects of other

head-to-head dependencies

  • Parameterized by lexicalization

ROOT S NP DT The NNS rules VP VBP force S NP NNS executives VP TO to VP VB report NP NNS purchases . .

25

slide-26
SLIDE 26

n-gram rule features generalize rules

  • Collects adjacent constituents in a local tree
  • Also includes relationship to head (e.g., adjacent? left or right?)
  • Parameterized by ancestor-annotation, lexicalization and head-type

ROOT S NP DT The NN clash VP AUX is NP NP DT a NN sign PP IN

  • f

NP NP DT a JJ new NN toughness CC and NN divisiveness PP IN in NP NP NNP Japan POS ’s JJ

  • nce-cozy

JJ financial NNS circles . . Left of head, non-adjacent to head

26

slide-27
SLIDE 27

Head to head dependencies

  • Head-to-head dependencies track the function-argument dependencies

in a tree

  • Co-ordination leads to phrases with multiple heads or functors
  • Parameterized by head type, number of governors and lexicalization

ROOT S NP WDT That VP VBD went PP IN

  • ver

NP NP DT the JJ permissible NN line PP IN for NP ADJP JJ warm CC and JJ fuzzy NNS feelings . .

27

slide-28
SLIDE 28

Head trees record all dependencies

  • Head trees consist of a (lexical) head, all of its projections and

(optionally) all of the siblings of these nodes

  • correspond roughly to TAG elementary trees
  • parameterized by head type, number of sister nodes and lexicalization

ROOT S NP PRP They VP VBD were VP VBN consulted PP IN in NP NN advance . .

28

slide-29
SLIDE 29

Rightmost branch bias

  • The RightBranch feature’s value is the number of nodes on the right-most

branch (ignoring punctuation) (c.f., Charniak 00)

  • Reflects the tendancy toward right branching in English
  • Only 2 different features, but very useful in final model!

ROOT WDT That went

  • ver

DT the JJ permissible NN line IN for JJ warm CC and JJ fuzzy NNS feelings . . PP VP S NP PP NP NP VBD IN NP ADJP

29

slide-30
SLIDE 30

Constituent Heavyness and location

  • Heavyness measures the constituent’s category, its (binned) size and

(binned) closeness to the end of the sentence

ROOT S NP WDT That VP VBD went PP IN

  • ver

NP NP DT the JJ permissible NN line PP IN for NP ADJP JJ warm CC and JJ fuzzy NNS feelings . . > 5 words =1 punctuation

30

slide-31
SLIDE 31

Coordination parallelism (1)

  • A CoPar feature indicates the depth to which adjacent conjuncts are

parallel

ROOT S NP PRP They VP VP VBD were VP VBN consulted PP IN in NP NN advance CC and VP VDB were VP VBN surprised PP IN at NP NP DT the NN action VP VBN taken . . Isomorphic trees to depth 4

31

slide-32
SLIDE 32

Coordination parallelism (2)

  • The CoLenPar feature indicates the difference in length in adjacent

conjuncts and whether this pair contains the last conjunct.

ROOT S NP PRP They VP VP VBD were VP VBN consulted PP IN in NP NN advance CC and VP VDB were VP VBN surprised PP IN at NP NP DT the NN action VP VBN taken . . 4 words 6 words CoLenPar feature: (2,true)

32

slide-33
SLIDE 33

Word

  • A Word feature is a word plus n of its parents (c.f., Klein and

Manning’s non-lexicalized PCFG)

  • A WProj feature is a word plus all of its (maximal projection) parents,

up to its governor’s maximal projection

ROOT S NP WDT That VP VBD went PP IN

  • ver

NP NP DT

the

JJ permissible NN line PP IN for NP ADJP JJ warm CC and JJ fuzzy NNS feelings . .

33

slide-34
SLIDE 34

Neighbours

  • A Neighbours feature indicates the node’s category, its binned length

and j left and k right POS tags for j, k ≤ 1

ROOT S NP WDT That VP VBD went PP IN

  • ver

NP NP DT the JJ permissible NN line PP IN for NP ADJP JJ warm CC and JJ fuzzy NNS feelings . . > 5 words

34

slide-35
SLIDE 35

Tree n-gram

  • A tree n-gram feature is a tree fragment that connect sequences of

adjacent n words, for n = 2, 3, 4 (c.f. Bod’s DOP models)

  • lexicalized and non-lexicalized variants

ROOT S NP WDT That VP VBD went PP IN

  • ver

NP NP DT the JJ permissible NN line PP IN for NP ADJP JJ warm CC and JJ fuzzy NNS feelings . .

35

slide-36
SLIDE 36

Experimental setup

  • Feature tuning experiments done using Collins’ split:

sections 2-19 as train, 20-21 as dev and 24 as test

  • Tc(s) computed using Charniak 50-best parser
  • Features which vary on less than 5 sentences pruned
  • Optimization performed using LMVM optimizer from Petsc/TAO
  • ptimization package
  • Regularizer constant c adjusted to maximize f-score on dev

36

slide-37
SLIDE 37

f-score vs. n-best beam size

Beam size Oracle f-score 50 40 30 20 10 0.98 0.96 0.94 0.92 0.9

  • F-score of Charniak’s most probable parse = 0.896
  • Oracle f-score of Charniak’s 50-best parses = 0.965 (66% redn)
  • oracle f-score continues to rise at wide beam widths
  • no guarantee that reranker performance improves with beam width!

37

slide-38
SLIDE 38

Rank of best parse

Rank of best parse in n-best list Fraction of sentences 50 40 30 20 10 0.5 0.4 0.3 0.2 0.1

  • Charniak parser’s most likely parse is the best parse 41% of the time
  • Reranker picks Charniak parser’s most likely parse 58% of the time

38

slide-39
SLIDE 39

Evaluating features

  • The feature weights are not that indicative of how important a feature is
  • The MaxEnt ranker with regularizer tuning takes approx 1 day to train
  • The averaged perceptron algorithm takes approximately 2 minutes

– used in experiments comparing different sets of features – all closest parses T +

c (t⋆) count as “correct”

– Used to compare models with the following features: NLogP Rule NGram Word WProj RightBranch Heavy NGramTree HeadTree Heads Neighbours CoPar CoLenPar

39

slide-40
SLIDE 40

Adding one feature class

  • Averaged perceptron baseline with only base parser log prob feature

– section 20–21 f-score = 0.894913 – section 24 f-score = 0.889901

40

slide-41
SLIDE 41

Subtracting one feature class

  • Averaged perceptron baseline with all features

– section 20–21 f-score = 0.906806 – section 24 f-score = 0.902782

41

slide-42
SLIDE 42

Feature selection is hard

Averaged perceptron feature selection f-score on sections 20-21 f-score on section 24 0.911 0.91 0.909 0.908 0.907 0.906 0.905 0.904 0.903 0.902 0.901 0.908 0.906 0.904 0.902 0.9 0.898 0.896 0.894 0.892

  • Greedy feature selection using averaged perceptron optimizing f-score on

sec 20–21

  • All models also evaluated on section 24

42

slide-43
SLIDE 43

Comparing estimators

  • Training on sections 2–19, regularizer tuned on 20–21, evaluate on 24

Estimator # features sec 20-21 sec 24 exponential model, p = 2 670,688 0.9085 0.9037 exponential model, p = 1 14,549 0.9078 0.9024 (p = 0.137) averaged perceptron 523,374 0.9068 0.9028 (p = 0.528) expected f-score 670,688 0.9084 0.9029 (p = 0.313)

  • Because the exponential model with p = 2 is usually the first model I

test a new feature on, the features may be biased to work well with it.

43

slide-44
SLIDE 44

Averaged perceptron vs Exponentional model

Exponentional model, tuning regularizer constant c Averaged perceptron w/ randomized data f-score on sections 20-21 f-score on section 24 0.91 0.9095 0.909 0.9085 0.908 0.9075 0.907 0.9065 0.906 0.9055 0.905 0.9045 0.907 0.906 0.905 0.904 0.903 0.902 0.901 0.9 0.899

  • Multiple runs of averaged perceptron on data in random order
  • Exponentional model p = 2 adjusting regularizer weight c

44

slide-45
SLIDE 45

Class scaling with averaged perceptron

Averaged perceptron with scaled feature values f-score on sections 20-21 f-score on section 24 0.912 0.91 0.908 0.906 0.904 0.902 0.9 0.898 0.896 0.894 0.892 0.915 0.91 0.905 0.9 0.895 0.89 0.885 0.88

  • Every feature class is associated with its own scaling factor
  • Scaling factors adjusted to maximize av perceptron f-score on sec 20-21
  • (Different features to other experiments)

45

slide-46
SLIDE 46

Expected f-score

  • The expected f-score is computed by calculating the expected number of

nodes and the expected number of correct nodes of the parse trees in the corpus under the exponential model

  • This should take the size of the sentence into account during training
  • The expected f-score can be calculated and differentiatiated wrt to w

section 24 sections 20-21 regularizer constant c f-score 1.4e-06 1.2e-06 1e-06 8e-07 6e-07 4e-07 2e-07 0.912 0.91 0.908 0.906 0.904 0.902 0.9

46

slide-47
SLIDE 47

Results on all training data

  • Features must vary on parses of at least 5 sentences in training data
  • In this experiment, 730,134 features
  • Exponential model trained on sections 2-21
  • Gaussian regularization p = 2, constant selected to optimize f-score on

section 24

  • On section 23: recall = 90.78, precision = 91.51, f-score = 91.15
  • Will be available on the web this week

47

slide-48
SLIDE 48

Conclusion and future work

  • Good features and a good machine learning algorithm can produce a

state-of-the-art parser

  • Good candidate trees are a big help!
  • The parse ranking framework lets us explore lots of different kinds of

features – what a pity it’s not clear which ones are important

  • Future work

– different kinds of information (prosody, morphology, word classes) – richer representations (empty nodes, predicate-argument structures) – build discriminatively-estimated features back into Charniak parser

48

slide-49
SLIDE 49

Sample parser errors

S NP PRP He ‘‘ ‘‘ VP MD will RB not VP AUX be VP VBN shaken PRT RP

  • ut

PP IN by NP JJ external NNS events , , ADVP RB however S ADJP JJ surprising , , JJ alarming CC

  • r

JJ vexing : ... . .

S NP PRP He ‘‘ ‘‘ VP MD will RB not VP AUX be VP VBN shaken PRT RP

  • ut

PP IN by NP NP JJ external NNS events , , ADJP RB however JJ surprising , , JJ alarming CC

  • r

JJ vexing : ... . .

49

slide-50
SLIDE 50

S NP JJ Soviet NNS leaders VP VBD said SBAR S NP PRP they VP MD would VP VB support NP PRP$ their NNP Kabul NNS clients PP IN by NP NP DT all NNS means ADJP JJ necessary :

  • CC

and AUX did . . S NP JJ Soviet NNS leaders VP VP VBD said SBAR S NP PRP they VP MD would VP VB support NP PRP$ their NNP Kabul NNS clients PP IN by NP NP DT all NNS means ADJP JJ necessary :

  • CC

and VP AUX did . .

50

slide-51
SLIDE 51

S NP NNP Kia VP AUX is NP NP DT the ADJP RBS most JJ aggressive PP IN

  • f

NP NP DT the NNP Korean NNP Big NNP Three PP IN in NP NN

  • ffering

NN financing . .

S NP NNP Kia VP AUX is NP NP DT the RBS most JJ aggressive PP IN

  • f

NP DT the NNP Korean NNP Big NNP Three PP IN in S VP VBG

  • ffering

NP NN financing . .

51

slide-52
SLIDE 52

S ADVP NP CD Two NNS years RB ago , , NP DT the NN district VP VBD decided S VP TO to VP VB limit NP DT the NNS bikes S VP TO to VP VB fire NP NNS roads PP IN in NP PRP$ its CD 65,000 JJ hilly NNS acres . . S ADVP NP CD Two NNS years IN ago , , NP DT the NN district VP VBD decided S VP TO to VP VB limit NP DT the NNS bikes PP TO to NP NP NN fire NNS roads PP IN in NP PRP$ its CD 65,000 JJ hilly NNS acres . .

52

slide-53
SLIDE 53

S NP DT The NN company ADVP RB also VP VBD pleased NP NNS analysts PP IN by S VP VBG announcing NP NP CD four JJ new NN store NNS

  • penings

VP VBN planned PP IN for NP JJ fiscal CD 1990 , , S VP VBG ending NP JJ next NNP August . .

S NP DT The NN company ADVP RB also VP VBD pleased NP NNS analysts PP IN by S VP VBG announcing NP NP CD four JJ new NN store NNS

  • penings

VP VBN planned PP IN for NP NP JJ fiscal CD 1990 , , VP VBG ending NP JJ next NNP August . .

53

slide-54
SLIDE 54

S CC But NP NNS funds ADVP RB generally VP AUX are VP ADVP RB better VBN prepared NP DT this NN time RP around . . S CC But NP NNS funds ADVP RB generally VP AUX are ADJP RBR better JJ prepared ADVP NP DT this NN time RB around . .

54

slide-55
SLIDE 55

S NP DT The NNP U.S. VP VBD said SBAR S NP PRP it VP MD would ADVP RB fully VP VP VB support NP DT the NN resistance :

  • CC

and VP AUX did RB n’t . . S NP DT The NNP U.S. VP VP VBD said SBAR S NP PRP it VP MD would VP ADVP RB fully VB support NP DT the NN resistance :

  • CC

and VP AUX did RB n’t . .

55

slide-56
SLIDE 56

Significance testing (av. perceptron)

comparing exponential model p = 2 with averaged perceptron

nsentences = 1345 in test corpus. model 1 nfeatures = 670688, corpus f-score = 0.9037 model 2 nfeatures = 670688, corpus f-score = 0.902782 permutation test significance of corpus f-score difference = 0.58234 model 1 better on 214 = 15.9108% sentences model 2 better on 170 = 12.6394% sentences models 1 and 2 tied on 961 = 71% sentences binomial 2-sided significance of sentence-by-sentence comparison = 0.0280806 bootstrap 95% confidence interval for model 1 f-scores = (0.897672 0.9096) bootstrap 95% confidence interval for model 2 f-scores = (0.896832 0.908697)

56

slide-57
SLIDE 57

Significance testing (p=1)

comparing exponential models p = 2 with p = 1

nsentences = 1345 in test corpus. model 1 nfeatures = 670688, corpus f-score = 0.9037 model 2 nfeatures = 670688, corpus f-score = 0.902357 permutation test significance of corpus f-score difference = 0.22695 model 1 better on 121 = 8.99628% sentences model 2 better on 98 = 7.28625% sentences models 1 and 2 tied on 1126 = 83% sentences binomial 2-sided significance of sentence-by-sentence comparison = 0.136934 bootstrap 95% confidence interval for model 1 f-scores = (0.897672 0.9096) bootstrap 95% confidence interval for model 2 f-scores = (0.896315 0.908321)

57

slide-58
SLIDE 58

Significance testing (expected f-score)

comparing exponential model p = 2 with expected f-score

nsentences = 1345 in test corpus. model 1 nfeatures = 670688, corpus f-score = 0.9037 model 2 nfeatures = 670688, corpus f-score = 0.902865 permutation test significance of corpus f-score difference = 0.59533 model 1 better on 169 = 12.5651% sentences model 2 better on 150 = 11.1524% sentences models 1 and 2 tied on 1026 = 76% sentences binomial 2-sided significance of sentence-by-sentence comparison = 0.313546 bootstrap 95% confidence interval for model 1 f-scores = (0.897672 0.9096) bootstrap 95% confidence interval for model 2 f-scores = (0.89686 0.908797)

58

slide-59
SLIDE 59

Features from correct/incorrect parses only

  • Features that varied on less than 5 sentences were pruned
  • Exponential model, p = 2

Source # features 20-21 f-score 24 f-score All parses 670,688 0.9085 0.9037 Correct parses 173,409 0.9087 0.9043 Incorrect parses 670,544 0.9085 0.9036

59