Some Experiments on Indicators of Parsing Complexity for Lexicalized - - PowerPoint PPT Presentation

some experiments on indicators of parsing complexity for
SMART_READER_LITE
LIVE PREVIEW

Some Experiments on Indicators of Parsing Complexity for Lexicalized - - PowerPoint PPT Presentation

Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars Anoop Sarkar, Fei Xia and Aravind Joshi Dept. of Computer and Information Sciences University of Pennsylvania f anoop,fxia,joshi g @linc.cis.upenn.edu 1 Lexicalized


slide-1
SLIDE 1

Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars

Anoop Sarkar, Fei Xia and Aravind Joshi

  • Dept. of Computer and Information Sciences

University of Pennsylvania

fanoop,fxia,joshi g@linc.cis.upenn.edu

1

slide-2
SLIDE 2

Lexicalized Tree Adjoining Grammars

NNP m NP n NP u NNP n NP u

Ms. Haag

NP arg VBZ n NP arg VP n S u

NNP n NP u

plays Elianti

These trees can be combined to parse the sentence Ms. Haag plays Elianti.

2

slide-3
SLIDE 3

Important Properties of LTAG wrt Parsing

Predicate-argument structure is represented in each elementary tree. Adjunction instead of feature unification. No recursive feature structures. FSs are bounded.

3

slide-4
SLIDE 4

Important Properties of LTAG wrt Parsing

Transformational relations for the same predicate-argument structure

are precomputed.

Each predicate selects a family of elementary trees. Different sources of issues for parsing efficiency.

4

slide-5
SLIDE 5

Parsing Efficiency

Parsing accuracy: Evaluations done in previous work. Parsing efficiency: observed time complexity for producing all parses. The usual notion: compare different parsing algorithms wrt time, space,

number of edges,

: : : This paper: explore parsing efficiency from a viewpoint that is inde-

pendent of a particular parsing algorithm.

5

slide-6
SLIDE 6

Parsing Efficiency

Not an alternative to comparision of parsing algorithms. An exploration of parsing efficiency from the perspective of a fully lexi-

calized grammar.

Sources of parsing complexity that are part of the input to the parsing

algorithm.

6

slide-7
SLIDE 7

Parsing Efficiency

We explore two issues: syntactic lexical ambiguity and clausal com-

plexity.

The contention: for LTAGs these issues are relevant across all parsing

algorithms.

7

slide-8
SLIDE 8

Experiment: The Parser

Implementation of head-corner chart-based parser. It is bi-directional – van Noord style. Produces a derivation forest as output. Written in ANSI C:
  • version available at

ftp://ftp.cis.upenn.edu/xtag/pub/lem

8

slide-9
SLIDE 9

Experiment: Input Grammar

Treebank Grammar extracted from Sections 02–21 WSJ Penn Treebank
  • 6789 tree templates,
123039 lexicalized trees number of word types in the lexicon is 44215 average number of trees per word is 2:78

9

slide-10
SLIDE 10

50 100 150 200 250 300 350 400 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Number of trees selected Word frequency

Number of trees selected by the words grouped by word frequency

10

slide-11
SLIDE 11

Treebank Grammar and XTAG English Grammar

Compared TG with the XTAG Grammar which has 1004 tree tem-

plates, 53 tree families and 1.8 million lexicalized trees.

82.1% of template tokens in the Treebank grammar match a corre-

sponding template in the XTAG grammar

14.0% are covered by the XTAG grammar but the templates look dif-

ferent because of different linguistic analyses

11

slide-12
SLIDE 12

Treebank Grammar and XTAG English Grammar

1.1% of template tokens in the Treebank grammar are due to annota-

tion errors

The remaining 2.8% are not currently covered by the XTAG grammar A total of 96.1% of the structures in the Treebank grammar match up

with structures in the XTAG grammar.

12

slide-13
SLIDE 13

Experiment: Test Corpus

input was a set of 2250 sentences each sentence was 21 words or less
  • avg. sentence length was
12:3 number of tokens = 27715
  • utput: shared forest of parses

13

slide-14
SLIDE 14

5 10 15 20 25 30 35 40 45 2 4 6 8 10 12 14 16 18 20 log(No. of derivations) Sentence length

Number of derivations per sentence

14

slide-15
SLIDE 15

1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20 log(time) in seconds Sentence length

Parsing times per sentence Coeff of determination

R 2 = 0:65

15

slide-16
SLIDE 16

500 1000 1500 2000 2500 3000 3500 4000 5 10 15 20 Median time (seconds) Sentence length

Median parsing times per sentence

16

slide-17
SLIDE 17

Hypothesis

There is a large variability in parse times. The typical increase in time depending on sentence length is not ob-

served.

Can a sentence predict its own parsing time? Hypothesis: check the number of lexicalized trees that are selected by

each sentence.

17

slide-18
SLIDE 18

1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 log(Time taken) in seconds Total num of trees selected by a sentence

The impact of syntactic lexical ambiguity on parsing times

R 2 = 0:82 (previous = 0.65)

18

slide-19
SLIDE 19

Hypothesis

To test the hypothesis further we did the following tests:

– Check time taken when an oracle gives us the single correct tree for each word. – Check time taken after parsing based on the output of an

n-best

SuperTagger.

19

slide-20
SLIDE 20
  • 5
  • 4.5
  • 4
  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

5 10 15 20 log(Time taken in secs) Sentence length

Parse times when the parser gets the correct tree for each word in the sentence Total time = 31.2 secs vs. 548K secs (orig)

20

slide-21
SLIDE 21
  • 6
  • 4
  • 2

2 4 6 8 5 10 15 20 25 log(Time in secs) Sentence length

Time taken by the parser after

n-best SuperTagging ( n = 60)

Total time = 21K secs vs. 548K secs (orig)

21

slide-22
SLIDE 22

Clausal Complexity

The complexity of syntactic and semantic processing is related to the

number of predicate-argument structures being computed for a given sentence.

This notion of complexity can be measured using the number of clauses

in the sentence.

Does the number of clauses grow proportionally with sentence length?

22

slide-23
SLIDE 23

2 4 6 8 10 12 14 50 100 150 200 250 Average number of clauses in the sentences Sentence length

Average number of clause plotted against sentence length. 99.1% of sentences in the Penn Treebank contain 6 or fewer clauses

23

slide-24
SLIDE 24

0.5 1 1.5 2 2.5 3 3.5 4 50 100 150 200 250 Standard deviation of clause number Sentence length

Standard deviation of clause number plotted against sentence length. Increase in deviation for sentences longer than 50 words.

24

slide-25
SLIDE 25

1 1.5 2 2.5 3 3.5 4 4.5 5 5 10 15 20 1 2 3 4 5 6 7 8 9 10 Num of clauses Sentence length log(Time taken in secs)

Variation in parse time against sentence length while identifying the number of clauses

25

slide-26
SLIDE 26

1 1.5 2 2.5 3 3.5 4 4.5 5 500 1000 1 2 3 4 5 6 7 8 9 10 Num of clauses Num of trees selected log(Time taken in secs)

Variation in parse time against number of trees The parser spends most of its time attaching modifiers

26

slide-27
SLIDE 27

Conclusions

We explored two issues that affect parsing effiency for LTAGs: syntac-

tic lexical ambiguity and clausal complexity. – Parsing of LTAGs is determined by number of trees selected by a sentence. – Number of clauses does not grow proportionally with sentence length.

Current work: incorporate these factors to improve parsing efficiency

for LTAGs.

27