LEXICALIZED PARSING FOR DIFFERENT DOMAINS Laura Rimell and Stephen - - PowerPoint PPT Presentation

lexicalized parsing for different domains
SMART_READER_LITE
LIVE PREVIEW

LEXICALIZED PARSING FOR DIFFERENT DOMAINS Laura Rimell and Stephen - - PowerPoint PPT Presentation

Daniel C. Mller LEXICALIZED PARSING FOR DIFFERENT DOMAINS Laura Rimell and Stephen Clark 10.12.2009 Hypothesis 2 parser adaption in the context of lexicalized grammar according to two different domains Daniel C. Mller


slide-1
SLIDE 1

LEXICALIZED PARSING FOR DIFFERENT DOMAINS

Laura Rimell and Stephen Clark

10.12.2009

Daniel C. Müller

slide-2
SLIDE 2

Hypothesis

10.12.2009 Daniel C. Müller -

2

parser adaption in the context of lexicalized grammar

according to two different domains

slide-3
SLIDE 3

domains

Biomedical domain Questions of question answering

10.12.2009

3

Daniel C. Müller -

slide-4
SLIDE 4

Lexicalized parser

POS-Tagging based on Penn Tree Bank Combinatory Categorial Grammar

+ manual annotation

10.12.2009

4

Daniel C. Müller -

slide-5
SLIDE 5

Lexicalized parser

POS-Tagging based on Penn Tree Bank

POS Tag:

50 grammatical labels indicating part of speech

Each word

  • ne POS Tag

10.12.2009

5

Daniel C. Müller -

slide-6
SLIDE 6

Lexicalized parser

POS-Tagging based on Penn Tree Bank Combinatory Categorial Grammar

+ manual annotation

lexical categorization (super-tagger)

425 categories Each word

at least one category

Containing subcategorial information Complex categories like (S\NP)/NP means:

returns S\NP when applied to NP

10.12.2009

6

Daniel C. Müller -

slide-7
SLIDE 7

Lexicalized parser

Example

Biomedical domain Talin|NN perhaps|RB acts|VBZ as|IN a|DT linkage|NN protein|NN .|. NP (S\NP)/(S\NP) (S[dcl]\NP)/PP PP/NP NP[nb]/N N/N N . Question domain What|WDT king|NN signed|VBD the|DT Magna|NNP Carta|NNP ?|. (S[wq]/(S[dcl]\NP))/N N (S[dcl]\NP)/NP NP[nb]/N N/N N . POS Tag lexical category

10.12.2009

7

Daniel C. Müller -

slide-8
SLIDE 8

Lexicalized parser

POS-Tagging based on Penn Tree Bank Combinatory Categorial Grammar

+ manual annotation

lexical categorization (super-tagger) derivation (hierarchy)

Lexicalized categories + combinatory rules

packed chart representation best derivation

Viterbi

10.12.2009

8

Daniel C. Müller -

slide-9
SLIDE 9

Lexicalized parser

Example

I drink coffee NP (S\NP)/NP NP . (S\NP)/NP needs a NP to the right NP S\NP S\NP needs a NP to the left S

10.12.2009

9

Daniel C. Müller -

slide-10
SLIDE 10

Motivation

creating new training data

at the lower levels of representation

better POS tagging

better categorization

reduce annotation overhead

10.12.2009

10

Daniel C. Müller -

slide-11
SLIDE 11

Experiments

Training resources Baseline Wall Street Journal Sections 02-21 of CCGbank Biomedical domain POS tagger: gold-standard POS tags from GENIA Lexical categories: rst1,000 sentences of GENIA parser evaluation: BioInfer Evaluation set: Pyysalo et al. (2007b) Question domain

Questions beginning with the word What, from the TREC 9-12 competitions:

manually POS tagged & annotated with lexical categories

10.12.2009

11

Daniel C. Müller -

slide-12
SLIDE 12

Experiments

Results

POS-Tagger

% WSJ 02-21 Retrained Sec.00 96.7

  • Biomedical

93.4 98.7 Question 92.2 97.1

10.12.2009

12

Daniel C. Müller -

slide-13
SLIDE 13

Experiments

Results

Supertagger

% Original pipeline Retrained POS Retrained POS & Super Sec.00 91.5

  • Biomedical

89.0 91.2 93.0 Question 71.6 74.0 92.1

10.12.2009

13

Daniel C. Müller -

slide-14
SLIDE 14

Experiments

Results

Parser evaluation

% Original pipeline new POS new POS & Super Biomedical 76.0 80.4 81.5 Question 64.4 69.4 86.6

10.12.2009

14

Daniel C. Müller -

slide-15
SLIDE 15

Analysis

Comparing to WSJ:

Biomedical domain:

+ similar syntactic structure vocabulary & foreign words long noun phrases

Question domain:

+ vocabulary words with different distribution of POS in source domain different syntactic structure

10.12.2009

15

Daniel C. Müller -

slide-16
SLIDE 16

Analysis

POS tagging

Biomedical domain:

nouns and adjectives (801 NN + 268 JJ errors)

very long noun phrases and unknown words like major histocompatibility complex class II molecules

Question domain:

wh-determiners (129 errors)

  • nly one occurrence in WSJ 02-21

10.12.2009

16

Daniel C. Müller -

slide-17
SLIDE 17

Analysis

POS tagging

Biomedical domain:

nouns and adjectives (801 NN + 268 JJ errors)

very long noun phrases and unknown words

Question domain:

wh-determiners (129 errors)

(S/(S/NP))/N: What Liverpool club spawned the Beatles? S/(S\NP) : What are the colors of the German ag?

much more errors but related syntactic structure

10.12.2009

17

Daniel C. Müller -

slide-18
SLIDE 18

Analysis

Syntactic differences

Unknown POS n-gram rate

% WJS 02-21 New training sets 3-grams 5-grams 3-grams 5-grams Sec.00 0.4 12.1

  • Biomedical 0.7

10.9 0.5 9.2 Question 3.6 22.0 0.7 7.4

10.12.2009

18

Daniel C. Müller -

slide-19
SLIDE 19

Analysis

Syntactic differences

Unknown POS n-gram rate Number of 20 most frequent POS n-grams

3-grams 5-grams Sec.00 18 19 Biomedical 16 13 Question 8 5

10.12.2009

19

Daniel C. Müller -

slide-20
SLIDE 20

Analysis

Syntactic differences

Unknown POS n-gram rate Number of 20 most frequent POS n-grams POS Trigrams

Biomedical domain:

Domination of NPs and PPs

Question domain:

Beginning with WP VBZ like What is Ending with VB .

10.12.2009

20

Daniel C. Müller -

slide-21
SLIDE 21

Analysis

Syntactic differences

Unknown POS n-gram rate Number of 20 most frequent POS n-grams POS Trigrams Number of rare or unseen lexical categories

10.12.2009

21

Daniel C. Müller -

slide-22
SLIDE 22

Conclusion

Biomedical domain

need for accurate parsing long and difficult sentences many POS tag errors

with POS tagging parser adaption successful!

Question domain

uniform sentences less related syntax

with supertagging

10.12.2009

22

Daniel C. Müller -

slide-23
SLIDE 23

References

Laura Rimell, Stephen Clark. 2008. Adapting a

Lexicalized-Grammar Parser to Contrasting Domains. EMNLP 2008.

Julia Hockenmaiers. 2007. Expressive Grammar

Formalisms for Natural Language: Theory and

  • Applications. Lecture 16: Extracting a CCG from the

Penn Treebank.

Julia Hockenmaiers. 2005 CCGBank Users Manual

10.12.2009

23

Daniel C. Müller -