Effective Self-Training for Parsing
David McClosky
dmcc@cs.brown.edu
Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 1
Effective Self-Training for Parsing David McClosky - - PowerPoint PPT Presentation
Effective Self-Training for Parsing David McClosky dmcc@cs.brown.edu Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 1
David McClosky
dmcc@cs.brown.edu
Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 1
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 2
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 2
S NP PRP I VP VBP need NP NP DT a NN sentence PP IN with NP NN ambiguity . .
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 2
π
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 3
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 4
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 5
NP a sentence PP with ambiguity S NP PRP I VP VBP need NP . . p(π1) = 7.25 × 10−20 NP a sentence PP with ambiguity S NP PRP I VP VBP need . . p(π2) = 7.05 × 10−21
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 6
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 7
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 8
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 9
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 10
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 10
train reranking parser on WSJ
use model to parse NANC
merge WSJ training data with parsed NANC data
train reranking parser on WSJ+NANC data
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 11
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 12
insignificant improvement
minor improvement/damage depending on amount of training data
helps when parsing WSJ when training on Brown corpus and self-training on news data
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 13
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 14
Annotator Sentences added Parser Reranking parser 0 (baseline) 90.3 50k 90.1 90.7 500k 90.0 90.9 1,000k 90.0 90.8 1,500k 90.0 90.8 2,000k 91.0 Parser (not reranking parser) f-scores
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 15
WSJ Section
Sentences added 1 22 24 0 (baseline) 91.8 92.1 90.5 50k 91.8 92.4 90.8 500k 92.0 92.4 90.9 1,000k 92.1 92.2 91.3 2,000k 92.2 92.0 91.3 Reranking parser f-scores for all sentences
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 16
WSJ×5+1,750k sentences from NANC
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 17
f-scores from all sentences in WSJ section 23
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 18
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 19
WSJ×1 + 250k
WSJ×5 + 1,750k
Pr(1-best) Pr(50th-best) increases from 12.0
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 20
10 20 30 40 50 60 20 40 60 80 100 Sentence length Number of sentences (smoothed) Better No change Worse 1 2 3 4 5 500 1000 1500 2000 Unknown words Number of sentences Better No change Worse 1 2 3 4 5 500 1000 1500 2000 Number of CCs Number of sentences Better No change Worse 2 4 6 8 10 200 400 600 Number of INs Number of sentences Better No change Worse
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 21
10 20 30 40 50 60 20 40 60 80 100 Sentence length Number of sentences (smoothed) Better No change Worse
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 22
10 20 30 40 50 60 20 40 60 80 100 Sentence length Number of sentences (smoothed) Better No change Worse
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 23
1 2 3 4 5 500 1000 1500 2000 Number of CCs Number of sentences Better No change Worse
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 24
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 25
Self-trained reranking parser available from: ftp://ftp.cs.brown.edu/pub/nlparser
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 26
This work was supported by NSF grants LIS9720368, and IIS0095940, and DARPA GALE contract HR0011-06-2-0001. Thanks to Michael Collins, Brian Roark, James Henderson, Miles Osborne, and the BLLIP team for their comments.
David McClosky - dmcc@cs.brown.edu - NAACL 2006 - 6.5.2006 - 27