appears in carl weir ed statistical ly ba se d natural
play

Appears in: Carl Weir (ed.), Statistical ly -Ba se d - PDF document

Appears in: Carl Weir (ed.), Statistical ly -Ba se d Natural Language Processing Technique s: Papers from the 1992 Workshop, pp. 20-27. Technical Report W-92-01, AAAI Press, Menlo Park, 1992. A Probabilistic P arser


  1. Appears in: Carl Weir (ed.), Statistical ly -Ba se d Natural Language Processing Technique s: Papers from the 1992 Workshop, pp. 20-27. Technical Report W-92-01, AAAI Press, Menlo Park, 1992. A Probabilistic P arser and Its Application Mark A. Jones Jason M. Eisner A T&T Bell Lab oratories Emman uel College, Cam bridge 600 Moun tain Av en ue, Rm. 2B-435 Cam bridge CB2 3AP England Murra y Hill, NJ 07974{063 6 jme14@pho enix.cam bri dge.ac. uk jones@researc h.att.com Abstract out of earlier w ork [Jones et al 1991 ] on correcting the output of optical c haracter recognition (OCR) systems. W e describ e a general approac h to the probabilis- W e w ere amazed at ho w m uc h correction w as p ossible tic parsing of con text-free grammars. The metho d using only lo w-lev el statistical kno wledge ab out En- in tegrates con text-sensitiv e statistical kno wledge glish (e.g., the frequency of digrams lik e \pa") and of v arious t yp es (e.g., syn tactic and seman tic) and ab out common OCR mistak es (e.g., rep orting \c" for can b e trained incremen tally from a brac k eted cor- \e"). As man y as 90% of incorrect w ords could b e �xed pus. W e in tro duce a v arian t of the GHR con text- within the telephon y sublanguage domain, and 70{80% free recognition algorithm, and explain ho w to for broader samples of English. Naturally w e w on- adapt it for e�cien t probabilistic parsing. In split- dered whether more sophisticated uses of statistical corpus testing on a real-w orld corpus of sen tences kno wledge could aid in suc h tasks as the one describ ed from soft w are testing do cumen ts, with 20 p ossible ab o v e. The recen t literature also re�ects an increas- parses for a sen tence of a v erage length, the sys- ing in terest in statistical training metho ds for man y tem �nds and iden ti�es the correct parse in 96% NL tasks, including parsing [Jelinek and La�ert y 1991 , of the sen tences for whic h it �nds an y parse, while Magerman and Marcus 1991 , Bobro w 1991 , pro ducing only 1.03 parses p er sen tence for those Magerman and W eir 1992 , Blac k, Jelinek, et al 1992 ], sen tences. Signi�can tly , this success rate w ould b e part of sp eec h tagging [Ch urc h 1988 ], and corp ora only 79% without the seman tic statistics. alignmen t [Dagan et al 1991 , Gale and Ch urc h 1991 ]. Simply stated, w e seek to build a parser that can construct accurate syn tactic and seman tic analyses for In tro duction the sen tences of a giv en language. The parser should In constrained domains, natural language pro cessing kno w little or nothing ab out the target language, sa v e can often pro vide lev erage. A t A T&T, for instance, what it can disco v er statistically from a represen ta- NL tec hnology can p oten tially help automate man y tiv e corpus of analyzed sen tences. When only unan- asp ects of soft w are dev elopmen t. A t ypical example alyzed sen tences are a v ailable, a practical approac h o ccurs in the soft w are testing area. Here 250,000 En- is to parse a small set of sen tences b y hand, to get glish sen tences sp ecify the op erational tests for a tele- started, and then to use the parser itself as a to ol to phone switc hing system. The c hallenge is to to ex- suggest analyses (or partial analyses) for further sen- tract at least the surface con ten t of this highly ref- tences. A similar \b o otstrapping" approac h is found eren tial, naturally o ccurring text, as a �rst step in in [Simmo ns 1990 ]. The precise grammatical theory automating the largely man ual testing pro cess. The w e use to hand-analyze sen tences should not b e cru- sen tences v ary in length and complexit y , ranging from cial, so long as it is applied consisten tly and is not short sen tences suc h as \Station B3 go es onho ok" to 50 unduly large. w ord sen tences con taining paren theticals, sub ordinate clauses, and conjunction. F ortunately the discourse P arsing Algorithms is reasonably w ell fo cused: a large but �nite n um b er of telephonic concepts en ter in to a limited set of logi- F ollo wing [Graham et al 1980 ], w e adopt the follo wing cal relationships. Suc h fo cus is c haracteristic of man y notation. An arbitrary con text-free grammar is giv en sublanguages with practical imp ortance (e.g., medical b y = ( V � ; ), where is the v o cabulary of all G ; P ; S V records). sym b ols, � is the set of terminal sym b ols, P is the W e desire to press forw ard to NL tec hniques that set of rewrite rules, and is the start sym b ol. F or S are robust, that do not need complete grammars in ad- an input sen tence = , let denote the w a a : : : a w 1 2 n i;j v ance, and that can b e trained from existing corp ora of substring a : : : a and w = w denote the pre�x i +1 j i 0 ;i sample sen tences. Our approac h to this problem grew of length i . W e use Greek letters ( �; � ; : : : ) to denote

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend