A Simple Algorithm for Identifying Abbreviation Definitions in - - PDF document

a simple algorithm for identifying abbreviation
SMART_READER_LITE
LIVE PREVIEW

A Simple Algorithm for Identifying Abbreviation Definitions in - - PDF document

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A.S. Schwartz, M.A. Hearst Pacific Symposium on Biocomputing 8:451-462(2003) ASIMPLEALGORITHMFORIDENTIFYINGABBREVIATION


slide-1
SLIDE 1

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text A.S. Schwartz, M.A. Hearst Pacific Symposium on Biocomputing 8:451-462(2003)

slide-2
SLIDE 2
  • ASIMPLEALGORITHMFORIDENTIFYINGABBREVIATION

DEFINITIONSINBIOMEDICALTEXT

  • ARIELS.SCHWARTZ
  • MARTIA.HEARST

ComputerScienceDivision UniversityofCalifornia,Berkeley Berkeley,CA94720 sariel@cs.berkeley.edu SIMS UniversityofCalifornia,Berkeley Berkeley,CA94720 hearst@sims.berkeley.edu

Abstract

  • Thevolumeofbiomedicaltextisgrowingatafastrate,creatingchallengesforhumansand

computer systems alike. One of these challenges arises from the frequent use of novel abbreviationsinthesetexts,thusrequiringthatbiomedicallexicalontologiesbecontinually updated.Inthispaperweshowthattheproblemofidentifyingabbreviations’definitionscan be solved with a much simpler algorithm than that proposedby other research efforts. The algorithmachieves96%precisionand82%recallonastandardtestcollection,whichisatleast as good as existing approaches. It also achieves 95% precision and 82% recall on another, largertestset.Anotableadvantageofthealgorithmisthat,unlikeotherapproaches,itdoesnot requireanytrainingdata.

1 Introduction Therehasbeenanincreasedinterestrecentlyintechniquestoautomaticallyextract informationfrombiomedicaltext,andparticularlyfromMEDLINEabstracts.3,4,7,15 The size and growth rate of biomedical literature creates new challenges for researcherswhoneedtokeepuptodate.Onespecificissueisthehighrateatwhich new abbreviations are introduced in biomedical texts. Existing databases,

  • ntologies, and dictionaries must be continually updated with new abbreviations

and their definitions. In an attempt to help resolve the problem, new techniques have been introduced to automatically extract abbreviations and their definitions fromMEDLINEabstracts. In this paper we propose a new, simple, fast algorithm for extraction of abbreviations from biomedical text. The scope of the task addressed here is the same as the one described in Pustejovsky et al.:14 identify <“short form”, “long form”>pairswherethereexistsamapping(ofanykind)fromcharactersintheshort formtocharactersinthelongform.a

  • aThroughoutthepaperweusetheterms“shortform”and“longform”interchangeablywith

“abbreviation”and“definition”.Wealsousetheterm“shortform”toindicatebothabbreviationsand acronyms,conflatingtheseashavepreviousauthors.

slide-3
SLIDE 3
  • Manyabbreviationsinbiomedicaltextfollowapredictablepattern,in which

thefirstletterofeachwordinthelongformcorrespondstooneletterintheshort form,asinmethylmethanesulfonatesulfate(MMS).However,therearemanycases inwhichthecorrectmatchbetweentheshortformandlongformrequireswordsin thelongformtobeskipped,ormatchingofinternallettersinlongformwords,asin Gcn5-relatedN-acetyltransferase(GNAT).Inthispaper,wedescribeaverysimple, fastalgorithmforthisproblemthatachievesbothhighrecallandhighprecision. 2 RelatedWork Pustejovsky et al.13, 14 present a solution for identifying abbreviations based on hand-builtregularexpressionsandsyntacticinformationtoidentifyboundariesof nounphrases.Whenanoun phraseis foundtoprecedeashort formenclosedin parentheses, each of the characters within the short form is matched in the long form.Ascoreisassignedthatcorrespondstothenumberofnon-stopwordsinthe long formdividedbythenumberofcharactersinthe shortform.Iftheresultis belowathresholdof1.5,thenthematchisaccepted.Thisalgorithmachieved72% recalland98%on“thegoldstandard,”asmall,publiclyavailableevaluationcorpus thatthisgroupcreated,workingbetterthanasimilaralgorithmthatdoesnottake syntaxintoaccount.b Pustejovskyetal.13alsosummarizesomedrawbacksofotherearlierpattern- based approaches, noting that the results of Taghva et al.17 look good (98% precision and 93% recall on a different test set), but do not account for abbreviationswhoselettersmaycorrespondtoacharacterinternaltoadefinition word,acommonoccurrenceinbiomedicaltext.TheyalsofindthattheAcrophile algorithmofLarkeyetal.8doesnotperformwellonthegoldstandard. Changetal.5presentanalgorithmthatuseslinearregressiononapre-selected setoffeatures,achieving80%precisionatarecalllevelof83%,and95%precision at75%recallonthesameevaluationcollection(thisincreasesto82%recalland 99%precisiononacorrectedversion).cTheiralgorithmusesdynamicprogramming tofindpotentialalignmentsbetweenshortandlongform,andusestheresultsofthis tocomputefeaturevectorsforcorrectlyidentifieddefinitions.Theythenusebinary logisticregressiontotrainaclassifieron1000candidatepairs. Yeates et al.19 examine acronyms in technical text. They address a more difficultproblemthansomeothergroupsinthattheirtestsetincludesinstancesthat do not have distinct orthographic markers such as parentheses to indicate the

  • bTherearesomeerrorsinthegoldstandard.TheresultsreportedbyPustejovskyetal.13areona

variationofthegoldstandardwithsomecorrections,buttheactualcorrectionsmadearenotreportedin thepaper.Unfortunately,thecorrectionsneededonthestandardarenotstandardized.

cPersonalcommunication,H.Schuetze.

slide-4
SLIDE 4
  • proximityofadefinitiontoanabbreviation(theyreportthatonlytwothirdsofthe

examplestakethisform).Theiralgorithmcreatesacodethatindicatesthedistance

  • fthedefinitionwordsfromthecorrespondingcharactersintheacronym,anduses

compression to learn the associations. They compile a large test collection consistingof1080definitions;trainingontwothirdsandtestingontheremainder, reportingtheresultsonaprecision/recallcurve. ParkandByrd12presentarule-basedalgorithmforextractionofabbreviation definitionsfromgeneraltext.Thealgorithmcreatesrulesontheflythatmodelhow the short form can be translated into the long form. They create a set of five translationrules,asetoffiverulesfordeterminingcandidatelongformsbasedon theirlength,andasetofsixheuristicsfordeterminingwhichdefinitiontochooseif therearemanypotentialcandidates.Theseare:syntacticcues,rulepriority,distance betweendefinitionandabbreviation,capitalizationcriteria,numberofwordsinthe definition,andnumberofstopwordsinthedefinition.Rulepriorityisbasedonhow

  • ftentherulehasbeenappliedinthepast.Theyevaluatetheiralgorithmon177

abbreviations taken from engineering texts, achieving 98% precision and 95% recall.Nomentionismadeofthesizeandnatureofthetrainingset,orwhetherit wasdistinctfromthetestset. Yuetal.21presentanotherrule-basedalgorithmformappingabbreviationsto theirfullformsinbiomedicaltext.TheiralgorithmissimilartothatofParkand Byrd.Foragivenshortform,thealgorithmextractsallthecandidatelongforms that start with the same character as the short form. The algorithm then tries to match the candidate long forms to the short form starting from the shortest long form,byiterativelyapplying5pattern-matchingrules.Therulesincludeheuristics such as prioritizing matching the first character of a word, allowing the use of internal letters only if the first letter of a word was matched, and so on. The algorithm was evaluated on a small collection of biomedical text containing 62 matchingpairs,achieving95%precisionand70%recallonaverage. Adar2presentsanalgorithmthatgeneratesasetofpathsthroughthewindowof textadjacenttoanabbreviation(startingfrom theleftmostcharacter),andscores thesepathstofindthemostlikelydefinition.Scoringrulesusedinclude“forevery abbreviationcharacterthatoccursatthestartofadefinitionword,add1”,and“A bonus point is awarded for definitions that are immediately adjacent to the parenthesis”.Afterprocessingalargesetofabbreviation-definitionpairs,theresults areclusteredinordertoidentify spelling variantsamong thedefinitions.N-gram clusteringiscoupledwithlookupintotheMeSHhierarchytofurtherimprovethe clusters.Performanceonasmallersubsetofthegoldstandardyielded85%recall and94%precision;theauthornotesthat2definitionsidentifiedbyhisalgorithm shouldhavebeenmarkedcorrectinthestandard,resultinginaprecisionof95%.d

  • dResultsverifiedthroughpersonalcommunicationwiththeauthor.
slide-5
SLIDE 5
  • Theworkdescribedinthispaperarosebecausetheauthorsfounddifficulties

making the Park and Byrd algorithm work well on biomedical text. The rules it producesareveryspecifictotheformatofcandidateabbreviations,andsomany abbreviationswerebeingrepresentedbypatternsthathadnotyetbeenencountered bythealgorithm,andthusruleprioritywasnotoftenapplicable. TheapproachclosesttotheonewepresenthereisthealgorithmofYoshidaet al.20Theiralgorithmassumesthatthedefinitionortheabbreviationoccursadjacent toparentheses,buttheirpaperdoesnotstatehowthelengthofcandidatedefinitions isdetermined.Theiralgorithmscanswordsfromtheendoftheabbreviationand candidatedefinitiontothebeginning,tryingateachiterationtofindamatchforthe substringoftheabbreviationinthedefinition.Thealgorithmassumesthatinorder foracharacterfromtheabbreviationtoberepresentedintheinteriorofawordin thedefinition,theremustbeamatchofsomeothercharacterfromtheabbreviation

  • nthefirstletterofthatword.Inaddition,charactersthatmatchintheinteriorof

the word must either be adjacent to one another following that initial letter, or adjacent to one another following a syllable boundary. Each iteration of the algorithm requires a check to see if a subsequence can be properly formed accordingtotheserules.Theytestthisalgorithmonaverylargecollection(they hadanindependentassessorevaluatemorethat15,000categorizations),achieving 97.5%precisionand95.5%recall. Another important processing issue for abbreviations is disambiguation of multiplesensesofthesameshortform.Pustejovskyetal.13describeanalgorithm that yieldsabbreviationsensedisambiguationaccuraciesof98%,andPakhomov9 achievesaccuraciesof89%onclinicalrecords. Yet another issue is normalization of different spellings of the same abbreviation.Itisdifficulttodefinewhatitmeansfortwobiomedicaltermstorefer tothesameconcept;Cohenetal.6provideonesetofrules. 3 MethodsandImplementation 3.1 IdentifyingShortFormandLongFormCandidates Theprocessofextractingabbreviationsandtheirdefinitionsfrommedicaltextis composedoftwomaintasks.Thefirstistheextractionof<short-form,long-form> paircandidatesfromthetext.Thesecondtaskisidentifyingthecorrectlongform from among the candidates in the sentence that surrounds the short form. Most approaches, including the one presented here, use a similar method for finding candidate pairs. Abbreviation candidates are determined by adjacency to parentheses.

slide-6
SLIDE 6
  • Thetwocasesare:

(i)longform‘(‘shortform‘)’ (ii)shortform‘(‘longform‘)’ In practice, most <short form, long form> pairs conform to pattern (i). Whenever the expression inside the parentheses includes more than two words, pattern (ii) is assumed, and a short form is searched for just before the left parenthesis(wordboundariesareindicatedbyspaces).Shortformsareconsidered validcandidatesonlyiftheyconsistofatmosttwowords,theirlengthisbetween two to ten characters, at least one of these characters is a letter, and the first characterisalphanumeric.Forsimplicity,pattern(i)isassumedinthediscussion below. The next step is to identify candidates for the long form. The long form candidatemustappearinthesamesentenceastheshortform,andasinParkand Byrd12,itshouldhavenomorethanmin(|A|+5,|A|*2)words,where|A|isthe numberofcharactersintheshortform. AlthoughthealgorithmofParkandByrdallowsforanoffsetbetweentheshort andlongforms,weconsideronlylongformsthatareadjacenttotheshortform.For agivenshortform,alongformcandidateiscomposedofcontiguouswordsfrom theoriginaltextthatincludethewordjustbeforetheshortform. 3.2 AlgorithmforIdentifyingCorrectLongForms Whenthepreviousstepsarecompletedthereisalistoflongformcandidatewords for the short form, and the task is to choose the right subset of words. Figure 1 presentsthecodethatperformsthistask.Themainideais:startingfromtheendof boththeshortformandthelongform,moverighttoleft,tryingfindtheshortest longformthatmatchestheshortform.Everycharacterintheshortformmustmatch acharacterinthelongform,andthematchedcharactersinthelongformmustbein thesameorderasthecharactersintheshortform.Anycharacterinthelongform can match a character in the short form, with one exception: the match of the characteratthebeginningoftheshortformmustmatchacharacterintheinitial positionofthefirst(leftmost)wordinthelongform(thisinitialpositioncanbethe firstletterofawordthatisconnectedtootherwordsbyhyphensandothernon- alphanumericcharacters). TheimplementationinFigure1usestwoindices,lIndexforthelongform,and sIndexfortheshortform.Thetwoindicesareinitializedtopointtotheendoftheir respectivestrings.ForeachcharactersIndexpointsto,lIndexisdecrementeduntila matching character is found. If lIndex reaches the beginning of the long form candidatelistbeforesIndexdoes,thealgorithmreturnsnull(nomatchfound).

slide-7
SLIDE 7
  • Figure1–JavaCodeforFindingtheBestLongFormforaGivenShortForm

/** MethodfindBestLongFormtakesasinputashort-formandalong- formcandidate(alistofwords)andreturnsthebestlong-form thatmatchestheshort-form,ornullifnomatchisfound. **/ publicStringfindBestLongForm(StringshortForm,StringlongForm){ intsIndex; //Theindexontheshortform intlIndex; //Theindexonthelongform charcurrChar; //Thecurrentcharactertomatch

  • sIndex=shortForm.length()-1; //SetsIndexattheendofthe

//shortform lIndex=longForm.length()-1; //SetlIndexattheendofthe //longform for(;sIndex>=0;sIndex--){ //Scantheshortformstarting //fromendtostart //Storethenextcharactertomatch.Ignorecase currChar=Character.toLowerCase(shortForm.charAt(sIndex)); //ignorenonalphanumericcharacters if(!Character.isLetterOrDigit(currChar)) continue; //DecreaselIndexwhilecurrentcharacterinthelongform //doesnotmatchthecurrentcharacterintheshortform. //Ifthecurrentcharacteristhefirstcharacterinthe //shortform,decrementlIndexuntilamatchingcharacter //isfoundatthebeginningofawordinthelongform. while( ((lIndex>=0)&& (Character.toLowerCase(longForm.charAt(lIndex))!=currChar)) || ((sIndex==0)&&(lIndex>0)&& (Character.isLetterOrDigit(longForm.charAt(lIndex-1))))) lIndex--; //Ifnomatchwasfoundinthelongformforthecurrent //character,returnnull(nomatch). if(lIndex<0) returnnull; //Amatchwasfoundforthecurrentcharacter.Movetothe //nextcharacterinthelongform. lIndex--; } //Findthebeginningofthefirstword(incasethefirst //charactermatchesthebeginningofahyphenatedword). lIndex=longForm.lastIndexOf("",lIndex)+1; //Returnthebestlongform,thesubstringoftheoriginal //longform,startingfromlIndexuptotheendoftheoriginal //longform. returnlongForm.substring(lIndex); }

slide-8
SLIDE 8
  • Otherwise, each time a matching character is found, sIndex and lIndex are

decremented.WhensIndexisattheinitial(leftmost)characteroftheshortform,a matchisconsideredonlyifitoccursatthebeginningofawordinthelongform. ThisisaccomplishedbydecrementinglIndexuntilitreachesanon-alphanumerical characterorreachesthebeginningofthelongform.Onlythenisthecharacteritis pointingtocheckedforamatchagainstthecharactersIndexpointsto(thisallows formatchesinthebeginningofthelongformjustbeforehyphens).lIndexisthen decrementeduntilitreachesaspace,orthebeginningofthelongform(whichever comes first), in order to include all the words that are connected, usually by hyphens, to the leftmost matched word in the long form. Finally, the algorithm returnsthesubstringoftheoriginallongform,startingfromlIndexuptotheendof theoriginallongform. Toincreaseprecision,thealgorithmdiscardslongformsthatareshorterthan theshortform,orthatincludetheshortformasoneofthewordsinthelongform.e To illustrate the algorithm, consider the following pair <HSF, Heat shock transcriptionfactor>.ThealgorithmstartsbysettingsIndextopointtotheendof theshortform(HSF),andlIndextopointtotheendofthelongform(factor).It thendecrementslIndexuntilamatchisfound(factor).sIndexisdecrementedby

  • ne(HSF).lIndexisdecrementeduntilamatchisfound(transcription).sIndexis

decrementedagain(HSF).SincesIndexnowpointstothebeginningoftheshort form,thenextmatchshouldbefoundatabeginningofawordinthelongform. Therefore, lIndex is decremented until a valid match is found (Heat). Note that anothermatchwasskipped(shock)becauseitwasnotinthebeginningofaword. Alsonote,thatalthoughthealgorithmdidnotmatchthesecondcharactercorrectly (transcriptioninsteadofshock)itstillfoundtherightlongform. To illustrate when the algorithm might fail, consider the following example. <TTF-1, Thyroid transcription factor 1>. In this case the algorithm finds the followingwrongmatch<TTF-1,Thyroidtranscriptionfactor1>.Ourexperiment resultsshowthatthiskindoferrorisveryrare. The algorithm is based on the observation that it is very rare for the first characteroftheshortformtomatchaninternalletterofthelongform.Byadding theconstraintthatthefirstcharacteroftheshortformmatchesthebeginningofa wordinthelongform,togetherwiththelimitationonthelengthofthelongform, the precision is increased by removing most of the false positives, without significantly reducing the recall. By contrast, adding additional constraints, as is donebymostotheralgorithms,doesnotseemtohelpmuchintermsofprecision, butcanseverelyreducetherecall.ToillustratethispointconsidertheresultsofYu etal.21,whichisasimilaralgorithmtoours,buthasadditionalconstraints.While theprecisionofbothalgorithmsisverysimilar,therecallofouralgorithmishigher.

  • eThispartofthealgorithmisomittedfromFigure1.
slide-9
SLIDE 9
  • 4

EvaluationandResults Toevaluatethealgorithm,1000MEDLINEabstractswererandomlyselectedfrom theresultsofaqueryontheterm“yeast”.Thesewerethenhandtagged,producinga list of 954 correct <short form, long form> pairs. The algorithm was also tested againstapubliclyavailabletaggedcorpus,theMedstractGoldStandardEvaluation Corpus,22whichincludes168<shortform,longform>pairs. Onacorrectedversionofthegoldstandard,thealgorithmidentified143pairs. Outofthese,137pairs werecorrect,resultinginarecallof82%atprecisionof 96%.Forcomparison,thealgorithmdescribedinChangetal.5achieved83%recall at 80% precision, and that of Pustejovsky et al.14 achieved 72% recall at 98% precision.f Analysisofthe6incorrectpairsrevealsthatinactuality,2ofthemarecorrect, but were overlooked by the creators of the gold standard.g The other 4 pairs are countedasincorrect,sincetheyonlypartiallymatchedthecorrectlongform.For example,thealgorithmfoundthepair<PolII,polymeraseII>insteadof<PolII, RNA polymerase II>. Allowing for reasonable partial matches, as was done in Chang et al.5 and considering the 2 missing pairs as correct, the precision is increasedto99%,andtherecallto84%. Thealgorithmmissed31pairs:9(38%)pairshaveskippedcharactersinthe shortform(e.g.<CNS1,cyclophilinsevensuppressor>),7(23%)donothaveany patternmatchbetweentheshortformandlongform(e.g.<5-HT, serotonin>), 5 (16%)haveanoutofordermatch(e.g.<ATN,anteriorthalamus>),for3(10%) pairsthelongformincludesanadditionalwordstotheleftofthematchresultingin apartialmatch(e.g.<PolI,RNApolymeraseI>),2pairshaveashortdefinition insidetheparenthesis,2pairshavethelongforminsideparenthesisandtheshort forminsidenestedparenthesis,1pairhasacommainthelongform,1pairhasno parenthesis, and for 1 pair the algorithm found a wrong partial match (see the exampleattheendofsection3).

  • fItisimportanttonotethatbecauseoftheerrorsinthegoldstandardtheseresultscannotbeaccurately

compared.Eachoftheseevaluationsuseditsowninterpretationofthestandard,fixingdifferentpartsof it.Inourevaluation,wefollowedtheguidelinesofPustejovskyetal.14(receivedthroughpersonal communication),sincetheyhavedevelopedandmaintainedthestandard.Wedidnotincludeherethe resultsofAdar2,sinceheusedasubsetoftheoriginalgold-standardwithonly144pairs,whichdidnot includemostofthepairsmissedbytheotheralgorithms.

gThetwomissingpairare<l'sc,lethalofscute>,and<cAMP,3',5'cyclicadenosinemonophosphate>.

Thesecondlongformwasonlypartiallyextractedbythealgorithm(withouttheleadingnumbers).

slide-10
SLIDE 10
  • Onthelargertestcollection,thealgorithmsidentified827pairs.Outofthese,

785pairswerecorrect,resultinginarecallof82%atprecisionof95%.hAnalysisof the 42 incorrect pairs reveals that 17 are completely incorrect, 12 pairs have a matchedlongformthatisasupersetofthecorrectlongform(thishappenswhenthe correctmappingincludesunusedcharacters,outofordermappings,first-character matches to internal letters, or when the first word of the correct long form is connected by hyphens to preceding words), 11 are partial matches where the extracted long form is a subset of the correct long form (this happens when the algorithmisabletomatchallthecharactersoftheshortformtoasubsetofthelong form, and when the correct match does not start from the first word of the long form),andfor2pairstheextractedlongformincludesleftbracketsthatarenotpart

  • fthecorrectlongform.

Outofthe169missedpairs,70(41%)haveunusedcharactersintheshortform, 23 (14%) have an out of order match, 12 (7%) have first-character matches to internal letters, 12 (7%) have nested parentheses, 11 (7%) have some kind of transformation involved in their mapping (like 2D -> two-dimensional), 11 (7%) havepartialmatches(seeabove),5(3%)haveanon-continuouslongforms,4(2%) involve multiple concurrent definitions, 4 (2%) have a short form of only one character,andtherestofthepairs,19(11%),havemiscellaneousissues. 5 AnAlternativeAlgorithm Whenwebegantoinvestigatetheproblemofabbreviationdefinitionidentification, we devised a much more complex algorithm than that presented here. This algorithmusestherepresentationofParkandByrd12incombinationwithavariation

  • nthedecisionlistsalgorithm,asappliedbyYarowsky18tothelexicalambiguity

resolutiontask.Thealgorithmmakesuseoftrainingdatatorankfeaturesthatare combinationsofmatchingruletransformations. Spacerestrictionspreventdetaileddescriptionofthatalgorithm(theinterested reader should refer to Schwartz and Hearst16 for a complete description of the algorithm). However, we found that it performed mildly better than our simple algorithmonbothtrainingsets,achievingforthegoldstandard97%precisionand 82%recall,whichisareductioninerrorof17%overthesimpleralgorithm.Forthe largertestcollection,itachieves96%precisionand82%recall,whichisanerror reductionof22%overthesimpleralgorithm.

  • hThedatasetwasoriginallyannotatedbyagraduatestudentincomputationalandbiosciences.We

furtheredverifiedthedatabycomparinganyquestionablepairsagainstotheroccurrencesofthesame abbreviationinotherabstracts,usingthewebsiteprovidedbyChangetal.5Apairextractedbythe algorithmisconsideredcorrectonlyifitexactlymatchesapairlabeledinthedataset.

slide-11
SLIDE 11
  • Becausethesimplealgorithmissomucheasiertoimplementandrequiresno

training data, we recommend its use, in combination with checking the entire datasetforredundancyindefinitionsinordertofurtherreducetheerrorrates. 6 Conclusions Inthispaperweintroducedanewalgorithmforextractingabbreviationsandtheir definitionsfrombiomedicaltext.Althoughthealgorithmisextremelysimple,itis highlyeffective,andislessspecific–andthereforelesspotentiallybrittle–than

  • therapproachesthatusecarefullycraftedrules.Althoughwearestaunchadvocates
  • fmachinelearningapproachesforproblemsincomputationallinguistics,itseems

thatinthecaseofthisparticularproblem,simplerisbetter.Onecanarguethatthe problem may vary across collections or languages, and so machine learning can help in these cases, but our experience with a machine learning approach to sentenceboundarydetermination11suggeststhatmostpractitionersdonotwantto botherwithlabelingtrainingdataforrelativelysimpletasks. Anotheradvantageofthesimplicityof thealgorithmisitsfastrunningtime performance.Thetaskofextractingthedefinitionofanabbreviation,isacommon pre-processing step of larger multi-layered text-mining tasks.1, 10 Therefore, it is essential that this step be as efficient as possible. Since our algorithm needs to consider only one possible long form per short form, it is much faster than the alternativealgorithmsthatfirstextractmanypossiblelongformsandthenpickthe best of them. To provide a rough comparison, using an IBM T21 laptop with a single CPU (800 MHz, 256 Mb RAM) running MS-Windows 2000, it takes our algorithmabout1secondtoprocess1000abstracts,whilethealgorithminChanget al.5usinga5processorSunEnterpriseE3500server,processedonly25.5abstracts persecond.WhileouralgorithmisclearlyI/Obound(runningtimedependsalmost entirelyonthetimeittakestoreadthefilesfromdisk,andwritetheresultsbackto thedisk),thealgorithmofChangetal.seemtobeheavilyCPUbound. Thealgorithmperformsbetterorthesameasthebestresultsofother work, withthepossibleexceptionofthatofYoshidaetal.20However,themainadvantage

  • ftheproposedalgorithmoverthealternativesisitssimplicity,andtransparency.It

wasimplementedwith260linesofJavacodeandrequiresnotrainingdatatorun. The Yoshida et al. algorithm is more complex, in that it requires a module for recognizingsyllableboundaries,anditperformsasubstringcheckateachiteration

  • ftheloop.

Analysis of the errors produced indicates that further improvement of the algorithmrequirestheuseofsyntacticinformation,assuggestedinPustejovskyet al.13Shallowparsingofthetextasapreprocessingstepmighthelpcorrectsomeof theerrorsinherentinthealgorithm,byhelpingtoidentifythenounphrasesnearthe

slide-12
SLIDE 12
  • abbreviations. In addition, combining evidence from more than one MEDLINE

abstract at a time, as was done in Adar2, might also prove to be beneficial for increasingbothprecisionandrecall.Finally,thealgorithmcurrentlyonlyconsiders candidate definitions when the abbreviation is enclosed in parentheses (and vice versa);findingallpossiblepairsisamoredifficultproblemandrequiresadditional study. Acknowledgements WethankBarbaraEngelhardtforproducingthemarked-updataforthelargetest collection,andJeffChangandHinrichSchuetze,YoungjaParkandRoyByrd,Jose CastañaandJamesPustejovsky,andEytanAdarforprovidingdetailsabouttheir data.ThisresearchwassupportedinpartbyagiftfromGenentech,andinpartbya grantfromARDA. References

  • 1. L.A.Adamic,D.Wilkinson,B.A.Huberman,andE.Adar,“ALiterature

BasedMethodforIdentifyingGene-DiseaseConnections”Proceedingsof the2002IEEEComputerSocietyBioinformaticsConference,Stanford, CA,August2002.

  • 2. E.Adar,“S-RAD:ASimpleandRobustAbbreviationDictionary”HP

LaboratoriesTechnicalReport,September2002.

  • 3. M.A.Andradeetal.“Automaticannotationforbiologicalsequencesby

extractionofkeywordsfromMEDLINEabstracts.Developmentofa prototypesystem.”ISMB1997.5:25-32.

  • 4. C.Blascheetal.“Automaticextractionofbiologicalinformationfrom

scientifictext:protein-preoteininteractions”ISMB1999.7:60-67.

  • 5. J.T.Chang,HSchütze,andR.B.Altman,“CreatinganOnlineDictionary
  • fAbbreviationsfromMEDLINE”JAMIA,toappear.
  • 6. B.Cohen,A.E.Dolbey,G.K.Acquaah-Mensah,andL.Hunter,“Contrast

andvariabilityingenenames”ProceedingsoftheACLWorkshopon NaturalLanguageProcessingintheBiomedicalDomain,Philadelphia, PA,July2002.

  • 7. M.Craven,andJKumlien,“ConstructingBiologicalknowledgeBasesby

ExtractinginformationfromTextSources”Proceedingsofthe7th InternationalConferenceonIntelligentSystemsforMolecularBiology,77- 86,Heidelberg,Germany.AAAIPress.

  • 8. L.S.Larkey,P.Ogilvie,M.A.Price,andB.Tamilio,“Acrophile:an

automatedacronymextractorandserver”ProceedingsoftheACMFifth InternationalConferenceonDigitalLibraries,DL’00,DallasTX,May 2000.

slide-13
SLIDE 13
  • 9. S.V.Pakhomov,“Semi-supervisedMaximumEntropy-basedApproachto

AcronymandAbbreviationNormalizationinMedicalTexts”Proceedings

  • fACL2002,Philadelphia,PA,July2002.
  • 10. M.Palakal,M.Stephens,S.Mukhopadhyay,R.Raje,andS.Rhodes,“A

Multi-levelTextMiningMethodtoExtractBiologicalRelationships” Proceedingsofthe2002IEEEComputerSocietyBioinformatics Conference,Stanford,CA,August2002.

  • 11. D.Palmer,andM.Hearst,“AdaptiveMultilingualSentenceBoundary

Disambiguation”ComputationalLinguistics,23(2),241-267,June1997.

  • 12. Y.Park,andR.J.Byrd,“HybridTextMiningforFindingAbbreviations

andTheirDefinitions”Proceedingsofthe2001ConferenceonEmpirical MethodsinNaturalLanguageProcessing,Pittsburgh,PA,June2001:126- 133

  • 13. J.Pustejovsky,J.Castaño,B.Cochran,M.Kotecki,M.Morrell,andA.

Rumshisky,“ExtractionandDisambiguationofAcronym-MeaningPairsin Medline”unpublishedmanuscript,2001.

  • 14. J.Pustejovskyetal.“AutomationExtractionofAcronym-MeaningPairs

fromMedlineDatabases”Medinfo2001;10(Pt1):371-375.

  • 15. T.C.Rindfleschetal.“ExtractingMolecularBindingRelationshipsfrom

BiomedicalText”ProceedingoftheANLP-NAACL2000,pages188-195 AssociationforComputationalLinguistics,Seattle,WA,2000.

  • 16. A.Schwartz,andM.Hearst,“ARuleBasedAlgorithmforIdentifying

AbbreviationDefinitionsinBiomedicalTextUsingDecisionLists” UniversityofCalifornia,Berkeley,TechnicalReport,toappear.

  • 17. K.Taghva,andJ.Gilbreth,“RecognizingAcronymsandtheirDefinitions”

IJDAR1(4),191-198,1999.

  • 18. D.Yarowsky,“DecisionListsforLexicalAmbiguityResolution:

ApplicationtoAccentRestorationinSpanishandFrench”Proceedingsof theACL,1994:88-95

  • 19. S.Yeates,D.Bainbridge,andI.H.Witten,“Usingcompressiontoidentify

acronymsintext”inDataCompressionConference,2000.

  • 20. M.Yoshida,K.Fukuda,andT.Takagi,“PNAD-CSS:aworkbenchfor

constructingaproteinnameabbreviationdictionary”Bioinformatics,16 (2),2000.

  • 21. H.Yu,G.Hripcsak,andC.Friedman,“Mappingabbreviationstofull

formsinbiomedicalarticles”JAmMedInformAssoc2002;9(3):262-272.

  • 22. http://www.medstract.org/gold-standard.html