Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation
Roberto Navigli and Paola Velardi
Abstract—Word Sense Disambiguation (WSD) is traditionally considered an AI-hard problem. A break-through in this field would have a significant impact on many relevant Web-based applications, such as Web information retrieval, improved access to Web services, information extraction, etc. Early approaches to WSD, based on knowledge representation techniques, have been replaced in the past few years by more robust machine learning and statistical techniques. The results of recent comparative evaluations of WSD systems, however, show that these methods have inherent limitations. On the other hand, the increasing availability of large-scale, rich lexical knowledge resources seems to provide new challenges to knowledge-based approaches. In this paper, we present a method, called structural semantic interconnections (SSI), which creates structural specifications of the possible senses for each word in a context and selects the best hypothesis according to a grammar G, describing relations between sense specifications. Sense specifications are created from several available lexical resources that we integrated in part manually, in part with the help of automatic procedures. The SSI algorithm has been applied to different semantic disambiguation problems, like automatic ontology population, disambiguation
- f sentences in generic texts, disambiguation of words in glossary definitions. Evaluation experiments have been performed on specific
knowledge domains (e.g., tourism, computer networks, enterprise interoperability), as well as on standard disambiguation test sets. Index Terms—Natural language processing, ontology learning, structural pattern matching, word sense disambiguation.
- 1
INTRODUCTION
W
ORD sense disambiguation (WSD) is perhaps the most
critical task in the area of computational linguistics (see [1] for a survey). Early approaches were based on semantic knowledge that was either manually encoded [2], [3] or automatically extracted from existing lexical re- sources, such as WordNet [4], [5], LDOCE [6], and Roget’s thesaurus [7]. Similarly to other artificial intelligence applications, knowledge-based WSD was faced with the knowledge acquisition bottleneck. Manual acquisition is a heavy and endless task, while online dictionaries provide semantic information in a mostly unstructured way, making it difficult for a computer program to exploit the encoded lexical knowledge. More recently, the use of machine learning, statistical and algebraic methods ([8], [9]) prevailed on knowledge- based methods, a tendency that clearly emerges in the main Information Retrieval conferences and in comparative system evaluations, such as SIGIR,1 TREC2, and SensEval.3 These methods are often based on training data (mainly, word cooccurrences) extracted from document archives and from the Web. The SensEval workshop series are specifically dedicated to the evaluation of WSD algorithms. Systems compete on different tasks (e.g., full WSD on generic texts, disambigua- tion of dictionary sense definitions, automatic labeling of semantic roles) and in different languages. English All- Words (full WSD on annotated corpora, such as the Wall Street Journal and the Brown Corpus) is among the most attended competitions. At Senseval-3, held in March 2004, 17 supervised and 9 unsupervised systems participated in the task. The best systems were those using a combination
- f several machine learning methods, trained with data on
word cooccurrences and, in few cases, with syntactic features, but nearly no system used semantic information.4 The best systems reached about 65 percent precision, 65 percent recall,5 a performance considered well below the needs of many real-world applications [10]. Comparing performances and trends with respect to previous SensEval events, the feeling is that supervised machine learning methods have little hope of providing a real break-through, the major problem being the need for high quality training data for all the words to be disambiguated. The lack of high-performing methods for sense disambi- guation may be considered the major obstacle that pre- vented an extensive use of natural language processing techniques in many areas of information technology, such as information classification and retrieval, query proces- sing, advanced Web search, document warehousing, etc. On the other hand, new emerging applications, like the so- called Semantic Web [11], foster “an extension of the current web in which information is given well-defined meaning,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
- VOL. 27,
- NO. 7,
JULY 2005 1075
. The authors are with the Dipartimento di Informatica, Universita ` of Roma “La Sapienza,” via Salaria 113, 00198 Roma, Italy. E-mail: {navigli, velardi}@di.uniroma.it. Manuscript received 2 Jan. 2004; revised 14 Apr. 2005; accepted 14 Apr. 2005; published online 12 May 2005. Recommended for acceptance by M. Basu. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.organdreferenceIEEECSLogNumberTPAMISI-0003-0104.
- 1. http://www.acm.org/sigir/.
- 2. http://trec.nist.gov/.
- 3. http://www.senseval.org/.
- 4. One of the systems reported the use of domain labels, e.g., medicine,
tourism, etc.
- 5. A performance sensibly lower than for Senseval-2.
0162-8828/05/$20.00 2005 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on October 30, 2008 at 12:23 from IEEE Xplore. Restrictions apply.