Automatically Annotating Text with Linked Open Data Delia Rusu , Bla - - PowerPoint PPT Presentation
Automatically Annotating Text with Linked Open Data Delia Rusu , Bla - - PowerPoint PPT Presentation
Automatically Annotating Text with Linked Open Data Delia Rusu , Bla Fortuna, Dunja Mladeni Joef Stefan Institute ailab.ijs.si Motivation: Annotating Text with LOD DBpedia Open Cyc WordNet ailab.ijs.si Overview Related work
ailab.ijs.si
Motivation: Annotating Text with LOD
DBpedia Open Cyc WordNet
ailab.ijs.si
Overview
Related work Algorithms for annotating with LOD
PageRank Context Similarity
Evaluation Datasets
WordNet OpenCyc DBpedia
Conclusions and Future Work
ailab.ijs.si
Related Work
Supervised approaches:
Parallel corpora: Chan et al. – SemEval 2007
Knowledge-based:
WordNet::Similarity package – Pedersen et al. 2004 Usage of context free grammars to validate semantic interconnections – Navigli and Velardi, 2005 Formal document structure description, hypothesis building, trying to reason using Cyc – Curtis et al. 2006 Disambiguate Wikipedia articles into Cyc concepts - Medelyan and Legg, 2008 Adapted versions of PageRank – Mihalcea et al. 2004, Agirre and Soroa, 2009
Simple knowledge-based approaches compete with state-of-the-art supervised approaches using a high- quality knowledge base - Ponzetto and Navigli, 2010.
ailab.ijs.si
LOD Dataset Representation
WordNet (VUA)
ailab.ijs.si
LOD Dataset Representation
rdf:type rdfs:subClassOf … rdf:resource rdf:resource="http://purl.org/vocabularies/princeton/wn30/ synset-belief-noun-1" rdf:resource="http://purl.org/vocabularies/princeton/wn30/ wordsense-values-noun-1" Example: rdf:type
ailab.ijs.si
Algorithms: PageRank
Example for the word values (human readable description of the resource)
- beliefs of a person or social group in which they have an emotional investment
(either for or against something); "he has very conservatives values“
- (an ideal accepted by some individual or group) "he has old-fashioned values“
- ((music) the relative duration of a musical note)
candidate resource for a word belonging to the text fragment
ailab.ijs.si
Algorithms: PageRank
Algorithm steps:
set the graph vertices to either of the values 0, if the vertex does not represent a candidate resource, or 1/R, with R being the total number
- f candidate resources
the PageRank value for each vertex i (PR[Vi]) is:
ailab.ijs.si
Algorithms: ContextSimilarity
In a country as diverse and complex as India, it is not surprising to find that people here reflect the rich glories of the past, the culture, traditions and values relative to geographic locations and the numerous distinctive manners, habits and food that will always remain truly Indian.
beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values“ (an ideal accepted by some individual or group) "he has old- fashioned values“ ((music) the relative duration of a musical note)
belief: (any cognitive content held as true) ideal: (the idea of something that is perfect; something that
- ne hopes to attain)
duration, continuance: (the period of time during which something continues)
candidate resource description candidate neighborhood resource description
ailab.ijs.si
Algorithms: ContextSimilarity
ContextSimilarity (resource, wa) returns Similarity Similarity = 0 NR = GetNeighborhoodResources(resour ce) CW = GetContext(wa) for i = 1 to Size(NR) do CS = simcos(NR[i], CW) Similarity = Similarity + CS end for return Similarity beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values“ (an ideal accepted by some individual
- r group) "he has
- ld-fashioned
values“ ((music) the relative duration of a musical note) values
ailab.ijs.si
Evaluation Datasets
Expert annotators
WordNet: SemEval 2007 word sense disambiguation Task 7: Course Grained English All Words - 2269 annotated words, 1591 polysemous (WordNet) OpenCyc: a subset comprised of 50 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 2 annotators DBpedia: a subset of 56 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 1 annotator
Crowdsourcing annotators
WordNet and OpenCyc: A subset of 325 words for WordNet, 177 for OpenCyc, from the SemEval 2007 Task 7 corpus
ailab.ijs.si
Evaluation Results
Expert Annotators Algorithm WordNet OpenCyc DBpedia CS 75.24 28.00 17.86 PR 73.44 40.00 21.43 Random 52.43 24.00 14.28 Crowdsourcing Annotators Algorithm WordNet OpenCyc CS 63.24 37.55
ailab.ijs.si
Conclusions
We investigated the applicability of two common approaches, taken from the word sense disambiguation community, for annotating text with LOD datasets:
relying on the dataset relationship structure (PageRank) taking advantage of the human-readable description of a resource as well as neighborhood relationships defined for that resource (ContextSimilarity)
Three datasets: WordNet, DBpedia and OpenCyc. Experiments revealed the shortcomings of the current state-of-the-art word sense disambiguation methods when applied to different LOD datasets
ailab.ijs.si
Conclusions
Purpose for which the dataset was developed WordNet OpenCyc DBpedia dictionary-based taxonomy common-sense knowledge base primarily developed for modeling and reasoning an effort to extract structured information from Wikipedia highest ratio of covered words candidate resources correspond directly with the possible word meanings distinctions between resources depend on the reasoning task contains concepts which are created to support specific tasks (reasoning, paraphrasing, etc.) rich set of instances (named entities: places, people, and organizations) few common words covered named entities which have common words (e.g. "Talk" is a song by the British alternative rock band Coldplay)
ailab.ijs.si
Conclusions
Human- readable descriptions WordNet OpenCyc DBpedia written similar to dictionary entries; also contain examples documentation to the
- ntology engineer using
it to model some world phenomena written like encyclopedia entries easier to understand by the general public hard to understand by the general public very short for some resources Relations between resources most relation types are defined between the same parts of speech contain infrastructure relationships (e.g. wikiPageUsesTemplate is the most common relation in DBpedia infobox triplets) useful relations between different parts of speech are missing should be disregarded as they introduce noise
ailab.ijs.si
Future Work
Further develop text annotation methods which can offer better performance on datasets, such as OpenCyc and DBpedia, and can be transferred to other datasets Investigate the potential for combining resources from different datasets in the same task Include elements of active learning
having users in the loop provide a few annotations, to enhance the discovery of hard to disambiguate text fragments acquire labeled data for performing algorithm
- ptimization