Automatically Annotating Text with Linked Open Data Delia Rusu , Bla - - PowerPoint PPT Presentation

▶

Feb 07, 2024 254 likes •433 views

Automatically Annotating Text with Linked Open Data Delia Rusu , Bla Fortuna, Dunja Mladeni Joef Stefan Institute ailab.ijs.si Motivation: Annotating Text with LOD DBpedia Open Cyc WordNet ailab.ijs.si Overview Related work

SLIDE 1

ailab.ijs.si

Delia Rusu, Blaž Fortuna, Dunja Mladenić Jožef Stefan Institute

Automatically Annotating Text with Linked Open Data

SLIDE 2

ailab.ijs.si

Motivation: Annotating Text with LOD

DBpedia Open Cyc WordNet

SLIDE 3

ailab.ijs.si

Overview

Related work Algorithms for annotating with LOD

PageRank Context Similarity

Evaluation Datasets

WordNet OpenCyc DBpedia

Conclusions and Future Work

SLIDE 4

ailab.ijs.si

Related Work

Supervised approaches:

Parallel corpora: Chan et al. – SemEval 2007

Knowledge-based:

WordNet::Similarity package – Pedersen et al. 2004 Usage of context free grammars to validate semantic interconnections – Navigli and Velardi, 2005 Formal document structure description, hypothesis building, trying to reason using Cyc – Curtis et al. 2006 Disambiguate Wikipedia articles into Cyc concepts - Medelyan and Legg, 2008 Adapted versions of PageRank – Mihalcea et al. 2004, Agirre and Soroa, 2009

Simple knowledge-based approaches compete with state-of-the-art supervised approaches using a high- quality knowledge base - Ponzetto and Navigli, 2010.

SLIDE 5

ailab.ijs.si

LOD Dataset Representation

WordNet (VUA)

SLIDE 6

ailab.ijs.si

LOD Dataset Representation

rdf:type rdfs:subClassOf … rdf:resource rdf:resource="http://purl.org/vocabularies/princeton/wn30/ synset-belief-noun-1" rdf:resource="http://purl.org/vocabularies/princeton/wn30/ wordsense-values-noun-1" Example: rdf:type

SLIDE 7

ailab.ijs.si

Algorithms: PageRank

Example for the word values (human readable description of the resource)

beliefs of a person or social group in which they have an emotional investment

(either for or against something); "he has very conservatives values“

(an ideal accepted by some individual or group) "he has old-fashioned values“
((music) the relative duration of a musical note)

candidate resource for a word belonging to the text fragment

SLIDE 8

ailab.ijs.si

Algorithms: PageRank

Algorithm steps:

set the graph vertices to either of the values 0, if the vertex does not represent a candidate resource, or 1/R, with R being the total number

f candidate resources

the PageRank value for each vertex i (PR[Vi]) is:

SLIDE 9

ailab.ijs.si

Algorithms: ContextSimilarity

In a country as diverse and complex as India, it is not surprising to find that people here reflect the rich glories of the past, the culture, traditions and values relative to geographic locations and the numerous distinctive manners, habits and food that will always remain truly Indian.

beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values“ (an ideal accepted by some individual or group) "he has old- fashioned values“ ((music) the relative duration of a musical note)

belief: (any cognitive content held as true) ideal: (the idea of something that is perfect; something that

ne hopes to attain)

duration, continuance: (the period of time during which something continues)

candidate resource description candidate neighborhood resource description

SLIDE 10

ailab.ijs.si

Algorithms: ContextSimilarity

ContextSimilarity (resource, wa) returns Similarity Similarity = 0 NR = GetNeighborhoodResources(resour ce) CW = GetContext(wa) for i = 1 to Size(NR) do CS = simcos(NR[i], CW) Similarity = Similarity + CS end for return Similarity beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values“ (an ideal accepted by some individual

r group) "he has
ld-fashioned

values“ ((music) the relative duration of a musical note) values

SLIDE 11

ailab.ijs.si

Evaluation Datasets

Expert annotators

WordNet: SemEval 2007 word sense disambiguation Task 7: Course Grained English All Words - 2269 annotated words, 1591 polysemous (WordNet) OpenCyc: a subset comprised of 50 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 2 annotators DBpedia: a subset of 56 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 1 annotator

Crowdsourcing annotators

WordNet and OpenCyc: A subset of 325 words for WordNet, 177 for OpenCyc, from the SemEval 2007 Task 7 corpus

SLIDE 12

ailab.ijs.si

Evaluation Results

Expert Annotators Algorithm WordNet OpenCyc DBpedia CS 75.24 28.00 17.86 PR 73.44 40.00 21.43 Random 52.43 24.00 14.28 Crowdsourcing Annotators Algorithm WordNet OpenCyc CS 63.24 37.55

SLIDE 13

ailab.ijs.si

Conclusions

We investigated the applicability of two common approaches, taken from the word sense disambiguation community, for annotating text with LOD datasets:

relying on the dataset relationship structure (PageRank) taking advantage of the human-readable description of a resource as well as neighborhood relationships defined for that resource (ContextSimilarity)

Three datasets: WordNet, DBpedia and OpenCyc. Experiments revealed the shortcomings of the current state-of-the-art word sense disambiguation methods when applied to different LOD datasets

SLIDE 14

ailab.ijs.si

Conclusions

Purpose for which the dataset was developed WordNet OpenCyc DBpedia dictionary-based taxonomy common-sense knowledge base primarily developed for modeling and reasoning an effort to extract structured information from Wikipedia highest ratio of covered words candidate resources correspond directly with the possible word meanings distinctions between resources depend on the reasoning task contains concepts which are created to support specific tasks (reasoning, paraphrasing, etc.) rich set of instances (named entities: places, people, and organizations) few common words covered named entities which have common words (e.g. "Talk" is a song by the British alternative rock band Coldplay)

SLIDE 15

ailab.ijs.si

Conclusions

Human- readable descriptions WordNet OpenCyc DBpedia written similar to dictionary entries; also contain examples documentation to the

ntology engineer using

it to model some world phenomena written like encyclopedia entries easier to understand by the general public hard to understand by the general public very short for some resources Relations between resources most relation types are defined between the same parts of speech contain infrastructure relationships (e.g. wikiPageUsesTemplate is the most common relation in DBpedia infobox triplets) useful relations between different parts of speech are missing should be disregarded as they introduce noise

SLIDE 16

ailab.ijs.si

Future Work

Further develop text annotation methods which can offer better performance on datasets, such as OpenCyc and DBpedia, and can be transferred to other datasets Investigate the potential for combining resources from different datasets in the same task Include elements of active learning

having users in the loop provide a few annotations, to enhance the discovery of hard to disambiguate text fragments acquire labeled data for performing algorithm

ptimization

SLIDE 17

ailab.ijs.si

Delia Rusu, Blaž Fortuna, Dunja Mladenić Jožef Stefan Institute

Automatically Annotating Text with Linked Open Data

ailab.ijs.si

Motivation: Annotating Text with LOD

ailab.ijs.si

Overview

Related work Algorithms for annotating with LOD

PageRank Context Similarity

Evaluation Datasets

WordNet OpenCyc DBpedia

Conclusions and Future Work

ailab.ijs.si

Related Work

Supervised approaches:

Parallel corpora: Chan et al. – SemEval 2007

Knowledge-based:

Simple knowledge-based approaches compete with state-of-the-art supervised approaches using a high- quality knowledge base - Ponzetto and Navigli, 2010.

ailab.ijs.si

LOD Dataset Representation

WordNet (VUA)

ailab.ijs.si

LOD Dataset Representation

rdf:type rdfs:subClassOf … rdf:resource rdf:resource="http://purl.org/vocabularies/princeton/wn30/ synset-belief-noun-1" rdf:resource="http://purl.org/vocabularies/princeton/wn30/ wordsense-values-noun-1" Example: rdf:type

ailab.ijs.si

Algorithms: PageRank

Example for the word values (human readable description of the resource)

candidate resource for a word belonging to the text fragment

ailab.ijs.si

Algorithms: PageRank

Algorithm steps:

set the graph vertices to either of the values 0, if the vertex does not represent a candidate resource, or 1/R, with R being the total number

the PageRank value for each vertex i (PR[Vi]) is:

ailab.ijs.si

Algorithms: ContextSimilarity

In a country as diverse and complex as India, it is not surprising to find that people here reflect the rich glories of the past, the culture, traditions and values relative to geographic locations and the numerous distinctive manners, habits and food that will always remain truly Indian.

ailab.ijs.si

Algorithms: ContextSimilarity

values“ ((music) the relative duration of a musical note) values

ailab.ijs.si

Evaluation Datasets

Expert annotators

Crowdsourcing annotators

WordNet and OpenCyc: A subset of 325 words for WordNet, 177 for OpenCyc, from the SemEval 2007 Task 7 corpus

ailab.ijs.si

Evaluation Results

Expert Annotators Algorithm WordNet OpenCyc DBpedia CS 75.24 28.00 17.86 PR 73.44 40.00 21.43 Random 52.43 24.00 14.28 Crowdsourcing Annotators Algorithm WordNet OpenCyc CS 63.24 37.55

ailab.ijs.si

Conclusions

We investigated the applicability of two common approaches, taken from the word sense disambiguation community, for annotating text with LOD datasets:

relying on the dataset relationship structure (PageRank) taking advantage of the human-readable description of a resource as well as neighborhood relationships defined for that resource (ContextSimilarity)

Three datasets: WordNet, DBpedia and OpenCyc. Experiments revealed the shortcomings of the current state-of-the-art word sense disambiguation methods when applied to different LOD datasets

ailab.ijs.si

Conclusions

ailab.ijs.si

Conclusions

ailab.ijs.si

Future Work

Further develop text annotation methods which can offer better performance on datasets, such as OpenCyc and DBpedia, and can be transferred to other datasets Investigate the potential for combining resources from different datasets in the same task Include elements of active learning

having users in the loop provide a few annotations, to enhance the discovery of hard to disambiguate text fragments acquire labeled data for performing algorithm

ailab.ijs.si

Thank You for Your Attention!