Identifying Relevant Sources for Data Linking using a Semantic Web - - PowerPoint PPT Presentation

▶

Mar 23, 2023 383 likes •547 views

Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu dAquin Knowledge Media Institute The Open University, UK How to link a new dataset? What other repositories contain relevant data which I

SLIDE 1

Identifying Relevant Sources for Data Linking using a Semantic Web Index

Andriy Nikolov Mathieu d’Aquin

Knowledge Media Institute The Open University, UK

SLIDE 2

How to link a new dataset?

What other repositories contain relevant data which I

should link to?

– Select the external repository

How to select the relevant data instances to link?

– Select the relevant classes within the chosen repository

TV programs

movies pieces of music LinkedMDB DBPedia Freebase MusicBrainz

?

actors composers bestbuy

SLIDE 3

data.open.ac.uk

Selection criteria

Additional information about local instances
Popularity
Degree of overlap

Publication data

DBPedia DBLP rae:RKBExplorer

SLIDE 4

Available information

Additional information about resources

– Schema ontology – Test examples

Popularity

– VoiD descriptors

Linking repositories

– Catalog of repositories (CKAN)

Degree of overlap

– VoiD descriptors (only topic relevance) – Relevant info hard to obtain on the client side

SLIDE 5

Approach

Search for sources with potentially high degree of

verlap

– Use a subset of entity labels from the original dataset as keywords for entity search

SLIDE 6

Approach

Aggregate results

– Group instances

ccurring in returned

result sets by their source repositories

SLIDE 7

Approach

Rank sources

– Sort by number of individuals returned in search results

SLIDE 8

Approach

Select “most relevant” class

– Select the class in each source, which covers most

f instances

SLIDE 9

Issues: imprecise results

Main cause: ambiguous instance labels
Inclusion of irrelevant sources

– E.g., DBLP for movie score composers

Selection of inappropriate classes within

the selected source

– Too generic: e.g., dbpedia: Person vs dbpedia: MusicArtist – Irrelevant: e.g., akt: Publication-Reference (journal volume) vs akt: Journal

SLIDE 10

Filtering results

Determine potentially irrelevant classes

– Use state-of-the-art schema matching to select relevant classes

SLIDE 11

Filtering results

Filter out irrelevant search results

– Only consider search result instances belonging to “approved” classes

SLIDE 12

Preliminary experiments

Datasets

– ORO journals (data.open.ac.uk): 3110 instances – LinkedMDB films: 400 instances – LinkedMDB music contributors: 400 instances

External components

– Semantic index: Sig.ma – Ontology matching techniques: CIDER, instance-based schema mappings retrieved from BTC2009 dataset

SLIDE 13

Preliminary experiments

Before filtering + / - After filtering + / - rae2001 (RKB) + rae2001 (RKB) + dotac (RKB) + DBPedia + DBPedia + dblp.l3s.de +

ai (RKB)

+ Freebase + dblp.l3s.de + DBLP (RKB) + wordnet (RKB)

eprints (RKB)

+ bibsonomy

eprints (RKB)

+ Freebase + www.examiner.com

Performance measure:

– Proportion of relevant sources among the top-10 returned results

SLIDE 14

Summary:

– Top-ranked returned repositories are largely relevant from the point of view of linking – Filtering using schema matching techniques greatly improves precision (all remaining sources are relevant) – … but at the expense of some recall

Preliminary experiments

SLIDE 15

Future work

Improving the quality of results

– E.g., estimating the potential loss of precision/ recall for different filtering decisions

Integrating with the data linking

workflow

– Automatically pre-configuring the data linking algorithm

Repository search as a potentially

useful semantic search use case (in addition to entity and document search)

SLIDE 16

Identifying Relevant Sources for Data Linking using a Semantic Web Index

Andriy Nikolov Mathieu d’Aquin

Knowledge Media Institute The Open University, UK

How to link a new dataset?

should link to?

– Select the external repository

– Select the relevant classes within the chosen repository

?

data.open.ac.uk

Selection criteria

DBPedia DBLP rae:RKBExplorer

Available information

– Schema ontology – Test examples

– VoiD descriptors

– Catalog of repositories (CKAN)

– VoiD descriptors (only topic relevance) – Relevant info hard to obtain on the client side

Approach

Search for sources with potentially high degree of

Approach

Aggregate results

Approach

Rank sources

Approach

Select “most relevant” class

Issues: imprecise results

– E.g., DBLP for movie score composers

the selected source

– Too generic: e.g., dbpedia: Person vs dbpedia: MusicArtist – Irrelevant: e.g., akt: Publication-Reference (journal volume) vs akt: Journal

Filtering results

Determine potentially irrelevant classes

Filtering results

Filter out irrelevant search results

Preliminary experiments

– ORO journals (data.open.ac.uk): 3110 instances – LinkedMDB films: 400 instances – LinkedMDB music contributors: 400 instances

– Semantic index: Sig.ma – Ontology matching techniques: CIDER, instance-based schema mappings retrieved from BTC2009 dataset

Preliminary experiments

– Proportion of relevant sources among the top-10 returned results

– Top-ranked returned repositories are largely relevant from the point of view of linking – Filtering using schema matching techniques greatly improves precision (all remaining sources are relevant) – … but at the expense of some recall

Preliminary experiments

Future work

– E.g., estimating the potential loss of precision/ recall for different filtering decisions

workflow

– Automatically pre-configuring the data linking algorithm

useful semantic search use case (in addition to entity and document search)

Questions?

Thanks for your attention