Identifying Relevant Sources for Data Linking using a Semantic Web - - PowerPoint PPT Presentation

identifying relevant sources for data linking using a
SMART_READER_LITE
LIVE PREVIEW

Identifying Relevant Sources for Data Linking using a Semantic Web - - PowerPoint PPT Presentation

Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu dAquin Knowledge Media Institute The Open University, UK How to link a new dataset? What other repositories contain relevant data which I


slide-1
SLIDE 1

Identifying Relevant Sources for Data Linking using a Semantic Web Index

Andriy Nikolov Mathieu d’Aquin

Knowledge Media Institute The Open University, UK

slide-2
SLIDE 2

How to link a new dataset?

  • What other repositories contain relevant data which I

should link to?

– Select the external repository

  • How to select the relevant data instances to link?

– Select the relevant classes within the chosen repository

TV programs

movies pieces of music LinkedMDB DBPedia Freebase MusicBrainz

?

actors composers bestbuy

slide-3
SLIDE 3

data.open.ac.uk

Selection criteria

  • Additional information about local instances
  • Popularity
  • Degree of overlap

Publication data

DBPedia DBLP rae:RKBExplorer

slide-4
SLIDE 4

Available information

  • Additional information about resources

– Schema ontology – Test examples

  • Popularity

– VoiD descriptors

  • Linking repositories

– Catalog of repositories (CKAN)

  • Degree of overlap

– VoiD descriptors (only topic relevance) – Relevant info hard to obtain on the client side

slide-5
SLIDE 5

Approach

Search for sources with potentially high degree of

  • verlap

– Use a subset of entity labels from the original dataset as keywords for entity search

slide-6
SLIDE 6

Approach

Aggregate results

– Group instances

  • ccurring in returned

result sets by their source repositories

slide-7
SLIDE 7

Approach

Rank sources

– Sort by number of individuals returned in search results

slide-8
SLIDE 8

Approach

Select “most relevant” class

– Select the class in each source, which covers most

  • f instances
slide-9
SLIDE 9

Issues: imprecise results

  • Main cause: ambiguous instance labels
  • Inclusion of irrelevant sources

– E.g., DBLP for movie score composers

  • Selection of inappropriate classes within

the selected source

– Too generic: e.g., dbpedia: Person vs dbpedia: MusicArtist – Irrelevant: e.g., akt: Publication-Reference (journal volume) vs akt: Journal

slide-10
SLIDE 10

Filtering results

Determine potentially irrelevant classes

– Use state-of-the-art schema matching to select relevant classes

slide-11
SLIDE 11

Filtering results

Filter out irrelevant search results

– Only consider search result instances belonging to “approved” classes

slide-12
SLIDE 12

Preliminary experiments

  • Datasets

– ORO journals (data.open.ac.uk): 3110 instances – LinkedMDB films: 400 instances – LinkedMDB music contributors: 400 instances

  • External components

– Semantic index: Sig.ma – Ontology matching techniques: CIDER, instance-based schema mappings retrieved from BTC2009 dataset

slide-13
SLIDE 13

Preliminary experiments

Before filtering + / - After filtering + / - rae2001 (RKB) + rae2001 (RKB) + dotac (RKB) + DBPedia + DBPedia + dblp.l3s.de +

  • ai (RKB)

+ Freebase + dblp.l3s.de + DBLP (RKB) + wordnet (RKB)

  • eprints (RKB)

+ bibsonomy

  • eprints (RKB)

+ Freebase + www.examiner.com

  • Performance measure:

– Proportion of relevant sources among the top-10 returned results

slide-14
SLIDE 14
  • Summary:

– Top-ranked returned repositories are largely relevant from the point of view of linking – Filtering using schema matching techniques greatly improves precision (all remaining sources are relevant) – … but at the expense of some recall

Preliminary experiments

slide-15
SLIDE 15

Future work

  • Improving the quality of results

– E.g., estimating the potential loss of precision/ recall for different filtering decisions

  • Integrating with the data linking

workflow

– Automatically pre-configuring the data linking algorithm

  • Repository search as a potentially

useful semantic search use case (in addition to entity and document search)

slide-16
SLIDE 16

Questions?

Thanks for your attention