identifying relevant sources for data linking using a
play

Identifying Relevant Sources for Data Linking using a Semantic Web - PowerPoint PPT Presentation

Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu dAquin Knowledge Media Institute The Open University, UK How to link a new dataset? What other repositories contain relevant data which I


  1. Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu d’Aquin Knowledge Media Institute The Open University, UK

  2. How to link a new dataset? • What other repositories contain relevant data which I should link to? – Select the external repository • How to select the relevant data instances to link? – Select the relevant classes within the chosen repository ? LinkedMDB TV programs movies DBPedia pieces of music Freebase actors MusicBrainz composers bestbuy

  3. Selection criteria • Additional information about local instances • Popularity • Degree of overlap DBLP data.open.ac.uk rae:RKBExplorer Publication data DBPedia

  4. Available information • Additional information about resources – Schema ontology – Test examples • Popularity – VoiD descriptors • Linking repositories – Catalog of repositories (CKAN) • Degree of overlap – VoiD descriptors (only topic relevance) – Relevant info hard to obtain on the client side

  5. Approach Search for sources with potentially high degree of overlap – Use a subset of entity labels from the original dataset as keywords for entity search

  6. Approach Aggregate results – Group instances occurring in returned result sets by their source repositories

  7. Approach Rank sources – Sort by number of individuals returned in search results

  8. Approach Select “most relevant” class – Select the class in each source, which covers most of instances

  9. Issues: imprecise results • Main cause: ambiguous instance labels • Inclusion of irrelevant sources – E.g., DBLP for movie score composers • Selection of inappropriate classes within the selected source – Too generic: e.g., dbpedia: Person vs dbpedia: MusicArtist – Irrelevant: e.g., akt: Publication-Reference (journal volume) vs akt: Journal

  10. Filtering results Determine potentially irrelevant classes – Use state-of-the-art schema matching to select relevant classes

  11. Filtering results Filter out irrelevant search results – Only consider search result instances belonging to “approved” classes

  12. Preliminary experiments • Datasets – ORO journals (data.open.ac.uk): 3110 instances – LinkedMDB films: 400 instances – LinkedMDB music contributors: 400 instances • External components – Semantic index: Sig.ma – Ontology matching techniques: CIDER, instance-based schema mappings retrieved from BTC2009 dataset

  13. Preliminary experiments • Performance measure: – Proportion of relevant sources among the top-10 returned results Before filtering + / - After filtering + / - rae2001 (RKB) + rae2001 (RKB) + dotac (RKB) + DBPedia + DBPedia + dblp.l3s.de + oai (RKB) + Freebase + dblp.l3s.de + DBLP (RKB) + wordnet (RKB) - eprints (RKB) + bibsonomy - eprints (RKB) + Freebase + www.examiner.com -

  14. Preliminary experiments • Summary: – Top-ranked returned repositories are largely relevant from the point of view of linking – Filtering using schema matching techniques greatly improves precision (all remaining sources are relevant) – … but at the expense of some recall

  15. Future work • Improving the quality of results – E.g., estimating the potential loss of precision/ recall for different filtering decisions • Integrating with the data linking workflow – Automatically pre-configuring the data linking algorithm • Repository search as a potentially useful semantic search use case (in addition to entity and document search)

  16. Thanks for your attention Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend