Data Linking: Capturing and Utilising Implicit Schema-Level - - PowerPoint PPT Presentation

data linking
SMART_READER_LITE
LIVE PREVIEW

Data Linking: Capturing and Utilising Implicit Schema-Level - - PowerPoint PPT Presentation

Data Linking: Capturing and Utilising Implicit Schema-Level Relations Andriy Nikolov Victoria Uren Enrico Motta Data linking: current state Automatic instance matching algorithms SILK, ODDLinker, KnoFuss, Pairwise matching


slide-1
SLIDE 1

Data Linking:

Capturing and Utilising Implicit Schema-Level Relations Andriy Nikolov Victoria Uren Enrico Motta

slide-2
SLIDE 2

Data linking: current state

  • Automatic instance matching

algorithms

– SILK, ODDLinker, KnoFuss, …

  • Pairwise matching of datasets

– Requires significant configuration effort

  • Transitive closure of links

– Use of “reference” datasets

slide-3
SLIDE 3

Problems

  • Transitive closures
  • ften incomplete

– Reference “hub” dataset is incomplete – Missing intermediate links – Direct comparison of relevant datasets is desirable

  • Schema heterogeneity

– Which instances to compare? – Which properties are relevant?

slide-4
SLIDE 4

Background

  • KnoFuss architecture

Knowledge fusion Ontology integration Knowledge base integration Ontology matching

Instance transformation

Coreference resolution

Inconsistency processing

Source KB Target KB

slide-5
SLIDE 5
  • Inferring schema

mappings from pre- existing instance mappings

  • Utilizing schema

mappings to produce new instance mappings

  • Background knowledge:

– Data-level (intermediate repositories) – Schema-level (datasets with more fine-grained schemas)

Overview

slide-6
SLIDE 6

Algorithm

  • Step 1:

– Obtaining transitive closure of existing mappings

LinkedMDB DBPedia

movie:music_contributor/2490

MusicBrainz

music:artist/a16…9fdf

= =

dbpedia:Ennio_Morricone

slide-7
SLIDE 7

Algorithm

  • Step 2: Inferring class and property mappings

– ClassOverlap and PropertyOverlap mappings – Confidence (classes A, B) = |c(A)Πc(B)| / min(c(|A|), c(|B|)) (overlap coefficient) – Confidence (properties r1, r2) = |c(X)|/|c(Y)|

  • X – identity clusters with equivalent values of r1 and r2
  • Y – all identity clusters which have values for both r1 and r2

LinkedMDB DBPedia MusicBrainz

music:artist/a16…9fdf

= =

dbpedia:Ennio_Morricone movie:music_contributor/2490

movie:music_contributor dbpedia:Artist is_a is_a

slide-8
SLIDE 8
  • Step 3: Inferring data

patterns

  • Functionality restrictions
  • IF 2 equivalent movies do

not have overlapping actors AND have different release dates THEN break the equivalence link

  • Note:

– Only usable if not taken into account at the initial instance matching stage

Algorithm

slide-9
SLIDE 9

Algorithm

  • Step 4: utilizing mappings and patterns

– Run instance-level matching for individuals

  • f strongly overlapping classes

– Use patterns to filter out existing mappings

  • DBLP

SELECT ?uri WHERE { ?uri rdf:type movie:music_contributor . }

  • DBPedia

SELECT ?uri WHERE { ?uri rdf:type dbpedia:Artist . }

slide-10
SLIDE 10

Results

  • Class mappings:

– Improvement in recall

  • Previously omitted mappings

were discovered after direct comparison of instances

  • Data patterns

– Improved precision

  • Filtered out spurious mappings
  • Identified 140 mappings

between movies as “potentially spurious”

  • 132 identified correctly

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Existing KnoFuss (only) Combined Precision Recall F1-measure 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Existing KnoFuss (only) Combined Precision Recall F1-measure 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Existing KnoFuss (only) Combined Precision Recall F1-measure

DBPedia/ DBLP DBPedia/ LinkedMDB DBPedia/ BookMashup

slide-11
SLIDE 11

Limitations & future work

  • Large-scale tests

– Billion Triple Challenge 2009, other repositories

  • Initial mappings

– What to do if a repository is not connected to any other one? – Utilizing low-cost instance-matching techniques

slide-12
SLIDE 12

Questions?

Thanks for your attention