LSWT2018 Link Discovery Presentation Presentation September 2018 - - PDF document

lswt2018 link discovery presentation
SMART_READER_LITE
LIVE PREVIEW

LSWT2018 Link Discovery Presentation Presentation September 2018 - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/327417445 LSWT2018 Link Discovery Presentation Presentation September 2018 CITATIONS READS 0 12 1 author: Mohamed Sherif


slide-1
SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/327417445

LSWT2018 Link Discovery Presentation

Presentation · September 2018

CITATIONS READS

12

1 author: Some of the authors of this publication are also working on these related projects: HOBBIT: Holistic Benchmarking of Big Linked Data View project GEISER: From sensor data to Internet based geo-services View project Mohamed Sherif University of Leipzig

32 PUBLICATIONS 269 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mohamed Sherif on 04 September 2018.

The user has requested enhancement of the downloaded file.

slide-2
SLIDE 2

LSWT 2018 Linked Data Integration at Scale

Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo

Paderborn University, Data Science Group, Pohlweg 51, D-33098 Paderborn, Germany {firstname.lastname}@upb.de

June 18, 2018

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 1 / 27

slide-3
SLIDE 3

Motivation

Linked Data Principles

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 2 / 27

slide-4
SLIDE 4

Motivation

Why Link Discovery?

1 Linked Open Data Cloud

130+ billion triples ≈ 0.5 billion links Mostly owl:sameAs

2 Decentralized dataset creation 3 Complex information needs ⇒ Need to

consume data across knowledge bases

4 Links are central for

Cross-ontology QA Data Integration Reasoning Federated Queries ...

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 3 / 27

slide-5
SLIDE 5

Motivation

Cross-Ontology QA

Example Give me the name and description of all drugs that cure their side-effect.

1 Need information from

Drugbank (Drug description) Sider (Side-effects) DBpedia (Description)

2 Gathering information via SPARQL query using

links

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 4 / 27

slide-6
SLIDE 6

Motivation

Cross-Ontology QA

Example Give me the name and description of all drugs that cure their side-effect. SELECT ?drug ?name ?desc WHERE { ?drug a drugbank:Drug . ?drug rdfs:label ?name . ?drug drugbank:cures ?disease . ?drug owl:sameAs ?drug2 . ?drug owl:sameAs ?drug3 . ?drug2 sider:hasSideEffect ?effect . ?effect owl:sameAs ?disease . ?drug3 dbo:hasWikiPage ?desc . }

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 5 / 27

slide-7
SLIDE 7

Motivation

Cross-Ontology QA (Geo-spatial)

Example (DEQA) Give me flats near kindergartens in Kobe. SELECT ?flat WHERE { ?flat a deqa:Flat . ?flat deqa:near ?school . ?school a lgdo:School . ?school lgdo:city lgdo:Kobe . }

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 6 / 27

slide-8
SLIDE 8

The Link Discovery Problem

Definition

Definition (Link Discovery, informal) Given two sets of resources S and T, find links of type R between S and T Here, declarative link discovery

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 7 / 27

slide-9
SLIDE 9

The Link Discovery Problem

Definition

Definition (Link Discovery, informal) Given two sets of resources S and T, find links of type R between S and T Here, declarative link discovery Definition (Declarative Link Discovery, formal, similarities) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Common approach: Find M′ = {(s, t) ∈ S × T : σ(s, t) ≥ θ}

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 7 / 27

slide-10
SLIDE 10

The Link Discovery Problem

Definition

Definition (Link Discovery, informal) Given two sets of resources S and T, find links of type R between S and T Here, declarative link discovery Definition (Declarative Link Discovery, formal, similarities) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Common approach: Find M′ = {(s, t) ∈ S × T : σ(s, t) ≥ θ} Definition (Declarative Link Discovery, formal, distances) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Common approach: Find M′ = {(s, t) ∈ S × T : δ(s, t) ≤ τ}

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 7 / 27

slide-11
SLIDE 11

The Link Discovery Problem

Definition

Most common: R = owl:sameAs Also known as deduplication

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 8 / 27

slide-12
SLIDE 12

The Link Discovery Problem

Definition

Goal: Address all possible relations R Declarative Link Discovery: Similarity/distance commonly derived from property (and property chain) values

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 9 / 27

slide-13
SLIDE 13

The Link Discovery Problem

Definition

Goal: Address all possible relations R Declarative Link Discovery: Similarity/distance commonly derived from property (and property chain) values Example: R = :sameModel :s770fm rdfs:label "S770FM"@en :s770fm rdf:type :SABER :s770fm :model :770 :s770fm :top :FlamedMaple :s770fm :producer :Ibanez :s770fm rdfs:label "S770BEM"@en :s770fm rdf:type :SABER :s770fm :model :770 :s770fm :top :BirdEyeMaple :s770fm :producer :Ibanez

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 9 / 27

slide-14
SLIDE 14

The Link Discovery Problem

Why is it difficult?

1 Time complexity (Efficiency)

Large number of triples (e.g., LinkedTCGA with 20.4 billion triples ) Quadratic a-priori runtime 69 days for mapping cities from DBpedia to Geonames Solutions usually in-memory (insufficient heap space)

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 10 / 27

slide-15
SLIDE 15

The Link Discovery Problem

Why is it difficult?

1 Time complexity (Efficiency)

Large number of triples (e.g., LinkedTCGA with 20.4 billion triples ) Quadratic a-priori runtime 69 days for mapping cities from DBpedia to Geonames Solutions usually in-memory (insufficient heap space)

2 Accuracy

Combination of several attributes required for high precision Tedious discovery of most adequate mapping Dataset-dependent similarity functions

(trigrams(x.name, y.name), 0.50) (levenshtein(x.desc, y.desc), 0.50) ⊔ (euclidean(x.price, y.price), 0.90) \ (cosine(x.name, y.name), 0.52) ⊓

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 10 / 27

slide-16
SLIDE 16

Limes

Link Discovery Framework for Metric Spaces

1 Time complexity

Limes algorithm HR3 Aegle Radon . . .

2 Accuracy

Raven Eagle Coala Euclid Wombat . . .

https://github.com/dice-group/limes

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 11 / 27

slide-17
SLIDE 17

Radon

Rapid Discovery of Topological Relations (AAA17)

Large number of datasets

http://stats.lod2.eu 150+ billion triples ≈ 0.5 billion links Mostly owl:sameAs

Large Geo-spatial datasets

LinkedGeoData contains > 20+ billion triples NUTS contains up to 1, 500 points per resources

Only 7.1% of the links between resources connect geo-spatial entities (Ngonga Ngomo, 2013)

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 12 / 27

slide-18
SLIDE 18

Radon

Why is linking geo-spatial resources difficult?

Link Discovery

Given two knowledge bases S and T, find links of type R between S and T Formally find M = {(s, t) ∈ S × T : R(s, t)} Na¨ ıve computation of M requires quadratic time complexity

Geo-spatial resources available on the LOD

Described using polygons Large in number Demands the computation of topological relations

Na¨ ıve computation of M is impracticable for geo-spatial resources

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 13 / 27

slide-19
SLIDE 19

Radon Algorithm

The Dimensionally Extended nine-Intersection Model (DE-9IM)

Standard to describe the topological relations in 2D space. DE-9IM is to based on the intersection matrix: DE9IM(a, b) dim(I(g1) ∩ I(g2))

dim(I(g1) ∩ B(g2)) dim(I(g1) ∩ E(g2)) dim(B(g1) ∩ I(g2)) dim(B(g1) ∩ B(g2)) dim(B(g1) ∩ E(g2)) dim(E(g1) ∩ I(g2)) dim(E(g1) ∩ B(g2)) dim(E(g1) ∩ E(g2))

  • There must be at least one shared point for a relation to be hold

Except for the disjoint relation ⇒ inverse of the intersects relation Accelerating the computation of whether two geometries share at least one point, accelerates the computation of any topological relation

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 14 / 27

slide-20
SLIDE 20

Radon Algorithm

Basic Idea

Radon implements improved indexing approach based on

1

Minimum bounding boxes (MBB)

2

Space tiling

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 15 / 27

slide-21
SLIDE 21

Radon Algorithm

  • I. Swapping Strategy

Large geometries that span over a large number of hypercubes ⇒ large spatial index when used as S Estimated Total Hypervolume (ETH) of a set of geometries X

ETH(X) = |X|

d

  • i=1

1 |X|

  • x∈X
  • max

p∈x {κi(p)} − min p∈x {κi(p)}

  • If ETH(S) > ETH(T), swaps S and T and

computes the reverse relation r′ instead of r e.g. if r is covered and ETH(S) > ETH(T), then swaps S and T and computes coveredBy

Since ETH(NUTS) > ETH(CLC), then S = CLC and T = NUTS

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 16 / 27

slide-22
SLIDE 22

Radon Algorithm

  • II. Optimized Sparse Space Tiling

Insert all geometries s ∈ S into index I(s)

1

Computes MBB(s)

2

Maps each s to all hypercubes over MBB(s) spans

Same procedure for all t ∈ T but only index geometries t that are potentially in hypercubes already contained in I(S)

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 17 / 27

slide-23
SLIDE 23

Radon Algorithm

  • III. Link Generation

Discards unnecessary computations using the TestMBB procedure TestMBB optimizes the subset of DE-9IM relations for relations where

  • ne geometry has interior or boundary points

in the exterior of the other geometry e.g. equals, covers and within

For other relations, TestMBB returns true If TestMBB returns false

No need to compute the expensive computation

  • f the topological relation

◮ TestMBB(within, blue) = false ◮ TestMBB(within, green) = true

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 18 / 27

slide-24
SLIDE 24

Radon Evaluation

Setup

Topological relations

Subset of the 7 topological relations i.e. within, touches, overlaps, intersects, equals, crosses and covers

Hardware

64-core 2.3 GHz, OpenJDK 64-Bit Server 20 GB RAM with timeout limit of 2 hours

State of the art

1

Silk

2

Strabon

Datasets

1

NUTS

2

CORINE Land Cover (CLC)

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 19 / 27

slide-25
SLIDE 25

Radon Evaluation

Linear Speedup

Radon vs. Silk 44 subsets of the CLC vs. the full NUTS 7 basic topological relations 308 experiments Single core Radon achieves a linear speedup relative to the dataset sizes Up to 450 times faster for the within relation within relation

108 2×108 3×108 4×108 5×108

Dataset sizes

100 200 300 400 500

Speedup

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 20 / 27

slide-26
SLIDE 26

Radon Evaluation

Topological Relations Computations

Same setting as in previous experiments Radon runs significantly less computations of the relations Radon carries out only 3 and 4 computations for the equals and within relations respectively On average, 449 times less computations per relation

2×105 4×105 6×105 8×105 covers crosses equals intersects

  • verlaps

touches within RADON SILK

Average number of computations

  • f topological relations

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 21 / 27

slide-27
SLIDE 27

Radon Evaluation

Runtime

Same setting as in previous experiments On average, Radon is faster then

Silk by 65.62 times Strabon by 11.99 times

Strabon outperforms Radon on the intersects relation

Strabon uses an R-tree-over-GiST spatial index over the stored geometries in the underlying PostGIS database R-tree-over-GiST is highly optimized for the retrieval of spatially connected objects

100 200 300 400 covers crosses equals intersects

  • verlaps

touches within RADON STRABON SILK

Average runtimes in seconds

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 22 / 27

slide-28
SLIDE 28

Radon Evaluation

Radon Speedup Quantification (Parallel Implementation)

Merge all 44 sub-datasets of CLC (CLCM) CLCM contains 2, 209, 538 resource CLCM as both source and target datasets Simple round robin load balancing policy On average, within the 2 hours time limit

Radon finishes in 20.83 minutes Silk finalizes 1.16% of the tasks Silk would need 4.36 days with 8 threads (linear extrapolation) Radon is 834.69 times faster than Silk

Relation #Thr. Radon Silk Speedup equals 1 24.11 36,500 (0.33%) 1,513.58 2 13.15 21,667 (0.55%) 1,647.58 4 6.81 11,750 (1.02%) 1,725.77 8 3.79 6,286 (1.91%) 1,658.78 intersects 1 93.17 37,500 (0.32%) 402.50 2 49.03 20,667 (0.58%) 421.53 4 25.11 12,000 (1.00%) 477.81 8 13.04 6,300 (1.90%) 483.24 within 1 36.47 35,000 (0.34%) 959.74 2 18.26 20,667 (0.58%) 1,131.86 4 9.44 11,765 (1.02%) 1,246.34 8 5.92 6,202 (1.93%) 1,048.34 covers 1 35.62 36,000 (0.33%) 1,010.75 2 18.51 21,029 (0.57%) 1,136.10 4 10.23 12,000 (1.00%) 1,172.50 8 5.33 6,300 (1.90%) 1,182.13 touches 1 94.50 35,500 (0.34%) 375.68 2 47.71 22,196 (0.54%) 465.18 4 25.09 12,121 (0.99%) 483.08 8 13.30 6,381 (1.88%) 479.75 Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 23 / 27

slide-29
SLIDE 29

Radon Evaluation

Radon Speedup Quantification (Parallel Implementation)

Strabon does not finish any of the experiments within the 2-hours time limit No progress feedback from Strabon Strabon performance estimation

Strabon vs. sub-CLC dataset 10 times < CLCm Optimistic (Strabon scales linearly), average speedup of 24 Realistic (Strabon scales in O(n2)), average speedup of 241

Relation #Thr. Radon Silk Speedup equals 1 24.11 36,500 (0.33%) 1,513.58 2 13.15 21,667 (0.55%) 1,647.58 4 6.81 11,750 (1.02%) 1,725.77 8 3.79 6,286 (1.91%) 1,658.78 intersects 1 93.17 37,500 (0.32%) 402.50 2 49.03 20,667 (0.58%) 421.53 4 25.11 12,000 (1.00%) 477.81 8 13.04 6,300 (1.90%) 483.24 within 1 36.47 35,000 (0.34%) 959.74 2 18.26 20,667 (0.58%) 1,131.86 4 9.44 11,765 (1.02%) 1,246.34 8 5.92 6,202 (1.93%) 1,048.34 covers 1 35.62 36,000 (0.33%) 1,010.75 2 18.51 21,029 (0.57%) 1,136.10 4 10.23 12,000 (1.00%) 1,172.50 8 5.33 6,300 (1.90%) 1,182.13 touches 1 94.50 35,500 (0.34%) 375.68 2 47.71 22,196 (0.54%) 465.18 4 25.09 12,121 (0.99%) 483.08 8 13.30 6,381 (1.88%) 479.75 Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 24 / 27

slide-30
SLIDE 30

Conclusion & Future Work

Conclusion

Presented Link Discovery problem Presented Radon, an approach for rapid discovery of topological relations among geo-spatial resources Radon is complete, correct and outperforms the SOTA by up to 3 orders of magnitude Presented Limes

Future work

More sophisticated load balancing approaches Other topology approximation methods Topological relations in higher dimensions Use modern industry-grade technology (Flink, SPARK, Docker Swarm, etc.) Benefit from the granularity in 5D data to reduce the linking runtime

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 25 / 27

slide-31
SLIDE 31

Limes 1.3.0

Website: http://cs.uni-paderborn.de/ds/research/research-projects/ active-projects/limes/ User manual: http://dice-group.github.io/LIMES/user_manual/ Developer manual: http://dice-group.github.io/LIMES/developer_manual/ Online DEMO: http://limes-webui.aksw.org/ Source code: https://github.com/dice-group/limes

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 26 / 27

slide-32
SLIDE 32

Thank you for your Attention!

Mohamed Ahmed Sherif Mohamed.Sherif@upb.de

This work has been supported by LEDS project, Eurostars Project SAGE (GA no. E!10882), the H2020 projects SLIPO (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no. NG 105/3-2) and the BMWI Project GEISER (project no. 01MD16014).

Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 27 / 27

View publication stats View publication stats