lswt2018 link discovery presentation
play

LSWT2018 Link Discovery Presentation Presentation September 2018 - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/327417445 LSWT2018 Link Discovery Presentation Presentation September 2018 CITATIONS READS 0 12 1 author: Mohamed Sherif


  1. See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/327417445 LSWT2018 Link Discovery Presentation Presentation · September 2018 CITATIONS READS 0 12 1 author: Mohamed Sherif University of Leipzig 32 PUBLICATIONS 269 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: HOBBIT: Holistic Benchmarking of Big Linked Data View project GEISER: From sensor data to Internet based geo-services View project All content following this page was uploaded by Mohamed Sherif on 04 September 2018. The user has requested enhancement of the downloaded file.

  2. LSWT 2018 Linked Data Integration at Scale Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo Paderborn University, Data Science Group, Pohlweg 51, D-33098 Paderborn, Germany { firstname.lastname } @upb.de June 18, 2018 Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 1 / 27

  3. Motivation Linked Data Principles Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 2 / 27

  4. Motivation Why Link Discovery? 1 Linked Open Data Cloud 130+ billion triples ≈ 0.5 billion links Mostly owl:sameAs 2 Decentralized dataset creation 3 Complex information needs ⇒ Need to consume data across knowledge bases 4 Links are central for Cross-ontology QA Data Integration Reasoning Federated Queries ... Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 3 / 27

  5. Motivation Cross-Ontology QA Example Give me the name and description of all drugs that cure their side-effect. 1 Need information from Drugbank (Drug description) Sider (Side-effects) DBpedia (Description) 2 Gathering information via SPARQL query using links Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 4 / 27

  6. Motivation Cross-Ontology QA Example Give me the name and description of all drugs that cure their side-effect. SELECT ?drug ?name ?desc WHERE { ?drug a drugbank:Drug . ?drug rdfs:label ?name . ?drug drugbank:cures ?disease . ?drug owl:sameAs ?drug2 . ?drug owl:sameAs ?drug3 . ?drug2 sider:hasSideEffect ?effect . ?effect owl:sameAs ?disease . ?drug3 dbo:hasWikiPage ?desc . } Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 5 / 27

  7. Motivation Cross-Ontology QA (Geo-spatial) Example (DEQA) Give me flats near kindergartens in Kobe. SELECT ?flat WHERE { ?flat a deqa:Flat . ?flat deqa:near ?school . ?school a lgdo:School . ?school lgdo:city lgdo:Kobe . } Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 6 / 27

  8. The Link Discovery Problem Definition Definition (Link Discovery, informal) Given two sets of resources S and T , find links of type R between S and T Here, declarative link discovery Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 7 / 27

  9. The Link Discovery Problem Definition Definition (Link Discovery, informal) Given two sets of resources S and T , find links of type R between S and T Here, declarative link discovery Definition (Declarative Link Discovery, formal, similarities) Given sets S and T of resources and relation R Find M = { ( s , t ) ∈ S × T : R ( s , t ) } Common approach: Find M ′ = { ( s , t ) ∈ S × T : σ ( s , t ) ≥ θ } Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 7 / 27

  10. The Link Discovery Problem Definition Definition (Link Discovery, informal) Given two sets of resources S and T , find links of type R between S and T Here, declarative link discovery Definition (Declarative Link Discovery, formal, similarities) Given sets S and T of resources and relation R Find M = { ( s , t ) ∈ S × T : R ( s , t ) } Common approach: Find M ′ = { ( s , t ) ∈ S × T : σ ( s , t ) ≥ θ } Definition (Declarative Link Discovery, formal, distances) Given sets S and T of resources and relation R Find M = { ( s , t ) ∈ S × T : R ( s , t ) } Common approach: Find M ′ = { ( s , t ) ∈ S × T : δ ( s , t ) ≤ τ } Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 7 / 27

  11. The Link Discovery Problem Definition Most common: R = owl:sameAs Also known as deduplication Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 8 / 27

  12. The Link Discovery Problem Definition Goal: Address all possible relations R Declarative Link Discovery: Similarity/distance commonly derived from property (and property chain) values Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 9 / 27

  13. The Link Discovery Problem Definition Goal: Address all possible relations R Declarative Link Discovery: Similarity/distance commonly derived from property (and property chain) values Example: R = :sameModel :s770fm rdfs:label "S770FM"@en :s770fm rdfs:label "S770BEM"@en :s770fm rdf:type :SABER :s770fm rdf:type :SABER :s770fm :model :770 :s770fm :model :770 :s770fm :top :FlamedMaple :s770fm :top :BirdEyeMaple :s770fm :producer :Ibanez :s770fm :producer :Ibanez Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 9 / 27

  14. The Link Discovery Problem Why is it difficult? 1 Time complexity (Efficiency) Large number of triples (e.g., LinkedTCGA with 20.4 billion triples ) Quadratic a-priori runtime 69 days for mapping cities from DBpedia to Geonames Solutions usually in-memory (insufficient heap space) Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 10 / 27

  15. The Link Discovery Problem Why is it difficult? 1 Time complexity (Efficiency) Large number of triples (e.g., LinkedTCGA with 20.4 billion triples ) Quadratic a-priori runtime 69 days for mapping cities from DBpedia to Geonames Solutions usually in-memory (insufficient heap space) ( euclidean ( x.price , y.price ) , 0 . 90) 2 Accuracy \ ( levenshtein ( x.desc , y.desc ) , 0 . 50) Combination of several attributes required for high ⊔ precision ⊓ ( trigrams ( x.name , y.name ) , 0 . 50) Tedious discovery of most adequate mapping Dataset-dependent similarity functions ( cosine ( x.name , y.name ) , 0 . 52) Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 10 / 27

  16. Limes Link Discovery Framework for Metric Spaces 1 Time complexity Limes algorithm HR 3 Aegle Radon . . . 2 Accuracy Raven Eagle Coala Euclid https://github.com/dice-group/limes Wombat . . . Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 11 / 27

  17. Radon Rapid Discovery of Topological Relations (AAA17) Large number of datasets http://stats.lod2.eu 150+ billion triples ≈ 0.5 billion links Mostly owl:sameAs Large Geo-spatial datasets LinkedGeoData contains > 20+ billion triples NUTS contains up to 1 , 500 points per resources Only 7 . 1% of the links between resources connect geo-spatial entities (Ngonga Ngomo, 2013) Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 12 / 27

  18. Radon Why is linking geo-spatial resources difficult? Link Discovery Given two knowledge bases S and T , find links of type R between S and T Formally find M = { ( s , t ) ∈ S × T : R ( s , t ) } Na¨ ıve computation of M requires quadratic time complexity Geo-spatial resources available on the LOD Described using polygons Large in number Demands the computation of topological relations Na¨ ıve computation of M is impracticable for geo-spatial resources Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 13 / 27

  19. Radon Algorithm The Dimensionally Extended nine-Intersection Model (DE-9IM) Standard to describe the topological relations in 2D space. DE-9IM is to based on the intersection matrix: � dim ( I ( g 1 ) ∩ I ( g 2 )) dim ( I ( g 1 ) ∩ B ( g 2 )) dim ( I ( g 1 ) ∩ E ( g 2 )) � DE 9 IM ( a , b ) dim ( B ( g 1 ) ∩ I ( g 2 )) dim ( B ( g 1 ) ∩ B ( g 2 )) dim ( B ( g 1 ) ∩ E ( g 2 )) dim ( E ( g 1 ) ∩ I ( g 2 )) dim ( E ( g 1 ) ∩ B ( g 2 )) dim ( E ( g 1 ) ∩ E ( g 2 )) There must be at least one shared point for a relation to be hold Except for the disjoint relation ⇒ inverse of the intersects relation Accelerating the computation of whether two geometries share at least one point, accelerates the computation of any topological relation Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 14 / 27

  20. Radon Algorithm Basic Idea Radon implements improved indexing approach based on Minimum bounding boxes (MBB) 1 Space tiling 2 Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 15 / 27

  21. Radon Algorithm I. Swapping Strategy Large geometries that span over a large number of hypercubes ⇒ large spatial index when used as S Estimated Total Hypervolume ( ETH ) of a set of geometries X d 1 � � � � ETH ( X ) = | X | max p ∈ x { κ i ( p ) } − min p ∈ x { κ i ( p ) } | X | i =1 x ∈ X If ETH ( S ) > ETH ( T ), swaps S and T and computes the reverse relation r ′ instead of r Since ETH ( NUTS ) > ETH ( CLC ), then S = CLC and T = NUTS e.g. if r is covered and ETH ( S ) > ETH ( T ), then swaps S and T and computes coveredBy Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 16 / 27

  22. Radon Algorithm II. Optimized Sparse Space Tiling Insert all geometries s ∈ S into index I ( s ) Computes MBB ( s ) 1 Maps each s to all hypercubes over MBB ( s ) spans 2 Same procedure for all t ∈ T but only index geometries t that are potentially in hypercubes already contained in I ( S ) Sherif et al. LSWT 2018, Linked Data Integration at Scale June 18, 2018 17 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend