challenges of making data interoperable during query
play

Challenges of Making Data Interoperable during Query Processing - PowerPoint PPT Presentation

Challenges of Making Data Interoperable during Query Processing Maria-Esther Vidal Scientific Data Management Group TIB, Germany Universidad Simn Bolvar, Venezuela Motivating Example Query: Drugs with the active substance Simvastatin :


  1. Challenges of Making Data Interoperable during Query Processing Maria-Esther Vidal Scientific Data Management Group TIB, Germany Universidad Simón Bolívar, Venezuela

  2. Motivating Example Query: Drugs with the active substance Simvastatin : ○ Name of possible drug targets, ○ Chemical formula of a drug, ○ Side effects, and ○ Disease Name Page 2

  3. Motivating Example- Available Data Sources Genomic Biological Data Data Chemical Data Diverse data sources potentially incomplete and noisy Page 3

  4. Motivating Example- Data Sources in Heterogeneous Formats Data sources is diverse formats, e.g., XML, CSV, JSON Page 4

  5. Data Evolution…. Schema Changes Changes in Data Entity Source Changes, e.g., Performance Data Completeness and Availability Data Distribution Changes Page 5

  6. Impacting Data Complexity Dimensions Veracity, Variety, and Variability Page 6

  7. Query Over Heterogeneous Data Sources ● Query: Drugs with the active substance Simvastatin : ○ Name of possible drug targets, ○ Chemical formula of a drug, ○ Side effects, and ○ Disease Name Page 7

  8. Interoperability Issues During Query Processing [{ "diseaseID": " ", "name": "Diabetes_mellitus", "associatedGene": ["ACE", "ABCC8", "TCF1"] },{ "diseaseID": " ", "name": "Kaposi sarcoma, susceptibility to, 148000", "associatedGene": ["IL6", "IFNB2", "BSF2"] }] accNum DrugName formula pubChemId Drug side_effects.csv simvastatin C 25 H 38 O 5 54454 DB00295 Morphine C 17 H 19 NO 3 5288826 Drug Target Drug_Target 631 1882 DB00295 7683 ID Name Gene UniprotID drug_names.csv Target 631 3-hydroxy-3-methylglutaryl-co HMGCR P04035 enzyme A reductase 1882 Ras-related C3 botulinum RAC1 P63000 toxin substrate 1 7683 Mu-type opioid receptor OPRM1 P35372 Page 8

  9. Query Over Heterogeneous Data Sources ● Query: Drugs with the active substance Simvastatin : ○ Name of possible drug targets, ○ Chemical formula of a drug, ○ Side effects, and Query must be ○ Disease Name evaluated against heterogeneous sources , that potentially suffer of quality issues , and evolve over time Page 9

  10. Agenda 1. Data Integration Systems 2. Adaptive SPARQL Query Engines 3. Hybrid SPARQL Query Engines

  11. Data Integration Systems A data integration system DIS =< O , S , M >: • O is a set of general concepts in a general schema (virtual) • S is a set of { S1 ,.., Sn } of data sources • M is a set of mappings between sources in S and general concepts in O cf. Lenzerini 2002 Page 11

  12. Data Integration Systems Data Integration Data Integration System System Wrapper Wrapper Wrapper Heterogeneous Data Integration Data Integration System System Homogeneous Centralized Distributed Page 12

  13. Data Integration Systems Data Integration Data Integration System System Wrapper Wrapper Wrapper Heterogeneous ✽ ✽ Data Integration Data Integration System System Homogeneous Centralized Distributed ✽ Existing Data Integration Systems for Querying Processing over RDF Page 13

  14. Query Rewriting Problem Query Rewriting Problem (QRP): ● A query Q is a conjunctive query over predicates in O Data Integration ● Find a conjunctive query Q’ System expressed in sources in S based on rules in M , such that Wrapper Wrapper Wrapper ○ Evaluation of Q ’ produces only answers of Q ○ Evaluation of Q ’ produces all the answers of Q given the sources in S Theorem [Levy et al. 1995] To check if there is a valid rewriting Q’ of Q with at most the same number of goals as Q is an NP-complete problem. Page 14

  15. Challenges for Query Processing Given a query Q in a formal language, i.e., SPARQL ● Identify the relevant data sources for Q ( Source Selection) ● Decompose Q into subqueries on relevant data sources ( Query Decomposition ) ● Plan evaluation of subqueries against relevant data sources ( Query Planning ) ● Merge data collected from relevant data sources ( Query Execution ) Relevant data sources for Q : minimal set of sources S from a federation of source F such that the answer of evaluating Q in S is the same than evaluating Q in F 15 Page 15

  16. Federated SPARQL Query Engines Web-access interfaces ( unpredictable behavior) that allow for Data Integration querying RDF data: System ● SPARQL Endpoints: respect SPARQL protocol, i.e., any SPARQL query ● Linked Data Fragments: limited query capabilities, i.e., only one triple pattern Federation of RDF Data Sources Challenges: Query processing is impacted by different parameters, e.g., query capabilities, data fragmentation, dataset size and connectivity , query selectivity , and current conditions of the Web-access interfaces Page 16

  17. Federated Query Engine SPARQL Query Q Source Selection & Query Decomposition Query Optimizer Execution Strategies Page 17

  18. Federated SPARQL Query Engines Extensions LILAC[5] FEDRA[6] Fed-DESATUR[3] Data Integration MULDER[10] DAW[9] System HIBISCUS[15] [4] ANAPSID[1] SPLENDID [3] [7] Ontario [14] Network of Linked [12] Data Eddies (nLDE) [2] Page 18

  19. Required Solutions to Support Evolution 5 1 Source Evolution Knowledge Incompleteness Selecting the sources Considering that unknown according to their current facts may need to be conditions and availability predicted during query execution Querying Evolving Data 4 2 Knowledge Environment Evolution Evolution Considering the evolution Executing queries of the knowledge during according to current the execution of the query conditions of the environment 3 Data Evolution Considering the status of the data, e.g., completeness, during the execution of the query Page 19

  20. Required Solutions to Support Evolution 5 1 Source Evolution Knowledge Incompleteness Selecting the sources Considering that unknown according to their current facts may need to be conditions and availability predicted during query execution Querying Evolving Data 4 2 Knowledge Environment Evolution Evolution Considering the evolution Executing queries of the knowledge during according to current the execution of the query conditions of the environment 3 Data Evolution Considering the status of the data, e.g., completeness, during the execution of the query Page 20

  21. Adaptive SPARQL Query Engines Adapt to Source and Environment Evolution: ▪ Misestimated or missing statistics. ▪ Unexpected correlations. ▪ Unpredictable costs. ▪ Dynamically changing data , workload, and source availability . ▪ Changes at rates at which tuples arrive from sources • Initial Delays. • Slow Delivery. • Bursty Arrivals. Page 21

  22. Adaptivity in Federated Query Processing Query Engines able to: ● Change their behavior by learning the behavior of data providers ● Receive and exploit information from the environment ● Use up-to-date information to change their behavior ● Keep iterating over time to adapt their behavior based on the environment conditions Page 22

  23. Existing Federated SPARQL Query Engines Identification of Relevant Sources Based on Current Conditions Adaptive Source Selection Existing Federated Query Engines Query Decomposition Based on Current Conditions Adaptive Operators, e.g., GJoin[1], SMJoin [13] Adaptive Query Processing Adaptive Query Engines, e.g., Networks of Linked Data Eddies[2] Page 23

  24. Existing Federated SPARQL Query Engines Identification of Relevant Sources Based on Current Conditions Adaptive Source Selection Existing Federated Query Engines Query Decomposition Based on Current Conditions Adaptive Operators, e.g., GJoin[1], SMJoin [13] Adaptive Query Processing Adaptive Query Engines, e.g., Networks of Linked Data Eddies[2] Only adaptivity to changes in the environment is addressed!! Page 24

  25. Adaptivity During Source Selection Coarse-Grained Fine-Grained No Adaptivity Adaptivity Adaptivity ANAPSID SPLENDID LILAC FEDRA Fed-DESATUR DAW MULDER HIBISCUS Source Selection techniques that allow for identifying the sources that can be used to answer a query based on the current conditions of the sources 25 Page 25

  26. Adaptivity During Query Execution Fine-Grained No Adaptivity Adaptivity SPLENDID ANAPSID Network of DAW HIBISCUS Linked Data Eddies (nLDE) LILAC FEDRA MULDER Implement physical operators and query processing techniques to adjust Fed-DESATUR query schedulers to the conditions of the sources and the network 26 Page 26

  27. Evaluation Dataset: DBpedia 2015 (HDT on top of TPF server), 837M triples Benchmark 1: 14 high-selective queries (<1000 int. res.) Benchmark 2: Four low-selective queries (>1000 int. res.) Metrics: • Execution Time, ms • Completeness over time, % Compared tools: TPF : triple pattern fragment server [7] ● nLDE : network of Linked Data Eddies [2] ● SMJoin : multi-way join operator for SPARQL [13] ● Page 27

  28. Benchmark 1: High Selective Queries An adaptive approach like SMJoin outperforms other approaches in high-selective queries that produce small number of intermediate results Page 28

  29. Benchmark 2: Low Selective Queries Q1 Q2 •SMJoin yields the first answer at about the same time as nLDE •SMJoin has to process more intermediate results •Q2: results are yielded but all intermediate tuples have to be processed Page 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend