Challenges of Making Data Interoperable during Query Processing - PowerPoint PPT Presentation

Challenges of Making Data Interoperable during Query Processing Maria-Esther Vidal Scientific Data Management Group TIB, Germany Universidad Simón Bolívar, Venezuela

Motivating Example Query: Drugs with the active substance Simvastatin : ○ Name of possible drug targets, ○ Chemical formula of a drug, ○ Side effects, and ○ Disease Name Page 2

Motivating Example- Available Data Sources Genomic Biological Data Data Chemical Data Diverse data sources potentially incomplete and noisy Page 3

Motivating Example- Data Sources in Heterogeneous Formats Data sources is diverse formats, e.g., XML, CSV, JSON Page 4

Data Evolution…. Schema Changes Changes in Data Entity Source Changes, e.g., Performance Data Completeness and Availability Data Distribution Changes Page 5

Impacting Data Complexity Dimensions Veracity, Variety, and Variability Page 6

Query Over Heterogeneous Data Sources ● Query: Drugs with the active substance Simvastatin : ○ Name of possible drug targets, ○ Chemical formula of a drug, ○ Side effects, and ○ Disease Name Page 7

Interoperability Issues During Query Processing [{ "diseaseID": " ", "name": "Diabetes_mellitus", "associatedGene": ["ACE", "ABCC8", "TCF1"] },{ "diseaseID": " ", "name": "Kaposi sarcoma, susceptibility to, 148000", "associatedGene": ["IL6", "IFNB2", "BSF2"] }] accNum DrugName formula pubChemId Drug side_effects.csv simvastatin C 25 H 38 O 5 54454 DB00295 Morphine C 17 H 19 NO 3 5288826 Drug Target Drug_Target 631 1882 DB00295 7683 ID Name Gene UniprotID drug_names.csv Target 631 3-hydroxy-3-methylglutaryl-co HMGCR P04035 enzyme A reductase 1882 Ras-related C3 botulinum RAC1 P63000 toxin substrate 1 7683 Mu-type opioid receptor OPRM1 P35372 Page 8

Query Over Heterogeneous Data Sources ● Query: Drugs with the active substance Simvastatin : ○ Name of possible drug targets, ○ Chemical formula of a drug, ○ Side effects, and Query must be ○ Disease Name evaluated against heterogeneous sources , that potentially suffer of quality issues , and evolve over time Page 9

Agenda 1. Data Integration Systems 2. Adaptive SPARQL Query Engines 3. Hybrid SPARQL Query Engines

Data Integration Systems A data integration system DIS =< O , S , M >: • O is a set of general concepts in a general schema (virtual) • S is a set of { S1 ,.., Sn } of data sources • M is a set of mappings between sources in S and general concepts in O cf. Lenzerini 2002 Page 11

Data Integration Systems Data Integration Data Integration System System Wrapper Wrapper Wrapper Heterogeneous Data Integration Data Integration System System Homogeneous Centralized Distributed Page 12

Data Integration Systems Data Integration Data Integration System System Wrapper Wrapper Wrapper Heterogeneous ✽ ✽ Data Integration Data Integration System System Homogeneous Centralized Distributed ✽ Existing Data Integration Systems for Querying Processing over RDF Page 13

Query Rewriting Problem Query Rewriting Problem (QRP): ● A query Q is a conjunctive query over predicates in O Data Integration ● Find a conjunctive query Q’ System expressed in sources in S based on rules in M , such that Wrapper Wrapper Wrapper ○ Evaluation of Q ’ produces only answers of Q ○ Evaluation of Q ’ produces all the answers of Q given the sources in S Theorem [Levy et al. 1995] To check if there is a valid rewriting Q’ of Q with at most the same number of goals as Q is an NP-complete problem. Page 14

Challenges for Query Processing Given a query Q in a formal language, i.e., SPARQL ● Identify the relevant data sources for Q ( Source Selection) ● Decompose Q into subqueries on relevant data sources ( Query Decomposition ) ● Plan evaluation of subqueries against relevant data sources ( Query Planning ) ● Merge data collected from relevant data sources ( Query Execution ) Relevant data sources for Q : minimal set of sources S from a federation of source F such that the answer of evaluating Q in S is the same than evaluating Q in F 15 Page 15

Federated SPARQL Query Engines Web-access interfaces ( unpredictable behavior) that allow for Data Integration querying RDF data: System ● SPARQL Endpoints: respect SPARQL protocol, i.e., any SPARQL query ● Linked Data Fragments: limited query capabilities, i.e., only one triple pattern Federation of RDF Data Sources Challenges: Query processing is impacted by different parameters, e.g., query capabilities, data fragmentation, dataset size and connectivity , query selectivity , and current conditions of the Web-access interfaces Page 16

Federated Query Engine SPARQL Query Q Source Selection & Query Decomposition Query Optimizer Execution Strategies Page 17

Federated SPARQL Query Engines Extensions LILAC[5] FEDRA[6] Fed-DESATUR[3] Data Integration MULDER[10] DAW[9] System HIBISCUS[15] [4] ANAPSID[1] SPLENDID [3] [7] Ontario [14] Network of Linked [12] Data Eddies (nLDE) [2] Page 18

Required Solutions to Support Evolution 5 1 Source Evolution Knowledge Incompleteness Selecting the sources Considering that unknown according to their current facts may need to be conditions and availability predicted during query execution Querying Evolving Data 4 2 Knowledge Environment Evolution Evolution Considering the evolution Executing queries of the knowledge during according to current the execution of the query conditions of the environment 3 Data Evolution Considering the status of the data, e.g., completeness, during the execution of the query Page 19

Required Solutions to Support Evolution 5 1 Source Evolution Knowledge Incompleteness Selecting the sources Considering that unknown according to their current facts may need to be conditions and availability predicted during query execution Querying Evolving Data 4 2 Knowledge Environment Evolution Evolution Considering the evolution Executing queries of the knowledge during according to current the execution of the query conditions of the environment 3 Data Evolution Considering the status of the data, e.g., completeness, during the execution of the query Page 20

Adaptive SPARQL Query Engines Adapt to Source and Environment Evolution: ▪ Misestimated or missing statistics. ▪ Unexpected correlations. ▪ Unpredictable costs. ▪ Dynamically changing data , workload, and source availability . ▪ Changes at rates at which tuples arrive from sources • Initial Delays. • Slow Delivery. • Bursty Arrivals. Page 21

Adaptivity in Federated Query Processing Query Engines able to: ● Change their behavior by learning the behavior of data providers ● Receive and exploit information from the environment ● Use up-to-date information to change their behavior ● Keep iterating over time to adapt their behavior based on the environment conditions Page 22

Existing Federated SPARQL Query Engines Identification of Relevant Sources Based on Current Conditions Adaptive Source Selection Existing Federated Query Engines Query Decomposition Based on Current Conditions Adaptive Operators, e.g., GJoin[1], SMJoin [13] Adaptive Query Processing Adaptive Query Engines, e.g., Networks of Linked Data Eddies[2] Page 23

Existing Federated SPARQL Query Engines Identification of Relevant Sources Based on Current Conditions Adaptive Source Selection Existing Federated Query Engines Query Decomposition Based on Current Conditions Adaptive Operators, e.g., GJoin[1], SMJoin [13] Adaptive Query Processing Adaptive Query Engines, e.g., Networks of Linked Data Eddies[2] Only adaptivity to changes in the environment is addressed!! Page 24

Adaptivity During Source Selection Coarse-Grained Fine-Grained No Adaptivity Adaptivity Adaptivity ANAPSID SPLENDID LILAC FEDRA Fed-DESATUR DAW MULDER HIBISCUS Source Selection techniques that allow for identifying the sources that can be used to answer a query based on the current conditions of the sources 25 Page 25

Adaptivity During Query Execution Fine-Grained No Adaptivity Adaptivity SPLENDID ANAPSID Network of DAW HIBISCUS Linked Data Eddies (nLDE) LILAC FEDRA MULDER Implement physical operators and query processing techniques to adjust Fed-DESATUR query schedulers to the conditions of the sources and the network 26 Page 26

Evaluation Dataset: DBpedia 2015 (HDT on top of TPF server), 837M triples Benchmark 1: 14 high-selective queries (<1000 int. res.) Benchmark 2: Four low-selective queries (>1000 int. res.) Metrics: • Execution Time, ms • Completeness over time, % Compared tools: TPF : triple pattern fragment server [7] ● nLDE : network of Linked Data Eddies [2] ● SMJoin : multi-way join operator for SPARQL [13] ● Page 27

Benchmark 1: High Selective Queries An adaptive approach like SMJoin outperforms other approaches in high-selective queries that produce small number of intermediate results Page 28

Benchmark 2: Low Selective Queries Q1 Q2 •SMJoin yields the first answer at about the same time as nLDE •SMJoin has to process more intermediate results •Q2: results are yielded but all intermediate tuples have to be processed Page 29

Challenges of Making Data Interoperable during Query Processing - PowerPoint PPT Presentation

Challenges of Making Data Interoperable during Query Processing Maria-Esther Vidal Scientific Data Management Group TIB, Germany Universidad Simn Bolvar, Venezuela Motivating Example Query: Drugs with the active substance Simvastatin :

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Global Forum on Gender Statistics Making Data Open, Interoperable, and credible to improve gender

Statewide Interoperable & Emergency Communications Board Meeting Albany, New York June 27,

State Interoperable & Emergency Communication Board Meeting September 12, 2012 1 Welcome

State Interoperable & Emergency Communication Board Meeting December 13, 2012 Welcome 2

ATO: Integral to achieve a truly interoperable system 09 May 2019 George Clark Director of

State Interoperable & Emergency Communication (SIEC) Board Meeting January 31, 2018 January

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

DNA FUNCTIONS TOMORROWS NEW HOPE AGAINST RESISTANT CANCER LISTED EURONEXT Paris NASDAQ

State Housing Legislation Update October 22, 2019 Danville Town Meeting Hall Committee to

How do I get my business online and sell my products/ services? March 2020 Strategies to boost

Transit Review #transitreview www.toronto.ca/transitreview Code of Conduct The City of Toronto

Aemetis Biorefining Innovating Below Zero Carbon Renewable Fuels (NASDAQ: AMTX) Eric McAfee

Fleet Management Agency a Standing Policy Committee on Innovation and Economic Development March

Surface Transportation Finance Surface Transportation Finance Transportation Systems Institute T

Fresno County Employees Retirement System Core Plus & MSFD November 2, 2011 PRESENTED

Challenges of Making Data Interoperable during Query Processing - PowerPoint PPT Presentation

Challenges of Making Data Interoperable during Query Processing Maria-Esther Vidal Scientific Data Management Group TIB, Germany Universidad Simn Bolvar, Venezuela Motivating Example Query: Drugs with the active substance Simvastatin :

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Global Forum on Gender Statistics Making Data Open, Interoperable, and credible to improve gender

Statewide Interoperable &amp; Emergency Communications Board Meeting Albany, New York June 27,

State Interoperable &amp; Emergency Communication Board Meeting September 12, 2012 1 Welcome

State Interoperable &amp; Emergency Communication Board Meeting December 13, 2012 Welcome 2

ATO: Integral to achieve a truly interoperable system 09 May 2019 George Clark Director of

State Interoperable &amp; Emergency Communication (SIEC) Board Meeting January 31, 2018 January

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval &gt; Query Us User er Query Words Query Words Search Personalization

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

DNA FUNCTIONS TOMORROWS NEW HOPE AGAINST RESISTANT CANCER LISTED EURONEXT Paris NASDAQ

State Housing Legislation Update October 22, 2019 Danville Town Meeting Hall Committee to

How do I get my business online and sell my products/ services? March 2020 Strategies to boost

Transit Review #transitreview www.toronto.ca/transitreview Code of Conduct The City of Toronto

Aemetis Biorefining Innovating Below Zero Carbon Renewable Fuels (NASDAQ: AMTX) Eric McAfee

Fleet Management Agency a Standing Policy Committee on Innovation and Economic Development March

Surface Transportation Finance Surface Transportation Finance Transportation Systems Institute T

Fresno County Employees Retirement System Core Plus &amp; MSFD November 2, 2011 PRESENTED

Statewide Interoperable & Emergency Communications Board Meeting Albany, New York June 27,

State Interoperable & Emergency Communication Board Meeting September 12, 2012 1 Welcome

State Interoperable & Emergency Communication Board Meeting December 13, 2012 Welcome 2

State Interoperable & Emergency Communication (SIEC) Board Meeting January 31, 2018 January

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Fresno County Employees Retirement System Core Plus & MSFD November 2, 2011 PRESENTED