An Adaptive Query Execution Engine for Data Integration Zachary - PDF document

An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Presented by Peng Li@CS.UBC 1

Outline • The Background of Data Integration Systems • Tukwila Architecture • Interleaving of planning and execution • Adaptive Query Operators Collector & Double Pipelined Join • Performance 2

Background (Data Integration Systems) Data Integration System Multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources 3

The key of DISs: • “Free users from having to locate the sources relevant to their query interact with each source independently manually combine the data from the different sources” 4

The main challenges of the design of DISs: • Query Reformulation • The construction of wrapper programs • Query optimizers and efficient query execution engines 5

Motivations: � Little information for cost estimates � Unpredictable data transfer rates Adaptive � Unreliable, overlapping sources � Want initial results quickly � Network bandwidth generally constrains the data sources to be smaller than in traditional db applications. 6

Tukwila Architecture 1. a semantic description of the contents of the data sources; 2. overlap information about pairs of data sources ; 3. key statistics about the data, such as the cost of accessing each source and so on 7

Novel Features of Tukwila • Interleaving of planning and execution – Compensates for lack of information • Handle event-condition-action rules – When and how to modify the implementation of certain operators at runtime if needed. – Detect opportunities for re-optimization. • Manages overlapping data sources ( collectors ) • Tolerant of latency ( double-pipelined join ) – Returns initial results quickly 8

RAP1 Interleaving of planning and execution the non-traditional characteristics of Tukwila are as following, – The optimizer can only create a partial plan if essential statistics are missing or uncertain – The optimizer generates not only operator trees but also the appropriate event-condition- action rules. – The optimizer conserves the state of its search space when it calls the execution engine. 9

RAP1 1. too wordy 2. you mean "characteristics" rather than "characters" Rachel Pottinger, 2/20/2006

Overview of the query plan structure The fragment structure is the key mechanism for implementing the • A plan includes a partially-ordered set of Why does the system need the fragment structure? adaptive property: at the end of each fragments and a set of global rules fragment, the rest of the plan can be re- optimized or rescheduled • A fragment consists of a fully pipelined tree of physical operators and a set of local rules. 10

Rules • Re-optimization The optimizer’s cardinality estimate for the fragment’s result is significantly different from the actual size ->reinvoke optimizer • Contingent planning The execution engine checks properties of the result to select the next fragment • Rescheduling Reschedule if a source times out • Adaptive operators 11

RAP2 Rule format When event if condition then actions When closed(frag1) if card(join1)>2*est_card(join1) then replan An event triggers a rule, coursing it to check its condition. If the condition is true, the rule fires, executing the action(s). 12

RAP2 Given time constraints, I'd cut slides 12 & 13 Rachel Pottinger, 2/20/2006

Events Actions open, closed: fragment/operator starts or completes error: operator failure, e.g., unable to contact source set the overflow method for a double pipelined join timeout(n): data source has not responded in n msec. alter a memory allotment out-of memory: join has insufficient memory deactivate an operator or fragment, which stops its execution and deactivates its associated rules reschedule the query operator tree re-optimize the plan Conditions return an error to the user state(operutor): the operator’s current state card(operator): the number of tuples produced so far time(operator): the time waiting since last tuple memory(operator): the memory used so far 13

Group Discussion • For one of the following motivating situations of Tukwila – Absence of statistics – Unpredictable data arrival characteristics – Overlap and redundancy among sources – Optimizing the time to initial answers • Q1: Can you give some examples where the chosen topic matters? • Q2: If you are a member of Tukwila team, what rules or policy would you have to deal with the problem? – To help discussion, more specific situations will be given – But you may assume any problem or situation • Discussion – Form 8 groups (3~4 person per group, two teams per topic) – Discuss Q1 and Q2 for one topic (5 ~ 7 minutes) 14

Examples Orders OrderNo TrackNo 1234 01-23-45 1235 02-90-85 1399 02-90-85 Join Orders.TrackNo = UPS.TrackNo (Orders, UPS) 1500 03-99-10 OrderNo TrackNo Status UPS 1234 01-23-45 In Transit TrackNo Status 1235 02-90-85 Delivered 01-23-45 In Transit 1399 02-90-85 Delivered 02-90-85 Delivered 1500 03-99-10 Delivered 03-99-10 Delivered 04-08-30 Undeliverable 15

Query Plan Execution Query plan represented as data-flow tree: • Control flow “Show which orders have – Iterator (top-down) been delivered” • Most common database model Join Orders.TrackNo = UPS.TrackNo • Easier to implement – Data-driven (bottom-up) • Threads or external Select Status = “Delivered” scheduling Read Read • Better concurrency Orders UPS 16

Tukwila Plans & Execution • Multiple fragments ending (3) at materialization points Join Orders.TrackNo = UPS.TrackNo • Rules triggered by events – Re-optimize remainder if (1) (2) Select Status = “Delivered” necessary – Return statistics Read Read Orders UPS When(closed(1)): if size_of(Orders) > 1000 then reoptimize {2, 3} 17

RAP3 Performance evaluation Interleaving Planning and Execution We can find that Tukwila’s strategy of interleaving planning and execution can slash the total time spent processing a query. With a total speedup of 1.42 over pipeline and 1.69 over the naïve strategy of materializing . 18

RAP3 Given time constraints, I'd cut this slide Rachel Pottinger, 2/20/2006

Adaptive Query Operators Collectors • Overlap issues Data Integration Systems needs to perform a union over a large number of overlapping sources. However, a standard union operator has no mechanism for handling errors or for deciding to ignore slow mirror data sources once it has obtained the full data set. Q A ∪ Q B ∪ Q C A B C Mirror_C 19

RAP4 Collectors (cont.) • Collectors can deal with the problems by using policies! • A collector operator = a set of children (wrapper calls, local data and so on) + a policy for contacting them 20

RAP4 Again, considering time constraints, consider cutting slides 19 and 20 Rachel Pottinger, 2/20/2006

Collectors (cont.) A complex policy example, Tukwila A B 21

Adaptive Query Operators Double Pipelined Join Conventional Joins • Sort merge joins &indexed joins ---can not be pipelined • Nested loops joins and hash joins ---Follow an asymmetric execution model For Nested loops joins, we must wait for the entire inner table to be transmitted initially before pipelining begins For hash joins, we must load the entire inner relation into a hash table before we can pipeline. 22

Double Pipelined Hash Join • Proposed for parallel main-memory databases (Wilschut 1990) – Hash table per source – As a tuple comes in, add to hash table and probe opposite table • Evaluation: – Results as soon as tuples received – Symmetric – Requires memory for two hash tables • But data-driven! 23

Orders OrderNo TrackNo Hash Table 1234 01-23-45 (Orders) 1235 02-90-85 1399 02-90-85 …… …… Join Orders.TrackNo = UPS.TrackNo (Orders, UPS) UPS Hash Table TrackNo Status (UPS) 01-23-45 In Transit 01-23-45 02-90-85 Delivered 03-99-10 Delivered …… …… 24

Double-Pipelined Join Adapted to Iterator Model • Use multiple threads with queues – Each child (A or B) reads tuples until full, then sleeps & awakens parent – Join sleeps until awakened, then: Join • Joins tuples from QA or QB, returning all QA QB matches as output A B • Wakes owner of queue 25

RAP5 Performance Evaluation: Double Pipelined Hash Join 26

RAP5 Again, consider cutting due to time constraints. Rachel Pottinger, 2/20/2006

Insufficient Memory? • May not be able to fit hash tables in RAM • Strategy for standard hash join – Swap some buckets to overflow files – As new tuples arrive for those buckets, write to files – After current phase, clear memory, repeat join on overflow files 27

Conclusions • General Tukwila architecture • Non-conventional characters of Tukwila • Interleaving of optimization and execution • main idea of the collector operator • Double pipelined hash join 28

An Adaptive Query Execution Engine for Data Integration Zachary - PDF document

An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Presented by Peng Li@CS.UBC 1 Outline The Background of Data Integration Systems

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Two Adaptive Query Execution Systems Outline Motivation Tukwila Eddies Evaluation

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Physical Operators Scanning, sorting, merging, hashing 193 Physical Operators Execution Query

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #17: QUERY

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Incident Management Team COVID-19 Incident Briefing Thursday, July 16, 2020 Amy Hyfield Public

Conflict of Interest: Normal Business Courtesies. Is there such a thing? Courtesies. Is

AIRS Outreach Science Team Meeting May 2009 Sharon Ray, JPL 1 Atmospheric Infrared Sounder

TRADE POLICIES: TARIFFS AND QUOTAS CLASSIFICATION OF POLICIES Price-type: import tariffs, export

Welcome to the Forest Heath Public Meeting Let us go forward together. (Sir Winston

Boroughmuir High School Justus et Tenax Fourth Year Parents Information Evening Wednesday 28

Interviewing, Structured and Unstructured Department of Government London School of Economics

Qualitative Research Adapted From Dapzury Valenzuela ?? Definitions The qualitative research