An Adaptive Query Execution Engine for Data Integration Zachary - - PDF document

an adaptive query execution engine for data integration
SMART_READER_LITE
LIVE PREVIEW

An Adaptive Query Execution Engine for Data Integration Zachary - - PDF document

An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Presented by Peng Li@CS.UBC 1 Outline The Background of Data Integration Systems


slide-1
SLIDE 1

1

An Adaptive Query Execution Engine for Data Integration

Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington

Presented by Peng Li@CS.UBC

slide-2
SLIDE 2

2

Outline

  • The Background of Data Integration Systems
  • Tukwila Architecture
  • Interleaving of planning and execution
  • Adaptive Query Operators

Collector & Double Pipelined Join

  • Performance
slide-3
SLIDE 3

3

Background (Data Integration Systems)

Data Integration System Multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources

slide-4
SLIDE 4

4

The key of DISs:

  • “Free users from having to locate the

sources relevant to their query interact with each source independently manually combine the data from the different sources”

slide-5
SLIDE 5

5

The main challenges of the design of DISs:

  • Query Reformulation
  • The construction of wrapper programs
  • Query optimizers and efficient query

execution engines

slide-6
SLIDE 6

6

Motivations:

Little information for cost estimates Unpredictable data transfer rates Unreliable, overlapping sources Want initial results quickly Network bandwidth generally constrains the data sources to be smaller than in traditional db applications.

Adaptive

slide-7
SLIDE 7

7

Tukwila Architecture

  • 1. a semantic description of the contents of

the data sources;

  • 2. overlap information about pairs of data

sources ;

  • 3. key statistics about the data, such as the

cost of accessing each source and so on

slide-8
SLIDE 8

8

Novel Features of Tukwila

  • Interleaving of planning and execution

– Compensates for lack of information

  • Handle event-condition-action rules

– When and how to modify the implementation of certain operators at runtime if needed. – Detect opportunities for re-optimization.

  • Manages overlapping data sources (collectors)
  • Tolerant of latency (double-pipelined join)

– Returns initial results quickly

slide-9
SLIDE 9

9

Interleaving of planning and execution

the non-traditional characteristics of Tukwila are as following,

– The optimizer can only create a partial plan if essential statistics are missing or uncertain – The optimizer generates not only operator trees but also the appropriate event-condition- action rules. – The optimizer conserves the state of its search space when it calls the execution engine.

RAP1

slide-10
SLIDE 10

Slide 9 RAP1

  • 1. too wordy
  • 2. you mean "characteristics" rather than "characters"

Rachel Pottinger, 2/20/2006

slide-11
SLIDE 11

10

Overview of the query plan structure

  • A plan includes a partially-ordered set of

fragments and a set of global rules

  • A fragment consists of a fully pipelined

tree of physical operators and a set of local rules.

Why does the system need the fragment structure?

The fragment structure is the key mechanism for implementing the adaptive property: at the end of each fragment, the rest of the plan can be re-

  • ptimized or rescheduled
slide-12
SLIDE 12

11

Rules

  • Re-optimization

The optimizer’s cardinality estimate for the fragment’s result is significantly different from the actual size ->reinvoke optimizer

  • Contingent planning

The execution engine checks properties of the result to select the next fragment

  • Rescheduling

Reschedule if a source times out

  • Adaptive operators
slide-13
SLIDE 13

12

Rule format

When event if condition then actions When closed(frag1) if card(join1)>2*est_card(join1) then replan An event triggers a rule, coursing it to check its condition. If the condition is true, the rule fires, executing the action(s).

RAP2

slide-14
SLIDE 14

Slide 12 RAP2 Given time constraints, I'd cut slides 12 & 13

Rachel Pottinger, 2/20/2006

slide-15
SLIDE 15

13

  • pen, closed: fragment/operator starts or completes

error: operator failure, e.g., unable to contact source timeout(n): data source has not responded in n msec.

  • ut-of memory: join has insufficient memory

Events Conditions

state(operutor): the operator’s current state card(operator): the number of tuples produced so far time(operator): the time waiting since last tuple memory(operator): the memory used so far

Actions

set the overflow method for a double pipelined join alter a memory allotment deactivate an operator or fragment, which stops its execution and deactivates its associated rules reschedule the query operator tree re-optimize the plan return an error to the user

slide-16
SLIDE 16

14

Group Discussion

  • For one of the following motivating situations of Tukwila

– Absence of statistics – Unpredictable data arrival characteristics – Overlap and redundancy among sources – Optimizing the time to initial answers

  • Q1: Can you give some examples where the chosen

topic matters?

  • Q2: If you are a member of Tukwila team, what rules
  • r policy would you have to deal with the problem?

– To help discussion, more specific situations will be given – But you may assume any problem or situation

  • Discussion

– Form 8 groups (3~4 person per group, two teams per topic) – Discuss Q1 and Q2 for one topic (5 ~ 7 minutes)

slide-17
SLIDE 17

15

Examples

Orders UPS

JoinOrders.TrackNo = UPS.TrackNo (Orders, UPS) OrderNo 1234 1235 1399 1500 TrackNo 01-23-45 02-90-85 02-90-85 03-99-10 Status In Transit Delivered Delivered Delivered OrderNo 1234 1235 1399 1500 TrackNo 01-23-45 02-90-85 02-90-85 03-99-10 TrackNo 01-23-45 02-90-85 03-99-10 04-08-30 Status In Transit Delivered Delivered Undeliverable

slide-18
SLIDE 18

16

Query Plan Execution

Query plan represented as data-flow tree:

  • Control flow

– Iterator (top-down)

  • Most common database

model

  • Easier to implement

– Data-driven (bottom-up)

  • Threads or external

scheduling

  • Better concurrency

SelectStatus = “Delivered” JoinOrders.TrackNo = UPS.TrackNo Read Orders Read UPS “Show which orders have been delivered”

slide-19
SLIDE 19

17

Tukwila Plans & Execution

  • Multiple fragments ending

at materialization points

  • Rules triggered by events

– Re-optimize remainder if necessary – Return statistics

When(closed(1)): if size_of(Orders) > 1000 then reoptimize {2, 3} SelectStatus = “Delivered” JoinOrders.TrackNo = UPS.TrackNo Read Orders Read UPS

(1) (2) (3)

slide-20
SLIDE 20

18

Performance evaluation

Interleaving Planning and Execution

We can find that Tukwila’s strategy of interleaving planning and execution can slash the total time spent processing a query. With a total speedup of 1.42 over pipeline and 1.69 over the naïve strategy of materializing .

RAP3

slide-21
SLIDE 21

Slide 18 RAP3 Given time constraints, I'd cut this slide

Rachel Pottinger, 2/20/2006

slide-22
SLIDE 22

19

Adaptive Query Operators Collectors

  • Overlap issues

Data Integration Systems needs to perform a union over a large number of overlapping sources. However, a standard union operator has no mechanism for handling errors or for deciding to ignore slow mirror data sources

  • nce it has obtained the full data set.

A B C

QA∪ QB ∪ QC

Mirror_C

slide-23
SLIDE 23

20

Collectors (cont.)

  • Collectors can deal with the problems by using policies!
  • A collector operator = a set of children (wrapper calls, local

data and so on) + a policy for contacting them

RAP4

slide-24
SLIDE 24

Slide 20 RAP4 Again, considering time constraints, consider cutting slides 19 and 20

Rachel Pottinger, 2/20/2006

slide-25
SLIDE 25

21

Collectors (cont.) A complex policy example,

Tukwila A B

slide-26
SLIDE 26

22

Adaptive Query Operators Double Pipelined Join

Conventional Joins

  • Sort merge joins &indexed joins
  • --can not be pipelined
  • Nested loops joins and hash joins
  • --Follow an asymmetric execution model

For Nested loops joins, we must wait for the entire inner table to be transmitted initially before pipelining begins For hash joins, we must load the entire inner relation into a hash table before we can pipeline.

slide-27
SLIDE 27

23

Double Pipelined Hash Join

  • Proposed for parallel main-memory databases

(Wilschut 1990)

– Hash table per source – As a tuple comes in, add to hash table and probe

  • pposite table
  • Evaluation:

– Results as soon as tuples received – Symmetric – Requires memory for two hash tables

  • But data-driven!
slide-28
SLIDE 28

24

UPS

OrderNo 1234 1235 1399 …… TrackNo 01-23-45 02-90-85 02-90-85 …… TrackNo 01-23-45 02-90-85 03-99-10 …… Status In Transit Delivered Delivered ……

Orders

Hash Table (Orders) Hash Table (UPS) 01-23-45

JoinOrders.TrackNo = UPS.TrackNo (Orders, UPS)

slide-29
SLIDE 29

25

Double-Pipelined Join Adapted to Iterator Model

  • Use multiple threads with queues

– Each child (A or B) reads tuples until full, then sleeps & awakens parent – Join sleeps until awakened, then:

  • Joins tuples from QA or QB, returning all

matches as output

  • Wakes owner of queue

Join A B QA QB

slide-30
SLIDE 30

26

Performance Evaluation: Double Pipelined Hash Join

RAP5

slide-31
SLIDE 31

Slide 26 RAP5 Again, consider cutting due to time constraints.

Rachel Pottinger, 2/20/2006

slide-32
SLIDE 32

27

Insufficient Memory?

  • May not be able to fit hash tables in RAM
  • Strategy for standard hash join

– Swap some buckets to overflow files – As new tuples arrive for those buckets, write to files – After current phase, clear memory, repeat join on

  • verflow files
slide-33
SLIDE 33

28

Conclusions

  • General Tukwila architecture
  • Non-conventional characters of Tukwila
  • Interleaving of optimization and execution
  • main idea of the collector operator
  • Double pipelined hash join