SEQUENCE QUERY PROCESSING Praveen Seshadri, Miron Livny, Raghu - - PowerPoint PPT Presentation

sequence query processing
SMART_READER_LITE
LIVE PREVIEW

SEQUENCE QUERY PROCESSING Praveen Seshadri, Miron Livny, Raghu - - PowerPoint PPT Presentation

SEQUENCE QUERY PROCESSING Praveen Seshadri, Miron Livny, Raghu Ramakrishnan (CS Department, University of Wisconsin-Madison, WI) Sedat Behar and Yevgeny Ioffe REFERENCES Praveen Sheshadri, Management of Sequence Data (PhD Thesis)


slide-1
SLIDE 1

SEQUENCE QUERY PROCESSING

Praveen Seshadri, Miron Livny, Raghu Ramakrishnan (CS Department, University of Wisconsin-Madison, WI)

Sedat Behar and Yevgeny Ioffe

slide-2
SLIDE 2

REFERENCES

  • Praveen Sheshadri, Management of Sequence

Data (PhD Thesis)

  • R.Ramakrishnan, M.Chend, M.Livny,

P.Sheshadri. What’s Next? Sequence Queries

  • P.Sheshadri, M.Livny, R.Ramakrishnan. SEQ: A

Model for Sequence Databases

  • H.Gunadhi, A.Segev. Event-join optimization in

temporal relational databases

slide-3
SLIDE 3

WHY? (Difficulties with expressing Sequences)

  • Relational Databases:

– Data collections treated as sets not sequences

  • Difficult to evaluate in SQL

– Data model does not help evaluation

  • Hard to optimize:

– scan two sequences in lock-step

slide-4
SLIDE 4

Query: Find the Volcano that caused an earthquake

  • f magnitude 7.0 or greater

SELECT V.name FROM Volcanos V, Earthquakes E WHERE E.Strength > 7.0 AND E.time = (SELECT max(E1.time) FROM Earthquakes E1 WHERE E1.time < V.time) It is difficult and not efficient!

slide-5
SLIDE 5

SEQUENCE MODEL

Definitions:

  • Record: <A1:T1, A2:T2,…, An:Tn>
  • Attribute: data type (int, String…)
  • Type: an instance in the type domain (Null associated)
  • Position: location of an entry in the record
  • Position Ordering: function that returns the position (index)
  • Type Domain: ( T1 x T2 x …x Tn )

Example of a Record: <Name:String, HeartBeat:Int, BloodPres:Int …>

slide-6
SLIDE 6

SEQUENCE MODEL

  • Types of Sequences:

BASE SEQUENCES: some positions map to some records CONSTANT SEQUENCES: every position maps to the same unique record

Ex: Perfect Health (Mr.Perfect, 80, 80/120)

DERIVED SEQUENCES: defined by a sequence operator

slide-7
SLIDE 7

SEQUENCE MODEL - OPERATORS

  • All operators are compositional: produce a single derived

sequence from 2 sequences

  • Simple Unary Operators:

– Selection: similar to Relational Databases – Projection: similar to Relational Databases – Positional Offset: shifts input sequence by offset – Value Offset: shifts the non-null entries by offset

  • Aggregate Unary Operators

– agg_pos(i): selects set P of positions for each position i – agg_func: aggregate function over records in input stream at positions p in P

  • Compose Operator (positional join operator)

– binary operator composing records r1 and r2 of two input sequences at each position i (Null exception)

slide-8
SLIDE 8

Sequence Queries:

  • A sequence query is an

acyclic graph of operators (just like Aurora!)

  • Output of query is output

sequence

  • No output can be input to

more than 1 operator ! graph is a tree

SEQUENCE MODEL

Project V.name compose

v E

Select STR>7.0 previous Earthquakes Volcanos

slide-9
SLIDE 9
  • important for optimization and evaluation!
  • Operator can be described by 2 functions:

– Scope: defines the positions of input records to look at – OpFunc: an operator function that actually works with input records defined by Scope to define the output sequence

SCOPE OF OPERATOR

slide-10
SLIDE 10
  • Operator properties:

– Scope size at position i – Scope sequentiality – Scope relativity at position i

  • Complex operator

– Is an acyclic composition of basic operators – Properties of basic operators determine its property (fixed scope; sequential scope; relative scope)

OPERATOR PROPERTIES

slide-11
SLIDE 11

QUERY OPTIMIZATION - TRANSFORMATIONS

  • Transform declarative query into equivalent one
  • Two sequence queries are equivalent if both have

same:

– input sequences – scopes on input sequences – operator function

  • Equivalence is independent of actual data in input

sequences

slide-12
SLIDE 12
  • Alter a sub-query but not the entire graph
  • Incorrect transformations
  • Good idea to propagate selections, projections,

and positional offsets as far down as possible

  • Non-unit scope operators ! break query into

blocks

QUERY OPTIMIZATION – TRANSFORMATIONS

slide-13
SLIDE 13

GLOBAL SPAN OPTIMIZATION

  • reduce query processing costs by restricting the span
  • f a sequence based on span of other sequences

! can modify span of output based on input and vice versa!

DEC HP IBM 200…350 #4.price 200…350 1…350

#4 #1#2

200…500 #1.close>#2.close PROJECT COMPOSE COMPOSE

#1 #2

1…750 200…500 DEC HP IBM 200…350 #4.price 200…350 200…350

#4 #1#2

200…350 #1.close>#2.close PROJECT COMPOSE COMPOSE

#1 #2

200…350 200…350 DEC HP IBM 200…350 #4.price 200…350 200…350

#4 #1#2

200…350 #1.close>#2.close PROJECT COMPOSE COMPOSE

#1 #2

200…350

slide-14
SLIDE 14
  • Concepts: Span and Density
  • A. B.
  • Utilize span first then density to reduce the workload on join

META-INFORMATION

Close>25 Selectivity=0.5 Density=0.7 DEC 200…350 IBM 200…350 COMPOSE Density=0.7 Width=20 #1 Density=0.95 Width=100 200…350 IBM 200…350 COMPOSE #1 Density=0.48 Close>25 Density=0.95 200…350 DEC 200…350 Selectivity=0.5

slide-15
SLIDE 15

A. B.

  • Join Strategy-A – two ways of doing it

– analogous to NLJ

  • Join Strategy-B: Stream both in lock-step
  • “stream access” = get the next non-Null record
  • “probed access” = get the record at a specific position

ACCESS MODES

DEC 200…350 IBM 200…350 COMPOSE Density=0.7 Width=20 #1 Density=0.95 Width=100 200…350 IBM 200…350 COMPOSE #1 Density=0.48 Close>25 Density=0.95 200…350 DEC 200…350 Selectivity=0.5 Selectivity=0.5 Close>25 Density=0.7

slide-16
SLIDE 16

CACHING OF DERIVED SEQUENCES

  • Cache-Strategy-A:

– Cache last 6 values of sequence #1 (figure on left) – If scope is large or variable ! may not be feasible to cache whole scope

  • Cache-Strategy-B (incremental cache strategy):

– Cache the value of #4 at previous position ! then the record at some position p is either cached record at previous position OR it’s the non-Null record from #3 at previous position

PROJECT PREVIOUS #1.close>#2.close PROJECT close SUM [pos, pos-5] IBM #1 close SUM [pos, pos-5] IBM #1 IBM HP #4 #3 COMPOSE #2 #1 HP #4 #3 COMPOSE #2 #1

slide-17
SLIDE 17

Sequences may be queried for

  • specific positions
  • range of positions

Plan Generation Algorithm:

SQG: signifies that there might be a derived sequence represented below, not necessarily a base sequence. Start: initiates query evaluation by invoking stream access on its input

QUERY PLAN GENERATION

start compose Position Sequence SQG

slide-18
SLIDE 18

QUERY PLAN GENERATION

Step 1- Query Specification Step 2- Meta-Information Propagation:

  • Bottom-up
  • Top-down

Step 3- Query Transformation Step 4- Identification of Query Blocks Step 5- Block-wise Plan Generation Step 6- Plan Selection

slide-19
SLIDE 19

Access costs to Base Sequences:

size of valid range, density of sequence, access paths available

Costs: Blocks with Non-Unit Scope: (Aggregate & Value Offset)

–If stream access,

  • cost = stream cost of input + cost of storing in cache each

record + cost of cache access + computational cost

–If probed access,

  • cost = probed access cost × operator scope size

QUERY PLAN GENERATION

Costs: Blocks with Positional Joins:

Stream Access: Stream access on stream-1 and probe on stream-2; or converse

  • r stream access on both streams

Probed Access: Access stream-1 in probed fashion and for every record join stream- 2 in probed fashion

Algorithmic Analysis: left-deep trees, time and space complexity

slide-20
SLIDE 20

Extensions

  • To Model:

General Sequences, Ordering Domains, Multiple Orderings, Sequence Groupings

  • To Queries:

Generalized Query Graph, Correlated Queries

  • To Framework:

Optimization Framework, Materialization of Derived Sequences, Optimizations on base sequences (sorting)

slide-21
SLIDE 21

Related Work

  • Compare to TS-based models
  • Aurora, Stream-based systems
slide-22
SLIDE 22

CONCLUSIONS/Things to remember

  • Two important things:

scope and query generation plans

Discussion Questions

  • How can this algorithm extend (relate) to Aurora or Atlas?
  • How is operator scope extended/implemented in CQL?