Online Query Processing Exposure to online query processing - - PDF document

online query processing
SMART_READER_LITE
LIVE PREVIEW

Online Query Processing Exposure to online query processing - - PDF document

Goals for Today Online Query Processing Exposure to online query processing algorithms and fundamentals A Tutorial Usage examples Basic sampling techniques and estimators Preferential data delivery Peter J. Haas Online


slide-1
SLIDE 1

1

1

Online Query Processing

A Tutorial

Peter J. Haas

IBM Almaden Research Center

Joseph M. Hellerstein

UC Berkeley

2

Goals for Today

  • Exposure to online query processing algorithms and

fundamentals

  • Usage examples
  • Basic sampling techniques and estimators
  • Preferential data delivery
  • Online join algorithms
  • Relation to OLAP, etc.
  • Some thoughts on research directions
  • More resources to appear on the web
  • Annotated bibliography
  • Extended slide set
  • Survey paper

3

Road Map

Background and motivation

Human-computer interaction Tech trends and prognostications Goals for online processing

Examples of online techniques Underlying technology Related work Looking forward 4

Human-Computer Interaction

  • Iterative querying with progressive refinement
  • Real-time interaction (impatience!)
  • Spreadsheets, WYSIWYG editors
  • Modern statistics packages
  • Netscape STOP button
  • Visually-oriented interface
  • Approximate results are usually OK

10.3343 6.87658 5.46571 3.0000 8.6562 6.56784 4.54673 2.0000 7.5654 4.32445 3.01325 1.0000 Syst em 3 Syst em 2 Syst em 1 Tim e

VS

5

Disk Appet it e

  • Greg Papadopoulos, CTO Sun:
  • " Moore's Law Ain't Good Enough" ( Hot Chips ’98)

500 1000 1500 2000 2500 3000 3500

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year Petabytes

Sales Moore's Law

Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf

6

The Lat est Com m er cial Technology

slide-2
SLIDE 2

2

7

Dr aw backs of Cur r ent Technology

  • Only exact answers are available
  • A losing pr oposit ion as dat a volum e gr ow s
  • Har dw ar e im pr ovem en t s n ot su f f icien t
  • I nt eract ive syst ems fail on massive dat a
  • E. g. , spr eadsh eet pr ogr am s ( 6 4 Kr o w lim it )
  • DBMS not int eract ive
  • No user f eedback or cont r ol ( “ back t o t he 60’s” )
  • Lon g pr ocessin g t im es
  • Fu n dam en t al m ism at ch w it h pr ef er r ed m odes of HCI
  • OLAP: a part ial solut ion
  • Can’t handle ad hoc quer ies or dat a set s

8

Goals f or Online Processing

  • New “ greedy” perf orm ance regim e
  • Maximize 1 st derivat ive of t he “ mirt h index”
  • Mirt h defined on-t he-fly
  • Theref ore need FEEDBACK and CONTROL

Time

100%

Online Traditional

9

Road Map

  • Background and Mot ivat ion
  • Exam ples of Online Techniques
  • Aggregat ion, visualizat ion, cleaning/ browsing
  • Underlying t echnology
  • Relat ed work
  • Looking Forw ard

10

Online Aggregat ion

  • SELECT AVG( t em p) FROM t GROUP BY sit e
  • 330K row s in t able
  • t he exact answ er:

11

Online Aggregat ion, cont ’d

  • A simple online aggregat ion int erface ( aft er 74 row s)

12

Online Aggregat ion, cont ’d

  • Af t er 834 row s:
slide-3
SLIDE 3

3

13

Exam ple: Online Aggr egat ion

Addit ional Feat ures:

Speed up Slow dow n Term inat e

14

Online Dat a Visualizat ion

  • I n Tioga Dat aSplash

15

Online Enum er at ion

  • Pot t er’s Wheel [ VLDB 2001]
  • Scalable spreadsheet
  • A fract ion of dat a is mat erialized in GUI widget
  • Scrolling = preference for dat a delivery in a quant ile
  • Permit s “ fuzzy” querying
  • I nt eract ive dat a cleaning
  • Online st ruct ure and discrepancy det ect ion
  • Online aggregat ion

16

Scalable Spreadsheet s

17

Visual Transf orm at ion Shot

18

slide-4
SLIDE 4

4

19

Road Map

  • Background and m ot ivat ion
  • Exam ples of online t echniques
  • Underlying t echnology
  • Building blocks: sampling, est imat ion
  • Preferent ial dat a delivery
  • Pipelined adapt ive processing algorit hms
  • Relat ed work
  • Looking f orw ard

20

Sam pling – Design I ssues

  • Granularit y of sample
  • I nst ance-level ( row -level): high I / O cost
  • Block-level ( page-level) : high variabilit y from clust ering
  • Type of sam ple
  • Oft en simple random sample ( SRS)
  • Especially f or on- t he-fly
  • Wit h/ wit hout replacement usually not crit ical
  • Dat a st ruct ure from w hich t o sam ple
  • Files or relat ional t ables
  • I ndexes ( B+ trees, etc)

21

Row - level Sam pling Techniques

  • Maint ain file in random order
  • Sampling = scan
  • I s file init ially in random order?
  • St at ist ical t est s needed: e.g., Runs t est , Sm ir nov t est
  • I n DB syst ems: clust er via RAND funct ion
  • Must “ freshen” ordering ( online reorg)
  • On -t he-fly sampling
  • Via index on “ random” column
  • Else get random page, t hen row w it hin page
  • Ex: ext ent - m ap sam plin g
  • Pr oblem : var iable num ber of r ecor ds on page

22

Accept ance/ Rej ect ion Sam pling

  • Accept row on page i w it h probabilit y = ni/ n MAX
  • Com m only used in ot her set t ings
  • E.g. sampling from j oins
  • E.g. sampling from indexes

r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r

Or iginal pages Modif ied pages

23

Cost of Row - Level Sam pling

2 0 4 0 6 0 8 0 1 0 0 . 2 . 4 . 6 . 8 1 1 . 2 1 . 4 1 . 6 1 . 8 2 Sam p l i n g Rat e ( % ) Pages fetched (% )

  • 100,000

pages

  • 200

row s/ page

24

Est im at ion f or Aggregat es

  • Point est imat es
  • Easy : SUM, COUNT, AVERAGE
  • Har d: MAX, MI N, quant iles, dist inct values
  • Confidence int ervals – a measure of precision
  • Tw o cases: single-t able and j oins
slide-5
SLIDE 5

5

25

Conf idence I nt ervals

  • 0.0075
  • 0.0025

0.0025 0.0075 1 50 99 Sam p l e Si ze ( % ) CI Endpoints

  • Lg. Sample

Conserv. Determ.

26 0.2 0.4 0.6 0.8 1 100 200 300 400 500

CI Length Sample Size

The Good and Bad New s

  • Good n ew s: 1/ n 1 / 2 m agic ( n chosen on- t he-f ly)
  • Bad n ew s: n eed le-in- a- h ayst ack pr oblem

27

Sam pling Deployed in I ndust ry

  • “ Simulat ed” Bernoulli sampling
  • SQL: SELECT * WHERE RAND( ) < = 0 . 0 1
  • Sim ilar capabilit y in SAS
  • Bernoulli Sampling wit h pre- specified rat e
  • I nf or m ix, Or acle 8i, ( DB2)
  • Ex : SELECT * FROM T1 SAMPLE ROW( 1 0 % ) , T2
  • Ex : SELECT * FROM T1 SAMPLE BLOCK( 1 0 % ) , T2
  • Not for novices
  • Need t o pr e- specif y pr ecision
  • n o f eed b ack / cont r ol
  • r ecall t he “ m ult ir esolut ion” pat t er ns f r om exam ple
  • No est im at or s pr ovided in cur r ent syst em s

28

Pr ecom put at ion Techniques

  • Tw o com ponent s
  • Dat a r educt ion ( of t en expensive)
  • Appr oxim at e r econ st r u ct ion ( qu ick)
  • Pros and cons
  • Ef f iciency vs flexibilit y
  • Class of quer ies t hat can be handled
  • Degr ee of pr ecision
  • Ease of im plem en t at ion
  • How m u ch of syst em m u st b e m od if ied
  • How soph ist icat ed m u st developer be?
  • More w idely deployed in indust ry
  • Will give overview lat er

29

Road Map

  • Background and mot ivat ion
  • Examples of online t echniques
  • Underlying t echnology
  • Building blocks: sampling, est imat ion
  • Preferent ial dat a delivery
  • Pipelined adapt ive processing algorit hms
  • Relat ed t echnology: precom put at ion
  • Looking f orw ard

30

Pref erent ial Dat a Delivery

  • Why needed
  • Speedup/ slow dow n arrow s
  • Spreadsheet scrollbars
  • Pipeline quasi- sort
  • Cont inuous re- opt imizat ion ( eddies)
  • I ndex st ride
  • High I / O cost s, good for out liers
  • Online Reordering ( “ Juggle” )
  • Excellent in most cases, no index required
  • [ VLDB ’99, VLDBJ ’00]
slide-6
SLIDE 6

6

31

Online Reordering

  • Deliver “ int erest ing” it ems first
  • “ I nt erest ing” det ermined on t he fly
  • Exploit rat e gap bet w een produce and

process/ consum e

pr oduce

disk

pr ocess consume

j oin

t r ansmit

32

Online Reordering

  • Deliver “ int erest ing” it ems first
  • “ I nt erest ing” det ermined on t he fly
  • Exploit rat e gap bet w een produce and

process/ consum e

S T R

pr od uc e r e or d e r pr oc e ss c ons ume

d i sk

j oi n

t r ansmi t 33

Mechanism

  • Tw o t hreads -- prefet ch from input
  • - spool/ enrich from auxiliary side disk
  • Juggle dat a bet ween buffer and side disk
  • keep buffer full of “ int erest ing” it ems
  • get Next chooses best it em current ly on buffer
  • get Next , enrich/ spool decisions -- based on reordering policy
  • Side disk m anagem ent
  • h ash in dex, popu lat ed in a w ay t h at post pon es r an dom I / O
  • play bot h sides of sor t / hash dualit y

b u f f er spool pr ef et ch en r ich g et Nex t side disk p r od u ce p r o ce ss/ c o n s u m e

34

Policies

  • “ good” permut at ion of it ems t 1 …t n t o tΠ1 …t Πn
  • quality of feedback for a prefix tΠ

1tΠ2…t Πk

QOF( UP( t Π1) , UP( t Π2) , … UP( t Πk )) ,

UP = user pr ef er ence

  • det er m ined by applicat ion
  • goodness of reordering: dQOF/ dt
  • implicat ion for j uggle m echanism
  • pr ocess get s it em f r om buf f er t hat incr eases QOF t he m ost
  • j uggle t r ies t o m aint ain buf f er w it h such it em s

t i m e

Q O F

35

QOF in Online Aggregat ion

  • avg w eight ed confidence int erval
  • preference act s as w eight on confidence int erval
  • QOF = −∑

∑ UPi / ( n i) ½ , n i = n u m b er of t u p les p r ocessed f r om

gr oup i ⇒ process pulls it ems from group w it h max UPi / ni 3/ 2

⇒ desired rat io of group i

in buffer = UPi

2 / 3/ ∑

∑ j UPj

2/ 3

  • juggle t ries t o maint ain t his by enrich/ spool
  • Similar derivat ions for ot her preferences
  • e.g. explicit rates, explicit ranking, etc.

36

Road Map

  • Background and m ot ivat ion
  • Exam ples of online t echniques
  • Underlying t echnology
  • Building blocks: sampling, est imat ion, pre-com put at ion
  • Preferent ial dat a delivery
  • Pipelined adapt ive processing algorit hms
  • Relat ed w ork
  • Looking f orw ard
slide-7
SLIDE 7

7

37

Pipelined Dat a Processing

  • Never, ever wait for anyt hing t o finish
  • Select ion: no problem
  • Grouping: hash, don’t sort
  • Sort ing: j uggle if possible
  • Joins?
  • Sample of j oins vs. j oin of samples

SELECT AVG( R.a * S.b) FROM R, S WHERE R.c = S.c

38

Tradit ional Nest ed Loops

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

S R

39

Ripple Joins

  • designed for online performance goals
  • Complet ely pipelined
  • Adapt t o dat a charact erist ics
  • designed for online performance goals
  • simplest version
  • read new t uples s f rom S and r f rom R
  • join r and s
  • join r with old S

t uples

  • join s with old R t uples

[ SI GMOD ’99]

40

Basic Ripple Join

R

x x x x x x x x x x x x x x x x

S

41

S R

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Block Ripple Joins ( Size = 2)

42

S R

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Rect angular Ripple Join

slide-8
SLIDE 8

8

43

Ripple Joins, cont ’d

  • Variant s:
  • Block: minimizes I / O in alt ernat ing nest ed loops
  • I ndex: coincides wit h index- nest ed loop
  • Hash: symmet ric hash t ables
  • Adapt ive aspect rat io
  • User set s animat ion rat e ( via slider)
  • Syst em goal:
  • m inim ize CI lengt h
  • Subj ect t o t im e const r aint
  • Syst em solves opt imizat ion problem ( approximat ely)
  • Samples from higher- variance relat ion fast er

44

Ripple Joins, cont ’d

  • Prot ot ypes in I nformix, I BM DB2
  • Ongoing w ork on scalabilit y issues
  • Mem ory com pact ion t echnique
  • Parallelism
  • Graceful degradat ion t o out - of-core hashing
  • a la Tukw ila, XJoin, but sensit ive t o st at ist ical issues
  • Nest ed queries
  • Opt imizat ion issues
  • A num ber of API and ot her syst em s issues
  • DMKD j ournal paper on I nform ix im plem ent at ion
  • Fort hcoming paper on sampling in DB2

45

Road Map

  • Background and m ot ivat ion
  • Exam ples of online t echniques
  • Underlying t echnology
  • Relat ed work
  • Online query processing
  • Precom put at ion
  • Looking f orw ard

46

Relat ed Work on Online QP

  • Morgenst ein’s PhD, Berkeley ’80
  • Online Associat ion Rules
  • Ng, et al’s CAP, SI GMOD ’98
  • Hid b er ’s CARMA, SI GMOD ‘99
  • I mplicat ions for deduct ive DB semant ics
  • Mon ot on e ag g r eg at ion in LDL+ + , Zan iolo an d Wan g
  • Online agg wit h subqueries
  • Tan, et al. VLDB ’99
  • Dynamic Pipeline Scheduling
  • Ur h an / Fr an klin VLDB ’01
  • Pipelining Hash Joins
  • Raschid, Wilschut / Aper s, Tukw ila, Xj oin
  • Relat ion t o sem i- naive evaluat ion
  • Anyt ime Algorit hms
  • Zilberst ein, Russell, et al.

47

Precom put at ion: Explicit

  • OLAP Dat a Cubes ( drill-dow n hierarchies)
  • MOLAP, ROLAP, HOLAP
  • Semant ic hierarchies
  • APPROXI MATE ( Vrbsky, et al.)
  • Query Relaxat ion, e.g. CoBase
  • Mult iresolut ion Dat a Models ( Silberschat z/ Reed/ Fussell)
  • More general mat erialized view s
  • See Gupt a/ Mumick’s t ext

48

Pr ecom put at ion: St at . Sum m ar ies

  • Hist ograms
  • Or iginally f or aggr egat ion quer ies, m any f lavor s
  • Ext ended t o enum er at ion quer ies r ecent ly
  • Mult i- dim ensional hist ogr am s
  • Paramet ric est imat ion
  • Wavelet s and Fr act als
  • Discr et e cosine t r ansf or m
  • Regr ession
  • Cur ve f it t ing and splines
  • Singular - Valu e Decom p osit ion ( aka LSI , PCA)
  • I ndexes: hierarchical hist ograms
  • Ran k in g an d p seu d o- r anking
  • Aoki’s use of GiSTs as est im at or s f or ADTs
  • Dat a Mining
  • Clust er ing, classif icat ion, ot her m ult idim ensional m odels
slide-9
SLIDE 9

9

49

Pr ecom pu t ed Sam ples

  • Mat erialized sample views
  • Olken’s original w ork
  • Ch au d h u r i et al.: j oin sam ples
  • St at ist ical inf er ences com plicat ed over “ r ecycled” sam ples?
  • Barbará’s quasi-cubes
  • AQUA “ j oin synopses” on universal relat ion
  • Maint enance issues
  • AQUA’s b ackin g sam p les
  • Can use fancier/ more efficient sampling t echniques
  • St r at if ied sam pling or AQUA’s “ con gr ession al” sam ples
  • Haas and Sw ami AFV st at ist ics
  • Co m b i n e p r ecom p u t ed “ out liers” w it h on- t he-f ly sam ples

50

St rat if ied Sam pling

51

Road Map

  • Background and mot ivat ion
  • Examples of online t echniques
  • Underlying t echnology
  • Relat ed Work
  • Looking f orw ard
  • Adapt ive syst ems
  • Hum an-cent ered syst ems

52

Looking For w ar d: Adapt ive Syst em s

  • Observat ion/ Decision ≈ Modeling/ Predict ion
  • usually st at ist ical
  • Already crit ically import ant in t oday’s syst ems
  • An d im agin e h ow im por t an t in u biqu it ou s com pu t in g!

Observe Environm ent Make Decision Act

53

A DBMS Tradit ion

  • One inst ance: Syst em R opt imizat ion
  • Obser ve: Ru n st at s
  • Decide: Qu er y Opt im izat ion
  • Act : Quer y Pr ocessing
  • A powerful aspect of our t echnologies
  • Dat a in depen den ce & declar at ive lan gu ages
  • Yet quit e coarse- grained
  • Ru n st at s on ce per day/ w eek
  • Act ions only per - quer y
  • Disk r esou r ce m an ag em en t : in d ex an d m at v iew select ion
  • Mem or y r esou r ce m an ag em en t : b u f f er s an d sor t / h ash sp ace
  • Con cu r r en cy m an ag em en t : ad m ission con t r ol

54

“ Built -in” adapt ivit y

  • I nfo syst ems should have adapt ivit y as a basic goal
  • Not j ust best - case per f or m an ce
  • Needs t o pervade syst em
  • Cor e ar chit ect ur al w or k t o be done her e
  • E.g. pipelining r equir ed f or m ult i- oper at or adapt ivit y
  • Obser ve m or e t han one t hing at a t im e
  • E.g. adapt ive oper at or s ( a la r ipple j oin)
  • E. g. adapt ive opt im izat ion ar chit ect ur es ( a la Eddies)
  • E. g. unif y quer y pr ocessing w it h dat abase design
  • Adapt ivit y should be built -in, not “ bolt ed- on”
  • Wizar ds t o t ur n exist ing knobs
  • Less helpf ul
  • Cer t ainly less elegant
  • Might be t echnically m ore dif f icult !
slide-10
SLIDE 10

10

55

Looking For w ar d: Hu m an - Cent er ed Syst em s

  • Annual plea for UI work in DB Direct ions Workshops
  • UI ’s per ceived as “ sof t ” , har d t o m easur e/ publish
  • Yet people use our syst ems
  • And ar guably w e ar e t r ying t o m ake t hem bet t er f or people
  • Problem: our performance met rics
  • “ Mirt h index” vs. w all- clock t im e
  • One can f ind r easonable “ har d” m et r ics f or m ir t h
  • Many of t hese m et r ics m ay be st at ist ical
  • Also consider “ w oe index” , e.g. in m aint ainabilit y
  • Most of t hese indices have t o do w it h user t im e
  • Not , e.g., resource ut ilizat ion
  • Good UI work need not require good UI s!
  • Can at t ack new m et r ics dir ect ly
  • We don’t have t o go back t o ar t school

56

Lessons Lear ned

  • Dream about UI s, w ork on syst em s
  • User needs drive syst ems design!
  • Syst ems and st at ist ics int ert w ine
  • All 3 go t oget her nat urally
  • User desires and behavior: 2 more t hings t o model, predict
  • “ Performance” met rics need t o reflect key user needs

“ What unlike t hings m ust m eet and m at e…”

  • - Ar t , Her m an Melville

57

Mor e?

  • Annot at ed bibliography & slides soon…

ht t p: / / cont rol.cs.berkeley.edu/ sigm od01/