Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 - - PowerPoint PPT Presentation

processing complex aggregate queries over data streams
SMART_READER_LITE
LIVE PREVIEW

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 - - PowerPoint PPT Presentation

Processing Complex Aggregate Queries over Data Streams SIGMOD 2002 Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002 Processing Network Data Streams DataStream Join Query Network Operations Center SELECT COUNT(*)


slide-1
SLIDE 1

Processing Complex Aggregate Queries

  • ver Data Streams

SIGMOD 2002

Alin Dobra Minos Garofalakis Johannes Gehrke Rajeev Rastogi June 4, 2002

slide-2
SLIDE 2

Processing Network Data Streams

Measurement Alarms Data−Stream Join Query Network Operations Center

Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router

R3 WHERE R1.a = R2.b = R3.c FROM R1, R2, R3 SELECT COUNT(*) R1 R2

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

2

slide-3
SLIDE 3

Computations over Streaming Data

Sketch for R1 Sketch for R2 Sketch for Rr Memory Stream for R1 Stream for R2 Stream for Rr Stream Engine Query Q(R1,...,Rr) Approximate answer to Q Query-Processing

  • Goal: Approximately answer JOIN-COUNT and JOIN-SUM queries over streams

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

3

slide-4
SLIDE 4

Outline of the Talk

  • Motivation
  • Sketch-based randomized algorithms
  • Sketch-based approximation of aggregate queries results
  • Sketch-partitioning for estimation accuracy boosting
  • Experimental evaluation
  • Summary

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

4

slide-5
SLIDE 5

Sketch-Based Randomized Algorithms [AMS96]

  • Estimate F(D) for some function F and some data D

Method:

  • Build a probability space and a random variable X with the properties:

1) E[X] = F(D) ≥ LE 2) Var(X) ≤ UV

  • Combine samples of X to achieve relative error ǫ with probability at least 1−δ
  • Boost accuracy to ǫ by averaging 8UV

ǫ2L2

E pairwise independent samples of X

  • Boost confidence to 1 − δ by taking the median of 2 log(1/δ) averages

Example usage: frequency moments [AMS96], size of join [AGMS99], L1 norm [FKSV99], wavelet decomposition [GKMS01]

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

5

slide-6
SLIDE 6

Sketch-Based Randomized Algorithms (cont.)

−1 1 −1 1 −1 1

1 2 h Data

X

232 − 1 ξ family of random variables Uniform random seed space (size 265)

  • ξi(s) = h(s, i) ∈ {−1, +1}
  • family ξ is 4-wise independent, i.e.

∀i1 = i2 = i3 = i4, ∀v1, v2, v3, v4 ∈ {−1, +1}, P[ξi1 = v1 ∧ ξi2 = v2 ∧ ξi3 = v3 ∧ ξi4 = v4] = P[ξi1 = v1]P[ξi2 = v2]P[ξi3 = v3]P[ξi4 = v4]

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

6

slide-7
SLIDE 7

Estimation of COUNT(F ⋊ ⋉a G) [AGMS99]

F · · · a · · · 1 1 2 3 1 3 ⇒ i fi 1 3 2 1 3 2 G · · · a · · · 3 3 1 1 1 ⇒ i gi 1 3 2 0 3 2

  • Estimate COUNT(F ⋊

⋉a G) = 3

i=1 figi = 3 · 3 + 1 · 0 + 2 · 2 = 13 Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

7

slide-8
SLIDE 8

Estimation of COUNT(F ⋊ ⋉a G) (cont.) i 1 2 3 ξi −1 +1 −1 Fa ξa XF =

t∈F ξt.a

1 −1 −1 1 −1 −2 2 +1 −1 3 −1 +0 1 −1 −1 3 −1 −2 Ga ξa XG =

t∈G ξt.a

3 −1 −1 3 −1 −2 1 −1 −3 1 −1 −4 1 −1 −5

X = XFXG = −2 · −5 = 10 ≈ 13 SJ(F) = (3 · 3) + (1 · 1) + (2 · 2) = 14, SJ(G) = 13

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

8

slide-9
SLIDE 9

Estimation of COUNT(F ⋊ ⋉a G) (cont.)

  • To estimate COUNT(F ⋊

⋉ G) = n

i=1 figi define:

XF =

n

  • i=1

fiξi =

  • t∈F

ξt.a XG =

n

  • i=1

giξi =

  • t∈G

ξt.a

  • With X = XFXG we have:

E[X] = E

  • n
  • i=1

figiξ2

i +

  • i=i′

figi′ξiξi′

  • = COUNT(F ⋊

⋉a G) Var(X) ≤ 2 SJ(F) SJ(G)

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

9

slide-10
SLIDE 10

Outline of the Talk

  • Motivation
  • Sketch-based randomized algorithms
  • Sketch-based approximation of aggregate queries results
  • Sketch-partitioning for estimation accuracy boosting
  • Experimental evaluation
  • Summary

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

10

slide-11
SLIDE 11

Using Sketches to Answer SUM Queries

  • Estimate SUMb (F(a) ⋊

⋉a G(a, b)) = 3

i=1 fi

  • t∈g,t.a=i t.b
  • i

1 2 3 ξi −1 +1 −1 Fa ξa XF =

t∈F ξt.a

1 −1 −1 1 −1 −2 2 +1 −1 3 −1 +0 1 −1 −1 3 −1 −2 Ga Gb ξa XG =

t∈G t.b ξt.a

3 2 −1 −2 3 2 −1 −4 1 1 −1 −5 1 2 −1 −7 1 1 −1 −8

X = XFXG = −2 · −8 = 16 ≈ 20

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

11

slide-12
SLIDE 12

Using Sketches to Answer SUM Queries (cont.)

  • To estimate SUMb (F(a) ⋊

⋉a G(a, b)) = n

i=1 fi

  • t∈g,t.a=i t.b
  • define:

XF =

n

  • i=1

fiξi =

  • t∈F

ξt.a XG =

n

  • i=1
  • t∈G,t.a=i

t.b

  • ξi =
  • t∈G

t.b ξt.a

  • With X = XFXG

E[X] = SUMb (F(a) ⋊ ⋉a G(a, b)) Var(X) ≤ 2 SJ(F)

n

  • i=1
  • t∈G,t.a=i

t.b 2

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

12

slide-13
SLIDE 13

Extension to COUNT(F ⋊ ⋉a G ⋊ ⋉b H)

  • Key idea: use independent ξ families for each join attribute

i 1 2 3 ξa

i

−1 +1 −1 j 1 2 ξb

j

+1 −1 Fa ξa

t.a

XF = ξa

t.a

1 −1 −1 1 −1 −2 2 +1 −1 3 −1 +0 1 −1 −1 3 −1 −2 Ga Gb ξa

t.a

ξb

t.b

XG= ξa

t.aξb t.b

3 2 −1 −1 −1 3 2 −1 −1 −2 1 1 −1 +1 −1 1 2 −1 −1 1 1 −1 +1 1 Hb ξb

t.b

XH = ξb

t.b

2 −1 −1 2 −1 −2 1 +1 −1 2 −1 −2 X = XFXGXH = −2 · 1 · −2 = 4 ≈ 21

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

13

slide-14
SLIDE 14

Extention to COUNT(F ⋊ ⋉a G ⋊ ⋉b H)(cont.)

  • To estimate

COUNT(F ⋊ ⋉a G ⋊ ⋉b H) =

n1

  • i=1

n2

  • j=1

figijhj

  • Define:

XF =

n1

  • i=1

fiξa

i ,

XG =

n1

  • i=1

n2

  • j=1

gijξa

i ξb j,

XH =

n2

  • j=1

hjξb

j

  • If ξa and ξb are independent families of ±1 4-wise independent pseudo random

variables E[XFXGXH] = COUNT(F ⋊ ⋉a G ⋊ ⋉b H) Var(XFXGXH) ≤ 4 SJ(F) SJ(G) SJ(H)

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

14

slide-15
SLIDE 15

Estimation of COUNT(R1 ⋊ ⋉ · · · ⋊ ⋉ Rr)

  • For each of the n equality join constraint build independent family of pseudo

random variables

  • For every relation Rl(a1, . . . , am) compute samples of the random variable XRl

defined as: XRl =

n1

  • i1

· · ·

nm

  • im

fi1,...,imξ1,i1 . . . ξm,im =

  • t∈R

ξ1,t.a1 · · · ξm,t.am X =

r

  • l=1

XRl

  • Can show:

E[X] = COUNT(R1 ⋊ ⋉ · · · ⋊ ⋉ Rr) Var(X) ≤ 22n

r

  • l=1

SJ(Rl)

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

15

slide-16
SLIDE 16

Outline of the Talk

  • Motivation
  • Sketch-based randomized algorithms
  • Sketch-based approximation of aggregate queries results
  • Sketch-partitioning for estimation accuracy boosting
  • Experimental evaluation
  • Summary

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

16

slide-17
SLIDE 17

Sketch Partitioning

Problem: large variance ⇒ loose estimation guarantees. Our solution: sketch partitioning i fi gi 1 20 2 2 5 15 3 10 3 4 2 10 Var(X) ≈ 2 SJ(F) SJ(G) = 2(202 + 52 + 102 + 22)(22 + 152 + 32 + 102) = 357604 Idea: split domain I = {1, 2, 3, 4} into I1 = {1, 3} and I2 = {2, 4}

  • F splits into F1 and F2, G into G1 and G2
  • build X1 to estimate COUNT(F1 ⋊

⋉ G1) and independently X2 to estimate COUNT(F2 ⋊ ⋉ G2)

  • take X′ = X1 + X2; have E[X′] = COUNT(F ⋊

⋉ G)

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

17

slide-18
SLIDE 18

Sketch Partitioning (cont.)

  • Estimation of COUNT(F1 ⋊

⋉ G1) i fi gi 1 20 2 3 10 3 Var(X1) ≈ 2 SJ(F1) SJ(G1) = 2(202 + 102)(22 + 32) = 13000

  • Estimation of COUNT(F2 ⋊

⋉ G2) i fi gi 2 5 15 4 2 10 Var(X2) ≈ 2 SJ(F2) SJ(G2) = 2(52 + 22)(152 + 102) = 18850

  • Var(X′) = Var(X1) + Var(X2) = 31850
  • Improvement

Var(X)/2 Var(X′) = 357604/2 31850 ≈ 5.6

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

18

slide-19
SLIDE 19

Binary Sketch Partitioning

  • Prior information: historical data, histograms.
  • Find the partitioning I = I1 ∪ I2 and the space allocation m = m1 + m2 that

minimizes Var(X1) m1 + Var(X2) m2 , where Var(Xk) ≈ 2

  • i∈Ik

f 2

i

  • i∈Ik

g2

i .

  • Allocate space proportional to
  • Var(Xk). In example 5:6
  • Have to look only at partitioning in the order fi/gi to find optimum ⇒ O(|I |)
  • In example order is {1, 3, 2, 4}. Optimal partition is {1, 3} ∪ {2, 4}.

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

19

slide-20
SLIDE 20

K-ary Sketch Partitioning

  • Want to split domain of join attribute in K parts
  • Allocate space proportional to
  • Var(Xk)
  • Have to look only at partitioning in the order fi/gi to find optimum (general-

ization of previous result)

  • Dynamic programming gives solution in time O(K |I |2) and space O(K |I |)
  • Approximate frequencies with histograms

– time and space dependency on number of buckets instead of |I | – provable approximation quality

  • Generalization to larger joins possible: details in the paper

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

20

slide-21
SLIDE 21

Experimental Study

Datasets:

  • Census data set (www.bls.census.gov):

– Current Population Survey data for Aug 1999(72100) and Aug 2001(81600) – Attributes used: ∗ income(1:14) ∗ education(1:46) ∗ age(1:99) ∗ weekly wage and weekly wage overtime(0:288416) Comparison: estimation using unidimensional equi-depth histograms Query load: JOIN-COUNT queries relations Error metric: relative error = 100|

actual−approx | actual

%

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

21

slide-22
SLIDE 22

Sketches v/s Histograms: Census data

10 20 30 40 50 60 70 80 90 100 500 1000 1500 2000 2500 3000 3500 4000 Relative error(%) Memory(words) sketch histogram

Census1999.weekly wage = Census2001.weekly wage

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

22

slide-23
SLIDE 23

Sketches v/s Histograms: Census data (cont.)

2 4 6 8 10 12 500 1000 1500 2000 2500 3000 3500 4000 Relative error(%) Memory(words) sketch histogram

Census1999.age = Census2001.age Census1999.education = Census2001.education

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

23

slide-24
SLIDE 24

Sketches v/s Histograms: Census data (cont.)

2 4 6 8 10 12 14 16 2000 4000 6000 8000 10000 12000 Relative Error(%) Memory(words) sketch histogram

Star join of four copies of Census 2001 on age, education and income

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

24

slide-25
SLIDE 25

Sketch Partitioning: Census Data Sets

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 Relative error(%) Number of partitions 25 buckets 50 buckets 100 buckets

Census1999.weekly wage overtime = Census2001.weekly wage overtime

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

25

slide-26
SLIDE 26

Sketch Partitioning: Census Data Sets (cont.)

20 40 60 80 100 120 140 160 180 2 4 6 8 10 12 14 16 Relative error(%) Number of partitions 25 buckets 50 buckets 100 buckets

Census1999.weekly wage overtime = Census2001.weekly wage Census1999.weekly wage = Census2001.weekly wage overtime

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

26

slide-27
SLIDE 27

Summary

  • Shown how to process multi-join decision support queries over streams
  • Proposed sketch partitioning – improves estimate guarantees
  • Shown experimental evidence that the proposed techniques work in practice

Dobra, Garofalkis, Gehrke and Rastogi – Processing Aggregate Queries over Streams

27