Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei - - PowerPoint PPT Presentation

beyond simple aggregates indexing for summary queries
SMART_READER_LITE
LIVE PREVIEW

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei - - PowerPoint PPT Presentation

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1 Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1 Reporting vs. Aggregation


slide-1
SLIDE 1

1-1

Beyond Simple Aggregates: Indexing for Summary Queries

Zhewei Wei and Ke Yi

Hong Kong University of Science and Technology

slide-2
SLIDE 2

2-1

Reporting vs. Aggregation

SELECT salary FROM Table T WHERE 30 < age < 40

slide-3
SLIDE 3

2-2

Reporting vs. Aggregation

SELECT salary FROM Table T WHERE 30 < age < 40 $32, 000 $76, 300 $54, 400 · · · $68, 000 $28, 000 50, 000 records

slide-4
SLIDE 4

2-3

Reporting vs. Aggregation

SELECT salary FROM Table T WHERE 30 < age < 40 SELECT AVG(salary) FROM Table T WHERE 30 < age < 40 $32, 000 $76, 300 $54, 400 · · · $68, 000 $28, 000 50, 000 records

slide-5
SLIDE 5

2-4

Reporting vs. Aggregation

SELECT salary FROM Table T WHERE 30 < age < 40 SELECT AVG(salary) FROM Table T WHERE 30 < age < 40 $32, 000 $76, 300 $54, 400 · · · $68, 000 $28, 000 $52, 312 50, 000 records

slide-6
SLIDE 6

2-5

Reporting vs. Aggregation

SELECT salary FROM Table T WHERE 30 < age < 40 SELECT AVG(salary) FROM Table T WHERE 30 < age < 40 Salary # of employees

slide-7
SLIDE 7

3-1

Reporting vs. Aggregation

Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya 2011.04.07 Japan nuclear crisis 2011.04.07 Libya · · · 2011.03.11 Japan earthquake 2011.03.11 Japan tsunami 2011.03.10 NCAA · · ·

Search Engine Log

slide-8
SLIDE 8

3-2

Reporting vs. Aggregation

Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya 2011.04.07 Japan nuclear crisis 2011.04.07 Libya · · · 2011.03.11 Japan earthquake 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · Keyword Frequency Libya 19.3% Japan nuclear crisis 16.5% Japan earthquake 10.2% · · ·

Search Engine Log

slide-9
SLIDE 9

4-1

Summary Queries

Let D be a database containing N records. Each record p ∈ D is associated with query attribute Aq(p) (age) and a summary attribute As(p) (salary).

slide-10
SLIDE 10

4-2

Summary Queries

Let D be a database containing N records. Each record p ∈ D is associated with query attribute Aq(p) (age) and a summary attribute As(p) (salary). A summary query specifies a range constraint [q1, q2] on Aq and the database returns a summary on the As attribute of all records whose Aq attribute is within the range.

slide-11
SLIDE 11

5-1

Summary Queries

Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... . . .

slide-12
SLIDE 12

5-2

Summary Queries

Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... . . . Past research focuses on computing summaries on the whole data set: offline or streaming

slide-13
SLIDE 13

6-1

Algorithm Problem vs. Data Structure Problem

The algorithm problem The data structure problem Space Time

slide-14
SLIDE 14

6-2

Algorithm Problem vs. Data Structure Problem

The algorithm problem The data structure problem Space offline: O(N) streaming: sublinear O(N): data must be stored Time

slide-15
SLIDE 15

6-3

Algorithm Problem vs. Data Structure Problem

The algorithm problem The data structure problem Space offline: O(N) streaming: sublinear O(N): data must be stored ˜ O(N) sublinear when sampling works preprocessing time: less important query time: O(log N + sε) internal mem O(logB N + sε/B) external mem sε: summary size B: block size Time

slide-16
SLIDE 16

7-1

Quantile Summaries

φ-quantile: the value ranked at φ|D| in D. ε-approximate φ-quantile: any value whose rank is between [(φ − ε)|D|, (φ + ε)|D|]. Quantile summary: for any 0 < φ < 1, an ε-approximate φ-quantile can be extracted.

slide-17
SLIDE 17

7-2

Quantile Summaries

φ-quantile: the value ranked at φ|D| in D. ε-approximate φ-quantile: any value whose rank is between [(φ − ε)|D|, (φ + ε)|D|]. Quantile summary: for any 0 < φ < 1, an ε-approximate φ-quantile can be extracted. Salary # of employees 20% 40% 60% 80% min max

slide-18
SLIDE 18

8-1

6 3 9 11 1 4 16 24 3 7 13 26 21

Quantile Summaries

ε|D| values

slide-19
SLIDE 19

8-2

6 3 9 11 1 4 16 24 3 7 13 26 21

Quantile Summaries

u

Size: sε = Θ(1/ε); Error: ε|D| ε|D| values

slide-20
SLIDE 20

9-1

A Baseline Solution

Decomposable summaries

slide-21
SLIDE 21

9-2

A Baseline Solution

Decomposable summaries

+ + · · · +

D1 D2 Dt

ε-summary ε-summary ε-summary

slide-22
SLIDE 22

9-3

A Baseline Solution

Decomposable summaries

+ + · · · + =

D1 D2 Dt D = D1 ⊎ · · · ⊎ Dt

ε-summary ε-summary ε-summary ε-summary

slide-23
SLIDE 23

9-4

A Baseline Solution

Decomposable summaries

+ + · · · + =

D1 D2 Dt D = D1 ⊎ · · · ⊎ Dt Error: ε|D1| + · · · + ε|Dt| = ε|D|

ε-summary ε-summary ε-summary ε-summary

slide-24
SLIDE 24

10-1

A Baseline Solution

ε-summary

Query range

slide-25
SLIDE 25

11-1

Query Cost

· · · · · ·

log N sorted lists sε

log N-way merging: O(sε log N log log N)

slide-26
SLIDE 26

12-1

A Baseline Solution

Internal memory Query time: O(sε log N log log N) Space: O(Nsε)

slide-27
SLIDE 27

12-2

A Baseline Solution

Internal memory Query time: O(sε log N log log N) Space: O(Nsε) Fat leaf: sε

slide-28
SLIDE 28

12-3

A Baseline Solution

Internal memory Query time: O(sε log N log log N) Fat leaf: sε Space: O(N)

slide-29
SLIDE 29

13-1

Optimal Data Structure

S(ε, D1) S( 3

2ε, D2)

S(( 3

2)2ε, D3)

Query range

slide-30
SLIDE 30

14-1

Optimal Data Structure

Quantile summary S(ε, D): An ε-quantile summary for data set D. Size: Θ(1/ε); Error: ε|D|.

slide-31
SLIDE 31

14-2

Optimal Data Structure

Quantile summary S(ε, D): An ε-quantile summary for data set D. Size: Θ(1/ε); Error: ε|D|.

Data set Data size Error param. Summary size Absolute error D1 k ε

1 ε

εk D2

k 2 3 2ε 2 3 1 ε 3 4εk

D3

k 4

3

2

2 ε 2

3

2 1

ε

3

4

2 εk · · · Dt

k 2t−1

3

2

t−1 ε 2

3

t−1 1

ε

3

4

t−1 εk D Θ(k) O( 1

ε)

O(εk)

slide-32
SLIDE 32

15-1

Optimal Data Structure

Query range

slide-33
SLIDE 33

15-2

Optimal Data Structure

· · ·

ε-summary ( 3

2ε)-summary

(( 3

2)2ε)-summary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Query range

slide-34
SLIDE 34

16-1

Query Cost

· · · · · ·

log N sorted lists sε

slide-35
SLIDE 35

16-2

Query Cost

· · · · · ·

log N sorted lists sε

log N-way merging: Θ(sε log log N)

slide-36
SLIDE 36

16-3

Query Cost

· · · · · ·

log N sorted lists sε

slide-37
SLIDE 37

16-4

Query Cost

· · · · · ·

log N sorted lists sε

Bottom-up two-way merging: O(sε)

slide-38
SLIDE 38

17-1

α-Exponentially Decomposable

Multisets D1, . . . , Dt with F1(Di) ≤ αi−1F1(D1), ∃ constant c, s.t. given S(ε, D1), S(cε, D2) . . . , S(ct−1ε, Dt): We can construct an O(ε)-summary for D1 ⊎ · · · ⊎ Dt. The total size of S(ε, D1), . . . , S(ct−1ε, Dt) is O(sε) and they can be combined in O(sε) time. The total size of S(ε, D), . . . , S(ct−1ε, D) is O(sε).

slide-39
SLIDE 39

17-2

α-Exponentially Decomposable

Multisets D1, . . . , Dt with F1(Di) ≤ αi−1F1(D1), ∃ constant c, s.t. given S(ε, D1), S(cε, D2) . . . , S(ct−1ε, Dt): We can construct an O(ε)-summary for D1 ⊎ · · · ⊎ Dt. The total size of S(ε, D1), . . . , S(ct−1ε, Dt) is O(sε) and they can be combined in O(sε) time. The total size of S(ε, D), . . . , S(ct−1ε, D) is O(sε). Theorem For any (1/2)-exponentially decomposable summary, a database D of N records can be stored in an internal memory structure of linear size so that a summary query can be answered in O(log N + sε) time.

slide-40
SLIDE 40

18-1

Optimal Data Structure - External Memory

Standard B-tree blocking with fat leaves

slide-41
SLIDE 41

18-2

Optimal Data Structure - External Memory

Standard B-tree blocking with fat leaves

Leaf size: sε Θ(B) O(log B)

slide-42
SLIDE 42

19-1

Query Path

u v0 r1 v1 r2 v2 r3 v w1 w2

slide-43
SLIDE 43

20-1

Summary Set

v u w1 w2 w3

slide-44
SLIDE 44

20-2

Summary Set

v u w1 w2 w3 R(u, v) = {w1, w2, w3}

slide-45
SLIDE 45

20-3

Summary Set

v u RS(u, v, ε) w1 w2 w3 S(ε, w1) S(cε, w2) S(c3ε, w3) R(u, v) = {w1, w2, w3}

slide-46
SLIDE 46

21-1

Focus on a Block

rB v2 u v1

slide-47
SLIDE 47

21-2

Focus on a Block

rB v2 u v1 RS(u, v1, ε) Case 1.

slide-48
SLIDE 48

21-3

Focus on a Block

rB v2 u v1 RS(u, v1, ε) Case 1. Size: sεB log B

slide-49
SLIDE 49

21-4

Focus on a Block

rB v2 RS(rB, v2, ε) RS(rB, v2, cε) u v1 RS(u, v1, ε) Case 1. Case 2. · · · Size: sεB log B

slide-50
SLIDE 50

21-5

Focus on a Block

rB v2 RS(rB, v2, ε) RS(rB, v2, cε) u v1 RS(u, v1, ε) Case 1. Case 2. · · · Size: sεB Size: sεB log B

slide-51
SLIDE 51

21-6

Focus on a Block

rB v2 RS(rB, v2, ε) RS(rB, v2, cε) u v1 RS(u, v1, ε)

· · ·

S(rB, ε) S(rB, cε) S(rB, c2ε) Case 1. Case 2. Case 3. S(rB, ε) S(rB, cε) S(rB, c2ε) · · · Size: sεB Size: sεB log B

slide-52
SLIDE 52

21-7

Focus on a Block

rB v2 RS(rB, v2, ε) RS(rB, v2, cε) u v1 RS(u, v1, ε)

· · ·

S(rB, ε) S(rB, cε) S(rB, c2ε) Case 1. Case 2. Case 3. S(rB, ε) S(rB, cε) S(rB, c2ε) · · · Size: sεB Size: sε Size: sεB log B

slide-53
SLIDE 53

22-1

Query Process

v0 r1 v1 r2 v2 r3 v w1 w2 u

slide-54
SLIDE 54

22-2

Query Process

v0 r1 v1 r2 v2 r3 v w1 w2 RS(u, v0, ε) Case 1. u

slide-55
SLIDE 55

22-3

Query Process

v0 r1 v1 r2 v2 r3 v w1 w2 RS(u, v0, ε) RS(r1, v1, cdr1−duε) RS(r2, v2, cdr2−duε) RS(r3, v3, cdr3−duε) Case 1. Case 2. u

slide-56
SLIDE 56

22-4

Query Process

v0 r1 v1 r2 v2 r3 v w1 w2 RS(u, v0, ε) RS(r1, v1, cdr1−duε) RS(r2, v2, cdr2−duε) RS(r3, v3, cdr3−duε) S(w1, cdw1−duε) S(w2, cdw2−duε) Case 1. Case 2. Case 3. u

slide-57
SLIDE 57

22-5

Query Process

u v0 r1 v1 r2 v2 r3 v w1 w2 RS(u, v0, ε) RS(r1, v1, cdr1−duε) RS(r2, v2, cdr2−duε) RS(r3, v3, cdr3−duε) S(w1, cdw1−duε) S(w2, cdw2−duε) Case 1. Case 2. Case 3.

Query Cost: O(logB N + sε/B)

u

slide-58
SLIDE 58

23-1

Optimal Data Structure - External Memory

Query Cost: O(logB N + sε/B) Space Usage: O(N log B)

slide-59
SLIDE 59

23-2

Optimal Data Structure - External Memory

Query Cost: O(logB N + sε/B) Space Usage: O(N log B) Query Cost: O(logB N + sε/B) Space Usage: O(N)

slide-60
SLIDE 60

23-3

Optimal Data Structure - External Memory

Query Cost: O(logB N + sε/B) Space Usage: O(N log B) Query Cost: O(logB N + sε/B) Space Usage: O(N) Idea: pack some leaves of u to reduce space usage

slide-61
SLIDE 61

24-1

Packed Structure

u ur = w1 ul u′ kh h w2 w3

slide-62
SLIDE 62

24-2

Packed Structure

u ur = w1 ul u′ kh h w2 w3 S(ε, w1) S(cε, w2) S(c2ε, w3) One summary for each node in u′’s subtree

slide-63
SLIDE 63

25-1

Packed Structure

u′ kh h S(ε, w1) S(cε, w2) S(c2ε, w3) One summary for each node in u′’s subtree u

slide-64
SLIDE 64

25-2

Packed Structure

u′ kh h S(ε, w1) S(cε, w2) S(c2ε, w3) One summary for each node in u′’s subtree u

The total size of all summaries below u′:

log kh

  • i=0

kh 2i sch−i−1ε. (1)

slide-65
SLIDE 65

25-3

Packed Structure

u′ kh h S(ε, w1) S(cε, w2) S(c2ε, w3) One summary for each node in u′’s subtree u

The total size of all summaries below u′:

log kh

  • i=0

kh 2i sch−i−1ε. (1) Choose kh such that (1) is Θ(sε).

slide-66
SLIDE 66

25-4

Packed Structure

u′ kh h S(ε, w1) S(cε, w2) S(c2ε, w3) One summary for each node in u′’s subtree u

The total size of all summaries below u′:

log kh

  • i=0

kh 2i sch−i−1ε. (1) Choose kh such that (1) is Θ(sε). The total size of the packed structures in B is bounded by

log B

  • h=1

Bsε/kh ≤ O(Bsε).

slide-67
SLIDE 67

26-1

Optimal Data Structure - External Memory

Theorem For any (1/2)-exponentially decomposable summary, a database D of N records can be stored in an external memory index of linear size so that a summary query can be answered in O(logB N + sε/B) I/Os.

slide-68
SLIDE 68

27-1

Exponentially Decomposable vs. Decomposable

Exponentially decomposable summaries Heavy hitters Quantile Count-Min Sketch

slide-69
SLIDE 69

27-2

Exponentially Decomposable vs. Decomposable

Internal Memory: Query cost: O(log N + sε) Space: O(N) External Memory: Query cost: O(logB N + sε/B) Space: O(N) Exponentially decomposable summaries Heavy hitters Quantile Count-Min Sketch

slide-70
SLIDE 70

28-1

Exponentially Decomposable vs. Decomposable

Decomposable AMS Sketch Wavelets

slide-71
SLIDE 71

28-2

Exponentially Decomposable vs. Decomposable

Internal Memory: Query cost: O(sε log N) Space: O(N) External Memory: Query cost: O( sε

B log N) for sε ≥ B

O(log N/ log(B/sε)) for sε < B Space: O(N) Decomposable AMS Sketch Wavelets

slide-72
SLIDE 72

28-3

Exponentially Decomposable vs. Decomposable

Internal Memory: Query cost: O(sε log N) Space: O(N) External Memory: Query cost: O( sε

B log N) for sε ≥ B

O(log N/ log(B/sε)) for sε < B Space: O(N) Decomposable AMS Sketch Wavelets Can we improve?

slide-73
SLIDE 73

29-1

Open Problems

Are the structures practical?

slide-74
SLIDE 74

29-2

Open Problems

Are the structures practical? (Q4) Return a summary on the household income distribution for the area within 50 miles from Washington, DC. Multiple query attributes:

slide-75
SLIDE 75

29-3

Open Problems

Are the structures practical? (Q4) Return a summary on the household income distribution for the area within 50 miles from Washington, DC. Multiple query attributes: (Q5) What is the geographical distribution of households with annual income below $50,000? Multiple summary attributes: Geometric summaries: clustering, ε-approximations

slide-76
SLIDE 76

29-4

Open Problems

Are the structures practical? (Q4) Return a summary on the household income distribution for the area within 50 miles from Washington, DC. Multiple query attributes: (Q5) What is the geographical distribution of households with annual income below $50,000? Multiple summary attributes: Joins? General SQL queries? Geometric summaries: clustering, ε-approximations

slide-77
SLIDE 77

30-1

Thank you!