beyond simple aggregates indexing for summary queries
play

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei - PowerPoint PPT Presentation

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1 Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1 Reporting vs. Aggregation


  1. Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1

  2. Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1

  3. Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 50 , 000 records · · · $68 , 000 $28 , 000 2-2

  4. Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 50 , 000 records · · · $68 , 000 $28 , 000 2-3

  5. Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 $52 , 312 50 , 000 records · · · $68 , 000 $28 , 000 2-4

  6. Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 # of employees Salary 2-5

  7. Reporting vs. Aggregation Search Engine Log Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya 2011.04.07 Japan nuclear crisis 2011.04.07 Libya · · · 2011.03.11 Japan earthquake 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · 3-1

  8. Reporting vs. Aggregation Search Engine Log Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya Keyword Frequency 2011.04.07 Japan nuclear crisis Libya 19.3% 2011.04.07 Libya Japan nuclear crisis 16.5% · · · Japan earthquake 10.2% 2011.03.11 Japan earthquake · · · 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · 3-2

  9. Summary Queries Let D be a database containing N records. Each record p ∈ D is associated with query attribute A q ( p ) (age) and a summary attribute A s ( p ) (salary). 4-1

  10. Summary Queries Let D be a database containing N records. Each record p ∈ D is associated with query attribute A q ( p ) (age) and a summary attribute A s ( p ) (salary). A summary query specifies a range constraint [ q 1 , q 2 ] on A q and the database returns a summary on the A s attribute of all records whose A q attribute is within the range. 4-2

  11. Summary Queries Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) . . . 5-1

  12. Summary Queries Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) . . . Past research focuses on computing summaries on the whole data set: offline or streaming 5-2

  13. Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space Time 6-1

  14. Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space offline: O ( N ) O ( N ): data must be stored streaming: sublinear Time 6-2

  15. Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space offline: O ( N ) O ( N ): data must be stored streaming: sublinear ˜ preprocessing time: O ( N ) less important sublinear when query time: sampling works Time O (log N + s ε ) internal mem O (log B N + s ε / B ) external mem s ε : summary size B : block size 6-3

  16. Quantile Summaries φ -quantile: the value ranked at φ | D | in D . ε -approximate φ -quantile: any value whose rank is between [( φ − ε ) | D | , ( φ + ε ) | D | ]. Quantile summary: for any 0 < φ < 1, an ε -approximate φ -quantile can be extracted. 7-1

  17. Quantile Summaries φ -quantile: the value ranked at φ | D | in D . ε -approximate φ -quantile: any value whose rank is between [( φ − ε ) | D | , ( φ + ε ) | D | ]. Quantile summary: for any 0 < φ < 1, an ε -approximate φ -quantile can be extracted. # of employees Salary max min 20% 40% 60% 80% 7-2

  18. Quantile Summaries ε | D | values 4 6 7 9 11 13 16 26 21 24 1 3 3 8-1

  19. Quantile Summaries ε | D | values 4 6 7 9 11 13 16 26 21 24 1 3 3 Size: s ε = Θ(1 /ε ); Error: ε | D | u 8-2

  20. A Baseline Solution Decomposable summaries 9-1

  21. A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t 9-2

  22. A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t ε -summary = D = D 1 ⊎ · · · ⊎ D t 9-3

  23. A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t ε -summary Error: ε | D 1 | + · · · + ε | D t | = ε | D | = D = D 1 ⊎ · · · ⊎ D t 9-4

  24. A Baseline Solution ε -summary Query range 10-1

  25. Query Cost s ε log N sorted lists · · · · · · log N -way merging: O ( s ε log N log log N ) 11-1

  26. A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( Ns ε ) 12-1

  27. A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( Ns ε ) Fat leaf: s ε 12-2

  28. A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( N ) Fat leaf: s ε 12-3

  29. Optimal Data Structure S ( ε, D 1 ) S ( 3 2 ε, D 2 ) S (( 3 2 ) 2 ε, D 3 ) Query range 13-1

  30. Optimal Data Structure Quantile summary S ( ε, D ): An ε -quantile summary for data set D . Size: Θ(1 /ε ); Error: ε | D | . 14-1

  31. Optimal Data Structure Quantile summary S ( ε, D ): An ε -quantile summary for data set D . Size: Θ(1 /ε ); Error: ε | D | . Data Data Error Summary Absolute set size param. size error 1 D 1 k ε ε k ε k 3 2 1 3 D 2 2 ε 4 ε k 2 3 ε � 2 ε � 2 ε k � 3 � 2 � 3 � 2 1 k D 3 4 2 3 4 ε · · · � t − 1 ε � t − 1 ε k � 3 � 2 � 3 � t − 1 1 k D t 2 t − 1 2 3 4 ε O ( 1 Θ( k ) ε ) O ( ε k ) D 14-2

  32. Optimal Data Structure Query range 15-1

  33. Optimal Data Structure ε -summary ( 3 2 ε )-summary (( 3 2 ) 2 ε )-summary · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Query range 15-2

  34. Query Cost s ε log N sorted lists · · · · · · 16-1

  35. Query Cost s ε log N sorted lists · · · · · · log N -way merging: Θ( s ε log log N ) 16-2

  36. Query Cost s ε log N sorted lists · · · · · · 16-3

  37. Query Cost s ε log N sorted lists · · · · · · Bottom-up two-way merging: O ( s ε ) 16-4

  38. α -Exponentially Decomposable Multisets D 1 , . . . , D t with F 1 ( D i ) ≤ α i − 1 F 1 ( D 1 ), ∃ constant c , s.t. given S ( ε, D 1 ) , S ( c ε, D 2 ) . . . , S ( c t − 1 ε, D t ): We can construct an O ( ε )-summary for D 1 ⊎ · · · ⊎ D t . The total size of S ( ε, D 1 ) , . . . , S ( c t − 1 ε, D t ) is O ( s ε ) and they can be combined in O ( s ε ) time. The total size of S ( ε, D ) , . . . , S ( c t − 1 ε, D ) is O ( s ε ). 17-1

  39. α -Exponentially Decomposable Multisets D 1 , . . . , D t with F 1 ( D i ) ≤ α i − 1 F 1 ( D 1 ), ∃ constant c , s.t. given S ( ε, D 1 ) , S ( c ε, D 2 ) . . . , S ( c t − 1 ε, D t ): We can construct an O ( ε )-summary for D 1 ⊎ · · · ⊎ D t . The total size of S ( ε, D 1 ) , . . . , S ( c t − 1 ε, D t ) is O ( s ε ) and they can be combined in O ( s ε ) time. The total size of S ( ε, D ) , . . . , S ( c t − 1 ε, D ) is O ( s ε ). Theorem For any (1 / 2)-exponentially decomposable summary, a database D of N records can be stored in an internal memory structure of linear size so that a summary query can be answered in O (log N + s ε ) time. 17-2

  40. Optimal Data Structure - External Memory Standard B-tree blocking with fat leaves 18-1

  41. Optimal Data Structure - External Memory Standard B-tree blocking with fat leaves O (log B ) Θ( B ) Leaf size: s ε 18-2

  42. Query Path u v 0 r 1 w 1 v 1 r 2 w 2 v 2 r 3 v 19-1

  43. Summary Set u w 1 w 2 w 3 v 20-1

  44. Summary Set u R ( u , v ) = { w 1 , w 2 , w 3 } w 1 w 2 w 3 v 20-2

  45. Summary Set u R ( u , v ) = { w 1 , w 2 , w 3 } w 1 w 2 w 3 v RS ( u , v , ε ) S ( ε, w 1 ) S ( c 3 ε, w 3 ) S ( c ε, w 2 ) 20-3

  46. Focus on a Block r B u v 2 v 1 21-1

  47. Focus on a Block r B u v 2 v 1 Case 1. RS ( u , v 1 , ε ) 21-2

  48. Focus on a Block r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-3

  49. Focus on a Block · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-4

  50. Focus on a Block Size: s ε B · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-5

  51. Focus on a Block Size: s ε B · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B S ( r B , ε ) S ( r B , ε ) S ( r B , c ε ) S ( r B , c ε ) S ( r B , c 2 ε ) S ( r B , c 2 ε ) · · · u Case 3. v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend