Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei - PowerPoint PPT Presentation

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1

Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1

Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 50 , 000 records · · · $68 , 000 $28 , 000 2-2

Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 50 , 000 records · · · $68 , 000 $28 , 000 2-3

Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 $32 , 000 $76 , 300 $54 , 400 $52 , 312 50 , 000 records · · · $68 , 000 $28 , 000 2-4

Reporting vs. Aggregation SELECT salary SELECT AVG(salary) FROM Table T FROM Table T WHERE 30 < age < 40 WHERE 30 < age < 40 # of employees Salary 2-5

Reporting vs. Aggregation Search Engine Log Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya 2011.04.07 Japan nuclear crisis 2011.04.07 Libya · · · 2011.03.11 Japan earthquake 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · 3-1

Reporting vs. Aggregation Search Engine Log Date Keyword 2011.04.08 Masters 2011 2011.04.08 Libya Keyword Frequency 2011.04.07 Japan nuclear crisis Libya 19.3% 2011.04.07 Libya Japan nuclear crisis 16.5% · · · Japan earthquake 10.2% 2011.03.11 Japan earthquake · · · 2011.03.11 Japan tsunami 2011.03.10 NCAA · · · 3-2

Summary Queries Let D be a database containing N records. Each record p ∈ D is associated with query attribute A q ( p ) (age) and a summary attribute A s ( p ) (salary). 4-1

Summary Queries Let D be a database containing N records. Each record p ∈ D is associated with query attribute A q ( p ) (age) and a summary attribute A s ( p ) (salary). A summary query specifies a range constraint [ q 1 , q 2 ] on A q and the database returns a summary on the A s attribute of all records whose A q attribute is within the range. 4-2

Summary Queries Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) . . . 5-1

Summary Queries Data summarization techniques Heavy hitters (a.k.a. frequent items) [MG 82] [MAA 06] ... Quantiles [MP 80] [GK 01] ... Histograms [PHIJ 96] [JKMPSS 98] [GGIKMS 02] ... Wavelets [MVW 98] [VM 99] [GKMS 01] ... Various sketches ([AMS 99], Count-Min [CM 05], ... ) . . . Past research focuses on computing summaries on the whole data set: offline or streaming 5-2

Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space Time 6-1

Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space offline: O ( N ) O ( N ): data must be stored streaming: sublinear Time 6-2

Algorithm Problem vs. Data Structure Problem The algorithm problem The data structure problem Space offline: O ( N ) O ( N ): data must be stored streaming: sublinear ˜ preprocessing time: O ( N ) less important sublinear when query time: sampling works Time O (log N + s ε ) internal mem O (log B N + s ε / B ) external mem s ε : summary size B : block size 6-3

Quantile Summaries φ -quantile: the value ranked at φ | D | in D . ε -approximate φ -quantile: any value whose rank is between [( φ − ε ) | D | , ( φ + ε ) | D | ]. Quantile summary: for any 0 < φ < 1, an ε -approximate φ -quantile can be extracted. 7-1

Quantile Summaries φ -quantile: the value ranked at φ | D | in D . ε -approximate φ -quantile: any value whose rank is between [( φ − ε ) | D | , ( φ + ε ) | D | ]. Quantile summary: for any 0 < φ < 1, an ε -approximate φ -quantile can be extracted. # of employees Salary max min 20% 40% 60% 80% 7-2

Quantile Summaries ε | D | values 4 6 7 9 11 13 16 26 21 24 1 3 3 8-1

Quantile Summaries ε | D | values 4 6 7 9 11 13 16 26 21 24 1 3 3 Size: s ε = Θ(1 /ε ); Error: ε | D | u 8-2

A Baseline Solution Decomposable summaries 9-1

A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t 9-2

A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t ε -summary = D = D 1 ⊎ · · · ⊎ D t 9-3

A Baseline Solution Decomposable summaries ε -summary ε -summary ε -summary + + · · · + D 1 D 2 D t ε -summary Error: ε | D 1 | + · · · + ε | D t | = ε | D | = D = D 1 ⊎ · · · ⊎ D t 9-4

A Baseline Solution ε -summary Query range 10-1

Query Cost s ε log N sorted lists · · · · · · log N -way merging: O ( s ε log N log log N ) 11-1

A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( Ns ε ) 12-1

A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( Ns ε ) Fat leaf: s ε 12-2

A Baseline Solution Internal memory Query time: O ( s ε log N log log N ) Space: O ( N ) Fat leaf: s ε 12-3

Optimal Data Structure S ( ε, D 1 ) S ( 3 2 ε, D 2 ) S (( 3 2 ) 2 ε, D 3 ) Query range 13-1

Optimal Data Structure Quantile summary S ( ε, D ): An ε -quantile summary for data set D . Size: Θ(1 /ε ); Error: ε | D | . 14-1

Optimal Data Structure Quantile summary S ( ε, D ): An ε -quantile summary for data set D . Size: Θ(1 /ε ); Error: ε | D | . Data Data Error Summary Absolute set size param. size error 1 D 1 k ε ε k ε k 3 2 1 3 D 2 2 ε 4 ε k 2 3 ε � 2 ε � 2 ε k � 3 � 2 � 3 � 2 1 k D 3 4 2 3 4 ε · · · � t − 1 ε � t − 1 ε k � 3 � 2 � 3 � t − 1 1 k D t 2 t − 1 2 3 4 ε O ( 1 Θ( k ) ε ) O ( ε k ) D 14-2

Optimal Data Structure Query range 15-1

Optimal Data Structure ε -summary ( 3 2 ε )-summary (( 3 2 ) 2 ε )-summary · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Query range 15-2

Query Cost s ε log N sorted lists · · · · · · 16-1

Query Cost s ε log N sorted lists · · · · · · log N -way merging: Θ( s ε log log N ) 16-2

Query Cost s ε log N sorted lists · · · · · · 16-3

Query Cost s ε log N sorted lists · · · · · · Bottom-up two-way merging: O ( s ε ) 16-4

α -Exponentially Decomposable Multisets D 1 , . . . , D t with F 1 ( D i ) ≤ α i − 1 F 1 ( D 1 ), ∃ constant c , s.t. given S ( ε, D 1 ) , S ( c ε, D 2 ) . . . , S ( c t − 1 ε, D t ): We can construct an O ( ε )-summary for D 1 ⊎ · · · ⊎ D t . The total size of S ( ε, D 1 ) , . . . , S ( c t − 1 ε, D t ) is O ( s ε ) and they can be combined in O ( s ε ) time. The total size of S ( ε, D ) , . . . , S ( c t − 1 ε, D ) is O ( s ε ). 17-1

α -Exponentially Decomposable Multisets D 1 , . . . , D t with F 1 ( D i ) ≤ α i − 1 F 1 ( D 1 ), ∃ constant c , s.t. given S ( ε, D 1 ) , S ( c ε, D 2 ) . . . , S ( c t − 1 ε, D t ): We can construct an O ( ε )-summary for D 1 ⊎ · · · ⊎ D t . The total size of S ( ε, D 1 ) , . . . , S ( c t − 1 ε, D t ) is O ( s ε ) and they can be combined in O ( s ε ) time. The total size of S ( ε, D ) , . . . , S ( c t − 1 ε, D ) is O ( s ε ). Theorem For any (1 / 2)-exponentially decomposable summary, a database D of N records can be stored in an internal memory structure of linear size so that a summary query can be answered in O (log N + s ε ) time. 17-2

Optimal Data Structure - External Memory Standard B-tree blocking with fat leaves 18-1

Optimal Data Structure - External Memory Standard B-tree blocking with fat leaves O (log B ) Θ( B ) Leaf size: s ε 18-2

Query Path u v 0 r 1 w 1 v 1 r 2 w 2 v 2 r 3 v 19-1

Summary Set u w 1 w 2 w 3 v 20-1

Summary Set u R ( u , v ) = { w 1 , w 2 , w 3 } w 1 w 2 w 3 v 20-2

Summary Set u R ( u , v ) = { w 1 , w 2 , w 3 } w 1 w 2 w 3 v RS ( u , v , ε ) S ( ε, w 1 ) S ( c 3 ε, w 3 ) S ( c ε, w 2 ) 20-3

Focus on a Block r B u v 2 v 1 21-1

Focus on a Block r B u v 2 v 1 Case 1. RS ( u , v 1 , ε ) 21-2

Focus on a Block r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-3

Focus on a Block · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-4

Focus on a Block Size: s ε B · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B u v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-5

Focus on a Block Size: s ε B · · · Case 2. RS ( r B , v 2 , c ε ) RS ( r B , v 2 , ε ) r B S ( r B , ε ) S ( r B , ε ) S ( r B , c ε ) S ( r B , c ε ) S ( r B , c 2 ε ) S ( r B , c 2 ε ) · · · u Case 3. v 2 v 1 Case 1. Size: s ε B log B RS ( u , v 1 , ε ) 21-6

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei - PowerPoint PPT Presentation

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1 Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1 Reporting vs. Aggregation

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

Multimedia Queries and Indexing Prof Stefan Rger Multimedia and Information Systems Knowledge

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Aggregation based on road topologies for large scale VRPs Eivind Nilssen, SINTEF ICT Oslo, June

Problem of random summation and its role in risk aggregation models Gregory Temnov School of

Scalable Preference Aggregation in Social Networks IFCAM Workshop on Social Networks Indian

Judgment Aggregation in Dynamic Logic of Propositional Assignments Arianna Novaro, Umberto

Judgment Aggregation in Abstract Argumentation Gabriella Pigozzi Universit e Paris-Dauphine

Convergence Issues of Iterative Aggregation/Disaggregation Ivo Marek Petr Mayer Czech Institute

Judgment aggregation acknowledgment: Ulle Endriss, University of Amsterdam Lirong Xia Fall, 2016

Aggregation of Spatio-temporal and Event Log Databases for Stochastic Characterization of Process

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei - PowerPoint PPT Presentation

Beyond Simple Aggregates: Indexing for Summary Queries Zhewei Wei and Ke Yi Hong Kong University of Science and Technology 1-1 Reporting vs. Aggregation SELECT salary FROM Table T WHERE 30 < age < 40 2-1 Reporting vs. Aggregation

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

Multimedia Queries and Indexing Prof Stefan Rger Multimedia and Information Systems Knowledge

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Aggregation based on road topologies for large scale VRPs Eivind Nilssen, SINTEF ICT Oslo, June

Problem of random summation and its role in risk aggregation models Gregory Temnov School of

Scalable Preference Aggregation in Social Networks IFCAM Workshop on Social Networks Indian

Judgment Aggregation in Dynamic Logic of Propositional Assignments Arianna Novaro, Umberto

Judgment Aggregation in Abstract Argumentation Gabriella Pigozzi Universit e Paris-Dauphine

Convergence Issues of Iterative Aggregation/Disaggregation Ivo Marek Petr Mayer Czech Institute

Judgment aggregation acknowledgment: Ulle Endriss, University of Amsterdam Lirong Xia Fall, 2016

Aggregation of Spatio-temporal and Event Log Databases for Stochastic Characterization of Process

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3