VLDB 2002 A One-Pass Aggregation Algorithm with the Optimal Buffer - PowerPoint PPT Presentation

VLDB 2002 A One-Pass Aggregation Algorithm with the Optimal Buffer Size in Multidimensional OLAP September 23, 2002 Young-Koo Lee, Kyu-Young Whang, Yang-Sae Moon, and Il-Yeol Song Department of Computer Science and Advanced Information Technology Research Center(AITrc) KAIST, Korea 1 September 23, 2002 KAIST

Overview � Introduction � Motivation and Goals � Computation Model Based on Disjoint-Inclusive Partition � One-Pass Aggregation Algorithm and Its Optimality � Experimental Results � Conclusions 2 September 23, 2002 KAIST

On-Line Analytical Processing: OLAP � OLAP is a database application that allows users to easily analyze large volumes of data in order to extract information necessary for decision making � Example: Customer Data Analysis Sales (values in cells) 2001 r a e 2000 Y • Query Example: 1999 1998 Find the total sales for each age 60 40,000 Income 70 20 30,000 20 30 50 40 20,000 10 30 20 10,000 10 20 30 40 50 Age � Multidimensional OLAP: MOLAP • Uses multidimensional files as storage structures 3 September 23, 2002 KAIST

Aggregation � Definition ( Aggregation ): An operation that classifies records into groups and determines one value per group by applying the given aggregate function [Graefe 93] • Grouping attributes : the attributes used for grouping • Aggregated attribute : the attribute to which the aggregate function is applied � Examples: • Find the total sales • Find the total sales for each year � OLAP queries make heavy use of aggregation for summarizing data � Since computing aggregation is very expensive, good aggregation algorithms are crucial for achieving performance in OLAP systems 4 September 23, 2002 KAIST

Terminology � Organizing attributes : a subset of attributes that determines the placement of records in the multidimensional file (i.e., attributes that correspond to dimensions) � Domain: a set of values from which an attribute value can be drawn � Domain space : the Cartesian product of all domains � Page region : a region associated with a disk page � Grouping domain space : the Cartesian product of the domains of all the grouping attributes � Grouping region : any subset of the grouping domain space � Page grouping region : the projection of the page region onto the grouping domain space 5 September 23, 2002 KAIST

Related Work � Aggregation using multidimensional arrays [Zhao et al. 97] • Stores data in a multidimensional array • Computes aggregation by accessing records in the unit of a page along the line perpendicular to the axis of the grouping attribute • Example: Aggregation of Y values for each X value in X-Y two dimensional space Y Y y 0 y 0 2 8 1 2 2 2 8 1 2 2 y 1 y 1 3 2 4 6 3 3 3 2 4 6 3 3 y 2 y 2 2 8 7 1 2 2 2 8 7 1 2 2 y 3 y 3 4 2 3 4 5 4 2 3 4 5 y 4 y 4 3 2 3 2 y 5 y 5 1 1 2 3 4 1 2 3 4 1 : cell y 6 y 6 1 1 3 1 8 3 1 8 : page region y 7 y 7 4 1 1 4 1 1 y 8 y 8 2 2 1 9 1 2 2 1 9 1 X X x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 (a) accessing in the unit of cells (b) accessing in the unit of pages • Is not efficient for skewed distributions 6 September 23, 2002 KAIST

Our Approach � To use a dynamic multidimensional file that handles skewed distributions efficiently 7 September 23, 2002 KAIST

A Naïve Aggregation Method Using a Dynamic Multidimensional File Aggregation of Z values for each pair of • The aggregation is computed as the union X and Y values in a three dimensional space of partial aggregations, each of which is A B computed for an aggregation window Z 99 • Definition: Aggregation windows are C D grouping regions that form a partition of the grouping domain space and that are 50 F E used to compute aggregation Y 9 9 W 1 W 3 5 • Partial aggregation for an aggregation 0 W 2 W 4 0 0 window is computed by retrieving records 0 50 75 99 X through a range query against the multidimensional file : aggregation windows 8 September 23, 2002 KAIST

Problems Aggregation of Z values for each pair of X and Y values in a three dimensional space A B • The pages having large regions Z are accessed multiple times 99 C D • Example: 50 • Page F (marked by blue color) is E F Y 9 9 accessed twice since its page grouping W 1 W 3 5 0 region overlaps with two aggregation W 2 W 4 0 0 windows W3 and W4 0 50 75 99 X : aggregation windows 9 September 23, 2002 KAIST

Solution � Use buffer � Control the order of accessing pages to maximize the buffering effect 10 September 23, 2002 KAIST

Buffer Replacement Policies � When the order of accessing pages is unknown • The common strategy is to select the page that has the longest expected time until the next access • Examples: LRU [Coffman et al. 73], CLOCK [Effelsberg et al.84], LRU-k [O’Neil et al. 93] � When the order of accessing pages is known in advance • Belady’s B 0 [Coffman et al. 73]: selects as a victim the page that has the longest time until the next access − Proven to be the optimal buffer replacement policy • Toss-Immediate [Korth et al. 91]: upon each page access, immediately invalidates the page that will not be used further � Since the order of accessing pages is not known a priori in general, Belady’s B 0 and Toss-Immediate policies have been known to lack practicality � Nevertheless, in this paper, we show that these policies can be effectively used for aggregation computation 11 September 23, 2002 KAIST

Goals � We propose an aggregation method that uses dynamic multidimensional files adapting to skewed distributions � We present a formal basis for aggregation computation, called the Disjoint-Inclusive Partition (DIP) computation model � We propose an aggregation method that maximizes the buffering effect by controlling the page access order � We formally prove that our algorithm achieves the optimal one-pass buffer size under the DIP computation model, which is the minimum buffer size required for one disk access per page 12 September 23, 2002 KAIST

Disjoint-Inclusive Partition � When page regions and aggregation windows have certain topol o gical relationships, we can improve the performance and buffering effect of computing aggregation by exploiting them � Definition 1: Two regions S 1 and S 2 satisfy the disjoint-inclusive relationship if either S 1 and S 2 are disjoint or one includes the other Definition 2: A disjoint-inclusive partition (DIP) of the domain space D � is a set Q of regions satisfying the following conditions: (1) Q is a partition of D (2) When two regions in Q are projected onto any subspace, the projected regions satisfy the disjoint-inclusive relationship � Definition 3: We call a multidimensional file whose page regions form a DIP a DIP multidimensional file 13 September 23, 2002 KAIST

Example: A DIP and a non-DIP Π G F Π G A Π G D Π G A Π G B Π G B Π G C Π G E Π G E Π G D Π G C Z B B Y A X C D A C • Organizing attributes: X, Y, Z D • Set of grouping E F E attributes G = {X, Y} (a) A DIP. (b) A non-DIP. Π G A and Π G D (also Π G A and Π G E ) do not satisfy the disjoint-inclusive relationship 14 September 23, 2002 KAIST

DIP Computation Model � Definition: The DIP computation model for computing aggregations using a multidimensional file is the one that satisfies the following four conditions: (1) It uses a DIP multidimensional file (2) The aggregation for the grouping domain space is computed as the union of partial aggregations for aggregation windows (3) Disjoint-inclusive relationship is satisfied among aggregation windows and page grouping regions (4) Each partial aggregation is computed by retrieving records through a range query against the multidimensional file 15 September 23, 2002 KAIST

Controlling the Order of Accessing Pages � Definition ( L-page ): A page P is an L-page (large page) of an aggregation window W i if the page grouping region of P properly includes W i � Objective • To make an L-page be accessed from disk only once by accessing the pages in a specific order � For this specific order, we propose an optimal space filling curve, called Induced Space Filling Curve, based on the formal properties of DIP 16 September 23, 2002 KAIST

Induced Space Filling Curve (ISFC) � Definition ( Induced Space Filling Curve (ISFC) ) : A space filling curve induced from a given set of regions so that it can traverse all smaller regions included in a region S i , and then, traverse those that are not included in S i � Lemma 2: For a given set S of regions, where elements of S satisfy the disjoint-inclusive relationship, there exists at least one ISFC � Definition ( ISFC R ∪ W ): ISFC based on the given set R ∪ W • R : a set of page grouping regions in a DIP multidimensional file • W : a set of aggregation windows � Lemma 3: When traversing the aggregation windows in ISFC R ∪ W order, L-pages are accessed in contiguous aggregation windows 17 September 23, 2002 KAIST

VLDB 2002 A One-Pass Aggregation Algorithm with the Optimal Buffer - PowerPoint PPT Presentation

VLDB 2002 A One-Pass Aggregation Algorithm with the Optimal Buffer Size in Multidimensional OLAP September 23, 2002 Young-Koo Lee, Kyu-Young Whang, Yang-Sae Moon, and Il-Yeol Song Department of Computer Science and Advanced Information

XQuery Full Text Implementation in BaseX XSym/VLDB 2009 XSym/VLDB 2009 Christian Grn,

FYE 03/2002 2Q Financial Results FYE 03/2002 FYE 03/2002 FYE 03/2002 2Q Financial Results 2Q

FYE 03/2002 3Q Financial Results FYE 03/2002 FYE 03/2002 FYE 03/2002 3Q Financial Results 3Q

The Spatial Skyline Queries Mehdi Sharifzadeh and Cyrus Shahabi VLDB 2006 VLDB 2006 Presented

VLDB Challenges VLDB Challenges in in Very Large Very Large Enterprises Enterprises

Tucson Fire Department 2002 Awards Presentation Included in this PDF is information of the

Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 VLDB

Aviva plc plc Aviva 2002 Interim Results 2002 Interim Results 1 August 2002 1 August 2002

VLDB 2015 PC Chairs Volker Markl Chen Li TU Berlin UC Irvine 33 6 Tutorials Research

Raster Databases - tutorial - VLDB 2007 Vienna, 25-sep-2007 Peter Baumann Jacobs University

Redescription Mining Pauli Miettinen 17 November 2010 An Example VLDB ICDM SDM

The Future Home of Data? Michael Franklin UC Berkeley VLDB Conf. August 2002 The Pervasive

HRA HRA -Health Reimbursement Health Reimbursement - Arrangement- - Arrangement June 2002

State of New Jersey Fiscal Year 2002 Appropriations Act STATE OF NEW JERSEY FY 2002

& 2002 I 2002 I NTERIM NTERIM R R ESULTS ESULTS P P RESENTATION R ESTRUCTURING &

Interim Results 2002 26 February 2002 1 Interim Results 2002 Overview Chris Morris Chief

Database Management Course Content Systems Introduction Database Design Theory

Project AutoMate Squid: Decentralized Discovery Service C. Schmidt, The AutoMate Group The

9. Hardware-Aware Numerics Approaching supercomputing ... 9. Hardware-Aware Numerics Numerical

Draft Supercanonical convergence rates in quasi-Monte Carlo simulation of Markov chains Pierre

Self-similar solutions to extension and approximation problems Robert Young New York University

Swarming Techniques to Improve Live Streaming Performance in the PeerLive System Eleni Mykoniati

Origin of cosmic structures ? Gravitational instability: non linear process, no analytical

Schramm-Loewner evolutions and imaginary geometry Nina Holden Institute for Theoretical Studies,

VLDB 2002 A One-Pass Aggregation Algorithm with the Optimal Buffer - PowerPoint PPT Presentation

VLDB 2002 A One-Pass Aggregation Algorithm with the Optimal Buffer Size in Multidimensional OLAP September 23, 2002 Young-Koo Lee, Kyu-Young Whang, Yang-Sae Moon, and Il-Yeol Song Department of Computer Science and Advanced Information

XQuery Full Text Implementation in BaseX XSym/VLDB 2009 XSym/VLDB 2009 Christian Grn,

FYE 03/2002 2Q Financial Results FYE 03/2002 FYE 03/2002 FYE 03/2002 2Q Financial Results 2Q

FYE 03/2002 3Q Financial Results FYE 03/2002 FYE 03/2002 FYE 03/2002 3Q Financial Results 3Q

The Spatial Skyline Queries Mehdi Sharifzadeh and Cyrus Shahabi VLDB 2006 VLDB 2006 Presented

VLDB Challenges VLDB Challenges in in Very Large Very Large Enterprises Enterprises

Tucson Fire Department 2002 Awards Presentation Included in this PDF is information of the

Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 VLDB

Aviva plc plc Aviva 2002 Interim Results 2002 Interim Results 1 August 2002 1 August 2002

VLDB 2015 PC Chairs Volker Markl Chen Li TU Berlin UC Irvine 33 6 Tutorials Research

Raster Databases - tutorial - VLDB 2007 Vienna, 25-sep-2007 Peter Baumann Jacobs University

Redescription Mining Pauli Miettinen 17 November 2010 An Example VLDB ICDM SDM

The Future Home of Data? Michael Franklin UC Berkeley VLDB Conf. August 2002 The Pervasive

HRA HRA -Health Reimbursement Health Reimbursement - Arrangement- - Arrangement June 2002

State of New Jersey Fiscal Year 2002 Appropriations Act STATE OF NEW JERSEY FY 2002

&amp; 2002 I 2002 I NTERIM NTERIM R R ESULTS ESULTS P P RESENTATION R ESTRUCTURING &amp;

Interim Results 2002 26 February 2002 1 Interim Results 2002 Overview Chris Morris Chief

Database Management Course Content Systems Introduction Database Design Theory

Project AutoMate Squid: Decentralized Discovery Service C. Schmidt, The AutoMate Group The

9. Hardware-Aware Numerics Approaching supercomputing ... 9. Hardware-Aware Numerics Numerical

Draft Supercanonical convergence rates in quasi-Monte Carlo simulation of Markov chains Pierre

Self-similar solutions to extension and approximation problems Robert Young New York University

Swarming Techniques to Improve Live Streaming Performance in the PeerLive System Eleni Mykoniati

Origin of cosmic structures ? Gravitational instability: non linear process, no analytical

Schramm-Loewner evolutions and imaginary geometry Nina Holden Institute for Theoretical Studies,

& 2002 I 2002 I NTERIM NTERIM R R ESULTS ESULTS P P RESENTATION R ESTRUCTURING &