Advanced SQL II Advanced Aggregation and OLAP 5DV120 Database - - PowerPoint PPT Presentation

advanced sql ii advanced aggregation and olap
SMART_READER_LITE
LIVE PREVIEW

Advanced SQL II Advanced Aggregation and OLAP 5DV120 Database - - PowerPoint PPT Presentation

Advanced SQL II Advanced Aggregation and OLAP 5DV120 Database System Principles Ume a University Department of Computing Science Stephen J. Hegner hegner@cs.umu.se http://www.cs.umu.se/~hegner Advanced SQL II Advanced


slide-1
SLIDE 1

Advanced SQL II — Advanced Aggregation and OLAP

5DV120 — Database System Principles Ume˚ a University Department of Computing Science Stephen J. Hegner hegner@cs.umu.se http://www.cs.umu.se/~hegner

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 1 of 35

slide-2
SLIDE 2

Aggregation Operators — Review

Query: For each department, find the minimum, maximum, average salary, as well as the number of employees.

SELECT dept_name , MIN(salary) AS min_sal , MAX(salary) AS max_sal , CAST (AVG(salary) as NUMERIC (8 ,2)) AS avg_sal , COUNT (*) AS n_emp FROM instructor GROUP BY dept_name UNION SELECT dept_name , 0, 0, 0, 0 FROM department as D WHERE (NOT EXISTS (SELECT * FROM instructor NATURAL JOIN department AS DI WHERE D.dept_name=DI.dept_name ));

An important rule: The attributes which are listed in the GROUP BY clause must include those which occur in the SELECT and are not aggregated.

  • Will not work (even though it clearly should):

SELECT dept_name , building , MIN(salary) AS min_sal , MAX(salary) AS max_sal , CAST (AVG(salary) as NUME02RIC (8 ,2)) AS avg_sal , COUNT (*) AS n_emp FROM instructor NATURAL JOIN department GROUP BY dept_name UNION ...

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 2 of 35

slide-3
SLIDE 3

Omission of GROUP BY — Review

  • If the GROUP BY clause is omitted, the aggregation is over the entire table.

SELECT MIN(salary) AS min_sal , MAX(salary) AS max_sal , CAST (AVG(salary) as NUMERIC (8 ,2)) AS avg_sal , COUNT (*) AS n_emp FROM instructor ;

  • In this case, there must be no non-aggregated attributes in the SELECT

clause.

  • Does not work:

SELECT dept_name , MIN(salary) AS min_sal , MAX(salary) AS max_sal , CAST (AVG(salary) as NUMERIC (8 ,2)) AS avg_sal , COUNT (*) AS n_emp FROM instructor ;

  • In the above case, dept name must appear in the GROUP BY clause.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 3 of 35

slide-4
SLIDE 4

The HAVING Clause — Review

Query: For each department with at least two distinct instructors, find the minimum, maximum, average salary, as well as the number of employees.

  • The following does not work:

SELECT dept_name , MIN(salary) AS min_sal , MAX(salary) AS max_sal , CAST (AVG(salary) as NUMERIC (8 ,2)) AS avg_sal , COUNT (*) AS n_emp FROM instructor WHERE (COUNT (*) >= 2); GROUP BY dept_name;

  • The problem is that the WHERE clause is evaluated before the aggregation.
  • The solution is to use a HAVING clause, which is evaluated after the

aggregation.

SELECT dept_name , MIN(salary) AS min_sal , MAX(salary) AS max_sal , CAST (AVG(salary) as NUMERIC (8 ,2)) AS avg_sal , COUNT (*) AS n_emp FROM instructor GROUP BY dept_name HAVING (COUNT (*) >= 2);

  • The HAVING clause must come after the GROUP BY clause.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 4 of 35

slide-5
SLIDE 5

Embedded Queries in the HAVING Clause — Review

Query: Find the department(s) with the greatest number of instructors.

SELECT I1.dept_name , COUNT(I1 .*) FROM instructor as I1 GROUP BY dept_name HAVING NOT EXISTS (SELECT I2.dept_name , COUNT(I2 .*) FROM instructor AS I2 GROUP BY I2.dept_name HAVING (COUNT(I2 .*) > COUNT(I1 .*)) );

  • It is also possible to do this with an embedded subquery in the FROM

clause.

SELECT I.dept_name , I.n_instr FROM (SELECT dept_name , COUNT (*) AS n_instr FROM instructor GROUP BY dept_name ) AS I WHERE I.n_instr >= ALL (SELECT n_instr FROM (SELECT dept_name , COUNT (*) AS n_instr FROM instructor GROUP BY dept_name ) AS Pointless );

  • Note the alias Pointless which is required by SQL rules.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 5 of 35

slide-6
SLIDE 6

Window Functions I

Review Query: For each building, find the average budget for all departments in that building.

SELECT building , AVG(budget) AS avg_for_bldg FROM department GROUP BY building ORDER BY AVG(budget) DESC;

  • To provide this averaged information while still providing full information

for each individual department, a window function is very useful.

SELECT dept_name , building , budget , CAST ( AVG(budget) OVER (PARTITION BY building) AS NUMERIC (8 ,2) ) AS avg_for_bldg FROM department ORDER BY building , budget DESC ; dept_name | building | budget | avg_for_bldg

  • - ----------+----------+-----------+--------------

Music | Packard | 80000.00 | 80000.00 Finance | Painter | 120000.00 | 85000.00 History | Painter | 50000.00 | 85000.00

  • Comp. Sci. | Taylor

| 100000.00 | 92500.00

  • Elec. Eng. | Taylor

| 85000.00 | 92500.00 Biology | Watson | 90000.00 | 80000.00 Physics | Watson | 70000.00 | 80000.00

  • Note that a GROUP BY clause is not required, despite the aggregation.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 6 of 35

slide-7
SLIDE 7

Window Functions II

  • When a window function is used more than once, it is helpful to declare

it by name.

SELECT dept_name , building , budget , CAST ( AVG(budget) OVER (PARTITION BY building) AS NUMERIC (8 ,2) ) AS avg_for_bldg , CAST ( budget

  • AVG(budget) OVER (PARTITION

BY building) AS NUMERIC (8 ,2) ) As deviation FROM department ;

  • The above query is represented more compactly as:

SELECT dept_name , building , budget , CAST (AVG(budget) OVER W AS NUMERIC (8 ,2)) AS avg_for_bldg , CAST (budget - AVG(budget) OVER W AS NUMERIC (8 ,2)) AS deviation FROM department WINDOW W AS (PARTITION BY building ); dept_name | building | budget | avg_for_bldg | deviation

  • - ----------+----------+-----------+--------------+-----------

Music | Packard | 80000.00 | 80000.00 | 0.00 Finance | Painter | 120000.00 | 85000.00 | 35000.00 History | Painter | 50000.00 | 85000.00 |

  • 35000.00
  • Comp. Sci. | Taylor

| 100000.00 | 92500.00 | 7500.00

  • Elec. Eng. | Taylor

| 85000.00 | 92500.00 |

  • 7500.00

Biology | Watson | 90000.00 | 80000.00 | 10000.00 Physics | Watson | 70000.00 | 80000.00 |

  • 10000.00

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 7 of 35

slide-8
SLIDE 8

Window Functions III

  • Here is a more complex query which provides, for each instructor salary,

its percentage of the average within the department.

SELECT ID , name , dept_name , salary , CAST (( salary *100)/ AVG(salary) OVER (PARTITION BY dept_name) AS NUMERIC (5 ,2) ) AS pct_avg_in_dept FROM instructor NATURAL JOIN department ; id | name | dept_name | salary | pct_avg_in_dept

  • - -----+------------+------------+----------+-----------------

76766 | Crick | Biology | 72000.00 | 100.00 10101 | Srinivasan | Comp. Sci. | 65000.00 | 84.05 45565 | Katz | Comp. Sci. | 75000.00 | 96.98 83821 | Brandt | Comp. Sci. | 92000.00 | 118.97 98345 | Kim | Elec. Eng. | 80000.00 | 100.00 12121 | Wu | Finance | 90000.00 | 105.88 76543 | Singh | Finance | 80000.00 | 94.12 32343 | El Said | History | 60000.00 | 98.36 58583 | Califieri | History | 62000.00 | 101.64 15151 | Mozart | Music | 40000.00 | 100.00 33456 | Gold | Physics | 87000.00 | 95.60 22222 | Einstein | Physics | 95000.00 | 104.40

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 8 of 35

slide-9
SLIDE 9

Window Functions IV

  • Window functions are also very useful for computing running averages.
  • Suppose that the following simple table of years and sales amounts is

given.

CREATE TABLE Sales (year NUMERIC (4 ,0) NOT NULL , amount NUMERIC (8 ,2) NOT NULL , PRIMARY KEY (year) );

  • The following query provides a moving average of sales for the current

year and two preceding.

SELECT Year , Amount , CAST ( AVG(Amount) OVER (ORDER BY year ROWS 2 PRECEDING) AS NUMERIC (8 ,2) ) AS mvg_avg_3_yr FROM Sales;

year | amount | mvg_avg_3_yr

  • - ----+-----------+--------------

1990 | 52134.00 | 52134.00 1991 | 66000.00 | 59067.00 1992 | 75000.00 | 64378.00 1993 | 65000.00 | 68666.67 1994 | 70000.00 | 70000.00 1995 | 80000.00 | 71666.67 1996 | 85000.00 | 78333.33 1997 | 82000.00 | 82333.33 1998 | 88000.00 | 85000.00 1999 | 90000.00 | 86666.67 2000 | 95000.00 | 91000.00 year | amount | mvg_avg_3_yr

  • - ----+-----------+--------------

2001 | 95000.00 | 93333.33 2002 | 98000.00 | 96000.00 2003 | 96000.00 | 96333.33 2004 | 101000.00 | 98333.33 2005 | 99000.00 | 98666.67 2006 | 104000.00 | 101333.33 2007 | 120000.00 | 107666.67 2008 | 125000.00 | 116333.33 2009 | 115000.00 | 120000.00 2010 | 130000.00 | 123333.33 2011 | 175000.00 | 140000.00 Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 9 of 35

slide-10
SLIDE 10

Window Functions V

  • Other options for the moving average are also possible.
  • All preceding years as well as the current:

SELECT Year , Amount , CAST ( AVG(Amount) OVER (ORDER BY year ROWS UNBOUNDED PRECEDING) AS NUMERIC (8 ,2) ) AS ubd_prec_mvg_avg FROM Sales;

  • A three-year running average which includes the preceding as well as the

following year:

SELECT Year , Amount , CAST ( AVG(Amount) OVER (ORDER BY year ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS NUMERIC (8 ,2) ) AS moving_avg_1p_1f FROM Sales;

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 10 of 35

slide-11
SLIDE 11

Ranking I

  • The rank of a tuple relative to some ordering is recaptured succinctly via

the RANK() construct.

SELECT dept_name , building , budget , RANK () OVER (ORDER BY budget ASC) AS rank FROM department ; dept_name | building | budget | rank

  • - -------------+----------+-----------+------

History | Painter | 50000.00 | 1 Physics | Watson | 70000.00 | 2 Music | Packard | 80000.00 | 3

  • Elec. Eng.

| Taylor | 85000.00 | 4 Biology | Watson | 90000.00 | 5

  • Comp. Sci.

| Taylor | 100000.00 | 6 Basketweaving | Packard | 100000.00 | 6 Finance | Painter | 120000.00 | 8

  • To restrict the result to certain ranks, embed the query in a FROM clause.

SELECT dept_name , building , budget , rank FROM (SELECT dept_name , building , budget , RANK () OVER (ORDER BY budget ASC) FROM department ) AS foo WHERE (rank <=3);

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 11 of 35

slide-12
SLIDE 12

Ranking II

  • To report ranking locally within each partition of the appropriate window

function:

SELECT dept_name , building , budget , RANK () OVER (PARTITION BY building ORDER BY budget ASC) AS rank FROM department ; dept_name | building | budget | rank

  • - ----------+----------+-----------+------

Music | Packard | 80000.00 | 1 History | Painter | 50000.00 | 1 Finance | Painter | 120000.00 | 2

  • Elec. Eng. | Taylor

| 85000.00 | 1

  • Comp. Sci. | Taylor

| 100000.00 | 2 Physics | Watson | 70000.00 | 1 Biology | Watson | 90000.00 | 2

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 12 of 35

slide-13
SLIDE 13

OLAP

OLAP: OnLine Analytical Processing

  • The simplest context is a relation with the following properties:

Measure attributes: One attribute is numerical, called the measure attribute. Dimension attributes: The other attributes, called dimension attributes, have relatively few alternatives, with a numerical value for each combination. Example: item name color clothes size quantity Sales

CREATE TABLE sales ( item_name varchar (10) NOT NULL , color varchar (10) , clothes_size varchar (10) , quantity integer , primary key (item_name ,color , clothes_size ), CHECK (item_name IN (’skirt ’,’dress ’,’shirt ’,’pants ’)), CHECK (color IN (’dark ’,’pastel ’,’white ’)), CHECK ( clothes_size IN (’small ’,’medium ’,’large ’)) );

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 13 of 35

slide-14
SLIDE 14

OLAP Example Table

  • A sample table for the sales

relation is shown to the right.

  • Note that there are

4 × 3 × 3 = 36 tuples, one for each possible value of (item name,color,clothes size). .

item_name | color | clothes_size | quantity

  • - ---------+--------+--------------+----------

skirt | dark | small | 2 skirt | dark | medium | 5 skirt | dark | large | 1 skirt | pastel | small | 11 skirt | pastel | medium | 9 skirt | pastel | large | 15 skirt | white | small | 2 skirt | white | medium | 5 skirt | white | large | 3 dress | dark | small | 2 dress | dark | medium | 6 dress | dark | large | 12 dress | pastel | small | 4 dress | pastel | medium | 3 dress | pastel | large | 3 dress | white | small | 2 dress | white | medium | 3 dress | white | large | shirt | dark | small | 2 shirt | dark | medium | 6 shirt | dark | large | 6 shirt | pastel | small | 4 shirt | pastel | medium | 1 shirt | pastel | large | 2 shirt | white | small | 17 shirt | white | medium | 1 shirt | white | large | 10 pants | dark | small | 14 pants | dark | medium | 6 pants | dark | large | pants | pastel | small | 1 pants | pastel | medium | pants | pastel | large | 1 pants | white | small | 3 pants | white | medium | pants | white | large | 2 Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 14 of 35

slide-15
SLIDE 15

Representation in Crosstab Format

  • The table on the previous slide may be represented alternatively with the

data cross tabulated against one of the measure attributes.

  • In this case, one column is formed for each value of the measure attribute.
  • Using color as the measure attribute which is cross tabulated, here is

the representation as a cross tabulation.

item_name | clothes_size | dark | pastel | white

  • - ---------+--------------+------+--------+-------

skirt | small | 2 | 11 | 2 skirt | medium | 5 | 9 | 5 skirt | large | 1 | 15 | 3 dress | small | 2 | 4 | 2 dress | medium | 6 | 3 | 3 dress | large | 12 | 3 | shirt | small | 2 | 4 | 17 shirt | medium | 6 | 1 | 1 shirt | large | 6 | 2 | 10 pants | small | 14 | 1 | 3 pants | medium | 6 | 0 | pants | large | 0 | 1 | 2

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 15 of 35

slide-16
SLIDE 16

Programming Cross Tabulation

  • Standard SQL supports this construction directly via the PIVOT directive.

SELECT * FROM sales PIVOT ( SUM(quantity) FOR color IN (’dark ’,’pastel ’,’white ’) ) ORDER BY item_name;

  • In PostgreSQL, it seems that there is not yet support for such a pivot.
  • It does have a contributed package tablefunc.sql which provides a

CROSSTAB function, but it is limited to a single-row key.

  • In this example, the pair (item name,clothes size) is the key.
  • Therefore, the pivot must be realized with vintage SQL:

SELECT SD.item_name , SD.clothes_size , SD.quantity AS dark , SP.quantity AS pastel , SW.quantity AS white FROM sales AS SD INNER JOIN sales AS SP ON ((SD.item_name=SP.item_name) AND (SD. clothes_size =SP. clothes_size )) INNER JOIN sales AS SW ON ((SP.item_name=SW.item_name) AND (SP. clothes_size =SW. clothes_size )) WHERE (SD.color=’dark ’) AND (SP.color=’pastel ’) AND (SW.color=’white ’);

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 16 of 35

slide-17
SLIDE 17

Pivot Tables I

  • Suppose that a cross tabulation is required which aggregates on one or

more attributes.

  • For example, here the clothes size attribute has been removed via

aggregation:

item_name | dark | pastel | white | total

  • - ---------+------+--------+-------+-------

dress | 20 | 10 | 5 | 35 pants | 20 | 2 | 5 | 27 shirt | 14 | 7 | 28 | 49 skirt | 8 | 35 | 10 | 53

  • How is this table created?

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 17 of 35

slide-18
SLIDE 18

Pivot Tables II

  • The easiest solution is to create a view By Color which gives the full

cross-tabulation:

item_name | clothes_size | dark | pastel | white

  • - ---------+--------------+------+--------+-------

| | | |

  • Then a simple aggregation query over the view does the trick:

SELECT DISTINCT item_name , SUM(dark) AS dark , SUM(pastel) AS pastel , SUM(white) AS white , SUM(dark )+ SUM(pastel )+ SUM(white) AS total FROM By_Color GROUP BY item_name;

  • Here is a definition of By Color for PostgreSQL:

CREATE VIEW By_Color AS SELECT SD.item_name , SD.clothes_size , SD.quantity AS dark , SP.quantity AS pastel , SW.quantity AS white FROM sales AS SD INNER JOIN sales AS SP ON ((SD.item_name=SP.item_name) AND (SD. clothes_size =SP. clothes_size )) INNER JOIN sales AS SW ON ((SP.item_name=SW.item_name) AND (SP. clothes_size =SW. clothes_size )) WHERE (SD.color=’dark ’) AND (SP.color=’pastel ’) AND (SW.color=’white ’);

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 18 of 35

slide-19
SLIDE 19

Pivot Tables III

  • Often, it is desirable to provide an additional row which provides the

total.

item_name | dark | pastel | white | total

  • - ---------+------+--------+-------+-------

dress | 20 | 10 | 5 | 35 pants | 20 | 2 | 5 | 27 shirt | 14 | 7 | 28 | 49 skirt | 8 | 35 | 10 | 53 total | 62 | 54 | 48 | 164

  • This is accomplished easily by augmenting the view query:

SELECT DISTINCT item_name , SUM(dark) AS dark , SUM(pastel) AS pastel , SUM(white) AS white , SUM(dark )+ SUM(pastel )+ SUM(white) AS total FROM By_Color GROUP BY item_name UNION SELECT ’total ’, SUM(dark), SUM(pastel), SUM(white), SUM(dark )+ SUM(pastel )+ SUM(white) FROM By_Color;

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 19 of 35

slide-20
SLIDE 20

Rollup I

  • The process of creating cross tabulations via aggregation is called rollup.

Example: From the table

item_name | dark | pastel | white | total

  • - ---------+------+--------+-------+-------

dress | 20 | 10 | 5 | 35 pants | 20 | 2 | 5 | 27 shirt | 14 | 7 | 28 | 49 skirt | 8 | 35 | 10 | 53 total | 62 | 54 | 48 | 164

  • each of the tables

item_name | quantity

  • - ---------+----------

all | 164 dress | 35 pants | 27 shirt | 49 skirt | 53 color | quantity

  • - ------+----------

all | 164 dark | 62 pastel | 54 white | 48

is obtained via rollup, that on the left by by selecting the last column, and that on the right by selection the last row of the table at the top.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 20 of 35

slide-21
SLIDE 21

Rollup II

  • A table such as

item_name | dark | pastel | white | total

  • - ---------+------+--------+-------+-------

dress | 20 | 10 | 5 | 35 pants | 20 | 2 | 5 | 27 shirt | 14 | 7 | 28 | 49 skirt | 8 | 35 | 10 | 53 total | 62 | 54 | 48 | 164

is useful for visualization, but it is not well suited for computation.

  • The dimensions of the table will change as the occurring values for the

pivot attribute change.

  • Therefore, an internal representation as shown on the next slide is to be

preferred.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 21 of 35

slide-22
SLIDE 22

Internal Representation I

  • Here is an internal representation for the cross tabulation of the previous

slide.

  • Note that the number of columns is fixed, regardless of the values
  • ccurring for item name or color.

item_name | color | clothes_size | quantity

  • - ---------+--------+--------------+----------

all | all | all | 164 all | dark | all | 62 all | pastel | all | 54 all | white | all | 48 dress | all | all | 35 dress | dark | all | 20 dress | pastel | all | 10 dress | white | all | 5 pants | all | all | 27 pants | dark | all | 20 pants | pastel | all | 2 pants | white | all | 5 shirt | all | all | 49 shirt | dark | all | 14 shirt | pastel | all | 7 shirt | white | all | 28 skirt | all | all | 53 skirt | dark | all | 8 skirt | pastel | all | 35 skirt | white | all | 10

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 22 of 35

slide-23
SLIDE 23

Internal Representation II

  • SQL to generate the internal representation of the cross tabulation, as

well as its two rollups.

CREATE VIEW By_Color_Rel AS SELECT * FROM ( SELECT item_name , color , ’all ’ AS clothes_size , SUM(quantity) AS quantity FROM sales GROUP BY item_name , color UNION SELECT ’all ’ as item_name , color , ’all ’ AS clothes_size , SUM(quantity) FROM sales GROUP BY color UNION SELECT item_name , ’all ’ AS color , ’all ’ AS clothes_size , SUM(quantity) FROM sales GROUP BY item_name UNION SELECT ’all ’ AS item_name , ’all ’ AS color , ’all ’ AS clothes_size , SUM(quantity) FROM sales ) AS Pointless ORDER BY 1,2; SELECT * SELECT item_name , quantity SELECT color , quantity FROM By_Color_Rel ; FROM By_Color_Rel FROM By_Color_Rel WHERE color=’all ’; WHERE item_name=’all ’;

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 23 of 35

slide-24
SLIDE 24

Hierarchies on Dimensions

  • Dimensions may have hierarchies defined on them, and rollup may be

performed relative to those hierarchies. Example: Partitions the items of Sales into men’s wear and women’s wear.

item name color clothes size quantity Sales item name category ItemCategory

item name menswear womenswear

item_name | category

  • - ---------+------------

skirt | womenswear dress | womenswear shirt | menswear pants | menswear

  • The desired result looks like:

category | item_name | dark | pastel | white | total

  • - ----------+-----------+------+--------+-------+-------

menswear | pants | 20 | 2 | 5 | 27 menswear | shirt | 14 | 7 | 28 | 49 menswear | subtotal | 34 | 9 | 33 | 76 womenswear | dress | 20 | 10 | 5 | 35 womenswear | skirt | 8 | 35 | 10 | 53 womenswear | subtotal | 28 | 45 | 15 | 88 ztotal | | 62 | 54 | 48 | 164

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 24 of 35

slide-25
SLIDE 25

Hierarchies on Dimensions II

  • Here is basic SQL code to accomplish this.
  • For convenience, first create a view which provides the summarizaion

without the categorization:

CREATE VIEW By_Color_no_size AS SELECT DISTINCT item_name , SUM(dark) AS dark , SUM(pastel) AS pastel , SUM(white) AS white , SUM(dark )+ SUM(pastel )+ SUM(white) AS total FROM By_Color GROUP BY item_name UNION SELECT ’total ’, SUM(dark), SUM(pastel), SUM(white), SUM(dark )+ SUM(pastel )+ SUM(white) FROM By_Color;

  • Now add the categorization:

SELECT category , item_name , dark , pastel , white , total FROM itemcategory NATURAL JOIN By_Color_no_size UNION SELECT category , ’subtotal ’, SUM(dark), SUM(pastel), SUM(white), SUM(total) FROM itemcategory NATURAL JOIN By_Color_no_size GROUP BY category UNION SELECT ’ztotal ’, ’’, SUM(dark), SUM(pastel), SUM(white), SUM(total) FROM itemcategory NATURAL JOIN By_Color_no_size ORDER BY 1,2;

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 25 of 35

slide-26
SLIDE 26

Data Cubes I

  • A relation with three dimension attributes may be visualized as a cube.

8 20 14 20 62 35 10 7 2 54 10 5 28 5 48 53 35 49 27 164 dark pastel white all color skirt dress shirt pants all item name 62 22 29 11 54 34 4 16 48 21 9 18 164 77 42 45 all large medium small clothes size 2 5 3 1 11 4 7 6 12 29 2 8 5 7 22 8 20 14 20 62

  • With n dimension attributes, an n-dimensional hypercube is obtained.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 26 of 35

slide-27
SLIDE 27

Data Cubes – Slicing and Dicing

8 20 14 20 62 35 10 7 2 54 10 5 28 5 48 53 35 49 27 164 dark pastel white all color skirt dress shirt pants all item name 62 22 29 11 54 34 4 16 48 21 9 18 164 77 42 45 all large medium small clothes size 2 5 3 1 11 4 7 6 12 29 2 8 5 7 22 8 20 14 20 62

  • A rollup operation on an n-dimensional hypercube may be visualized as

taking an (n − k)-dimensional “slice” out of it, with k the number of attributes aggregated.

  • Illustrated above is a rollup which aggregates on clothes size.
  • The rollup operation is also called slicing and/or dicing.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 27 of 35

slide-28
SLIDE 28

The Size of a Data Cube

Context:

  • n dimension attributes A = {a1, a2, . . . , an}.
  • mk possible values for ak.
  • For an n-dimensional hypercube which includes, for each B ⊆ A, a single

rollup for that aggregation, the number of unit dies (the “size”) is Πn

k=1(mk + 1)

Example: For the cube on the previous slides, m1 = 4, m2 = m3 = 3, so the size is (4 + 1) × (3 + 1) × (3 + 1) = 80.

  • To roll up all subsets of values for each attribute, the size is much greater.

Πn

k=1(2mk)

Example: For the cube on the previous slides, this size (24) × (23) × (23) = 1024.

  • These are mathematical sizes.
  • The efficient representation of data cubes in real systems is an ongoing

research topic.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 28 of 35

slide-29
SLIDE 29

Data Cubes and OLAP – Drilldown

Drilldown is the opposite of rollup.

  • In drilldown, aggregation is reversed and more detail is provided.
  • Drilldown is often requested by a user who needs to see more detail for a

given situation.

  • Drilldown requires access to data not contained in the original rollup view.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 29 of 35

slide-30
SLIDE 30

OLAP on Real Systems

  • There are two possibilities for doing OLAP on real systems.

Within SQL: Major commercial systems offer much more than PostgreSQL in the way of native OLAP constructs. GROUP BY CUBE(a1,a2,..,an) Creates an aggregation for every subset of {a1, a2, . . . , an}; in other words, for all 2n subsets. Example: GROUP BY CUBE (color, clothes size) GROUP BY ROLLUP(a1,a2,..,an) Create a rollup in which n+1 aggregations are produced: {a1, . . . , ai} for 1 ≤ i ≤ n, as well as ∅. Example: GROUP BY ROLLUP (color, clothes size) produces aggregations for {color, clothes size}, {color}, and ∅ (full aggregation). GROUP BY GROUPING SETS(S1,S2,..,Sk) Here the Si’s are subsets

  • f the set {a1, a2, . . . , an} of dimension attributes. An aggregation

is created for each Si. Example: GROUP BY GROUPING SET ((color, clothes size), (color) ()) does exactly the same thing as the ROLLUP grouping above.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 30 of 35

slide-31
SLIDE 31

OLAP on Real Systems II

Add-on Tools: The second way of doing OLAP is via add-on tools.

  • These tools may be provide by either the DBMS vendor or by a

third-party vendor. ❯ They typically provide much better visualization than is possible in SQL. ❯ They also generally provide better analysis tools. ❉ They are not free. ❉ Even for the ones which are “open source”, only parts have this property. ❉ For example, the computational engine may be open source, with the GUI be for purchase only.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 31 of 35

slide-32
SLIDE 32

Data Warehousing

  • A data warehouse is a repository for data from many sources, collected

into a single schema.

  • The data are typically regarded as historical.
  • This means that it is not subject to update by ordinary transactions.
  • A main purpose of a data warehouse (but not the only one) is

OLAP-style computation.

  • Schemata are often organized into star (sub)schemata, as illustrated on

the next slide.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 32 of 35

slide-33
SLIDE 33

Data Warehousing – Star Schemata

  • A star schema has:
  • One fact table (the hub:

sales).

  • Several dimension tables (the

leaves: item info, date info, store, customer).

  • The key of each dimension

table is a foreign key to the fact table.

sales item id store id customer id date number price item info item id color size category date info date month quarter year store store id city state country customer customer id name street city state zipcode country

  • The dimension tables provide details about the items listed in the fact

table.

  • A data warehouse schema may consist of many star sub-schemata.
  • It is easy to see that the data in such schemata are suitable for OLAP

style computations, among others.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 33 of 35

slide-34
SLIDE 34

Data Warehousing – Snowflake Schemata

Sales item id store id customer id date number price item info item id color size category mfr id date info date month quarter year store store id city state country customer customer id name street city state zipcode country manufacturer mfr id volume earnings

  • In a snowflake schema, there may be several levels of dimension tables.
  • Snowflake schemata are typically in 3NF.
  • Snowflake (sub)-schemata are also common in data warehouses.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 34 of 35

slide-35
SLIDE 35

Data Mining

  • In data mining, the data are analyzed to find useful patterns.

Example: Look for associations in purchases by a single customer in a single store on a single day.

  • Data Mining is applied pattern classification on data sets which are

stored in databases.

  • To understand it well requires a strong background in probability and

statistics.

  • Some DBMSs do have interesting tools for doing data mining.
  • Look at IBM DB2 in particular.
  • But to understand and use the results meaningfully requires expertise in

pattern classification.

  • Data mining will not be studied further in this course.

Advanced SQL II — Advanced Aggregation and OLAP 20130417 Slide 35 of 35