overview dw performance optimization
play

Overview DW Performance Optimization Choosing aggregates - PowerPoint PPT Presentation

Overview DW Performance Optimization Choosing aggregates Maintaining views Bitmapped indices Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 Aggregates


  1. Overview DW Performance Optimization • Choosing aggregates • Maintaining views • Bitmapped indices • Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 Aggregates Aggregate Use Example • Observations • Imagine 1 billion sales rows, 1000 products, 100 locations � DW queries are simple, follow same “schema” • CREATE VIEW TotalSales (pid,locid,total) AS � Aggregate (GROUP-BY) queries SELECT s.pid,s.locid,SUM(s.sales) • Idea FROM Sales s GROUP BY s.pid,s.locid � Compute and store query results in advance (pre-aggregation) ◆ Example: store “total sales per month and product” • The materialized view has 100,000 rows � Yields large performance improvements (e.g., factor 1000) • Query rewritten to use view � No need to store everything – reuse whenever possible � SELECT p.category,SUM(s.sales) FROM Products p, Sales s ◆ Example: quarterly total can be quickly computed from monthly total WHERE p.pid=s.pid GROUP BY p.category • Prerequisites � Rewritten to � Tree-structured dimensions with fixed height � SELECT p.category,SUM(t.total) FROM Products p, TotalSales t � Many-to-one relationships from fact to dimensions WHERE p.pid=t.pid GROUP BY p.category � Facts mapped to bottom level in all dimensions � Query becomes 10,000 times faster ! � Otherwise, re-use is not possible Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4

  2. Pre-Aggregation Choices Using Aggregates • Application: aggregates used via aggregate navigator � Given a query, the best aggregate is found , and the query is • Full pre-aggregation: (all combinations of levels) rewritten to use it � Fast query response � Done by the system, not by the user � Takes a lot of space/update time (200-500 times raw data) ◆ Traditionally done in middleware, e.g., ODBC • No pre-aggregation ◆ Performed directly by DBMS � Slow query response (for terabytes…) • Four design goals for aggregate usage • Practical pre-aggregation: chosen combinations � Aggregates stored separately from detail data � A good compromise between response time and space use � “ Shrunk ” dimensions mapped to aggregate facts • Most (R)OLAP tools now support practical pre-aggregation � Connection between aggregates and detail data known by system � IBM DB2 UDB � All queries (SQL) refers only detail data � Oracle 10g • SUM, MIN, MAX, COUNT, AVG can all be handled � MS Analysis Services � The last one requires little trick – Why? Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6 Choosing Aggregates Data Cube The data cube stores multidimensional GROUP BY • Practical pre-aggregation, decide what aggregates to store relations of tables in data warehouses • Non-trivial (NP-complete) optimization problem � Space use Group By (with total) � Update speed Aggregate By color � Response time demands Red Sum � Actual queries White By make and year Database capable Blue � Index and/or aggregates 1 9 2 9 3 Ford 9 4 1 9 9 1 9 9 m Chevy 9 1 1 u By make S • Choose an aggregate if it is considerably smaller than Sum available, usable aggregates (factor 3-5-10) Data warehouse Cross Tab capable • Supported (semi)-automatically by DBMS Chevy Ford By color Red � Oracle, DB2, MS SQL Server White • “Implementing Data Cubes Efficiently” SIGMOD’96 paper By make and color Blue By color and year By make � Greedy approach: simple but effective Sum � Above DBMS now use similar, but more advanced techniques Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8

  3. Lattice Framework A Data Cube example (I) – 8 possible 1. part, supplier, customer (6M rows) Scenario: (III) – Picking views for 3 the right views 2. part, customer (6M) dimensions. A query asks for the sales of a part to materialize Each view � improve 3. part, supplier (0.8M) gives the total performance a) If view pc is available, need to sales as per 4. supplier, customer (6M) process about 6M rows; OR • Each lattice node represents a view / query that grouping. 5. part (0.2M) 19 M rows total Dependence relationship: Q1 ≤ Q2 b) If view p is available, need to • 6. supplier (0.01M) process about 0.2M rows Q1 is the descendant of Q2 � 7. customer (0.1M) Q1 can be answered using only the results of Q2 8. none (1) � • In other words, Q1 is dependent on Q2 (IV) – View pc (II) – 8 views Questions: psc 6M and sc not The ≤ operator imposes a partial ordering on the queries organized into � needed � a Lattice a) How many views to Effective rows • Top view = base data (i.e., most detail) materialize to get reasonable reduced from pc 6M ps 0.8M sc 6M performance? - and 19M to 7M • Essentially, the lattice captures dependency relationship among b) Given that we have space S, queries / views and can be represented by a lattice graph p 0.2M s 0.01M c 0.1M what views to materialize to minimize average query none 1 costs? Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10 Greedy Algorithm Cost Model Greedy algorithm: each time Model simple and realistic, Cost of answering a query based choose the view with the Enable the design and analysis on a view and it has major maximum “benefit” of powerful algorithms assumptions … Explanation: The Greedy algorithm • Given a data cube lattice with space costs Time to answer a query is equal to An illustration: 1 associated with each view the space occupied by the query from which the query is answered S = {top view} • To answer query Q, choose an ancestor • Always include the top view because it cannot be of Q, say, Qa, that has been materialized generated from other views for i=1 to k do begin All queries are identical to some 2 • We thus need to process the table of Qa queries in the given lattice • Suppose we may only select k number of views in Select that view v not addition to the top view • Cost of answering Q: number of rows in in S such that B(v,S) the table of Qa The clustering of the materialized • The benefit of view v (relative to S), is based on is maximized; 3 query and indexes have not been how v can improve the costs of evaluating views, considered including itself S = S union {v}; • The total benefit of v is the sum over all views w end; Size S Time T Ratio m Source of the benefit of using v to evaluate w, providing that benefit is positive Experimental validation of the cost From cell itself 2.07 - return S; 1 model: almost linear relationship From view s 2.38 .000031 10,000 between size and running time … From view ps 20.77 .000023 0.8M From view psc 226.23 .000037 6M Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend