dw performance optimization i
play

DW Performance Optimization (I) Slides adapted from Torben Bach - PowerPoint PPT Presentation

DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen Overview Choosing Aggregates Maintaining Views Bitmap Indices Aalborg University 2008 - DWML course 2 Aggregates Observations DW queries follow


  1. DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen

  2. Overview • Choosing Aggregates • Maintaining Views • Bitmap Indices Aalborg University 2008 - DWML course 2

  3. Aggregates • Observations � DW queries follow the same “schema” � Aggregate (GROUP-BY) queries • Idea � Compute and store query results in advance (pre-aggregation) ◆ Example: store “total sales per month and product” � Yields large performance improvements (e.g., factor 1000) � No need to store everything – reuse whenever possible ◆ Example: quarterly total can be quickly computed from monthly total • Prerequisites � Tree-structured dimensions with fixed height � Many-to-one relationships from fact to dimensions � Facts mapped to bottom level in all dimensions Aalborg University 2008 - DWML course 3

  4. Aggregate Use Example Sales • Imagine 1 billion sales rows, 1000 products, 100 locations tid pid locid sales 1 1 1 10 • CREATE VIEW TotalSales (pid,locid,total) AS 2 1 1 20 SELECT s.pid,s.locid,SUM(s.sales) 3 2 3 40 FROM Sales s … … … … GROUP BY s.pid,s.locid 1 billion rows • Pre-compute the above view and store the result into a table (e.g., TotalSales with 100,000 rows) • Rewrite the query to use the view TotalSales SELECT p.category,SUM(s.sales) FROM Products p, � Sales s WHERE p.pid=s.pid GROUP BY p.category pid locid sales 1 1 30 ◆ can be rewritten to 2 3 40 SELECT p.category,SUM(t.total) FROM Products p, � TotalSales t WHERE p.pid=t.pid GROUP BY p.category … … … ◆ Query becomes 10,000 times faster! 100,000 rows Aalborg University 2008 - DWML course 4

  5. Pre-Aggregation Choices • Full pre-aggregation: (all combinations of levels) � Fast query response � Takes a lot of space/update time (200-500 times raw data) • No pre-aggregation � Slow query response (for terabytes…) • Practical pre-aggregation: chosen combinations � A good compromise between response time and space use � The approach supported by ROLAP tools ◆ IBM DB2 UDB ◆ Oracle 10g ◆ MS Analysis Services Aalborg University 2008 - DWML course 5

  6. Using Aggregates • The term “aggregate” refers to materialized view • Application: aggregates used via aggregate navigator � A set of aggregates stored in the system (i.e., DBMS) � Given a query, the best aggregate is found , and the query is rewritten to use it � Done by the system, not by the user • Four design goals for aggregate usage � Aggregates stored separately from detail data � “ Shrunk ” dimensions mapped to aggregate facts � Connection between aggregates and detail data known by system � All queries (SQL) refers only detail data • SUM, MIN, MAX, COUNT, AVG functions can be handled � The AVG function requires a trick – Why and how? Aalborg University 2008 - DWML course 6

  7. Choosing Aggregates • Practical pre-aggregation, decide what aggregates to store • Non-trivial (NP-complete) optimization problem � Space use � Update speed � Response time demands � …… • Choose an aggregate if it is considerably smaller than available, usable aggregates (factor 3-5-10) • “Implementing Data Cubes Efficiently” SIGMOD’96 paper � Greedy approach: simple but effective � Commerical DBMS now use similar, but more advanced techniques Aalborg University 2008 - DWML course 7

  8. Data Cube The data cube stores multidimensional GROUP BY relations of tables in data warehouses mc Group By (with total) Aggregate c By color m Red none Sum White Database capable Blue By make and year By make 1991 1992 1993 F 1994 o r Sum d Sum C h e v y Data warehouse Cross Tab capable Chevy Ford By color Red White Blue By make and color By make By color and year Sum Aalborg University 2008 - DWML course 8

  9. Lattice Framework • Each lattice node represents a view / query Dependence relationship: Q1 ≤ Q2 • Q1 is the descendant of Q2 � Q1 can be answered using only the results of Q2 � • In other words, Q1 is dependent on Q2 The ≤ operator imposes a partial ordering on the queries � • Top view = base data (i.e., most detail) • Essentially, the lattice captures dependency relationship among queries / views and can be represented by a lattice graph Aalborg University 2008 - DWML course 9

  10. A Data Cube Example 1. part, supplier, customer (6M rows) Scenario: 2. part, customer (6M) A query asks for the sales of a part 3. part, supplier (0.8M) 4. supplier, customer (6M) a) If view pc is available, need to 5. part (0.2M) 19 M rows total process about 6M rows; OR 6. supplier (0.01M) b) If view p is available, need to 7. customer (0.1M) process about 0.2M rows 8. none (1) (I) – 8 possible views for 3 dimensions. Each view (III) – Picking the right views (IV) – View pc and sc not to materialize � improve gives the total sales as needed � Effective rows psc 6M per that grouping. performance reduced from 19M to 7M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M Problem: Given that we have space S, what views to materialize for minimizing query costs? none 1 (II) – 8 views organized into a Lattice Aalborg University 2008 - DWML course 10

  11. Cost Model Model simple and realistic, Cost of answering a query based on Enable the design and analysis a view and its major assumptions of powerful algorithms Time to answer a query is equal to 1 An illustration: the space occupied by the query from which the query is answered • To answer query Q, choose an ancestor of Q, say, Qa, that has been materialized • We thus need to process the table of Qa All queries are identical to some 2 queries in the given lattice • Cost of answering Q: number of rows in the table of Qa Size S Time T Ratio m Source Experimental validation of the cost From cell itself 2.07 - 1 model: almost linear relationship From view s 2.38 .000031 10,000 between size and running time … From view ps 20.77 .000023 0.8M From view psc 226.23 .000037 6M Aalborg University 2008 - DWML course 11

  12. Greedy Approach • We plan to solve the problem by the greedy approach � Being as greedy as possible in each step! � Now give you a taste of this approach …… • Example: Pay 5 kroner with the smallest number of coins � Suppose we have many coins of 2 kroner, 1 krone, and 50 ores (i.e., 50 cents). • How to be greedy: � Select the largest coin: 2 kroner � Select the largest coin again: 2 kroner � Remaining amount = 5 – 2 – 2 = 1 krone � We cannot select the largest coin again � We choose the next largest coin 1 krone instead • Only 3 coins are needed Aalborg University 2008 - DWML course 12

  13. Greedy Algorithm Greedy algorithm: each time choose the view with the maximum “benefit” Explanation: The Greedy algorithm • Given a data cube lattice with space costs S = {top view} associated with each view for i=1 to k do begin • Always include the top view because it cannot be generated from other views Select a view v not in S such that B(v,S) is maximized; • Given a number k, we are allowed to select k views (in addition to the top view) S = S union {v}; • B(v,S), the benefit of view v (relative to S), end; denotes the amount that v can improve the costs of evaluating views w, including itself return S; • B(v,S) = Σ view w max {Cost(w,S) – Cost(w,v), 0} only positive benefit taken into account Aalborg University 2008 - DWML course 13

  14. Greedy Algorithm – Example 100 a 50 b C 75 30 e Benefits of possible choices in 1 st round 20 d f 40 Round 1 g h 10 1 50 x 5 = 250 b 25 x 5 = 125 c Lattice with space costs 80 x 2 = 160 d e 70 x 3 =210 Initial: top view a is automatically chosen 60 x 2 =120 f 99 x 1 = 99 g 90 x 1 = 90 h • Round 1 : each query answered by a, with cost 100 End of round 1: view b selected • Which queries can be answered by view c? • What is the benefit of view c? Ans: 5*(100-75)=125 • Can we find a better view than view c? How about b? Aalborg University 2008 - DWML course 14

  15. Greedy Algorithm – Example 100 a Round 1: view b selected 50 b C 75 30 e 20 d f 40 Benefits of possible choices at each round Round 1 Round 2 g h 10 50 x 5 = 250 b 1 25 x 5 = 125 c 25 x 2 = 50 Lattice with space costs 80 x 2 = 160 d 30 x 2 = 60 20 x 3 = 60 e 70 x 3 =210 60 x 2 =120 60 + 10 = 70 f • Round 2 : views b, d, e, g, h start with cost 50 99 x 1 = 99 g 49 x 1 = 49 and the other views start with cost 100 90 x 1 = 90 h 40 x 1 = 40 • Which queries can be improved by view f? End of round 2: view f selected • Benefit of view f? Ans: (100-40)+(50-40)=70 • How about the benefit of view c? Aalborg University 2008 - DWML course 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend