DW Performance Optimization (I) Slides adapted from Torben Bach - PowerPoint PPT Presentation

DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen

Overview • Choosing Aggregates • Maintaining Views • Bitmap Indices Aalborg University 2008 - DWML course 2

Aggregates • Observations � DW queries follow the same “schema” � Aggregate (GROUP-BY) queries • Idea � Compute and store query results in advance (pre-aggregation) ◆ Example: store “total sales per month and product” � Yields large performance improvements (e.g., factor 1000) � No need to store everything – reuse whenever possible ◆ Example: quarterly total can be quickly computed from monthly total • Prerequisites � Tree-structured dimensions with fixed height � Many-to-one relationships from fact to dimensions � Facts mapped to bottom level in all dimensions Aalborg University 2008 - DWML course 3

Aggregate Use Example Sales • Imagine 1 billion sales rows, 1000 products, 100 locations tid pid locid sales 1 1 1 10 • CREATE VIEW TotalSales (pid,locid,total) AS 2 1 1 20 SELECT s.pid,s.locid,SUM(s.sales) 3 2 3 40 FROM Sales s … … … … GROUP BY s.pid,s.locid 1 billion rows • Pre-compute the above view and store the result into a table (e.g., TotalSales with 100,000 rows) • Rewrite the query to use the view TotalSales SELECT p.category,SUM(s.sales) FROM Products p, � Sales s WHERE p.pid=s.pid GROUP BY p.category pid locid sales 1 1 30 ◆ can be rewritten to 2 3 40 SELECT p.category,SUM(t.total) FROM Products p, � TotalSales t WHERE p.pid=t.pid GROUP BY p.category … … … ◆ Query becomes 10,000 times faster! 100,000 rows Aalborg University 2008 - DWML course 4

Pre-Aggregation Choices • Full pre-aggregation: (all combinations of levels) � Fast query response � Takes a lot of space/update time (200-500 times raw data) • No pre-aggregation � Slow query response (for terabytes…) • Practical pre-aggregation: chosen combinations � A good compromise between response time and space use � The approach supported by ROLAP tools ◆ IBM DB2 UDB ◆ Oracle 10g ◆ MS Analysis Services Aalborg University 2008 - DWML course 5

Using Aggregates • The term “aggregate” refers to materialized view • Application: aggregates used via aggregate navigator � A set of aggregates stored in the system (i.e., DBMS) � Given a query, the best aggregate is found , and the query is rewritten to use it � Done by the system, not by the user • Four design goals for aggregate usage � Aggregates stored separately from detail data � “ Shrunk ” dimensions mapped to aggregate facts � Connection between aggregates and detail data known by system � All queries (SQL) refers only detail data • SUM, MIN, MAX, COUNT, AVG functions can be handled � The AVG function requires a trick – Why and how? Aalborg University 2008 - DWML course 6

Choosing Aggregates • Practical pre-aggregation, decide what aggregates to store • Non-trivial (NP-complete) optimization problem � Space use � Update speed � Response time demands � …… • Choose an aggregate if it is considerably smaller than available, usable aggregates (factor 3-5-10) • “Implementing Data Cubes Efficiently” SIGMOD’96 paper � Greedy approach: simple but effective � Commerical DBMS now use similar, but more advanced techniques Aalborg University 2008 - DWML course 7

Data Cube The data cube stores multidimensional GROUP BY relations of tables in data warehouses mc Group By (with total) Aggregate c By color m Red none Sum White Database capable Blue By make and year By make 1991 1992 1993 F 1994 o r Sum d Sum C h e v y Data warehouse Cross Tab capable Chevy Ford By color Red White Blue By make and color By make By color and year Sum Aalborg University 2008 - DWML course 8

Lattice Framework • Each lattice node represents a view / query Dependence relationship: Q1 ≤ Q2 • Q1 is the descendant of Q2 � Q1 can be answered using only the results of Q2 � • In other words, Q1 is dependent on Q2 The ≤ operator imposes a partial ordering on the queries � • Top view = base data (i.e., most detail) • Essentially, the lattice captures dependency relationship among queries / views and can be represented by a lattice graph Aalborg University 2008 - DWML course 9

A Data Cube Example 1. part, supplier, customer (6M rows) Scenario: 2. part, customer (6M) A query asks for the sales of a part 3. part, supplier (0.8M) 4. supplier, customer (6M) a) If view pc is available, need to 5. part (0.2M) 19 M rows total process about 6M rows; OR 6. supplier (0.01M) b) If view p is available, need to 7. customer (0.1M) process about 0.2M rows 8. none (1) (I) – 8 possible views for 3 dimensions. Each view (III) – Picking the right views (IV) – View pc and sc not to materialize � improve gives the total sales as needed � Effective rows psc 6M per that grouping. performance reduced from 19M to 7M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M Problem: Given that we have space S, what views to materialize for minimizing query costs? none 1 (II) – 8 views organized into a Lattice Aalborg University 2008 - DWML course 10

Cost Model Model simple and realistic, Cost of answering a query based on Enable the design and analysis a view and its major assumptions of powerful algorithms Time to answer a query is equal to 1 An illustration: the space occupied by the query from which the query is answered • To answer query Q, choose an ancestor of Q, say, Qa, that has been materialized • We thus need to process the table of Qa All queries are identical to some 2 queries in the given lattice • Cost of answering Q: number of rows in the table of Qa Size S Time T Ratio m Source Experimental validation of the cost From cell itself 2.07 - 1 model: almost linear relationship From view s 2.38 .000031 10,000 between size and running time … From view ps 20.77 .000023 0.8M From view psc 226.23 .000037 6M Aalborg University 2008 - DWML course 11

Greedy Approach • We plan to solve the problem by the greedy approach � Being as greedy as possible in each step! � Now give you a taste of this approach …… • Example: Pay 5 kroner with the smallest number of coins � Suppose we have many coins of 2 kroner, 1 krone, and 50 ores (i.e., 50 cents). • How to be greedy: � Select the largest coin: 2 kroner � Select the largest coin again: 2 kroner � Remaining amount = 5 – 2 – 2 = 1 krone � We cannot select the largest coin again � We choose the next largest coin 1 krone instead • Only 3 coins are needed Aalborg University 2008 - DWML course 12

Greedy Algorithm Greedy algorithm: each time choose the view with the maximum “benefit” Explanation: The Greedy algorithm • Given a data cube lattice with space costs S = {top view} associated with each view for i=1 to k do begin • Always include the top view because it cannot be generated from other views Select a view v not in S such that B(v,S) is maximized; • Given a number k, we are allowed to select k views (in addition to the top view) S = S union {v}; • B(v,S), the benefit of view v (relative to S), end; denotes the amount that v can improve the costs of evaluating views w, including itself return S; • B(v,S) = Σ view w max {Cost(w,S) – Cost(w,v), 0} only positive benefit taken into account Aalborg University 2008 - DWML course 13

Greedy Algorithm – Example 100 a 50 b C 75 30 e Benefits of possible choices in 1 st round 20 d f 40 Round 1 g h 10 1 50 x 5 = 250 b 25 x 5 = 125 c Lattice with space costs 80 x 2 = 160 d e 70 x 3 =210 Initial: top view a is automatically chosen 60 x 2 =120 f 99 x 1 = 99 g 90 x 1 = 90 h • Round 1 : each query answered by a, with cost 100 End of round 1: view b selected • Which queries can be answered by view c? • What is the benefit of view c? Ans: 5*(100-75)=125 • Can we find a better view than view c? How about b? Aalborg University 2008 - DWML course 14

Greedy Algorithm – Example 100 a Round 1: view b selected 50 b C 75 30 e 20 d f 40 Benefits of possible choices at each round Round 1 Round 2 g h 10 50 x 5 = 250 b 1 25 x 5 = 125 c 25 x 2 = 50 Lattice with space costs 80 x 2 = 160 d 30 x 2 = 60 20 x 3 = 60 e 70 x 3 =210 60 x 2 =120 60 + 10 = 70 f • Round 2 : views b, d, e, g, h start with cost 50 99 x 1 = 99 g 49 x 1 = 49 and the other views start with cost 100 90 x 1 = 90 h 40 x 1 = 40 • Which queries can be improved by view f? End of round 2: view f selected • Benefit of view f? Ans: (100-40)+(50-40)=70 • How about the benefit of view c? Aalborg University 2008 - DWML course 15

DW Performance Optimization (I) Slides adapted from Torben Bach - PowerPoint PPT Presentation

DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen Overview Choosing Aggregates Maintaining Views Bitmap Indices Aalborg University 2008 - DWML course 2 Aggregates Observations DW queries follow

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

PERFORMANCE OPTIMIZATION IN RED PERFORMANCE OPTIMIZATION IN RED HAT OPENSTACK PLATFORM HAT

PERFORMANCE OF PERFORMANCE OF OPTIMIZATION OPTIMIZATION ALGORITHMS ALGORITHMS FOR DERIVING

Performance Optimization on a Performance Optimization on a Supercomputer with cTuning and

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

A Typeful Integration of SQL into Curry Michael Hanus University of Kiel Programming Languages

Ghaith Haddad, Faraz Hussain and Gary T. Leavens School of Electrical Engineering and Computer

CSCI-1101A Object-Oriented Programming (OOP) Mohammad T . Irfan Reading: Handout (also on

Ancestral selection graph meets lookdown construction Ellen Baake Biomathematics and Theoretical

Brief Intro to Linear and Integer Programming Marco Chiarandini Institut for Matematik og

Logistic Regression 1 The basics Michael Claudius, Associate Professor, Roskilde 31.03.2020

Contributions to Analysis and Functional Analysis in memoriam Pawe l Doma nski Dietmar

WELCOME New Faculty! Orientation Guide These materials are intended to orient new faculty to the