DW Performance Optimization (I)
Slides adapted from Torben Bach Pedersen
DW Performance Optimization (I) Slides adapted from Torben Bach - - PowerPoint PPT Presentation
DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen Overview Choosing Aggregates Maintaining Views Bitmap Indices Aalborg University 2008 - DWML course 2 Aggregates Observations DW queries follow
Slides adapted from Torben Bach Pedersen
Aalborg University 2008 - DWML course 2
Aalborg University 2008 - DWML course 3
DW queries follow the same “schema” Aggregate (GROUP-BY) queries
Compute and store query results in advance (pre-aggregation)
◆ Example: store “total sales per month and product”
Yields large performance improvements (e.g., factor 1000) No need to store everything – reuse whenever possible
◆ Example: quarterly total can be quickly computed from monthly total
Tree-structured dimensions with fixed height Many-to-one relationships from fact to dimensions Facts mapped to bottom level in all dimensions
Aalborg University 2008 - DWML course 4
locations
SELECT s.pid,s.locid,SUM(s.sales) FROM Sales s GROUP BY s.pid,s.locid
a table (e.g., TotalSales with 100,000 rows)
Sales s WHERE p.pid=s.pid GROUP BY p.category
◆ can be rewritten to
TotalSales t WHERE p.pid=t.pid GROUP BY p.category
◆ Query becomes 10,000 times faster!
40 3 2 3 … … … … 1 1 locid 2 10 1 1 20 1 sales pid tid
Sales
40 3 2 … … … 1 locid 30 1 sales pid
TotalSales
1 billion rows 100,000 rows
Aalborg University 2008 - DWML course 5
Fast query response Takes a lot of space/update time (200-500 times raw data)
Slow query response (for terabytes…)
A good compromise between response time and space use The approach supported by ROLAP tools
◆ IBM DB2 UDB ◆ Oracle 10g ◆ MS Analysis Services
Aalborg University 2008 - DWML course 6
A set of aggregates stored in the system (i.e., DBMS) Given a query, the best aggregate is found, and the query is
rewritten to use it
Done by the system, not by the user
Aggregates stored separately from detail data “Shrunk” dimensions mapped to aggregate facts Connection between aggregates and detail data known by system All queries (SQL) refers only detail data
The AVG function requires a trick – Why and how?
Aalborg University 2008 - DWML course 7
Space use Update speed Response time demands ……
Greedy approach: simple but effective Commerical DBMS now use similar, but more advanced techniques
Aalborg University 2008 - DWML course 8
By make and color By color and year By make and year
F
d C h e v y 1991 1992 1993 1994 Sum
Cross Tab
Chevy Ford By color
Sum
Red White Blue By make
Group By (with total)
By color
Sum
Red White Blue By make
Database capable Data warehouse capable Aggregate Sum mc m c none
Aalborg University 2008 - DWML course 9
queries / views and can be represented by a lattice graph
Aalborg University 2008 - DWML course 10 psc 6M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M none 1
Scenario: A query asks for the sales of a part a) If view pc is available, need to process about 6M rows; OR b) If view p is available, need to process about 0.2M rows
(I) – 8 possible views for 3 dimensions. Each view gives the total sales as per that grouping. 1. part, supplier, customer (6M rows) 2. part, customer (6M) 3. part, supplier (0.8M) 4. supplier, customer (6M) 5. part (0.2M) 6. supplier (0.01M) 7. customer (0.1M) 8. none (1) (II) – 8 views organized into a Lattice
Problem: Given that we have space S, what views to materialize for minimizing query costs?
(III) – Picking the right views to materialize improve performance (IV) – View pc and sc not needed Effective rows reduced from 19M to 7M 19 M rows total
Aalborg University 2008 - DWML course 11
Cost of answering a query based on a view and its major assumptions
Time to answer a query is equal to the space occupied by the query from which the query is answered
All queries are identical to some queries in the given lattice
An illustration:
Q, say, Qa, that has been materialized
table of Qa
Model simple and realistic, Enable the design and analysis
Experimental validation of the cost model: almost linear relationship between size and running time …
Source Size S Time T Ratio m From cell itself 1 2.07
10,000 2.38 .000031 From view ps 0.8M 20.77 .000023 From view psc 6M 226.23 .000037
Aalborg University 2008 - DWML course 12
Being as greedy as possible in each step! Now give you a taste of this approach ……
Suppose we have many coins of 2 kroner, 1 krone, and 50 ores
(i.e., 50 cents).
Select the largest coin: 2 kroner Select the largest coin again: 2 kroner Remaining amount = 5 – 2 – 2 = 1 krone We cannot select the largest coin again We choose the next largest coin 1 krone instead
Aalborg University 2008 - DWML course 13
Greedy algorithm: each time choose the view with the maximum “benefit”
associated with each view
be generated from other views
views (in addition to the top view)
denotes the amount that v can improve the costs
Explanation: The Greedy algorithm S = {top view} for i=1 to k do begin Select a view v not in S such that B(v,S) is maximized; S = S union {v}; end; return S;
Aalborg University 2008 - DWML course 14
Benefits of possible choices in 1st round
Round 1 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 3 =210 60 x 2 =120 99 x 1 = 99 90 x 1 = 90
100 a C 75 50 b 20 d f 40 h 10 g 1 30 e
Lattice with space costs
Initial: top view a is automatically chosen
End of round 1: view b selected
Aalborg University 2008 - DWML course 15
Benefits of possible choices at each round
Round 1 Round 2 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 3 =210 60 x 2 =120 99 x 1 = 99 90 x 1 = 90 25 x 2 = 50 30 x 2 = 60 20 x 3 = 60 60 + 10 = 70 49 x 1 = 49 40 x 1 = 40
100 a C 75 50 b 20 d f 40 h 10 g 1 30 e
Lattice with space costs
and the other views start with cost 100
Round 1: view b selected End of round 2: view f selected
Aalborg University 2008 - DWML course 16
Benefits of possible choices at each round
Round 1 Round 2 Round 3 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 3 =210 60 x 2 =120 99 x 1 = 99 90 x 1 = 90 25 x 2 = 50 30 x 2 = 60 20 x 3 = 60 60 + 10 = 70 49 x 1 = 49 40 x 1 = 40 25 x 1 = 25 30 x 2 = 60 2 x 20 + 10 =50 49 x 1 = 49 30 x 1 = 30
100 a C 75 50 b 20 d f 40 h 10 g 1 30 e
Lattice with space costs
maximum “benefit”
views f and h start with cost 40; the other views start with cost 100
Aalborg University 2008 - DWML course 17
Situations where the algorithm does poorly … The benefit of Greedy is guaranteed to be at least 63% of the benefit of the optimal algorithm (see the
A Lattice not good for Greedy
200 a d 100 100 b 20 Nodes total 1000 c 99 20 Nodes total 1000 20 Nodes total 1000 20 Nodes total 1000
nodes below resulting in total benefits of 8200
Aalborg University 2008 - DWML course 18
Greedy order of view selection example
Selection Benefit Space 1 2 3 4 5 6 cp 7 8 9 10 11 12 ns nt c p cs np ct t n s none infinite 24M rows 12M 5.9M 5.8M 1M 1M 0.01M small small small small Time 72M rows 48M 36M 30.1M 24.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 6M rows 6M 6M 6.1M 11.3M 16.3M 6.3M
Experiment with a lattice: Important to materialize only the first few (5) views, performance increases at first…
30 20 10 70 60 50 40 80 2 4 6 8 10 12 Time / Space Views Total Time Total Space
Aalborg University 2008 - DWML course 19
Two situations where the algorithm is optimal …
to optimal
greedy is optimal
… but there are also two situations where the algorithm is not realistic
Views in a lattice are unlikely to have the same probability
Instead of asking for some fixed number of views to materialize, should instead allocate a fixed amount of space to views
Aalborg University 2008 - DWML course 20
The size of views grows exponentially, until it reaches the size of the raw data at rank [logr m], i.e. the cliff
Number of group-by attributes n Size of views m Log r m
Effect of the number of group-by attributes on the size of views
data point and m of the cells with be nonempty. Use m as the size of any view.
points can be collapsed into one aggregate Assumptions and basis of reasoning
This explains why grouping of 2 attributes (p,c), (s,c) have the same size as (p,s,c) at 6 M rows in the figure on slide #10
Aalborg University 2008 - DWML course 21
What is the average time for a query when the space is optimal ?
the top view is materialized
is m2n Is there sense to minimize time by materializing all views ?
depends on
1.
Rank k = [logr m] at which the cliff occurs; and
2.
Rank j, such that rj = (n
j) is
maximized
Aalborg University 2008 - DWML course 22
Hierarchies are important as they underlie two commonly used query operations, Drill-down and Roll-up …
Day Week Month Year none
A common hierarchy … … and its dependency relations
… but, hierarchies introduce query dependencies that must be accounted for when determining which queries to materialize; and this can be complex
Aalborg University 2008 - DWML course 23
Dependencies caused by different dimensions and attribute hierarchies can be combined into a Direct Product lattice …
… assuming views can be created by independently grouping any or no member of the hierarchy for each dimension
Combining two hierarchical dimensions
customer nation none 0.1M 25 1 product size type none 6M 1 50 150
A Direct Product lattice
c p 6 M c . 1 M c s 5 M n p 5 M n t 3 7 5 n 2 5 n s 1 2 5 p . 2 M t 1 5 n
e 1 s 5 c t 5 . 9 9 M
Aalborg University 2008 - DWML course 24
Product P2 Type P1 None P0
Product: Product (P2) Type (P1) ALL (P0=∅) Store: Store (S2) Area (S1) ALL (S0=∅)
How many nodes does the lattice have? Enumerate them Connect the nodes based on their dependency
Store S2 Area S1 None S0 none P2 S2
???
Aalborg University 2008 - DWML course 25
A clean framework to work with dimensional hierarchies, since
hierarchies are themselves lattices
Dependency relationship among queries captured by the lattice
A simple descending-order topological sort on lattice nodes gives
the required order of materialization – more in the next lecture
Aalborg University 2008 - DWML course 26
Aalborg University 2008 - DWML course 27
Aalborg University 2008 - DWML course 28
Too expensive
Apply only changes since last refresh to view Insert new rows Ri and delete existing rows Rd Update = delete + insert
To make the views self-maintainable Store “number of instances” cv along with each row v in V
Aalborg University 2008 - DWML course 29
delete from V if cr=0
◆ Insert the tuple (3,7) into R ◆ Insert the tuple (7,9) into R
◆ Delete the tuple (3,4) from R ◆ Delete the tuple (6,9) from R
9 6 4 5 4 3 3 3 2 1 y x 6 (1) 5 (1) 3 (2) 1 (1) x (c)
R V
Aalborg University 2008 - DWML course 30
S
S and add to V, update counts
S and subtract from V, update counts
S
◆ Insert the tuple (3) into R ◆ Insert the tuple (4) into R ◆ Insert the tuple (7) into R
◆ Delete the tuple (2) from R ◆ Delete the tuple (3) from R ◆ Delete the tuple (7) from S
9 7 3 3 2 y 9 (1) 3 (2) 2 (1) y (c)
V
9 4 4 3 2 y
R S
Aalborg University 2008 - DWML course 31
Like projection view, but we maintain a tuple for each group g
◆ The group g can contain multiple attributes ◆ E.g., GROUP BY color, year; an instance of g = blue, 1993
Maintain <g,count> Update count based on inserts/deletes Insert <g,1> rows for new values, delete rows from V with count=0
Maintain <g,sum,count> Update count and sum based on inserts/deletes Insert <g,a,1> rows for new values, delete rows from V with count=0
Maintain <g,minimum,count> Insert (g,m): update minimum,count based on m =,<,> minimum Delete (g,m): update minimum,count based on m =,<,> minimum If count=0, then we need to search the base table (expensive). Why?
Aalborg University 2008 - DWML course 32
Immediate – in same transaction as base change Lazy – when V is used for the first time after base updates Periodic, e.g., once a day, often together with base load Forced – after a certain number of changes
Computation outside DBMS in flat files (still relevant?) Built by loader Computation in DBMS using SQL Can be expensive: DBMS must be tuned for this
Oracle, SQL Server, DB2
Aalborg University 2008 - DWML course 33
Index on dimension tables and on materialized views
Build primary B+-tree index on dimension keys (primary key) ? Build indices on each dimension key separately (index intersection) Indices on combinations of dimension keys? (many!)
Possible to compress data (values not repeated) Can save aggregates due to fast sequential scan Best sort order (almost) always “Time”
Build indices on many/all individual columns (think of type) Build indices on common combinations
Efficient for un-sorted data (e.g., color)
Aalborg University 2008 - DWML course 34
A RowID takes ~8 bytes Large space use for columns with low cardinality (gender, color) Example: Index for 1 bio. rows with gender takes 8 GB Not efficient to do “index intersection” for these columns
Female: 01110010101010… Male:
10001101010101…
Takes only (num. of values)*(num. of rows)*1 bit Example: bitmap index on gender (as before) takes only 256 MB Very efficient to do “index intersection” (AND/OR) on bitmaps
◆ Intersection of 32 bits done in a single CPU instruction (word length=32)
Can be improved for for higher cardinality using compression
Aalborg University 2008 - DWML course 35
Find female customers in Aalborg with black hair and blue eyes Female: 01010101010 Aalborg: 00000011111 Black:
10110110110
Blue:
01101101111
Result
00000000010 – use AND, only one such customer
Find … Salary BETWEEN 22000 AND 29000 Use the binning technique, i.e., group every C values into a bin
◆ E.g., group every 5000 values into a bin, and assign a bitmap for it ◆ Bitmap for [20000-25000): 001001001 ◆ Bitmap for [25000-30000): 010010010 ◆ OR together: 011011011 (Why is it OR but not AND?) ◆ Refinement step: follow those records and check their actual salaries
Tradeoff between storage size and index effectiveness
Do we need to consider the bitmap for [15000-20000) ?
Aalborg University 2008 - DWML course 36
With m possible values and n records: n*m bits required However, probability of a 1 is 1/m => very few 1’s
Run-length encoding A run is i 0’s followed by a 1 Concatenating binary numbers won’t work – no unique decoding Instead, determine j – number of bits in binary representation of i Encode as “<j-1 1’s>”+“0”+“<i in binary>” Encode next run similarly, trailing 0’s not encoded Example: 000000010000 encoded as 110111 j>1 => first bit of i is 1 – this bit can be saved – encoded as 11011 Decoding: scan bits to find j, scan next j-1 bits to find i, find next 1 Example: j = 3, i = 7 (11) => bitmap = 00000001 + 0000 (trailing)
Aalborg University 2008 - DWML course 37
Assume m=n (unique values) Each value has just one run of length i < n Each run takes at most 2 log2 n bits (j <= log2 n) Total space consumption: 2n log2 n bits (compared to n2)
Decompress one run at a time and produce relevant 1’s in output
Index with B-trees + store in blocks/block chains
Use secondary or primary index (hashing or B-tree)
Deletion: “retire” record number + update bitmaps with 1’s Insertion: add new record to file + update bitmaps with 1’s (trail 0’s) Updates: update bitmaps with old and new 1’s (expensive)
Aalborg University 2008 - DWML course 38
at the mini-project website
◆ SQL Server Management Studio ◆ Connect to Analysis Services ◆ Find your cube and ……
Aalborg University 2008 - DWML course 39