DW Performance Optimization (I) Slides adapted from Torben Bach - - PowerPoint PPT Presentation

dw performance optimization i
SMART_READER_LITE
LIVE PREVIEW

DW Performance Optimization (I) Slides adapted from Torben Bach - - PowerPoint PPT Presentation

DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen Overview Choosing Aggregates Maintaining Views Bitmap Indices Aalborg University 2008 - DWML course 2 Aggregates Observations DW queries follow


slide-1
SLIDE 1

DW Performance Optimization (I)

Slides adapted from Torben Bach Pedersen

slide-2
SLIDE 2

Aalborg University 2008 - DWML course 2

Overview

  • Choosing Aggregates
  • Maintaining Views
  • Bitmap Indices
slide-3
SLIDE 3

Aalborg University 2008 - DWML course 3

Aggregates

  • Observations

DW queries follow the same “schema” Aggregate (GROUP-BY) queries

  • Idea

Compute and store query results in advance (pre-aggregation)

◆ Example: store “total sales per month and product”

Yields large performance improvements (e.g., factor 1000) No need to store everything – reuse whenever possible

◆ Example: quarterly total can be quickly computed from monthly total

  • Prerequisites

Tree-structured dimensions with fixed height Many-to-one relationships from fact to dimensions Facts mapped to bottom level in all dimensions

slide-4
SLIDE 4

Aalborg University 2008 - DWML course 4

Aggregate Use Example

  • Imagine 1 billion sales rows, 1000 products, 100

locations

  • CREATE VIEW TotalSales (pid,locid,total) AS

SELECT s.pid,s.locid,SUM(s.sales) FROM Sales s GROUP BY s.pid,s.locid

  • Pre-compute the above view and store the result into

a table (e.g., TotalSales with 100,000 rows)

  • Rewrite the query to use the view
  • SELECT p.category,SUM(s.sales) FROM Products p,

Sales s WHERE p.pid=s.pid GROUP BY p.category

◆ can be rewritten to

  • SELECT p.category,SUM(t.total) FROM Products p,

TotalSales t WHERE p.pid=t.pid GROUP BY p.category

◆ Query becomes 10,000 times faster!

40 3 2 3 … … … … 1 1 locid 2 10 1 1 20 1 sales pid tid

Sales

40 3 2 … … … 1 locid 30 1 sales pid

TotalSales

1 billion rows 100,000 rows

slide-5
SLIDE 5

Aalborg University 2008 - DWML course 5

Pre-Aggregation Choices

  • Full pre-aggregation: (all combinations of levels)

Fast query response Takes a lot of space/update time (200-500 times raw data)

  • No pre-aggregation

Slow query response (for terabytes…)

  • Practical pre-aggregation: chosen combinations

A good compromise between response time and space use The approach supported by ROLAP tools

◆ IBM DB2 UDB ◆ Oracle 10g ◆ MS Analysis Services

slide-6
SLIDE 6

Aalborg University 2008 - DWML course 6

Using Aggregates

  • The term “aggregate” refers to materialized view
  • Application: aggregates used via aggregate navigator

A set of aggregates stored in the system (i.e., DBMS) Given a query, the best aggregate is found, and the query is

rewritten to use it

Done by the system, not by the user

  • Four design goals for aggregate usage

Aggregates stored separately from detail data “Shrunk” dimensions mapped to aggregate facts Connection between aggregates and detail data known by system All queries (SQL) refers only detail data

  • SUM, MIN, MAX, COUNT, AVG functions can be handled

The AVG function requires a trick – Why and how?

slide-7
SLIDE 7

Aalborg University 2008 - DWML course 7

Choosing Aggregates

  • Practical pre-aggregation, decide what aggregates to store
  • Non-trivial (NP-complete) optimization problem

Space use Update speed Response time demands ……

  • Choose an aggregate if it is considerably smaller than

available, usable aggregates (factor 3-5-10)

  • “Implementing Data Cubes Efficiently” SIGMOD’96 paper

Greedy approach: simple but effective Commerical DBMS now use similar, but more advanced techniques

slide-8
SLIDE 8

Aalborg University 2008 - DWML course 8

Data Cube

By make and color By color and year By make and year

F

  • r

d C h e v y 1991 1992 1993 1994 Sum

The data cube stores multidimensional GROUP BY relations

  • f tables in data warehouses

Cross Tab

Chevy Ford By color

Sum

Red White Blue By make

Group By (with total)

By color

Sum

Red White Blue By make

Database capable Data warehouse capable Aggregate Sum mc m c none

slide-9
SLIDE 9

Aalborg University 2008 - DWML course 9

Lattice Framework

  • Each lattice node represents a view / query
  • Dependence relationship: Q1 ≤ Q2
  • Q1 is the descendant of Q2
  • Q1 can be answered using only the results of Q2
  • In other words, Q1 is dependent on Q2
  • The ≤ operator imposes a partial ordering on the queries
  • Top view = base data (i.e., most detail)
  • Essentially, the lattice captures dependency relationship among

queries / views and can be represented by a lattice graph

slide-10
SLIDE 10

Aalborg University 2008 - DWML course 10 psc 6M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M none 1

A Data Cube Example

Scenario: A query asks for the sales of a part a) If view pc is available, need to process about 6M rows; OR b) If view p is available, need to process about 0.2M rows

(I) – 8 possible views for 3 dimensions. Each view gives the total sales as per that grouping. 1. part, supplier, customer (6M rows) 2. part, customer (6M) 3. part, supplier (0.8M) 4. supplier, customer (6M) 5. part (0.2M) 6. supplier (0.01M) 7. customer (0.1M) 8. none (1) (II) – 8 views organized into a Lattice

Problem: Given that we have space S, what views to materialize for minimizing query costs?

(III) – Picking the right views to materialize improve performance (IV) – View pc and sc not needed Effective rows reduced from 19M to 7M 19 M rows total

slide-11
SLIDE 11

Aalborg University 2008 - DWML course 11

Cost of answering a query based on a view and its major assumptions

Cost Model

Time to answer a query is equal to the space occupied by the query from which the query is answered

1

All queries are identical to some queries in the given lattice

2

An illustration:

  • To answer query Q, choose an ancestor of

Q, say, Qa, that has been materialized

  • We thus need to process the table of Qa
  • Cost of answering Q: number of rows in the

table of Qa

Model simple and realistic, Enable the design and analysis

  • f powerful algorithms

Experimental validation of the cost model: almost linear relationship between size and running time …

Source Size S Time T Ratio m From cell itself 1 2.07

  • From view s

10,000 2.38 .000031 From view ps 0.8M 20.77 .000023 From view psc 6M 226.23 .000037

slide-12
SLIDE 12

Aalborg University 2008 - DWML course 12

Greedy Approach

  • We plan to solve the problem by the greedy approach

Being as greedy as possible in each step! Now give you a taste of this approach ……

  • Example: Pay 5 kroner with the smallest number of coins

Suppose we have many coins of 2 kroner, 1 krone, and 50 ores

(i.e., 50 cents).

  • How to be greedy:

Select the largest coin: 2 kroner Select the largest coin again: 2 kroner Remaining amount = 5 – 2 – 2 = 1 krone We cannot select the largest coin again We choose the next largest coin 1 krone instead

  • Only 3 coins are needed
slide-13
SLIDE 13

Aalborg University 2008 - DWML course 13

Greedy Algorithm

Greedy algorithm: each time choose the view with the maximum “benefit”

  • Given a data cube lattice with space costs

associated with each view

  • Always include the top view because it cannot

be generated from other views

  • Given a number k, we are allowed to select k

views (in addition to the top view)

  • B(v,S), the benefit of view v (relative to S),

denotes the amount that v can improve the costs

  • f evaluating views w, including itself
  • B(v,S) = Σview w max {Cost(w,S) – Cost(w,v), 0}
  • nly positive benefit taken into account

Explanation: The Greedy algorithm S = {top view} for i=1 to k do begin Select a view v not in S such that B(v,S) is maximized; S = S union {v}; end; return S;

slide-14
SLIDE 14

Aalborg University 2008 - DWML course 14

Greedy Algorithm – Example

Benefits of possible choices in 1st round

Round 1 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 3 =210 60 x 2 =120 99 x 1 = 99 90 x 1 = 90

100 a C 75 50 b 20 d f 40 h 10 g 1 30 e

Lattice with space costs

Initial: top view a is automatically chosen

End of round 1: view b selected

  • Round 1: each query answered by a, with cost 100
  • Which queries can be answered by view c?
  • What is the benefit of view c? Ans: 5*(100-75)=125
  • Can we find a better view than view c? How about b?
slide-15
SLIDE 15

Aalborg University 2008 - DWML course 15

Greedy Algorithm – Example

Benefits of possible choices at each round

Round 1 Round 2 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 3 =210 60 x 2 =120 99 x 1 = 99 90 x 1 = 90 25 x 2 = 50 30 x 2 = 60 20 x 3 = 60 60 + 10 = 70 49 x 1 = 49 40 x 1 = 40

100 a C 75 50 b 20 d f 40 h 10 g 1 30 e

Lattice with space costs

  • Round 2: views b, d, e, g, h start with cost 50

and the other views start with cost 100

  • Which queries can be improved by view f?
  • Benefit of view f? Ans: (100-40)+(50-40)=70
  • How about the benefit of view c?

Round 1: view b selected End of round 2: view f selected

slide-16
SLIDE 16

Aalborg University 2008 - DWML course 16

Greedy Algorithm – Example

Benefits of possible choices at each round

Round 1 Round 2 Round 3 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 3 =210 60 x 2 =120 99 x 1 = 99 90 x 1 = 90 25 x 2 = 50 30 x 2 = 60 20 x 3 = 60 60 + 10 = 70 49 x 1 = 49 40 x 1 = 40 25 x 1 = 25 30 x 2 = 60 2 x 20 + 10 =50 49 x 1 = 49 30 x 1 = 30

100 a C 75 50 b 20 d f 40 h 10 g 1 30 e

Lattice with space costs

  • Round 1: view b selected
  • Round 2: view f selected
  • End of round 3: view d selected
  • After 3 rounds, total cost is 420 (initial cost 800)
  • In summary, each round pick the view with the

maximum “benefit”

  • Round 3: views b, d, e, g start with cost 50;

views f and h start with cost 40; the other views start with cost 100

  • How about the benefit of view e now?
slide-17
SLIDE 17

Aalborg University 2008 - DWML course 17

Greedy Algorithm vs Optimal Choice

Situations where the algorithm does poorly … The benefit of Greedy is guaranteed to be at least 63% of the benefit of the optimal algorithm (see the

  • riginal paper)

A Lattice not good for Greedy

200 a d 100 100 b 20 Nodes total 1000 c 99 20 Nodes total 1000 20 Nodes total 1000 20 Nodes total 1000

  • Round 1: Picks c whose benefit is 4141
  • Round 2: Picks b or d with benefit of 2100 each
  • Greedy results in 4141+2100 = 6241
  • But, the optimal choice is to pick b and d
  • b and d improve by 100 for itself and all 80

nodes below resulting in total benefits of 8200

  • Ratio of greedy/optimal=6241/8200=76%
slide-18
SLIDE 18

Aalborg University 2008 - DWML course 18

Greedy Algorithm – Space-Time Trade-Off

Greedy order of view selection example

Selection Benefit Space 1 2 3 4 5 6 cp 7 8 9 10 11 12 ns nt c p cs np ct t n s none infinite 24M rows 12M 5.9M 5.8M 1M 1M 0.01M small small small small Time 72M rows 48M 36M 30.1M 24.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 6M rows 6M 6M 6.1M 11.3M 16.3M 6.3M

Experiment with a lattice: Important to materialize only the first few (5) views, performance increases at first…

30 20 10 70 60 50 40 80 2 4 6 8 10 12 Time / Space Views Total Time Total Space

slide-19
SLIDE 19

Aalborg University 2008 - DWML course 19

Optimal Cases and Anomalies

Two situations where the algorithm is optimal …

If a1 is much larger than the

  • ther a’s, the greedy is close

to optimal

If all the a’s are equal then

greedy is optimal

… but there are also two situations where the algorithm is not realistic

Views in a lattice are unlikely to have the same probability

  • f being requested in a query

Instead of asking for some fixed number of views to materialize, should instead allocate a fixed amount of space to views

slide-20
SLIDE 20

Aalborg University 2008 - DWML course 20

The size of views grows exponentially, until it reaches the size of the raw data at rank [logr m], i.e. the cliff

Number of group-by attributes n Size of views m Log r m

Effect of the number of group-by attributes on the size of views

  • Each domain size is r
  • Top element has m cells appearing in raw data
  • If group on i attributes, cube has ri cells
  • If ri ≥ m, then each cell will have at most one

data point and m of the cells with be nonempty. Use m as the size of any view.

  • If ri <m, then almost all ri cells will have at least
  • ne data point. Space cost is ri as several data

points can be collapsed into one aggregate Assumptions and basis of reasoning

This explains why grouping of 2 attributes (p,c), (s,c) have the same size as (p,s,c) at 6 M rows in the figure on slide #10

Hypercube Lattices – Observations

slide-21
SLIDE 21

Aalborg University 2008 - DWML course 21

Space-Optimal and Time-Optimal Solutions

What is the average time for a query when the space is optimal ?

  • Space is minimized when only

the top view is materialized

  • Every query would take time m
  • Total time cost for all 2n queries

is m2n Is there sense to minimize time by materializing all views ?

  • No gain past the cliff
  • No point to do so
  • Nature of time-optimal solution

depends on

1.

Rank k = [logr m] at which the cliff occurs; and

2.

Rank j, such that rj = (n

j) is

maximized

slide-22
SLIDE 22

Aalborg University 2008 - DWML course 22

Hierarchies are important as they underlie two commonly used query operations, Drill-down and Roll-up …

Day Week Month Year none

  • Year ≤ Month ≤ Day
  • Week ≤ Day; but
  • Month ≤ week; week ≤ month

A common hierarchy … … and its dependency relations

… but, hierarchies introduce query dependencies that must be accounted for when determining which queries to materialize; and this can be complex

Hierarchies & the Lattice Framework

slide-23
SLIDE 23

Aalborg University 2008 - DWML course 23

Dependencies caused by different dimensions and attribute hierarchies can be combined into a Direct Product lattice …

Composite Lattices

… assuming views can be created by independently grouping any or no member of the hierarchy for each dimension

Combining two hierarchical dimensions

+

customer nation none 0.1M 25 1 product size type none 6M 1 50 150

A Direct Product lattice

=

c p 6 M c . 1 M c s 5 M n p 5 M n t 3 7 5 n 2 5 n s 1 2 5 p . 2 M t 1 5 n

  • n

e 1 s 5 c t 5 . 9 9 M

slide-24
SLIDE 24

Aalborg University 2008 - DWML course 24

Exercise for Composite Lattice +

Product P2 Type P1 None P0

  • Consider hierarchies of two dimensions

Product: Product (P2) Type (P1) ALL (P0=∅) Store: Store (S2) Area (S1) ALL (S0=∅)

  • Draw the composite lattice

How many nodes does the lattice have? Enumerate them Connect the nodes based on their dependency

Store S2 Area S1 None S0 none P2 S2

???

slide-25
SLIDE 25

Aalborg University 2008 - DWML course 25

Applicability of the Lattice Framework

Advantages of the lattice framework

  • Clean framework

A clean framework to work with dimensional hierarchies, since

hierarchies are themselves lattices

  • Easy to model dependencies

Dependency relationship among queries captured by the lattice

  • Order of materialization

A simple descending-order topological sort on lattice nodes gives

the required order of materialization – more in the next lecture

slide-26
SLIDE 26

Aalborg University 2008 - DWML course 26

Summary on Choosing Aggregates

  • 1. Lattice framework
  • 2. Linear cost model
  • 3. Greedy algorithm (How it works?)
  • 4. Space-time trade-off
slide-27
SLIDE 27

Aalborg University 2008 - DWML course 27

Maintaining Views

slide-28
SLIDE 28

Aalborg University 2008 - DWML course 28

View Maintenance

  • How and when should we refresh materialized views?
  • Total re-computation

Too expensive

  • Incremental view maintenance

Apply only changes since last refresh to view Insert new rows Ri and delete existing rows Rd Update = delete + insert

  • Additional info must be stored in views

To make the views self-maintainable Store “number of instances” cv along with each row v in V

slide-29
SLIDE 29

Aalborg University 2008 - DWML course 29

View Maintenance

  • Projection views (with DISTINCT in SQL)
  • V=Π(R)
  • (r,c): r appears “c” times in R
  • If (r,ci)∈Π(Ri) and (r,cr)∈V then cr=cr+ci,
  • therwise insert (r,ci) into V
  • If (r,cd)∈Π(Rd) and (r,cr)∈V then cr=cr–cd,

delete from V if cr=0

  • Example
  • V=Πx(R)
  • How does V change after these operations?
  • Insertion

◆ Insert the tuple (3,7) into R ◆ Insert the tuple (7,9) into R

  • Deletion

◆ Delete the tuple (3,4) from R ◆ Delete the tuple (6,9) from R

9 6 4 5 4 3 3 3 2 1 y x 6 (1) 5 (1) 3 (2) 1 (1) x (c)

R V

slide-30
SLIDE 30

Aalborg University 2008 - DWML course 30

View Maintenance

  • Join views V=R

S

  • Compute Ri

S and add to V, update counts

  • Compute Rd

S and subtract from V, update counts

  • Example
  • V=R

S

  • How does V change after these operations?
  • Insertion

◆ Insert the tuple (3) into R ◆ Insert the tuple (4) into R ◆ Insert the tuple (7) into R

  • Deletion

◆ Delete the tuple (2) from R ◆ Delete the tuple (3) from R ◆ Delete the tuple (7) from S

9 7 3 3 2 y 9 (1) 3 (2) 2 (1) y (c)

V

9 4 4 3 2 y

R S

slide-31
SLIDE 31

Aalborg University 2008 - DWML course 31

Aggregation View Maintenance

  • COUNT

Like projection view, but we maintain a tuple for each group g

◆ The group g can contain multiple attributes ◆ E.g., GROUP BY color, year; an instance of g = blue, 1993

Maintain <g,count> Update count based on inserts/deletes Insert <g,1> rows for new values, delete rows from V with count=0

  • SUM

Maintain <g,sum,count> Update count and sum based on inserts/deletes Insert <g,a,1> rows for new values, delete rows from V with count=0

  • AVG computed as sum/count
  • MIN (similar for MAX)

Maintain <g,minimum,count> Insert (g,m): update minimum,count based on m =,<,> minimum Delete (g,m): update minimum,count based on m =,<,> minimum If count=0, then we need to search the base table (expensive). Why?

slide-32
SLIDE 32

Aalborg University 2008 - DWML course 32

Practical View Maintenance

  • Maintenance policies – When to synchronize?

Immediate – in same transaction as base change Lazy – when V is used for the first time after base updates Periodic, e.g., once a day, often together with base load Forced – after a certain number of changes

  • Updating aggregates

Computation outside DBMS in flat files (still relevant?) Built by loader Computation in DBMS using SQL Can be expensive: DBMS must be tuned for this

  • Supported by tool/DBMS

Oracle, SQL Server, DB2

slide-33
SLIDE 33

Aalborg University 2008 - DWML course 33

Indexing

  • Index used in combination with aggregates

Index on dimension tables and on materialized views

  • Fact table

Build primary B+-tree index on dimension keys (primary key) ? Build indices on each dimension key separately (index intersection) Indices on combinations of dimension keys? (many!)

  • Sort order is important (index-organized tables)

Possible to compress data (values not repeated) Can save aggregates due to fast sequential scan Best sort order (almost) always “Time”

  • Dimension tables

Build indices on many/all individual columns (think of type) Build indices on common combinations

  • Hash indices

Efficient for un-sorted data (e.g., color)

slide-34
SLIDE 34

Aalborg University 2008 - DWML course 34

Bitmap Indices

  • A B+-tree index stores a list of RowIDs for each value

A RowID takes ~8 bytes Large space use for columns with low cardinality (gender, color) Example: Index for 1 bio. rows with gender takes 8 GB Not efficient to do “index intersection” for these columns

  • Idea: make a “position bitmap” for each value (only two)

Female: 01110010101010… Male:

10001101010101…

Takes only (num. of values)*(num. of rows)*1 bit Example: bitmap index on gender (as before) takes only 256 MB Very efficient to do “index intersection” (AND/OR) on bitmaps

◆ Intersection of 32 bits done in a single CPU instruction (word length=32)

Can be improved for for higher cardinality using compression

  • Supported by some RDBMS (DB2, Oracle)
slide-35
SLIDE 35

Aalborg University 2008 - DWML course 35

Using Bitmap Indices

  • Query example

Find female customers in Aalborg with black hair and blue eyes Female: 01010101010 Aalborg: 00000011111 Black:

10110110110

Blue:

01101101111

Result

00000000010 – use AND, only one such customer

  • Numeric attributes can also be handled

Find … Salary BETWEEN 22000 AND 29000 Use the binning technique, i.e., group every C values into a bin

◆ E.g., group every 5000 values into a bin, and assign a bitmap for it ◆ Bitmap for [20000-25000): 001001001 ◆ Bitmap for [25000-30000): 010010010 ◆ OR together: 011011011 (Why is it OR but not AND?) ◆ Refinement step: follow those records and check their actual salaries

Tradeoff between storage size and index effectiveness

Do we need to consider the bitmap for [15000-20000) ?

slide-36
SLIDE 36

Aalborg University 2008 - DWML course 36

Compressed Bitmaps

  • Problem: space use

With m possible values and n records: n*m bits required However, probability of a 1 is 1/m => very few 1’s

  • Solution: compressed bitmaps

Run-length encoding A run is i 0’s followed by a 1 Concatenating binary numbers won’t work – no unique decoding Instead, determine j – number of bits in binary representation of i Encode as “<j-1 1’s>”+“0”+“<i in binary>” Encode next run similarly, trailing 0’s not encoded Example: 000000010000 encoded as 110111 j>1 => first bit of i is 1 – this bit can be saved – encoded as 11011 Decoding: scan bits to find j, scan next j-1 bits to find i, find next 1 Example: j = 3, i = 7 (11) => bitmap = 00000001 + 0000 (trailing)

slide-37
SLIDE 37

Aalborg University 2008 - DWML course 37

Managing Bitmaps

  • Compression factor

Assume m=n (unique values) Each value has just one run of length i < n Each run takes at most 2 log2 n bits (j <= log2 n) Total space consumption: 2n log2 n bits (compared to n2)

  • Operations on compressed bitmaps

Decompress one run at a time and produce relevant 1’s in output

  • Finding and storing bitvectors

Index with B-trees + store in blocks/block chains

  • Finding records

Use secondary or primary index (hashing or B-tree)

  • Handling modifications

Deletion: “retire” record number + update bitmaps with 1’s Insertion: add new record to file + update bitmaps with 1’s (trail 0’s) Updates: update bitmaps with old and new 1’s (expensive)

slide-38
SLIDE 38

Aalborg University 2008 - DWML course 38

Mini Project

  • Subtask #5
  • Build reports in Reporting Services
  • Check the Reporting Services tutorial

at the mini-project website

  • Subtask #6
  • DW performance optimization
  • Build aggregates

◆ SQL Server Management Studio ◆ Connect to Analysis Services ◆ Find your cube and ……

  • Short demo. session on this today
slide-39
SLIDE 39

Aalborg University 2008 - DWML course 39

Summary

  • Choosing Aggregates
  • Maintaining Views
  • Bitmap Indices