Overview DW Performance Optimization Choosing aggregates - - PowerPoint PPT Presentation

overview dw performance optimization
SMART_READER_LITE
LIVE PREVIEW

Overview DW Performance Optimization Choosing aggregates - - PowerPoint PPT Presentation

Overview DW Performance Optimization Choosing aggregates Maintaining views Bitmapped indices Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 Aggregates


slide-1
SLIDE 1

DW Performance Optimization

Original slides were written by Torben Bach Pedersen

Aalborg University 2007 - DWML course 2

Overview

  • Choosing aggregates
  • Maintaining views
  • Bitmapped indices
  • Other optimization issues

Aalborg University 2007 - DWML course 3

Aggregates

  • Observations

DW queries are simple, follow same “schema” Aggregate (GROUP-BY) queries

  • Idea

Compute and store query results in advance (pre-aggregation) ◆ Example: store “total sales per month and product” Yields large performance improvements (e.g., factor 1000) No need to store everything – reuse whenever possible ◆ Example: quarterly total can be quickly computed from monthly total

  • Prerequisites

Tree-structured dimensions with fixed height Many-to-one relationships from fact to dimensions Facts mapped to bottom level in all dimensions Otherwise, re-use is not possible

Aalborg University 2007 - DWML course 4

Aggregate Use Example

  • Imagine 1 billion sales rows, 1000 products, 100 locations
  • CREATE VIEW TotalSales (pid,locid,total) AS

SELECT s.pid,s.locid,SUM(s.sales) FROM Sales s GROUP BY s.pid,s.locid

  • The materialized view has 100,000 rows
  • Query rewritten to use view

SELECT p.category,SUM(s.sales) FROM Products p, Sales s

WHERE p.pid=s.pid GROUP BY p.category

Rewritten to SELECT p.category,SUM(t.total) FROM Products p, TotalSales t

WHERE p.pid=t.pid GROUP BY p.category

Query becomes 10,000 times faster !

slide-2
SLIDE 2

Aalborg University 2007 - DWML course 5

Pre-Aggregation Choices

  • Full pre-aggregation: (all combinations of levels)

Fast query response Takes a lot of space/update time (200-500 times raw data)

  • No pre-aggregation

Slow query response (for terabytes…)

  • Practical pre-aggregation: chosen combinations

A good compromise between response time and space use

  • Most (R)OLAP tools now support practical pre-aggregation

IBM DB2 UDB Oracle 10g MS Analysis Services

Aalborg University 2007 - DWML course 6

Using Aggregates

  • Application: aggregates used via aggregate navigator

Given a query, the best aggregate is found, and the query is

rewritten to use it

Done by the system, not by the user ◆ Traditionally done in middleware, e.g., ODBC ◆ Performed directly by DBMS

  • Four design goals for aggregate usage

Aggregates stored separately from detail data “Shrunk” dimensions mapped to aggregate facts Connection between aggregates and detail data known by system All queries (SQL) refers only detail data

  • SUM, MIN, MAX, COUNT, AVG can all be handled

The last one requires little trick – Why?

Aalborg University 2007 - DWML course 7

Choosing Aggregates

  • Practical pre-aggregation, decide what aggregates to store
  • Non-trivial (NP-complete) optimization problem

Space use Update speed Response time demands Actual queries Index and/or aggregates

  • Choose an aggregate if it is considerably smaller than

available, usable aggregates (factor 3-5-10)

  • Supported (semi)-automatically by DBMS

Oracle, DB2, MS SQL Server

  • “Implementing Data Cubes Efficiently” SIGMOD’96 paper

Greedy approach: simple but effective Above DBMS now use similar, but more advanced techniques

Aalborg University 2007 - DWML course 8

Data Cube

By make and color By color and year By make and year

Ford Chevy 1 9 9 1 1 9 9 2 1 9 9 3 1 9 9 4 S u m

The data cube stores multidimensional GROUP BY relations of tables in data warehouses

Cross Tab

Chevy Ford By color

Sum

Red White Blue By make

Group By (with total)

By color

Sum

Red White Blue By make

Database capable Data warehouse capable Aggregate Sum

slide-3
SLIDE 3

Aalborg University 2007 - DWML course 9

A Data Cube example

Scenario: A query asks for the sales of a part a) If view pc is available, need to process about 6M rows; OR b) If view p is available, need to process about 0.2M rows

(I) – 8 possible views for 3 dimensions. Each view gives the total sales as per that grouping. 1. part, supplier, customer (6M rows) 2. part, customer (6M) 3. part, supplier (0.8M) 4. supplier, customer (6M) 5. part (0.2M) 6. supplier (0.01M) 7. customer (0.1M) 8. none (1)

psc 6M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M none 1

(II) – 8 views

  • rganized into

a Lattice

Questions: a) How many views to materialize to get reasonable performance? - and b) Given that we have space S, what views to materialize to minimize average query costs?

(III) – Picking the right views to materialize improve performance (IV) – View pc and sc not needed Effective rows reduced from 19M to 7M 19 M rows total Aalborg University 2007 - DWML course 10

Lattice Framework

  • Each lattice node represents a view / query
  • Dependence relationship: Q1 ≤ Q2
  • Q1 is the descendant of Q2
  • Q1 can be answered using only the results of Q2
  • In other words, Q1 is dependent on Q2
  • The ≤ operator imposes a partial ordering on the queries
  • Top view = base data (i.e., most detail)
  • Essentially, the lattice captures dependency relationship among

queries / views and can be represented by a lattice graph

Aalborg University 2007 - DWML course 11

Cost of answering a query based

  • n a view and it has major

assumptions …

Cost Model

Time to answer a query is equal to the space occupied by the query from which the query is answered

1

All queries are identical to some queries in the given lattice

2

The clustering of the materialized query and indexes have not been considered

3

An illustration:

  • To answer query Q, choose an ancestor
  • f Q, say, Qa, that has been materialized
  • We thus need to process the table of Qa
  • Cost of answering Q: number of rows in

the table of Qa

Model simple and realistic, Enable the design and analysis

  • f powerful algorithms

Experimental validation of the cost model: almost linear relationship between size and running time …

Source Size S Time T Ratio m From cell itself 1 2.07

  • From view s

10,000 2.38 .000031 From view ps 0.8M 20.77 .000023 From view psc 6M 226.23 .000037 Aalborg University 2007 - DWML course 12

Greedy Algorithm

Greedy algorithm: each time choose the view with the maximum “benefit”

  • Given a data cube lattice with space costs

associated with each view

  • Always include the top view because it cannot be

generated from other views

  • Suppose we may only select k number of views in

addition to the top view

  • The benefit of view v (relative to S), is based on

how v can improve the costs of evaluating views, including itself

  • The total benefit of v is the sum over all views w
  • f the benefit of using v to evaluate w, providing

that benefit is positive

Explanation: The Greedy algorithm S = {top view} for i=1 to k do begin Select that view v not in S such that B(v,S) is maximized; S = S union {v}; end; return S;

slide-4
SLIDE 4

Aalborg University 2007 - DWML course 13

Greedy Algorithm – an example

Benefits of possible choices at each round Round 1 Round 2 Round 3 b c d e f g h 50 x 5 = 250 25 x 5 = 125 80 x 2 = 160 70 x 2 =140 60 x 2 =120 99 x 1 = 99 90 x 1 = 90 25 x 2 = 50 30 x 2 = 60 20 x 2 = 40 60 + 10 = 70 49 x 1 = 49 40 x 1 = 40 25 x 1 = 25 30 x 2 = 60 2 x 20 + 10 =50 49 x 1 = 49 30 x 1 = 30

100 a C 75 50 b 20 d f 40 h 10 g 1 30 e

Lattice with space costs

  • Initial: each query answered by a, with cost 100
  • Which queries can be answered by view b?
  • What is the benefit of view b? Ans: 5*(100-50)=250
  • Can we find a better view than view b?
  • End of round 1: view b selected
  • Round 2: views b, d, e, g, h start with cost 50

and the other views start with cost 100

  • Which queries can be improved by view f?
  • Benefit of view f? Ans: (100-40)+(50-40)=70
  • After 3 rounds, total cost is 420 (initial cost 800)
  • In summary, each round pick the view with the

maximum “benefit”

Aalborg University 2007 - DWML course 14

Greedy Algorithm vs optimal choice

Situations where the algorithm does poorly … … but the benefit is at least 63%

  • f the benefit of the optimal

algorithm as reasoned by the authors

A Lattice where the greedy does poorly

200 a d 100 100 b 20 Nodes total 1000 c 99 20 Nodes total 1000 20 Nodes total 1000 20 Nodes total 1000

  • Round 1: Picks c whose benefit is 4141
  • Round 2: Picks b or d with benefit of 2100 each
  • Greedy results in 4141+2100 = 6241
  • But, the optimal choice is to pick b and d
  • b and d improve by 100 for itself and all 80

nodes below resulting in total benefits of 8200

  • Ratio of greedy/optimal=6241/8200=76%

Aalborg University 2007 - DWML course 15

Greedy algorithm – space-time trade-off

Greedy order of view selection example

Selection Benefit Space 1 2 3 4 5 6 cp 7 8 9 10 11 12 ns nt c p cs np ct t n s none infinite 24M rows 12M 5.9M 5.8M 1M 1M 0.01M small small small small Time 72M rows 48M 36M 30.1M 24.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 23.3M 6M rows 6M 6M 6.1M 11.3M 16.3M 6.3M

Experiment with a lattice: Important to materialize only the first few (5) views, performance increases at first…

30 20 10 70 60 50 40 80 2 4 6 8 10 12 Time / Space Views Total Time Total Space

Aalborg University 2007 - DWML course 16

Optimal cases and anomalies

Two situations where the algorithm is optimal …

If a1 is much larger than the

  • ther a’s, the greedy is close

to optimal

If all the a’s are equal then

greedy is optimal

… but there are also two situations where the algorithm is not realistic

Views in a lattice are unlikely to have the same probability

  • f being requested in a query

Instead of asking for some fixed number of views to materialize, should instead allocate a fixed amount of space to views

slide-5
SLIDE 5

Aalborg University 2007 - DWML course 17

The size of views grows exponentially, until it reaches the size of the raw data at rank [logr m], i.e. the cliff

Number of group-by attributes n Size of views m Log r m

Effect of the number of group-by attributes on the size of views

  • Each domain size is r
  • Top element has m cells appearing in raw

data

  • If group on i attributes, cube has ri cells
  • If ri ≥ m, then each cell will have at most
  • ne data point and m of the cells with be
  • nonempty. Use m as the size of any view.
  • If ri <m, then almost all ri cells will have at

least one data point. Space cost is ri as several data points can be collapsed into

  • ne aggregate

Assumptions and basis of reasoning

Which explains why grouping

  • f 2 attributes (p,c), (s,c) have

the same size as (p,s,c) at 6 M rows* in Figure 1

* (p,s) does not have 6M rows because the benchmark made it so

Hypercube lattices – some observations

Aalborg University 2007 - DWML course 18

Space-optimal and time-optimal solutions

What is the average time for a query when the space is

  • ptimal ?
  • Space is minimized when only the

top view is materialized

  • Every query would take time m
  • Total time cost for all 2n queries is

m2n Is there sense to minimize time by materializing all views ?

  • No gain past the cliff
  • No point to do so
  • Nature of time-optimal solution

depends on

1.

Rank k = [logr m] at which the cliff occurs; and

2.

Rank j, such that rj = (n

j) is

maximized

Aalborg University 2007 - DWML course 19

Hierarchies are important as they underlie two commonly used query operations, Drill- down and Roll-up …

Day Week Month Year none

  • Year ≤ Month ≤ Day
  • Week ≤ Day; but
  • Month ≤ week; week ≤ month

A common hierarchy … … and its dependency relations

… but, hierarchies introduce query dependencies that must be accounted for when determining which queries to materialize; and this can be complex

Hierarchies & the Lattice Framework

Aalborg University 2007 - DWML course 20

Dependencies caused by different dimensions and attribute hierarchies can be combined into a Direct Product lattice …

Composite lattices

… assuming views can be created by independently grouping any or no member of the hierarchy for each of n dimensions

Combining two hierarchical dimensions

+

customer nation none 0.1M 25 1 product size type none 6M 1 50 150

A Direct Product lattice

=

c p 6 M c . 1 M c s 5 M n p 5 M n t 3 7 5 n 2 5 n s 1 2 5 p . 2 M t 1 5 n

  • n

e 1 s 5 c t 5 . 9 9 M

slide-6
SLIDE 6

Aalborg University 2007 - DWML course 21

Applicability of the Lattice Framework

Advantages of the lattice framework

  • Clean framework

A clean framework to work with dimensional hierarchies, since

hierarchies are themselves lattices

  • Easy to model dependencies

Dependency relationship among queries captured by the lattice

  • Order of materialization

A simple descending-order topological sort on lattice nodes gives

the required order of materialization – Why?

Aalborg University 2007 - DWML course 22

Summary on choosing aggregates

  • 1. Lattice framework
  • 2. Linear cost model
  • 3. Greedy algorithm (How it works?)
  • 4. Space-time trade-off

Aalborg University 2007 - DWML course 23

Overview

  • Choosing aggregates
  • Maintaining views
  • Bitmapped indices
  • Other optimization issues

Aalborg University 2007 - DWML course 24

View Maintenance

  • How and when should we refresh materialized views?
  • Total re-computation

Too expensive

  • Incremental view maintenance

Apply only changes since last refresh to view Insert new rows Ri and delete existing rows Rd Update = delete + insert

  • Additional info must be stored in views

To make the views self-maintainable Store “number of derivations” cv along with each row v in V

slide-7
SLIDE 7

Aalborg University 2007 - DWML course 25

View Maintenance

  • Projection views (with DISTINCT in SQL)

V=Π(R) (r,c): r appears “c” times in R If (r,ci)∈Π(Ri) and (r,cr)∈V then cr=cr+ci, otherwise insert (r,ci) into V If (r,cd)∈Π(Rd) and (r,cr)∈V then cr=cr-cd, delete from V if cr=0

  • Join views V=R⊗S

Compute Ri⊗S and add to V, update counts Compute Rd⊗S and subtract from V, update counts

Aalborg University 2007 - DWML course 26

Aggregation View Maintenance

  • COUNT

Maintain <g,count> Update count based on inserts/deletes Insert <g,1> rows for new values, delete rows from V with count=0

  • SUM

Maintain <g,sum,count> Update count and sum based on inserts/deletes Insert <g,a,1> rows for new values, delete rows from V with

count=0

  • AVG computed as sum/count
  • MIN (similar for MAX)

Maintain <g,minimum,count> Insert (g,m): update minimum,count based on m =,<,> minimum Delete (g,m): update minimum,count based on m =,<,> minimum If count=0, base table must be searched (expensive!)

Aalborg University 2007 - DWML course 27

Practical View Maintenance

  • When to synchronize?

Immediate – in same transaction as base change Lazy – when V is used for the first time after base updates Periodic, e.g., once a day, often together with base load Forced – after a certain number of changes

  • Updating aggregates

Computation outside DBMS in flat files (still relevant?) Built by loader Computation in DBMS using SQL Can be expensive: DBMS must be tuned for this

  • Supported by tool/DBMS

Oracle, SQL Server, DB2

Aalborg University 2007 - DWML course 28

Indexing

  • Index used in combination with aggregates

Index on dimension tables and on materialized views

  • Fact table

Build primary B-tree index on dimension keys (primary key) ? Build indices on each dimension key separately (index intersection) Indices on combinations of dimension keys? (many!)

  • Sort order is important (index-organized tables)

Compressing data can be possible (values not repeated) Can save aggregates due to fast sequential scan Best sort order (almost) always “Time”

  • Dimension tables

Build indices on many/all individual columns (think of type) Build indices on common combinations

  • Hash indices

Efficient for un-sorted data

slide-8
SLIDE 8

Aalborg University 2007 - DWML course 29

Bitmap Indices

  • A B-tree index stores a list of RowIDs for each value

A RowID takes ~8 bytes Large space use for columns with low cardinality (gender, color) Example: Index for 1 bio. rows with gender takes 8 GB Not efficient to do “index intersection” for these columns

  • Idea: make a “position bitmap” for each value (only two)

Female: 01110010101010… Male:

10001101010101…

Takes only (no. of values)*(no. of rows)*1 bit Example: bitmap index on gender (as before) takes only 256 MB Very efficient to do “index intersection” (AND/OR) on bitmaps Can be improved for for higher cardinality using compression

  • Supported by some RDBMS (DB2, Oracle)

Aalborg University 2007 - DWML course 30

Using Bitmap Indices

  • Query example

Find female customers in Aalborg with black hair and blue eyes Female: 01010101010 Aalborg: 00000011111 Black:

10110110110

Blue:

01101101111

Result

00000000010 – use AND, only one such customer

  • Range queries can also be handled

…and Salary BETWEEN 200,000 AND 300,000 200-250,000: 001001001 250-300,000: 010010010 OR together: 011011011 Use as regular bitmap

Aalborg University 2007 - DWML course 31

Compressed Bitmaps

  • Problem: space use

With m possible values and n records: n*m bits required However, probability of a 1 is 1/m => very few 1’s

  • Solution: compressed bitmaps

Run-length encoding A run is i 0’s followed by a 1 Concatenating binary numbers won’t work – no unique decoding Instead, determine j – number of bits in binary representation of i Encode as “<j-1 1’s>”+“0”+“<i in binary>” Encode next run similarly, trailing 0’s not encoded Example: 000000010000 encoded as 110111 j>1 => first bit of i is 1 – this bit can be saved – encoded as 11011 Decoding: scan bits to find j, scan next j-1 bits to find i, find next 1 Example: j = 3, i = 7 (11) => bitmap = 00000001 + 0000 (trailing)

Aalborg University 2007 - DWML course 32

Managing Bitmaps

  • Compression factor

Assume m=n (unique values) Each value has just one run of length i < n Each run takes at most 2 log2 n bits (j <= log2 n) Total space consumption: 2n log2 n bits (compared to n2)

  • Operations on compressed bitmaps

Decompress one run at a time and produce relevant 1’s in output

  • Finding and storing bitvectors

Index with B-trees + store in blocks/block chains

  • Finding records

Use secondary or primary index (hashing or B-tree)

  • Handling modifications

Deletion: “retire” record number + update bitmaps with 1’s Insertion: add new record to file + update bitmaps with 1’s (trail 0’s) Updates: update bitmaps with old and new 1’s (expensive)

slide-9
SLIDE 9

Aalborg University 2007 - DWML course 33

Physical Storage

  • Partitioning

Data stored in large “lumps” (partitions) Example: one partition per quarter Queries need only read the relevant partitions Can yield large performance improvements

  • Operations on partitions are independent

Creation, deletion, update, indexing Aggregation level can be different among partitions

  • Column storage

Data stored in columns, not in rows A different kind of partitioning Works well for typical DW queries (only few columns accessed) Supports good compression of data

Aalborg University 2007 - DWML course 34

Physical Configuration

  • RAID
  • Gives (depending on level) error tolerance and improved read speed
  • DW optimized for reads, not for writes
  • DW well suited for, e.g., RAID5 (20% redundancy)
  • Disk type
  • Small drives (with many controllers) are more expensive, but faster
  • Large drives are cheaper, can store more aggregates for the same price
  • Block size
  • Large sequential reads faster with large blocks (32K)
  • Scattered index reads faster with small blocks (4K)
  • Memory
  • RAM is cheap: buy a lot
  • RAM caching must be per user session
  • Monitoring user activity
  • Can give feedback to, e.g., choice of aggregates

Aalborg University 2007 - DWML course 35

DBMS Possibilities

  • Aggregate navigation/use, Aggregate choice, Aggregate maintenance
  • Oracle 10g, DB2 UDB, MS Analysis Services
  • Using ordinary indices
  • Oracle 10g, DB2 UDB, MS SQL Server can do “star joins”
  • Bitmap indices
  • Oracle 10g, DB2 UDB – not yet in MS SQL Server
  • Partitioning
  • Oracle 10g, DB2 UDB, MS SQL Server+Analysis Services
  • Column storage
  • Redbrick (Informix->IBM->?)
  • MOLAP/ROLAP/HOLAP
  • Oracle 10g, DB2 UDB, MS SQL Server

Aalborg University 2007 - DWML course 36

Mini Project

  • New subtask

DW performance optimization Build aggregates ◆ SQL Server Management Studio ◆ Connect to Analysis Services ◆ Find your cube and ……