Horizontal or Vertical Storage A fact table for data warehousing is - - PowerPoint PPT Presentation

horizontal or vertical storage
SMART_READER_LITE
LIVE PREVIEW

Horizontal or Vertical Storage A fact table for data warehousing is - - PowerPoint PPT Presentation

Horizontal or Vertical Storage A fact table for data warehousing is often fat Tens of even hundreds of dimensions/attributes A query is often about only a few attributes Horizontal storage: tuples are stored one by one Vertical


slide-1
SLIDE 1

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 1

Horizontal or Vertical Storage

  • A fact table for data warehousing is often fat

– Tens of even hundreds of dimensions/attributes

  • A query is often about only a few attributes
  • Horizontal storage: tuples are stored one by one
  • Vertical storage: tuples are stored by attributes

A1 A2 … A100 x1 x2 … x100 … … … … z1 z2 … z100 A1 A2 … A100 x1 x2 … x100 … … … … z1 z2 … z100

slide-2
SLIDE 2

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 2

Horizontal Versus Vertical

  • Find the information of tuple t

– Typical in OLTP – Horizontal storage: get the whole tuple in one search – Vertical storage: search 100 lists

  • Find SUM(a100) GROUP BY {a22, a83}

– Typical in OLAP – Horizontal storage (no index): search all tuples O(100n), where n is the number of tuples – Vertical storage: search 3 lists O(3n), 3% of the horizontal storage method

  • Projection index: vertical storage
slide-3
SLIDE 3

Only Vertical Storage Is Not Enough

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 3

0" 5" 10" 15" 20" 25" 30" 35" 40" 45"

Column"Store" Row"Store" Run$me'(sec)' Performance'of'Column3Oriented'Op$miza$ons' –Late" Materializa:on" –Compression" –Join"Op:miza:on" –Tuple@at@a@:me" Baseline"

Abadi et al. The Design and Implementation of Modern Column-Oriented Database Systems.

slide-4
SLIDE 4

C-Store

  • ROS: Read Optimized Store (column
  • riented)
  • WOS: Write Optimized Store (row oriented,

mainly for input)

  • Each column may be stored multiple times

in different sort orders

  • Projection: a group of columns sorted on the

same attribute

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 4

slide-5
SLIDE 5

Different Projections

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 5

17 22 6 98 12 4 14 7 8 11 1/6/08 1/6/08 1/8/08 1/13/08 1/20/08 1/24/08 2/2/08 2/4/08 2/5/08 2/12/08 West East South South North South West North East East 1 2 3 4 5 6 7 8 9 10 saleid date region (a) Sales Projection Sorted By Date 5 9 4 12 5 7 22 3 18 6 1/6/08 West 1/6/08 East 1/8/08 South 1/13/08 South 1/20/08 North 1/24/08 South 2/2/08 West 2/4/08 North 2/5/08 East 2/12/08 East 1 2 3 4 5 6 7 8 9 10 prodid date region (b) Sales Projection Sorted By Region, Date

(saleid,date,region | date) (prodid,date,region | region,date)

Abadi et al. The Design and Implementation of Modern Column-Oriented Database Systems.

slide-6
SLIDE 6

MonetDB

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 6

(memory mapped) simple memory array BAT ’byear’ BAT ’name’ (virtual) dense surrogates

SQL front−end XQuery front−end

1 2 3

SPARQL front−end

1907 1927 1927 1968

BAT Algebra

1 2 1 11 23 33 1 2 3 John Wayne\0 Roger Moore\0 Bob Fosse\0 Will Smith\0

select(byear,1927)

MonetDB back−end

Abadi et al. The Design and Implementation of Modern Column-Oriented Database Systems.

BAT: Binary Association Table

slide-7
SLIDE 7

Vectorized Processing

  • Query: SELECT AVG(A) FROM R WHERE A <

100

  • Full materialization

– Scan column A once, find the qualified rows – Compute average on the qualified rows – Drawback: large intermediate results

  • Vectorized processing

– Get N tuples in a shot – Compute the sum and count of the N tuples – N is set so that the vector can fit into the L1 cache

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 7

slide-8
SLIDE 8

Why Vectorized Processing?

  • Reduced interpretation cost
  • Better cache locality
  • Compiler optimization – SIMD instructions
  • Block algorithms
  • Parallel memory access
  • Profiling and adaptive execution

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 8

slide-9
SLIDE 9

Late Materialization

  • Keep data in columns until late in query

evaluation

– Use intermediate position lists

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 9

12 34 75 45 49 58 97 75 42 55 11 35 62 44 29 78 19 81 26 23 Relation R Rb Relation S Sa Sb select sum(R.a) from R, S where R.c = S.b and 5<R.a<20 and 40<R.b<50 and 30<S.a<40 Initial Status 12 34 53 23 78 65 33 21 29 Rc 3 16 56 9 11 27 8 41 19 35

  • 1. inter1 = select(Ra,5,20)
  • 2. inter2 = reconstruct(Rb,inter1)
  • 3. inter3 = select(inter2,30,40)
  • 4. join_input_R = reconstruct(Rc,inter3)
  • 5. inter4 = select(Sa,55,65)
  • 6. inter5 = reconstruct(Sb,inter4)
  • 7. join_input_S = reverse(inter5)
  • 8. join_res_R_S = join(join_input_R,join_input_S)
  • 9. inter6 = voidTail(join_res_R_S)
  • 10. inter7 = reconstruct(Ra,inter6)
  • 11. result = sum(inter7)

17 49 58 99 64 37 53 61 32 50 Ra Query and Query Plan (MAL Algebra)

slide-10
SLIDE 10

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 10

3 16 56 9 11 27 8 41 19 35 Ra 2 4 5 7 9 inter1 select(Ra,5,20) 2 4 5 7 9 inter1 12 34 75 45 49 58 97 75 42 55 34 45 49 97 42 Rb inter2 reconstruct(Rb,inter1) 2 4 5 7 9 (1) (2)

slide-11
SLIDE 11

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 11

34 45 49 97 42 inter2 inter3 2 4 5 7 9 4 5 9 select(inter2,30,40) inter3 4 5 9 12 34 53 23 78 65 33 21 29 Rc join_input_R 23 78 29 4 5 9 reconstruct(Rc,inter3) (2) (4) (3)

slide-12
SLIDE 12

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 12

17 49 58 99 64 37 53 61 32 50 Sa 3 5 7 8 10 inter4 select(Sa,55,65) 62 29 19 81 23 inter5 reconstruct(Sb,inter4) 3 5 7 8 10 3 5 7 8 10 inter4 11 35 62 44 29 78 19 81 26 23 Sb 62 29 19 81 23 inter5 3 5 7 8 10 62 29 19 81 23 join_input_S 3 5 7 8 10 reverse(inter5) (5) (6) (7)

slide-13
SLIDE 13

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 13

join_input_R 23 78 29 4 5 9 62 29 19 81 23 join_input_S 3 5 7 8 10 4 9 10 5 join_res_ R_S 4 9 10 5 join(join_input_R,join_input_S) join_res_ R_S 4 9 inter6 voidTail(join_res_R_S) (8) (9)

slide-14
SLIDE 14

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 14

Ra 4 9 inter6 3 16 56 9 11 27 8 41 19 35 9 19 inter7 9 19 inter7 reconstruct(Ra,inter6) 28 result sum(inter7) (10) (11)

Figure 4.1: An example of a select-project-

slide-15
SLIDE 15

Example of Joins

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 15

42 36 42 44 38 1 38 42 46 36 = 1 2 3 5 2 4 2 1

slide-16
SLIDE 16

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 16

CUBE

SALES Model Year Color Sales

Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);

slide-17
SLIDE 17

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 17

MOLAP

Date Country

sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum

slide-18
SLIDE 18

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 18

Pros and Cons

  • Easy to implement
  • Fast retrieval
  • Many entries may be empty if data is sparse
  • Costly in space
slide-19
SLIDE 19

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 19

ROLAP – Data Cube in Table

  • A multi-dimensional database

Base table

Dimensions Measure Store Product Season AVG(Sales) S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9 Dimensions Measure Store Product Season Sales S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9

Cubing

slide-20
SLIDE 20

Data Cube: A Lattice of Cuboids

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 20

time,item time,item,location time, item, location, supplierc

all time item location supplier

time,location time,supplier item,location item,supplier location,supplier

time,item,supplier time,location,supplier

item,location,supplier

0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

slide-21
SLIDE 21

Data Cube: A Lattice of Cuboids

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 21

  • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells

(9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *), (*, milk, Urbana, *), (*, milk, Urbana, *) (*, milk, Chicago, *), (*, milk, *, *)

all

time,item time,item,location time, item, location, supplier

time item location supplier

time,location time,supplier item,location item,supplier location,supplier

time,item,supplier time,location,supplier

item,location,supplier

0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

slide-22
SLIDE 22

Full Cube vs. Iceberg Cube

  • Full cube vs. iceberg cube

compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 22

n Avoid explosive growth: A cube with 100 dimensions n 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)

n How many aggregate cells if “having count >= 1”? n What about “having count >= 2”?

iceberg condition

slide-23
SLIDE 23

Iceberg / Closed Cube & Cube Shell

  • Is iceberg cube good enough?

– 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100): 10} – How many cells will the iceberg cube have if having count(*) >= 10?

  • Close cube:

– Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c. – Closed cube: a cube consisting of only closed cells – What is the closed cube of the above base cuboid?

  • Cube Shell

– Precompute only the cuboids involving a small # of dimensions, e.g., 3 – More dimension combinations will need to be computed on the fly

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 23

slide-24
SLIDE 24

Efficient Cube Computation

  • General cube computation heuristics (Agarwal

et al.’96)

  • Computing full/iceberg cubes: 3 methodologies

– Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande & Naughton, SIGMOD’97) – Top-down:

  • BUC (Beyer & Ramarkrishnan, SIGMOD’99)
  • H-cubing technique (Han, Pei, Dong & Wang:

SIGMOD’01)

– Integrating Top-Down and Bottom-Up:

  • Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 24

slide-25
SLIDE 25

Extensions

  • High-dimensional OLAP: A Minimal Cubing

Approach (Li, et al. VLDB’04)

  • Computing alternative kinds of cubes:

– Partial cube, closed cube, approximate cube, etc.

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 25

slide-26
SLIDE 26

General Heuristics

  • Sorting, hashing, and grouping operations are applied to the

dimension attributes in order to reorder and cluster related tuples

  • Aggregates may be computed from previously computed

aggregates, rather than from the base fact table

– Smallest-child: computing a cuboid from the smallest, previously computed cuboid – Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os – Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads – Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used – Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used

  • (Agarwal et al. VLDB’96)

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 26

slide-27
SLIDE 27

Multi-Way Array Aggregation

  • Array-based “bottom-up”

algorithm

  • Using multi-dimensional chunks
  • No direct tuple comparisons
  • Simultaneous aggregation on

multiple dimensions

  • Intermediate aggregate values

are re-used for computing ancestor cuboids

  • Cannot do Apriori pruning: No

iceberg optimization

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 27

ABC AB A All B AC BC C

slide-28
SLIDE 28

Multi-way Array Aggregation for Cube Computation (MOLAP)

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 28

  • Partition arrays into chunks (a small subcube which fits in memory).
  • Compressed sparse array addressing: (chunk_id, offset)
  • Compute aggregates in “multiway” by visiting cube cells in the order which

minimizes the # of times to visit each cell, and reduces memory access & storage cost.

What is the best traversing

  • rder to do

multi-way aggregation? A B

29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3

C B

44 28 56 40 24 52 36 20 60

slide-29
SLIDE 29

Multi-way Array Aggregation for Cube Computation (3-D to 2-D)

all A B AB ABC AC BC C

  • The best order is

the one that minimizes the memory requirement and reduced I/Os

ABC AB A All B AC BC C Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 29

slide-30
SLIDE 30

Multi-way Array Aggregation for Cube Computation (2-D to 1-D)

ABC AB A All B AC BC C

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 30

slide-31
SLIDE 31

Multi-Way Array Aggregation for Cube Computation

  • Method: the planes should be sorted and

computed according to their size in ascending

  • rder

– Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane

  • Limitation of the method: computing well only

for a small number of dimensions

– If there are a large number of dimensions, “top- down” computation and iceberg cube computation methods can be explored

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 31

slide-32
SLIDE 32

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 32

Iceberg Cube

  • In a data cube, many aggregate cells are

trivial

– Having an aggregate too small

  • Iceberg query
slide-33
SLIDE 33

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 33

Monotonic Iceberg Condition

  • If COUNT(a, b, *)<100, then COUNT(a, b,

c)<100 for any c

  • For cells c1 and c2, c1 is called an ancestor
  • f c2 if in all dimensions that c1 takes a non-*

value, c2 agrees with c1

– (a,b,*) is an ancestor of (a,b,c)

  • An iceberg condition P is monotonic if for

any aggregate cell c failing P, any descendants of c cannot honor P

slide-34
SLIDE 34

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 34

BUC

  • Once a base table (A,

B, C) is sorted by A-B- C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters

  • To compute other

aggregates, we can sort the base table in some other orders

slide-35
SLIDE 35

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 35

Location Year Color Amount Vancouver 2015 Yellow 300 Victoria 2014 Red 400 Seattle 2015 Green 120 Vancouver 2014 Green 260 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160

Threshold: sum() >= 300

slide-36
SLIDE 36

Example: Sorting on Location

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 36

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2015 Yellow 300 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2014 Green 260 Victoria 2014 Red 400 Sum(Seattle, *, *) = 280 ✗ Sum(Vancouver, *, *) = 1000 ✓ Sum(Victoria, *, *) = 400 ✓

slide-37
SLIDE 37

Sorting on Year for Vancouver

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 37

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2014 Green 260 Vancouver 2015 Yellow 300 Vancouver 2015 Red 160 Victoria 2014 Red 400 Sum(Vancouver, 2014, *) = 540 ✓ Sum(Vancouver, 2015, *) = 460 ✓

slide-38
SLIDE 38

Color on Vancouver & 2014/2015

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 38

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2014 Green 260 Vancouver 2015 Yellow 300 Vancouver 2015 Red 160 Victoria 2014 Red 400 Sum(Vancouver, 2014, Yellow) = 280 ✗ Sum(Vancouver, 2014, Green) = 260 ✗ Sum(Vancouver, 2015, Yellow) = 300 ✓ Sum(Vancouver, 2015, Red) = 160 ✗

slide-39
SLIDE 39

Sort on Color for Vancouver

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 39

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Yellow 300 Victoria 2014 Red 400 Sum(Vancouver, *, Green) = 260 ✗ Sum(Vancouver, *, Red) = 160 ✗ Sum(Vancouver, Yellow) = 580 ✓

slide-40
SLIDE 40

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 40

How to Sort the Base Table?

  • General sorting in main memory O(nlogn)
  • Counting in main memory O(n), linear to the

number of tuples in the base table

– How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0’s – Scan the integers once, count the occurrences

  • f each value in 1 to 100

– Scan the integers again, put the integers to the right places

slide-41
SLIDE 41

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 41

Pushing Monotonic Conditions

  • BUC searches the

aggregates bottom-up in depth-first manner

  • Only when a

monotonic condition holds, the descendants

  • f the current node

should be expanded

slide-42
SLIDE 42

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 42

How to Push Non-Monotonic Ones?

  • Condition P(c)=AVG(price)>=800 AND

COUNT(*)>=50 is not monotonic

  • BUC cannot push such a constraint
slide-43
SLIDE 43

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 43

Ideas

  • Let AVGk(price) be the average of the top-k

tuples

  • AVGk(price)>=800 is a monotonic condition

– If the top-10 average of (Vancouver, *, *) is less than 800, the top-10 average of (Vancouver, laptop, *) cannot be 800 or more

  • AVGk(price)>=800 can be a filter for

AVG(price)>=800

– If AVGk(price)<800, AVG(price)<800 – Generally, AVG()<=AVGk()

slide-44
SLIDE 44

To-Do List

  • Read Sections 4.2, 4.3, 4.4, 5.1 and 5.2
  • Understand how to build column store indexes in

SQL Server

  • Understand the data cube computation methods in

the textbook

– Quotient cube and semantic OLAP will not be covered by the exam

  • Find out how to build a data cube in SQL Server
  • Advanced reading

– Understand how to conduct OLAP with Cassandra and SPARK

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 44

slide-45
SLIDE 45

Advanced Reading List

  • D. J. Abadi et al. Column-Stores vs. Row-

Stores: How Different Are They Really? SIGMOD 2008.

  • D. Abadi, P. Boncz, S. Harizopoulos, S.

Idreos and S. Madden. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends in Databases, vol. 5, no. 3, pp. 197– 280, 2012.

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 45