OLAP Databases Aalborg University, adapted from Torben Bach - - PowerPoint PPT Presentation

olap databases
SMART_READER_LITE
LIVE PREVIEW

OLAP Databases Aalborg University, adapted from Torben Bach - - PowerPoint PPT Presentation

OLAP Databases Aalborg University, adapted from Torben Bach Pedersen, Man Lung Yiu and Dimitra Vista Instructors: Peter Baumann email: p.baumann@jacobs-university.de tel: -3178 office: room 88, Research 1 340151 Big Data & Cloud


slide-1
SLIDE 1

1 340151 Big Data & Cloud Services (P. Baumann)

OLAP Databases

Instructors: Peter Baumann email: p.baumann@jacobs-university.de tel:

  • 3178
  • ffice:

room 88, Research 1

Aalborg University, adapted from Torben Bach Pedersen, Man Lung Yiu and Dimitra Vista

slide-2
SLIDE 2

2 340151 Big Data & Cloud Services (P. Baumann)

Overview

  • Data Warehousing & Decision Support
  • Datacubes, Dimension Hierarchies
  • ROLAP & MOLAP
  • ETL
  • Summary
slide-3
SLIDE 3

3 340151 Big Data & Cloud Services (P. Baumann)

Overview

  • Data Warehousing & Decision Support
  • Datacubes, Dimension Hierarchies
  • ROLAP & MOLAP
  • ETL
  • Summary
slide-4
SLIDE 4

4 340151 Big Data & Cloud Services (P. Baumann)

Desicion Support Systems (DSS)

  • Support business decisions
  • often based on OLTP collected data
  • Examples of high-level analytical questions:
  • What products have been most profitable for the company this year?
  • Is it the same group of products that were most profitable last year?
  • How is the company doing this quarter versus this same quarter last year?
  • Examples of data used for making decisions
  • Retail sales transaction details
  • Customer profiles (income, age, sex, etc.)
  • logs
slide-5
SLIDE 5

5 340151 Big Data & Cloud Services (P. Baumann)

DSS: Architecture

Information Sources Data Warehouse Server OLAP Servers Clients

Operational DB’s Semistructured Sources extract transform load refresh etc. Data Marts Data Warehouse e.g., ROLAP

serve

OLAP Query/Reporting Data Mining

slide-6
SLIDE 6

6 340151 Big Data & Cloud Services (P. Baumann)

Data Warehousing: Informal

  • Problem: critical enterprise information disparate, unavailable
  • locations, representations, storage, accessibility, completeness, ...
  • Data Warehouse

= system for reporting & data analysis

  • one or more disparate sources

 central, integrated repository

  • current + historical data
  • creating analytical reports

core component of business intelligence

  • data cleansing: extract, transfer, load (ETL)

[soha jamil / Wikipedia]

slide-7
SLIDE 7

7 340151 Big Data & Cloud Services (P. Baumann)

Data Warehousing: Definition

  • “A warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of

data in support of management decision making process”

  • Bill Inmon, 1990
  • Key features:
  • Subject Oriented: particular subject instead of company ongoing operations
  • Integrated: gathered from a variety of sources, merged into a coherent whole
  • Time Variant: particular time period
  • Non-Volatile: data, never removed
slide-8
SLIDE 8

8 340151 Big Data & Cloud Services (P. Baumann)

OLAP

  • OLAP = Online Analytical Processing
  • Edgar Codd, 1994
  • Differentiated against OLTP = Online Transaction Processing
  • software category motivated by industry, introducing advanced data analysis
  • decision making, business modeling, operations research, …
  • enables analysts to extract & view business data from different points of view
  • dimensions
  • OLAP Characteristics
  • multidimensional data analysis techniques
  • Strong use of aggregate functions for summarizing large volumes of data
  • advanced database support
  • easy-to-use end-user interfaces (spreadsheet type)
  • client/server architecture
slide-9
SLIDE 9

9 340151 Big Data & Cloud Services (P. Baumann)

Overview

  • Data Warehousing & Decision Support
  • Datacubes, Dimension Hierarchies
  • ROLAP &MOLAP
  • ETL
  • Summary
slide-10
SLIDE 10

10 340151 Big Data & Cloud Services (P. Baumann)

Datacubes

  • Data structure for fast analysis along different views („dimensions“),
  • n all levels of detail
  • Technically: multi-dimensional array + metadata
  • Typically, time one dimension

Q4

Q1

Q2 Q3 Time Dimension Dallas Denver Chicago Apples Cherries Grapes Atlanta Sales Fact Melons

slide-11
SLIDE 11

11 340151 Big Data & Cloud Services (P. Baumann)

Dense vs Sparse Datacubes

  • Dense = every cell has meaningful value
  • Ex: climate simulation
  • Sparse = some values null
  • Clustered data
  • Empty regions
  • Ex: retail – open Mon thru Fri

http://lookfordiagnosis.com/mesh_info.php?term=cluster%20analysis&lang=1 [DKRZ] [www.agencie.fi] http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-253

slide-12
SLIDE 12

12 340151 Big Data & Cloud Services (P. Baumann)

Datacube Operations

  • Extraction + aggregation + combinations:
  • Slice
  • Dice
  • Roll-Up
  • Drill Down
  • Pivot

[guru99.com]

slide-13
SLIDE 13

13 340151 Big Data & Cloud Services (P. Baumann)

Operations: Slicing

  • Slicing = Select sub-cube by selecting

dimension values to fewer points

  • Result cube has less dimensions
  • Ex: select particular time slice

[guru99.com]

slide-14
SLIDE 14

14 340151 Big Data & Cloud Services (P. Baumann)

Operations: Dicing

  • Dicing = subsetting
  • „thicker slices“, not reducing dimensionality
  • Ex: derive subcube by selecting

along location, time, item simultaneously

[guru99.com]

slide-15
SLIDE 15

15 340151 Big Data & Cloud Services (P. Baumann)

Operations: Roll-Up

  • Roll-Up = aggregation along dimensions
  • also: „consolidation“
  • collapsing a dimension hierarchy
  • „climbing up“ concept hierarchy
  • Ex: consolidating

from cities to countries

[guru99.com]

slide-16
SLIDE 16

16 340151 Big Data & Cloud Services (P. Baumann)

Operations: Drill-Down

  • Drill-Down = fragment data into smaller parts
  • Moving down concept hierarchy
  • Expanding some dimension
  • Inverse of roll-up
  • Ex: detailing from quarters to months

[guru99.com]

slide-17
SLIDE 17

17 340151 Big Data & Cloud Services (P. Baumann)

Operations: Pivot

  • Pivot = rotate axes
  • show another view
  • Ex: swap rows & columns
  • Ex: swap cities ↔ product types

[guru99.com]

slide-18
SLIDE 18

18 340151 Big Data & Cloud Services (P. Baumann)

OLAP Datacube Querying

  • ISO SQL does not directly support cubes
  • changing with SQL/MDA
  • Multidimensional Expressions (MDX) = query language for OLAP
  • Microsoft 1997, also adopted by other vendors
  • https://docs.microsoft.com/en-us/sql/mdx/multidimensional-expressions-mdx-

reference?view=sql-analysis-services-2017

  • Ex (Wikipedia):

SELECT { [Measures].[Store Sales] } ON COLUMNS, { [Date].[2002], [Date].[2003] } ON ROWS FROM Sales WHERE ( [Store].[USA].[CA] )

slide-19
SLIDE 19

19 340151 Big Data & Cloud Services (P. Baumann)

Overview

  • Data Warehousing & Decision Support
  • Datacubes, Dimension Hierarchies
  • ROLAP & MOLAP
  • ETL
  • Summary
slide-20
SLIDE 20

20 340151 Big Data & Cloud Services (P. Baumann)

  • Mapping datacubes

to relational table schema ?

  • Central fact table

= tuples + n-D “coordinate” attributes

  • foreign keys
  • non-keys = measure
  • Dimension = table(s)
  • with coordinates

+ descriptions („metadata“)

  • One step of normalization:

keys  dimension tables

Datacubes in ROLAP: Facts & Dimensions

customer Prodcut Time

Qty Amount

Dimension Tables

customer Prodcut Time

Sales Fact Table

Measures Fact Dimensions

slide-21
SLIDE 21

21 340151 Big Data & Cloud Services (P. Baumann)

  • Measure vs dimension?
  • Ex: Web server log with browser, IP, visit duration, etc.
  • average of visit duration? Yes measure
  • average of browser brand name? No  dimension
  • Normalizing dimensions

 dimension hierarchies

  • Datacube = collection of fact & dimensions tables

Datacubes in ROLAP: Dimension Hierarchies

location: country city region time: day month year week

slide-22
SLIDE 22

22 340151 Big Data & Cloud Services (P. Baumann)

Star Schema

  • star schema = multidimensional data structure in relational database
  • Dimension hierarchies = aka lookup tables around fact table
  • MS SQL Server

Enterprise Manager:

many Dimension Tables Measures 1 Fact Table

slide-23
SLIDE 23

23 340151 Big Data & Cloud Services (P. Baumann)

Snowflake Schema

  • snowflake schema = refinement of star schema
  • Normalizing dimension tables
  • Ex:
  • Year → Month → Day
  • Week → Day
  • MS SQL Server

Enterprise Manager:

normalized Dimension Tables 1 Fact Table

[SqlPac @ Wikipedia]

slide-24
SLIDE 24

24 340151 Big Data & Cloud Services (P. Baumann)

Galaxy Schema

  • Galaxy schema = combined datacubes
  • Sharing dimension(s)
  • helpful for aggregating fact tables
  • also called „Fact Constellation Schema“

[https://www.guru99.com]

slide-25
SLIDE 25

25 340151 Big Data & Cloud Services (P. Baumann)

A Query in ROLAP

Table 1 (Dim 1)

,Table 1 T1 -- Dim 1

Table 3 (Dim 3)

,Table 3 T3 -- Dim 3

Table 4 (Facts 1)

FROM Table 4 T4 -- Fact WHERE T4.Dim_Col_1 = T1.Dim_Col_1 AND T4.Dim_Col_2 = T2.Dim_Col_1 AND T4.Dim_Col_3 = T3.Dim_Col_1

Table 2 (Dim 2)

,Table 2 T2 -- Dim 2 AND T1.Dim_Property_2 = ′Product 1‘ AND T2.Dim_Property_1 = ′City 1‘ AND T3.Dim_Property_1 = ′Salesman 1‘ SELECT Fact_Column_1 ,Fact_Column_2

SELECT Fact_Column_1 ,Fact_Column_2 FROM Table 4 T4 -- Fact ,Table 1 T1 -- Dim 1 ,Table 2 T2 -- Dim 2 ,Table 3 T3 -- Dim 3 WHERE T4.Dim_Col_1 = T1.Dim_Col_1 AND T4.Dim_Col_2 = T2.Dim_Col_1 AND T4.Dim_Col_3 = T3.Dim_Col_1 AND T1.Dim_Property_2 = ′Product 1‘ AND T2.Dim_Property_1 = ′City 1‘ AND T3.Dim_Property_1 = ′Salesman 1‘

slide-26
SLIDE 26

26 340151 Big Data & Cloud Services (P. Baumann)

Performance of ROLAP methods

  • ~ 70% of the time spent on CPU, rest on I/O
  • Most of the CPU time spent in sorting intermediate results
  • ~ 10-20% is spent on copying data
  • I/O composed of read/write into large tables
slide-27
SLIDE 27

27 340151 Big Data & Cloud Services (P. Baumann)

The MOLAP Approach

  • Native datacube

= multidimensional array

  • plus metadata
  • Fast position-based computation
  • cell values stored in fixed positions

determined by dimension values

  • Often used for data marts
slide-28
SLIDE 28

28 340151 Big Data & Cloud Services (P. Baumann)

A Query in MOLAP

  • Proprietary QLs
  • Ex: Oracle

Dim 1 Dim 3

LIMIT Dim 1 TO ’Product 1’ LIMIT Dim 2 TO ’City 1’ LIMIT Dim 3 TO ’Salesman 1’ REPORT sales REPORT sales LIMIT Dim 1 TO ’Product 1’ LIMIT Dim 2 TO ’City 1’ LIMIT Dim 3 TO ’Salesman 1’

slide-29
SLIDE 29

29 340151 Big Data & Cloud Services (P. Baumann)

Overview

  • Data Warehousing & Decision Support
  • Datacubes, Dimension Hierarchies
  • ROLAP & MOLAP
  • ETL
  • Summary
slide-30
SLIDE 30

30 340151 Big Data & Cloud Services (P. Baumann)

ETL

  • Extract
  • Extract relevant data
  • Transform
  • Transform data to DW format
  • Build keys, etc.
  • cleaning of data
  • Load
  • Load data into DW
  • Build aggregates, etc.
  • most underestimated process

in DW development

  • most time-consuming process

in DW development

  • 80% of development time spent
  • n ETL!
slide-31
SLIDE 31

31 340151 Big Data & Cloud Services (P. Baumann)

ETL in Data Warehouse Architecture

[soha jamil / Wikipedia]

slide-32
SLIDE 32

32 340151 Big Data & Cloud Services (P. Baumann)

How To Build an ETL Pipeline

  • Plan
  • Make high-level diagram of source-destination flow
  • Test, choose and implement ETL tool
  • Develop default strategies for common activities, e.g. extraction,
  • dimension management, etc
  • Drill down by target table, each column
  • One-time historic load process
  • Build and test the historic dimension table loads
  • Build and test the historic fact table loads
  • Construct fact tables, automation
  • Build and test the dimension table incremental load process
  • Build and test the fact table incremental load process
  • Build and test aggregate table / OLAP
  • Design, construct, and test ETL automation
slide-33
SLIDE 33

33 340151 Big Data & Cloud Services (P. Baumann)

Sample High-Level Extract Diagram

slide-34
SLIDE 34

34 340151 Big Data & Cloud Services (P. Baumann)

Ex: Microsoft BI Dev Studio

slide-35
SLIDE 35

35 340151 Big Data & Cloud Services (P. Baumann)

Common Transformations

  • Data type conversions
  • EBCDIC  ASCII/UniCode
  • String manipulations
  • Date/time format conversions
  • Ex: unix time 1201928400 = what time?
  • Normalization/denormalization
  • To desired DW format
  • Depending on source format
  • Building keys
  • Table matches production keys to surrogate DW keys
  • Correct handling of history - especially for total reload
slide-36
SLIDE 36

36 340151 Big Data & Cloud Services (P. Baumann)

Data Cleansing: Why?

  • Garbage In - Garbage Out
  • BI does not work on “raw” data
  • Pre-processing necessary for BI analysis
  • Handle inconsistent data formats: Spellings, codings, …
  • Remove unnecessary attributes: Production keys, comments,…
  • Replace codes with text (Why?)
  • City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000
  • Combine data from multiple sources with common key
  • E.g., customer data from customer address, customer name, …
slide-37
SLIDE 37

38 340151 Big Data & Cloud Services (P. Baumann)

How to Check Cardinalities?

  • sorting
  • group by
  • Ex: given table product, check if product_sku and product_model are “1:many”

SELECT product_sku, count[*] as row count, count(distinct product_model) as model_count FROM staging_database.product GROUP BY product_sku HAVING count(distinct product_model) > 1

slide-38
SLIDE 38

40 340151 Big Data & Cloud Services (P. Baumann)

Data Quality

  • Data almost never has decent quality
  • Data in DW must be:
  • Precise
  • DW data must match known numbers - or explanation needed
  • Complete
  • DW has all relevant data and the users know
  • Consistent
  • No contradictory data: aggregates fit with detail data
  • Unique
  • same things called same and have same key
  • Timely
  • Data is updated ”frequently enough” and users know when
slide-39
SLIDE 39

41 340151 Big Data & Cloud Services (P. Baumann)

Improving Data Quality

  • Appoint “data quality administrator”
  • Responsibility for data quality
  • Includes manual inspections and corrections!
  • Source-controlled improvements
  • The optimal?
  • Construct programs that check data quality
  • Are totals as expected?
  • Do results agree with alternative source?
  • Number of NULL values?
  • Do not fix all problems with data quality
  • Allow management to see “weird” data in their reports?
  • Such data may be meaningful for them? (e.g., fraud detection)
slide-40
SLIDE 40

42 340151 Big Data & Cloud Services (P. Baumann)

Summary

  • ETL = Extract / Transfer / Load
  • General ETL issues
  • Building dimensions & fact tables
  • Extract
  • Transformations/cleaning
  • Load
  • Core & expensive task for Data Warehousing
  • Many tools available
slide-41
SLIDE 41

43 340151 Big Data & Cloud Services (P. Baumann)

Overview

  • Data Warehousing & Decision Support
  • Datacubes, Dimension Hierarchies
  • ROLAP & MOLAP
  • ETL
  • Summary
slide-42
SLIDE 42

44 340151 Big Data & Cloud Services (P. Baumann)

Extraction of Knowledge from Data

slide-43
SLIDE 43

45 340151 Big Data & Cloud Services (P. Baumann)

Summary: Data Warehousing Terminology

  • Typically warehouse data is multidimensional, with very large fact tables
  • Fact table
  • The subject, focus of analysis
  • Measures
  • The specific elements of analysis
  • Dimension
  • An object that allows to explore the measures from different perspectives
  • Hierarchies
  • Classification of dimensions, useful for data exploration and aggregation
  • Granularity
  • Level of detail of the stored data
slide-44
SLIDE 44

46 340151 Big Data & Cloud Services (P. Baumann)

Summary

Data warehouse software product or application, but information processing system architecture for decision making

  • Different models & schemas for data warehousing: ROLAP, MOLAP, …

Part of BI plot

ETL, Data Warehousing, OLAP, Data Mining, ...

  • Some more reading:
  • A. Shoshani: OLAP and statistical databases: Similarities and differences. ACM PODS

2000

  • OLAP council: MDAPI specification version 2.0