Data Warehousing and OLAP Decision Support Systems - - PDF document

data warehousing and olap decision support systems
SMART_READER_LITE
LIVE PREVIEW

Data Warehousing and OLAP Decision Support Systems - - PDF document

Data Warehousing and OLAP Decision Support Systems Decision-support systems are used to make business decisions, often based on data collected by online transaction-processing systems Examples of business decisions: What items to


slide-1
SLIDE 1

Data Warehousing and OLAP

slide-2
SLIDE 2

CMPT 354: Database I -- Data Warehousing and OLAP 2

Decision Support Systems

  • Decision-support systems are used to make

business decisions, often based on data collected by online transaction-processing systems

  • Examples of business decisions:

– What items to stock? – What insurance premium to change? – To whom to send advertisements?

  • Examples of data used for making decisions

– Retail sales transaction details – Customer profiles (income, age, gender, etc.)

slide-3
SLIDE 3

CMPT 354: Database I -- Data Warehousing and OLAP 3

Data and Statistical Analysis

  • Data analysis tasks are simplified by

specialized tools and SQL extensions

– Example tasks

  • For each product category and each region, what

were the total sales in the last quarter and how do they compare with the same quarter last year

  • As above, for each product category and each

customer category

  • Statistical analysis packages (e.g., SAS)

can be interfaced with databases

slide-4
SLIDE 4

CMPT 354: Database I -- Data Warehousing and OLAP 4

Data Analysis and OLAP

  • Online Analytical Processing (OLAP)

– Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion (with negligible delay)

  • Multidimensional data: data modeled as dimension

attributes and measure attributes

– Dimension attributes: define the dimensions on which measure attributes (or aggregates thereof) are viewed, e.g. the attributes item_name, color, and size of the sales relation – Measure attributes: can be aggregated upon, e.g., the attribute number of the sales relation

slide-5
SLIDE 5

CMPT 354: Database I -- Data Warehousing and OLAP 5

Pivot Table

  • Values for one of the dimension attributes form the row headers
  • Values for another dimension attribute form the column headers
  • Other dimension attributes are listed on top
  • Values in individual cells are (aggregates of) the values of the

dimension attributes that specify the cell

slide-6
SLIDE 6

CMPT 354: Database I -- Data Warehousing and OLAP 6

Relational Representation

  • Cross-tabs can be

represented as relations

– The value all is used to represent aggregates – All represents a set – The SQL:1999 standard uses null values in place of all despite confusion with regular null values

slide-7
SLIDE 7

CMPT 354: Database I -- Data Warehousing and OLAP 7

Data Cubes

  • A data cube is a multidimensional

generalization of a cross-tab

  • Can have n dimensions
  • Cross-tabs can be used as views on a data

cube

slide-8
SLIDE 8

CMPT 354: Database I -- Data Warehousing and OLAP 8

Online Analytical Processing

  • Pivoting: changing the dimensions used in a

cross-tab is called

  • Slicing: creating a cross-tab for fixed values only

– Sometimes called dicing, particularly when values for multiple dimensions are fixed

  • Rollup: moving from finer-granularity data to a

coarser granularity

  • Drill down: The opposite operation - that of

moving from coarser-granularity data to finer- granularity data

slide-9
SLIDE 9

CMPT 354: Database I -- Data Warehousing and OLAP 9

Hierarchies on Dimensions

  • Enable dimensions be viewed at different levels of

detail

– Dimension DateTime can be used to aggregate by hour

  • f day, date, day of week, month, quarter or year
slide-10
SLIDE 10

CMPT 354: Database I -- Data Warehousing and OLAP 10

Cross Tabulation With Hierarchy

slide-11
SLIDE 11

CMPT 354: Database I -- Data Warehousing and OLAP 11

OLAP Implementation

  • Multidimensional OLAP (MOLAP) systems

– Multidimensional arrays in memory to store data cubes

  • Relational OLAP (ROLAP) systems

– Relational tables to store data cubes

  • Hybrid OLAP (HOLAP) systems

– Store some summaries in memory and store the base data and other summaries in a relational database

slide-12
SLIDE 12

CMPT 354: Database I -- Data Warehousing and OLAP 12

Extended Aggregation in SQL:1999

  • The cube operation computes union of group by’s
  • n every subset of the specified attributes

select item-name, color, size, sum(number) from sales group by cube(item-name, color, size)

  • Compute the union of eight different groupings of

the sales relation: { (item-name, color, size), (item- name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) }

  • For each grouping, the result contains the null

value for attributes not present in the grouping

slide-13
SLIDE 13

CMPT 354: Database I -- Data Warehousing and OLAP 13

OLTP Versus OLAP

OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date, detailed, flat relational Isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write, index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response

slide-14
SLIDE 14

CMPT 354: Database I -- Data Warehousing and OLAP 14

What Is a Data Warehouse?

  • “A data warehouse is a subject-oriented,

integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” – W. H. Inmon

  • Data warehousing: the process of

constructing and using data warehouses

slide-15
SLIDE 15

CMPT 354: Database I -- Data Warehousing and OLAP 15

Subject-Oriented

  • Organized around major subjects, such as

customer, product, sales

  • Focusing on the modeling and analysis of

data for decision makers, not on daily

  • perations or transaction processing
  • Providing a simple and concise view around

particular subject issues by excluding data that are not useful in the decision support process

slide-16
SLIDE 16

CMPT 354: Database I -- Data Warehousing and OLAP 16

Integrated

  • Integrating multiple, heterogeneous data sources

– Relational databases, flat files, on-line transaction records

  • Data cleaning and data integration

– Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

  • E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted

slide-17
SLIDE 17

CMPT 354: Database I -- Data Warehousing and OLAP 17

Time Variant

  • The time horizon for the data warehouse is

significantly longer than that of operational systems

– Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

  • Every key structure in the data warehouse

– Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element”

slide-18
SLIDE 18

CMPT 354: Database I -- Data Warehousing and OLAP 18

Nonvolatile

  • A physically separate store of data

transformed from the operational environment

  • Operational update of data does not occur in

the data warehouse environment

– Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing:

  • initial loading of data and access of data
slide-19
SLIDE 19

CMPT 354: Database I -- Data Warehousing and OLAP 19

Data Warehousing

slide-20
SLIDE 20

CMPT 354: Database I -- Data Warehousing and OLAP 20

Collecting Data

  • Source driven architecture: data sources transmit

new information to a warehouse, either continuously or periodically (e.g. at night)

  • Destination driven architecture: a warehouse

periodically requests new information from data sources

  • Keeping warehouse exactly synchronized with

data sources (e.g. using two-phase commit) is too expensive

– Usually OK to have slightly out-of-date data at warehouse – Data/updates are periodically downloaded form online transaction processing (OLTP) systems

slide-21
SLIDE 21

CMPT 354: Database I -- Data Warehousing and OLAP 21

Design Issues

  • Data cleansing

– Correct mistakes in addresses (misspellings, zip code errors), and merge address lists from different sources and purge duplicates

  • Update propagating

– Warehouse schema may be a (materialized) view of schema from data sources

  • Summarizing data

– Raw data may be too large to store on-line – Aggregate values (totals/subtotals) often suffice – Queries on raw data can often be transformed by query

  • ptimizer to use aggregate values
slide-22
SLIDE 22

CMPT 354: Database I -- Data Warehousing and OLAP 22

Warehouse Schemas

  • Dimension values are usually encoded

using small integers and mapped to full values via dimension tables

  • Resultant schema is called a star schema

– More complicated schema structures

  • Snowflake schema: multiple levels of dimension

tables

  • Constellation: multiple fact tables
slide-23
SLIDE 23

CMPT 354: Database I -- Data Warehousing and OLAP 23

Data Warehouse Schema

slide-24
SLIDE 24

CMPT 354: Database I -- Data Warehousing and OLAP 24

Snowflake Schema

A star schema is a snowflake schema where each dimension has only

  • ne single

dimension table

Picture from publib.boulder.ibm.com

slide-25
SLIDE 25

CMPT 354: Database I -- Data Warehousing and OLAP 25

Why Data Mining?

  • Evolution of database technology

– To collect a large amount of data primitive file processing – To store and query data efficiently DBMS

  • New challenges: huge amount of data, how

to analyze and understand?

– Data mining

slide-26
SLIDE 26

CMPT 354: Database I -- Data Warehousing and OLAP 26

What Is Data Mining?

  • Mining data – mining knowledge
  • Data mining is the non-trivial process of

identifying valid, novel, potentially useful, and ultimately understandable patterns in data

slide-27
SLIDE 27

CMPT 354: Database I -- Data Warehousing and OLAP 27

The KDD Process

Data

Target data

Preprocessed data Transformed data

Patterns

Knowledge Selection Preprocessing Transformation Data mining Interpretation/ evaluation

slide-28
SLIDE 28

CMPT 354: Database I -- Data Warehousing and OLAP 28

KDD Process Steps

  • Preprocessing

– Data cleaning – Data integration

  • Data selection
  • Data transformation
  • Data mining
  • Pattern evaluation
  • Knowledge presentation
slide-29
SLIDE 29

CMPT 354: Database I -- Data Warehousing and OLAP 29

Summary

  • OLAP
  • Data warehousing

– Star schema – Snowflake schema

  • Data mining and KDD process