Data Warehousing and OLAP Decision Support Systems - - PDF document
Data Warehousing and OLAP Decision Support Systems - - PDF document
Data Warehousing and OLAP Decision Support Systems Decision-support systems are used to make business decisions, often based on data collected by online transaction-processing systems Examples of business decisions: What items to
CMPT 354: Database I -- Data Warehousing and OLAP 2
Decision Support Systems
- Decision-support systems are used to make
business decisions, often based on data collected by online transaction-processing systems
- Examples of business decisions:
– What items to stock? – What insurance premium to change? – To whom to send advertisements?
- Examples of data used for making decisions
– Retail sales transaction details – Customer profiles (income, age, gender, etc.)
CMPT 354: Database I -- Data Warehousing and OLAP 3
Data and Statistical Analysis
- Data analysis tasks are simplified by
specialized tools and SQL extensions
– Example tasks
- For each product category and each region, what
were the total sales in the last quarter and how do they compare with the same quarter last year
- As above, for each product category and each
customer category
- Statistical analysis packages (e.g., SAS)
can be interfaced with databases
CMPT 354: Database I -- Data Warehousing and OLAP 4
Data Analysis and OLAP
- Online Analytical Processing (OLAP)
– Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion (with negligible delay)
- Multidimensional data: data modeled as dimension
attributes and measure attributes
– Dimension attributes: define the dimensions on which measure attributes (or aggregates thereof) are viewed, e.g. the attributes item_name, color, and size of the sales relation – Measure attributes: can be aggregated upon, e.g., the attribute number of the sales relation
CMPT 354: Database I -- Data Warehousing and OLAP 5
Pivot Table
- Values for one of the dimension attributes form the row headers
- Values for another dimension attribute form the column headers
- Other dimension attributes are listed on top
- Values in individual cells are (aggregates of) the values of the
dimension attributes that specify the cell
CMPT 354: Database I -- Data Warehousing and OLAP 6
Relational Representation
- Cross-tabs can be
represented as relations
– The value all is used to represent aggregates – All represents a set – The SQL:1999 standard uses null values in place of all despite confusion with regular null values
CMPT 354: Database I -- Data Warehousing and OLAP 7
Data Cubes
- A data cube is a multidimensional
generalization of a cross-tab
- Can have n dimensions
- Cross-tabs can be used as views on a data
cube
CMPT 354: Database I -- Data Warehousing and OLAP 8
Online Analytical Processing
- Pivoting: changing the dimensions used in a
cross-tab is called
- Slicing: creating a cross-tab for fixed values only
– Sometimes called dicing, particularly when values for multiple dimensions are fixed
- Rollup: moving from finer-granularity data to a
coarser granularity
- Drill down: The opposite operation - that of
moving from coarser-granularity data to finer- granularity data
CMPT 354: Database I -- Data Warehousing and OLAP 9
Hierarchies on Dimensions
- Enable dimensions be viewed at different levels of
detail
– Dimension DateTime can be used to aggregate by hour
- f day, date, day of week, month, quarter or year
CMPT 354: Database I -- Data Warehousing and OLAP 10
Cross Tabulation With Hierarchy
CMPT 354: Database I -- Data Warehousing and OLAP 11
OLAP Implementation
- Multidimensional OLAP (MOLAP) systems
– Multidimensional arrays in memory to store data cubes
- Relational OLAP (ROLAP) systems
– Relational tables to store data cubes
- Hybrid OLAP (HOLAP) systems
– Store some summaries in memory and store the base data and other summaries in a relational database
CMPT 354: Database I -- Data Warehousing and OLAP 12
Extended Aggregation in SQL:1999
- The cube operation computes union of group by’s
- n every subset of the specified attributes
select item-name, color, size, sum(number) from sales group by cube(item-name, color, size)
- Compute the union of eight different groupings of
the sales relation: { (item-name, color, size), (item- name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) }
- For each grouping, the result contains the null
value for attributes not present in the grouping
CMPT 354: Database I -- Data Warehousing and OLAP 13
OLTP Versus OLAP
OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date, detailed, flat relational Isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write, index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
CMPT 354: Database I -- Data Warehousing and OLAP 14
What Is a Data Warehouse?
- “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” – W. H. Inmon
- Data warehousing: the process of
constructing and using data warehouses
CMPT 354: Database I -- Data Warehousing and OLAP 15
Subject-Oriented
- Organized around major subjects, such as
customer, product, sales
- Focusing on the modeling and analysis of
data for decision makers, not on daily
- perations or transaction processing
- Providing a simple and concise view around
particular subject issues by excluding data that are not useful in the decision support process
CMPT 354: Database I -- Data Warehousing and OLAP 16
Integrated
- Integrating multiple, heterogeneous data sources
– Relational databases, flat files, on-line transaction records
- Data cleaning and data integration
– Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
- E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted
CMPT 354: Database I -- Data Warehousing and OLAP 17
Time Variant
- The time horizon for the data warehouse is
significantly longer than that of operational systems
– Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
- Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element”
CMPT 354: Database I -- Data Warehousing and OLAP 18
Nonvolatile
- A physically separate store of data
transformed from the operational environment
- Operational update of data does not occur in
the data warehouse environment
– Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing:
- initial loading of data and access of data
CMPT 354: Database I -- Data Warehousing and OLAP 19
Data Warehousing
CMPT 354: Database I -- Data Warehousing and OLAP 20
Collecting Data
- Source driven architecture: data sources transmit
new information to a warehouse, either continuously or periodically (e.g. at night)
- Destination driven architecture: a warehouse
periodically requests new information from data sources
- Keeping warehouse exactly synchronized with
data sources (e.g. using two-phase commit) is too expensive
– Usually OK to have slightly out-of-date data at warehouse – Data/updates are periodically downloaded form online transaction processing (OLTP) systems
CMPT 354: Database I -- Data Warehousing and OLAP 21
Design Issues
- Data cleansing
– Correct mistakes in addresses (misspellings, zip code errors), and merge address lists from different sources and purge duplicates
- Update propagating
– Warehouse schema may be a (materialized) view of schema from data sources
- Summarizing data
– Raw data may be too large to store on-line – Aggregate values (totals/subtotals) often suffice – Queries on raw data can often be transformed by query
- ptimizer to use aggregate values
CMPT 354: Database I -- Data Warehousing and OLAP 22
Warehouse Schemas
- Dimension values are usually encoded
using small integers and mapped to full values via dimension tables
- Resultant schema is called a star schema
– More complicated schema structures
- Snowflake schema: multiple levels of dimension
tables
- Constellation: multiple fact tables
CMPT 354: Database I -- Data Warehousing and OLAP 23
Data Warehouse Schema
CMPT 354: Database I -- Data Warehousing and OLAP 24
Snowflake Schema
A star schema is a snowflake schema where each dimension has only
- ne single
dimension table
Picture from publib.boulder.ibm.com
CMPT 354: Database I -- Data Warehousing and OLAP 25
Why Data Mining?
- Evolution of database technology
– To collect a large amount of data primitive file processing – To store and query data efficiently DBMS
- New challenges: huge amount of data, how
to analyze and understand?
– Data mining
CMPT 354: Database I -- Data Warehousing and OLAP 26
What Is Data Mining?
- Mining data – mining knowledge
- Data mining is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data
CMPT 354: Database I -- Data Warehousing and OLAP 27
The KDD Process
Data
Target data
Preprocessed data Transformed data
Patterns
Knowledge Selection Preprocessing Transformation Data mining Interpretation/ evaluation
CMPT 354: Database I -- Data Warehousing and OLAP 28
KDD Process Steps
- Preprocessing
– Data cleaning – Data integration
- Data selection
- Data transformation
- Data mining
- Pattern evaluation
- Knowledge presentation
CMPT 354: Database I -- Data Warehousing and OLAP 29
Summary
- OLAP
- Data warehousing
– Star schema – Snowflake schema
- Data mining and KDD process