Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho - - PDF document

chapter 4 data w arehouse and olap operations
SMART_READER_LITE
LIVE PREVIEW

Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho - - PDF document

11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W


slide-1
SLIDE 1

11/16/2017 1

Chapter 4 , Data W arehouse and OLAP Operations

Young-Rae Cho

Associate Professor Department of Computer Science Baylor University

CSI 4352, Introduction to Data Mining

Chapter 4 , Data W arehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

  • Basic Concept of Data W arehouse

 Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining

slide-2
SLIDE 2

11/16/2017 2 What is Data Warehouse?

  • Data W arehouse

( defined in many different ways )

  • A decision support database that is maintained separately from the
  • rganization’s operational database
  • The support of information processing by providing a solid platform of

consolidated, historical data for analysis

  • “A data warehouse is a (1) subject-oriented, (2) integrated,

(3) time-variant, and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. I nmon

  • Data W arehousing
  • The process of constructing and using data warehouses

Data Warehouse – Subject-Oriented

  • Organized around major subjects
  • e.g., customers, products, sales
  • Focusing on the modeling and analysis of data for decision makers,

not on daily operations or transaction processing

  • Provide a simple and concise view around particular subject issues
  • Excluding data that are not useful in the decision support process
slide-3
SLIDE 3

11/16/2017 3 Data Warehouse – Integrated

  • I ntegrating multiple, heterogeneous data sources
  • Relational databases, flat files, on-line transaction records
  • Apply data cleaning and data integration techniques
  • Ensures consistency in naming conventions, encoding structures,

attribute measures, etc. among different data sources

Data Warehouse – Time Variant

  • The time horizon of data warehouses is significantly longer than that of
  • perational systems
  • Operational databases have current data values
  • Data warehouses provide information from a historical perspective

(e.g., 10-20 years)

  • Time is a key structure in data warehouses
  • Contain the attribute of time (explicitly or implicitly)
slide-4
SLIDE 4

11/16/2017 4 Data Warehouse – Nonvolatile

  • A physically separate storage of data transformed from operational databases
  • Operational update of data does not occur
  • Not require transaction processing, recovery, and concurrency control

mechanisms

  • Require only two operations, initial loading of data and access of data

Data Integration Methods

  • Methods

(1) Process to provide uniform interface to multiple data sources → Tradition Database I ntegration (2) Process to combine multiple data sources into coherent storage → Data warehousing

  • Traditional DB I ntegration
  • A query-driven approach
  • Wrappers / mediators on top of heterogeneous data sources
  • Data W arehousing
  • An update-driven approach
  • Combined the heterogeneous data sources in advance
  • Stored them in a warehouse for direct query and analysis
slide-5
SLIDE 5

11/16/2017 5 OLTP vs. OLAP

  • OLTP ( on-line transaction processing)
  • Major task of traditional relational DBMS
  • Day-to-day operations: e.g., purchasing, inventory, manufacturing,

banking, payroll, registration, accounting, etc.

  • OLAP ( on-line analytical processing)
  • Major task of data warehouse system
  • Data analysis and decision making
  • Distinct Features ( OLTP vs. OLAP)
  • User and system orientation (customers vs. market analysts)
  • Data contents (current, detailed vs. historical, consolidated)
  • Database design (ER + application vs. star + subject)
  • View (current, local vs. evolutionary, integrated)

OLTP vs. OLAP

Feature OLTP OLAP Characteristic

  • perational processing

information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans)

slide-6
SLIDE 6

11/16/2017 6 Why Data Warehouse?

  • Perform ance I ssue
  • DBMS: tuned for OLTP

e.g., access methods, indexing, concurrency control, recovery

  • Data warehouse: tuned for OLAP

e.g., complex queries, multidimensional view, consolidation

  • Data I ssue
  • Decision support requires historical data, consolidated and summarized

data, consistent data

Chapter 4 , Data W arehouse & OLAP Operations

CSI 4352, Introduction to Data Mining  Basic Concept of Data W arehouse

  • Data W arehouse Modeling

 Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining

slide-7
SLIDE 7

11/16/2017 7 Data Format for Warehouse

  • Dim ensions
  • Multi-dimensional data model
  • Data are stored in the form of a data cube
  • Data Cube
  • A view of multi-dimensions
  • Dimension tables, such as item (item_name, brand, type), or

time (day, week, month, quarter, year)

  • Fact table contains measures (such as dollars_sold) and keys to each of

the related dimension tables

  • Cuboid
  • Each combination of dimensional spaces in a data cube
  • 0-D cuboid, 1-D cuboid, 2-D cuboid, …

, n-D cuboid

The Lattice of Cuboids

all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier

0-D (apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D (base) cuboid

time,item time,item,location time, item, location, supplier

slide-8
SLIDE 8

11/16/2017 8 Conceptual Modeling

  • Key of Modeling Data W arehouses
  • Handling dimensions & measures
  • Exam ples
  • Star schema: A fact table in the middle connected to a set of dimension

tables

  • Snowflake schema: A refinement of star schema where some dimensional

hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

  • Fact constellations: Multiple fact tables share dimension tables, viewed

as a collection of stars, therefore called galaxy schema

  • r fact constellation

Example of Star Schema

time_key day day_of_the_week month quarter year time location_key street city state country location

Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

item_key item_name brand type supplier_type item branch_key branch_name branch_type branch

slide-9
SLIDE 9

11/16/2017 9 Example of Snowflake Schema

supplier_key supplier_type

supplier

city_key city state country

city time_key day day_of_the_week month quarter year time location_key street city_key location

Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

item_key item_name brand type supplier_key item branch_key branch_name branch_type branch

Example of Fact Constellation

time_key day day_of_the_week month quarter year time location_key street city state country location

Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

item_key item_name brand type supplier_type item branch_key branch_name branch_type branch time_key item_key shipper_key from_location to_location dollars_cost units_shipped

shipper_key shipper_name location_key shipper_type

shipper Shipping Fact Table

slide-10
SLIDE 10

11/16/2017 10 Cube Definition in DMQL

  • Cube Definition

(Fact Table)

  • define cube < cube_name> [ < dimension_list> ] : < measure_list>
  • Dim ension Definition

(Dimension Table)

  • define dimension < dimension_name> as (< attribute_or_dimension_list> )
  • Special Case

(Shared Dimension Table)

  • define dimension < dimension_name> as < dimension_name_first>

in cube < cube_name_first>

Star Schema Definition in DMQL

  • Exam ple
  • define cube sales [ time, item, branch, location] :

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* )

  • define dimension time as (time_key, day, day_of_week, month, quarter,

year)

  • define dimension item as (item_key, item_name, brand, type,

supplier_type)

  • define dimension branch as (branch_key, branch_name, branch_type)
  • define dimension location as (location_key, street, city, state, country)
slide-11
SLIDE 11

11/16/2017 11 Snowflake Schema Definition in DMQL

  • Exam ple
  • define cube sales [ time, item, branch, location] :

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* )

  • define dimension time as (time_key, day, day_of_week, month, quarter,

year)

  • define dimension item as (item_key, item_name, brand, type,

supplier(supplier_key, supplier_type))

  • define dimension branch as (branch_key, branch_name, branch_type)
  • define dimension location as (location_key, street, city(city_key,

province_or_state, country))

Fact Constellation Schema Definition in DMQL

  • Exam ple
  • define cube sales [ time, item, branch, location] :

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* )

  • define dimension time as (time_key, day, day_of_week, month, quarter, year)
  • define dimension item as (item_key, item_name, brand, type, supplier_type)
  • define dimension branch as (branch_key, branch_name, branch_type)
  • define dimension location as (location_key, street, city, state, country)
  • define cube shipping [ time, item, shipper, from_location, to_location] :

dollar_cost = sum(cost_in_dollars), unit_shipped = count(* )

  • define dimension time as time in cube sales
  • define dimension item as item in cube sales
  • define dimension shipper as (shipper_key, shipper_name,

location as location in cube sales, shipper_type)

  • define dimension from_location as location in cube sales
  • define dimension to_location as location in cube sales
slide-12
SLIDE 12

11/16/2017 12 Measures

  • Distributive
  • I f the result derived by applying the function to n aggregate values is the

same as that derived by applying the function on all the data without partitioning

  • e.g., count(), sum(), min(), max()
  • Algebraic
  • I f it can be computed by an algebraic function with m arguments, each of

which is obtained by applying a distributive aggregate function

  • e.g., avg(), standard_deviation()
  • Holistic
  • I f there is no constant bound on the storage size needed to describe a

subaggregate

  • e.g., median(), mode(), rank()

Concept Hierarchy

  • Schem a Hierarchy
  • e.g., street < city < state < country
  • e.g., day < { month < quarter ; week } < year
  • Set-group hierarchy
  • e.g., { (0..100] ; (100..200] } < (0..200]
  • e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }

< allProducts Year Quarter | Week Month Day (0..200] (0..100] (100..200] allProducts lowPrice highPrice (0..10] (10..100] (100..200]

slide-13
SLIDE 13

11/16/2017 13 Three Components of Data Cube

  • Exam ple
  • Measures: data values as a function of products, locations and time
  • Dimensions:
  • Hierarchies:

Company | Category | Product locations time Country | City | Office Year Quarter | Week Month Day product location time

Example of Cuboid Cells

all product quarter country product, quarter product, country quarter, country product, quarter, country

0-D (apex) cuboid 1-D cuboids 2-D cuboids 3-D (base) cuboid

product dimension time dimension location dimension

slide-14
SLIDE 14

11/16/2017 14 Example of Data Cube

Total annual sales

  • f TV in U.S.A.

Quarter Country

sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum

OLAP Operations

  • Roll-up

(drill-up)

  • Summarizes (aggregates) data
  • by climbing up hierarchy or by dimension reduction
  • Drill-dow n

(roll-down)

  • Reverse of roll-up
  • by stepping down to lower-level data or introducing new dimensions
  • Slice
  • Selecting data on one dimension
  • Dice
  • Selecting data on multi-dimensions
  • Pivot

(rotate)

  • Reorienting the cube, or transforming 3-D data to a series of 2-D spaces
slide-15
SLIDE 15

11/16/2017 15 Roll-UP & Drill-Down

country quarter city quarter country month location time

Slice, Dice & Pivot

country quarter quarter country quarter location time

Q1 Q2 Q3 Q4 USA canada mexico pc laptop supercom

country

USA canada Q1 Q2 pc laptop Q1 Q2 Q3 Q4 USA canada mexico

slide-16
SLIDE 16

11/16/2017 16 Starnet Query Model

Shipping Method air truck

  • rder

Orders contracts Customer Product category item sales-person division division Organization Promotion city country region Location daily quarterly annually Time Each circle is called a footprint customerI D

Chapter 4 , Data W arehouse & OLAP Operations

CSI 4352, Introduction to Data Mining  Basic Concept of Data W arehouse  Data W arehouse Modeling

  • Data W arehouse Architecture

 Data W arehouse I m plem entation  From Data W arehousing to Data Mining

slide-17
SLIDE 17

11/16/2017 17 Data Warehouse Design

  • Top-Dow n View
  • Allows the selection of the relevant information necessary for the data

warehouse

  • Data Source View
  • Exposes the information being captured, stored, and managed by
  • perational systems
  • Data W arehouse View
  • Consists of fact tables and dimension tables
  • Business Query View
  • Shows the perspectives of data in the warehouse to end-users

Data Warehouse Design Process

  • Categories by Process Direction
  • Top-down: Starts with overall design and planning (mature)
  • Bottom-up: Starts with experiments and prototypes (rapid)
  • Categories by Softw are Engineering View
  • Waterfall: structured, systematic analysis at each step before proceeding

to the next

  • Spiral: rapid generation of functional systems, short turn around time
  • Typical Data W arehouse Design Process
  • Choose business processes for modeling, e.g., orders, invoices, etc
  • Choose the grain (atomic level of data) of the business processes
  • Choose the dimensions that will apply to each fact table
  • Choose the measures that will populate each fact table
slide-18
SLIDE 18

11/16/2017 18 Data Warehouse Architecture

Data W arehouse

Extract Transform Load Refresh

OLAP Engine Query Analysis Reports

Monitor & Integrator

Metadata Data Sources Front-End Tools

Serve

Data Marts

Operational DBs Other sources

Data Storage OLAP Server

Three Data Warehouse Models

  • Enterprise W arehouse
  • A global view with all the information about subjects spanning the entire
  • rganization
  • Data Mart
  • A subset of corporate-wide data that is of value to a specific group of

users

  • I ts scope is confined to specific, selected groups
  • I ndependent vs. dependent (directly from warehouse) data mart
  • Virtual W arehouse
  • A set of views over operational databases
  • Only some of possible summary views may be materialized
slide-19
SLIDE 19

11/16/2017 19 Development of Data Warehouse

Define a high-level corporate data m odel Model Refinement Enterprise Data W arehouse Multi-Tier Data W arehouse Distributed Data Marts Data Mart Data Mart Model Refinement

Utilities of Back-End Tools

  • Data Extraction
  • Get data from multiple, heterogeneous, and external sources
  • Data Cleaning
  • Detect errors in the data and rectify them when possible
  • Data Transform ation
  • Convert data from the original format to the warehouse format
  • Loading
  • Sort, summarize, consolidate, compute views, check integrity, and

build indices and partitions

  • Refresh
  • Propagate the updates from data sources to the warehouse
slide-20
SLIDE 20

11/16/2017 20 OLAP Server Architecture

  • Relational OLAP ( ROLAP)
  • Uses relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware

  • I ncludes optimization of DBMS back-end, implementation of aggregation

navigation logic, and additional tools and services

  • High scalability
  • Multidim ensional OLAP ( MOLAP)
  • Sparse array-based multidimensional storage engine
  • Fast indexing to pre-computed summarized data
  • Hybrid OLAP ( HOLAP)
  • Low-level: relational / high-level: array
  • High flexibility

Metadata Repository

  • Definition of Metadata
  • The data defining data warehouse objects
  • Exam ples
  • Description of the structure of the data warehouse, e.g., schema, view,

dimensions, hierarchies, data definitions, data mart locations and contents

  • Operational meta-data, e.g., history of migrated data, currency of data,

warehouse usage statistics, error reports

  • Algorithms used for summarization
  • Mapping from operational environment to data warehouse
  • Data related to system performance
  • Business data, e.g., business terms and definitions, ownership of data,

charging policies

slide-21
SLIDE 21

11/16/2017 21

Chapter 4 , Data W arehouse & OLAP Operations

CSI 4352, Introduction to Data Mining  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture

  • Data W arehouse I m plem entation

 From Data W arehousing to Data Mining

Data Cube Computation

  • View as a Lattice of Cuboids
  • How many cuboids in an n-dimensional cube?
  • How many cuboid cells in an n-dimensional cube

with Li levels?

  • Materialization of Data Cube
  • Full materialization (all cuboids), Partial materialization (some cuboids),

No materialization (only base cuboid)

  • Selection of cuboids to materialize
  • Based on the size, sharing, access frequency, etc.

) 1 1 (    n i i L

n

2

all product time location product, time product, location time, location product, time, location

slide-22
SLIDE 22

11/16/2017 22 Cube Operation

  • Cube Definition and Com putation in DMQL
  • define cube sales [item, city, year]: sum (sales_in_dollars)
  • compute cube sales
  • Cube Definition and Com putation in SQL
  • select item, city, year, SUM (amount)
  • from sales
  • cube by item, city, year
  • I nternal Operations
  • group by (item, city, year)
  • group by (item, city), (item, year), (city, year)
  • group by (item), (city), (year)
  • group by ()

Iceberg Cube

  • I ceberg Cube Com putation
  • Computing only the cuboid cells whose count or
  • ther aggregates satisfying the condition like

HAVI NG COUNT(* ) > = min_sup

  • Motivation
  • Only a small portion of cube cells may be “above the water’’

in a sparse cube

  • Only calculate “interesting” cells — data above certain threshold
  • Avoid explosive growth of the cube
slide-23
SLIDE 23

11/16/2017 23

Chapter 4 , Data W arehouse & OLAP Operations

CSI 4352, Introduction to Data Mining  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation

  • From Data W arehousing to Data Mining

Data Warehouse Usage

  • I nform ation Processing
  • Supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs

  • Analytical Processing
  • Supports OLAP operations in multi-dimensional space
  • Data Mining
  • Supports pattern discovery from warehouse data, and presenting

the mining results using visualization tools

slide-24
SLIDE 24

11/16/2017 24 From OLAP To OLAM

  • On-Line Analytical Mining ( OLAM)
  • ( OLAP + Data Mining ) in data warehouse
  • W hy OLAM ?
  • High quality data
  • Data warehouse contains integrated, consistent and cleaned data
  • I nformation processing infrastructure
  • ODBC/ OLE DB connections, web accessing, service facilities
  • OLAP-based exploratory data analysis
  • Mining with drilling, dicing, pivoting, etc.
  • On-line selection of data mining functions
  • I ntegration and swapping of multiple data mining functions

OLAM System Architecture

Data Warehouse OLAM Engine OLAP Engine

User GUI API Data Cube API Database API data cleaning data integration

Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Layer4 User Interface

mining query mining result

Meta Data Databases

slide-25
SLIDE 25

11/16/2017 25 Questions?

  • References
  • Gray, J., et al., “Data Cube: A Relational Aggregation Operator

Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and Knowledge Discovery, Vol. 1 (1997)

  • Lecture Slides are found on the Course Website,

www.ecs.baylor.edu/ faculty/ cho/ 4352