Data Warehouse
Ali Kamandi Sharif University of Technology Spring 2007 kamandi@ce.sharif.edu
Data Warehouse Ali Kamandi Sharif University of Technology Spring - - PowerPoint PPT Presentation
Data Warehouse Ali Kamandi Sharif University of Technology Spring 2007 kamandi@ce.sharif.edu Part 1: Data Warehouse Concepts 2 Data, Data everywhere yet ... I cant find the data I need data is scattered over the network many
Ali Kamandi Sharif University of Technology Spring 2007 kamandi@ce.sharif.edu
2
3
4
The Difficulties of managing Data:
5
Internal Data Sources Personal Data External Data Sources
6
A single, complete and consistent store of data
different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]
7
8
integrated, time-variant and non-volatile collection of data in support of management’s decision making process.
9
Subject-Oriented. Data are organized by subject and contain information relevant for decision support only .
differently . In the data warehouse, though, they will be coded in a consistent manner. Time variant. The data are kept for many years so that they can be used for trends, forecasting, and comparisons over time. Non-volatile. Data are not updated once entered into the warehouse.
multidimensional structure .
10
recoverability critical
data
as users
11
12
13
Relational databases store data in two –
14
dimensional table
CUSTOMER ORDER Cus_id Cus_name … 001 Robert … 002 Lyn … … … … Ord_no Ord_date Cus_id … 01 02 Dec 02 002 … 02 03 Dec 02 Lyn … … … … …
15
Total annual sales
Date Product Country
sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum
16
Composed of one fact table and a set of
dimension tables.
Dimensional table: each dimension table has a
simple table (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table.
A multidimensional data model is typically
instance.
17
Modeling data warehouses: dimensions & measures
some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
18
time_key day day_of_the_week month quarter year
time
location_key street city province_or_street country
location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
19
time_key day day_of_the_week month quarter year
time
location_key street city_key
location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
item_key item_name brand type supplier_key
item
branch_key branch_name branch_type
branch
supplier_key supplier_type
supplier
city_key city province_or_street country
city
20
time_key day day_of_the_week month quarter year
time
location_key street city province_or_street country
location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped
shipper_key shipper_name location_key shipper_type
shipper
21
22
Rollup: summarize data
e.g., given sales data, summarize sales for last
year by product category and region
Drill down: get more details
e.g., given summarized sales as above, find
breakup of sales by city within each region, or within the Andhra region
23
Slice and dice: select and project
e.g.: Sales of soft-drinks in Andhra over the last
quarter
Pivot: change the view of data
24
25
Periodic snapshots Database triggers Log shipping Data shipping (replication service) …
26
Incremental vs. refresh Off-line vs. on-line Frequency of loading
At night, 1x a week/month, continuously
Parallel/Partitioned load
27
28
29
30
The application of data mining techniques to discover actionable and meaningful patterns form web resources.
31
Regression Classification Clustering / similarity matching Association rules and variants
32
Given old data about customers and
Age Salary Profession Location Customer type Previous customers Classifier Decision rules
Salary > 5 L
New applicant’s data
Good/ bad
33
Regression: (linear or any other polynomial)
a*x1 + b*x2 + c = Ci.
34
Tree where internal nodes are simple
Salary < 1 M Prof = teacher Good Age < 30 Bad Bad Good
35
Set of nodes connected by directed weighted
Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3
y n i i i
e y x w
=
+ = =
1 ) ( ) (
1
σ σ
Basic NN unit A more typical NN
36
Hmmm, which items are frequently purchased together by my customers? milk cereal bread milk bread butter milk bread sugar eggs Customer 1 Market Analyst Customer 2 sugar eggs Customer n Customer 3 Shopping Baskets
milk brea d suga r butte r cerea l eggs Basket 1 1 1 1 Basket 2 1 1 1 1 Basket 3 1 1 1 Basket 4 1 1
37
milk bread sugar butter cereal eggs Basket 1 1 1 1 Basket 2 1 1 1 1 Basket 3 1 1 1 Basket 4 1 1 count 3 3 2 1 1 2 Support (milk)=3 Support (bread)=3 Support (sugar)=2 …… Support (milk U bread)=3 Support (milk U sugar)=1 …… Support (milk U bread U sugar)=1 …… Support (milk U bread U sugar U butter U cereal U eggs)=0
Confidence (A B)=Support (A U B)/Support (A) As Confidence (milk bread) = = Support (milk U bread)/Support (milk) = 3/3 = 100%, Then milk bread If Confidence (A B) >= min_conf, Then A B
38
39
Strategy 2: Placing milk and bread at opposite ends of the store may entice customers who purchase such items to pick up other items along the way.
40
Strategy 3:Put these two items into a package at reduced price.