Data Warehouse Ali Kamandi Sharif University of Technology Spring - - PowerPoint PPT Presentation

data warehouse
SMART_READER_LITE
LIVE PREVIEW

Data Warehouse Ali Kamandi Sharif University of Technology Spring - - PowerPoint PPT Presentation

Data Warehouse Ali Kamandi Sharif University of Technology Spring 2007 kamandi@ce.sharif.edu Part 1: Data Warehouse Concepts 2 Data, Data everywhere yet ... I cant find the data I need data is scattered over the network many


slide-1
SLIDE 1

Data Warehouse

Ali Kamandi Sharif University of Technology Spring 2007 kamandi@ce.sharif.edu

slide-2
SLIDE 2

2

Part 1:

Data Warehouse Concepts

slide-3
SLIDE 3

3

Data, Data everywhere yet ...

  • I can’t find the data I need
  • data is scattered over the network
  • many versions
  • I can’t understand the data I found
  • I can’t use the data I found
slide-4
SLIDE 4

4

Data Management

The Difficulties of managing Data:

  • The amount of data increases exponentially with time
  • Data are scattered throughout organization
  • An ever-increasing amount of external data needs
  • Data security, quality, and integrity are critical
slide-5
SLIDE 5

5

Data Sources

Internal Data Sources Personal Data External Data Sources

slide-6
SLIDE 6

6

What is a Data Warehouse?

A single, complete and consistent store of data

  • btained from a variety of

different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

slide-7
SLIDE 7

7

  • Why Data Warehousing?
slide-8
SLIDE 8

8

Definition of data warehousing

  • A data warehouse is a subject-oriented,

integrated, time-variant and non-volatile collection of data in support of management’s decision making process.

slide-9
SLIDE 9

9

Characteristics of a Data Warehouse

Subject-Oriented. Data are organized by subject and contain information relevant for decision support only .

  • Consistency. Data in different operational databases may be encoded

differently . In the data warehouse, though, they will be coded in a consistent manner. Time variant. The data are kept for many years so that they can be used for trends, forecasting, and comparisons over time. Non-volatile. Data are not updated once entered into the warehouse.

  • Multidimensional. Typically the data warehouse uses a

multidimensional structure .

slide-10
SLIDE 10

10

OLTP vs. OLAP

  • Mostly updates
  • Many small transactions
  • Mb-Tb of data
  • Raw data
  • Clerical users
  • Up-to-date data
  • Consistency,

recoverability critical

  • Mostly reads
  • Queries long, complex
  • Gb-Tb of data
  • Summarized, consolidated

data

  • Decision-makers, analysts

as users

OLTP OLAP

slide-11
SLIDE 11

11

Part 2:

Data Warehouse Design

slide-12
SLIDE 12

12

Building a Data Warehouse

slide-13
SLIDE 13

13

Relational and Multidimensional Database

Relational databases store data in two –

dimensional tables. Multidimensional databases typically store data in arrays, which consist of at least three business dimension.

slide-14
SLIDE 14

14

Relational data model

  • based on a single structure of data values in a two

dimensional table

CUSTOMER ORDER Cus_id Cus_name … 001 Robert … 002 Lyn … … … … Ord_no Ord_date Cus_id … 01 02 Dec 02 002 … 02 03 Dec 02 Lyn … … … … …

slide-15
SLIDE 15

15

A Sample Data Cube

Total annual sales

  • f TV in U.S.A.

Date Product Country

sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum

slide-16
SLIDE 16

16

Multidimensional Data Model

Composed of one fact table and a set of

dimension tables.

Dimensional table: each dimension table has a

simple table (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table.

A multidimensional data model is typically

  • rganized around a central theme, like sales, for

instance.

slide-17
SLIDE 17

17

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

  • Star schema: A fact table in the middle connected to a set
  • f dimension tables
  • Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

  • Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

slide-18
SLIDE 18

18

Example of Star Schema

time_key day day_of_the_week month quarter year

time

location_key street city province_or_street country

location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

slide-19
SLIDE 19

19

Example of Snowflake Schema

time_key day day_of_the_week month quarter year

time

location_key street city_key

location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

item_key item_name brand type supplier_key

item

branch_key branch_name branch_type

branch

supplier_key supplier_type

supplier

city_key city province_or_street country

city

slide-20
SLIDE 20

20

Example of Fact Constellation

time_key day day_of_the_week month quarter year

time

location_key street city province_or_street country

location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped

shipper_key shipper_name location_key shipper_type

shipper

slide-21
SLIDE 21

21

Typical OLAP Operations

  • Roll up (drill-up): summarize data
  • by climbing up hierarchy or by dimension reduction
  • Drill down (roll down): reverse of roll-up
  • from higher level summary to lower level summary or detailed data,
  • r introducing new dimensions
  • Slice and dice:
  • project and select
  • Pivot (rotate):
  • reorient the cube, visualization, 3D to series of 2D planes.
  • Other operations
slide-22
SLIDE 22

22

Operations

Rollup: summarize data

e.g., given sales data, summarize sales for last

year by product category and region

Drill down: get more details

e.g., given summarized sales as above, find

breakup of sales by city within each region, or within the Andhra region

slide-23
SLIDE 23

23

More Cube Operations

Slice and dice: select and project

e.g.: Sales of soft-drinks in Andhra over the last

quarter

Pivot: change the view of data

slide-24
SLIDE 24

24

Design a Warehouse?

  • Design data warehouse
  • Design data marts
  • Design representation (Star schema, …)
  • gathering data
  • Which data is needed?
  • Where does it come from?
  • cleansing, integrating, ...
  • querying, reporting, analysis
  • monitoring, administering warehouse
  • data mining
slide-25
SLIDE 25

25

Data Gathering

Periodic snapshots Database triggers Log shipping Data shipping (replication service) …

slide-26
SLIDE 26

26

Loading Data

Incremental vs. refresh Off-line vs. on-line Frequency of loading

At night, 1x a week/month, continuously

Parallel/Partitioned load

slide-27
SLIDE 27

27

Part 3:

Data Mining

slide-28
SLIDE 28

28

Data Mining Concepts

Data mining: The process of searching for valuable business information in a large database, data warehouse, or data mart.

slide-29
SLIDE 29

29

Text Mining

The application of data mining to non- structured or less-structured text files.

slide-30
SLIDE 30

30

Web Mining

The application of data mining techniques to discover actionable and meaningful patterns form web resources.

slide-31
SLIDE 31

31

Some basic operations

Regression Classification Clustering / similarity matching Association rules and variants

slide-32
SLIDE 32

32

Classification

Given old data about customers and

payments, predict new applicant’s loan eligibility.

Age Salary Profession Location Customer type Previous customers Classifier Decision rules

Salary > 5 L

  • Prof. = Exec

New applicant’s data

Good/ bad

slide-33
SLIDE 33

33

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn)

Regression: (linear or any other polynomial)

a*x1 + b*x2 + c = Ci.

slide-34
SLIDE 34

34

Tree where internal nodes are simple

decision rules on one or more attributes and leaf nodes are predicted class labels.

Decision trees

Salary < 1 M Prof = teacher Good Age < 30 Bad Bad Good

slide-35
SLIDE 35

35

Neural network

Set of nodes connected by directed weighted

edges

Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3

y n i i i

e y x w

=

+ = =

  • 1

1 ) ( ) (

1

σ σ

Basic NN unit A more typical NN

slide-36
SLIDE 36

36

What is association rule mining?

Hmmm, which items are frequently purchased together by my customers? milk cereal bread milk bread butter milk bread sugar eggs Customer 1 Market Analyst Customer 2 sugar eggs Customer n Customer 3 Shopping Baskets

milk brea d suga r butte r cerea l eggs Basket 1 1 1 1 Basket 2 1 1 1 1 Basket 3 1 1 1 Basket 4 1 1

slide-37
SLIDE 37

37

What is association rule mining? (cont.)

milk bread sugar butter cereal eggs Basket 1 1 1 1 Basket 2 1 1 1 1 Basket 3 1 1 1 Basket 4 1 1 count 3 3 2 1 1 2 Support (milk)=3 Support (bread)=3 Support (sugar)=2 …… Support (milk U bread)=3 Support (milk U sugar)=1 …… Support (milk U bread U sugar)=1 …… Support (milk U bread U sugar U butter U cereal U eggs)=0

Confidence (A B)=Support (A U B)/Support (A) As Confidence (milk bread) = = Support (milk U bread)/Support (milk) = 3/3 = 100%, Then milk bread If Confidence (A B) >= min_conf, Then A B

slide-38
SLIDE 38

38

How DM improve your business?

Strategy 1: Placing milk and bread within close proximity may further encourage the sale of these items together within single visits to the store.

slide-39
SLIDE 39

39

How DM improve your business?

Strategy 2: Placing milk and bread at opposite ends of the store may entice customers who purchase such items to pick up other items along the way.

slide-40
SLIDE 40

40

How DM improve your business?

Strategy 3:Put these two items into a package at reduced price.