Data Science in the Wild Lecture 12: Memory-Based Data Warehouses - - PowerPoint PPT Presentation

data science in the wild
SMART_READER_LITE
LIVE PREVIEW

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses - - PowerPoint PPT Presentation

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the Wild, Spring 2019 1 Data Engineering Extract Transform Load & Clean Sources Data Warehouse Data Science in the Wild, Spring 2019 2


slide-1
SLIDE 1

Data Science in the Wild, Spring 2019

Eran Toch

1

Lecture 12: Memory-Based Data Warehouses

Data Science in the Wild

slide-2
SLIDE 2

Data Science in the Wild, Spring 2019

Data Engineering

2

Sources Data Warehouse Extract Transform & Clean Load

slide-3
SLIDE 3

Data Science in the Wild, Spring 2019

Agenda

3

1.What are Data Warehouse? 2.Data warehouse architecture 3.The design process 4.Transaction design 5.Periodic snapshot 6.Accumulative transactions

slide-4
SLIDE 4

Data Science in the Wild, Spring 2019

<1> What are Data Warehouse?

4

slide-5
SLIDE 5

Data Science in the Wild, Spring 2019

Data Warehouse: A definition

  • A Data Warehouse is a central

repository of integrated data from

  • ne or more disparate sources
  • Data warehouses don’t aim to solve

a single problem

  • Instead, they provide the

infrastructure for an organizational data science process

5 Sales Inventory Suppliers Customers Stores HR Pharmacy Finance

slide-6
SLIDE 6

Data Science in the Wild, Spring 2019

Data Warehouse: Basic Architecture

  • It stores current and historical data in one single place
  • Data marts represent repositories for specific subjects (sales, orders,

website navigation)

6

Operational Systems User

ETL Reporting Machine Learning Operationaliza tion Data Warehouse

Data Marts Staging

slide-7
SLIDE 7

Data Science in the Wild, Spring 2019

Technologies for Data Warehouse

  • RDBMS - Relational Database

Management Systems

  • Hadoop
  • Hadoop / Spark
  • And many other variations

7

slide-8
SLIDE 8

Data Science in the Wild, Spring 2019

Spark-based Architecture

8

slide-9
SLIDE 9

Data Science in the Wild, Spring 2019

9

slide-10
SLIDE 10

Data Science in the Wild, Spring 2019

Comparison

10

Operational Databases Data Warehouses Process Oriented Subject Oriented Add, Modify, Remove single rows Bulk load, rarely modify, never remove Online human / sensors entry ETL jobs Queries for small sets of rows with all their details Scan large sets for aggregates Using trained models Training models

slide-11
SLIDE 11

Data Science in the Wild, Spring 2019

<2> Data Warehouse Architectures

11

slide-12
SLIDE 12

Data Science in the Wild, Spring 2019

Designing the Data Warehouse

  • Dimensional model
  • f a business

process:

  • The facts we

want to analyze

  • The dimensions

we analyze the facts

12

Sale Transaction Facts Date key (FK) Product key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Cost Dollar Amount Gross Profit Margin Date Dimension Date Key (PK) Date Attributes (TBD) Store Dimension Store Key (PK) Store Attributes (TBD) Product Dimension Product Key (PK) Product Attributes (TBD) Promotion Dimension Product Key (PK) Promotion Attributes (TBD)

slide-13
SLIDE 13

Data Science in the Wild, Spring 2019

Two Super Architectures

13

Star Schema Snowflake

https://www.guru99.com/star-snowflake-data-warehousing.html

slide-14
SLIDE 14

Data Science in the Wild, Spring 2019

Star Schema

  • Fact table contains a key and measure
  • Every dimension in a star schema is

represented with the only one-dimension table

  • The dimension table is joined to the fact table
  • Dimension tables are not joined to each other
  • The dimension tables are not normalized
  • Main advantages
  • Queries are simpler
  • Optimizes number of joins

14

slide-15
SLIDE 15

Data Science in the Wild, Spring 2019

Snowflake

  • The dimension tables

are normalized which splits data into additional tables

  • Main advantages:
  • Optimizes storage
  • Easier to understand
  • Easier to engineer the

dimensions (adding, removing etc)

15

slide-16
SLIDE 16

Data Science in the Wild, Spring 2019

<3> The Design Process

16

slide-17
SLIDE 17

Data Science in the Wild, Spring 2019

4 Step Design Process

  • 1. Identify the business process
  • 2. Identify the facts
  • 3. Declare the grain
  • 4. Choose the dimensions

17

slide-18
SLIDE 18

Data Science in the Wild, Spring 2019

Case Study: A Retail Sales Operation

  • ~2000 Stores
  • ~$75.17B yearly revenue
  • Typical 80K individual products (SKU’s)
  • In a store
  • In any given moment
  • ~10 departments
  • Food, medicine, cosmetics, nature, kids...

18

slide-19
SLIDE 19

Data Science in the Wild, Spring 2019

Step 2: Identify the Facts

  • Identifying the numeric facts for analysis
  • Facts are determined by answering the

question, "What are we measuring?"

  • All candidate facts in a design must be true to

the grain defined in step 2

  • Facts that clearly belong to a different grain

must be in a separate fact table

19

  • Examples
  • Sales quantity
  • Sales dollar amount
  • Cost dollar amount
  • Gross profit margin
slide-20
SLIDE 20

Data Science in the Wild, Spring 2019

Step 3: Find the Grain

  • Declaring the grain means specifying

exactly what an individual fact table row represents.

  • The grain conveys the level of detail

associated with the fact table measurements.

  • It provides the answer to the question,

"How do you describe a single row in the fact table?"

20

The small golden disk is a piece of pure gold weighing one troy grain.

slide-21
SLIDE 21

Data Science in the Wild, Spring 2019

Examples

  • Total sale for each customer
  • A record for an individual line item on a customer's

retail sales ticket as measured by a scanner device

  • A record for each item

21

slide-22
SLIDE 22

Data Science in the Wild, Spring 2019

Grain Design

  • What are the options?
  • For example, summary of sales per day per

store... (what’s the problem?)

  • The grain of data should support
  • The ability to drill down
  • The ability to support independent dimensions
  • Selected option:
  • Individual line item on a POS

22

POS Sale Transaction Facts POS Transaction Number Sales Quantity Sales Dollar Amount Price reduction

slide-23
SLIDE 23

Data Science in the Wild, Spring 2019

Good Grains

  • How should we judge a design?
  • Preferably you should develop dimensional models for the most atomic

information captured by a business process

  • Atomic data is the most detailed information collected; such data

cannot be subdivided further

23

slide-24
SLIDE 24

Data Science in the Wild, Spring 2019

Step 4: Choose Dimensions

  • Dimensions can be designed by

thinking about:

  • "How do business describe the

data that results from the business process?"

  • We want a robust set of dimensions

representing all possible descriptions that take on single values in the context of each measurement

24

slide-25
SLIDE 25

Data Science in the Wild, Spring 2019

Summary

  • 1. Identify the business process
  • 2. Identify the facts
  • 3. Declare the grain
  • 4. Choose the dimensions

25

slide-26
SLIDE 26

Data Science in the Wild, Spring 2019

Grain Design Pattern

  • Transactions
  • e.g., POS transaction, financial transaction, medical action
  • Periodic snapshot
  • Inventory state in a given day, closing stock price
  • Accumulative transactions
  • Order management, hospitalization process
  • Other types…

26

slide-27
SLIDE 27

Data Science in the Wild, Spring 2019

<4> Transaction design

27

slide-28
SLIDE 28

Data Science in the Wild, Spring 2019

Case Study Data

  • Point of Sale (POS):
  • Individual product purchase (as scanned at the POS)
  • Supply:
  • the purchase price of each product
  • Promotions
  • Temporary price reductions
  • Ads
  • Inserts
  • Coupons

28

slide-29
SLIDE 29

Data Science in the Wild, Spring 2019

Preliminary Star Schema

29

POS Sale Transaction Facts Date key (FK) Product key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Cost Dollar Amount Gross Profit Margin Date Dimension Date Key (PK) Date Attributes (TBD) Store Dimension Store Key (PK) Store Attributes (TBD) Product Dimension Product Key (PK) Product Attributes (TBD) Promotion Dimension Product Key (PK) Promotion Attributes (TBD)

slide-30
SLIDE 30

Data Science in the Wild, Spring 2019

Date Dimension

  • Almost always exists in DW systems.
  • Can be built and populated in advance
  • Why is this important?
  • Each row represents one day (or one time

unit)

  • 10 years = 3,650 rows

30

slide-31
SLIDE 31

Data Science in the Wild, Spring 2019

Date Dimension

31

POS Sale Transaction Facts Date key (FK) Product key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Cost Dollar Amount Gross Profit Margin Date Dimension Date Key (PK) Date Day of Week Day num in Epoch Week num in Epoch Month num in Epoch Day num in calendar month Day num in calendar year Calendar week num in year Day num in fiscal year ... Fiscal Week Fiscal Week number in year Fiscal Month Fiscal Quarter Fiscal Year Holiday Indicator Weekday Indicator Selling Season Major Event ... Store Dimension Product Dimension Promotion Dimension

slide-32
SLIDE 32

Data Science in the Wild, Spring 2019

Date Example

  • There are many date attributes not supported by the SparkSQL date function, including

fiscal periods, seasons, holidays, and weekends.

  • Rather than attempting to determine these non-standard calendar calculations in a query,

we should look them up in a date dimension table

32

slide-33
SLIDE 33

Data Science in the Wild, Spring 2019

Advantages of Date Dimension

  • Indexing integer-based key is faster than date key
  • Simplifies calculations
  • Simpler comparisons
  • Month vs. Month
  • Year vs. Year
  • Day of week, day of month
  • Special events

33

slide-34
SLIDE 34

Data Science in the Wild, Spring 2019

Product Dimension

  • 60K current products -> 150K distinct products (SKUs)
  • Contains hierarchy:
  • Department
  • Category
  • Brand
  • SKU

34

slide-35
SLIDE 35

Data Science in the Wild, Spring 2019

Product Dimension

35 17

POS Sale Transaction Facts Date key (FK) Product key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Cost Dollar Amount Gross Profit Margin Product Dimension Product Key (PK) Product Description SKU number Brand Category Department Package type Package size Requires prescription Weight Shelf life Storage type Color Shelf width Shelf height Shelf depth ... Store Dimension Date Dimension Promotion Dimension

slide-36
SLIDE 36

Data Science in the Wild, Spring 2019

Store Dimension

  • Business questions:
  • Store performance
  • Geographical questions
  • Floor plan (crucial for retail)
  • Size and diversity

36

slide-37
SLIDE 37

Data Science in the Wild, Spring 2019

Store Dimension

37

POS Sale Transaction Facts Date key (FK) Product key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Cost Dollar Amount Gross Profit Margin Store Dimension Store Key (PK) Store name Store street address City State Country Store district Store region Selling square footage Total square footage First open date Remodel date ... Product Dimension Date Dimension Promotion Dimension

slide-38
SLIDE 38

Data Science in the Wild, Spring 2019

Promotion Dimension

  • Business analysis:
  • Sales lift, compared with baseline
  • Cannibalization
  • Profit
  • Market growth
  • Not all promotions appear in

POS

  • An example of how manual data

enrichment can improve business analysis

38

slide-39
SLIDE 39

Data Science in the Wild, Spring 2019

Promotion Dimension

39

POS Sale Transaction Facts Date key (FK) Product key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Cost Dollar Amount Gross Profit Margin Promotion Dimension Promotion Key (PK) Price reduction type Promotor media type Ad type Coupon type Ad media name Promotion cost Begin date End date ... Store Dimension Date Dimension Promotion Dimension

slide-40
SLIDE 40

Data Science in the Wild, Spring 2019

<5> Periodic Snapshot

40

slide-41
SLIDE 41

Data Science in the Wild, Spring 2019

Case Study: Store Inventory

  • Our story:
  • Inventory of different products is managed in

different stores.

  • The business process we are interested in

analyzing is the retail store inventory.

  • What would be the grain and main dimensions
  • f the data warehouse?

41

slide-42
SLIDE 42

Data Science in the Wild, Spring 2019

Analyzing Inventory Levels

  • Optimized inventory levels in the stores can have a major impact on profitability
  • Minimizing out-of-stocks and reducing overall inventory carrying costs
  • The retailer needs the ability to analyze daily quantity-on-hand inventory levels by product and store
  • Which questions would a business ask to better manage inventory?
  • Inventory value
  • Inventory size
  • Inventory turnover
  • Process timing
  • Planning procurement
  • physical planning
  • ...

42

slide-43
SLIDE 43

Data Science in the Wild, Spring 2019

Periodic Snapshot

Grain:

  • We want to see daily inventory by product at each individual store,

which we assume is the atomic level of detail provided by the

  • perational inventory system.

43

slide-44
SLIDE 44

Data Science in the Wild, Spring 2019

Additivity

  • Quantity is additive?
  • Yes for store and product
  • No for date
  • Therefore it is semi-additive
  • Is there another function that logically summarize quantity over time?

44

slide-45
SLIDE 45

Data Science in the Wild, Spring 2019

What is Still Missing

  • The periodic snapshot is the most common inventory schema
  • However, it does not teach us everything we need to know about

inventory

  • What is missing?

45

slide-46
SLIDE 46

Data Science in the Wild, Spring 2019

Table Size

  • Unlike the transaction table, where there is a record only if a transaction
  • ccurred...
  • This table has a record for each product/store/date, even if the

inventory did not change (or if the quantity is 0)

46

slide-47
SLIDE 47

Data Science in the Wild, Spring 2019

<6> Accumulative transactions

47

slide-48
SLIDE 48

Data Science in the Wild, Spring 2019

Example: Order Management

48

Quoting Ordering Manufacturing Shipping Paying

slide-49
SLIDE 49

Data Science in the Wild, Spring 2019

Types of Business Processes

  • We have seen single stage processes:
  • Purchasing
  • Procurement
  • But what about multi-stage

processes, like:

  • Order management
  • Patient hospitalization
  • Sales process
  • Manufacturing process

49

Ordering

slide-50
SLIDE 50

Data Science in the Wild, Spring 2019

Business Process Bus Matrix

The business process bus matrix shows which business processes require what types of data. The data warehouse bus matrix closely corresponds to the organization's value chain, refocussing the relations between data and business processes.

50

slide-51
SLIDE 51

Data Science in the Wild, Spring 2019

Design Solution

  • Accumulative transactions table: provides multi-stages time tracking
  • Each record represents a single line item on the shipment invoice
  • The Fact table has multiple date foreign keys, each assigned to a

different process

51

slide-52
SLIDE 52

Data Science in the Wild, Spring 2019

Example

52

Order Facts Order number (PK) Order date key (FK) Backlog date key (FK) Assembly date key (FK) QA date key (FK) Requested shipment date key (FK) Scheduled shipment date key (FK) Actual date key (FK) Arrival date key (FK) Invoice date key (FK) Product key (FK) Customer key (FK) Sales rep key (FK) Shipper key (FK) Manufacturing facility key (FK) Warehouse key (FK) Order quantity Order dollar amount Assembly quantity QA quantity Shipment quantity ... Sales rep Dimension Date Dimension Customer Dimension Manufacturing facility Dimension Warehouse Dimension Shipper Dimension Product Dimension

slide-53
SLIDE 53

Data Science in the Wild, Spring 2019

Summary: How Patterns Compare?

53

Characteristic Transaction Periodic Snapshot Accumulative transactions Time period represented? Point in time Regular, set intervals Changing time spans Grain One row per transaction One row per period One row per lifecycle Fact table loads Insert Insert Insert and update Date dimension Transaction date End-of-period date Multiple dates for milestones Facts Transaction activity Performance for predefined time interval Performance over lifetime

slide-54
SLIDE 54

Data Science in the Wild, Spring 2019

Summary

  • Thinking beyond a single table
  • Questions of design
  • Transaction Snapshot
  • Periodic Snapshot
  • Accumulative transactions

54