Introduction to Introduction to Datawarehousing Umberto Nanni - - PowerPoint PPT Presentation

introduction to introduction to datawarehousing
SMART_READER_LITE
LIVE PREVIEW

Introduction to Introduction to Datawarehousing Umberto Nanni - - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction


slide-1
SLIDE 1

Master of Science in Engineering in Computer Science (MSE-CS)

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI

(MSE-CS)

Seminars in Software and Services for the Information Society

Umberto Nanni

Introduction to

1 Seminars of Software and Services for the Information Society Umberto Nanni

Introduction to Datawarehousing

slide-2
SLIDE 2

Business Intelligence Architecture

management

management system

goals results management system

reporting OLAP data mining

Datawarehouse

Datamart-1 Datamart-2 Datamart-3

KPI DSS MKT HR CRM

system

2 Seminars of Software and Services for the Information Society Umberto Nanni

  • perational

system

ETL systems

external data sources service systems ERP systems internet / extranet

  • perational system
slide-3
SLIDE 3

What is Data Warehousing

Collection of methods, technologies and tools to assist the “knowledge worker” (manager, analyst) to conduct data analysis aimed at supporting decision-making and/or improving

3 Seminars of Software and Services for the Information Society Umberto Nanni

the management of information assets

slide-4
SLIDE 4

What is a Data Warehouse

A data warehouse is a collection of data

  • integrated (far beyond the organization)
  • consistent (despite the heterogeneous origin)
  • focused (an interest area is defined)

4 Seminars of Software and Services for the Information Society Umberto Nanni

  • historical (over a consistent timeframe)
  • permanent (never delete your data!)
slide-5
SLIDE 5

Purpose of a Data Warehouse

A Data Warehouse helps (allows) you:

  • to take decisions
  • to identify and interpret phenomena
  • to make predictions about the future

5 Seminars of Software and Services for the Information Society Umberto Nanni

  • to make predictions about the future
  • to control a complex system
slide-6
SLIDE 6

Value and quantity of information

value

BD competitors marketing prices sales logistics

value

strategic information reports selected

$

$$$

competitors prices

6 Seminars of Software and Services for the Information Society Umberto Nanni

quantity

primary information sources selected information

$$$

slide-7
SLIDE 7

OLTP & OLAP

OLTP - On-Line Transaction Processing

– realm of (write and / or read) transactions, recovery, consistency consistency – many, fast and frequent operations – high level of concurrency – access to a small amount of data – on-the-fly data update

OLAP - On-Line Analytical Processing

read only

7 Seminars of Software and Services for the Information Society Umberto Nanni

– read only – few operations – low level of concurrency – access to huge amounts of data – historical but essentially static data

slide-8
SLIDE 8

Separation between: Operational Database & Data Warehouse

  • different computational load
  • different needs:
  • different needs:

– DB: dynamic data, asynchronous updates – DW: static data, periodic updates

  • integration with business activity:

– DB: supporting operations (focused, timely) – DW: supporting decisions (descriptive, historical)

8 Seminars of Software and Services for the Information Society Umberto Nanni

– DW: supporting decisions (descriptive, historical)

  • data collection:

– DB: minimal – DW: maximal

slide-9
SLIDE 9

Two issues with different perspectives

  • Data redundancy

– OLTP (DB): to avoid, bringing to inconsistency – OLTP (DB): to avoid, bringing to inconsistency and/or inefficiency on updates – OLAP (DW): redundancy avoids recomputation and shorten response time

  • Indexing

– OLTP (DB): good when you search – bad when you

9 Seminars of Software and Services for the Information Society Umberto Nanni

– OLTP (DB): good when you search – bad when you update... you need some trade-off – OLAP (DW): the more, the best

slide-10
SLIDE 10

Some Data Warehouse Systems

  • Oracle 12
  • IBM InfoSphere Warehouse
  • IBM InfoSphere Warehouse
  • Microsoft SQL-Server 2012 – Analysis Services
  • Sybase IQ
  • Hyperion (bought by Oracle)
  • Teradata (division of NCR)

10 Seminars of Software and Services for the Information Society Umberto Nanni

  • Teradata (division of NCR)
  • Netezza – Cognos (bought by IBM)
  • Business Objects (bought by SAP)
slide-11
SLIDE 11

A comparison by Gartner

11 Seminars of Software and Services for the Information Society Umberto Nanni

Donald Feinberg, Mark A. Beyer Magic Quadrant for Data Warehouse Database Management Systems Gartner RAS Core Research Note G00173535, 28 January 2010

slide-12
SLIDE 12

Architectures for Datawarehousing: issues

  • separating OLTP & OLAP
  • scalability
  • extensibility
  • security

12 Seminars of Software and Services for the Information Society Umberto Nanni

  • security
  • administrability
slide-13
SLIDE 13

Architecture for Datawarehousing

  • determined by design choices
  • determined by / determines the choice of a
  • determined by / determines the choice of a

software system

  • determines the cost and makes possible

future integration (quantitative and / or qualitative)

13 Seminars of Software and Services for the Information Society Umberto Nanni

qualitative)

  • affects the cost of data processing
slide-14
SLIDE 14

Data Mart

Collection of data focused on particular user profile or

  • n particular target analysis
  • n particular target analysis

Alternatives:

  • 1. dependent Data Mart: it is a subset and/or an aggregation of

data in the primary DW → DM extracted from a DW

  • 2. independent Data Mart: it is a subset and/or an aggregation

14 Seminars of Software and Services for the Information Society Umberto Nanni

  • 2. independent Data Mart: it is a subset and/or an aggregation
  • f data in the operational DB

→ DW=Ui(DMi), that is, DW is a set of DM

  • 3. hybrid solution, combining 1, 2
slide-15
SLIDE 15

DW architecture: 1 Level

  • there is only an operational DW
  • virtual DB (no OLTP-OLAP separation)
  • data coincident with DB operational
  • difficult integration with other sources

data - level 1

middleware

(copy of)

  • perational

15 Seminars of Software and Services for the Information Society Umberto Nanni

sources warehouse analysis

  • perational

DB external sources

slide-16
SLIDE 16

DW architecture: 2 Levels – dependent DMs

  • data sources complemented with external sources
  • running on dedicated software platform
  • ETL: Extraction, Transformation, Loading
  • ETL: Extraction, Transformation, Loading
  • materialization of the DW
  • materialization of Data Marts
  • per

BD

Data Mart ETL

data - level 1 data - level 2

16 Seminars of Software and Services for the Information Society Umberto Nanni

BD ext BD

sources warehouse analysis feeding

DW

Mart

Data Mart

slide-17
SLIDE 17

DW architecture: 2 Levels – independent DMs

  • Data Mart are materialized by feeding
  • DW = union of DMs
  • DW = union of DMs
  • per

BD Data Mart

ETL

data - level 1 data - level 2

17 Seminars of Software and Services for the Information Society Umberto Nanni

sources warehouse analysis feeding

BD ext BD Mart Data Mart

slide-18
SLIDE 18

DW architecture: 3 Levels

  • a level of "reconciled" data (operational data store) is

introduced

  • separation into two phases of ETL activities:
  • separation into two phases of ETL activities:

1. extraction / transformation 2. loading

  • per

BD Data Mart

ET(L)

data - level 1 data - level 2 data - level 3

18 Seminars of Software and Services for the Information Society Umberto Nanni

BD ext BD

DW

Mart Data Mart

reconcilied data loading

sources warehouse analysis feeding

slide-19
SLIDE 19

ETL: Extraction, Transformation, Loading

extraction Operational Data, External Data

  • extraction
  • cleaning - validation - filtering
  • transformation

Reconciled Data

19 Seminars of Software and Services for the Information Society Umberto Nanni

  • loading

Data Warehouse

slide-20
SLIDE 20

Extraction

  • initial extraction:

– targeted at the creation of the DW – targeted at the creation of the DW

  • furter extractions:

– static (replaces the whole DW) – incremental

20 Seminars of Software and Services for the Information Society Umberto Nanni

– incremental

  • log
  • timestamp
slide-21
SLIDE 21

Cleaning

  • changing VALUES
  • duplicates
  • duplicates
  • inconsistencies

– domain violation – functional dependency violation

  • null values
  • misuse of fields

21 Seminars of Software and Services for the Information Society Umberto Nanni

  • misuse of fields
  • spelling
  • abbreviations (not homogeneous)
slide-22
SLIDE 22

Transformation

  • changing FORMATS:
  • misalignment of formats
  • field overloading
  • unhomogeneous coding

22 Seminars of Software and Services for the Information Society Umberto Nanni

  • unhomogeneous coding
slide-23
SLIDE 23

Loading

  • Refresh:

ex-novo load of the whole DW ex-novo load of the whole DW

  • Update:

differential updates

23 Seminars of Software and Services for the Information Society Umberto Nanni

slide-24
SLIDE 24

Metadata

  • internal metadata

– concerning the administration of the DW (i.e., sources, transformations, schemas, users, etc..) schemas, users, etc..)

  • external metadata

– interesting for users (e.g., measurement units, possible combinations)

  • STANDARDs
  • CWM - Common Warehouse Model (OMG), defined by:

– UML (Unified Modeling Language)

24 Seminars of Software and Services for the Information Society Umberto Nanni

– UML (Unified Modeling Language) – XML (eXtensible Markup Language) – XMI (XML Metadata Interchange)

OMG = Object Management Group: CORBA (Common Object Request Broker Architecture), UML (Unified Modeling Language), MDA (Model-Driven Architecture)