Developing Data Warehouses with Quality in Mind Yannis Vassiliou - - PDF document

developing data warehouses with quality in mind
SMART_READER_LITE
LIVE PREVIEW

Developing Data Warehouses with Quality in Mind Yannis Vassiliou - - PDF document

Developing Data Warehouses with Quality in Mind Yannis Vassiliou National Technical University of Athens Workshop on Data Quality December 1, 2000 1 OUTLINE Introduction Motivation The Data Warehouse Metadata Framework Developed


slide-1
SLIDE 1

1

December 1, 2000 1

Developing Data Warehouses with Quality in Mind

Yannis Vassiliou National Technical University of Athens Workshop on Data Quality

December 1, 2000 Yannis Vassiliou Slide 2

OUTLINE

Introduction – Motivation The Data Warehouse Metadata Framework

Developed

Architecture, Processes, Quality Models

Employing the Framework Conclusions

slide-2
SLIDE 2

2

December 1, 2000 Yannis Vassiliou Slide 3

National Technical University of Athens (NTUA) Informatik V & Lehr- und Forschungsgebiet Theoretische Informatik (RWTH-Aachen) Institute National de Recherche en Informatique et en Automatique (INRIA) Deutsche Forschungszentrum fόr kόnstliche Intelligenz (DFKI) University of Rome «La Sapienza» (Uniroma) Istituto per la Ricerca Scientifica e Tecnologica (IRST) University of Manchester (UMan)

Foundations of Data Warehouse Quality- DWQ Project

December 1, 2000 Yannis Vassiliou Slide 4

Introduction – Motivation

Contribute to the systematic understanding and

usage of the interplay between QUALITY FACTORS and DESIGN / EVOLUTION OPTIONS in Data Warehousing (Objective)

Develop comprehensive DW Foundations

(Framework), Prototype and Evaluate them (Achievement)

Enriched Meta data management facilities with

embedded analysis and optimization techniques (Key Methodology)

slide-3
SLIDE 3

3

December 1, 2000 Yannis Vassiliou Slide 5

Standard DW Architecture

Data Warehouse

Text File External data DB Data Mart Data Mart

OLAP DSS GIS

Mediator Sources Clients Administration Agent Wrappers / Loaders / /

Standard DW Architecture

Data Warehouse

Text File External data DB Data Mart Data Mart

OLAP DSS GIS

Mediator Sources Clients Administration Agent Wrappers / Loaders / /

Standard DW Architecture

Data Warehouse

Text File External data DB Data Mart Data Mart

OLAP DSS GIS

Mediator Sources Clients Administration Agent Wrappers / Loaders / Meta DataBase Repository

Examples:

Microsoft Repository Metadata Interchange Specification (MDIS) control and manage metadata for OLAP databases.

December 1, 2000 Yannis Vassiliou Slide 6

Standard DW Architecture

Meta DataBase Repository Data Warehouse

Text File External data DB Data Mart Data Mart

OLAP DSS GIS

Mediator Sources Clients Administration Agent Wrappers / Loaders /

PRACTICAL QUESTIONS not Handled in the Traditional Architecture:

  • - How come the information

from the DW is not the same to the one coming from sources?

  • - What is the effort required

to get in the DW information not currently available?

  • - If I want 100 % correct data

in my DW, how do I design it? how often do I refresh it?

  • - ….
slide-4
SLIDE 4

4

December 1, 2000 Yannis Vassiliou Slide 7

DWQ DW Architecture

DWQ Repository Concept Base Data Warehouse

Text File External data DB Data Mart Data Mart

OLAP DSS GIS

Mediator query

  • ptimiser

subsumption reasoner aggregation reasoner freshness agent Sources Clients Administration Agent Wrappers/ Loaders quality manager

December 1, 2000 Yannis Vassiliou Slide 8

A Small Motivating Example

MINISTRY of HEALTH (Greece) Data Warehouse:

Sources = COBOL files for all the medical centers

in Greece (~2400)

Transformation and Cleaning Tasks

Quality requirements (Goals)

«Achieve 100% completeness and consistency of data»

slide-5
SLIDE 5

5

December 1, 2000 Yannis Vassiliou Slide 9

Metadata Framework

Introduction – Motivation The Data Warehouse Metadata Framework

Developed

Architecture, Processes, Quality Models

Employing the Framework Conclusions

December 1, 2000 Yannis Vassiliou Slide 10

Viewpoints of a DW

slide-6
SLIDE 6

6

December 1, 2000 Yannis Vassiliou Slide 11

Architecture Model: Step 1 Enterprise Version

  • f the Traditional DW

Information Source Data Warehouse Wrapper/ Loader Multidimensional Data Mart Aggregation/ Customization

?

Observation

OLTP OLAP

Analyst Operational Department Enterprise

Source Quality DW Quality Mart Quality December 1, 2000 Yannis Vassiliou Slide 12

Architecture Model: Step 2 Enterprise Version (Meta level) Extending the Traditional DW

Source Data Store DW Data Store Wrapper Client Data Store Aggregation/ Customization

?

Observation

OLTP OLAP

Client Model Operational Department Model Enterprise Model Source Schema DW Schema Transportation Agent Transportation Agent Client Schema Conceptual Perspective Logical Perspective Physical Perspective

slide-7
SLIDE 7

7

December 1, 2000 Yannis Vassiliou Slide 13

Architecture Model - Instantiation

Client Level DW Level Source Level Conceptual Perspective Logical Perspective Physical Perspective

Meta Model Level Models/ Meta Data Level

in Real World in in

December 1, 2000 Yannis Vassiliou Slide 14

Architecture Model: Step 3 Structure of the Meta Model as implemented in ConceptBase / Telos

ConceptualObject LogicalObject PhysicalObject Model Concept hasConcept Source Model Enterprise Model Client Model Schema Type

hasType

Source Schema DW Schema Client Schema DW_ Component deliversTo Agent DataStore Source DataStore Control Agent Transport Agent Client DataStore DW DataStore isViewOn hasStructure in in in isa isa isa isa isa isa isa

isSubsumedBy relatesTo

isa isa isa Measurable Object

slide-8
SLIDE 8

8

December 1, 2000 Yannis Vassiliou Slide 15

Process Meta Model: Step 1 Capturing the Dynamic Aspects

  • f the Architecture Model (static)

Client Level DW Level Source Level Conceptual Perspective Logical Perspective Physical Perspective

Meta Model Level Models/ Meta Data Level

in

Real World

in in

Process Model Process Meta Model

uses

Process

Processes

December 1, 2000 Yannis Vassiliou Slide 16

DW Process Meta Model

Workflow Reference Model (made less

abstract to fit in the DW case, e.g.: capture schedules, relationships with data)

Strategic Dependency Model (conceptual) Processes: Cleaning, transformation,

transfer, computation

ROLE – ACTIVITY – AGENT

slide-9
SLIDE 9

9

December 1, 2000 Yannis Vassiliou Slide 17

Process Meta Model: Step 3 DW Operational Process Meta Model

PROCESS ELEMENT TRANSITION ELEMENT ACTIVITY

NEXT IN SCHEDULE INPUT OUTPUT RELATES TO

DATA PACKAGE isa TYPE DATA STORE

EXECUTED BY

AGENT (application) Physical Perspective (how)

STORED

Logical Perspective (what) Conceptual Perspective (why) CONCEPT

MAPPED

COMPOSITE ACTIVITY

COMPOSED

isa

INPUT/ OUTPUT MAPPED

ROLE IS RELATED responsibility type active passive STAKEHOLDER

MAPPED

DW_OBJECT DW_USER

FOR PERSON OPERATES

isa

December 1, 2000 Yannis Vassiliou Slide 18

Quality Model

Quality in a Data Warehouse

Quality of Data Quality of Processes Quality of Service

At all perspectives

Establishment of Quality aspects

(dimensions)

Scientific vs. Pragmatic (user defined)

slide-10
SLIDE 10

10

December 1, 2000 Yannis Vassiliou Slide 19

Quality Model

Concepts:

Measurable Object (e.g. logical schema of source) Quality Goal (e.g.,improve availability of source A) Quality Query (decide whether a quality goal is achieved) Quality Dimension (e.g., “availability”, “correct”) Quality Factor (measurement) Stakeholders (decision makers, designers, administrators,

programmers)

December 1, 2000 Yannis Vassiliou Slide 20

Quality Dimensions Example: Data Usage

currency volatility responsiveness accessibility interpretability System availability Transactional availability security timeliness Data usage quality usefulness

slide-11
SLIDE 11

11

December 1, 2000 Yannis Vassiliou Slide 21

Quality Factors by Perspective

Conceptual Perspective

  • Completeness
  • Redundancy
  • Consistency
  • Correctness
  • Trace ability
  • f Concepts and

Models

Logical Perspective

  • Usefulness of

schemas

  • Correctness of

mappings

  • Interpretability of

schemas

Physical Perspective

  • Efficiency
  • Interpretability of

schemas

  • Timeliness of stored

data

  • Maintainability/

Usability of software components

  • Questions and metrics for each quality factor ?
  • Predictive models of quality impacts and trade-offs ?
  • Can the results be mapped back into data warehouse practice ?

December 1, 2000 Yannis Vassiliou Slide 22

Quality Factors - Metrics

Factor Methods of measurement Metrics Schema quality Correctness final inspection of data warehouse schema for each entity and its corresponding ones in the sources number of errors in the mapping of the entities Complete- ness final inspection of data warehouse schema for useful entities in the sources, not represented in the data warehouse schema number of useful entities, not present in the data warehouse Minimality final inspection of data warehouse schema for undesired redundant information number of undesired entities in the data warehouse trace ability final inspection of data warehouse schema for inability to cover user requirements number of requirements not covered

slide-12
SLIDE 12

12

December 1, 2000 Yannis Vassiliou Slide 23

Quality Factors – Metrics (Data Usage)

Data Usage Quality Logical Perspective Physical Perspective

Schema Type Agent Data Store Accessibility Is the schema definition accessible by the users? Is the type visible and accessible for users? Is the network sufficient for delivered data? Is the data store accessible? Availability Frequency of updates Frequency of updates Response time Uptime of data store, response time Security Level of security (access rights) Level of security (access rights) Are there physical access restrictions? Is the store able to prevent unauthorized access? Usefulness Is the schema used by any users? Is the type used by any users? Is the data delivered by the agent really used in the destination store? Is the data in this store queried by a user? Interpreta- bility Is the schema understandable? Is the type understandable? Is the data delivered understandable? Is the data stored understandable?

December 1, 2000 Yannis Vassiliou Slide 24

Quality Meta Model Management An Adapted GQM Approach

DW Designers Decision Maker DW Administrator

Quality Goal

Quality Query

DW Objects, Processes and Data

Metadata for DW Architecture, Quality and Processes

establish Measurement Processes evaluated by evidence for defined on

Quality Factor [Jarke et al., IS99]

slide-13
SLIDE 13

13

December 1, 2000 Yannis Vassiliou Slide 25

The DWQ Quality Meta Model in ConceptBase

Quality Factor

Quality Dimension

imposedOn forPerson description evaluates expected achieved

Measureable Object

  • nObject

dimension isSubDimOf prefers Measuring Agent dependsOn direction

PowerSet

MetricUnit

  • f

hasResult dimension concreteBy when unit

Stakeholder String Purpose QualityDomain ! QualityGoal ? QualityQuery

Timestamp

December 1, 2000 Yannis Vassiliou Slide 26

Metadata Management

Introduction – Motivation The Data Warehouse Metadata Framework

Developed

Architecture, Processes, Quality Models

Employing the Framework Conclusions

slide-14
SLIDE 14

14

December 1, 2000 Yannis Vassiliou Slide 27

Employing the Framework -Mapping the Architecture and Models to a Traditional DW

DWQ Repository Concept Base Data Warehouse

Text File External data DB Data Mart Data Mart

OLAP DSS GIS

Mediator query

  • ptimiser

subsumption reasoner aggregation reasoner freshness agent Sources Clients Administration Agent Wrappers/ Loaders quality manager

December 1, 2000 Yannis Vassiliou Slide 28

Methodological Approach for Quality Management

Quality Goal

  • 1. Design
  • 2. Evaluation
  • 3. Analysis

& Improvement Define Quality Factor Types Define Object Types Define Object Instances & Properties Define Metrics & Agents Compute! Acquire values for quality factors (current status) Feed values to quality scenario and play! Discover/Refine new/old "functions" Take actions! Decompose complex objects and iterate Empirically derive "functions" Analytically derive "functions" Produce a scenario for a goal Produce expected/ acceptable values Negotiate!

  • 4. Re-evaluation

& evolution

slide-15
SLIDE 15

15

December 1, 2000 Yannis Vassiliou Slide 29

Employing the Framework Running Example

Ministry of Health Example Successful detection of reasons for the inconsistencies

between DW data and source (legacy system) data

COBOL file DW Buffer table Normalized table Materialized view COBOL file Buffer table Normalized table Client tools December 1, 2000 Yannis Vassiliou Slide 30

Performing the processes

Loading Cleaning Computation Aggregation Buffer Class_info V1 CBL next next INS INS INS in in in in

  • ut
  • ut
  • ut

(H_ID, DATE, CLASS_A, CLASS_B, CLASS_C) (H_ID, DATE, CLASS_A, CLASS_B, CLASS_C) (H_ID, EDATE, CLASS_ID, #BEDS) (H_ID,EDATE, SUM(#BEDS))

SELECT * FROM BUFFER B1 WHERE EXISTS (SELECT B2.H_ID, B2.DATE FROM BUFFER B2 WHERE B1.H_ID = B2.H_ID AND B1.DATE = B2.DATE GROUP BY H_ID,DATE HAVING COUNT(*)> 1)

DEL

  • ut

next COMMIT COMMIT COMMIT

slide-16
SLIDE 16

16

December 1, 2000 Yannis Vassiliou Slide 31

DW Process Quality

Role Quality Goal ‘why?’ Conceptual Achieve 100% consistency of the information to be given to the minister! Activity Quality Query ‘what?’ Logical Is the propagation activity performing properly? Agent Quality Factor ‘how?’ Physical Correctness of software processes (performed with white box testing)

December 1, 2000 Yannis Vassiliou Slide 32

Analysis of Quality Factors

Quality Dimens ion DW objects Primary Quality Factors Derived Quality Factors Design Choices Consist ency

  • COBOL Source file
  • Buffer table
  • Normalized table
  • Materialized View
  • Loading process
  • Cleaning process
  • Computation

process

  • Consistency of a data

store

  • Completeness of a data

store

  • Correctness of an

application

  • Consistency of a

data store

  • Completeness of

a data store

  • Data flow
  • Chosen source

files

slide-17
SLIDE 17

17

December 1, 2000 Yannis Vassiliou Slide 33

Quality Factors

Loading Cleaning Computation Aggregation Buffer Class_info V1 CBL next next next INS INS INS in in in in

  • ut
  • ut
  • ut

(H_ID, DATE, CLASS_A, CLASS_B, CLASS_C) (H_ID, DATE, CLASS_A, CLASS_B, CLASS_C) (H_ID, EDATE, CLASS_ID, #BEDS) (H_ID,EDATE, SUM(#BEDS)) DEL

  • ut

COMMIT COMMIT COMMIT Correctness Correctness Correctness Correctness Completeness Consistency Completeness Consistency Completeness Consistency Completeness Consistency

December 1, 2000 Yannis Vassiliou Slide 34

Conclusions

Summarizing the Approach Formal Results Tools Developed Several Applications by all partners

(TEAM-4, DB / GMI, Telecom Italia, etc.)

slide-18
SLIDE 18

18

December 1, 2000 Yannis Vassiliou Slide 35

Summarizing the DWQ Approach Architecture, Processes, Quality

METAMODELS MODELS

PROCESS METAMODEL QUALITY METAMODEL

A R C H I T E C T U R E M E T A M O D E L PROCESS MODELS QUALITY MODEL DATA MODELS

Conceptual Logical Physical Client Enterprise Source Client Enterprise Source Conceptual Logical Physical

December 1, 2000 Yannis Vassiliou Slide 36

Key Formal Results

  • n Quality Impacts

conceptual: description logic theory and tools for complete

reasoning about the relationships between source, enterprise, and client models (Rome, Manchester, Aachen)

conceptual/logical: containment, satisfiability, and rewriting

  • f queries over views with & without aggregates (DFKI,

Rome)

logical/physical: incremental cost-based optimization of view

materializations (Athens)

physical: detailed impact analysis of replication and

refreshment policies (Aachen, INRIA)

slide-19
SLIDE 19

19

December 1, 2000 Yannis Vassiliou Slide 37

Tools Developed

R1R2R3

Enterprise Model Materialized Views C1 C2 Cm

Conj. Queries R1R2R3

S1

R1R2R3

S2

R1R2R3

S3

R1R2R3

Sn

Conj. Queries Conj. Queries

User queries OLTP updates

  • 3. Conceptual

Client Modeling

  • 1. Conceptual

Enterprise Model

  • 2. Conceptual

Source Models Rewriting of Aggregate Queries Refreshment

  • 6. Data

Reconciliation

  • 4. Translate aggregates

into OLAP operations

  • 5. Design

Optimization

Metadata Repository