An Overview of Data An Overview of Data and facts. Data is usually - PDF document

3/2/2009 Data Warehouse Motivation Data Warehouse Motivation • Businesses have a lot of data, operational data An Overview of Data An Overview of Data and facts. • Data is usually in different databases and in Warehousing and OLAP Warehousing and OLAP different physical places. T T echnology echnology • Decision makers need to access information (data that has been summarized) virtually on the Presentation by Debojit single site. Discussion by Ali • Access needs to be fast regardless of the size of data, and how data’s age. 2 What is data warehouse What is data warehouse What is decision support What is decision support  Data warehouse is a collection of decision • Decision support systems are a class of support technologies, aimed at enabling the computerized information systems that support analysts to make better and faster decisions. It decision making activities. consists of subject-oriented, integrated, time- variant, and non-volatile collection of data. • Decision support systems usually require ◦ It contains data from different sources. consolidating data form many heterogeneous ◦ It retains a long history. sources: these might include external sources. ◦ Changes as new data is added to the repository. -Such as stock market feeds. 3 4 Why do we separate DW from DB ? Why do we separate DW from DB ? Difference between OLAP and OLTP Difference between OLAP and OLTP OLTP OLAP  Performance reasons: Users Clerk, IT professional Knowledge worker ◦ OLAP requires special data organization that Function Day to day operations Decision support supports multidimensional views. DB Design Application-oriented Subject-oriented ◦ OLAP queries would degrade operational DB. Data Current, up-to-date Historical, summarized, ◦ OLAP is read only. detailed. multidimensional,… Usage repetitive Ad-hoc ◦ No concurrency control and recovery. Access Read/write Lots of scans Unit of work Short, simple transaction Complex query  Decision support requires historical data. # rec accessed tens Millions  Decision support requires consolidated data. # users thousands Hundreds DB size 100 MB-GB 100 GB-TB Metric Transaction throughput Query throughput 6 5 1

3/2/2009 T T ypical OLAP architecture ypical OLAP architecture Utilities Utilities  Data Cleaning ◦ Data Migration: simple transformation rules (replace "gender" with "sex") ◦ Data Scrubbing: use domain-specific knowledge (e.g. zip codes) to modify data. ◦ Data Auditing: discover rules and relationships (or signal violations thereof).  Load ◦ Full load: like one big xact – change from old data to new is atomic. ◦ Incremental loading ("refresh") makes sense for big warehouses, but transaction model is more complex. 7 Database Design Methodology Database Design Methodology Star Schema Example Star Schema Example  Most data warehouses use a star schema to represent the multi-dimensional model.  Each dimension is represented by a dimension- table that describes it.  A fact-table connects to all dimension-tables with a multiple join. Each tuple in fact-table consists of a pointer to each of the dimension-tables.  Links between the fact-table in the centre and the dimension-tables form a shape like a star. (Star Schema) 9 10 Database Design Methodology (contd.) Database Design Methodology (contd.) Discussion Discussion  Each dimension is represented by one table.  Do you think that star schemas are more useful in data warehouses than in RDBMSs? Why or why not?  Un-normalized (introduces redundancy) Ex: (Vancouver, BC, Canada, North America) (Victoria, BC, Canada, North America) Normalize dimension tables  Snowflake Schema 11 2

3/2/2009 Metadata Example Snowflake Schema Metadata Example Snowflake Schema Important considerations for DW Important considerations for DW servers servers  Indexing  Materialized Views  Transformation of complex queries  Parallel processing  ROLAP/MOLAP servers  SQL extensions 13 Metadata requirements Metadata requirements Materialized Views Materialized Views  Administrative metadata  Materializing summary data can help to  Source database and their contents accelerate several queries.  Back-end and front-end tools  Some of the key challenges :  Definitions of the warehouse schema ◦ Identify the views to materialize  Pre-defined queries and reports ◦ Exploit the materialized views to answer queries  Data mark locations and contents ◦ Efficiently update the materialized views during load  Data refresh and purging policies and refresh.  User profiles and user access control policies 16 Metadata requirements Metadata requirements Discussion Discussion  Business metadata  Business terms and definitions  Ownership of data  We can use materialized views both in relational data bases and in data warehouses. Using materialized  Charging policies views in which one is more crucial? Using materialized views in which one is easier? Why?  Operational metadata  Data lineage: history of migrated data and sequence of transformations applied  Currency of data: active, archived, purged  Monitoring information: warehouse usage statistics, error reports, audit trails 17 3

3/2/2009 Outline Outline  Data analysis  Visualization and dimension reduction Data Cube: A Relational Aggregation Operator Data Cube: A Relational Aggregation Operator  The relational representation of N-dimensional Generalizing Group Generalizing Group-By, Cross By, Cross-Tab, and Sub Tab, and Sub-Totals Totals data  What is CUBE J. Gray, al, Microsoft Research J. Gray, al, Microsoft Research  Summary F. Pellow, al, IBM Research F. Pellow, al, IBM Research The sales example The sales example Data analysis applications Data analysis applications  Looking for anomalies or unusual patterns.  Extract statistical information Model Year Color Number sold  Four steps to aggregate data across many dimensions Chevy 1994 Black 50 Chevy 1994 White 40 Ford 1994 Black 50 Ford 1994 White 10 Chevy 1995 Black 85 Chevy 1995 White 115 Ford 1995 Black 85 Ford 1995 White 75 We have ignored a few columns here such as the date of purchase and the dealer  Represent the dataset as an N-dimensional space Problems with SQL Problems with SQL “Dimensionality Reduction” “Dimensionality Reduction” Analyze car sales The three most common problems faced  Focus on the role of model, year and color of the cars by the SQL GROUP BY are:  Ignore differences between sales along dimensions of 1. Histograms date of sale or car dealership 2. Roll-up and drill-down  As a result, extensive constructs are used, such as cross- tabulation, subtotals, roll-up and drill-down 3. Cross Tabulations 4

3/2/2009 Histograms Histograms One Dimensional Aggregation One Dimensional Aggregation  Standard SQL does not allow aggregation over computed categories. Example: Car sales for year 1994 and 1995 showed in table_1: Table_1 :  For example, if we had to sort the car sales by Model Model Sales Sales type and then perform aggregation functions on Chevy Chevy 290 290 it, standard SQL would not support it. Ford Ford 220 220 SELECT day, nation, MAX(Temp) If we need to know the sales for model, we can easily query it by: FROM Weather GROUP BY Day(Time) AS day, SELECT sales FROM table_1 Nation(Latitude, Longitude) AS nation; GROUP BY model Three Dimensional Aggregation Three Dimensional Aggregation Discussion Discussion If we need more dimensional generalization of these operators  How useful is multi-dimensional aggregation? Table_2:  Besides the data warehousing applications mentioned Model Model Year Year Color Color Sales Sales in the paper, can you think of any other application for multi dimensional aggregation and data cubes? Chevy Chevy 1994 1994 black black 50 50 Chevy Chevy 1995 1995 black black 85 85 Chevy Chevy 1994 1994 white white 40 40 Chevy Chevy 1995 1995 white white 115 115 Roll up/Drill down( Roll up/Drill down(contd. contd.) Roll up/Drill down Roll up/Drill down For Table_3a: If we need to query the sales by model, by year, and by color, then how we can do it?  Concepts: going up the levels is called rolling-up the data. Typically, we can make a report as showed by Going down is called drilling-down into the data Table_3a:  In this table, sales are rolled up by using totals and subtotals.  Data is aggregated by Model, then by Year, then by Color.  The report shows data aggregated at three levels, that is, at Model level, Year level, and Color level.  Data aggregated at each distinct level produces a sub-total. 5

An Overview of Data An Overview of Data and facts. Data is usually - PDF document

3/2/2009 Data Warehouse Motivation Data Warehouse Motivation Businesses have a lot of data, operational data An Overview of Data An Overview of Data and facts. Data is usually in different databases and in Warehousing and OLAP

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

EOTSS: Data Sharing and Services July 18 , 2019 Agenda Data Sharing Framework Overview

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CDBG Disaster Recovery Eligible Activities U.S. Department of Housing and Urban Development

Investing Via The Internet: A Guided Tour of AAII's Top-Rated Investment Web Sites John

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Adversarial Attacks on Probabilistic Autoregressive Forecasting Models 2 (i) Probabilistic

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

Data Warehousing and OLAP Decision Support Systems Decision-support systems are used to make

Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre

2. Independence and Bernoulli Trials (Euler, Ramanujan and Bernoulli Numbers) Independence :

An Overview of Data An Overview of Data and facts. Data is usually - PDF document

3/2/2009 Data Warehouse Motivation Data Warehouse Motivation Businesses have a lot of data, operational data An Overview of Data An Overview of Data and facts. Data is usually in different databases and in Warehousing and OLAP

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

EOTSS: Data Sharing and Services July 18 , 2019 Agenda Data Sharing Framework Overview

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CDBG Disaster Recovery Eligible Activities U.S. Department of Housing and Urban Development

Investing Via The Internet: A Guided Tour of AAII's Top-Rated Investment Web Sites John

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Adversarial Attacks on Probabilistic Autoregressive Forecasting Models 2 (i) Probabilistic

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

Data Warehousing and OLAP Decision Support Systems Decision-support systems are used to make

Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre

2. Independence and Bernoulli Trials (Euler, Ramanujan and Bernoulli Numbers) Independence :

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is