4/14/20 Outline 0) Course Info CS520 1) Introduction Data - PDF document

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data Preparation and Cleaning Provenance 3) Schema matching and mapping 4) Virtual Data Integration 6. Data Warehousing 5) Data Exchange 6) Data Warehousing IIT DBGroup 7) Big Data Analytics Boris Glavic http://www.cs.iit.edu/~glavic/ 8) Data Provenance http://www.cs.iit.edu/~cs520/ http://www.cs.iit.edu/~dbgroup/ 1 CS520 - 6) Data Warehousing 0 1 6. What is Datawarehousing? 6. What is Datawarehousing? • Problem: Data Analysis, Prediction, Mining • Solution: – Example: Walmart • Performance – Transactional databases – Store data in a different system (the datawarehouse) for analysis • Run many “cheap” updates concurrently • E.g., each store has a database storing its stock and sales – Bulk-load data to avoid wasting performance on – Complex Analysis over Transactional Databases? concurrency control during analysis • Want to analyze across several transactional databases • Heterogeneity and Distribution – E.g., compute total Walmart sales per month – Preprocess data coming from transactional – Distribution and heterogeneity databases to clean it and translate it into a unified • Want to run complex analysis over large datasets format before bulk-loading – Resource consumption of queries affects normal operations on transactional databases 2 3 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 2 3 6. Datawarehousing Process 6. Overview • The multidimensional datamodel (cube) • 1) Design a schema for the warehouse – Multidimensional data model • 2) Create a process for preprocessing the data – Relational implementations • 3) Repeat • Preprocessing and loading (ETL) • Query language extensions – A) Preprocess data from the transactional databases – ROLL UP, CUBE, … – B) Bulk-load it into the warehouse • Query processing in datawarehouses – C) Run analytics Data Warehouse – Bitmap indexes E TL pipeline outputs – Query answering with views – Self-tuning ETL ETL ETL ETL ETL R D B M S 1 R D B M S 2 4 H TM L 5 1 X M L 1 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 4 5 1

4/14/20 6. Multidimensional Datamodel 6. Multidimensional Datamodel • Analysis queries are typically aggregating • Commonality among these queries: lower level facts about a business – At the core are facts : a sale in a Walmart store, a toy stored in a warehouse, a call made by a certain – The revenue of Walmart in each state (country, phone city) – Data is aggregated across one or more dimensions – The amount of toy products in a warehouse of a company per week • These dimensions are typically organized hierarchically: year – month – day – hour, country – state - zip – The call volume per zip code for the Sprint network • Example – … – The revenue (sum of sale amounts) of Walmart in each state 6 7 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 6 7 6. Generalization to multiple 6. Example 2D dimensions • Given a fixed number of dimensions 2014 2015 – E.g., product type, location, time 1. Quarter 2. Quarter 3. Quarter 4. Quarter 1. Quarter 2. Qu… Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May • Given some measure car 3 7 6 37 7 92 37 7 92 37 7 92 37 7 92 2 ... – E.g., number of sales, items in stock, … Toy puppet 9 4 5 31 1 1 1 1 1 1 1 1 1 2 2 2 … Fishing rod 11 12 22 22 22 22 22 22 7 6 6 6 6 65 4 33 … • In the multidimensional datamodel we store Moby Dick 3 40 39 37 7 92 81 6 51 7 48 51 5 7 3 3 … facts: the values of measures for a combination Mobile 3 2 5 43 7 0 81 6 51 7 48 51 5 7 3 3 … Books devel. of values for the dimensions King Lear 3 9 6 37 7 92 5 6 51 7 48 51 5 7 3 3 … 8 9 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 8 9 6. Data cubes 6. Dimensions • Purpose • Given n dimensions – Selection of descriptive data – E.g., product type, location, time – Grouping with desired level of granularity • Given m measures • A dimension is define through a containment- – E.g., number of sales, items in stock, … hierarchy • A datacube (datahypercube) is an n - • Hierarchies typically have several levels dimensional datastructure that maps values in • The root level represents the whole dimensions the dimensions to values for the m measures • We may associate additional descriptive – Schema : D 1 , …, D n , M 1 , …, M m information with a elements in the hierarchy – Instance : a function (e.g., number of residents in a city) dom(D 1 ) x … x dom(D n ) -> dom(M 1 ) x ... x dom(M m ) 10 11 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 10 11 2

4/14/20 6. Dimension Example 6. Dimension Schema • Location • Schema of a Dimension – Levels: location, state, city – A set D of category attributes D 1 , …, D n , Top D • These correspond to the levels Schema Instance – A partial order → over D which represents parent- location Locations child relationships in the hierarchy • These correspond to upward edges in the hierarchy state Illinois Wisconsin • Top D is larger than anything else – For every D i : D i → Top D city Chicago Schaumburg Madison Whitewater • There exists D min which is smaller than anything else – For every D i : D min → D i 12 13 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 12 13 6. Dimension Schema Example 6. Remarks • Schema of Location Dimension • In principle there does not have to exist an order among the elements at one level of the – Set of categories D = {location, state, city} hierarchy – Partial order – E.g., cities { city → state, city → location, state → location } • Hierarchies do not have to be linear – Top D = location – D min = city Schema year Schema Instance quarter location Locations month week state Illinois Wisconsin day city Chicago Schaumburg Madison Whitewater 14 15 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 14 15 6. Cells, Facts, and Measures Facts • Each cell in the cube corresponds to a combination of • Targets of analytics elements from each dimension – E.g., revenue, #sales, #stock – Facts are non-empty cells • A fact is uniquely defined by the combination – Cells store measures of values from the dimensions • Cube for a combination of levels of the dimension – E.g., for dimensions time and and location Time May Apr Fact : Revenue in Illinois during Jan 2015 Mar Feb Jan Items in stock in Jan at Product • Granularity: Levels in the dimension Chicago that belong to Book category Tool 3 4 hierarchy corresponding to the fact Tool 4 9 Electronic 5 1 – E.g., city, month Audio year location Gardening quarter Location state New York m onth w eek Madison Seattle Aspen Chicago city 16 17 day CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 16 17 3

4/14/20 Facts (Event vs. Snapshot) Measures • Event Facts • A measure describes a fact – Model real-world events – May be derived from other measures – E.g., Sale of an item • Two components • Snapshot Facts – Numerical value – Temporal state – Formula (optional): how to derive it – A single object (e.g., a book) may contribute to • E.g., avg(revenue) = sum(revenue) / count(revenue) several facts • We may associate multiple measures to each – E.g., number of items in stock cell – E.g., number of sales and total revenue 18 19 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 18 19 Measures - Granularity Design Process (after Kimball) • Similar to facts, measures also have a granularity • Comparison to classical relational modeling • How to change granularity of a measure? – Analysis driven • No need to model all existing data and relationships relevant • Need algorithm to combine measures to a domain – Additive measures • Limit modeling to information that is relevant for predicted analytics • Can be aggregated along any dimension – Redundancy – Semi-additive / non-additive • Tolerate redundancy for performance if reasonable • Cannot be aggregated along some/all dimensions – E.g., in dimension tables to reduce number of joins • E.g., snapshot facts along time dimension – Number of items in stock at Jan + Feb + … != items in stock during year – Median of a measure 20 21 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 20 21 Design Process – Steps Design Process Example • 1) Select relevant business processes • Coffee shop chain – E.g., order shipping, sales, support, stock – Processes management • Sell coffee to customers • Buy ingredients from suppliers • 2) Select granuarity • Ship supplies to branches – E.g., track stock at level of branches or regions • Pay employees • 3) Design dimensions • HR (hire, advertise positions, …) – Which process is relevant to be analysed to increase – E.g., time, location, product, … profits? • 4) Select measures – E.g., revenue, cost, #sales, items in stock, #support requests 22 23 CS520 - 6) Data Warehousing CS520 - 6) Data Warehousing 22 23 4

4/14/20 Outline 0) Course Info CS520 1) Introduction Data - PDF document

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data Preparation and Cleaning Provenance 3) Schema matching and mapping 4) Virtual Data Integration 6. Data Warehousing 5) Data Exchange 6) Data

MTD-BO 4: ETL Overview Including LGADs, System Testing, I&C Artur Apresyan HL-LHC CMS

address the challenges posed by Big Data Syed Muhammad Fawad Ali Agenda 1. Introduction 2.

Co u rse ratings IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data

Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits Pierre Perrault (INRIA

Bakry meets Villani Fabrice Baudoin Purdue University Purdue Probability Seminar Dominique

RAPIDS, FOSDEM19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC & AI

Data Orchestration with Apache Airflow Data driven empower the organization to seek more

Chernoff approximation of diffusions and further applications Yana A. Butko Analysis and

sqrrl sqrrl Secure. Scale. Adapt Secure. Scale. Adapt. Adam

e-MOTICON e-MO MObility Transnational strategy for an Interoperable CO COmmunity and

Erasmus+ Mobility: Connecting Adult Education in Ireland to its Rich Tapestry of Values,

Mathematical Models of Artificial Genetic Representations with Neutrality c 2 and Nino Ba Carlos

Social-media Storytelling Linking Hao Wu Seamus Lawless Gareth Jones Francois Pitie The ADAPT

Unstructured Texts Jinhua Du ADAPT Centre, Dublin City University, Ireland The ADAPT Centre is

Pay-fo Pa for-Us Use in th the EGI Marketplace EG Sy Holsinger, EGI Foundati tion Se

P2P-NEXT EUROPEAN UNION FRAMEWORK 7 PROJECT WWW.P2P-NEXT.ORG Johnathan Ishmael

Fostering Systems Research in Europe A White Paper by EuroSys - the European Professional

EARNINGS RESULTS | 1st Quarter 2015 May 1, 2015 FORWARD-LOOKING STATEMENTS This presentation

1 TWO QUESTIONS LUKE 9:51-62 Jesus journey to Jerusalem was underway. The Boys do a bit of

Beyond uber how platforms rule the world Matthias Walter T-Systems MMS The platform innovation

Program Verification (Rosen, Sections 5.5) TOPICS Program Correctness Preconditions &

WhatisGameTheory?I* Verygeneralmathema)calframeworktostudysitua)ons*

CSS 161 Fundamentals of Compu3ng Arrays November 28, 2012

Parallel Programming and Heterogeneous Computing Shared-Nothing Systems: Actors and Channels Max

4/14/20 Outline 0) Course Info CS520 1) Introduction Data - PDF document

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data Preparation and Cleaning Provenance 3) Schema matching and mapping 4) Virtual Data Integration 6. Data Warehousing 5) Data Exchange 6) Data

MTD-BO 4: ETL Overview Including LGADs, System Testing, I&amp;C Artur Apresyan HL-LHC CMS

address the challenges posed by Big Data Syed Muhammad Fawad Ali Agenda 1. Introduction 2.

Co u rse ratings IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data

Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits Pierre Perrault (INRIA

Bakry meets Villani Fabrice Baudoin Purdue University Purdue Probability Seminar Dominique

RAPIDS, FOSDEM19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC &amp; AI

Data Orchestration with Apache Airflow Data driven empower the organization to seek more

Chernoff approximation of diffusions and further applications Yana A. Butko Analysis and

sqrrl sqrrl Secure. Scale. Adapt Secure. Scale. Adapt. Adam

e-MOTICON e-MO MObility Transnational strategy for an Interoperable CO COmmunity and

Erasmus+ Mobility: Connecting Adult Education in Ireland to its Rich Tapestry of Values,

Mathematical Models of Artificial Genetic Representations with Neutrality c 2 and Nino Ba Carlos

Social-media Storytelling Linking Hao Wu Seamus Lawless Gareth Jones Francois Pitie The ADAPT

Unstructured Texts Jinhua Du ADAPT Centre, Dublin City University, Ireland The ADAPT Centre is

Pay-fo Pa for-Us Use in th the EGI Marketplace EG Sy Holsinger, EGI Foundati tion Se

P2P-NEXT EUROPEAN UNION FRAMEWORK 7 PROJECT WWW.P2P-NEXT.ORG Johnathan Ishmael

Fostering Systems Research in Europe A White Paper by EuroSys - the European Professional

EARNINGS RESULTS | 1st Quarter 2015 May 1, 2015 FORWARD-LOOKING STATEMENTS This presentation

1 TWO QUESTIONS LUKE 9:51-62 Jesus journey to Jerusalem was underway. The Boys do a bit of

Beyond uber how platforms rule the world Matthias Walter T-Systems MMS The platform innovation

Program Verification (Rosen, Sections 5.5) TOPICS Program Correctness Preconditions &amp;

What*is*Game*Theory?*I* Very*general*mathema)cal*framework*to*study*situa)ons*

CSS 161 Fundamentals of Compu3ng Arrays November 28, 2012

Parallel Programming and Heterogeneous Computing Shared-Nothing Systems: Actors and Channels Max

MTD-BO 4: ETL Overview Including LGADs, System Testing, I&C Artur Apresyan HL-LHC CMS

RAPIDS, FOSDEM19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC & AI

Program Verification (Rosen, Sections 5.5) TOPICS Program Correctness Preconditions &

WhatisGameTheory?I* Verygeneralmathema)calframeworktostudysitua)ons*