Distributed OLAP Databases Lecture # 24 Database Systems Andy - PowerPoint PPT Presentation

Distributed OLAP Databases Lecture # 24 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018

2 UPCO M IN G DATABASE EVEN TS Swarm64 Tech Talk → Thursday November 29 th @ 12pm → GHC 8102 ← Different Location! VoltDB Research Talk → Monday December 3 rd @ 4:30pm → GHC 8102 CMU 15-445/645 (Fall 2018)

8 O LTP VS. O LAP On-line Transaction Processing (OLTP): → Short-lived read/write txns. → Small footprint. → Repetitive operations. On-line Analytical Processing (OLAP): → Long-running, read-only queries. → Complex joins. → Exploratory queries. CMU 15-445/645 (Fall 2018)

9 BIFURCATED EN VIRO N M EN T Extract Transform Load OLTP Databases OLAP Database CMU 15-445/645 (Fall 2018)

10 DECISIO N SUPPO RT SYSTEM S Applications that serve the management, operations, and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data. Star Schema vs. Snowflake Schema CMU 15-445/645 (Fall 2018)

11 STAR SCH EM A PRODUCT_DIM CUSTOMER_DIM ID CATEGORY_NAME FIRST_NAME CATEGORY_DESC SALES_FACT LAST_NAME PRODUCT_CODE EMAIL PRODUCT_NAME PRODUCT_FK ZIP_CODE PRODUCT_DESC TIME_FK LOCATION_FK CUSTOMER_FK LOCATION_DIM TIME_DIM PRICE YEAR COUNTRY QUANTITY DAY_OF_YEAR STATE_CODE MONTH_NUM STATE_NAME MONTH_NAME ZIP_CODE DAY_OF_MONTH CITY CMU 15-445/645 (Fall 2018)

12 CAT_LOOKUP SN OWFLAKE SCH EM A CATEGORY_ID CATEGORY_NAME CATEGORY_DESC CUSTOMER_DIM PRODUCT_DIM ID SALES_FACT FIRST_NAME CATEGORY_FK LAST_NAME PRODUCT_CODE PRODUCT_FK EMAIL PRODUCT_NAME ZIP_CODE PRODUCT_DESC TIME_FK LOCATION_FK LOCATION_DIM TIME_DIM CUSTOMER_FK YEAR COUNTRY DAY_OF_YEAR STATE_FK PRICE MONTH_FK ZIP_CODE DAY_OF_MONTH CITY QUANTITY STATE_LOOKUP MONTH_LOOKUP STATE_ID MONTH_NUM STATE_CODE MONTH_NAME STATE_NAME MONTH_SEASON CMU 15-445/645 (Fall 2018)

13 STAR VS. SN OWFLAKE SCH EM A Issue #1: Normalization → Snowflake schemas take up less storage space. → Denormalized data models may incur integrity and consistency violations. Issue #2: Query Complexity → Snowflake schemas require more joins to get the data needed for a query. → Queries on star schemas will (usually) be faster. CMU 15-445/645 (Fall 2018)

14 PRO BLEM SETUP Partitions SELECT * FROM R JOIN S ON R.id = S.id P1 P2 Application Server P3 P4 CMU 15-445/645 (Fall 2018)

14 PRO BLEM SETUP Partitions SELECT * FROM R JOIN S ON R.id = S.id P1 P2 P2 P4 P3 Application Server P3 P4 CMU 15-445/645 (Fall 2018)

15 TO DAY'S AGEN DA Execution Models Query Planning Distributed Join Algorithms Cloud Systems CMU 15-445/645 (Fall 2018)

16 PUSH VS. PULL Approach #1: Push Query to Data → Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network. Approach #2: Pull Data to Query → Bring the data to the node that is executing a query that needs it for processing. CMU 15-445/645 (Fall 2018)

17 PUSH Q UERY TO DATA SELECT * FROM R JOIN S Node ON R.id = S.id P1→ ID:1-100 R ⨝ S IDs [101,200] Result: R ⨝ S Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

18 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage Page ABC ON R.id = S.id R ⨝ S IDs [101,200] Page XYZ Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

18 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage ON R.id = S.id R ⨝ S IDs [101,200] Result: R ⨝ S Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

21 FAULT TO LERAN CE Traditional distributed OLAP DBMSs were designed to assume that nodes will not fail during query execution. → If the DBMS fails during query execution, then the whole query fails. The DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover after a crash. CMU 15-445/645 (Fall 2018)

22 Q UERY PLAN N IN G All the optimizations that we talked about before are still applicable in a distributed environment. → Predicate Pushdown → Early Projections → Optimal Join Orderings But now the DBMS must also consider the location of data at each partition when optimizing CMU 15-445/645 (Fall 2018)

23 Q UERY PLAN FRAGM EN TS Approach #1: Physical Operators → Generate a single query plan and then break it up into partition-specific fragments. → Most systems implement this approach. Approach #2: SQL → Rewrite original query into partition-specific queries. → Allows for local optimization at each node. → MemSQL is the only system that I know that does this. CMU 15-445/645 (Fall 2018)

25 Q UERY PLAN FRAGM EN TS SELECT * FROM R JOIN S ON R.id = S.id SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S ON R.id = S.id ON R.id = S.id ON R.id = S.id WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300 Id:1-100 Id:101-200 Id:201-300 CMU 15-445/645 (Fall 2018)

25 Union the output of Q UERY PLAN FRAGM EN TS each join together to produce final result. SELECT * FROM R JOIN S ON R.id = S.id SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S ON R.id = S.id ON R.id = S.id ON R.id = S.id WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300 Id:1-100 Id:101-200 Id:201-300 CMU 15-445/645 (Fall 2018)

26 O BSERVATIO N The efficiency of a distributed join depends on the target tables' partitioning schemes. One approach is to put entire tables on a single node and then perform the join. → You lose the parallelism of a distributed DBMS. → Costly data transfer over the network. CMU 15-445/645 (Fall 2018)

27 DISTRIBUTED J O IN ALGO RITH M S To join tables R and S , the DBMS needs to get the proper tuples on the same node. Once there, it then executes the same join algorithms that we discussed earlier in the semester. CMU 15-445/645 (Fall 2018)

28 SCEN ARIO # 1 One table is replicated at every node. SELECT * FROM R JOIN S Each node joins its local data and then ON R.id = S.id sends their results to a coordinating node. :R ⨝ S P2:R ⨝ S P1 Id:1-100 Id:101-200 R (Id) R (Id) Replicated Replicated S S CMU 15-445/645 (Fall 2018)

28 SCEN ARIO # 1 One table is replicated at every node. SELECT * FROM R JOIN S Each node joins its local data and then ON R.id = S.id sends their results to a coordinating node. :R ⨝ S P1 R ⨝ S P2:R ⨝ S Id:1-100 Id:101-200 R (Id) R (Id) Replicated Replicated S S CMU 15-445/645 (Fall 2018)

29 SCEN ARIO # 2 Tables are partitioned on the join SELECT * FROM R JOIN S attribute. Each node performs the join ON R.id = S.id on local data and then sends to a node for coalescing. :R ⨝ S P2:R ⨝ S P1 Id:1-100 Id:101-200 R (Id) R (Id) Id:1-100 Id:101-200 S (Id) S (Id) CMU 15-445/645 (Fall 2018)

29 SCEN ARIO # 2 Tables are partitioned on the join SELECT * FROM R JOIN S attribute. Each node performs the join ON R.id = S.id on local data and then sends to a node for coalescing. :R ⨝ S P1 R ⨝ S P2:R ⨝ S Id:1-100 Id:101-200 R (Id) R (Id) Id:1-100 Id:101-200 S (Id) S (Id) CMU 15-445/645 (Fall 2018)

30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. S S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. :R ⨝ S P2:R ⨝ S P1 S S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. :R ⨝ S P1 R ⨝ S P2:R ⨝ S S S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

31 SCEN ARIO # 4 Both tables are not partitioned on the SELECT * FROM R JOIN S join key. The DBMS copies the tables ON R.id = S.id by reshuffling them across nodes. Name:A-M Name:N-Z R (Name) R (Name) Val:1-50 S (Val) S (Val) Val:51-100 CMU 15-445/645 (Fall 2018)

Distributed OLAP Databases Lecture # 24 Database Systems Andy - PowerPoint PPT Presentation

Distributed OLAP Databases Lecture # 24 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 UPCO M IN G DATABASE EVEN TS Swarm64 Tech Talk Thursday November 29 th @ 12pm GHC 8102

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

OLAP & Data Mining OLAP & Data Mining Agenda Agenda SQL Server Features (in short) SQL

Saiku Saiku taking taking OLAP OLAP databases databases into into 21 21st st

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1 Midterm I Midterm I Friday, March

Fjord Bridging the between Gap OLAP SQL and Jens-Peter Dittrich 1, *

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Distributed Databases Distributed database management system A distributed database (DDB) is

DISTRIBUTED DATABASES CHAPTER 25 LECTURE OVERVIEW What are distributed databases?

Module 3: Creating and Managing Databases Overview Creating Databases Creating

CS377: Database Systems Distributed Databases Distributed Databases

Benchmarking Hybrid OLTP&OLAP Database Systems Florian Funke Alfons Kemper Thomas Neumann

A General Model for OLAP of Complex Data Jian Pei State University of New York at Buffalo, USA

Next Generation Is Islamic Fin inance Business Model - The Islamic Finance House - Facts

ELEN E6884 - Topics in Signal Processing Recap Topic: Speech Recognition Gaussian Mixture

Tokyo, September 14, 2020 OUTLINE OF THE MAIN REPORT SECTION 1 MACROECONOMIC PERFORMANCE AND

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

15-721 DATABASE SYSTEMS Lecture #09 OLAP Indexes Andy Pavlo / / Carnegie Mellon University

Advanced SQL II Advanced Aggregation and OLAP 5DV120 Database System Principles Ume a

Outline What is de c ision suppor t An Overview of Data What is a data war e house

DATA MANAGEMENT FOR BUSINESS INTELLIGENCE OLAP: On-Line Analytical Processing Salvatore Ruggieri

Distributed OLAP Databases Lecture # 24 Database Systems Andy - PowerPoint PPT Presentation

Distributed OLAP Databases Lecture # 24 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 UPCO M IN G DATABASE EVEN TS Swarm64 Tech Talk Thursday November 29 th @ 12pm GHC 8102

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

OLAP &amp; Data Mining OLAP &amp; Data Mining Agenda Agenda SQL Server Features (in short) SQL

Saiku Saiku taking taking OLAP OLAP databases databases into into 21 21st st

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1 Midterm I Midterm I Friday, March

Fjord Bridging the between Gap OLAP SQL and Jens-Peter Dittrich 1, *

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Distributed Databases Distributed database management system A distributed database (DDB) is

DISTRIBUTED DATABASES CHAPTER 25 LECTURE OVERVIEW What are distributed databases?

Module 3: Creating and Managing Databases Overview Creating Databases Creating

CS377: Database Systems Distributed Databases Distributed Databases

Benchmarking Hybrid OLTP&amp;OLAP Database Systems Florian Funke Alfons Kemper Thomas Neumann

A General Model for OLAP of Complex Data Jian Pei State University of New York at Buffalo, USA

Next Generation Is Islamic Fin inance Business Model - The Islamic Finance House - Facts

ELEN E6884 - Topics in Signal Processing Recap Topic: Speech Recognition Gaussian Mixture

Tokyo, September 14, 2020 OUTLINE OF THE MAIN REPORT SECTION 1 MACROECONOMIC PERFORMANCE AND

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

15-721 DATABASE SYSTEMS Lecture #09 OLAP Indexes Andy Pavlo / / Carnegie Mellon University

Advanced SQL II Advanced Aggregation and OLAP 5DV120 Database System Principles Ume a

Outline What is de c ision suppor t An Overview of Data What is a data war e house

DATA MANAGEMENT FOR BUSINESS INTELLIGENCE OLAP: On-Line Analytical Processing Salvatore Ruggieri

OLAP & Data Mining OLAP & Data Mining Agenda Agenda SQL Server Features (in short) SQL

Benchmarking Hybrid OLTP&OLAP Database Systems Florian Funke Alfons Kemper Thomas Neumann