24 Databases Intro to Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Distributed OLAP 24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2019

2 ADM IN ISTRIVIA Homework #5 : Monday Dec 3 rd @ 11:59pm Project #4 : Monday Dec 10 th @ 11:59pm Extra Credit : Wednesday Dec 10 th @ 11:59pm Final Exam : Monday Dec 9 th @ 5:30pm Systems Potpourri : Wednesday Dec 4th → Vote for what system you want me to talk about. → https://cmudb.io/f19-systems CMU 15-445/645 (Fall 2019)

3 ADM IN ISTRIVIA Monday Dec 2 nd – Oracle Lecture → Shasank Chavan (VP In-Memory Databases) Monday Dec 2 nd – Oracle Systems Talk → 4:30pm in GHC 6115 → Pizza will be served Tuesday Dec 3 rd – Oracle Research Talk → Hideaki Kimura (Oracle Beast) → 12:00pm in CIC 4 th Floor (Panther Hollow Room) → Pizza will be served. CMU 15-445/645 (Fall 2019)

4 LAST CLASS Atomic Commit Protocols Replication Consistency Issues (CAP) Federated Databases CMU 15-445/645 (Fall 2019)

5 BIFURCATED EN VIRO N M EN T Extract Transform Load OLTP Databases OLAP Database CMU 15-445/645 (Fall 2019)

6 DECISIO N SUPPO RT SYSTEM S Applications that serve the management, operations, and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data. Star Schema vs. Snowflake Schema CMU 15-445/645 (Fall 2019)

7 STAR SCH EM A PRODUCT_DIM CUSTOMER_DIM ID CATEGORY_NAME FIRST_NAME CATEGORY_DESC SALES_FACT LAST_NAME PRODUCT_CODE EMAIL PRODUCT_NAME PRODUCT_FK ZIP_CODE PRODUCT_DESC TIME_FK LOCATION_FK CUSTOMER_FK LOCATION_DIM TIME_DIM PRICE YEAR COUNTRY QUANTITY DAY_OF_YEAR STATE_CODE MONTH_NUM STATE_NAME MONTH_NAME ZIP_CODE DAY_OF_MONTH CITY CMU 15-445/645 (Fall 2019)

8 CAT_LOOKUP SN OWFLAKE SCH EM A CATEGORY_ID CATEGORY_NAME CATEGORY_DESC CUSTOMER_DIM PRODUCT_DIM ID SALES_FACT FIRST_NAME CATEGORY_FK LAST_NAME PRODUCT_CODE PRODUCT_FK EMAIL PRODUCT_NAME ZIP_CODE PRODUCT_DESC TIME_FK LOCATION_FK LOCATION_DIM TIME_DIM CUSTOMER_FK YEAR COUNTRY DAY_OF_YEAR STATE_FK PRICE MONTH_FK ZIP_CODE DAY_OF_MONTH CITY QUANTITY STATE_LOOKUP MONTH_LOOKUP STATE_ID MONTH_NUM STATE_CODE MONTH_NAME STATE_NAME MONTH_SEASON CMU 15-445/645 (Fall 2019)

9 STAR VS. SN OWFLAKE SCH EM A Issue #1: Normalization → Snowflake schemas take up less storage space. → Denormalized data models may incur integrity and consistency violations. Issue #2: Query Complexity → Snowflake schemas require more joins to get the data needed for a query. → Queries on star schemas will (usually) be faster. CMU 15-445/645 (Fall 2019)

10 PRO BLEM SETUP Partitions SELECT * FROM R JOIN S ON R.id = S.id P1 P2 Application Server P3 P4 CMU 15-445/645 (Fall 2019)

10 PRO BLEM SETUP Partitions SELECT * FROM R JOIN S ON R.id = S.id P1 P2 P2 P4 P3 Application Server P3 P4 CMU 15-445/645 (Fall 2019)

11 TO DAY'S AGEN DA Execution Models Query Planning Distributed Join Algorithms Cloud Systems CMU 15-445/645 (Fall 2019)

12 PUSH VS. PULL Approach #1: Push Query to Data → Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network. Approach #2: Pull Data to Query → Bring the data to the node that is executing a query that needs it for processing. CMU 15-445/645 (Fall 2019)

13 PUSH Q UERY TO DATA SELECT * FROM R JOIN S Node ON R.id = S.id P1→ ID:1-100 R ⨝ S IDs [101,200] Result: R ⨝ S Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2019)

14 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage Page ABC ON R.id = S.id R ⨝ S IDs [101,200] Page XYZ Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2019)

14 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage ON R.id = S.id R ⨝ S IDs [101,200] Result: R ⨝ S Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2019)

15 O BSERVATIO N The data that a node receives from remote sources are cached in the buffer pool. → This allows the DBMS to support intermediate results that are large than the amount of memory available. → Ephemeral pages are not persisted after a restart. What happens to a long-running OLAP query if a node crashes during execution? CMU 15-445/645 (Fall 2019)

16 Q UERY FAULT TO LERAN CE Most shared-nothing distributed OLAP DBMSs are designed to assume that nodes do not fail during query execution. → If one node fails during query execution, then the whole query fails. The DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover if nodes fail. CMU 15-445/645 (Fall 2019)

17 Q UERY FAULT TO LERAN CE SELECT * FROM R JOIN S Node Storage ON R.id = S.id R ⨝ S Result: R ⨝ S Application Server Node CMU 15-445/645 (Fall 2019)

17 Q UERY FAULT TO LERAN CE SELECT * FROM R JOIN S Node Storage ON R.id = S.id Result: R ⨝ S Application Server Node CMU 15-445/645 (Fall 2019)

18 Q UERY PLAN N IN G All the optimizations that we talked about before are still applicable in a distributed environment. → Predicate Pushdown → Early Projections → Optimal Join Orderings Distributed query optimization is even harder because it must consider the location of data in the cluster and data movement costs. CMU 15-445/645 (Fall 2019)

19 Q UERY PLAN FRAGM EN TS Approach #1: Physical Operators → Generate a single query plan and then break it up into partition-specific fragments. → Most systems implement this approach. Approach #2: SQL → Rewrite original query into partition-specific queries. → Allows for local optimization at each node. → MemSQL is the only system that I know that does this. CMU 15-445/645 (Fall 2019)

20 Q UERY PLAN FRAGM EN TS SELECT * FROM R JOIN S ON R.id = S.id SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S ON R.id = S.id ON R.id = S.id ON R.id = S.id WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300 Id:1-100 Id:101-200 Id:201-300 CMU 15-445/645 (Fall 2019)

20 Union the output of Q UERY PLAN FRAGM EN TS each join to produce final result. SELECT * FROM R JOIN S ON R.id = S.id SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S ON R.id = S.id ON R.id = S.id ON R.id = S.id WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300 Id:1-100 Id:101-200 Id:201-300 CMU 15-445/645 (Fall 2019)

21 O BSERVATIO N The efficiency of a distributed join depends on the target tables' partitioning schemes. One approach is to put entire tables on a single node and then perform the join. → You lose the parallelism of a distributed DBMS. → Costly data transfer over the network. CMU 15-445/645 (Fall 2019)

22 DISTRIBUTED J O IN ALGO RITH M S To join tables R and S , the DBMS needs to get the proper tuples on the same node. Once there, it then executes the same join algorithms that we discussed earlier in the semester. CMU 15-445/645 (Fall 2019)

23 SCEN ARIO # 1 One table is replicated at every node. SELECT * FROM R JOIN S Each node joins its local data and then ON R.id = S.id sends their results to a coordinating node. P1:R ⨝ S P2:R ⨝ S Id:1-100 R{Id} R{Id} Id:101-200 Replicated S S Replicated CMU 15-445/645 (Fall 2019)

23 SCEN ARIO # 1 One table is replicated at every node. SELECT * FROM R JOIN S Each node joins its local data and then ON R.id = S.id sends their results to a coordinating node. P1:R ⨝ S R ⨝ S P2:R ⨝ S Id:1-100 R{Id} R{Id} Id:101-200 Replicated S S Replicated CMU 15-445/645 (Fall 2019)

24 SCEN ARIO # 2 Tables are partitioned on the join SELECT * FROM R JOIN S attribute. Each node performs the join ON R.id = S.id on local data and then sends to a node for coalescing. P1:R ⨝ S P2:R ⨝ S Id:1-100 R{Id} R{Id} Id:101-200 Id:1-100 S{Id} S{Id} Id:101-200 CMU 15-445/645 (Fall 2019)

24 SCEN ARIO # 2 Tables are partitioned on the join SELECT * FROM R JOIN S attribute. Each node performs the join ON R.id = S.id on local data and then sends to a node for coalescing. P1:R ⨝ S R ⨝ S P2:R ⨝ S Id:1-100 R{Id} R{Id} Id:101-200 Id:1-100 S{Id} S{Id} Id:101-200 CMU 15-445/645 (Fall 2019)

25 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. Id:1-100 R{Id} R{Id} Id:101-200 Val:1-50 S{Val} S{Val} Val:51-100 CMU 15-445/645 (Fall 2019)

25 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. S Id:1-100 R{Id} R{Id} Id:101-200 Val:1-50 S{Val} S{Val} Val:51-100 CMU 15-445/645 (Fall 2019)

24 Databases Intro to Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Distributed OLAP 24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2019 2 ADM IN ISTRIVIA Homework #5 : Monday Dec 3 rd @ 11:59pm Project #4 : Monday Dec 10 th @

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Dynamic generation of parallel computations James Hanlon, Simon J. Hollis Many-core project June

High-speed parallel software implementation of the T pairing Diego F. Aranha Institute of

Data Parallel Programming in R David Padua Department of

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du,

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

[S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Reliable communication via semilattice properties of partial knowledge A. Pagourtzis 1 G.

Oshkosh Corporation Third Quarter Fiscal 2019 August 1, 2019 WILSON R. JONES PRESIDENT AND

& Panel Discussion Information Session October 1, 2014 AGENDA Welcome Make Sure

OPES processor and end points OPES processor and end points communications communications

24 Databases Intro to Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Distributed OLAP 24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2019 2 ADM IN ISTRIVIA Homework #5 : Monday Dec 3 rd @ 11:59pm Project #4 : Monday Dec 10 th @

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Dynamic generation of parallel computations James Hanlon, Simon J. Hollis Many-core project June

High-speed parallel software implementation of the T pairing Diego F. Aranha Institute of

Data Parallel Programming in R David Padua Department of

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du,

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Spark &amp; sparklyr part II Spark &amp; sparklyr part II Programming for Statistical

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

[S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Reliable communication via semilattice properties of partial knowledge A. Pagourtzis 1 G.

Oshkosh Corporation Third Quarter Fiscal 2019 August 1, 2019 WILSON R. JONES PRESIDENT AND

&amp; Panel Discussion Information Session October 1, 2014 AGENDA Welcome Make Sure

OPES processor and end points OPES processor and end points communications communications

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

& Panel Discussion Information Session October 1, 2014 AGENDA Welcome Make Sure