Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan, Gehrke, - PowerPoint PPT Presentation

Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan, Gehrke, Hellerstein, Gray. Figures taken from: Dewitt and Gray. Parallel Database Systems: The Future of High Performance Database Systems . CACM 1992

Parallel vs. Distributed DBs  Parallel database systems – Improve performance through parallelizing various operations: loading data, indexing, query evaluation. Data may be distributed, but purely for performance reasons.  Distributed database systems – Data is physically stored across various sites, each of which runs DBMS and can function independently. Data distribution determined by local ownership and availability, in addition to performance. 2

Why Parallel Access To Data? At 10 MB/s 1,000 x parallel 1.2 days to scan 1.5 minute to scan Bandwidth 1 Terabyte 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Parallel DBMS: Intro  Parallelism is natural to DBMS processing – Pipeline parallelism: many machines each doing one step in a multi-step process. – Partition parallelism: many machines doing the same thing to different pieces of data. – Both are natural in DBMS! Any Any Sequential Sequential Pipeline Program Program Sequential Any Any Sequential Partition Sequential Sequential Sequential Sequential Program Program

DBMS: The || Success Story  DBMSs are the most (only?) successful application of parallelism. – Teradata, Tandem vs. Thinking Machines, KSR.. – Every major DBMS vendor has some || server  Reasons for success: – Bulk-processing (= partition ||-ism). – Natural pipelining. – Inexpensive hardware can do the trick! – Users/app-programmers don’t need to think in ||

Some || Terminology Ideal (throughput)  Speed-Up Xact/sec. – More resources means proportionally less time for given amount of data. degree of ||-ism – problem size constant, system grows.  Scale-Up Ideal (response time) – If resources increased in sec./Xact proportion to increase in data size, time is constant. – problems size, system both grow degree of ||-ism

Enemies of good speed-up / scale-up  Start up work – If thousands of processes must be started, this can dominate actual computation time  Interference – The slowdown each new process imposes on all others when accessing shared resources  Skew – Variance in the size of jobs for each process. Service time for whole job is the service time of slowest step of job. 7

Architecture Issue: Shared What? Shared memory Shared disk Shared nothing  Alternative architectures: – Shared memory: all processors shared common global memory and access to all disks. – Shared disk: all processors have private memory, but direct access to all disks. – Shared nothing: each memory/disk owned by processor which acts as server for data. 8

Architecture Issue: Shared What? Shared memory Shared disk Shared nothing  Alternative architectures: Advantages – Shared memory: all processors shared common • Minimize interference by minimizing shared resources global memory and access to all disks. • Exploit commodity processors and memory • Disk and memory accesses are local – Shared disk: all processors have private memory, • Traffic on interconnection network is minimized but direct access to all disks. – Shared nothing: each memory/disk owned by processor which acts as server for data. 8

Different Types of DBMS ||-ism  Intra-operator parallelism – get all machines working to compute a given operation (scan, sort, join)  Inter-operator parallelism – each operator may run concurrently on a different site (exploits pipelining)  Inter-query parallelism – different queries run on different sites  We’ll focus on intra-operator ||-ism

Limits of pipelined parallelism in DBMS  Relational pipelines usually not very long  Some relational operators block (e.g. sorting, aggregation)  Execution cost of one operator may be much higher than another (example of skew)  As a result, partitioned parallelism is key to achieving speed-up and scale-up 10

Automatic Data Partitioning Partitioning a table: Range Hash Round Robin A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z Good for equijoins, Good for equijoins Good to spread load range queries group-by Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

Parallel query processing Two relational scans consuming two input relations, A and B, and feeding their outputs to a join operator that in turn produces a data stream C. 12

Parallel Scans  Scan in parallel, and merge.  Selection may not require all sites for range or hash partitioning.  Indexes can be built at each partition.

Parallel Hash Join Partitions OUTPUT 1 1 Phase 1 INPUT 2 2 hash . . . Original Relations function h (R then S) B-1 B-1 B main memory buffers Disk Disk  In first phase, partitions get distributed to different sites: – A good hash function automatically distributes work evenly!  Do second phase at each site.  Almost always the winner for equi-join.

Dataflow Network for || Join  Good use of split/merge makes it easier to build parallel versions of sequential join code.

Complex Parallel Query Plans  Complex Queries: Inter-Operator parallelism – Pipelining between operators:  note that sort and phase 1 of hash-join block the pipeline!! – Bushy Trees Sites 1-8 Sites 1-4 Sites 5-8 A B R S

Parallel query optimization issues  Cost estimation in parallel environment  Consider bushy plans -- much larger plan space  Some parameters only known at runtime: number of free processors, available buffer space. 17

Sequential vs. Parallel Optimization  Best serial plan != Best || plan!  Trivial counter-example: – Table partitioned with local secondary index at two nodes – Range query: all of node 1 and 1% of node 2. – Node 1 should do a scan of its partition. – Node 2 should use secondary index. Index Table Scan Scan  SELECT * FROM telephone_book WHERE name < “NoGood”; N..Z A..M

Parallel DBMS Summary  ||-ism natural to query processing: – Both pipeline and partition ||-ism!  Shared-Nothing vs. Shared-Mem – Shared-disk too, but less standard – Shared-mem easy, costly. Doesn’t scaleup. – Shared-nothing cheap, scales well, harder to implement.  Intra-op, Inter-op, & Inter-query ||-ism all possible.

|| DBMS Summary, cont.  Data layout choices important!  Most DB operations can be done partition-|| – Sort. – Sort-merge join, hash-join.  Complex plans. – Allow for pipeline-||ism, but sorts, hashes block the pipeline. – Partition ||-ism achieved via bushy trees.

|| DBMS Summary, cont.  Hardest part of the equation: optimization. – 2-phase optimization simplest, but can be ineffective. – More complex schemes still at the research stage.  We haven’t said anything about Xacts, logging. – Easy in shared-memory architecture. – Takes some care in shared-nothing.

Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan, Gehrke, - PowerPoint PPT Presentation

Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan, Gehrke, Hellerstein, Gray. Figures taken from: Dewitt and Gray. Parallel Database Systems: The Future of High Performance Database Systems . CACM 1992 Parallel vs. Distributed DBs

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

DRY-SAS/DBMS UPDATE Executive Committee meeting 9 OCTOBER 2020 BACKGROUND DRY-SAS AND DBMS

Distributed DBMS reliability Distributed DBMS reliability

Database Management System (DBMS) DBMS contains information about a particular enterprise

Database Management Systems (DBMS) Prof. Pfaff. Lafayette College February 19, 2018 Prof.

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data

Architecture of DBMS Mrs. Maninder Kaur professormaninder@gmail.com Mrs. Maninder Kaur

B a n Parallel DBMS d 1 Terabyte w 1 Terabyte i d Chapter 21, Part A t h Parallelism:

On Parallel Processing of Aggregate and Scalar Functions in Object-Relational DBMS Michael

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Culturally Competent Care Learning Collaborative Session 2 1 November 10, 2020 National Center

ADA Supports Point-Of-Care Testing & Vaccination in Dental Offices David Reznik, DDS Gary

I N ST R U M E N TA L VA R I A B L E S I PMAP 8521: Program Evaluation for Public Service

CS510 Concurrent Systems Why the Grass May Not Be Greener

Hedging Interest Rate Margins on Demand Deposits

WEYERHAEUSER EARNINGS RESULTS | 3rd Quarter 2016 | October 28, 2016 FORWARD-LOOKING

AAPICO HITECH PLC [AH] Mr. Yeap Swee Chuan SET Opportunity Day March 9, 2016 Agenda 1.

Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan, Gehrke, - PowerPoint PPT Presentation

Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan, Gehrke, Hellerstein, Gray. Figures taken from: Dewitt and Gray. Parallel Database Systems: The Future of High Performance Database Systems . CACM 1992 Parallel vs. Distributed DBs

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

DRY-SAS/DBMS UPDATE Executive Committee meeting 9 OCTOBER 2020 BACKGROUND DRY-SAS AND DBMS

Distributed DBMS reliability Distributed DBMS reliability

Database Management System (DBMS) DBMS contains information about a particular enterprise

Database Management Systems (DBMS) Prof. Pfaff. Lafayette College February 19, 2018 Prof.

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data

Architecture of DBMS Mrs. Maninder Kaur professormaninder@gmail.com Mrs. Maninder Kaur

B a n Parallel DBMS d 1 Terabyte w 1 Terabyte i d Chapter 21, Part A t h Parallelism:

On Parallel Processing of Aggregate and Scalar Functions in Object-Relational DBMS Michael

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Culturally Competent Care Learning Collaborative Session 2 1 November 10, 2020 National Center

ADA Supports Point-Of-Care Testing &amp; Vaccination in Dental Offices David Reznik, DDS Gary

I N ST R U M E N TA L VA R I A B L E S I PMAP 8521: Program Evaluation for Public Service

CS510 Concurrent Systems Why the Grass May Not Be Greener

Hedging Interest Rate Margins on Demand Deposits

WEYERHAEUSER EARNINGS RESULTS | 3rd Quarter 2016 | October 28, 2016 FORWARD-LOOKING

AAPICO HITECH PLC [AH] Mr. Yeap Swee Chuan SET Opportunity Day March 9, 2016 Agenda 1.

ADA Supports Point-Of-Care Testing & Vaccination in Dental Offices David Reznik, DDS Gary