Distributed Databases: Design and Query Execution Data - PowerPoint PPT Presentation

Distributed Databases: Design and Query Execution

Data Fragmentation and Placement  Fragmentation:  How to split up the data into smaller fragments? Horizontal Vertical  Placement:  Where to put the fragments? COMP-577: Distributed Database Systems 2

Horizontal Data Fragmentation Problem Statement:   Given a table R (a1, a2, a3,…), containing tuples t1, t2, …  How to distribute R across several sites Horizontal Fragmentation   For each tuple ti, place ti on one of the sites  Fragmentation given through selection  Range partitioning: σ a1 < 10 (R), σ 10 <= a1 < 20 (R), σ 20 <= a1 (R)  loc = ‘Montreal’, loc = ‘Zurich’  Correcness: each tuple (also future tuple) is member of exactly one fragment (completeness and disjointness) COMP-577: Distributed Database Systems 3

Referencing partition  Assume relations Emp(eid, name, dep) and Dep(did, name, location); dep is foreign key referencing Dep  Dep is partitioned into Dep i according to location π did (Dep i ) are disjoint sets Ι  Emp is partitioned according to Dep  Emp i = Emp π did (Dep i ) SELECT eid, name, dep FROM Emp, Dep i WHERE Emp.dep = Dep i .did  Emp i are also disjoint COMP-577: Distributed Database Systems 4

Vertical Data Fragmentation Vertical Fragmentation   For each attribute ai, place for each tuple ti, the value of its attribute ai on one of the sites.  In order to identify tuples, the key is replicated at each fragment!  determined by projection : π K,A1,A2 (R), π K,A3,A4 (R), COMP-577: Distributed Database Systems 5

Fragmentation and Query Execution  LAN  Fragment such that your typical queries are executed fast  Full Scan  SELECT * FROM R  Many result tuples => distribute work  Idea: distribute data evenly among several sites and read data in parallel  Expect scanning time long => assemble and communication overhead neglegible  Point Query  SELECT * FROM R WHERE a1 = 5  Few result tuples = localize work  Idea: if all relevant data on one site, only one site has to scan its partition; no work on the other sites  Expect search time short: assemble and communication overhead have too big impact  WAN  Each query should be executed solely locally COMP-577: Distributed Database Systems 6

Range Partitioning  Partitioning Vector indicating domain ranges of an attribute V=(4;7)  Parallel Query Execution  + good for full scan if even distribution  + good for range queries on partition attribute  + ok for point queries  - not good for range/point queries not on partition attribute  Need for good vector otherwise data skew and execution skew  Oracle provides range partitioning  Local Query Execution  Good if partition related to geography COMP-577: Distributed Database Systems 7

Data Placement  Which partition is stored on which computers  Which partitions should be replicated and where  Highly related to data fragmentation  Optimization Problems:  Efficiency  Final goal  Minimize query response time  Maximize throughput  Achieved by minimizing “costs”  minimize WAN remote access  Optimize parallelism  Minimize bottlenecks (communication system, local machined)  Availability  Choose nodes according to their availability  replication COMP-577: Distributed Database Systems 8

Data Placement  Data Allocation depends on many parameters  Number of nodes  Storage capacity and costs on each site  Number of fragments  Size of fragments  Number of different queries/updates on specific fragments  Frequency of each query/update operation  Size of input query, number of results tuples per query  Communication costs per data unit  What query processing strategies do exist (type of joins etcs.)  This is a hard problem! COMP-577: Distributed Database Systems 9

Query Processing I  SELECT A,B FROM R WHERE E=3  Generate Execution Tree π A,B E = 3 R COMP-577: Distributed Database Systems 10

Query Processing II  Localization 1: Replace Relations by Fragments π A,B E = 3 ∪ [R1: E < 10] [R2: E >= 10] COMP-577: Distributed Database Systems 11

Query Processing III  Localization II  Push ∪ up  Push π and σ down π A,B ∪ π A,B π A,B E = 3 E = 3 E = 3 ∪ [R1: E < 10] [R2: E >= 10] [R1: E < 10] [R2: E >= 10] COMP-577: Distributed Database Systems 12

Query Processing IV  Localization III  Simplify  Eliminate unnecessary operations ∪ π A,B π A,B π A,B E = 3 E = 3 E = 3 [R1: E < 10] [R1: E < 10] [R2: E >= 10] = ∅ COMP-577: Distributed Database Systems 13

Summary  Many different rules to simplify  Is getting very sophisticated for  Joins  Aggregation (sum, max, etc.)  Localization with vertical fragmentation slightly different COMP-577: Distributed Database Systems 14

Heterogeneous Query Processing client client Mediator / catalog Middleware Wrapper Wrapper DB1 DB1 DB2 COMP-577: Distributed Database Systems 15

Work distribution  Mediator  Schema Integration  must know the schemas of the component DB (external schemas)  find a reasonable global schema for the separate schemas  translate the external local schemas of a component DB into global schema and vs. versa  this is its whole own research area  very complex if data sources have different data models (Relational, OO, XML, html, file-system)  Own data dictionary (catalog)  global schema, translation mechanisms, information about local schemas, …  some statistics of the component DB  size of tables, query capacity of component DB, etc. COMP-577: Distributed Database Systems 16

Work distribution  Mediator  global query execution  query parsing, rewrite, query optimization  executes some of the operations (joining, extra operations on the data retrieved from data sources)  might cache data COMP-577: Distributed Database Systems 17

Work distribution  Wrapper (adaptor)  translates every request of the mediator into request that is understood by the underlying DB  translates the results into format understood by the mediator  Wrappers for relational DB, file systems, web-based DB  wrapper is client of DB source: can cache data and execute parts of the query by itself (e.g. if data source only returns html pages)  Expect to have many wrappers for many different data sources  new data source -> new wrapper, but same mediator  Key feature of being extensible  Query optimization and execution  distributed between mediator, wrapper and data source COMP-577: Distributed Database Systems 18

Distributed Databases: Design and Query Execution Data - PowerPoint PPT Presentation

Distributed Databases: Design and Query Execution Data Fragmentation and Placement Fragmentation: How to split up the data into smaller fragments? Horizontal Vertical Placement: Where to put the fragments? COMP-577:

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Distributed Databases Distributed database management system A distributed database (DDB) is

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast Query: Query:

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution in Column-Stores Atte Hinkka Seminar on Columnar Databases, Fall 2012 1

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Module 3: Creating and Managing Databases Overview Creating Databases Creating

DISTRIBUTED DATABASES CHAPTER 25 LECTURE OVERVIEW What are distributed databases?

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #17: QUERY

S Security in Outsourced i i O d Databases Databases (Query Answer Assurance) (Q y ) 1

THE NOSQL MOUVEMENT GENOVEVA VARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA,

Lecture 7: Indexes and Database Tuning Wednesday, November 10, 2010 Dan Suciu -- CSEP544 Fall 1

Data and Process Modelling 4. Relational Mapping Marco Montali KRDB Research Centre for

Personell Kjell Orsborn, lecturer, examiner email: kjell.orsborn@it.uu.se, phone: 471

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A.

Why Sort? Used for eliminating duplicates Select DISTINCT External Sorting Bulk

on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting