distributed databases design and query execution data
play

Distributed Databases: Design and Query Execution Data - PowerPoint PPT Presentation

Distributed Databases: Design and Query Execution Data Fragmentation and Placement Fragmentation: How to split up the data into smaller fragments? Horizontal Vertical Placement: Where to put the fragments? COMP-577:


  1. Distributed Databases: Design and Query Execution

  2. Data Fragmentation and Placement  Fragmentation:  How to split up the data into smaller fragments? Horizontal Vertical  Placement:  Where to put the fragments? COMP-577: Distributed Database Systems 2

  3. Horizontal Data Fragmentation Problem Statement:   Given a table R (a1, a2, a3,…), containing tuples t1, t2, …  How to distribute R across several sites Horizontal Fragmentation   For each tuple ti, place ti on one of the sites  Fragmentation given through selection  Range partitioning: σ a1 < 10 (R), σ 10 <= a1 < 20 (R), σ 20 <= a1 (R)  loc = ‘Montreal’, loc = ‘Zurich’  Correcness: each tuple (also future tuple) is member of exactly one fragment (completeness and disjointness) COMP-577: Distributed Database Systems 3

  4. Referencing partition  Assume relations Emp(eid, name, dep) and Dep(did, name, location); dep is foreign key referencing Dep  Dep is partitioned into Dep i according to location π did (Dep i ) are disjoint sets Ι  Emp is partitioned according to Dep  Emp i = Emp π did (Dep i ) SELECT eid, name, dep FROM Emp, Dep i WHERE Emp.dep = Dep i .did  Emp i are also disjoint COMP-577: Distributed Database Systems 4

  5. Vertical Data Fragmentation Vertical Fragmentation   For each attribute ai, place for each tuple ti, the value of its attribute ai on one of the sites.  In order to identify tuples, the key is replicated at each fragment!  determined by projection : π K,A1,A2 (R), π K,A3,A4 (R), COMP-577: Distributed Database Systems 5

  6. Fragmentation and Query Execution  LAN  Fragment such that your typical queries are executed fast  Full Scan  SELECT * FROM R  Many result tuples => distribute work  Idea: distribute data evenly among several sites and read data in parallel  Expect scanning time long => assemble and communication overhead neglegible  Point Query  SELECT * FROM R WHERE a1 = 5  Few result tuples = localize work  Idea: if all relevant data on one site, only one site has to scan its partition; no work on the other sites  Expect search time short: assemble and communication overhead have too big impact  WAN  Each query should be executed solely locally COMP-577: Distributed Database Systems 6

  7. Range Partitioning  Partitioning Vector indicating domain ranges of an attribute V=(4;7)  Parallel Query Execution  + good for full scan if even distribution  + good for range queries on partition attribute  + ok for point queries  - not good for range/point queries not on partition attribute  Need for good vector otherwise data skew and execution skew  Oracle provides range partitioning  Local Query Execution  Good if partition related to geography COMP-577: Distributed Database Systems 7

  8. Data Placement  Which partition is stored on which computers  Which partitions should be replicated and where  Highly related to data fragmentation  Optimization Problems:  Efficiency  Final goal  Minimize query response time  Maximize throughput  Achieved by minimizing “costs”  minimize WAN remote access  Optimize parallelism  Minimize bottlenecks (communication system, local machined)  Availability  Choose nodes according to their availability  replication COMP-577: Distributed Database Systems 8

  9. Data Placement  Data Allocation depends on many parameters  Number of nodes  Storage capacity and costs on each site  Number of fragments  Size of fragments  Number of different queries/updates on specific fragments  Frequency of each query/update operation  Size of input query, number of results tuples per query  Communication costs per data unit  What query processing strategies do exist (type of joins etcs.)  This is a hard problem! COMP-577: Distributed Database Systems 9

  10. Query Processing I  SELECT A,B FROM R WHERE E=3  Generate Execution Tree π A,B E = 3 R COMP-577: Distributed Database Systems 10

  11. Query Processing II  Localization 1: Replace Relations by Fragments π A,B E = 3 ∪ [R1: E < 10] [R2: E >= 10] COMP-577: Distributed Database Systems 11

  12. Query Processing III  Localization II  Push ∪ up  Push π and σ down π A,B ∪ π A,B π A,B E = 3 E = 3 E = 3 ∪ [R1: E < 10] [R2: E >= 10] [R1: E < 10] [R2: E >= 10] COMP-577: Distributed Database Systems 12

  13. Query Processing IV  Localization III  Simplify  Eliminate unnecessary operations ∪ π A,B π A,B π A,B E = 3 E = 3 E = 3 [R1: E < 10] [R1: E < 10] [R2: E >= 10] = ∅ COMP-577: Distributed Database Systems 13

  14. Summary  Many different rules to simplify  Is getting very sophisticated for  Joins  Aggregation (sum, max, etc.)  Localization with vertical fragmentation slightly different COMP-577: Distributed Database Systems 14

  15. Heterogeneous Query Processing client client Mediator / catalog Middleware Wrapper Wrapper DB1 DB1 DB2 COMP-577: Distributed Database Systems 15

  16. Work distribution  Mediator  Schema Integration  must know the schemas of the component DB (external schemas)  find a reasonable global schema for the separate schemas  translate the external local schemas of a component DB into global schema and vs. versa  this is its whole own research area  very complex if data sources have different data models (Relational, OO, XML, html, file-system)  Own data dictionary (catalog)  global schema, translation mechanisms, information about local schemas, …  some statistics of the component DB  size of tables, query capacity of component DB, etc. COMP-577: Distributed Database Systems 16

  17. Work distribution  Mediator  global query execution  query parsing, rewrite, query optimization  executes some of the operations (joining, extra operations on the data retrieved from data sources)  might cache data COMP-577: Distributed Database Systems 17

  18. Work distribution  Wrapper (adaptor)  translates every request of the mediator into request that is understood by the underlying DB  translates the results into format understood by the mediator  Wrappers for relational DB, file systems, web-based DB  wrapper is client of DB source: can cache data and execute parts of the query by itself (e.g. if data source only returns html pages)  Expect to have many wrappers for many different data sources  new data source -> new wrapper, but same mediator  Key feature of being extensible  Query optimization and execution  distributed between mediator, wrapper and data source COMP-577: Distributed Database Systems 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend