Distributed Databases: Design and Query Execution Data - - PowerPoint PPT Presentation
Distributed Databases: Design and Query Execution Data - - PowerPoint PPT Presentation
Distributed Databases: Design and Query Execution Data Fragmentation and Placement Fragmentation: How to split up the data into smaller fragments? Horizontal Vertical Placement: Where to put the fragments? COMP-577:
2
COMP-577: Distributed Database Systems
Data Fragmentation and Placement
Fragmentation: How to split up the data into smaller fragments? Placement: Where to put the fragments?
Horizontal Vertical
3
COMP-577: Distributed Database Systems
Horizontal Data Fragmentation
Problem Statement:
Given a table R (a1, a2, a3,…), containing tuples t1, t2, … How to distribute R across several sites
Horizontal Fragmentation
For each tuple ti, place ti on one of the sites Fragmentation given through selection Range partitioning: σ a1 < 10(R), σ10 <= a1 < 20(R), σ 20 <= a1(R)
loc = ‘Montreal’, loc = ‘Zurich’ Correcness: each tuple (also future tuple) is member of exactly
- ne fragment (completeness and disjointness)
4
COMP-577: Distributed Database Systems
Referencing partition
Assume relations Emp(eid, name, dep) and Dep(did, name,
location); dep is foreign key referencing Dep
Dep is partitioned into Depi according to location
Ι
πdid (Depi) are disjoint sets
Emp is partitioned according to Dep
Empi = Emp πdid (Depi) SELECT eid, name, dep FROM Emp, Depi WHERE Emp.dep = Depi.did Empi are also disjoint
5
COMP-577: Distributed Database Systems
Vertical Data Fragmentation
Vertical Fragmentation
For each attribute ai, place for each tuple ti, the value of its
attribute ai on one of the sites.
In order to identify tuples, the key is replicated at each fragment! determined by projection: πK,A1,A2 (R), πK,A3,A4 (R),
6
COMP-577: Distributed Database Systems
Fragmentation and Query Execution
LAN Fragment such that your typical queries are executed fast Full Scan
SELECT * FROM R Many result tuples => distribute work Idea: distribute data evenly among several sites and read data in parallel Expect scanning time long => assemble and communication overhead neglegible
Point Query
SELECT * FROM R WHERE a1 = 5 Few result tuples = localize work Idea: if all relevant data on one site, only one site has to scan its partition; no
work on the other sites
Expect search time short: assemble and communication overhead have too big impact
WAN Each query should be executed solely locally
7
COMP-577: Distributed Database Systems
Range Partitioning
Partitioning Vector indicating domain ranges of an attribute V=(4;7) Parallel Query Execution + good for full scan if even distribution + good for range queries on partition attribute + ok for point queries - not good for range/point queries not on partition attribute Need for good vector otherwise data skew and execution skew Oracle provides range partitioning Local Query Execution
Good if partition related to geography
8
COMP-577: Distributed Database Systems
Data Placement
Which partition is stored on which computers Which partitions should be replicated and where Highly related to data fragmentation Optimization Problems:
Efficiency
Final goal
Minimize query response time Maximize throughput
Achieved by minimizing “costs” minimize WAN remote access Optimize parallelism Minimize bottlenecks (communication system, local machined)
Availability
Choose nodes according to their availability replication
9
COMP-577: Distributed Database Systems
Data Placement
Data Allocation depends on many parameters
Number of nodes Storage capacity and costs on each site Number of fragments Size of fragments Number of different queries/updates on specific fragments Frequency of each query/update operation Size of input query, number of results tuples per query Communication costs per data unit What query processing strategies do exist (type of joins etcs.)
This is a hard problem!
10
COMP-577: Distributed Database Systems
Query Processing I
R
E = 3
SELECT A,B FROM R WHERE E=3 Generate Execution Tree
π A,B
11
COMP-577: Distributed Database Systems
Query Processing II
[R1: E < 10]
E = 3
[R2: E >= 10] ∪
Localization 1: Replace Relations by Fragments
π A,B
12
COMP-577: Distributed Database Systems
Query Processing III
[R1: E < 10]
E = 3
[R2: E >= 10] ∪
Localization II
Push ∪ up Push π and σ down
[R1: E < 10]
E = 3
[R2: E >= 10] ∪
E = 3
π A,B π A,B π A,B
13
COMP-577: Distributed Database Systems
Query Processing IV
Localization III
Simplify
Eliminate unnecessary operations
[R1: E < 10]
E = 3
[R2: E >= 10] ∪
E = 3
[R1: E < 10]
E = 3 = ∅
π A,B π A,B π A,B
14
COMP-577: Distributed Database Systems
Summary
Many different rules to simplify Is getting very sophisticated for
Joins Aggregation (sum, max, etc.)
Localization with vertical fragmentation
slightly different
15
COMP-577: Distributed Database Systems
Heterogeneous Query Processing
Mediator / Middleware Wrapper Wrapper DB2 DB1 DB1 catalog client client
16
COMP-577: Distributed Database Systems
Work distribution
Mediator
Schema Integration
must know the schemas of the component DB (external schemas) find a reasonable global schema for the separate schemas translate the external local schemas of a component DB into global
schema and vs. versa
this is its whole own research area very complex if data sources have different data models
(Relational, OO, XML, html, file-system) Own data dictionary (catalog)
global schema, translation mechanisms, information about local
schemas, …
some statistics of the component DB
size of tables, query capacity of component DB, etc.
17
COMP-577: Distributed Database Systems
Work distribution
Mediator
global query execution
query parsing, rewrite, query optimization executes some of the operations (joining, extra operations on the
data retrieved from data sources) might cache data
18
COMP-577: Distributed Database Systems
Work distribution
Wrapper (adaptor)
translates every request of the mediator into request that is
understood by the underlying DB
translates the results into format understood by the mediator Wrappers for relational DB, file systems, web-based DB wrapper is client of DB source: can cache data and execute
parts of the query by itself (e.g. if data source only returns html pages)
Expect to have many wrappers for many different data sources
new data source -> new wrapper, but same mediator
Key feature of being extensible
Query optimization and execution
distributed between mediator, wrapper and data source