Distributed Databases: Design and Query Execution Data - - PowerPoint PPT Presentation

distributed databases design and query execution data
SMART_READER_LITE
LIVE PREVIEW

Distributed Databases: Design and Query Execution Data - - PowerPoint PPT Presentation

Distributed Databases: Design and Query Execution Data Fragmentation and Placement Fragmentation: How to split up the data into smaller fragments? Horizontal Vertical Placement: Where to put the fragments? COMP-577:


slide-1
SLIDE 1

Distributed Databases: Design and Query Execution

slide-2
SLIDE 2

2

COMP-577: Distributed Database Systems

Data Fragmentation and Placement

 Fragmentation:  How to split up the data into smaller fragments?  Placement:  Where to put the fragments?

Horizontal Vertical

slide-3
SLIDE 3

3

COMP-577: Distributed Database Systems

Horizontal Data Fragmentation

Problem Statement:

 Given a table R (a1, a2, a3,…), containing tuples t1, t2, …  How to distribute R across several sites 

Horizontal Fragmentation

 For each tuple ti, place ti on one of the sites  Fragmentation given through selection  Range partitioning: σ a1 < 10(R), σ10 <= a1 < 20(R), σ 20 <= a1(R)

 loc = ‘Montreal’, loc = ‘Zurich’  Correcness: each tuple (also future tuple) is member of exactly

  • ne fragment (completeness and disjointness)
slide-4
SLIDE 4

4

COMP-577: Distributed Database Systems

Referencing partition

 Assume relations Emp(eid, name, dep) and Dep(did, name,

location); dep is foreign key referencing Dep

 Dep is partitioned into Depi according to location

Ι

πdid (Depi) are disjoint sets

 Emp is partitioned according to Dep

 Empi = Emp πdid (Depi) SELECT eid, name, dep FROM Emp, Depi WHERE Emp.dep = Depi.did  Empi are also disjoint

slide-5
SLIDE 5

5

COMP-577: Distributed Database Systems

Vertical Data Fragmentation

Vertical Fragmentation

 For each attribute ai, place for each tuple ti, the value of its

attribute ai on one of the sites.

 In order to identify tuples, the key is replicated at each fragment!  determined by projection: πK,A1,A2 (R), πK,A3,A4 (R),

slide-6
SLIDE 6

6

COMP-577: Distributed Database Systems

Fragmentation and Query Execution

 LAN  Fragment such that your typical queries are executed fast  Full Scan

 SELECT * FROM R  Many result tuples => distribute work  Idea: distribute data evenly among several sites and read data in parallel  Expect scanning time long => assemble and communication overhead neglegible

 Point Query

 SELECT * FROM R WHERE a1 = 5  Few result tuples = localize work  Idea: if all relevant data on one site, only one site has to scan its partition; no

work on the other sites

 Expect search time short: assemble and communication overhead have too big impact

 WAN  Each query should be executed solely locally

slide-7
SLIDE 7

7

COMP-577: Distributed Database Systems

Range Partitioning

 Partitioning Vector indicating domain ranges of an attribute V=(4;7)  Parallel Query Execution  + good for full scan if even distribution  + good for range queries on partition attribute  + ok for point queries  - not good for range/point queries not on partition attribute  Need for good vector otherwise data skew and execution skew  Oracle provides range partitioning  Local Query Execution

 Good if partition related to geography

slide-8
SLIDE 8

8

COMP-577: Distributed Database Systems

Data Placement

 Which partition is stored on which computers  Which partitions should be replicated and where  Highly related to data fragmentation  Optimization Problems:

 Efficiency

 Final goal

Minimize query response time Maximize throughput

 Achieved by minimizing “costs”  minimize WAN remote access  Optimize parallelism  Minimize bottlenecks (communication system, local machined)

 Availability

 Choose nodes according to their availability  replication

slide-9
SLIDE 9

9

COMP-577: Distributed Database Systems

Data Placement

 Data Allocation depends on many parameters

 Number of nodes  Storage capacity and costs on each site  Number of fragments  Size of fragments  Number of different queries/updates on specific fragments  Frequency of each query/update operation  Size of input query, number of results tuples per query  Communication costs per data unit  What query processing strategies do exist (type of joins etcs.)

 This is a hard problem!

slide-10
SLIDE 10

10

COMP-577: Distributed Database Systems

Query Processing I

R

E = 3

 SELECT A,B FROM R WHERE E=3  Generate Execution Tree

π A,B

slide-11
SLIDE 11

11

COMP-577: Distributed Database Systems

Query Processing II

[R1: E < 10]

E = 3

[R2: E >= 10] ∪

 Localization 1: Replace Relations by Fragments

π A,B

slide-12
SLIDE 12

12

COMP-577: Distributed Database Systems

Query Processing III

[R1: E < 10]

E = 3

[R2: E >= 10] ∪

 Localization II

 Push ∪ up  Push π and σ down

[R1: E < 10]

E = 3

[R2: E >= 10] ∪

E = 3

π A,B π A,B π A,B

slide-13
SLIDE 13

13

COMP-577: Distributed Database Systems

Query Processing IV

 Localization III

 Simplify

 Eliminate unnecessary operations

[R1: E < 10]

E = 3

[R2: E >= 10] ∪

E = 3

[R1: E < 10]

E = 3 = ∅

π A,B π A,B π A,B

slide-14
SLIDE 14

14

COMP-577: Distributed Database Systems

Summary

 Many different rules to simplify  Is getting very sophisticated for

 Joins  Aggregation (sum, max, etc.)

 Localization with vertical fragmentation

slightly different

slide-15
SLIDE 15

15

COMP-577: Distributed Database Systems

Heterogeneous Query Processing

Mediator / Middleware Wrapper Wrapper DB2 DB1 DB1 catalog client client

slide-16
SLIDE 16

16

COMP-577: Distributed Database Systems

Work distribution

 Mediator

 Schema Integration

 must know the schemas of the component DB (external schemas)  find a reasonable global schema for the separate schemas  translate the external local schemas of a component DB into global

schema and vs. versa

 this is its whole own research area  very complex if data sources have different data models

(Relational, OO, XML, html, file-system)  Own data dictionary (catalog)

 global schema, translation mechanisms, information about local

schemas, …

 some statistics of the component DB

 size of tables, query capacity of component DB, etc.

slide-17
SLIDE 17

17

COMP-577: Distributed Database Systems

Work distribution

 Mediator

 global query execution

 query parsing, rewrite, query optimization  executes some of the operations (joining, extra operations on the

data retrieved from data sources)  might cache data

slide-18
SLIDE 18

18

COMP-577: Distributed Database Systems

Work distribution

 Wrapper (adaptor)

 translates every request of the mediator into request that is

understood by the underlying DB

 translates the results into format understood by the mediator  Wrappers for relational DB, file systems, web-based DB  wrapper is client of DB source: can cache data and execute

parts of the query by itself (e.g. if data source only returns html pages)

 Expect to have many wrappers for many different data sources

 new data source -> new wrapper, but same mediator

 Key feature of being extensible

 Query optimization and execution

 distributed between mediator, wrapper and data source