CS377: Database Systems Distributed Databases Distributed Databases - - PowerPoint PPT Presentation

cs377 database systems distributed databases distributed
SMART_READER_LITE
LIVE PREVIEW

CS377: Database Systems Distributed Databases Distributed Databases - - PowerPoint PPT Presentation

CS377: Database Systems Distributed Databases Distributed Databases Department of Mathematics and Computer Science Emory University 1 Centralized DBMS on a Network


slide-1
SLIDE 1

CS377: Database Systems Distributed Databases

1

Distributed Databases

  • Department of Mathematics and Computer Science

Emory University

slide-2
SLIDE 2

Centralized DBMS on a Network

  • 2
slide-3
SLIDE 3

Distributed DBMS Environment

  • 3
slide-4
SLIDE 4

Distributed Database System

A distributed database (DDB) is a collection of

multiple, databases distributed

  • ver a

4

A distributed database management system (D–

DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users.

Distributed database system (DDBS) = DDB + D–

DBMS

slide-5
SLIDE 5
slide-6
SLIDE 6

Distributed Database System

The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible replication as shown below.

6

slide-7
SLIDE 7

Distributed DBMS Promises

Transparent management of distributed, fragmented, and replicated data Improved reliability/availability through distributed

7

transactions Improved performance Easier and more economical system expansion

slide-8
SLIDE 8

Distributed DBMS Issues

How to distribute the database

8

Optimize cost = data transmission + local processing

slide-9
SLIDE 9

Distributed DBMS Issues

Synchronization of concurrent accesses Consistency and isolation of transactions' effects

9

Deadlock management How to make the system resilient to failures Atomicity and durability

slide-10
SLIDE 10

Distributed database design

Data distribution

TopAdown A mostly in designing systems from scratch BottomAup A when the databases already exist at a

number of sites

10

Unit of distribution

relation fragments of relations (subArelations)

Data are inherently fragmented, e.g. in locality Allow concurrent execution of a number of transactions that

access different portions of a relation

slide-11
SLIDE 11

Example

Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc=Sb

11

where loc=Sa where loc=Sb and… and ...

Motivation: Two sites: Sa, Sb Qa → ← Qb

slide-12
SLIDE 12

Fragmentation Alternatives – Horizontal

PROJ1 : projects with budgets less than $200,000 PROJ2 : projects with budgets greater than or equal to

  • !!!!

" # $ % !!!! % &'"() # !!! * + #!!!!

  • $

!!!!!

  • 12

greater than or equal to $200,000

  • #

$ % !!!!

  • *

+ #!!!!

  • $

!!!!!

  • !!!!

" % &'"() # !!!

  • %
  • $

!!!!!

slide-13
SLIDE 13

Fragmentation Alternatives – Vertical

PROJ1:information about project budgets PROJ2:information about project names and

  • !!!!

" # $ % !!!! % &'"() # !!! * + #!!!!

  • $

!!!!!

  • 13

project names and locations

  • !!!!

# % !!!! % # !!! * #!!!!

  • !!!!!
  • "

# $

  • %

&'"()

  • *

+

  • $
  • %
  • $

!!!!!

slide-14
SLIDE 14

Data Fragmentation, Replication and Allocation

A horizontal subset of a relation which contain those of tuples

which satisfy selection conditions.

E.g. Employee relation with selection condition (DNO = 5)

Can be specified by a σ

σ σ σCi (R) operation in the relational algebra.

14

Can be specified by a σ

σ σ σCi (R) operation in the relational algebra.

Complete horizontal fragmentation

A set of horizontal fragments whose conditions C1, C2, …, Cn

include all the tuples in RA every tuple in R satisfies (C1 OR C2 OR … OR Cn).

Disjoint complete horizontal fragmentation: No tuple in R

satisfies (Ci AND Cj) where i ≠ j.

How to reconstruct R from complete horizontal fragments?

slide-15
SLIDE 15

Three common horizontal partitioning techniques

Round robin Hash partitioning Range partitioning

15

15

slide-16
SLIDE 16
  • Round robin

R D0 D1 D2 t1 t1 t2 t2 t3 t3 t4 t4

16

t4 t4 ... t5

slide-17
SLIDE 17
  • Hash partitioning

R D0 D1 D2 t1→h(k1)=2 t1 t2→h(k2)=0 t2 t3→h(k3)=0 t3

17

t3→h(k3)=0 t3 t4→h(k4)=1 t4 ...

slide-18
SLIDE 18
  • Range partitioning

R D0 D1 D2 t1: A=5 t1 t2: A=8 t2 t3: A=2 t3 t4: A=3 t4

  • 18

t4: A=3 t4 ...

slide-19
SLIDE 19

Data Fragmentation, Replication and Allocation

A vertical subset of a relation that contains a subset of

columns.

E.g. Employee relation: a vertical fragment of Name, Bdate, Sex Can be specified by a ΠLi(R) operation in the relational algebra.

19

Can be specified by a ΠLi(R) operation in the relational algebra. Each fragment must include the primary key attribute of the parent

relation Employee Complete vertical fragmentation

A set of vertical fragments whose projection lists L1, L2, …, Ln

include all the attributes in R but share only the primary key of R.

L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R) Li ∩ Lj = PK(R) for any i j

How to reconstruct R from complete vertical fragments?

slide-20
SLIDE 20

Data Fragmentation, Replication and Allocation

A combination of Vertical fragmentation and Horizontal

fragmentation.

This is achieved by SELECTAPROJECT operations

which is represented by ΠLi(σ σ σ σCi (R))

20

which is represented by ΠLi(σ σ σ σCi (R))

slide-21
SLIDE 21

Data Fragmentation, Replication and Allocation

!"

A definition of a set of fragments (horizontal or vertical or

mixed) that can reconstruct the original database

#"

Distribution of fragments to sites of distributed databases. It

21

Distribution of fragments to sites of distributed databases. It

can be fully or partially replicated or can be partitioned

$

Full replication: database is replicated to all sites. Partial replication: some selected part is replicated

slide-22
SLIDE 22

Distributed Database System

The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally and stored with possible replication as shown below.

22

slide-23
SLIDE 23

Distributed DBMS Issues

How to distribute the database

23

Optimize cost = data transmission + local processing

slide-24
SLIDE 24

Query Processing in Distributed Databases

Cost of transferring data (files and results) over the network is usually high Example:

Employee at site 1 and Department at Site 2

Employee at site 1. 10,000 rows. Row size = 100 bytes. Table

size = 106 bytes.

24

size = 106 bytes.

Department at Site 2. 100 rows. Row size = 35 bytes. Table size

= 3,500 bytes.

Q submitted at Site 3: retrieve employee name and department name

where the employee works.

ΠFname,Lname,Dname (Employee Dno = Dnumber Department) Result has 10,000 tuples and each result tuple is 40 bytes

slide-25
SLIDE 25

Query Processing in Distributed Databases

Strategies:

  • 1. Transfer Employee and Department to site 3.

Total transfer size

  • 2. Transfer Employee to site 2, execute join at site 2 and send

the result to site 3.

25

the result to site 3.

Total transfer size

  • 3. Transfer Department relation to site 1, execute the join at site

1, and send the result to site 3.

Total bytes transferred

Optimization criteria: minimizing data transfer. Which strategy?

slide-26
SLIDE 26

Query Processing in Distributed Databases

Strategies:

  • 1. Transfer Employee and Department to site 3.

Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.

  • 2. Transfer Employee to site 2, execute join at site 2 and send

the result to site 3.

26

the result to site 3.

Query result size = 40 * 10,000 = 400,000 bytes. Total transfer

size = 400,000 + 1,000,000 = 1,400,000 bytes.

  • 3. Transfer Department relation to site 1, execute the join at site

1, and send the result to site 3.

Total bytes transferred = 400,000 + 3500 = 403,500 bytes.

Optimization criteria: minimizing data transfer.

Preferred approach: strategy 3.

slide-27
SLIDE 27

Query Processing in Distributed Databases

What if Q is submitted at site 2? Example:

Employee at site 1 and Department at Site 2

Employee at site 1. 10,000 rows. Row size = 100 bytes. Table

size = 106 bytes.

27

size = 106 bytes.

Department at Site 2. 100 rows. Row size = 35 bytes. Table size

= 3,500 bytes.

Q submitted at Site 2: retrieve employee name and department name

where the employee works.

ΠFname,Lname,Dname (Employee Dno = Dnumber Department) Result has 10,000 tuples and each result tuple is 40 bytes

slide-28
SLIDE 28

Query Processing in Distributed Databases

  • Semijoin:
  • Objective is to reduce the number of tuples in a relation

before transferring it to another site.

  • Example execution of Q:
  • 1. Project the join attributes of Department at site 2, and

transfer them to site 1. For Q, 4 * 100 = 400 bytes are

28

transfer them to site 1. For Q, 4 * 100 = 400 bytes are transferred

  • 2. Join the transferred file with the Employee relation at site 1,

and transfer the required attributes from the resulting file to site 2. For Q, 32 * 10,000 = 320,000 bytes are transferred

  • 3. Execute the query by joining the transferred file with

Department and present the result to the user at site 2.

SemiAjoin

⋉ ⋉ ⋉ = Π ( join ).

slide-29
SLIDE 29

Parallel Databases

Parallel database

Using parallel processers

Architectures

Shared memory

29

Shared memory Shared disk Shared nothing

Data partitioning (shard)