Systems Infrastructure for Data Science Web Science Group Uni - - PowerPoint PPT Presentation

systems infrastructure for data science
SMART_READER_LITE
LIVE PREVIEW

Systems Infrastructure for Data Science Web Science Group Uni - - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture VIII: Fragmentation Fragmentation Fragments should be subsets of database relations due to two main reasons: Access locality: Application views


slide-1
SLIDE 1

Systems Infrastructure for Data Science

Web Science Group Uni Freiburg WS 2012/13

slide-2
SLIDE 2

Lecture VIII: Fragmentation

slide-3
SLIDE 3

Fragmentation

  • Fragments should be subsets of database

relations due to two main reasons:

– Access locality: Application views are subsets of

  • relations. Also, multiple views that access a relation

may reside at different sites. – Query concurrency and system throughput: Sub- queries can operate on fragments in parallel.

  • Main issues:

– Views that cannot be defined on a single fragment will require extra processing and communication cost. – Semantic data control (e.g., integrity checking) of dependent fragments residing at different sites is more complicated and costly.

Uni Freiburg, WS2012/13 3 Systems Infrastructure for Data Science

slide-4
SLIDE 4

Fragmentation Alternatives

  • Horizontal fragmentation

– Primary horizontal fragmentation – Derived horizontal fragmentation

  • Vertical fragmentation
  • Hybrid fragmentation

Uni Freiburg, WS2012/13 4 Systems Infrastructure for Data Science

slide-5
SLIDE 5

Example Database

Uni Freiburg, WS2012/13 5 Systems Infrastructure for Data Science

slide-6
SLIDE 6

Horizontal Fragmentation Example

Projects with BUDGET < $200,000 Projects with BUDGET ≥ $200,000

Uni Freiburg, WS2012/13 6 Systems Infrastructure for Data Science

slide-7
SLIDE 7

Vertical Fragmentation Example

Project budgets Project names and locations

Uni Freiburg, WS2012/13 7 Systems Infrastructure for Data Science

slide-8
SLIDE 8

Hybrid Fragmentation Example

Projects with BUDGET < $200,000 Projects with BUDGET ≥ $200,000 Project budgets Project names and locations Horizontal Vertical

Uni Freiburg, WS2012/13 8 Systems Infrastructure for Data Science

slide-9
SLIDE 9

Correctness of Fragmentation

  • Completeness

– Decomposition of relation R into fragments R1, R2, .., Rn is complete iff each data item in R can also be found in one or more of Ri’s.

  • Reconstruction

– If a relation R is decomposed into fragments R1, R2, .., Rn, then there should exist a relational operator θ such that R = θ1≤i≤nRi.

  • Disjointness

– If a relation R is horizontally (vertically) decomposed into fragments R1, R2, .., Rn, and data item di (non-primary key attribute di) is in Rj, then di should not be in any other fragment Rk (k ≠ j).

Uni Freiburg, WS2012/13 9 Systems Infrastructure for Data Science

slide-10
SLIDE 10

Horizontal Fragmentation Algorithms

What is given?

  • Relationships among database relations

Li: one-to-many relationship from an “owner” to a “member”

Uni Freiburg, WS2012/13 10 Systems Infrastructure for Data Science

slide-11
SLIDE 11

Horizontal Fragmentation Algorithms

What is given?

  • Cardinality of each database relation
  • Mostly used predicates in user queries
  • Predicate selectivities
  • Access frequencies for data

Uni Freiburg, WS2012/13 11 Systems Infrastructure for Data Science

slide-12
SLIDE 12

Horizontal Fragmentation Algorithms

Predicates

  • Simple predicate

– Given R(A1, A2, .., An), a simple predicate pj is defined as “pj: Ai θ value”, where θ є {=, <, ≤, >, ≥, ≠} and value є Di, where Di is the domain of Ai. – Examples: PNAME = “Maintenance” BUDGET ≤ 200000

  • Minterm predicate

– A conjunction of simple and negated simple predicates – Examples: PNAME = “Maintenance” AND BUDGET ≤ 200000 NOT(PNAME = “Maintenance”) AND BUDGET ≤ 200000

Uni Freiburg, WS2012/13 12 Systems Infrastructure for Data Science

slide-13
SLIDE 13

Primary Horizontal Fragmentation

Definition

  • Given an owner relation R, its horizontal fragments are

given by Ri = σFi(R), 1 ≤ i ≤ w where Fi is a minterm predicate.

  • First step: Determine a set of simple predicates that will

form the minterm predicates. This set of simple predicates must have two key properties:

– completeness – minimality

Uni Freiburg, WS2012/13 13 Systems Infrastructure for Data Science

slide-14
SLIDE 14

Completeness of Simple Predicates

Definition

  • A set of simple predicates P is complete iff the

accesses to the tuples of the minterm fragments defined on P requires that two tuples of the same minterm fragment have the same probability of being accessed by any application.

Uni Freiburg, WS2012/13 14 Systems Infrastructure for Data Science

slide-15
SLIDE 15

Completeness of Simple Predicates

Example

App 1: Find the budgets of projects at each location. App 2: Find projects with budgets less than $200000. Set of simple predicates: P = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”} P = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”, BUDGET ≤ 200000, BUDGET > 200000}

complete

Uni Freiburg, WS2012/13 15 Systems Infrastructure for Data Science

slide-16
SLIDE 16

Minimality of Simple Predicates

Definition

  • A set of simple predicates P is complete iff for each

predicate p є P:

– if p influences how fragmentation is performed (i.e., causes a fragment f to be further fragmented into fi anf fj), then there should be at least one application that accesses fi and fj differently.

Uni Freiburg, WS2012/13 16 Systems Infrastructure for Data Science

slide-17
SLIDE 17

Minimality of Simple Predicates

Example

App 1: Find the budgets of projects at each location. App 2: Find projects with budgets less than $200000. P = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”, BUDGET ≤ 200000, BUDGET > 200000}

+ PNAME=“Instrumentation”

P = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”, BUDGET ≤ 200000, BUDGET > 200000, PNAME=“Instrumentation”}

complete & minimal complete & NOT minimal

Uni Freiburg, WS2012/13 17 Systems Infrastructure for Data Science

slide-18
SLIDE 18

Primary Horizontal Fragmentation COM_MIN Algorithm Sketch

  • Input: a relation R and a set of simple predicates Pr
  • Output: a complete and minimal set of simple predicates

Pr’ for Pr

  • Rule 1: A relation or fragment is partitioned into at least

two parts which are accessed differently by at least one application.

  • Find a pi є Pr such that pi partitions R according to Rule 1.

Initialize Pr’ = pi.

  • Iteratively add predicates to Pr’ until it is complete.

Uni Freiburg, WS2012/13 18 Systems Infrastructure for Data Science

slide-19
SLIDE 19

Primary Horizontal Fragmentation PHORIZONTAL Algorithm Sketch

  • Input: a relation R and a set of simple predicates Pr
  • Output: a set of minterm predicates M according to

which relation R is to be fragmented

  • Pr’ ← COM_MIN(R, Pr)
  • Determine the set M of minterm predicates
  • Determine the set I of implications among pi є Pr’
  • Eliminate the minterms from M that contradict with I

Uni Freiburg, WS2012/13 19 Systems Infrastructure for Data Science

slide-20
SLIDE 20

Primary Horizontal Fragmentation

Example

  • PAY(title, sal) and PROJ(pno, pname, budget, loc)
  • Fragmentation of relation PAY

– Application: Check the salary info and determine raise. (employee records kept at two sites → application run at two sites) – Simple predicates

  • p1 : sal ≤ 30000
  • p2 : sal > 30000
  • Pr = {p1, p2} which is complete and minimal Pr‘ = Pr

– Minterm predicates

  • m1 : (sal ≤ 30000)
  • m2 : NOT(sal ≤ 30000) = (sal > 30000)

Uni Freiburg, WS2012/13 20 Systems Infrastructure for Data Science

slide-21
SLIDE 21

Primary Horizontal Fragmentation

Example

Uni Freiburg, WS2012/13 21 Systems Infrastructure for Data Science

slide-22
SLIDE 22

Primary Horizontal Fragmentation

Example

  • Fragmentation of relation PROJ

– App1: Find the name and budget of projects given their location. (issued at 3 sites) – App2: Access project information according to budget (one site accesses ≤ 200000, other accesses > 200000) – Simple predicates

  • For App1:

p1 : LOC = “Montreal” p2 : LOC = “New York” p3 : LOC = “Paris”

  • For App2:

p4 : BUDGET ≤ 200000 p5 : BUDGET > 200000

  • Pr = Pr' = {p1, p2, p3, p4, p5}

Uni Freiburg, WS2012/13 22 Systems Infrastructure for Data Science

slide-23
SLIDE 23

Primary Horizontal Fragmentation

Example

  • Fragmentation of relation PROJ

– Minterm fragments left after elimination

m1 : (LOC = “Montreal”) AND (BUDGET ≤ 200000) m2 : (LOC = “Montreal”) AND (BUDGET > 200000) m3 : (LOC = “New York”) AND (BUDGET ≤ 200000) m4 : (LOC = “New York”) AND (BUDGET > 200000) m5 : (LOC = “Paris”) AND (BUDGET ≤ 200000) m6 : (LOC = “Paris”) AND (BUDGET > 200000)

Uni Freiburg, WS2012/13 23 Systems Infrastructure for Data Science

slide-24
SLIDE 24

Primary Horizontal Fragmentation

Example

Uni Freiburg, WS2012/13 24 Systems Infrastructure for Data Science

slide-25
SLIDE 25

Primary Horizontal Fragmentation

Correctness

  • Completeness

– Since Pr' is complete and minimal, the selection predicates are complete.

  • Reconstruction

– If relation R is fragmented into FR = {R1, R2, .., Rr} R = URi є FR Ri

  • Disjointness

– Minterm predicates that form the basis of fragmentation should be mutually exclusive.

Uni Freiburg, WS2012/13 25 Systems Infrastructure for Data Science

slide-26
SLIDE 26

Derived Horizontal Fragmentation

  • Defined on a member relation of a link

according to a selection operation specified on its owner.

  • Two important points:

– Each link is an equi-join. – Equi-join can be implemented using semi-joins.

Uni Freiburg, WS2012/13 26 Systems Infrastructure for Data Science

slide-27
SLIDE 27

Semi-join

Uni Freiburg, WS2012/13 27 Systems Infrastructure for Data Science

Semi-join reduces the amount of data that needs to be transmitted btw sites.

  • Given R(A) and S(B), semi-join of R with S is defined as follows:
  • Example:
slide-28
SLIDE 28
  • Given relations R and S:
  • The derived horizontal fragments of R are defined as

Ri = R Si, 1 ≤ i ≤ w where w is the maximum number of fragments that will be defined on R, and Si = σFi(S) where Fi is the formula according to which the primary horizontal fragment Si is defined.

Derived Horizontal Fragmentation

S R L

  • wner

member

Uni Freiburg, WS2012/13 28 Systems Infrastructure for Data Science

slide-29
SLIDE 29

Derived Horizontal Fragmentation

Example

Primary horizontal fragments of PAY: PAY1 = σ sal ≤ 30000 (PAY) PAY2 = σ sal > 30000 (PAY) Derived horizontal fragments of EMP: EMP1 = EMP PAY1 EMP2 = EMP PAY2

Uni Freiburg, WS2012/13 29 Systems Infrastructure for Data Science

slide-30
SLIDE 30

Derived Horizontal Fragmentation

Correctness

  • Completeness

– Referential integrity – Let R be the member relation of a link whose owner is relation S which is fragmented as FS = {S1, S2, .., Sn}. Furthermore, let A be the join attribute between R and S. Then, for each tuple t of R, there should be a tuple t' of S such that

t[A]=t'[A]

  • Reconstruction

– If relation R is fragmented into FR = {R1, R2, .., Rr} R = URi є FR Ri

  • Disjointness

– Simple join graphs between the owner and the member fragments.

Uni Freiburg, WS2012/13 30 Systems Infrastructure for Data Science

slide-31
SLIDE 31

Vertical Fragmentation

  • Divide a relation R into fragments R1, R2, .., Rr,

each of which contains a subset of R’s attributes as well as the primary key of R.

  • Goal: to minimize the execution time of user

applications that run on these fragments.

  • Too many alternatives => Use heuristic solutions

based on:

– Grouping: merge attributes to fragments – Splitting: divide a relation into fragments

  • Better for disjointness and easier dependency enforcement

Uni Freiburg, WS2012/13 31 Systems Infrastructure for Data Science

slide-32
SLIDE 32

Vertical Fragmentation Algorithms

What is given?

  • Attribute usage matrix of the application queries
  • Example: PROJ(PNO, PNAME, BUDGET, LOC)

Q1: SELECT BUDGET Q2: SELECT PNAME, BUDGET FROM PROJ FROM PROJ WHERE PNO=110 Q3: SELECT PNAME Q4: SELECT SUM(BUDGET) FROM PROJ FROM PROJ WHERE LOC=“New York” WHERE LOC=“New York”

Q1 Q2 Q3 Q4 A1 1 1 1 1 1 1 1 1 A2 A3 A4

Uni Freiburg, WS2012/13 32 Systems Infrastructure for Data Science

slide-33
SLIDE 33

Vertical Fragmentation Algorithms

What is given?

  • Attribute affinity matrix
  • Togetherness measure for attribute pairs
  • Given a relation R(A1, A2, .., An), the affinity between Ai and Aj

w.r.t. a set of application queries Q = {Q1, Q2, .., Qq} is defined as follows:

aff (Ai, Aj) = (query access) all queries that access Ai and Aj

query access = access frequency of a query * access execution all sites

comes from the attribute usage matrix

Uni Freiburg, WS2012/13 33 Systems Infrastructure for Data Science

slide-34
SLIDE 34

Vertical Fragmentation

Algorithm Sketch

  • Cluster step: Permute rows and columns of

the attribute affinity matrix to generate a clustered affinity matrix where attributes in each cluster are in high affinity to each other.

A A A A 1 2 3 4 A A A A 1 2 3 4 45 45 45 53 5 3 5 80 75 3 75 78 A A A A 1 3 2 4 A A A A 1 3 2 4 45 45 80 5 75 45 5 53 3 75 3 78 [ Bond Energy Algorithm ]

Uni Freiburg, WS2012/13 34 Systems Infrastructure for Data Science

slide-35
SLIDE 35

Vertical Fragmentation

Algorithm Sketch

  • Partition step: Divide the clustered attributes into

non-overlapping partitions such that the number of application queries that access to more than one partition is as small as possible.

TA BA Given: TQ = set of applications that access only TA BQ = set of applications that access only BA OQ = set of applications that access both TA and BA CTQ = total number of accesses to attributes by TQ CBQ = total number of accesses to attributes by BQ COQ = total number of accesses to attributes by OQ Find: The point along the diagonal that maximizes CTQ*฀ ฀ CBQ-฀ ฀ COQ2

Uni Freiburg, WS2012/13 35 Systems Infrastructure for Data Science

slide-36
SLIDE 36

Vertical Fragmentation

Correctness

  • A relation R, defined over attribute set A and key K, generates

the vertical partitioning FR = {R1, R2, …, Rr}.

  • Completeness

– The following should be true for A:

A =∪ ARi

  • Reconstruction

– Reconstruction can be achieved by

R = Ri ,∀Ri ∈ FR

  • Disjointness

– Duplicated keys are not considered to be overlapping

K

Uni Freiburg, WS2012/13 36 Systems Infrastructure for Data Science

slide-37
SLIDE 37

Hybrid Fragmentation

  • Obtained by applying horizontal and vertical

fragmentation one after the other.

  • In practice, nesting level does not exceed 2.
  • Correctness properties are guaranteed if constituent

fragmentations are correct.

  • Bottom-up reconstruction:

H H V V V V V

Uni Freiburg, WS2012/13 37 Systems Infrastructure for Data Science

slide-38
SLIDE 38

Fragment Allocation

  • Problem definition:

– Given a set of fragments F, a set of network sites S, and a set of application queries Q, find the optimal distribution

  • f F to S.
  • Optimality measures:

– Minimal cost = communication + storage + processing – Optimal performance = response time and/or throughput

  • Complex problem, heuristic solutions

Uni Freiburg, WS2012/13 38 Systems Infrastructure for Data Science

slide-39
SLIDE 39

Fragment Allocation High-Level Model

  • Minimize(total cost)
  • Subject to

– Response time constraint – Storage constraint – Processing constraint

  • Decide on variable xij

xij = 1 if fragment Fi is stored at site Sj

  • therwise

  

Uni Freiburg, WS2012/13 39 Systems Infrastructure for Data Science

slide-40
SLIDE 40

Fragment Allocation Algorithms

What is given?

  • Size of a fragment in bytes
  • Selectivity of a fragment w.r.t. a query
  • Number of read and update accesses of a query on a

fragment

  • Access localities
  • Max. response time for each application
  • Costs and capacities of sites

Uni Freiburg, WS2012/13 40 Systems Infrastructure for Data Science

slide-41
SLIDE 41

Fragment Allocation Alternatives

  • Non-replicated

– Partitioned: each fragment at only one site

  • Replicated

– Fully replicated: each fragment at each site – Partially replicated: each fragment at some of the sites

  • Rule of thumb:

– If , then replication pays off.

read-only queries update queries ≥ 1

Uni Freiburg, WS2012/13 41 Systems Infrastructure for Data Science

slide-42
SLIDE 42

Fragment Allocation Alternatives

Uni Freiburg, WS2012/13 42 Systems Infrastructure for Data Science