Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13

Lecture IX: Distributed query processing and optimization

Roadmap • Overview • (Query Decomposition) • Data Localization • Query Optimization Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 3

Query Processing Recap Declarative Procedural Query query specification query execution plan Processor Q Q’ SQL Relational Algebra SELECT ENAME FROM EMP, ASG Π ENAME ( EMP ENO ( σ RESP=“Manager” ( ASG ))) WHERE EMP.ENO = ASG.ENO AND RESP = “Manager” Two important requirements: 1. Correctness: Q’ must be semantically equivalent to Q. 2. Efficiency: Q’ must have the smallest execution cost. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 4

Cost Metrics • Total cost – processing time at all sites (CPU + I/O) – communication time between sites • In WANs, communication cost usually dominates. • Query response time – time elapsed for executing the query What is the difference between total cost and query response time? Does it change in distributed/parallel settings? Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 5

Complexity of Relational Algebra Operators n: relation cardinality To reduce costs:  The most selective operations should be performed first.  Operations should be ordered by increasing complexity. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 6

Query Processing in a Centralized System Given: EMP(ENO, ENAME, TITLE) ASG(ENO, PNO, RESP, DUR) Query: Find the names of employees who are managing a project. SELECT ENAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND RESP = “Manager” 1 Π ENAME ( σ RESP=“Manager” AND EMP.ENO=ASG.ENO ( EMP x ASG )) Two equivalent execution plans. 2 Π ENAME ( EMP ENO ( σ RESP=“Manager” ( ASG ))) Which one to use? Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 7

Query Processing in a Distributed System EMP ENO ( σ RESP=“Manager” ( ASG )) • Query: • Data fragments and their allocation to sites: – Site1 : ASG1 = σ ENO ≤ “E3” ( ASG )) – Site2 : ASG2 = σ ENO > “E3” ( ASG )) – Site3 : EMP1 = σ ENO ≤ “E3” ( EMP )) – Site4 : EMP2 = σ ENO > “E3” ( EMP )) – Site5 : Result • Assumptions: – size(EMP) = 400, size(ASG) = 1000, size( σ RESP=“Manager” ( ASG )) = 20 – tuple access cost = 1, tuple transfer cost = 10 – EMP locally indexed on ENO, ASG locally indexed on RESP – uniform data distribution across sites Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 8

Query Processing in a Distributed System 1 join = 400 * 20 * 1 = 8000 selection = 1000 * 1 = 1000 transfer = 400 * 10 = 4000 transfer = 1000 * 10 = 10000 total cost = 10000 + 4000 + 1000 + 8000 = 23000 Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 9

Query Processing in a Distributed System 2 transfer = 10 * 10 = 100 transfer = 10 * 10 = 100 join = 10 * 1 * 2 = 20 join = 10 * 1 * 2 = 20 transfer = 10 * 10 = 100 transfer = 10 * 10 = 100 selection = 10 * 1 = 10 selection = 10 * 1 = 10 total cost = 10 + 10 + 100 + 100 + 20 + 20 + 100 + 100 = 460 Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 10

General Query Optimization Issues • Algorithmic approach: – Cost-based vs. Heuristics-based • Granularity: – Single query at a time vs. Multi-query optimization • Timing: – Static vs. Dynamic vs. Hybrid • Statistics: – what to collect, accuracy, independence, uniformity • Decision mechanism: Specific to – Centralized vs. Distributed vs. Hybrid distributed • Network topology: query processing – WANs vs. LANs Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 11

Distributed Query Processing * * Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 12

Query Decomposition • Goal: To convert global declarative query into a correct and efficient global procedural query • Query decomposition consists of 4 steps: 1. Normalization  Transformation of query predicates into normal form 2. Semantic Analysis  Detection and rejection of semantically incorrect queries 3. Simplification  Elimination of redundant predicates 4. Restructuring  Transformation of the query into algebraic form No distribution-related processing. • Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 13

Sample Query • Transformation of the query into algebraic form Given: EMP(ENO, ENAME, TITLE) ASG(ENO, PNO, RESP, DUR) PROJ(PNO, PNAME, BUDGET, LOC) Query: Find the names of employees other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years. SELECT ENAME FROM EMP, ASG, PROJ WHERE ASG.ENO = EMP.ENO AND ASG.PNO = PROJ.PNO AND ENAME ≠ “J. Doe” AND PROJ.PNAME = “CAD/CAM” AND (DUR = 12 OR DUR = 24) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 14

Data Localization • Goal: To convert an algebraic query on global relations into an algebraic query on physical fragments • General approach: 1. Generate a localized query by substituting each global relation in the leaves of the operator tree by the appropriate subtree on fragments. Union for horizontal fragments • Join for vertical fragments • 2. Apply reduction techniques on the localized query to generate a simpler and an optimized operator tree. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 15

Data Localization Example Query plan on global relations • EMP is fragmented as follows: EMP 1 = σ ENO ≤ “E3” (EMP) EMP 2 = σ “E3” < ENO ≤ “E6” (EMP) EMP 3 = σ ENO ≥ “E6” (EMP) • ASG is fragmented as follows: ASG 1 = σ ENO ≤ “E3” (ASG) ASG 2 = σ ENO > “E3” (ASG) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 16

Data Localization Example Query plan on global relations Localized query plan ∪ ∪ Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science ASG 1 ASG 2 17 EMP 1 EMP 2 EMP 3

Data Localization Reduction for Primary Horizontal Fragmentation • Reduction with Selection – Given relation R and F R = { R 1 , R 2 , …, R w } where R j = σ pj ( R ) : σ pi (R j ) = φ , if ∀ x in R: ¬(p i (x) ∧ p j (x)) SELECT * – Example: EMP is fragmented as before. FROM EMP WHERE ENO = “E5” Localized query Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 18

Data Localization Reduction for Primary Horizontal Fragmentation • EMP is fragmented as follows: EMP 1 = σ ENO ≤ “E3” (EMP) • Reduction with Selection EMP 2 = σ “E3” < ENO ≤ “E6” (EMP) EMP 3 = σ ENO ≥ “E6” (EMP) – Given relation R and F R = { R 1 , R 2 , …, R w } where R j = σ pj ( R ) : σ pi (R j ) = φ , if ∀ x in R: ¬(p i (x) ∧ p j (x)) • ASG is fragmented as follows: ASG 1 = σ ENO ≤ “E3” (ASG) SELECT * – Example: EMP is fragmented as before. ASG 2 = σ ENO > “E3” (ASG) FROM EMP WHERE ENO = “E5” Localized query Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 19

Data Localization Reduction for Primary Horizontal Fragmentation • Reduction with Join – Apply when fragmentation is done on the join attribute – Distribute Joins over Unions (R 1 ∪ R 2 ) S ⇔ (R 1 S) ∪ (R 2 S) – Eliminate useless Joins R i R j = φ , if ∀ x in R i , ∀ y in R j : ¬(p i (x) ∧ p j (y)) • Example: – EMP and ASG are fragmented as before. SELECT * FROM EMP, ASG WHERE EMP.ENO = ASG.ENO Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 20

Data Localization Reduction for Primary Horizontal Fragmentation • Reduction with Join Example (cont’d): Localized query Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 21

Data Localization Reduction for Primary Horizontal Fragmentation • EMP is fragmented as follows: EMP 1 = σ ENO ≤ “E3” (EMP) • Reduction with Join Example (cont’d): EMP 2 = σ “E3” < ENO ≤ “E6” (EMP) Localized query EMP 3 = σ ENO ≥ “E6” (EMP) • ASG is fragmented as follows: ASG 1 = σ ENO ≤ “E3” (ASG) ASG 2 = σ ENO > “E3” (ASG) Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 22

Data Localization Reduction for Vertical Fragmentation • Reduction with Projection – Given a relation R defined over attributes A = {A 1 , ..., A n } and vertically fragmented as R i = Π A' (R) where A' ⊆ A : Π D,K (R i ) is useless, if the set of projection attributes D is not in A‘. • Example: – EMP is vertically fragmented as follows: EMP 1 = Π ENO,ENAME (EMP) Localized query Reduced query EMP 2 = Π ENO,TITLE (EMP) SELECT ENAME FROM EMP Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 23

Data Localization Reduction for Derived Horizontal Fragmentation • Example: ASG 1 : ASG ENO EMP 1 SELECT * ASG 2 : ASG ENO EMP 2 FROM EMP, ASG EMP 1 : σ TITLE = “Programmer” (EMP) WHERE ASG.ENO = EMP.ENO AND EMP.TITLE = “Mech. Eng.” EMP 2 : σ TITLE ≠ “Programmer” (EMP) Localized query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 24

Data Localization Reduction for Derived Horizontal Fragmentation • Example cont’d: Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 25

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture IX: Distributed query processing and optimization Roadmap Overview (Query Decomposition) Data Localization Query Optimization Uni

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

Lecture 23 Verified Systems Software Infrastructure is Shaky Software Infrastructure is Shaky

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Compiler Infrastructure Systems and Internet Infrastructure Security (SIIS) Laboratory Page 1

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Selecting Least Cost Green Infrastructure James W. Ridgway, PE October 14, 2015 Integrated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational

Broadband Infrastructure in Broadband Infrastructure in North Asia and Central Asia North Asia and

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Lecture #26:

Feature Bagging for Author Attribution PAN - CLEF 2012 Franois-Marie Giraud / Thierry

CS 162 Intro to Computer Science II Separate Compila4on 1

ethics in NLP CS 685, Fall 2020 Introduction to Natural Language Processing

Cloud Mobile Computing Ed Crowley Tonights Topics Communicate class expectations

Source: Prof. Scott Eberhardt AE-714_332M Aircraft Design Capsule-0 Cost is the Key ! Needs

Quadrilateral Mesh Generation: Meromorphic Quartic Differential and Abel-Jacobi Condition Na Lei 1

Flynns Taxonomy Prof. Mike Flynns famous taxonomy of parallel computers 1 Flynns

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture IX: Distributed query processing and optimization Roadmap Overview (Query Decomposition) Data Localization Query Optimization Uni

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

Lecture 23 Verified Systems Software Infrastructure is Shaky Software Infrastructure is Shaky

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Compiler Infrastructure Systems and Internet Infrastructure Security (SIIS) Laboratory Page 1

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Selecting Least Cost Green Infrastructure James W. Ridgway, PE October 14, 2015 Integrated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

Infrastructure &amp; Shared Services Director Infrastructure &amp; Shared Services Organisational

Broadband Infrastructure in Broadband Infrastructure in North Asia and Central Asia North Asia and

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Lecture #26:

Feature Bagging for Author Attribution PAN - CLEF 2012 Franois-Marie Giraud / Thierry

CS 162 Intro to Computer Science II Separate Compila4on 1

ethics in NLP CS 685, Fall 2020 Introduction to Natural Language Processing

Cloud Mobile Computing Ed Crowley Tonights Topics Communicate class expectations

Source: Prof. Scott Eberhardt AE-714_332M Aircraft Design Capsule-0 Cost is the Key ! Needs

Quadrilateral Mesh Generation: Meromorphic Quartic Differential and Abel-Jacobi Condition Na Lei 1

Flynns Taxonomy Prof. Mike Flynns famous taxonomy of parallel computers 1 Flynns

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational